Modeling Programming as a Reinforcement Learning Task

Published on July 12, 2019 • 2 min read

Let's say you spend your days assembling RC quadcopters. It's fun, but eventually you get bored of it. So you decide to build a small factory that builds the quadcopters for you. You now don't have to build any more quadcopters! However, you realize you're spending more and more of your time working on the factory, so you build a meta-factory which builds the factory that builds your quadcopters. This is known as metaprogramming.

Metaprogramming is the concept of having computer programs write other computer programs.

# metaprogram
echo '#!/bin/sh' > program
for i in $(seq 10)
    echo "echo $i" >> program
chmod +x program

This an example of a Bash script which generates and runs another program that prints out the numbers 1-10.

Now, let's think in terms of reinforcement learning. What if we were to define a programming language interpreter as an agent's environment, the language's syntax as possible actions an agent could take, and a sparse reward function that would reward the agent when the generated program successfully did what we wanted it to. We would have a reinforcement learning agent capable of writing its own programs — a metaprogramming agent. Unfortunately, there are a number of practical problems with this idea. Programming tends to have sparse rewards, either a problem is solved or it isn't, and sparse rewards often result in agents not finding the global minima of a problem. As human programmers, we're able to tell when we're getting closer to solving a problem because we have experience, however a naive agent lacks our intuition.

The problem to solve then becomes, how can we craft an effective reward function for programming itself? Such that, as an agent gets closer to solving a programming problem, we can increase its reward. A good place to start would be to observe the thought processes behind a human programmer.

Let's say we want to create a reinforcement learning agent that prints out the integers 1-10. A human programmer would decompose this program into two separate programs, looping through the integers, and then printing them.

I would first write a a for-loop:

for i in range(1,11):

Then I would nest a print statement inside the for-loop.

for i in range(1,11):

While this program may appear simple at first glance, it's not immediately obvious how to devise a reward function that would reward the agent as it went along. Sure, we could hardcode a reward function that did something like "+1 reward if there's a for-loop" and "+1 reward if there's a nested print statement", but hand-crafting such a function would be even more complex than outright writing the program itself, and the whole point of a metaprogramming agent would be to avoid writing the program.

With that in mind, I'll leave you with this problem to solve: How can we develop a non-sparse reward function for solving programming problems?