Applying Optimal Control Theory to Chutes and Ladders

by LosAlfajores in Circuits > Robots

1034 Views, 5 Favorites, 0 Comments

Applying Optimal Control Theory to Chutes and Ladders

ChutesAndLadders.png
20211206_211956.jpg
20211206_212259.jpg

When your first kid gets to an age where they start thinking more critically and can follow instructions well, every parent has the thought "we should definitely get a board game and play it with our kid! That would be so cute!" Shortly followed by "Ooh! Remember Chutes and Ladders? That would be a great first game!"

Then, the unsuspecting family tries it out, only to discover that it is the most boring game on the planet. 100% based in chance, and with giant leaps forward and backward in the gameplay, it's just so dang boring.

So, naturally, when something is boring and systematic, the roboticist inside of me screams "YoU sHoUlD oPtImIzE iT!!!!" And I have no choice but to listen. Now I'll infect you with the knowledge of how we go about bludgeoning this (and other) useless board games with some optimal control theory and a few simulation techniques in Python.

Supplies

20211206_212135.jpg

For this instructable, I'll be using Python with Numpy for data processing, but the concepts here are language agnostic, so any scientific computing language should do the trick! I'll not go over how to get Python set up here, since that's out of the scope of what we're trying to do here.

As for the game, Chutes and Ladders, these principles can apply to pretty much any board game, as long as there's not too much strategy involved. The more complicated the game, the more difficult it is to define a distinct optimum way to play. For any game, we'll have to identify states and actions in order to define the system.

Define States and Actions

20211206_211737.jpg

In this context, a state defines the status of the game at any given time. For Chutes and Ladders, the state of the game is defined by the position of each of the players (2 or 3, by the rules). The board has tiles 1-100, but players start off the edge of the board, so we'll define that as state 0. This means positions 0-100 are our possible states.

The actions you can take in Chutes and Ladders are trivial. All you can do is spin for your next turn. No decision necessary. That's why this game is so boring. We want to change that, so instead of a player's turn consisting of spinning a random number between 1 and 6, they player will select a number between 1 and 6, and will have a 50% chance of moving forward the correct amount, a 25% chance of undershooting by one tile, and a 25% chance of overshooting by one tile.

For example, I have my character on tile 14, and I choose to move forward 3. There's a 25% chance I will move forward 2, 50% chance I will move forward 3, and 25% chance I will move forward 4. To do this in real life, just flip a coin twice. Two heads = overshoot. Two tails = undershoot. One heads, one tails = proceed the correct amount.

Now we have a proper game that requires some strategy! Let's dive into figuring out the optimal way to win.

Define the Problem

20211206_212112.jpg

Here's what we want to know:

What is the best action I can take from any given tile to minimize the amount of time it takes me to reach tile 100?

A more complicated question could include strategies based on your opponent's position (for example, taking greater risks, since your opponent is ahead), but that is a bit more difficult. Later, we'll introduce a discount factor that will allow us to put pressure on finishing quickly vs. avoiding risk.

To answer this question, we want to learn the optimal policy for the game. The optimal policy is basically a map: it says "in space ___, you should choose to try to move forward ___ tiles". Functionally, it'll be a lookup table, and if you follow it, you will be following the exact statistically optimal solution to the game.

Value Maps

20211206_212103.jpg

To get an optimal policy, we first need to compute the value map, which is a map that assigns a value to each tile. Basically, we assign a high value to the tiles that cause us to win (tiles 100 and 80 [tile 80 has a ladder to 100]), then we go through each other tile and figure out the action that has the best chance of taking us to a tile with a high value. For any given tile, the value is computed as the average reward from taking the best possible action from that tile. Tiles that are closer to the end generally have a higher value. Tiles with a ladder lead us to tiles of better value, therefore they have better value themselves.

For example, let's say we are on tile 45, and tile 50 has a super awesome ladder that takes us forward 20 spaces. However, let's pretend tile 51 has a chute that takes us back to tile 1. Should we choose to jump to tile 50 and risk a 25% chance of falling down a huge chute, just for the 50% chance of moving forward 20 spaces? Tile 50 would have a high value, since it gets us closer to the reward. Tile 51 would have a low value, since it takes us far away from the reward.

As we iteratively construct our value map, situations like that will be taken into account as the "reward" on tile 100 trickles down the map and show us which paths result in the highest value earned. The optimal policy can then be constructed by going tile by tile and finding which action gives you the best reward on average.

Example: Gridworld

0.png
1.png
2.png
3.png
4.png

Let's do a simple example of a value map. Let's say we are in a 2D world, and we can move horizontally or vertically one tile at each time step (like Pokemon, or simple 2D video games), and we always go the direction we want. This is called a deterministic system - there's no randomness. We want to know the fastest way to the top right corner from any given point on the map.

So, what do we do? We give a reward value to that top right corner - 80 points, then set a discount factor of 0.5. This means that each tile will have a value of 1/2 the best tile you can reach from there. To fill out the map, you just go over each tile, find the best tile you can reach, then multiply that tile's value by 0.5 and stick it in the current tile. This way, the reward in the top right "flows" down the map. To take the best action, just choose the action that will take you to the best tile you can get to - viola! You know the best way to get to the top right!

Example: Stochastic Gridworld

5.png
6.png

Now let's mix it up. Let's say we have a stochastic system, where if we try to go forward, sometimes we accidently move left or right. In this case, we have a penalty for colliding with those barriers in the middle of the field. We fill out the value map similarly, but the value of a tile is now the value gained on average from the best action to take. We also add a penalty to hitting a barrier, to make it more fun!

Now, when we are choosing the best action to take, we need to consider the average reward of each action. We look at all the tiles a certain action could put us on, then multiply the value of each of those tiles by the probability of ending up there, then add those numbers:

reward_of_action = sum(reward_of_each_possible_outcome * probability_of_that_outcome)

This way, we make sure to follow a path that minimizes the risk of crashing into barriers.

NOTE: The values shown in the images here would be close to the actual value map, but not exactly. If we were to go over this map again, and apply our new formula for the reward_of_action, we would find that each tile changes slightly in value, since you could end up moving to a tile adjacent to your target tile. This is really hard to do by hand, so we'll force a computer to do it over and over again until we get the correct map. That's what's coming up next!


Simulate the Game

20211206_212002.jpg

Now let's get back to Chutes and Ladders! Depending on the game you've chosen to simulate, you'll need to structure the simulation differently. For a linear game like Chutes and Ladders, we know that we will progress relatively linearly from tile 0 to tile 100, with a few exceptions (the chutes and the ladders). We will construct a tileMap variable so we can call

tileMap[tile_I_just_landed_on]

and return the tile number a chute or ladder would move me to.

Starting with library imports, we set up our header:

import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy


nStates = 107   # the number of possible states attainable immediately after a roll
tileMap = np.arange(nStates)    # initialize the tilemap
mapFrom = [1,  4,  9,  21, 28, 36, 48, 49, 51, 56, 62, 64, 71, 80,  87, 93, 95, 98]
mapTo =   [38, 14, 31, 42, 84, 44, 26, 11, 67, 53, 18, 60, 91, 100, 24, 73, 76, 78]
tileMap[mapFrom] = mapTo    # mapFrom is the start of a chute/ladder, mapTo is its destination
tileMap[101:] = 99   # overshooting tile 100 leaves the player on tile 99

The Numpy library will handle all our math, matplotlib will handle plotting, and deepcopy will be used to fix a bug that took me WAY TOO DANG LONG to track down.

NOTE: Despite only having tiles 0-100, I've defined 107 states here - why? If you're standing on tile 99, and you (very unwisely) try to move forward 6 tiles, and move 7 tiles due to overshoot, you'd land on tile 106, meaning we actually have to account for 107 total states (0-106). On the last line, we say that landing on any tile over 100 just plops you back on tile 99.

Set Up Value Iteration Parameters

20211206_212124.jpg

Here, we set up everything we need for value iteration, the process by which we will update our value map. We set our reward for winning, initialize a few variables, then set our stopping criteria (convergeTol and maxIter).

We also set our action probabilities (actionProb). Here, we have the probabilities of undershooting by one tile, hitting our chosen tile, and overshooting our tile, respectively. As long as these numbers sum to 1.0, you can put in whatever you like, and it'll change our results to match a game with those probabilities!

The last hyperparameter (value that changes how an algorithm works) is our discount factor. The discount factor is a measure of how much the reward decreases every time it's passed on to the next tile. This effectively sets how aggressive we want our optimal policy. A low discount factor (like 0.5) says that, even if our next move is guaranteed to land us on the winning space and win the game, the tile we're sitting on is only worth a small amount compared to the final goal.

In practice, a low discount factor makes your policy advise you to take greater risks to get to the end ASAP. A high discount factor (say, 0.99) will be less aggressive, since even if you miss the exact optimal path, there's likely another ladder that will get you close to the goal soon.

winReward = 100 # the reward for winning - this is arbitrary
valueMap = np.zeros(nStates)    # the value of each tile, 1:nStates
newVal = np.zeros(nStates)      # for calculating the change in value in a given timestep
policy = np.zeros(nStates)      # the best action to take for each tile

convergeTol = 0.001     # when the total change in value is this low, we stop
totalChange = 100       # initialize to an arbitrary high value
actionProb = np.array([0.25, 0.5, 0.25])    # the probability of [undershoot, perfect, overshoot]
discountFactor = 0.9    # this controls how quickly the "win reward" decays in the value map
maxIter = 1000      # maximum number of value iterations
iter = 0    # initialize iteration counter

Numerically Construct the Value Map and Optimal Policy

20211206_212254.jpg

Here we get to the meat and potatoes of our algorithm. We iterate until we time out (iterations exceed our set max iterations) or we make such a small change to our value map that effectively we've found the optimum.

At every step of the while loop, we go through all the tiles backwards (start at 99, work towards 0). For each tile, we check the result of each action we can take, and figure out, on average, how much reward we'd get for taking that action. This can be expressed as:

reward_of_action = sum(reward_of_each_possible_outcome * probability_of_that_outcome)

We set the value of a given tile as the reward of the best action you can take from there, multiplied by our discount factor. Since we know which action was best, we save that off into our optimal policy, so we can reference it later.

Rinse, repeat, and when we've converged, we should have our optimal policy and value map, ready to interpret!

# loop until we converge or time out
while totalChange > convergeTol and iter < maxIter:
    iter += 1
    for i in reversed(range(100)):      # step backwards through tiles
        choices = np.zeros(6)           # keep track of the value of each action
        for j in range(0, 6):           # iterate through possible actions to find the best action from a tile
            # the value achieved with each action is a probability-weighted sum of the value of each tile
            # that can be reached by taking a given action
            choices[j] = np.sum(valueMap[tileMap[i+j : i+j+3]] * actionProb)


        bestChoice = np.max(choices)    # the value of the best action
        policy[i] = np.argmax(choices)  # the action resulting in the best value
        newVal[i] = bestChoice * discountFactor     # set the new value of tile i
    
    newVal[100] = winReward     # ensure the win reward is unchanged
    newVal[80] = winReward      # tile 80 has a ladder to 100, so it's basically a win
    newVal[101:] = newVal[99]   # tiles 101 to 106 just map back to tile 99


    totalChange = np.sum(abs(newVal - valueMap))    # figure out how much the value map changed
    valueMap = deepcopy(newVal)         # set the new value map
    print(totalChange)

NOTE: On that second to last line, you see the stinking bug that took me way to long to identify. If you omit the deepcopy() call on that line, valueMap becomes a pointer to newVal, causing the totalChange to be zero when computed on the second iteration of the while() loop. There's probably a better way to handle this, but this should work fine for now.

Plot the Results

Policy.png
Chutes and Ladders Optima.jpg

Nothing fancy here, just show us what we came to see!

# plot everything, excluding the slush tiles (101 - 106)
fig, axs = plt.subplots(2)


axs[0].plot(valueMap[:101])
axs[0].set_ylabel('Value')
axs[1].set_ylabel('Optimal Action')
axs[1].set_xlabel('Tile')
axs[1].plot(policy[:101] + 1)   # add 1 because action '0' moves you forward 1 tile
plt.show()

From the plots, we see a value map on the top, and our optimal policy on the bottom. The value map shows the value of each tile, where the goal is to head uphill as quickly as possible. It's interesting to see where the value map dips down. Each place the value map dips down is where we've missed a ladder, or come on a section with a bunch of chutes.

The optimal policy is our key to winning. Find the tile you're currently on, take the suggested action, and on average you'll win more than any other strategy!

Congratulations! You've now mercilessly butchered a children's game into a paint by numbers way to demoralize anyone who plays with you! We started with a boring children's game, 100% rooted in chance, and turned it into a game that could potentially be fun, but we've made not fun again by turning it into a paint-by-the-numbers experience! BUT - we had fun doing it =)

What Happens If...?

20211206_211948.jpg

Now, time to play with our algorithm! It should solve pretty quickly, so here's a couple scenarios to try out by tweaking the hyperparameters:

What happens if you increase the randomness? Make it an even chance for overshoot, undershoot, and perfect hits!

Make it deterministic! That means you always land on the tile you choose.

Reduce the discount factor! What happens to the value graph?

increase the discount factor! Compare the optimal policies - what changed?

The Whole Code

20211206_212152.jpg

All of this code can be found on my Github page: https://github.com/TylerQuacken/chutes_and_ladders_stats

In that repo, there are also other explorations of the Chutes and Ladders game, including one with a Monte Carlo simulation of the game. If this Instructable gets enough attention, I'll do another that accomplishes the same purposes, but goes more into Monte Carlo simulation instead of value iteration.

Here's the full code, in case you're a copy and paste kind of person:

import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy


nStates = 107   # the number of possible states attainable immediately after a roll
tileMap = np.arange(nStates)    # initialize the tilemap
mapFrom = [1,  4,  9,  21, 28, 36, 48, 49, 51, 56, 62, 64, 71, 80,  87, 93, 95, 98]
mapTo =   [38, 14, 31, 42, 84, 44, 26, 11, 67, 53, 18, 60, 91, 100, 24, 73, 76, 78]
tileMap[mapFrom] = mapTo    # mapFrom is the start of a chute/ladder, mapTo is its destination
tileMap[101:] = 99   # overshooting tile 100 leaves the player on tile 99


winReward = 100 # the reward for winning - this is arbitrary
valueMap = np.zeros(nStates)    # the value of each tile, 1:nStates
newVal = np.zeros(nStates)      # for calculating the change in value in a given timestep
policy = np.zeros(nStates)      # the best action to take for each tile


convergeTol = 0.001     # when the total change in value is this low, we stop
totalChange = 100       # initialize to an arbitrary high value
actionProb = np.array([0.25, 0.5, 0.25])    # the probability of [undershoot, perfect, overshoot]
discountFactor = 0.9    # this controls how quickly the "win reward" decays in the value map
maxIter = 1000      # maximum number of value iterations
iter = 0    # initialize iteration counter


# loop until we converge or time out
while totalChange > convergeTol and iter < maxIter:
    iter += 1
    for i in reversed(range(100)):      # step backwards through tiles
        choices = np.zeros(6)           # keep track of the value of each action
        for j in range(0, 6):           # iterate through possible actions to find the best action from a tile
            # the value achieved with each action is a probability-weighted sum of the value of each tile
            # that can be reached by taking a given action
            choices[j] = np.sum(valueMap[tileMap[i+j : i+j+3]] * actionProb)


        bestChoice = np.max(choices)    # the value of the best action
        policy[i] = np.argmax(choices)  # the action resulting in the best value
        newVal[i] = bestChoice * discountFactor     # set the new value of tile i
    
    newVal[100] = winReward     # ensure the win reward is unchanged
    newVal[80] = winReward      # tile 80 has a ladder to 100, so it's basically a win
    newVal[101:] = newVal[99]   # tiles 101 to 106 just map back to tile 99


    totalChange = np.sum(abs(newVal - valueMap))    # figure out how much the value map changed
    valueMap = deepcopy(newVal)         # set the new value map
    print(totalChange)


# plot everything, excluding the slush tiles (101 - 106)
fig, axs = plt.subplots(2)


axs[0].plot(valueMap[:101])
axs[0].set_ylabel('Value')
axs[1].set_ylabel('Optimal Action')
axs[1].set_xlabel('Tile')
axs[1].plot(policy[:101] + 1)   # add 1 because action '0' moves you forward 1 tile
plt.show()