Monte Carlo Reinforcement Learning: A Hands-On Approach

Tawsif Kamal
15 min readMay 25, 2021

Learning through pure randomness shouldn’t make any sense.

But it works…

Using some statistics magic, you can calculate unknown values through randomness with great accuracy and solve complex problems including Reinforcement Learning problems. One of these methods of “random learning” is called Monte Carlo Simulation.

In this article, we are going to understand Monte Carlo (MC) methods applied to Reinforcement Learning. Alongside the explanations, we are going to apply this algorithm to play the game of blackjack!

Part 1 Monte Carlo RL Tutorial

Applying MC Learning onto Blackjack Open AI Gym


  • What is Monte Carlo Learning
  • Dynamic Programming vs MC Learning
  • Policy Iteration
  • MC Prediction
  • MC Control

What is Monte Carlo Learning

Monte Carlo Reinforcement Learning methods are intuitive as it contains one fundamental concept: Averaging returns from several episodes to estimate value functions.

Some key features of Monte Carlo Learning are the following:

  • the algorithm only works on episodic tasks
  • learns from interaction with the environment (called experience)
  • Takes a value-based approach for estimating optimal policies
  • calculates the return after the end of the episode
  • Model-Free

Sidenote… if you are not familiar with the basics of Reinforcement Learning through the formulation of Markov Decision Processes, check out my previous article explaining the basic theory and math behind RL + other great introductory resources to get up to speed!

To get an intuition behind MC learning, we have to go back to the basics of Dynamic Programming.

Dynamic Programming Vs Monte Carlo Learning

If you are familiar with dynamic programming (DP), recall that the method to estimate value functions is by using planning algorithms such as policy iteration or value iteration. However, dynamic programming algorithms in RL are model-based, which means that the environment must be fully visible (we must know all components of MDP features including the state transition probabilities) to the agent in order to implement policy or value iteration.

Given the nature of planning algorithms, we don’t even have to make the agent interact with the environment to compute optimal value functions. We can just use our amazing math skills to solve the Bellman equations and boom we solved the RL problem! Since we are able to calculate the optimal policy without any interaction with the environment, this means that there isn’t any actual learning going on in dynamic programming methods. It is just your plain old “pick up a textbook, write down your givens, and solve the equation” type of problem.


While on the other hand, Monte Carlo Learning is a model-free algorithm. This means that the state-transition dynamics are unknown to us. In other words, we don’t know the physics of the world. Since we now have an unknown variable, this means that we cannot solve the Bellman equation to then predict the optimal policy!

The Bellman Equations. Hmm… ring a bell?

state-value function bellman equation

The above diagram represents the state-value function bellman equation which assigns a value to a particular state. For planning algorithms, we would usually solve the right-hand side of this equation to get an accurate value for each state in our state space. However, we have the unknown variable p(s’, r | s, a) in Monte Carlo. This missing variable is what differentiates dynamic programming from Monte Carlo methods.

dynamics of the environment

The above expression is the probability transition matrix that describes the physics of the environment. If we don’t know the values for this transition matrix, then we can't predict the stochasticity associated with the environment.

Probabilities are unknown in MC Learning!

MC learning allows us to solves RL problems without needing to calculate the transition probabilities.

This is what makes MC a powerful learning algorithm since we can start to apply it in real-life environments such as Blackjack, Chess, and the famous game of Go where the transition probabilities are likely going to be unknown.

Just think about it… how can you calculate the probability of your opponent making a certain move in chess for all positions in a chess game? Sure maybe the most immediate moves wouldn’t be too hard to predict using simple logic, but what about the 5th, 6th, 50th move from the current state of the game?

Hard… Isn’t it.

I wish we were mind readers :(

Anyways, to understand how MC learning can calculate value functions, we have to get an intuition behind the infamous dynamic programming algorithm called Policy Iteration.

Policy Iteration

Policy iteration is a DP algorithm that helps us compute optimal value functions by iteratively updating the values of each state and improving a random policy from these updates until it converges to be the most optimal one. Iteratively updating the values is called policy evaluation, while improving a random policy is called policy improvement.

“updating” and “improving” may sound extremely vague at the moment, but these concepts will be cleared up as we explore them more in detail below.

Another side note… If you can get a good grasp of Policy Iteration, then it will be much easier to understand most of the RL algorithms out there including Monte Carlo, Q-learning, SARSA, Actor-Critic, etc. This is because most RL algorithms are derived from Policy Iteration. Really spend some time understanding the concepts/derivations of Policy Iteration, and it will feel as though everything in RL finally clicks!

Let us first try to understand policy evaluation.

Policy Evaluation

Policy evaluation aims to answer the question, “given a policy, what is the value function that corresponds to this policy?” In other words, we are trying to “evaluate” the value of each state (prediction problem). To perform policy evaluation, we first initialize a random policy 𝝿 (choosing random actions) and arbitrarily assign values of zero to each state. The key point to note is that everything is random in the initialization step. More importantly, the value function we initialized with zeros (v_k(s)) is NOT the accurate value function for the given policy. Our job is to find the true value function for policy 𝝿 so that we can perform further useful operations.

the initial value function is not equal to the actual value function

For this reason, we need to formulate an update rule that we can iteratively use to approximate the true value function of the corresponding policy. Usually, the update rule is just variations of the bellman equations but will depend on the algorithm at hand. The MC update rule will be covered later on, but the important idea to take away is that we iteratively update v_k with a recursive equation infinitely a number of times until v_k converges to the optimal policy.

Now you may be asking, how is it that by iteratively updating a random value function, it will converge to the true value function? This question requires an article on its own, so I will not be talking about the proof of converges here. But, check out the following resources that explain in detail why policy evaluation is always guaranteed to converge!

After having computed the value function for the given policy, we can move on to policy improvement.

Policy Improvement

Policy improvement aims to answer the question, “given a value function for a policy 𝝿, how can we improve this policy so that it becomes the most greedy policy?” Greedy means to take the action that will give us the highest value for that current state. We already know the state value when we choose to follow policy 𝝿. But what if there is a better policy 𝝿’ that we don't know about? What if we tried following a different policy instead of our current one and see if the state value we receive is higher?

This form of thinking is the fundamental idea of policy improvement. In essence, we are trying to compute the value of each state-action pair for that current value function. Therefore, we have to go back to the q-function.

The q-function

In the end, whichever action gives us the highest q-value will then become our greedy policy.

Acting greedily by choosing the action with max q_𝝿

Now that we have this improved policy, we now have to again compute the value function because the current value function corresponds to v_𝝿, not v_𝝿’. In other words, we now have to perform policy evaluation again. See how this is becoming a bit recursive?

Policy Evaluation + Policy Improvement = Policy Iteration

Policy iteration composes of performing policy evaluation and policy improvement enough times until it reaches the most optimal policy. The following visualization provides a clear overview of how a random policy can become optimal through recursion.


In essence, the steps are as follows:

  1. We initialize an arbitrary value function.+
  2. find the true value function through policy evaluation.
  3. given that value function, improve the current policy by acting greedily.
  4. Take this new policy, and compute the value function again using policy evaluation.
  5. Repeat.

NBA x Policy Iteration


Policy Iteration is very similar to how humans learn. Take the example of an NBA team. The players on a team would play the game to the best of their ability given their current knowledge. After the game, the team gathers together to reflect and improve on their performance by going over the gameplay recordings.

“Playing the game” is basically the value function and the knowledge that the players currently have is the policy. Remember how that policy evaluation aimed to answer the question, “given a current policy, what is the value function?” Considering this question in the context of NBA players, it becomes, “given the current basketball knowledge/chemistry, how much will the team score?” Hint: playing the game is policy evaluation.

Now that you have “computed the value function,” it is time to improve by rewatching the gameplay (hint: this is policy improvement). The question now to answer is, “given the game played, how can the team improve upon the knowledge/chemistry between the players?”

The players will then repeat this process: play a game, rewatch the highlight tape, play a game, rewatch, play, and on. This recursive process will allow them to get to a point where they are playing optimally for the current season (through recursion).

In essence, this is exactly how policy iteration works! Lovely how you can take computer science concepts and apply them to real-world scenarios right?

Now that we have a better grasp on policy iteration, let's move onto understanding Monte Carlo Learning.

Monte Carlo Methods

Monte Carlo methods in RL break down into two parts just like policy iteration: MC Prediction and MC Control.

MC Prediction

Very similar to how we evaluated the value function for policy evaluation, we are going to focus on evaluating the q-function in MC Prediction. Again the question will become, “given a policy, what is the q-function for that corresponding policy?”

The q-function with respect to accumulated rewards

The q-function is basically the sum of expected rewards in an episode until the terminal timestep given the current state-action pair (s, a) following policy 𝝿.

However, because we don’t know the state-transition function for the environment in MC, it is not possible for us to calculate the expectation since we don’t know the probabilities associated with each reward!

However, what we can do is run a bunch of episodes with random actions and observe the rewards received in each episode. Let's consider the case of a simple grid world.

Simple 3x3 Gridworld
  • The values represent the rewards for transitioning into that state
  • the state with +10 rewards is the terminal state

Given this small grid world, our goal is to calculate the q-value of being in state S_0 and taking action right. Each diagram represents one complete episode for the purpose of illustration.

Recall that in MC methods we have to average the rewards. This means that we have to calculate the empirical mean instead of the expectation since we don't know the probabilities.

First-Vist MC vs Every-Visit MC

Each occurrence of a state-action pair in an episode is called a visit to that state-action pair. The number of times our agent visits a given state-action pair in an episode will change the way in which we update our q-values.

For First-Visit Monte Carlo, we will only consider the return from the first time that S_0 and action right were visited. For Every-Visit Monte Carlo, we will consider the returns from ALL visits to S_0 and action right. In the following example, we will perform every visit Monte Carlo Learning.

Gridworld Example

First, we will initialize all of our q_values to 0 and set a random stochastic policy 𝝿. We will play out 4 episodes and accumulate 4 returns. It will be easy now to calculate the incremental mean. The discount factor is taken as 1 for simple explanation purposes.

Example Episodes

Sample Return 1: G_t = 0 + 1+ 5 + 0 + 10 =16

Sample Return 2: G_t = 0 + 1 - 1 + 0 + 10 = 10

Sample Return 3: G_t = 0 + 1+ 5+ 0 - 1 + 0 - 2 + 5 + 10 = 18

Sample Return 4: G_t = 0 + 1 + 5 + 0 - 1 + 5 + 10 = 20

Estimating the q-value for s_0 taking action right

By using the incremental mean, we can estimate the q-value of being in state S_0 and taking action right. But a question arises, how will calculating the empirical mean guarantee us that the this q_value will be the accurate q_function approximation for the given policy?

According to the law of large numbers, as a sample size grows, its mean gets closer to the average of the whole population. In MC prediction, as the number of sample returns approaches infinity, the average of all of those returns will become more and more of an accurate estimate of the q-value.

This is the essence of Monte Carlo Methods!

We were able to evaluate the q_function WITHOUT having to know the dynamics of the environment.

Another way to find the mean value of total returns is to update the mean incremental using the following update rule:

Incremental Mean Updates
  • Q represents the q value for a state-action pair
  • G is the current return accumulated after the episode has finished
  • N is the total number of times that state-action pair has been encountered
  • G - Q is called the temporal difference as we are subtracting the old average return (Q) from the new return (G).

It is important to remember that the Q-values are updated after an episode has finished in Monte Carlo Learning! That is the only way we will get the return.

Another interesting thing to note is that once the value of N becomes relatively large, the temporal difference will not have much of an effect on updating the Q_value. This is a problem because updates in the earlier episodes will be favoured more than updates in later episodes making our algorithm biased. To combat this notion, we will multiply by a constant alpha instead of N:

The Monte Carlo Update Rule For The Q-Function

If you are familiar with gradient descent from machine learning, this update rule looks awfully similar to calculating the weights. You have your old value Q, the “loss” quantified as G - Q, and multiply that loss by a constant. The only difference is that we add the loss instead of subtracting. By adding the loss, we are able to get closer to the mean due to the incremental nature of calculating the mean.

Now that we are able to calculate the q-value for each state-action pair, all we have to do is run many episodes until the random q-function becomes the true q-function for policy 𝝿.

k-represents the number of episodes

Monte Carlo Control

Now that we have calculated our q-function, we have to now find the best policy that will maximize our q-values for each state. MC Control is basically policy improvement but with a subtle difference: we have to deal with the exploration-exploitation problem.

Exploration = choosing actions randomly from the whole action space

Exploitation = only choose the action that returns the highest q-value

Just like in Dynamic Programming, we are going to choose the action with that max q-value in a given state. However, we also have to account for trajectories that our agent has not taken yet.

For example, if an agent finds an action a_1 from state S_0 and receives a return of +5, it would make sense for that agent to exploit that action given that it would be the highest q-value for that given state. However, what if there is another action a_2 from state S_0 that will give an even better return of +10?

If our agent always decided to choose the action with the highest q-value (in this case would be +5 since our agent hasn’t explored all other actions), then it would actually be choosing the suboptimal action every time!

For this reason, we need to devise some method where the agent chooses the suboptimal action with some probability of the time in order to ensure that it explores all possible actions in a given state.

Epsilon-Greedy Strategy

One of the ways to deal with this exploration problem is through the epsilon greedy strategy. Let ε represent the probability that we are going to take a random action. This means that we are going to choose the most optimal action 1 - ε probability of the time. Therefore, our policy will become the following:

The |A(s)| represents the number of actions available from our action space. For example, if our agent is only able to move left or right, |A(s)| is going to equal 2.

We have to divide by the |A(s)| in order to account for the number of actions we have to ensure that the probabilities that we assign are uniformly distributed between all actions.

Once we define our policy like this, we are able to always allow our agent to explore the state space in order to ensure that it reaches the most optimal action every time!

Now that we have accounted for exploration vs exploitation, formalizing our Monte Carlo Control becomes easy. We are going to improve our policy by choosing the action that maximizes our q-value but then also choosing a random action ε probability of the time.

By choosing our actions greedily, we influence what rewards our agent is going to extract from the environment. This accumulation of rewards will change our q-value estimates since it is going to be fed into the update rule as mentioned in the equation above. Once we loop through MC prediction and control enough times, our policy will start to reach optimality just like in DP!

Full Monte Carlo Learning Loop

On Policy Monte Carlo Learning with ε-Greedy Exploration

Given that we are initializing a random policy and improving upon that same policy, this means that our algorithm is coined as an On-Policy algorithm. This means that our initial policy will be improved to the final policy (target policy = behavioural policy).

Now that we have formalized all steps, let's see how we would implement this algorithm!

Pseudocode for On-Policy MC Learning

Epsilon-greedy is just a subset of Epsilon soft policies. So don’t be thrown off in the pseudocode where it talks about initializing an epsilon soft policy.

After understanding all of the concepts and derivations, it should feel much more intuitive to apply this algorithm to code. Check out my video (linked above) on applying MC Learning to Blackjack!

Key Takeaways

  • Monte Carlo Learning a model-free algorithm that estimates value functions by averaging rewards from random samples of the interaction of our agent with the environment
  • Dynamic Programming can find optimal policies without any interaction with the environment. There is no learning involved. In MC, the value functions are estimated through sampled episodes. There is learning involved.
  • Policy iteration is composed of Policy Evaluation and Policy Improvement
  • MC Prediction is derived from Policy Evaluation and MC Control is derived from Policy Improvement
  • The Epsilon-Greedy strategy is one way of dealing with the exploration-exploitation trade-off

Hey, I’m Tawsif Kamal, an 18-year-old high school student who is passionate about using programming and mathematics to drive social change!

Thank you so much for reading this article! If you enjoyed it or learned something new, leave some claps, connect with me on LinkedIn, and see what I’m up to in my monthly newsletter!

Really Helpful Additional Resources

Tawsif Kamal

Reinforcement Learning enthusiast