Skip to content
Go back

RL Series Part 1: A Deep Dive into Policy and Value Iteration

Edit page

Welcome to the start of a new series on Reinforcement Learning! Our journey begins with a fundamental question: if we knew all the rules of a game, how could we determine the best possible strategy? This post dives into the mathematical framework for answering that, exploring how to find optimal solutions in known environments using the cornerstone algorithms of Policy and Value Iteration.

Table of contents

Open Table of contents

1. The Framework: Markov Decision Processes (MDPs)

Before we can solve anything, we need a formal language to describe the problem. In RL, this language is the Markov Decision Process (MDP). An MDP is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.

An MDP is defined by a tuple (S,A,P,R,γ)(S, A, P, R, \gamma):

The agent’s goal is to learn a policy, denoted by π(as)\pi(a|s), which is a strategy that specifies what action to take in each state. Our ultimate objective is to find the optimal policy, π\pi^*, that maximizes the cumulative discounted reward.

2. The Heart of RL: The Bellman Equations

To find the optimal policy, we first need a way to measure how good a policy is. We do this with value functions.

These two functions are connected by the Bellman Expectation Equation, which provides a recursive definition of value:

Vπ(s)=aAπ(as)(R(s,a)+γsSP(ss,a)Vk(s))\begin{aligned} V^\pi(s) = \sum_{a \in A} \pi(a|s) \left( R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) V_k(s') \right) \end{aligned}

This equation says that the value of a state under a policy π\pi is the sum of the expected immediate rewards and the expected discounted future rewards.

While this tells us the value of a given policy, it doesn’t tell us how to find the best policy. For that, we need the Bellman Optimality Equation:

V(s)=maxaA(R(s,a)+γsSP(ss,a)V(s))\begin{aligned} V^*(s) = \max_{a \in A} \left( R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) V^*(s') \right) \end{aligned}

The only difference is the max operator. Instead of averaging over the policy’s actions, it chooses the best action at each step. If we can solve this equation, we can find the optimal policy. The algorithms below are methods for solving it.

3. Algorithm 1: Policy Iteration

Policy Iteration is a “two-step dance” that reliably finds the optimal policy. It iterates between evaluating a policy and then improving it.

Step 1: Policy Evaluation

First, we take a policy π\pi and compute its value function VπV^\pi. We start with an arbitrary value function V0V_0 and iteratively apply the Bellman Expectation Equation as an update rule:

Vk+1(s)=aAπ(as)(R(s,a)+γsSP(ss,a)Vk(s))\begin{aligned} V_{k+1}(s) = \sum_{a \in A} \pi(a|s) \left( R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) V_k(s') \right) \end{aligned}

We repeat this for all states until the value function converges (i.e., Vk+1VkV_{k+1} \approx V_k).

Step 2: Policy Improvement

Now that we have the value function VπV^\pi for our policy, we can improve the policy. For each state, we find the action that leads to the highest expected return, according to our just-calculated value function. This gives us a new, greedy policy π\pi':

π(s)=argmaxaAQπ(s,a) =argmaxaA(R(s,a)+γsSP(ss,a)Vπ(s))\begin{aligned} \pi'(s) &= \arg\max_{a \in A} Q^\pi(s,a) \ &= \arg\max_{a \in A} \left( R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) V^\pi(s') \right) \end{aligned}

The Policy Improvement Theorem guarantees that this new policy π\pi' is at least as good as, if not better than, the original policy π\pi.

We repeat these two steps—evaluation and improvement—until the policy stabilizes. At that point, the policy is optimal.

4. Algorithm 2: Value Iteration

Value Iteration is a more direct approach that combines the two steps of Policy Iteration into one. It directly iterates on the Bellman Optimality Equation:

Vk+1(s)=maxaA(R(s,a)+γsSP(ss,a)Vk(s))\begin{aligned} V_{k+1}(s) = \max_{a \in A} \left( R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) V_k(s') \right) \end{aligned}

We start with an arbitrary value function V0V_0 and repeatedly apply this update to all states. This process is guaranteed to converge to the optimal value function VV^* because the Bellman operator is a contraction mapping, which mathematically ensures it has a unique fixed point that the iterations will converge to.

Once we have VV^*, we can extract the optimal policy π\pi^* by acting greedily with respect to it:

π(s)=argmaxaA(R(s,a)+γsSP(ss,a)V(s))\begin{aligned} \pi^*(s) = \arg\max_{a \in A} \left( R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) V^*(s') \right) \end{aligned}

5. Why Does Value Iteration Converge? The Contraction Mapping Proof

We mentioned that Value Iteration is guaranteed to converge because the Bellman operator is a contraction mapping. This is a fundamental concept from mathematics that, when applied here, provides the theoretical bedrock for our algorithm. Let’s briefly walk through the proof.

A mapping (or operator) BB is a contraction if, for any two vectors V1V_1 and V2V_2, the following inequality holds:

BV1BV2γV1V2\begin{aligned} ||BV_1 - BV_2||_\infty \le \gamma ||V_1 - V_2||_\infty \end{aligned}

where 0γ<10 \le \gamma < 1. In plain English, applying the operator BB to any two value functions always brings them closer together (scaled by γ\gamma). The Banach Fixed-Point Theorem states that if an operator is a contraction on a complete metric space, it has a unique fixed point, and iterating the operator from any starting point will converge to that fixed point.

Proof for the Bellman Optimality Operator (BB)

Let’s prove this for the Bellman optimality operator used in Value Iteration.

(BV)(s)=maxaA(R(s,a)+γsP(ss,a)V(s))\begin{aligned} (BV)(s) = \max_{a \in A} \left( R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right) \end{aligned}

Consider two arbitrary value functions, V1V_1 and V2V_2. Let’s look at the difference for a single state ss:

(BV1)(s)(BV2)(s)=maxa(R(s,a)+γsP(ss,a)V1(s))maxa(R(s,a)+γsP(ss,a)V2(s))maxa(R(s,a)+γsP(ss,a)V1(s))(R(s,a)+γsP(ss,a)V2(s))=maxaγsP(ss,a)(V1(s)V2(s))maxa(γsP(ss,a)V1(s)V2(s))maxa(γsP(ss,a)V1V2)=γV1V2maxa(sP(ss,a))=γV1V2\begin{aligned} |(BV_1)(s) - (BV_2)(s)| &= \left| \max_a \left( R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_1(s') \right) - \max_a \left( R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_2(s') \right) \right| \\ &\le \max_a \left| \left( R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_1(s') \right) - \left( R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_2(s') \right) \right| \\ &= \max_a \left| \gamma \sum_{s'} P(s'|s,a) (V_1(s') - V_2(s')) \right| \\ &\le \max_a \left( \gamma \sum_{s'} P(s'|s,a) |V_1(s') - V_2(s')| \right) \\ &\le \max_a \left( \gamma \sum_{s'} P(s'|s,a) ||V_1 - V_2||_\infty \right) \\ &= \gamma ||V_1 - V_2||_\infty \max_a \left( \sum_{s'} P(s'|s,a) \right) \\ &= \gamma ||V_1 - V_2||_\infty \end{aligned}

This holds for any state ss, so it must also hold for the maximum over all states:

BV1BV2=maxs(BV1)(s)(BV2)(s)γV1V2\begin{aligned} ||BV_1 - BV_2||_\infty = \max_s |(BV_1)(s) - (BV_2)(s)| \le \gamma ||V_1 - V_2||_\infty \end{aligned}

This completes the proof that the Bellman optimality operator BB is a contraction. A nearly identical, but slightly simpler proof (since it doesn’t involve the max operator) shows that the policy-specific Bellman operator BπB^\pi is also a contraction. This is why both Value Iteration and Policy Evaluation are guaranteed to converge.

6. Practical Considerations

Applying these algorithms brings up important practical issues:

7. How Good is a Greedy Policy? Performance Bounds

A fascinating theoretical question is: if we stop Value Iteration early, we get an imperfect value function VV. If we then extract the greedy policy π\pi from this imperfect VV, how bad can that policy be compared to the true optimal policy π\pi^*?

This is where the Bellman Residual (or Bellman error) comes in. It measures how much a value function VV fails to satisfy the Bellman Optimality Equation:

ε=BVV=maxsS(BV)(s)V(s)\begin{aligned} \varepsilon = ||BV - V||_\infty = \max_{s \in S} |(BV)(s) - V(s)| \end{aligned}

If V=VV = V^*, the error ε\varepsilon is zero. For any other VV, it’s positive. It turns out we can use this error to create a powerful performance guarantee.

The Performance Bound Theorem

The theorem states that for a greedy policy π\pi extracted from an arbitrary value function VV:

Vπ(s)V(s)2ε1γ\begin{aligned} V^\pi(s) \ge V^*(s) - \frac{2\varepsilon}{1 - \gamma} \end{aligned}

This result tells us that the performance of our greedy policy (VπV^\pi) is guaranteed to be no worse than a certain amount below the true optimal performance (VV^*). This performance gap is directly proportional to the Bellman error ε\varepsilon of the value function we started with.

A Quick But Important Clarification: The Cast of Characters

A common point of confusion in these proofs is keeping track of the different value functions and policies. Let’s be explicit:

The core of the proof relies on the fact that π\pi is greedy with respect to VV, but its actual long-term value is VπV^\pi. The goal is to bound the difference between VπV^\pi and VV^*.

The Proof

Let’s walk through the proof, as it’s a great example of how these theoretical concepts connect. Our goal is to bound the “value loss,” V(s)Vπ(s)V^*(s) - V^\pi(s).

  1. Start with the value loss and apply the triangle inequality:

    V(s)Vπ(s)VVπVV+VVπ\begin{aligned} |V^*(s) - V^\pi(s)| \le ||V^* - V^\pi||_\infty \le ||V^* - V||_\infty + ||V - V^\pi||_\infty \end{aligned}
  2. Bound each term separately. To do this, we need to prove a general and very useful lemma that bounds the distance between an arbitrary value function VV and a policy’s true value function VπV^\pi. The lemma states:

    VVπVBπV1γ\begin{aligned} ||V - V^\pi||_\infty \le \frac{||V - B^\pi V||_\infty}{1 - \gamma} \end{aligned}

    Let’s prove this, as it’s a classic RL proof technique.

    VVπ=VBπV+BπVVπVBπV+BπVVπ=VBπV+BπVBπVπVBπV+γVVπ\begin{aligned} ||V - V^\pi||_\infty &= ||V - B^\pi V + B^\pi V - V^\pi||_\infty \\ &\le ||V - B^\pi V||_\infty + ||B^\pi V - V^\pi||_\infty \\ &= ||V - B^\pi V||_\infty + ||B^\pi V - B^\pi V^\pi||_\infty \\ &\le ||V - B^\pi V||_\infty + \gamma ||V - V^\pi||_\infty \end{aligned}

    Rearranging the final inequality gives the result:

    (1γ)VVπVBπVVVπVBπV1γ\begin{aligned} (1 - \gamma) ||V - V^\pi||_\infty &\le ||V - B^\pi V||_\infty \\ ||V - V^\pi||_\infty &\le \frac{||V - B^\pi V||_\infty}{1 - \gamma} \end{aligned}

    An identical proof (just replacing π\pi with *) shows the same relationship for the optimal value function:

    VVVBV1γ\begin{aligned} ||V - V^*||_\infty \le \frac{||V - BV||_\infty}{1 - \gamma} \end{aligned}
  3. Substitute the bounds back in.

    VVπVBV1γ+VBπV1γ\begin{aligned} ||V^* - V^\pi||_\infty \le \frac{||V - BV||_\infty}{1 - \gamma} + \frac{||V - B^\pi V||_\infty}{1 - \gamma} \end{aligned}
  4. Use the greedy policy definition. Our policy π\pi was extracted greedily from VV. This means that for any state, applying the policy-specific operator BπB^\pi to VV is the same as applying the full optimality operator BB. In other words, BπV=BVB^\pi V = BV. We can now substitute this into our inequality:

    VVπVBV1γ+VBV1γ=2VBV1γ\begin{aligned} ||V^* - V^\pi||_\infty &\le \frac{||V - BV||_\infty}{1 - \gamma} + \frac{||V - BV||_\infty}{1 - \gamma} \\ &= \frac{2||V - BV||_\infty}{1 - \gamma} \end{aligned}
  5. Final Result. Letting ε=BVV\varepsilon = ||BV - V||_\infty, we have shown that for any state ss:

    V(s)Vπ(s)2ε1γ\begin{aligned} V^*(s) - V^\pi(s) \le \frac{2\varepsilon}{1 - \gamma} \end{aligned}

    Rearranging this gives the final performance bound:

    Vπ(s)V(s)2ε1γ\begin{aligned} V^\pi(s) \ge V^*(s) - \frac{2\varepsilon}{1 - \gamma} \end{aligned}

This provides a crucial stopping condition for our algorithms and a certificate of quality for our final policy.

Numerical Intuition: The Cost of Myopia

The term 11γ\frac{1}{1 - \gamma} is a crucial part of these bounds. It represents how much the single-step Bellman error ε\varepsilon can be magnified into the final performance gap.

Consider a discount factor γ=0.9\gamma = 0.9. This represents an agent that is fairly farsighted. The error magnification term becomes 110.9=10.1=10\frac{1}{1 - 0.9} = \frac{1}{0.1} = 10. Let’s see how this plays out.

If we have a more myopic agent, with γ=0.5\gamma = 0.5, the magnification is only 110.5=2\frac{1}{1 - 0.5} = 2. A smaller γ\gamma means future errors are discounted more heavily, so the single-step error has less impact on the total value. Conversely, as γ\gamma approaches 1, the magnification factor approaches infinity, meaning that even tiny single-step errors can accumulate into a massive performance loss over an extremely long horizon. This gives us a concrete way to understand the trade-off between horizon and the precision required in our value function approximation.

8. A Deeper Look: Horizons and Stationarity

A crucial distinction in MDPs is whether the problem has a finite or infinite horizon. This choice fundamentally changes the nature of the optimal policy.

Can we find a γ that makes the infinite-horizon policy match a finite-horizon one?

What’s Next?

Both Policy and Value Iteration are powerful algorithms, but they share a critical requirement: we must have a perfect model of the environment (the transition probabilities PP and reward function RR). In many real-world scenarios, we don’t have this.

In the next post, we’ll explore model-free reinforcement learning, where we learn optimal policies directly from experience, without ever needing to know the underlying dynamics of the world. This will lead us to foundational algorithms like Monte Carlo methods and Temporal Difference learning.


Edit page
Share this post on:

Next Post
From MLE to Variational Inference