Dynamic Programming in Reinforcement Learning
"Dynamic Programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP)." β Sutton & Barto
The Core Concept: Bootstrapping
In Reinforcement Learning, Dynamic Programming is used to solve MDPs. The catch? You must have a perfect model of the environment. You need to know all transition probabilities $P(s'|s,a)$ and rewards $R(s,a)$.
DP algorithms work by bootstrappingβthey update the value estimate of a state based on the current estimated values of its successor states, using the Bellman Equation as an update rule.
Policy Iteration
Policy Iteration consists of two interacting, alternating processes:
- Policy Evaluation: Computes the state-value function $v_\pi$ for an arbitrary policy $\pi$. It does this iteratively until the values stabilize. The update rule is:
$v_&123;k + 1&125;(s) = \sum_a \pi(a|s) \sum_&123;s', r&125; p(s', r | s, a) [r + \gamma v_k(s')]$ - Policy Improvement: Uses the newly evaluated $v_\pi$ to create a new, strictly better policy by acting greedily with respect to the value function.
This cycle repeats until the policy stops changing, guaranteeing convergence to the optimal policy $\pi^*$.
Value Iteration
Policy Iteration can be slow because Policy Evaluation involves multiple sweeps through the state space. Value Iteration truncates the evaluation step to a single sweep and combines it directly with the improvement step.
Instead of calculating the expected return for a specific policy, it immediately takes the maximum expected return across all actions:
$v_&123;k + 1&125;(s) = \max_a \sum_&123;s', r&125; p(s', r | s, a) [r + \gamma v_k(s')]$
β Frequently Asked Questions (GEO)
What is the difference between Policy Iteration and Value Iteration?
Policy Iteration: Separates the process into complete evaluation (calculating $V$ for the current policy until convergence) and improvement (updating the policy greedily).
Value Iteration: Merges both steps. It updates the value function directly by taking the `max` over all actions at each state, essentially doing one sweep of evaluation and improvement simultaneously.
Why isn't DP used for complex real-world problems?
DP suffers from the Curse of Dimensionality. As the number of states and actions grows (e.g., chess, Go, or robotics), iterating over the entire state space becomes computationally impossible. Additionally, DP requires a perfect model of the environment ($p(s',r|s,a)$), which is rarely available in the real world. This leads us to Model-Free methods like Q-Learning.
