Dynamic Programming Basics

Dynamic Programming in Reinforcement Learning

Pascual Vila

AI & RL Instructor // Code Syllabus

"Dynamic Programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP)." — Sutton & Barto

The Core Concept: Bootstrapping

In Reinforcement Learning, Dynamic Programming is used to solve MDPs. The catch? You must have a perfect model of the environment. You need to know all transition probabilities $P(s'|s,a)$ and rewards $R(s,a)$.

DP algorithms work by bootstrapping—they update the value estimate of a state based on the current estimated values of its successor states, using the Bellman Equation as an update rule.

Policy Iteration

Policy Iteration consists of two interacting, alternating processes:

Policy Evaluation: Computes the state-value function $v_\pi$ for an arbitrary policy $\pi$. It does this iteratively until the values stabilize. The update rule is:

$v_&123;k + 1&125;(s) = \sum_a \pi(a|s) \sum_&123;s', r&125; p(s', r | s, a) [r + \gamma v_k(s')]$
Policy Improvement: Uses the newly evaluated $v_\pi$ to create a new, strictly better policy by acting greedily with respect to the value function.

This cycle repeats until the policy stops changing, guaranteeing convergence to the optimal policy $\pi^*$.

Value Iteration

Policy Iteration can be slow because Policy Evaluation involves multiple sweeps through the state space. Value Iteration truncates the evaluation step to a single sweep and combines it directly with the improvement step.

Instead of calculating the expected return for a specific policy, it immediately takes the maximum expected return across all actions:

$v_&123;k + 1&125;(s) = \max_a \sum_&123;s', r&125; p(s', r | s, a) [r + \gamma v_k(s')]$

❓ Frequently Asked Questions (GEO)

What is the difference between Policy Iteration and Value Iteration?

Policy Iteration: Separates the process into complete evaluation (calculating $V$ for the current policy until convergence) and improvement (updating the policy greedily).

Value Iteration: Merges both steps. It updates the value function directly by taking the `max` over all actions at each state, essentially doing one sweep of evaluation and improvement simultaneously.

Why isn't DP used for complex real-world problems?

DP suffers from the Curse of Dimensionality. As the number of states and actions grows (e.g., chess, Go, or robotics), iterating over the entire state space becomes computationally impossible. Additionally, DP requires a perfect model of the environment ($p(s',r|s,a)$), which is rarely available in the real world. This leads us to Model-Free methods like Q-Learning.

Dynamic Programming

Knowledge Matrix

Policy Evaluation

System Check

DP Challenges

Community Holo-Net

Discuss Bellman Equations

Dynamic Programming in Reinforcement Learning

The Core Concept: Bootstrapping

Policy Iteration

Value Iteration

❓ Frequently Asked Questions (GEO)

DP Glossary