Policy Gradients: Direct Optimization
"While Q-learning asks 'What is the value of this action?', Policy Gradients ask 'Which action should I take right now?' By directly optimizing the policy, we can handle continuous action spaces and stochastic environments."
The Objective Function
Our goal in RL is to find a policy $\pi_\theta(a|s)$ that maximizes the expected cumulative return $J(\theta)$. Using the Policy Gradient Theorem, we can calculate the gradient of this objective with respect to our network weights $\theta$:
This equation is profound. It tells us that to improve our policy, we should move our weights in the direction of the gradient of the log-probability of the action taken ($\nabla_\theta \log \pi_\theta$), scaled by how good the outcome was ($G_t$).
The REINFORCE Algorithm
REINFORCE is a Monte Carlo implementation of the Policy Gradient theorem. Because we cannot compute the true expectation $\mathbb&123;E&125;$ over all possible trajectories, we approximate it by sampling entire episodes.
- Generate an Episode: Follow current policy $\pi_\theta$ to generate $S_0, A_0, R_1, ..., S_T$.
- Calculate Return: For each step $t$, compute $G_t = \sum_&123;k = t + 1&125;^T \gamma_&123;k - t - 1&125; R_k$.
- Update Policy: Apply gradient ascent: $\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(A_t|S_t) G_t$.
❓ Frequently Asked Questions (RL)
What is the difference between DQN and REINFORCE?
DQN (Value-Based): Learns the expected value of an action, then implicitly defines the policy (e.g., take the action with highest Q-value). Fails in continuous action spaces.
REINFORCE (Policy-Based): Directly outputs the probability of taking an action. Can naturally handle stochastic policies and continuous action spaces.
Why do we take the log probability instead of just probability?
The `log` arises from the mathematical derivation of the Policy Gradient Theorem (often called the "log-derivative trick"). Practically, it improves numerical stability, turns products of probabilities into sums, and makes gradient calculations much cleaner in deep learning frameworks like PyTorch.
What is the "High Variance" problem in REINFORCE?
Because REINFORCE uses full episode Monte Carlo sampling, the return $G_t$ can vary wildly from episode to episode depending on random environmental factors. This high variance leads to noisy gradients and slow learning. The standard solution is to subtract a "Baseline" from the return.