POLICY GRADIENTS /// REINFORCE /// MONTE CARLO /// PYTORCH /// POLICY GRADIENTS /// REINFORCE /// MONTE CARLO ///

Policy Gradients

Skip value estimation and optimize the behavior directly. Master the REINFORCE algorithm and continuous action spaces.

reinforce_agent.py
1 / 9
12345
🤖

Tutor:Unlike Value-based methods (DQN) that predict rewards, Policy Gradient methods train a neural network to directly output probabilities for each action.


Architecture Node

UNLOCK NODES BY MASTERING REINFORCE.

Concept: The Policy

A parameterized policy $\pi_\theta$ directly maps states to probabilities of actions.

System Check

What is the primary advantage of a Policy-Based agent over a Value-Based (Q-learning) agent?

RL Researchers Hub

Share Your Agent's Progress

ONLINE

Struggling with hyperparameter tuning? Join the community to discuss baselines, entropy bonuses, and more.

Policy Gradients: Direct Optimization

"While Q-learning asks 'What is the value of this action?', Policy Gradients ask 'Which action should I take right now?' By directly optimizing the policy, we can handle continuous action spaces and stochastic environments."

The Objective Function

Our goal in RL is to find a policy $\pi_\theta(a|s)$ that maximizes the expected cumulative return $J(\theta)$. Using the Policy Gradient Theorem, we can calculate the gradient of this objective with respect to our network weights $\theta$:

$\nabla_\theta J(\theta) = \mathbb&123;E&125;_&123;\pi_\theta&125; [\nabla_\theta \log \pi_\theta(a|s) G_t]$

This equation is profound. It tells us that to improve our policy, we should move our weights in the direction of the gradient of the log-probability of the action taken ($\nabla_\theta \log \pi_\theta$), scaled by how good the outcome was ($G_t$).

The REINFORCE Algorithm

REINFORCE is a Monte Carlo implementation of the Policy Gradient theorem. Because we cannot compute the true expectation $\mathbb&123;E&125;$ over all possible trajectories, we approximate it by sampling entire episodes.

  • Generate an Episode: Follow current policy $\pi_\theta$ to generate $S_0, A_0, R_1, ..., S_T$.
  • Calculate Return: For each step $t$, compute $G_t = \sum_&123;k = t + 1&125;^T \gamma_&123;k - t - 1&125; R_k$.
  • Update Policy: Apply gradient ascent: $\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(A_t|S_t) G_t$.

❓ Frequently Asked Questions (RL)

What is the difference between DQN and REINFORCE?

DQN (Value-Based): Learns the expected value of an action, then implicitly defines the policy (e.g., take the action with highest Q-value). Fails in continuous action spaces.

REINFORCE (Policy-Based): Directly outputs the probability of taking an action. Can naturally handle stochastic policies and continuous action spaces.

Why do we take the log probability instead of just probability?

The `log` arises from the mathematical derivation of the Policy Gradient Theorem (often called the "log-derivative trick"). Practically, it improves numerical stability, turns products of probabilities into sums, and makes gradient calculations much cleaner in deep learning frameworks like PyTorch.

What is the "High Variance" problem in REINFORCE?

Because REINFORCE uses full episode Monte Carlo sampling, the return $G_t$ can vary wildly from episode to episode depending on random environmental factors. This high variance leads to noisy gradients and slow learning. The standard solution is to subtract a "Baseline" from the return.

RL Glossary

Policy Network
A neural network that maps states to action probabilities $pi_ heta(a|s)$.
Monte Carlo
A method that relies on repeated random sampling (full episodes) to obtain numerical results.
Categorical Distribution
A discrete probability distribution representing probabilities of mutually exclusive events (actions).
Baseline
A value (often state-value V(s)) subtracted from the return G_t to reduce gradient variance without introducing bias.
Trajectory
A sequence of states, actions, and rewards experienced by the agent during an episode.
Log-Derivative Trick
The mathematical substitution $ abla pi = pi abla log pi$ that allows us to sample the expectation in the policy gradient theorem.