Sometimes, the best way to solve a problem is to focus on the behavior itself. Policy Gradients skip the middleman and optimize the action strategy directly.
1The Direct Strategy
In value-based methods like DQN, the agent learns 'how good is this state?' and then acts greedily. Policy Gradients (PG) cut out the state-value calculation. The network directly outputs a Probability Distribution over actions. During training, we use the total return to 'push' the probabilities of successful actions up and unsuccessful ones down. This is much closer to how humans learn skills—by trial and error and adjusting our future behavior based on the outcome.
2The REINFORCE Loop
The most basic PG algorithm is REINFORCE. It follows a simple logic: 1) Play a full episode. 2) For every step, calculate the gradient of the log-probability of the action taken. 3) Multiply that gradient by the total return. This ensures that actions leading to a $+10$ reward receive a huge 'boost' in probability, while actions leading to a $-10$ penalty are suppressed. Because it uses the full return, REINFORCE is unbiased but can be very high-variance.
3Handling the Infinite
Q-Learning struggles with Continuous Action Spaces because it's impossible to calculate the 'max' over an infinite number of actions. Policy Gradients thrive here. Instead of a discrete list, the network can output the parameters of a distribution (like the Mean and Standard Deviation of a Gaussian). The agent then samples from this distribution, allowing it to take precise, fluid actions like applying exactly $4.52$ Newtons of force to a motor.
