Actor-Critic Methods

Actor-Critic: The Best of Both Worlds

In Reinforcement Learning, we historically had to choose between Value-Based methods (like Q-Learning) which can't handle continuous actions well, or Policy-Based methods (like REINFORCE) which suffer from high variance and slow learning. Actor-Critic bridges this gap.

The Actor (Policy)

The Actor acts as the agent's "muscle" and "intuition". It looks at the current state and outputs a probability distribution over the available actions: π(a|s). The actor's job is to figure out what to do.

The Critic (Value)

The Critic is the "evaluator". It doesn't choose actions; instead, it observes the state the agent is in and predicts how much total reward the agent can expect to get from that point onward: V(s). It tells the Actor how good its action was.

The Advantage Function

To train the Actor, we don't just use the raw reward. We use the Advantage:
Advantage = Actual Reward + Discounted Next State Value - Current State Value.
If the advantage is positive, the action was better than the Critic expected, so the Actor should increase its probability. If negative, it was worse, and the probability should decrease. This massively reduces training variance.

⚙️ Architecture FAQ

What is the difference between A2C and A3C?

A3C (Asynchronous Advantage Actor-Critic): Multiple independent agents interact with their own copies of the environment and update a global network asynchronously.

A2C (Advantage Actor-Critic): A synchronous version. It waits for all agents to finish their segment of experience before updating the global network. A2C is often preferred as it utilizes GPUs more efficiently and is easier to implement.

Why do Actor-Critic methods use an Entropy bonus?

In policy gradients, agents can prematurely converge on a sub-optimal action. By adding an entropy bonus to the loss function, we penalize the Actor for being "too certain". This encourages the policy to remain slightly random, forcing the agent to continue exploring the environment.

RL Lexicon

Actor

The neural network that dictates the policy (which action to take given a state).

Critic

The neural network that estimates the value function (expected future rewards of a state).

Advantage

A metric that indicates if an action was better or worse than the baseline expected value.

TD Error

Temporal Difference Error. Used to train the critic by calculating the difference between the estimated value and the observed reward + next state value.

A2C

Advantage Actor-Critic. A synchronous algorithm that utilizes parallel environments to stabilize training.

Policy Gradient

An approach to RL that directly optimizes the policy function by performing gradient ascent on expected return.

Actor-Critic

Architecture Matrix

Component: Actor

Evaluation Checkpoint