Actor-Critic: The Best of Both Worlds
In Reinforcement Learning, we historically had to choose between Value-Based methods (like Q-Learning) which can't handle continuous actions well, or Policy-Based methods (like REINFORCE) which suffer from high variance and slow learning. Actor-Critic bridges this gap.
The Actor (Policy)
The Actor acts as the agent's "muscle" and "intuition". It looks at the current state and outputs a probability distribution over the available actions: π(a|s). The actor's job is to figure out what to do.
The Critic (Value)
The Critic is the "evaluator". It doesn't choose actions; instead, it observes the state the agent is in and predicts how much total reward the agent can expect to get from that point onward: V(s). It tells the Actor how good its action was.
The Advantage Function
To train the Actor, we don't just use the raw reward. We use the Advantage: Advantage = Actual Reward + Discounted Next State Value - Current State Value.
If the advantage is positive, the action was better than the Critic expected, so the Actor should increase its probability. If negative, it was worse, and the probability should decrease. This massively reduces training variance.
⚙️ Architecture FAQ
What is the difference between A2C and A3C?
A3C (Asynchronous Advantage Actor-Critic): Multiple independent agents interact with their own copies of the environment and update a global network asynchronously.
A2C (Advantage Actor-Critic): A synchronous version. It waits for all agents to finish their segment of experience before updating the global network. A2C is often preferred as it utilizes GPUs more efficiently and is easier to implement.
Why do Actor-Critic methods use an Entropy bonus?
In policy gradients, agents can prematurely converge on a sub-optimal action. By adding an entropy bonus to the loss function, we penalize the Actor for being "too certain". This encourages the policy to remain slightly random, forcing the agent to continue exploring the environment.