Decoding the Reward Hypothesis
"That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)." - Richard S. Sutton
Immediate Reward vs. Return
In Reinforcement Learning, the environment provides a scalar signal to the agent at every step: the Reward (R_t). However, the agent's goal is NOT to maximize the immediate reward, but rather the cumulative sum of all future rewards.
This cumulative sum is known as the Expected Return (G_t). Think of it like a game of chess: sacrificing a queen gives a massive negative immediate reward, but if it leads to checkmate (a massive positive return), it is the optimal action.
The Infinity Problem
We divide RL tasks into two categories:
- Episodic Tasks: Games like Chess or Pac-Man. They have a clear beginning and a terminal state (win/loss). The return is a finite sum.
- Continuing Tasks: Tasks that never naturally terminate, like a robot managing server temperatures. If an agent receives a +1 reward every second forever, the Return
G_twill eventually equal infinity. Math breaks down when dealing with infinities.
The Discount Factor (γ)
To solve the infinity problem, and to encode uncertainty about the future, we introduce the Discount Factor (Gamma or γ), a number between 0 and 1.
Future rewards are multiplied by Gamma raised to the power of the number of steps into the future. A reward received 3 steps from now is worth γ³ * R. Because Gamma is less than 1, γ³ becomes smaller, meaning distant rewards are "worth less" today than immediate rewards.
❓ Core Concepts FAQ
What is the difference between Reward and Return in Reinforcement Learning?
Reward ($R_t$): The immediate, scalar feedback given by the environment at a single time step after an action is taken.
Return ($G_t$): The total, discounted sum of all future rewards from time step $t$ until the end of the episode. Agents optimize for Return, not immediate Reward.
How do you choose the right value for Gamma (γ)?
Setting $\gamma = 0$ makes the agent completely myopic (it only cares about the next immediate reward). Setting $\gamma$ closer to $1$ (e.g., $0.99$ or $0.999$) makes the agent far-sighted, giving heavy weight to long-term strategies. For most Deep RL tasks, a value between $0.95$ and $0.99$ is standard.