MODULE 1: REINFORCEMENT LEARNING /// REWARDS /// RETURN /// DISCOUNT FACTOR /// GAMMA /// BELLMAN EQUATION /// MODULE 1: REINFORCEMENT LEARNING ///

Rewards & Return

The foundation of agent motivation. Learn to formulate Expected Return ($G_t$) and manipulate agent behavior via the Discount Factor ($\gamma$).

agent_training.py
1 / 7
12345
🤖

[Instructor]:In Reinforcement Learning, the agent's sole objective is to maximize the cumulative reward it receives over time.


Learning Architecture

UPGRADE WEIGHTS TO UNLOCK NODES.

Concept: The Reward

The scalar feedback signal $R_t$ passed from environment to agent.

Policy Check

What does the agent ultimately want to maximize?


Swarm Intelligence Hub

Agent Sync

ONLINE

Struggling with discount factors? Connect with other neural networks (humans) in our Discord collective.

Decoding the Reward Hypothesis

"That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)." - Richard S. Sutton

Immediate Reward vs. Return

In Reinforcement Learning, the environment provides a scalar signal to the agent at every step: the Reward (R_t). However, the agent's goal is NOT to maximize the immediate reward, but rather the cumulative sum of all future rewards.

This cumulative sum is known as the Expected Return (G_t). Think of it like a game of chess: sacrificing a queen gives a massive negative immediate reward, but if it leads to checkmate (a massive positive return), it is the optimal action.

The Infinity Problem

We divide RL tasks into two categories:

  • Episodic Tasks: Games like Chess or Pac-Man. They have a clear beginning and a terminal state (win/loss). The return is a finite sum.
  • Continuing Tasks: Tasks that never naturally terminate, like a robot managing server temperatures. If an agent receives a +1 reward every second forever, the Return G_t will eventually equal infinity. Math breaks down when dealing with infinities.

The Discount Factor (γ)

To solve the infinity problem, and to encode uncertainty about the future, we introduce the Discount Factor (Gamma or γ), a number between 0 and 1.

Future rewards are multiplied by Gamma raised to the power of the number of steps into the future. A reward received 3 steps from now is worth γ³ * R. Because Gamma is less than 1, γ³ becomes smaller, meaning distant rewards are "worth less" today than immediate rewards.

Core Concepts FAQ

What is the difference between Reward and Return in Reinforcement Learning?

Reward ($R_t$): The immediate, scalar feedback given by the environment at a single time step after an action is taken.

Return ($G_t$): The total, discounted sum of all future rewards from time step $t$ until the end of the episode. Agents optimize for Return, not immediate Reward.

How do you choose the right value for Gamma (γ)?

Setting $\gamma = 0$ makes the agent completely myopic (it only cares about the next immediate reward). Setting $\gamma$ closer to $1$ (e.g., $0.99$ or $0.999$) makes the agent far-sighted, giving heavy weight to long-term strategies. For most Deep RL tasks, a value between $0.95$ and $0.99$ is standard.

Neural Lexicon

Reward (R_t)
Immediate numerical feedback received from the environment at a given time step.
python
Return (G_t)
The cumulative, discounted sum of future rewards. The true objective function of the RL agent.
python
Discount Factor (γ)
A parameter (0 to 1) that controls the importance of future rewards compared to immediate rewards.
python
Episodic Task
A task that breaks naturally into identifiable episodes, reaching a terminal state.
python