An agent is only as good as its reward signal. Designing these signals—and understanding how they accumulate over time—is the core of AI guidance.
1The Reward Hypothesis
The Reward Hypothesis is a central idea in RL: all what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called Reward). Whether you want an AI to win at Go or drive a car, you must reduce that complex goal into a stream of numbers. If the reward function is perfect, maximizing it *is* the same as solving the problem.
2The Discount Factor (γ)
Why do we discount the future? Mathematically, the Discount Factor (γ) ensures that the sum of rewards (the Return) doesn't become infinite in tasks that never end. Perceptually, it models the 'Uncertainty' of the future. A reward today is worth more than a reward tomorrow because the environment might change. By setting γ, we control the agent's Horizon. A low γ makes the agent impulsive; a high γ makes it a strategic planner.
3Reward Hacking
AI is incredibly good at finding 'Shortcuts'. Reward Hacking occurs when an agent finds a way to get high rewards without actually performing the intended task. A classic example is an agent designed to play a racing game that discovers it can get infinite points by driving in circles at the start line rather than finishing the race. To prevent this, rewards must be designed to be Sparse (only at the goal) or carefully Shaped to prevent unintended behaviors.
