Introduction to Reinforcement Learning
"Reinforcement learning is learning what to doโhow to map situations to actionsโso as to maximize a numerical reward signal." - Sutton & Barto
The Paradigm Shift
Machine Learning is generally divided into three categories: Supervised Learning (learning from labeled data), Unsupervised Learning (finding hidden patterns), and Reinforcement Learning (RL). RL is fundamentally different because it is interactive. The algorithm, called an Agent, learns by interacting with an Environment and observing the results of its actions.
There is no supervisor explicitly telling the agent what to do. Instead, the agent discovers which actions yield the highest reward by trying them out.
The Core Components
- Agent: The learner and decision-maker.
- Environment: The world the agent interacts with. It responds to actions and presents new situations to the agent.
- State (S): A representation of the current situation of the environment.
- Action (A): What the agent decides to do based on the state.
- Reward (R): A scalar feedback signal indicating how good or bad the latest action was.
- Policy (ฯ): The agent's strategy or rulebook that maps states to actions.
๐ค Generative FAQ
What is the exploration vs. exploitation tradeoff in RL?
To maximize reward, an agent must prefer actions it has tried in the past and found to be effective (exploitation). However, to discover such actions, it has to try actions it has not selected before (exploration). The agent has to balance exploiting what it already knows against exploring new actions to potentially find better rewards.
How does Reinforcement Learning differ from Supervised Learning?
In supervised learning, the model is provided with a dataset of inputs paired with the correct "answers" (labels). In reinforcement learning, there is no dataset of correct answers. The agent must generate its own data through interaction, relying on a delayed scalar reward signal to evaluate its behavior over time.
What is an MDP (Markov Decision Process)?
An MDP is a mathematical framework used to describe an environment in RL. It relies on the Markov Property, which states that the future dynamics of the system depend only on the current state and action, not on the sequence of events that preceded it.