How do you learn to be a pro while still being a rookie? Q-Learning is the answer—the most popular algorithm for discovering the perfect action strategy.
1The Q-Value Foundation
The 'Q' in Q-Learning stands for Quality. We want to know the quality of an action $a$ in a state $s$. We store these values in a Q-Table, a grid where rows are states and columns are actions. Initially, the table is full of zeros (the agent knows nothing). As the agent explores, it fills the table with the 'Expected Future Return' for every action, eventually creating a complete map of the best possible moves for any situation.
2The Off-Policy Secret
What makes Q-Learning special is that it is Off-Policy. This means it learns about the Optimal Policy (the best way to win) while following a Behavior Policy (which includes random exploration). The update rule uses the max of the next state's Q-values. It assumes that in the future, it will act perfectly, even if right now it is still exploring. This allows the agent to learn the 'true' best strategy even from a path of mistakes.
3Epsilon-Greedy Strategy
If an agent finds a small reward, it might stop looking for a bigger one. This is the 'Local Optima' trap. To avoid this, we use $epsilon$-Greedy Exploration. With a probability of $epsilon$ (usually 0.1), the agent ignores its table and takes a Random Action. With a probability of $1-epsilon$, it takes the best action it knows. Over time, we usually 'decay' $epsilon$, so the agent explores less as it becomes more confident in its knowledge.
