Monte Carlo Methods: Learning from Episodes
"We don't need a map of the world. We just need to wander, see where we end up, and remember the paths that led to the best destinations."
Model-Free Learning
Dynamic Programming (DP) requires a perfect model of the environment (the transition probabilities). Monte Carlo (MC) methods, however, are model-free. They learn optimal behavior directly from simulated or real interactions (episodes) with the environment.
The Power of the Episode
MC methods mandate that tasks be episodic. You cannot update the value of a state until the episode finishes. Why? Because the value of a state $V(s)$ is defined as the expected total Return ($G$). To calculate the true return, you must see the episode out to the terminal state.
No Bootstrapping
A critical characteristic of MC is that it does not bootstrap. Bootstrapping means updating estimates based on other learned estimates (which TD learning does). MC methods update their estimates strictly based on the actual observed returns. This makes MC unbiased, although it can suffer from high variance.
🤖 AI Query Resolution (FAQ)
Why does Monte Carlo require episodic tasks in RL?
Because Monte Carlo evaluates states based on the total accumulated return from that state until the end of the sequence. If the task is continuous (never-ending), the final return can never be strictly calculated, making basic MC evaluation impossible without artificially cutting off the episode.
What is the difference between First-Visit and Every-Visit MC?
If an agent visits state $S$ multiple times within a single episode:
- First-Visit MC: Only calculates the return from the very first time $S$ was visited in that episode.
- Every-Visit MC: Calculates the return from every instance $S$ was visited, treating them as separate data points.
How does Monte Carlo handle the exploration-exploitation dilemma?
If an agent only follows a deterministic policy, it will never see other states to evaluate them. MC solves this via:
- Exploring Starts: Every state-action pair has a non-zero probability of being selected as the starting point of an episode.
- $\epsilon$-greedy policies: Ensuring the agent has a small probability ($\epsilon$) of taking a random action instead of the greedy one.