Calculations are useless if the rules are unknown. Monte Carlo methods bypass the need for a model by simply averaging the returns of played episodes.
1Sample-Based Learning
Unlike Dynamic Programming, Monte Carlo (MC) methods do not assume knowledge of the environment's transitions or rewards. Instead, they learn from Experience. The agent plays an entire Episode from start to finish. At the end, it looks at the total Return (G) and uses it to update the estimated value of every state it visited during that episode. By averaging many samples, the estimate converges to the expected valueβtrue 'Learning from Trial and Error'.
2The Counting Rules
When a state is visited multiple times in a single episode, how should we update its value? First-Visit MC only updates based on the return after the very first time the state was hit, which makes the samples independent and easier to analyze. Every-Visit MC updates the average for every single visit. While Every-Visit is more computationally efficient for some problems, both are mathematically sound and will reach the same optimal value function given enough samples.
3Terminal Constraints
The biggest weakness of Monte Carlo is that it is strictly episodic. Because the update rule requires the 'Final Return,' the agent can only learn once the game is over. In continuous tasks (like keeping a drone level or managing a stock portfolio), there is no 'end,' so a pure MC agent would never update its knowledge. This limitation is the primary motivation for Temporal Difference methods, which learn while the action is still happening.
