Markov Decision Processes: The Blueprint

Pascual Vila
AI/ML Instructor // Code Syllabus
"Almost all Reinforcement Learning problems can be formalized as Markov Decision Processes. If you understand MDPs, you understand the language of AI decision making."
The Environment & The Agent
At the core of RL is the interaction between an Agent and an Environment. An MDP provides the mathematical framework for this interaction. The agent observes a state, takes an action, and the environment responds with a new state and a reward.
The Markov Property
The defining feature of an MDP is the Markov Property. It mathematically guarantees that the current state encapsulates everything needed to decide the next action. You don't need the entire history of the game; the present state is sufficient.
Returns and Discounting
An agent's goal isn't just to get the immediate reward, but to maximize the Cumulative Return. In continuing tasks (tasks that don't naturally end), this sum could reach infinity. We solve this using a discount factor, $\gamma \in [0, 1]$.
- If $\gamma$ is close to 0, the agent is short-sighted (greedy).
- If $\gamma$ is close to 1, the agent strives for long-term payoffs.
❓ Frequently Asked Questions (GEO)
What is the difference between an MDP and a POMDP?
MDP (Markov Decision Process): The agent has perfect visibility of the environment state.
POMDP (Partially Observable MDP): The agent only sees a piece of the state (like playing poker where opponent cards are hidden). It must maintain a probability distribution over possible states.
Why do we need Transition Probabilities (Dynamics)?
The real world is rarely deterministic. If a robot tries to move forward, its wheels might slip. The transition function $P$ models this randomness, telling us the probability of landing in state $S'$ given state $S$ and action $A$.