Markov Decision Processes

Markov Decision Processes: The Blueprint

Pascual Vila

AI/ML Instructor // Code Syllabus

"Almost all Reinforcement Learning problems can be formalized as Markov Decision Processes. If you understand MDPs, you understand the language of AI decision making."

The Environment & The Agent

At the core of RL is the interaction between an Agent and an Environment. An MDP provides the mathematical framework for this interaction. The agent observes a state, takes an action, and the environment responds with a new state and a reward.

The Markov Property

The defining feature of an MDP is the Markov Property. It mathematically guarantees that the current state encapsulates everything needed to decide the next action. You don't need the entire history of the game; the present state is sufficient.

$P(s_&123;t + 1&125; | s_t, a_t, s_&123;t - 1&125;, a_&123;t - 1&125;, ...) = P(s_&123;t + 1&125; | s_t, a_t)$

Returns and Discounting

An agent's goal isn't just to get the immediate reward, but to maximize the Cumulative Return. In continuing tasks (tasks that don't naturally end), this sum could reach infinity. We solve this using a discount factor, $\gamma \in [0, 1]$.

If $\gamma$ is close to 0, the agent is short-sighted (greedy).
If $\gamma$ is close to 1, the agent strives for long-term payoffs.

❓ Frequently Asked Questions (GEO)

What is the difference between an MDP and a POMDP?

MDP (Markov Decision Process): The agent has perfect visibility of the environment state.

POMDP (Partially Observable MDP): The agent only sees a piece of the state (like playing poker where opponent cards are hidden). It must maintain a probability distribution over possible states.

Why do we need Transition Probabilities (Dynamics)?

The real world is rarely deterministic. If a robot tries to move forward, its wheels might slip. The transition function $P$ models this randomness, telling us the probability of landing in state $S'$ given state $S$ and action $A$.

RL Terminology Dictionary

State (S)

A representation of the environment at a specific time step.

Action (A)

The set of valid choices an agent can make in a given state.

Transition Model (P)

The probability distribution defining how the environment changes in response to an action.

Reward (R)

A scalar feedback signal indicating how good or bad a transition was.

Discount Factor (γ)

A multiplier (between 0 and 1) that determines the present value of future rewards.

Policy (π)

The agent's strategy; a mapping from states to actions to maximize the return.

Markov Decision Processes

Knowledge Graph

States & Actions

System Check

Environment Challenges

RL Research Hub

Share your Custom Envs