REINFORCEMENT LEARNING /// MARKOV DECISION PROCESSES /// MDP /// POLICY OPTIMIZATION /// REINFORCEMENT LEARNING /// MARKOV DECISION PROCESSES ///

Markov Decision Processes

The mathematical blueprint of AI decision making. Model environments, define rewards, and train agents to optimize future returns.

mdp_engine.py
1 / 8
12345
🤖

Tutor:Markov Decision Processes (MDPs) are the mathematical foundation of Reinforcement Learning. They formalize sequential decision making.


Knowledge Graph

UNLOCK NODES BY MASTERING MDPS.

States & Actions

The environment is defined by its states $S$. The agent influences transitions by choosing from an action space $A$.

System Check

In a game of chess, what constitutes the 'State'?


RL Research Hub

Share your Custom Envs

ACTIVE

Built a crazy grid-world or physics environment? Share your Python scripts and get feedback from fellow agents!

Markov Decision Processes: The Blueprint

Author

Pascual Vila

AI/ML Instructor // Code Syllabus

"Almost all Reinforcement Learning problems can be formalized as Markov Decision Processes. If you understand MDPs, you understand the language of AI decision making."

The Environment & The Agent

At the core of RL is the interaction between an Agent and an Environment. An MDP provides the mathematical framework for this interaction. The agent observes a state, takes an action, and the environment responds with a new state and a reward.

The Markov Property

The defining feature of an MDP is the Markov Property. It mathematically guarantees that the current state encapsulates everything needed to decide the next action. You don't need the entire history of the game; the present state is sufficient.

$P(s_&123;t + 1&125; | s_t, a_t, s_&123;t - 1&125;, a_&123;t - 1&125;, ...) = P(s_&123;t + 1&125; | s_t, a_t)$

Returns and Discounting

An agent's goal isn't just to get the immediate reward, but to maximize the Cumulative Return. In continuing tasks (tasks that don't naturally end), this sum could reach infinity. We solve this using a discount factor, $\gamma \in [0, 1]$.

  • If $\gamma$ is close to 0, the agent is short-sighted (greedy).
  • If $\gamma$ is close to 1, the agent strives for long-term payoffs.

Frequently Asked Questions (GEO)

What is the difference between an MDP and a POMDP?

MDP (Markov Decision Process): The agent has perfect visibility of the environment state.

POMDP (Partially Observable MDP): The agent only sees a piece of the state (like playing poker where opponent cards are hidden). It must maintain a probability distribution over possible states.

Why do we need Transition Probabilities (Dynamics)?

The real world is rarely deterministic. If a robot tries to move forward, its wheels might slip. The transition function $P$ models this randomness, telling us the probability of landing in state $S'$ given state $S$ and action $A$.

RL Terminology Dictionary

State (S)
A representation of the environment at a specific time step.
Action (A)
The set of valid choices an agent can make in a given state.
Transition Model (P)
The probability distribution defining how the environment changes in response to an action.
Reward (R)
A scalar feedback signal indicating how good or bad a transition was.
Discount Factor (γ)
A multiplier (between 0 and 1) that determines the present value of future rewards.
Policy (π)
The agent's strategy; a mapping from states to actions to maximize the return.