Multi-Agent Reinforcement Learning

Multi-Agent Reinforcement Learning: The Swarm Mind

Single-agent RL operates in a static environment. In Multi-Agent RL (MARL), the environment is dynamic because other learning agents are constantly updating their behaviors, breaking the Markov property.

Cooperation vs. Competition

MARL environments are generally split into three categories based on the reward structure:

Fully Cooperative: All agents share the exact same reward function. Their goal is to maximize a joint return (e.g., controlling traffic lights to minimize total congestion).
Fully Competitive (Zero-Sum): One agent's gain is another agent's loss (e.g., Chess, Go, 1v1 games). The optimal policy often converges to a Nash Equilibrium.
Mixed Sum: Agents have their own self-interests, which may align or conflict depending on the state (e.g., self-driving cars navigating an intersection).

The CTDE Paradigm

Centralized Training, Decentralized Execution (CTDE) is the gold standard architecture for cooperative MARL (like MAPPO or QMIX).

During training, a centralized "Critic" network evaluates actions using the global state (the true underlying state of the environment + all agent actions). This stabilizes training and solves non-stationarity. However, during execution (deployment), the "Actor" networks must select actions relying only on their local, limited observations.

📡 Extracted Intelligence (FAQ)

What is the difference between Single-Agent and Multi-Agent RL?

In single-agent RL, the environment is stationary from the agent's perspective. In MARL, multiple agents learn simultaneously. As other agents update their policies, the environment's dynamics change from the perspective of any single agent. This causes non-stationarity, making standard algorithms like Q-learning unstable.

What is the Multi-Agent Credit Assignment problem?

When a team of agents receives a single shared reward (e.g., a team wins a match), it is difficult to determine which agent's specific actions contributed to the success. Was it agent A's brilliant move, or was agent B carrying the team? Algorithms use techniques like counterfactual baselines (COMA) to isolate individual contributions.

What is PettingZoo in Reinforcement Learning?

PettingZoo is a Python library that serves as the multi-agent equivalent of OpenAI Gym (Gymnasium). It provides a standardized API for defining MARL environments, supporting both sequential Agent Environment Cycle (AEC) models and Parallel execution models.

MARL Lexicon

CTDE

Centralized Training, Decentralized Execution. Training with global state data, but deploying agents that act only on local observations.

concept.py

Nash Equilibrium

A concept in game theory where no agent can increase its expected reward by unilaterally changing its policy, assuming other agents keep theirs fixed.

concept.py

Non-Stationarity

The phenomenon where the environment's transition probabilities change from one agent's perspective because other agents are updating their policies.

concept.py

PettingZoo AEC

Agent Environment Cycle API. A sequential execution format where agents take turns acting, simulating real-world decision delays and turn-based games.

concept.py

Multi-Agent RL

Architecture Tree

Concept: Environments

Evaluation Matrix

Initialization Labs

Research Collective

Share Environment Data