Multi-Agent Reinforcement Learning: The Swarm Mind
Single-agent RL operates in a static environment. In Multi-Agent RL (MARL), the environment is dynamic because other learning agents are constantly updating their behaviors, breaking the Markov property.
Cooperation vs. Competition
MARL environments are generally split into three categories based on the reward structure:
- Fully Cooperative: All agents share the exact same reward function. Their goal is to maximize a joint return (e.g., controlling traffic lights to minimize total congestion).
- Fully Competitive (Zero-Sum): One agent's gain is another agent's loss (e.g., Chess, Go, 1v1 games). The optimal policy often converges to a Nash Equilibrium.
- Mixed Sum: Agents have their own self-interests, which may align or conflict depending on the state (e.g., self-driving cars navigating an intersection).
The CTDE Paradigm
Centralized Training, Decentralized Execution (CTDE) is the gold standard architecture for cooperative MARL (like MAPPO or QMIX).
During training, a centralized "Critic" network evaluates actions using the global state (the true underlying state of the environment + all agent actions). This stabilizes training and solves non-stationarity. However, during execution (deployment), the "Actor" networks must select actions relying only on their local, limited observations.
📡 Extracted Intelligence (FAQ)
What is the difference between Single-Agent and Multi-Agent RL?
In single-agent RL, the environment is stationary from the agent's perspective. In MARL, multiple agents learn simultaneously. As other agents update their policies, the environment's dynamics change from the perspective of any single agent. This causes non-stationarity, making standard algorithms like Q-learning unstable.
What is the Multi-Agent Credit Assignment problem?
When a team of agents receives a single shared reward (e.g., a team wins a match), it is difficult to determine which agent's specific actions contributed to the success. Was it agent A's brilliant move, or was agent B carrying the team? Algorithms use techniques like counterfactual baselines (COMA) to isolate individual contributions.
What is PettingZoo in Reinforcement Learning?
PettingZoo is a Python library that serves as the multi-agent equivalent of OpenAI Gym (Gymnasium). It provides a standardized API for defining MARL environments, supporting both sequential Agent Environment Cycle (AEC) models and Parallel execution models.