The Capstone: Training AI to Play Games

Pascual Vila
AI Architecture Lead // Code Syllabus
Bridging the gap between theory and execution is the final hurdle in Reinforcement Learning. Translating Markov Decision Processes into a working codebase forces you to confront hyperparameter tuning, exploration decay, and reward optimization.
Bootstrapping The Environment
We utilize Gymnasium (formerly OpenAI Gym) to standardize how our agent perceives the world. Every environment follows a strict API contract: it yields a state, expects an action, and returns the next_state, a numerical reward, and boolean flags for termination.
The DQN Architecture
A Deep Q-Network acts as the brain. Rather than storing a massive table of Q-values (which is impossible for continuous states like velocity or coordinates), a neural network approximates the Q-value for every possible action given an input state.
To prevent catastrophic forgetting, the agent uses an Experience Replay buffer, memorizing past actions and randomly sampling them to batch-train the network.
Reward Shaping & Heuristics
Often, the default environment rewards are too sparse (e.g., +1 only when winning the game). Reward Shaping involves programmatically tweaking the reward logic during the env.step() phase. You might penalize the agent for taking too long, or reward it for moving closer to the target.
View Hyperparameter Tuning Specs+
Gamma (Discount Factor): 0.95 to 0.99. Dictates how much the agent cares about long-term rewards vs immediate rewards.
Epsilon Decay: Start at 1.0 (100% random exploration), decay by a factor (e.g., 0.995) per episode, until hitting a minimum (e.g., 0.01).
Batch Size: Typically 32 or 64. The number of memory tuples sampled during experience replay.
❓ Training Protocol FAQ
Why is my agent not learning anything, despite running for thousands of episodes?
Verify your Epsilon-Greedy strategy. If epsilon drops to 0 too fast, the agent stops exploring and exploits a terrible, randomly initialized policy forever. Alternatively, your Neural Network learning rate may be too high, causing divergent weights.
What is the difference between terminated and truncated?
In Gymnasium, terminated means the natural end of an episode (e.g., the CartPole fell, or the agent reached the goal). truncated means an artificial time limit was reached (e.g., surviving for 500 steps).
How do I save the agent so I don't have to retrain it?
Save the neural network weights. If using PyTorch: torch.save(agent.model.state_dict(), 'model.pth'). Next time, initialize the architecture and load the weights to instantly have a master agent.