Training a neural network is an optimization problem. We seek the set of weights that minimizes the model's error in a landscape of billions of possibilities.
1The Cost of Error
Before a neural network can improve, it needs to know how badly it failed. That is the job of the Loss Function (or Cost Function).
The Loss Function calculates a single numerical score representing the penalty for being wrong. If you are predicting continuous numbers (like house prices), you use Mean Squared Error (MSE). If you are classifying categories (like "cat" vs "dog"), you use Cross-Entropy Loss, which heavily penalizes the model for being confidently incorrect. Without a differentiable loss function, the network has no 'error signal' to guide its learning.
import torch.nn as nn
# Binary Cross-Entropy for Yes/No
criterion = nn.BCELoss()
# Mean Squared Error for numbers
criterion_reg = nn.MSELoss()2Gradient Descent Mechanics
Once we have an error score, we need to minimize it. Gradient Descent is the algorithm that achieves this.
Imagine standing in a foggy mountain range and trying to find the lowest valley. You can't see the whole map, so you check the slope of the ground beneath your feet and take a step downhill. In AI, calculating the slope is done via backpropagation, and taking the step is done by the Optimizer. The size of the step you take is called the Learning Rate.
# Weight Update Rule
# w = w - (learning_rate * gradient)
# The 'learning_rate' determines step size.3The Learning Rate Dilemma
The Learning Rate is the single most important hyperparameter in deep learning.
If your learning rate is too low, your model takes microscopic steps; training will take forever and might get stuck in a shallow valley (a local minimum). If your learning rate is too high, your model takes massive leaps; it will completely overshoot the deepest valley and fail to learn anything. Finding the 'Goldilocks' zone for the learning rate is essential for convergence.
"""
LR too low -> Stagnation
LR too high -> Divergence (NaN loss)
LR just right -> Smooth convergence
"""4Adam: The Smart Engine
In the early days, everyone used standard Stochastic Gradient Descent (SGD). Today, the default choice for almost every project is Adam (Adaptive Moment Estimation).
Adam is a 'smart' optimizer. Instead of using a single, fixed learning rate for all weights, Adam automatically adapts the learning rate for *each individual parameter* based on its past gradients. If a weight has been moving predictably, Adam speeds it up. If it's bouncing around wildly, Adam slows it down.
import torch.optim as optim
# Adam: The smart choice (adaptive)
optimizer = optim.Adam(model.parameters(), lr=0.001)
# SGD: The classic choice (fixed)
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01)5Momentum
Another key feature of modern optimizers like Adam is Momentum.
If you roll a heavy ball down a hill, it gains momentum. If it hits a small bump, its momentum carries it over. In optimization, the loss landscape is often filled with jagged noise (mini-batch variance) and shallow false valleys (local minima). By remembering past gradients (adding momentum), the optimizer can roll straight through the noise and safely reach the true global minimum.
# Momentum helps 'roll' through noise.
# Without it, the model gets stuck easily in flat regions.
# Adam calculates momentum implicitly.