Navigating the Terrain: Loss & Descent

Pascual Vila
AI Engineer // Code Syllabus
To build intelligent applications, you must first define what it means to be wrong. Only by mathematically measuring failure can we systematically chart a path toward success.
The Loss Function (Cost)
A neural network initially makes random predictions. A Loss Function (or Cost Function) evaluates how far those predictions are from reality. It outputs a single number: the larger the number, the worse the model.
- Mean Squared Error (MSE): Used for regression (predicting continuous values like prices). It heavily penalizes large errors. Formula: $MSE = \frac1{n}\sum(y_i - \hat{y}_i)^2$
- Cross-Entropy Loss: Used for classification (predicting categories like Cat vs. Dog). It measures the divergence between probability distributions.
Optimization: Gradient Descent
If the Loss Function maps out a landscape of hills (high error) and valleys (low error), Gradient Descent is the algorithm that tells us how to walk downhill to find the lowest point (the minimum).
By calculating the derivative (gradient) of the loss function with respect to the network's weights, we find the direction of the steepest ascent. We then step in the opposite direction to reduce the error.
The Learning Rate ($\alpha$)
The weight update formula is: $w_{new} = w_{old} - \alpha \cdot \nabla J(w)$The $\alpha$ represents the Learning Rate. It dictates how large of a step we take downhill. If $\alpha$ is too small, the model takes ages to converge. If it is too large, the model takes chaotic, massive steps, completely missing the valley (divergence).
❓ Neural Engine FAQs
Why do we need different Loss Functions?
Because different tasks have different mathematical goals. If you are predicting a house price (Regression), you want to measure the exact distance from the true price (MSE). If you are predicting "Cat vs Dog" (Classification), predicting "80% Cat" is a probability problem, perfectly suited for Cross-Entropy.
What is Stochastic Gradient Descent (SGD)?
Standard Gradient Descent calculates the loss over the entire dataset before taking a single step. This is computationally expensive. Stochastic Gradient Descent (SGD) calculates the error and updates weights using only a single sample (or a small "mini-batch") at a time. It's noisier, but much faster and often avoids getting stuck in local minima.
What happens if my loss becomes NaN (Not a Number)?
This usually means your gradients exploded. Your learning rate is likely way too high, causing the weight updates to swing so wildly that the numbers overflowed the computer's memory limits. Lower your learning rate significantly (e.g., from 0.1 to 0.001) and restart training.