If Adam is so smart, why do people still use SGD?

While Adam converges much faster initially, standard SGD (with manually tuned momentum and learning rate schedules) often achieves a slightly better final generalization on the test set for specific computer vision tasks (like training ResNets). However, Adam is almost universally preferred in NLP and rapid prototyping.

What does it mean if my loss suddenly becomes 'NaN' (Not a Number)?

A NaN loss usually means your gradients 'exploded'. Your learning rate was too high, causing the weights to update by such a massive amount that the math exceeded the maximum limit of floating-point numbers. Lower your learning rate immediately.

What is the difference between an Epoch and a Step?

A Step occurs every time the optimizer updates the weights (after processing one batch of data). An Epoch is completed when the model has seen every piece of data in the entire training dataset exactly once.

HTML MASTER CLASS /// LEARN TAGS /// BUILD STRUCTURE /// SEMANTIC WEB /// HTML MASTER CLASS /// LEARN TAGS ///

⚡ Total XP: 0|💻 artificialintelligence XP: 0

Loss & Optimizers in AI & Artificial Intelligence

Learn about Loss & Optimizers in this comprehensive AI & Artificial Intelligence tutorial. Master the relationship between Cost Functions and Gradient Descent. Learn the trade-offs between classic SGD and modern Adam, and understand why the Learning Rate is the most critical knob in Deep Learning.

LOADING ENGINE...

Skill Matrix

UNLOCK NODES BY LEARNING NEW TAGS.

Opti Hub

Minimizing error.

Quick Quiz //

Which of the following best describes the relationship between the Loss Function and the Optimizer?

Training a neural network is an optimization problem. We seek the set of weights that minimizes the model's error in a landscape of billions of possibilities.

1The Cost of Error

Before a neural network can improve, it needs to know how badly it failed. That is the job of the Loss Function (or Cost Function).

The Loss Function calculates a single numerical score representing the penalty for being wrong. If you are predicting continuous numbers (like house prices), you use Mean Squared Error (MSE). If you are classifying categories (like "cat" vs "dog"), you use Cross-Entropy Loss, which heavily penalizes the model for being confidently incorrect. Without a differentiable loss function, the network has no 'error signal' to guide its learning.

editor.html

import torch.nn as nn

# Binary Cross-Entropy for Yes/No
criterion = nn.BCELoss()
# Mean Squared Error for numbers
criterion_reg = nn.MSELoss()

localhost:3000

2Gradient Descent Mechanics

Once we have an error score, we need to minimize it. Gradient Descent is the algorithm that achieves this.

Imagine standing in a foggy mountain range and trying to find the lowest valley. You can't see the whole map, so you check the slope of the ground beneath your feet and take a step downhill. In AI, calculating the slope is done via backpropagation, and taking the step is done by the Optimizer. The size of the step you take is called the Learning Rate.

editor.html

# Weight Update Rule
# w = w - (learning_rate * gradient)

# The 'learning_rate' determines step size.

localhost:3000

3The Learning Rate Dilemma

The Learning Rate is the single most important hyperparameter in deep learning.

If your learning rate is too low, your model takes microscopic steps; training will take forever and might get stuck in a shallow valley (a local minimum). If your learning rate is too high, your model takes massive leaps; it will completely overshoot the deepest valley and fail to learn anything. Finding the 'Goldilocks' zone for the learning rate is essential for convergence.

editor.html

"""
LR too low -> Stagnation
LR too high -> Divergence (NaN loss)
LR just right -> Smooth convergence
"""

localhost:3000

4Adam: The Smart Engine

In the early days, everyone used standard Stochastic Gradient Descent (SGD). Today, the default choice for almost every project is Adam (Adaptive Moment Estimation).

Adam is a 'smart' optimizer. Instead of using a single, fixed learning rate for all weights, Adam automatically adapts the learning rate for *each individual parameter* based on its past gradients. If a weight has been moving predictably, Adam speeds it up. If it's bouncing around wildly, Adam slows it down.

editor.html

import torch.optim as optim

# Adam: The smart choice (adaptive)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# SGD: The classic choice (fixed)
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01)

localhost:3000

5Momentum

Another key feature of modern optimizers like Adam is Momentum.

If you roll a heavy ball down a hill, it gains momentum. If it hits a small bump, its momentum carries it over. In optimization, the loss landscape is often filled with jagged noise (mini-batch variance) and shallow false valleys (local minima). By remembering past gradients (adding momentum), the optimizer can roll straight through the noise and safely reach the true global minimum.

editor.html

# Momentum helps 'roll' through noise.
# Without it, the model gets stuck easily in flat regions.
# Adam calculates momentum implicitly.

localhost:3000

?Frequently Asked Questions

Pascual Vila

Frontend Instructor // Code Syllabus

Lesson Glossary

[01]Loss Function

A mathematical formula that measures how well the model's prediction matches the target.

Code Preview

The Scorecard

[02]Optimizer

The algorithm that updates the network's weights to minimize the loss.

Code Preview

The Driver

[03]Learning Rate

A hyperparameter that controls the step size taken during gradient descent.

Code Preview

Step Size

[04]Adam

An adaptive optimizer that combines momentum and parameter-specific learning rates.

Code Preview

Modern Standard

[05]Cross-Entropy

The standard loss function for classification tasks.

Code Preview

Log-Loss

Continue Learning

Foundations

nlp transformers

Read lesson→

Foundations

Bag of Words & TF-IDF

Read lesson→

Foundations

Dimensionality Reduction (PCA)

Read lesson→

Foundations

Perceptrons and Activation Functions (ReLU, Sigmoid)

Read lesson→

Foundations

Using OpenAI / Anthropic APIs

Read lesson→

Foundations

Data Cleaning and Handling Missing Values

Read lesson→

Skill Matrix

Opti Hub

Interactive Challenges

1The Cost of Error

2Gradient Descent Mechanics

3The Learning Rate Dilemma

4Adam: The Smart Engine

5Momentum

?Frequently Asked Questions

Lesson Glossary

[01]Loss Function

[02]Optimizer

[03]Learning Rate

[04]Adam

[05]Cross-Entropy

Continue Learning

Article Contents