Gradient Descent Optimization
Visualize how different optimization algorithms navigate loss landscapes to find optimal parameters
Gradient Descent: The Foundation of Machine Learning Optimization
Introduction
Gradient descent is the workhorse optimization algorithm behind most machine learning models. Whether you're training a simple linear regression or a complex deep neural network, gradient descent (or one of its variants) is likely doing the heavy lifting to find the best parameters.
At its core, gradient descent is beautifully simple: it iteratively adjusts parameters in the direction that most reduces the loss function. Think of it like hiking down a mountain in fog—you can't see the bottom, but you can feel which direction is steepest downward, so you take small steps in that direction.
The Gradient Descent Algorithm
Gradient descent iteratively moves toward the minimum of the loss function
Basic Concept
Given a loss function L(θ) that measures how well our model performs, gradient descent updates the parameters θ using this rule:
θ = θ - α∇L(θ)
Where:
- θ represents the model parameters (weights, biases, etc.)
- α is the learning rate (step size)
- ∇L(θ) is the gradient of the loss function (direction of steepest ascent)
The negative sign means we move in the opposite direction of the gradient—downhill toward lower loss.
The Learning Rate
The learning rate α is crucial:
- Too small: Convergence is slow, requiring many iterations
- Too large: May overshoot the minimum or even diverge
- Just right: Efficient convergence to a good solution
Finding the right learning rate often requires experimentation.
Optimizer Variants
While basic gradient descent works, researchers have developed sophisticated variants that converge faster and more reliably.
SGD (Stochastic Gradient Descent)
The simplest form—just follow the gradient:
θ = θ - α∇L(θ)
Pros: Simple, easy to understand Cons: Can be slow, sensitive to learning rate, oscillates in ravines
SGD with Momentum
Momentum helps accelerate convergence and reduce oscillations
Adds "momentum" to accelerate convergence and dampen oscillations:
v = βv + ∇L(θ)
θ = θ - αv
The velocity v accumulates gradients over time, helping the optimizer:
- Build speed in consistent directions
- Dampen oscillations in inconsistent directions
- Escape shallow local minima
Typical β value: 0.9 (90% of previous velocity retained)
RMSProp (Root Mean Square Propagation)
Adapts the learning rate for each parameter based on recent gradient magnitudes:
E[g²] = βE[g²] + (1-β)g²
θ = θ - α·g / √(E[g²] + ε)
This helps with:
- Ill-conditioned problems: Different parameters need different learning rates
- Ravines: Steep in some directions, shallow in others
Typical β value: 0.9
Adam (Adaptive Moment Estimation)
Combines the best of momentum and RMSProp:
m = β₁m + (1-β₁)g # First moment (momentum)
v = β₂v + (1-β₂)g² # Second moment (RMSProp)
m̂ = m / (1-β₁ᵗ) # Bias correction
v̂ = v / (1-β₂ᵗ) # Bias correction
θ = θ - α·m̂ / (√v̂ + ε)
Adam is often the default choice because it:
- Adapts learning rates per parameter
- Includes momentum for acceleration
- Corrects for initialization bias
- Works well across many problems
Typical values: β₁=0.9, β₂=0.999, ε=1e-8
Loss Landscapes
Different types of loss landscapes with local minima and saddle points
Understanding loss landscapes helps explain optimizer behavior.
Convex Functions
Simple bowl-shaped functions have one global minimum. All optimizers easily find it, though at different speeds.
Example: Quadratic bowl (x² + y²)
Ill-Conditioned Functions
Elongated valleys where gradients are much steeper in some directions. Basic SGD oscillates; momentum and adaptive methods handle these better.
Example: Elongated valley (x² + 10y²)
Non-Convex Functions
Real-world loss landscapes with:
- Local minima: Suboptimal solutions that trap basic optimizers
- Saddle points: Flat regions that slow convergence
- Plateaus: Nearly flat regions with tiny gradients
Examples: Rosenbrock function, Himmelblau function
Multi-Modal Functions
Multiple local minima of varying quality. The starting point matters!
Example: Rastrigin function (many local minima)
Practical Considerations
Convergence Criteria
When to stop optimizing?
- Gradient norm: Stop when ||∇L(θ)|| < ε (gradient is tiny)
- Loss change: Stop when loss improvement is negligible
- Maximum iterations: Prevent infinite loops
Hyperparameter Tuning
Key hyperparameters to tune:
- Learning rate (α): Most important; try 0.001, 0.01, 0.1
- Momentum (β): Usually 0.9 or 0.99
- Batch size: Affects gradient noise (not shown in this visualization)
Common Pitfalls
- Exploding gradients: Learning rate too high, parameters diverge
- Vanishing gradients: Gradients become too small to make progress
- Poor initialization: Starting in a bad region of the loss landscape
- Wrong optimizer: Some problems need specific optimizers
Interactive Exploration
Use the interactive visualization to:
- Compare optimizers: See how SGD, Momentum, RMSProp, and Adam navigate the same landscape
- Adjust learning rate: Observe the impact on convergence speed and stability
- Try different landscapes: See how optimizer performance varies by problem type
- Step through optimization: Watch the path unfold iteration by iteration
Key Takeaways
- Gradient descent is the foundation of ML optimization
- The learning rate is the most critical hyperparameter
- Momentum helps accelerate convergence and escape local minima
- Adaptive methods (RMSProp, Adam) adjust learning rates per parameter
- Loss landscape shape dramatically affects optimizer performance
- Adam is often a good default choice for most problems