Training & Optimization

How neural networks learn: backpropagation, gradient descent, and optimization techniques that make deep learning possible.

Training at Scale

GPT-3 training cost: $4.6 million in compute
Adam optimizer is used in 70%+ of production models
Training ImageNet from scratch takes 14 days on 8 GPUs
Learning rate is the most important hyperparameter to tune

Backpropagation

Backpropagation is the algorithm that enables neural networks to learn. It efficiently computes gradients of the loss function with respect to all parameters using the chain rule of calculus. These gradients tell us how to adjust parameters to reduce the loss.

The Algorithm

Forward pass: Compute predictions and loss
Backward pass: Compute gradients starting from output layer
Apply chain rule: ∂L/∂wᵢ = (∂L/∂y) × (∂y/∂z) × (∂z/∂wᵢ)
Update parameters: wᵢ = wᵢ - α × ∂L/∂wᵢ

The beauty of backpropagation is its efficiency: computing gradients for all parameters requires just one forward and one backward pass.

Gradient Descent Variants

Batch Gradient Descent

Computes gradients using entire dataset. Accurate but slow for large datasets.

θ = θ - α∇J(θ)

Stochastic Gradient Descent (SGD)

Updates parameters after each example. Fast but noisy.

θ = θ - α∇J(θ; xᵢ, yᵢ)

Mini-batch GD

Best of both worlds. Updates using small batches (32-256 examples). Industry standard.

SGD with Momentum

Accumulates velocity from past gradients. Helps escape local minima and accelerates convergence.

Advanced Optimizers

Adam (Adaptive Moment Estimation)

The most popular optimizer in deep learning. Combines momentum and adaptive learning rates per parameter. Maintains running averages of gradients and their squares.

Automatically adjusts learning rate for each parameter
Works well with sparse gradients
Default choice for most applications
Typical hyperparameters: β₁=0.9, β₂=0.999, α=0.001

RMSprop

Adapts learning rate per parameter using moving average of squared gradients. Good for RNNs.

AdaGrad

Adapts learning rate based on historical gradients. Works well for sparse data but can be too aggressive.

Learning Rate Scheduling

The learning rate determines the step size when updating parameters. Too high causes instability, too low slows training. Learning rate schedules adjust this over time for better convergence.

Step Decay

Reduce learning rate by factor (e.g., 0.5) every N epochs

Exponential Decay

Smoothly decrease: α = α₀ × e^(-kt)

Cosine Annealing

Follows cosine curve from max to min

Warmup

Start with small LR, gradually increase to target

Regularization Techniques

Regularization prevents overfitting by constraining model complexity or adding noise during training.

Dropout

Randomly deactivate neurons during training (typically 20-50%). Forces network to learn robust features. Most effective regularization technique for deep networks.

L2 Regularization (Weight Decay)

Adds penalty proportional to squared weights: Loss + λ||w||²

Early Stopping

Stop training when validation loss stops improving

Batch Normalization

Normalize activations, acts as regularizer and speeds training

Data Augmentation

Artificially expand training set with transformations

Key Takeaways

Backpropagation efficiently computes gradients using the chain rule
Adam optimizer is the default choice for most deep learning tasks
Learning rate is the most important hyperparameter to tune carefully
Regularization (dropout, weight decay) prevents overfitting in deep networks

Previous: Deep Learning Next: Transformers