Training & Optimization
How neural networks learn: backpropagation, gradient descent, and optimization techniques that make deep learning possible.
Training at Scale
- GPT-3 training cost: $4.6 million in compute
- Adam optimizer is used in 70%+ of production models
- Training ImageNet from scratch takes 14 days on 8 GPUs
- Learning rate is the most important hyperparameter to tune
Backpropagation
Backpropagation is the algorithm that enables neural networks to learn. It efficiently computes gradients of the loss function with respect to all parameters using the chain rule of calculus. These gradients tell us how to adjust parameters to reduce the loss.
The Algorithm
- Forward pass: Compute predictions and loss
- Backward pass: Compute gradients starting from output layer
- Apply chain rule: ∂L/∂wᵢ = (∂L/∂y) × (∂y/∂z) × (∂z/∂wᵢ)
- Update parameters: wᵢ = wᵢ - α × ∂L/∂wᵢ
The beauty of backpropagation is its efficiency: computing gradients for all parameters requires just one forward and one backward pass.
Gradient Descent Variants
Batch Gradient Descent
Computes gradients using entire dataset. Accurate but slow for large datasets.
θ = θ - α∇J(θ)
Stochastic Gradient Descent (SGD)
Updates parameters after each example. Fast but noisy.
θ = θ - α∇J(θ; xᵢ, yᵢ)
Mini-batch GD
Best of both worlds. Updates using small batches (32-256 examples). Industry standard.
SGD with Momentum
Accumulates velocity from past gradients. Helps escape local minima and accelerates convergence.
Advanced Optimizers
Adam (Adaptive Moment Estimation)
The most popular optimizer in deep learning. Combines momentum and adaptive learning rates per parameter. Maintains running averages of gradients and their squares.
- Automatically adjusts learning rate for each parameter
- Works well with sparse gradients
- Default choice for most applications
- Typical hyperparameters: β₁=0.9, β₂=0.999, α=0.001
RMSprop
Adapts learning rate per parameter using moving average of squared gradients. Good for RNNs.
AdaGrad
Adapts learning rate based on historical gradients. Works well for sparse data but can be too aggressive.
Learning Rate Scheduling
The learning rate determines the step size when updating parameters. Too high causes instability, too low slows training. Learning rate schedules adjust this over time for better convergence.
Step Decay
Reduce learning rate by factor (e.g., 0.5) every N epochs
Exponential Decay
Smoothly decrease: α = α₀ × e^(-kt)
Cosine Annealing
Follows cosine curve from max to min
Warmup
Start with small LR, gradually increase to target
Regularization Techniques
Regularization prevents overfitting by constraining model complexity or adding noise during training.
Dropout
Randomly deactivate neurons during training (typically 20-50%). Forces network to learn robust features. Most effective regularization technique for deep networks.
L2 Regularization (Weight Decay)
Adds penalty proportional to squared weights: Loss + λ||w||²
Early Stopping
Stop training when validation loss stops improving
Batch Normalization
Normalize activations, acts as regularizer and speeds training
Data Augmentation
Artificially expand training set with transformations
Key Takeaways
- Backpropagation efficiently computes gradients using the chain rule
- Adam optimizer is the default choice for most deep learning tasks
- Learning rate is the most important hyperparameter to tune carefully
- Regularization (dropout, weight decay) prevents overfitting in deep networks