Deep Learning Architectures

Exploring specialized neural network architectures designed for specific types of data and tasks.

Deep Learning Milestones

AlexNet (2012) sparked the deep learning revolution with 8 layers
ResNet-152 (2015) achieved breakthrough with 152 layers using skip connections
CNNs reduce image classification error rate to less than 3% (human-level)
Modern architectures can have thousands of layers (e.g., EfficientNet-B7)

Why Deep Learning?

Deep learning refers to neural networks with many layers (typically more than 3). The "deep" in deep learning signifies the depth of the network—the number of layers between input and output. Deep architectures can learn hierarchical representations that simple shallow networks cannot.

However, depth alone isn't enough. Specialized architectures have been developed to handle different types of data structures: convolutional networks for spatial data, recurrent networks for sequential data, and more.

Convolutional Neural Networks (CNNs)

CNNs are designed for processing grid-like data, especially images. They use convolution operations that apply learnable filters across the input, detecting local patterns like edges, textures, and shapes.

Key Components

Convolutional Layers: Apply filters (kernels) that slide over the input, computing dot products. Each filter learns to detect specific features (edges, textures, patterns).
Pooling Layers: Downsample spatial dimensions, reducing computation and providing translation invariance. Max pooling selects maximum values, average pooling computes averages.
Feature Maps: Output of convolutional layers. Each filter produces one feature map, highlighting where that feature appears in the input.
Stride & Padding: Stride controls filter movement step size. Padding adds zeros around borders to control output dimensions.

Why CNNs Work for Images

Local connectivity: Each neuron connects only to a small region (receptive field)
Parameter sharing: Same filter used across entire image, drastically reducing parameters
Translation invariance: Can detect features regardless of position in image
Hierarchical features: Early layers detect edges, later layers detect complex objects

Famous CNN Architectures

LeNet-5 (1998): Early CNN for digit recognition
AlexNet (2012): Won ImageNet, sparked deep learning revolution
VGGNet (2014): Demonstrated power of depth with small filters
ResNet (2015): Introduced skip connections, enabling 100+ layer networks
Inception/GoogLeNet (2014): Multi-scale processing with inception modules
EfficientNet (2019): Optimized scaling of network dimensions

Recurrent Neural Networks (RNNs)

RNNs are designed for sequential data where order matters: text, speech, time series. Unlike feedforward networks, RNNs have loops that allow information to persist across time steps.

How RNNs Work

h_t = f(W_h · h_(t-1) + W_x · x_t + b)

At each time step t, the network produces a hidden state h_t based on the current input x_t and the previous hidden state h_(t-1). This allows the network to maintain a "memory" of previous inputs.

h_t: Hidden state at time t
x_t: Input at time t
W_h, W_x: Weight matrices (shared across all time steps)
f: Activation function (typically tanh)

⚠️ The Vanishing Gradient Problem

Standard RNNs struggle to learn long-term dependencies due to vanishing gradients during backpropagation through time. Gradients become exponentially smaller as they propagate backwards, making it difficult to learn connections between distant time steps.

LSTM & GRU: Advanced RNNs

Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) were developed to address the vanishing gradient problem:

LSTM: Uses gates (forget, input, output) to control information flow and maintain long-term memory
GRU: Simplified version of LSTM with fewer parameters, often performs comparably
Both can learn dependencies spanning hundreds of time steps

Attention Mechanisms

Attention mechanisms allow networks to focus on relevant parts of the input when producing each output. Rather than compressing all information into a fixed-size vector, attention lets the model dynamically select what to focus on.

Why Attention Matters

Allows networks to process variable-length inputs and outputs
Provides interpretability: can visualize what the model attends to
Foundation for Transformers, the dominant architecture for NLP
Enables parallelization unlike sequential RNNs

Key Takeaways

CNNs excel at spatial data through local connectivity and parameter sharing
RNNs, LSTMs, and GRUs handle sequential data by maintaining hidden states across time
Attention mechanisms enable models to focus on relevant input parts dynamically
Architecture choice depends on data structure: images (CNNs), sequences (RNNs/Transformers)

Previous: Neural Networks Next: Training Deep Networks