Large Language Models (LLMs)

Understanding the engineering and science behind models like GPT, Claude, and Gemini that power modern AI applications.

What Makes an LLM "Large"?

Large Language Models are transformer-based neural networks with billions (or trillions) of parameters, trained on vast amounts of text data. Their size and training scale enable emergent capabilities not seen in smaller models.

Scale Milestones (as of November 2025)

The evolution of LLMs continues at rapid pace. Model details change frequently — refer to provider docs for exact specs.

GPT-2 (2019): 1.5B parameters — early large LM milestone
GPT-3 (2020): 175B parameters — in-context learning emerged
PaLM (2022): 540B parameters — strong reasoning benchmarks
GPT-4 (2023): multimodal capabilities, 128K context
GPT-4o (2024): omni-modal (text/vision/audio), 128K context, 2x faster than GPT-4
Claude 3.5 Sonnet (2024): 200K context, exceptional coding (64% on agentic benchmarks)
GPT-5 (Aug 2025): flagship model with improved reasoning, thinking built-in, better coding and agentic capabilities
GPT-5.1 (Nov 2025): more conversational, improved personality and steerability
Claude Sonnet 4.5 (Sep 2025): state-of-the-art coding, enhanced alignment, supports agentic workflows
Gemini 3 Pro (2025): Google's most intelligent model, state-of-the-art multimodal understanding, up to 1M token context
Gemini 3 Deep Think (2025): extended reasoning variant for complex problem-solving
o3/o4-mini (Apr 2025): OpenAI's advanced reasoning models with chain-of-thought for STEM problems
Sora 2 (Sep 2025): physically accurate video generation with synchronized dialogue and sound effects

Training LLMs

Pre-training

Models learn language by predicting the next token on massive text corpora (Common Crawl, books, code, etc.). This requires enormous compute: thousands of GPUs/TPUs running for weeks or months.

Data scale: Trillions of tokens (TB to PB of text)
Compute: Thousands of A100/H100 GPUs
Cost: Millions to tens of millions of dollars
Time: Weeks to months of continuous training

Fine-tuning

After pre-training, models are adapted for specific tasks or behaviors:

Instruction tuning: Teach model to follow instructions
RLHF: Reinforcement Learning from Human Feedback for alignment
Task-specific: Adapt for domain-specific applications

Emergent Capabilities

As models scale, they develop abilities not explicitly programmed or trained for. These emerge from the combination of scale, architecture, and training data.

In-Context Learning

Learn new tasks from examples in the prompt, without parameter updates

Advanced Reasoning

Chain-of-thought reasoning built into models like GPT-5, o3/o4, and Gemini 3 Deep Think

Native Multimodality

Process and generate text, images, audio, and video seamlessly (Gemini 3, GPT-5, Sora 2)

Extended Context Windows

Up to 1M+ tokens (Gemini 3) for entire codebases, books, or long conversations

Engineering LLM Systems

Inference Optimization

Quantization: Reduce precision (FP16, INT8) to save memory and speed up inference
KV caching: Cache key-value pairs to avoid recomputation
Flash Attention: Optimized attention implementation
Model sharding: Split model across multiple GPUs (tensor/pipeline parallelism)

Prompt Engineering

The art and science of crafting prompts to elicit desired behaviors:

Zero-shot, few-shot, and chain-of-thought prompting
System messages and role-playing
Temperature and sampling strategies
Context window management

LLM Architectures

GPT-5 / GPT-5.1

OpenAI's flagship. Thinking built-in, exceptional coding, improved steerability. Released Aug/Nov 2025.

Claude Sonnet 4.5

Anthropic's most aligned model. State-of-the-art coding, reasoning, and computer use. Sep 2025.

Gemini 3 Pro

Google's most intelligent model. Up to 1M token context. State-of-the-art multimodal understanding. 2025.

o3 / o4-mini

OpenAI's advanced reasoning models with full tool access. Chain-of-thought for STEM. Apr 2025.

Challenges & Limitations

Hallucinations: Models can confidently generate false information (improving with reasoning models)
Context limits: Now 128K-1M+ tokens, substantially improved but still finite
Computational cost: Expensive to train and run, especially with extended thinking or long-context inference
Biases: Reflect and amplify biases in training data despite alignment efforts
Grounding: Improving with tool use and search, but still require RAG for real-time info
Reasoning gaps: While reasoning models excel at STEM, complex multi-step planning remains challenging

Key Takeaways

LLMs are transformer models scaled to billions of parameters
Pre-training on massive data enables emergent capabilities
Fine-tuning and RLHF align models with human preferences
Engineering systems around LLMs requires optimization and careful prompting
LLMs have significant limitations and biases to be aware of

Transformers AI Ethics & Bias