Large Language Models (LLMs)
Understanding the engineering and science behind models like GPT, Claude, and Gemini that power modern AI applications.
What Makes an LLM "Large"?
Large Language Models are transformer-based neural networks with billions (or trillions) of parameters, trained on vast amounts of text data. Their size and training scale enable emergent capabilities not seen in smaller models.
Scale Milestones (as of November 2025)
The evolution of LLMs continues at rapid pace. Model details change frequently — refer to provider docs for exact specs.
- GPT-2 (2019): 1.5B parameters — early large LM milestone
- GPT-3 (2020): 175B parameters — in-context learning emerged
- PaLM (2022): 540B parameters — strong reasoning benchmarks
- GPT-4 (2023): multimodal capabilities, 128K context
- GPT-4o (2024): omni-modal (text/vision/audio), 128K context, 2x faster than GPT-4
- Claude 3.5 Sonnet (2024): 200K context, exceptional coding (64% on agentic benchmarks)
- GPT-5 (Aug 2025): flagship model with improved reasoning, thinking built-in, better coding and agentic capabilities
- GPT-5.1 (Nov 2025): more conversational, improved personality and steerability
- Claude Sonnet 4.5 (Sep 2025): state-of-the-art coding, enhanced alignment, supports agentic workflows
- Gemini 3 Pro (2025): Google's most intelligent model, state-of-the-art multimodal understanding, up to 1M token context
- Gemini 3 Deep Think (2025): extended reasoning variant for complex problem-solving
- o3/o4-mini (Apr 2025): OpenAI's advanced reasoning models with chain-of-thought for STEM problems
- Sora 2 (Sep 2025): physically accurate video generation with synchronized dialogue and sound effects
Training LLMs
Pre-training
Models learn language by predicting the next token on massive text corpora (Common Crawl, books, code, etc.). This requires enormous compute: thousands of GPUs/TPUs running for weeks or months.
- Data scale: Trillions of tokens (TB to PB of text)
- Compute: Thousands of A100/H100 GPUs
- Cost: Millions to tens of millions of dollars
- Time: Weeks to months of continuous training
Fine-tuning
After pre-training, models are adapted for specific tasks or behaviors:
- Instruction tuning: Teach model to follow instructions
- RLHF: Reinforcement Learning from Human Feedback for alignment
- Task-specific: Adapt for domain-specific applications
Emergent Capabilities
As models scale, they develop abilities not explicitly programmed or trained for. These emerge from the combination of scale, architecture, and training data.
In-Context Learning
Learn new tasks from examples in the prompt, without parameter updates
Advanced Reasoning
Chain-of-thought reasoning built into models like GPT-5, o3/o4, and Gemini 3 Deep Think
Native Multimodality
Process and generate text, images, audio, and video seamlessly (Gemini 3, GPT-5, Sora 2)
Extended Context Windows
Up to 1M+ tokens (Gemini 3) for entire codebases, books, or long conversations
Engineering LLM Systems
Inference Optimization
- Quantization: Reduce precision (FP16, INT8) to save memory and speed up inference
- KV caching: Cache key-value pairs to avoid recomputation
- Flash Attention: Optimized attention implementation
- Model sharding: Split model across multiple GPUs (tensor/pipeline parallelism)
Prompt Engineering
The art and science of crafting prompts to elicit desired behaviors:
- Zero-shot, few-shot, and chain-of-thought prompting
- System messages and role-playing
- Temperature and sampling strategies
- Context window management
LLM Architectures
GPT-5 / GPT-5.1
OpenAI's flagship. Thinking built-in, exceptional coding, improved steerability. Released Aug/Nov 2025.
Claude Sonnet 4.5
Anthropic's most aligned model. State-of-the-art coding, reasoning, and computer use. Sep 2025.
Gemini 3 Pro
Google's most intelligent model. Up to 1M token context. State-of-the-art multimodal understanding. 2025.
o3 / o4-mini
OpenAI's advanced reasoning models with full tool access. Chain-of-thought for STEM. Apr 2025.
Challenges & Limitations
- Hallucinations: Models can confidently generate false information (improving with reasoning models)
- Context limits: Now 128K-1M+ tokens, substantially improved but still finite
- Computational cost: Expensive to train and run, especially with extended thinking or long-context inference
- Biases: Reflect and amplify biases in training data despite alignment efforts
- Grounding: Improving with tool use and search, but still require RAG for real-time info
- Reasoning gaps: While reasoning models excel at STEM, complex multi-step planning remains challenging
Key Takeaways
- LLMs are transformer models scaled to billions of parameters
- Pre-training on massive data enables emergent capabilities
- Fine-tuning and RLHF align models with human preferences
- Engineering systems around LLMs requires optimization and careful prompting
- LLMs have significant limitations and biases to be aware of