Back to Home

AI Guide for Senior Software Engineers

Large Language Models (LLMs)

Understanding the engineering and science behind models like GPT, Claude, and Gemini that power modern AI applications.

What Makes an LLM "Large"?

Large Language Models are transformer-based neural networks with billions (or trillions) of parameters, trained on vast amounts of text data. Their size and training scale enable emergent capabilities not seen in smaller models.

Scale Milestones (as of November 2025)

The evolution of LLMs continues at rapid pace. Model details change frequently — refer to provider docs for exact specs.

  • GPT-2 (2019): 1.5B parameters — early large LM milestone
  • GPT-3 (2020): 175B parameters — in-context learning emerged
  • PaLM (2022): 540B parameters — strong reasoning benchmarks
  • GPT-4 (2023): multimodal capabilities, 128K context
  • GPT-4o (2024): omni-modal (text/vision/audio), 128K context, 2x faster than GPT-4
  • Claude 3.5 Sonnet (2024): 200K context, exceptional coding (64% on agentic benchmarks)
  • GPT-5 (Aug 2025): flagship model with improved reasoning, thinking built-in, better coding and agentic capabilities
  • GPT-5.1 (Nov 2025): more conversational, improved personality and steerability
  • Claude Sonnet 4.5 (Sep 2025): state-of-the-art coding, enhanced alignment, supports agentic workflows
  • Gemini 3 Pro (2025): Google's most intelligent model, state-of-the-art multimodal understanding, up to 1M token context
  • Gemini 3 Deep Think (2025): extended reasoning variant for complex problem-solving
  • o3/o4-mini (Apr 2025): OpenAI's advanced reasoning models with chain-of-thought for STEM problems
  • Sora 2 (Sep 2025): physically accurate video generation with synchronized dialogue and sound effects

Training LLMs

Pre-training

Models learn language by predicting the next token on massive text corpora (Common Crawl, books, code, etc.). This requires enormous compute: thousands of GPUs/TPUs running for weeks or months.

  • Data scale: Trillions of tokens (TB to PB of text)
  • Compute: Thousands of A100/H100 GPUs
  • Cost: Millions to tens of millions of dollars
  • Time: Weeks to months of continuous training

Fine-tuning

After pre-training, models are adapted for specific tasks or behaviors:

  • Instruction tuning: Teach model to follow instructions
  • RLHF: Reinforcement Learning from Human Feedback for alignment
  • Task-specific: Adapt for domain-specific applications

Emergent Capabilities

As models scale, they develop abilities not explicitly programmed or trained for. These emerge from the combination of scale, architecture, and training data.

In-Context Learning

Learn new tasks from examples in the prompt, without parameter updates

Advanced Reasoning

Chain-of-thought reasoning built into models like GPT-5, o3/o4, and Gemini 3 Deep Think

Native Multimodality

Process and generate text, images, audio, and video seamlessly (Gemini 3, GPT-5, Sora 2)

Extended Context Windows

Up to 1M+ tokens (Gemini 3) for entire codebases, books, or long conversations

Engineering LLM Systems

Inference Optimization

  • Quantization: Reduce precision (FP16, INT8) to save memory and speed up inference
  • KV caching: Cache key-value pairs to avoid recomputation
  • Flash Attention: Optimized attention implementation
  • Model sharding: Split model across multiple GPUs (tensor/pipeline parallelism)

Prompt Engineering

The art and science of crafting prompts to elicit desired behaviors:

  • Zero-shot, few-shot, and chain-of-thought prompting
  • System messages and role-playing
  • Temperature and sampling strategies
  • Context window management

LLM Architectures

GPT-5 / GPT-5.1

OpenAI's flagship. Thinking built-in, exceptional coding, improved steerability. Released Aug/Nov 2025.

Claude Sonnet 4.5

Anthropic's most aligned model. State-of-the-art coding, reasoning, and computer use. Sep 2025.

Gemini 3 Pro

Google's most intelligent model. Up to 1M token context. State-of-the-art multimodal understanding. 2025.

o3 / o4-mini

OpenAI's advanced reasoning models with full tool access. Chain-of-thought for STEM. Apr 2025.

Challenges & Limitations

  • Hallucinations: Models can confidently generate false information (improving with reasoning models)
  • Context limits: Now 128K-1M+ tokens, substantially improved but still finite
  • Computational cost: Expensive to train and run, especially with extended thinking or long-context inference
  • Biases: Reflect and amplify biases in training data despite alignment efforts
  • Grounding: Improving with tool use and search, but still require RAG for real-time info
  • Reasoning gaps: While reasoning models excel at STEM, complex multi-step planning remains challenging

Key Takeaways

  • LLMs are transformer models scaled to billions of parameters
  • Pre-training on massive data enables emergent capabilities
  • Fine-tuning and RLHF align models with human preferences
  • Engineering systems around LLMs requires optimization and careful prompting
  • LLMs have significant limitations and biases to be aware of