๐๏ธ LLM Architecture Deep Dive: Understanding Modern Language Models
A comprehensive exploration of how modern Large Language Models are built, from the transformer foundation to advanced architectural innovations driving todayโs most capable AI systems.
Table of Contents
- Foundation: The Transformer
- Attention Mechanism
- Modern Variations
- Training & Optimization
- Scaling Laws
- Future Architectures
๐๏ธ Foundation: The Transformer
The Transformer Block
Transformers are built from repeating blocks containing:
Input Sequence (Embeddings)
โ
[Multi-head Self-Attention]
โโ Parallel attention heads
โโ Different representation subspaces
โโ Combined outputs
โ
[Add & Norm] (Residual connection)
โ
[Feed-Forward Network]
โโ Dense layer 1 (expand)
โ Dimension: d_model โ 4*d_model
โโ ReLU/GELU activation
โโ Dense layer 2 (contract)
Dimension: 4*d_model โ d_model
โ
[Add & Norm] (Residual connection)
โ
Output (same shape as input)
Positional Encoding
Since attention has no inherent sense of position, we add positional information:
Sinusoidal Positional Encoding
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Benefits:
- Learned automatically
- Consistent across sequences
- Extrapolates to longer sequences
The Full Transformer Stack
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Input Embeddings โ
โ + Positional Encoding โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Encoder Block 1 (Self-Att) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Encoder Block 2 (Self-Att) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ... โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Encoder Block N (Self-Att) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Decoder Block 1 (Att + X-Att) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Decoder Block 2 (Att + X-Att) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ... โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Decoder Block N (Att + X-Att) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Output Layer (Softmax) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ง Attention Mechanism
Scaled Dot-Product Attention
The core of transformers:
Attention(Q, K, V) = softmax(QยทK^T / โd_k)ยทV
Where:
- Q (Query): What we're looking for
- K (Key): What information is available
- V (Value): What information to retrieve
- d_k: Dimension of key vectors (for scaling)
- โd_k: Prevents attention weights from becoming too small
Multi-Head Attention
Use multiple attention heads in parallel:
MultiHead(Q,K,V) = Concat(head_1, ..., head_h)ยทW^O
Where each head uses:
head_i = Attention(QยทW_i^Q, KยทW_i^K, VยทW_i^V)
Advantages:
- Different heads attend to different aspects
- Semantic vs. syntactic vs. positional
- Improved gradient flow during training
Cross-Attention
Attention from decoder to encoder outputs:
Decoder attends to Encoder:
Attention(Q_decoder, K_encoder, V_encoder)
Enables:
- Seq2Seq translation
- Information flow
- Encoder-decoder coupling
๐ Modern Variations
Causal/Masked Attention
For autoregressive decoding:
Token 1 sees: [Token 1]
Token 2 sees: [Token 1, Token 2]
Token 3 sees: [Token 1, Token 2, Token 3]
Mask prevents looking at future tokens:
Attention_matrix[i, j] = -โ for j > i
Rotary Position Embeddings (RoPE)
Encodes position in rotation:
Traditional: Add position to embedding
RoPE: Rotate query/key vectors
- Encodes relative positions
- Better extrapolation
- Used in GPT-3, LLaMA, QwQ
Grouped Query Attention (GQA)
Efficient attention variant:
Multi-Head Attention:
- h heads each with unique K, V
Grouped Query:
- Multiple Q heads share one K,V
- Reduces memory and computation
- Maintains performance
Multi-Query (Extreme GQA):
- All Q heads share one K,V pair
- Maximum efficiency
Flash Attention
Algorithm optimization:
Standard Attention: Reads entire matrices from memory
โโ O(Nยฒ) memory bandwidth
Flash Attention:
- Block-wise computation
- Minimizes memory I/O
- 2-3x speedup
- No approximation loss
๐ Training & Optimization
Pre-training Objectives
Causal Language Modeling
Predict next token given context
Loss = -log P(token_t | token_1...t-1)
Used by: GPT family
Masked Language Modeling
Randomly mask tokens, predict them
Loss = -log P(masked_tokens | unmasked)
Used by: BERT, RoBERTa
Contrastive Learning
Maximize similarity between related pairs
Minimize similarity between unrelated pairs
Used by: Embedding models
Optimization Techniques
Layer Normalization
Normalizes hidden states per sample
- Speeds up convergence
- Reduces internal covariate shift
- Applied before or after attention/FFN
Gradient Checkpointing
Trade compute for memory
- Store activations at checkpoints
- Recompute others in backward pass
- Enables training larger models
Mixed Precision Training
Use float16 for computations, float32 for weights
- Reduces memory by ~2x
- Maintains accuracy
- Better hardware utilization
๐ Scaling Laws
The Scaling Laws
Empirically observed relationships:
Loss โ a * N^(-ฮฑ) + b * D^(-ฮฒ)
Where:
- N: Model size (parameters)
- D: Dataset size (tokens)
- ฮฑ, ฮฒ: Scaling exponents (~0.07)
Key finding: Similar diminishing returns for model & data
Implications
Optimal Allocation:
- Equal compute ~50% to model, ~50% to data
- Larger dataset โ worse results
- Training longer helps
Compute Budget:
- For fixed budget, optimal model usually smaller than trained
- But inference cost matters
- Trade-off between size and training duration
Chinchilla Optimal
The โChinchillaโ compute-optimal frontier:
For C compute operations:
Optimal N โ C / 6
Optimal D โ C
Example:
- 1 trillion token budget
- Optimal: 20B parameter model, trained on 1T tokens
๐ฎ Future Architectures
State Space Models (SSMs)
Alternative to attention:
Selective SSM (Mamba):
โโ Linear complexity in sequence length
โโ Competitive performance with transformers
โโ Great for long sequences
โโ Simpler hardware usage
Architecture:
Input โ SSM layers (not attention) โ Output
Mixture of Experts (MoE)
Sparse architecture:
Input
โ
[Router: Assign to experts]
โโ Expert 1 (FFN)
โโ Expert 2 (FFN)
โโ Expert 3 (FFN)
โโ Expert N (FFN)
โ
[Combine expert outputs]
Advantages:
- Activate only relevant experts
- More parameters, same compute
- Example: Google's Switch-100B
Retrieval-Augmented Models
Hybrid approach:
Query + Context โ Model
โโ Retrieve relevant documents
โโ Include in context
โโ Generate answer
Benefits:
- External knowledge
- Up-to-date information
- Reduced hallucination
Multimodal Architectures
Unified processing:
Text โโ
โโ[Shared Transformers]โโOutput
Imageโโค
Audioโโ
Cross-modal attention enables transfer learning
๐ Model Comparison
| Architecture | Speed | Quality | Memory | Sequence Length |
|---|---|---|---|---|
| Standard Transformer | 1x | Baseline | 1x | 2K-4K |
| GQA | 1.2x | 99% baseline | 0.8x | 2K-4K |
| Flash Attention | 2x | 100% | 0.9x | 4K-8K |
| ALiBi | 1x | 95% baseline | 1x | 100K+ |
| RetNet | 1.5x | 90-95% | 0.7x | 100K+ |
| Mamba | 3x | 95%+ | 0.8x | Unlimited |
๐ Practical Considerations
Inference Optimization
Quantization
- INT8: 75% memory reduction
- INT4: 87.5% reduction
- Minimal quality loss
Batching
- Group requests
- Amortize model load
- Better throughput
Caching
- Cache KV states
- Reduce recomputation
- Trade memory for speed
Deployment Trade-offs
Factors to Balance:
โโ Latency: Response time
โโ Throughput: Requests/second
โโ Cost: Infrastructure expense
โโ Quality: Accuracy/coherence
โโ Availability: Uptime requirement
๐ Related Topics
- LLM Fundamentals - Introduction to LLMs
- NLP Fundamentals - Language understanding
- AI Agents - Using models as agents
- Advanced Model Implementations - Practical ML
๐ Influential Papers
- โAttention is All You Needโ - Vaswani et al. (2017) - Foundation
- โAn Image is Worth 16x16 Wordsโ - Dosovitskiy et al. (2020) - Vision Transformers
- โScaling Laws for Neural Language Modelsโ - Kaplan et al. (2020) - Compute efficiency
- โEfficient Transformers: A Surveyโ - Tay et al. (2020) - Optimizations
- โMambaโ - Gu & Dao (2023) - Beyond transformers
- โAttention Free Transformersโ - Zhai et al. (2021) - Alternative architectures
๐ Implementation Resources
Libraries
- HuggingFace Transformers: Pre-built models
- PyTorch: Low-level implementation
- JAX: Research-friendly framework
- vLLM: Inference optimization
- Flash-Attention: Efficient attention
Platforms
- Together AI: Build with models
- Replicate: Model deployment
- Hugging Face Spaces: Demo hosting