Join us for an insightful exploration into the cutting-edge design of today's Large Language Models. Seven years on from the original GPT architecture, have we truly seen groundbreaking changes, or are we simply refining existing foundations? This podcast focuses on the architectural developments that define flagship open models in 2025, moving beyond benchmark performance or training algorithms.
In this episode, we'll unpack the key ingredients contributing to LLM performance, examining how developers are pushing the boundaries of efficiency, memory management, and training stability. Discover the evolution and intricacies of:
- Attention Mechanisms: From Multi-Head Attention (MHA) to the more efficient Grouped-Query Attention (GQA), and innovative approaches like Multi-Head Latent Attention (MLA) used in DeepSeek-V3, which compresses key and value tensors for memory savings. We also delve into Sliding Window Attention from Gemma 3, which restricts context size for local efficiency.
- Normalization Layers: Explore the shift from LayerNorm to RMSNorm and the crucial placement of these layers (Pre-Norm, Post-Norm) as seen in OLMo 2 and Gemma 3, including the addition of QK-Norm for enhanced training stability.
- Mixture-of-Experts (MoE): Understand why this approach has seen a significant resurgence in 2025. Learn how MoE, as implemented in models like DeepSeek-V3, Llama 4, and Qwen3's sparse variants, allows for massive total parameter counts (e.g., DeepSeek-V3's 671 billion parameters) while activating only a small subset (e.g., 37 billion) per inference step for remarkable efficiency.
- Positional Embeddings: Discover how positional information is handled, from rotational positional embeddings (RoPE) to the radical concept of No Positional Embeddings (NoPE) in SmolLM3, which aims for better length generalization.
We'll compare the structural nuances of leading models such as:
- DeepSeek-V3: A massive 671-billion-parameter model known for MLA and MoE with a shared expert.
- OLMo 2: Notable for its transparency and specific RMSNorm placements for training stability.
- Gemma 3 & 3n: Featuring sliding window attention for KV cache memory savings and unique normalization layer placements; Gemma 3n also introduces Per-Layer Embedding and MatFormer concepts.
- Mistral Small 3.1: Prioritizing lower inference latency through custom tokenizers and specific architectural choices.
- Llama 4: Adopting an MoE approach similar to DeepSeek-V3 but with its own distinct expert configuration.
- Qwen3: Available in both dense and MoE variants, offering flexibility for various use cases and moving away from shared experts in some MoE configurations.
- SmolLM3: A compact 3-billion-parameter model exploring the effectiveness of NoPE.
- Kimi K2: An impressive 1 trillion parameter model, building on the DeepSeek-V3 architecture with more experts and fewer MLA heads, setting new standards for open-weight performance.
Tune in to understand the intricate design decisions driving the next generation of large language models.