Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

368 snips
Apr 29, 2026
Reiner Pope, MatX CEO and former Google engineer, turns a chalkboard into a tour of how frontier LLMs really run. He gets into batching, sparsity, MoE routing, rack design, pipeline parallelism, KV cache bottlenecks, and why decode is pricier than prefill. There’s also a fun detour into API pricing, long-context costs, and links between neural nets and cryptography.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

MoE Expert Parallelism Wants One Rack

  • MoE layers map naturally onto one rack because expert parallelism needs all-to-all traffic and racks provide dense fast connectivity.
  • Reiner Pope lays 256 experts across 64 GPUs with four experts per GPU; crossing racks hits a much slower scale-out network and becomes the bottleneck.
INSIGHT

Scaling Up Runs Into Physical Cable Limits

  • Larger scale-up domains are constrained by physical rack design, not just chip design or switch math.
  • Reiner Pope says moving from smaller systems to Blackwell-style racks required denser cabling, tighter power and cooling, and managing bend radius, weight, and connector density.
INSIGHT

Pipelining Fits Cross Rack Communication Better

  • Pipeline parallelism works across racks because layer-to-layer transfers are far cheaper than MoE all-to-all traffic within a layer.
  • Reiner Pope shows scale-out only needs to beat an 8x bandwidth penalty with activated experts and layers per stage, so one rack per layer can make sense.
Get the Snipd Podcast app to discover more snips from this episode
Get the app