
Dwarkesh Podcast Reiner Pope – The math behind how LLMs are trained and served
368 snips
Apr 29, 2026 Reiner Pope, MatX CEO and former Google engineer, turns a chalkboard into a tour of how frontier LLMs really run. He gets into batching, sparsity, MoE routing, rack design, pipeline parallelism, KV cache bottlenecks, and why decode is pricier than prefill. There’s also a fun detour into API pricing, long-context costs, and links between neural nets and cryptography.
AI Snips
Chapters
Transcript
Episode notes
MoE Expert Parallelism Wants One Rack
- MoE layers map naturally onto one rack because expert parallelism needs all-to-all traffic and racks provide dense fast connectivity.
- Reiner Pope lays 256 experts across 64 GPUs with four experts per GPU; crossing racks hits a much slower scale-out network and becomes the bottleneck.
Scaling Up Runs Into Physical Cable Limits
- Larger scale-up domains are constrained by physical rack design, not just chip design or switch math.
- Reiner Pope says moving from smaller systems to Blackwell-style racks required denser cabling, tighter power and cooling, and managing bend radius, weight, and connector density.
Pipelining Fits Cross Rack Communication Better
- Pipeline parallelism works across racks because layer-to-layer transfers are far cheaper than MoE all-to-all traffic within a layer.
- Reiner Pope shows scale-out only needs to beat an 8x bandwidth penalty with activated experts and layers per stage, so one rack per layer can make sense.

