Reiner Pope – The math behind how LLMs are trained and served

368 snips

Apr 29, 2026

Reiner Pope, MatX CEO and former Google engineer, turns a chalkboard into a tour of how frontier LLMs really run. He gets into batching, sparsity, MoE routing, rack design, pipeline parallelism, KV cache bottlenecks, and why decode is pricier than prefill. There’s also a fun detour into API pricing, long-context costs, and links between neural nets and cryptography.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

MoE Expert Parallelism Wants One Rack

MoE layers map naturally onto one rack because expert parallelism needs all-to-all traffic and racks provide dense fast connectivity.
Reiner Pope lays 256 experts across 64 GPUs with four experts per GPU; crossing racks hits a much slower scale-out network and becomes the bottleneck.

INSIGHT

Scaling Up Runs Into Physical Cable Limits

Larger scale-up domains are constrained by physical rack design, not just chip design or switch math.
Reiner Pope says moving from smaller systems to Blackwell-style racks required denser cabling, tighter power and cooling, and managing bend radius, weight, and connector density.

INSIGHT

Pipelining Fits Cross Rack Communication Better

Pipeline parallelism works across racks because layer-to-layer transfers are far cheaper than MoE all-to-all traffic within a layer.
Reiner Pope shows scale-out only needs to beat an 8x bandwidth penalty with activated experts and layers per stage, so one rack per layer can make sense.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served.

It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk.

It’s a bit technical, but I encourage you to hang in there – it’s really worth it.

There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him.

Recommend watching this one on YouTube so you can see the chalkboard.

Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture.

Download markdown of transcript here to chat with an LLM.

Wrote up some flashcards and practice problems to help myself retain what Reiner taught. Hope it's helpful to you too!

Sponsors

* Jane Street needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street’s open roles at janestreet.com/dwarkesh

* Google’s Gemma 4 is the first open model that’s let me shut off the internet and create a fully disconnected “focus machine”. This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner’s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at goo.gle/Gemma4

* Cursor helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn’t sure the best way to visualize the concept, but Cursor’s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog post. And if you have something to visualize yourself, go to cursor.com/dwarkesh

Timestamps

(00:00:00) – How batch size affects token cost and speed

(00:32:09) – How MoE models are laid out across GPU racks

(00:47:12) – How pipeline parallelism spreads model layers across racks

(01:03:37) – Why Ilya said, “As we now know, pipelining is not wise.”

(01:18:59) – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal

(01:33:02) – Deducing long context memory costs from API pricing

(02:04:02) – Convergent evolution between neural nets and cryptography

Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe