
OpenAI Podcast Episode 18 - Why AI needs a new kind of supercomputer network
59 snips
May 6, 2026 Mark Handley, a longtime networking researcher and UCL professor, and Greg Steinbrecher, a GPU-cluster engineer optimizing large-scale training, explain how AI training breaks traditional networks. They discuss why synchronous GPU workloads stall on small faults, how Multipath Reliable Connection sprays traffic and trims packets to avoid failures, and why making this open standard can lower cost and boost reliability for massive GPU fleets.
AI Snips
Chapters
Transcript
Episode notes
Synchronous Training Makes The Network About The Worst Link
- AI training workloads are synchronised computations across thousands of GPUs so the worst single link sets overall speed.
- Mark Handley explains that unlike internet traffic, AI training is subject to P100 tail behavior where one bottleneck link stalls the entire job.
More GPUs Means More Failures Per Unit Time
- As cluster size grows, component failures increase and mean time between failures drops roughly inversely with scale.
- Greg Steinbrecher notes a single GPU connects through many optical components, so a building can have orders of magnitude more network parts than GPUs.
Packet Trimming Removes Loss Versus Reorder Ambiguity
- Multipath Reliable Connection (MRC) spreads packets across many paths and uses packet trimming to remove ambiguity from loss versus reordering.
- Mark Handley describes trimming payload and forwarding a tiny header so the receiver can request a retransmission immediately.


