
OpenAI Podcast Episode 18 - Why AI needs a new kind of supercomputer network
May 6, 2026
Meet Greg Steinbrecher, a workload systems engineer who makes GPUs talk efficiently, and Mark Handley, a veteran networking researcher shaping data center protocols. They discuss why large-scale GPU training breaks internet networking assumptions. They explain Multipath Reliable Connection, packet trimming, fast failure detection, and how a new network design boosts reliability and speed for massive model training.
AI Snips
Chapters
Transcript
Episode notes
MRC Sprays Traffic And Trims Packets
- Multipath Reliable Connection (MRC) sprays packets across many paths and uses packet trimming to avoid ambiguity on loss.
- Greg Steinbrecher describes trimming payloads and forwarding headers so receivers can request retransmission immediately.
Endpoints Self Detect Failures Fast
- MRC lets endpoints detect and avoid bad paths in milliseconds, removing slow distributed routing convergence.
- Greg Steinbrecher says endpoints independently stop using a failed path so jobs aren't paused for seconds during route reconvergence.
Buildout Link Flaps Went Unnoticed With MRC
- During data center buildout links frequently went up and down from manual fiber work and technicians.
- Greg Steinbrecher says MRC made those frequent link flaps invisible to training jobs because endpoints auto-switched paths.

