OpenAI Podcast

Episode 18 - Why AI needs a new kind of supercomputer network

May 6, 2026
Meet Greg Steinbrecher, a workload systems engineer who makes GPUs talk efficiently, and Mark Handley, a veteran networking researcher shaping data center protocols. They discuss why large-scale GPU training breaks internet networking assumptions. They explain Multipath Reliable Connection, packet trimming, fast failure detection, and how a new network design boosts reliability and speed for massive model training.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

MRC Sprays Traffic And Trims Packets

  • Multipath Reliable Connection (MRC) sprays packets across many paths and uses packet trimming to avoid ambiguity on loss.
  • Greg Steinbrecher describes trimming payloads and forwarding headers so receivers can request retransmission immediately.
INSIGHT

Endpoints Self Detect Failures Fast

  • MRC lets endpoints detect and avoid bad paths in milliseconds, removing slow distributed routing convergence.
  • Greg Steinbrecher says endpoints independently stop using a failed path so jobs aren't paused for seconds during route reconvergence.
ANECDOTE

Buildout Link Flaps Went Unnoticed With MRC

  • During data center buildout links frequently went up and down from manual fiber work and technicians.
  • Greg Steinbrecher says MRC made those frequent link flaps invisible to training jobs because endpoints auto-switched paths.
Get the Snipd Podcast app to discover more snips from this episode
Get the app