Episode 18 - Why AI needs a new kind of supercomputer network

May 6, 2026

Meet Greg Steinbrecher, a workload systems engineer who makes GPUs talk efficiently, and Mark Handley, a veteran networking researcher shaping data center protocols. They discuss why large-scale GPU training breaks internet networking assumptions. They explain Multipath Reliable Connection, packet trimming, fast failure detection, and how a new network design boosts reliability and speed for massive model training.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

MRC Sprays Traffic And Trims Packets

Multipath Reliable Connection (MRC) sprays packets across many paths and uses packet trimming to avoid ambiguity on loss.
Greg Steinbrecher describes trimming payloads and forwarding headers so receivers can request retransmission immediately.

INSIGHT

Endpoints Self Detect Failures Fast

MRC lets endpoints detect and avoid bad paths in milliseconds, removing slow distributed routing convergence.
Greg Steinbrecher says endpoints independently stop using a failed path so jobs aren't paused for seconds during route reconvergence.

ANECDOTE

Buildout Link Flaps Went Unnoticed With MRC

During data center buildout links frequently went up and down from manual fiber work and technicians.
Greg Steinbrecher says MRC made those frequent link flaps invisible to training jobs because endpoints auto-switched paths.

Get the Snipd Podcast app to discover more snips from this episode

Get the app