Episode 18 - Why AI needs a new kind of supercomputer network

59 snips

May 6, 2026

Guest

Mark Handley

Guest

Greg Steinbrecher

Mark Handley, a longtime networking researcher and UCL professor, and Greg Steinbrecher, a GPU-cluster engineer optimizing large-scale training, explain how AI training breaks traditional networks. They discuss why synchronous GPU workloads stall on small faults, how Multipath Reliable Connection sprays traffic and trims packets to avoid failures, and why making this open standard can lower cost and boost reliability for massive GPU fleets.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Synchronous Training Makes The Network About The Worst Link

AI training workloads are synchronised computations across thousands of GPUs so the worst single link sets overall speed.
Mark Handley explains that unlike internet traffic, AI training is subject to P100 tail behavior where one bottleneck link stalls the entire job.

INSIGHT

More GPUs Means More Failures Per Unit Time

As cluster size grows, component failures increase and mean time between failures drops roughly inversely with scale.
Greg Steinbrecher notes a single GPU connects through many optical components, so a building can have orders of magnitude more network parts than GPUs.

INSIGHT

Packet Trimming Removes Loss Versus Reorder Ambiguity

Multipath Reliable Connection (MRC) spreads packets across many paths and uses packet trimming to remove ambiguity from loss versus reordering.
Mark Handley describes trimming payload and forwarding a tiny header so the receiver can request a retransmission immediately.

Get the Snipd Podcast app to discover more snips from this episode

Get the app