Feb 19, 2026 - InferenceX (Cam Quilici, Bryan Shan, Doug O'Laughlin, Jordan Nanos)

11 snips

Feb 19, 2026

Cam Quilici, an AI/ML systems practitioner focused on inference performance and benchmarking. He discusses InferenceX’s evolution from its predecessor and the move to multi-node DeepSeq. Conversation covers major hardware wins, nightly large-scale benchmarking, software tuning complexities, multi-token prediction benefits, cost/TCO modeling, and roadmaps for TPU, multimodal, and multi-turn benchmarks.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

No Single Optimal Serving Point

There is no single Pareto-optimal serving configuration; tradeoffs exist between per-user latency and total throughput.
Different customers value extremes like very fast per-user interactivity versus maximizing aggregate throughput.

INSIGHT

Software Tuning Can Triple Performance Fast

Software and runtime optimizations can triple performance in weeks without hardware changes.
Continuous engineering pushes the performance frontier as much as new GPUs do.

ADVICE

Work With Vendors And Validate Recipes

Collaborate closely with hardware vendors to discover and validate performance recipes.
Test community recipes and novel flags but verify combinations for real workloads.

Get the Snipd Podcast app to discover more snips from this episode

Get the app