SemiAnalysis Weekly

Feb 19, 2026 - InferenceX (Cam Quilici, Bryan Shan, Doug O'Laughlin, Jordan Nanos)

11 snips
Feb 19, 2026
Cam Quilici, an AI/ML systems practitioner focused on inference performance and benchmarking. He discusses InferenceX’s evolution from its predecessor and the move to multi-node DeepSeq. Conversation covers major hardware wins, nightly large-scale benchmarking, software tuning complexities, multi-token prediction benefits, cost/TCO modeling, and roadmaps for TPU, multimodal, and multi-turn benchmarks.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

No Single Optimal Serving Point

  • There is no single Pareto-optimal serving configuration; tradeoffs exist between per-user latency and total throughput.
  • Different customers value extremes like very fast per-user interactivity versus maximizing aggregate throughput.
INSIGHT

Software Tuning Can Triple Performance Fast

  • Software and runtime optimizations can triple performance in weeks without hardware changes.
  • Continuous engineering pushes the performance frontier as much as new GPUs do.
ADVICE

Work With Vendors And Validate Recipes

  • Collaborate closely with hardware vendors to discover and validate performance recipes.
  • Test community recipes and novel flags but verify combinations for real workloads.
Get the Snipd Podcast app to discover more snips from this episode
Get the app