
SemiAnalysis Weekly Feb 19, 2026 - InferenceX (Cam Quilici, Bryan Shan, Doug O'Laughlin, Jordan Nanos)
11 snips
Feb 19, 2026 Cam Quilici, an AI/ML systems practitioner focused on inference performance and benchmarking. He discusses InferenceX’s evolution from its predecessor and the move to multi-node DeepSeq. Conversation covers major hardware wins, nightly large-scale benchmarking, software tuning complexities, multi-token prediction benefits, cost/TCO modeling, and roadmaps for TPU, multimodal, and multi-turn benchmarks.
AI Snips
Chapters
Transcript
Episode notes
No Single Optimal Serving Point
- There is no single Pareto-optimal serving configuration; tradeoffs exist between per-user latency and total throughput.
- Different customers value extremes like very fast per-user interactivity versus maximizing aggregate throughput.
Software Tuning Can Triple Performance Fast
- Software and runtime optimizations can triple performance in weeks without hardware changes.
- Continuous engineering pushes the performance frontier as much as new GPUs do.
Work With Vendors And Validate Recipes
- Collaborate closely with hardware vendors to discover and validate performance recipes.
- Test community recipes and novel flags but verify combinations for real workloads.

