Software Engineering Radio - the podcast for professional software developers

SE Radio 703: Sahaj Garg on Low Latency AI

18 snips
Jan 14, 2026
In this engaging discussion, Sahaj Garg, CTO and co-founder of Whispr.ai, shares his expertise on low-latency AI applications. He explains how latency affects user experience and offers insights into measuring and diagnosing latency issues. The conversation covers critical trade-offs between speed, accuracy, and cost in AI models. Sahaj also introduces optimization techniques like quantization and distillation, stressing the importance of low latency for user engagement in interactive apps. Tune in for invaluable tips on navigating the latency landscape!
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Cascade Models To Meet Different Latency Needs

  • Recommendation systems use cascaded models with different latency needs.
  • Serve fast candidate retrieval first and compute expensive ranking for later pages.
INSIGHT

Autoregression Makes AI Sequential And Costly

  • Autoregressive generation creates sequential dependency for each token.
  • AI workloads often require GPUs and have many sequential decode steps, unlike typical web requests.
ADVICE

Choose Between Immediate Serve Or Batching

  • Balance latency vs throughput by choosing processing strategies.
  • Either serve requests immediately for low latency or batch them to increase GPU throughput and lower cost.
Get the Snipd Podcast app to discover more snips from this episode
Get the app