Software Engineering Radio - the podcast for professional software developers

SE Radio 703: Sahaj Garg on Low Latency AI

18 snips

Jan 14, 2026

In this engaging discussion, Sahaj Garg, CTO and co-founder of Whispr.ai, shares his expertise on low-latency AI applications. He explains how latency affects user experience and offers insights into measuring and diagnosing latency issues. The conversation covers critical trade-offs between speed, accuracy, and cost in AI models. Sahaj also introduces optimization techniques like quantization and distillation, stressing the importance of low latency for user engagement in interactive apps. Tune in for invaluable tips on navigating the latency landscape!

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Cascade Models To Meet Different Latency Needs

Recommendation systems use cascaded models with different latency needs.
Serve fast candidate retrieval first and compute expensive ranking for later pages.

INSIGHT

Autoregression Makes AI Sequential And Costly

Autoregressive generation creates sequential dependency for each token.
AI workloads often require GPUs and have many sequential decode steps, unlike typical web requests.

ADVICE

Choose Between Immediate Serve Or Batching

Balance latency vs throughput by choosing processing strategies.
Either serve requests immediately for low latency or batch them to increase GPU throughput and lower cost.

Get the Snipd Podcast app to discover more snips from this episode

Get the app