Data Science at Home

There Is No AI. There's a Stateless Function on 10,000 GPUs Pretending to Know You (Ep. 299)

Mar 3, 2026
A deep dive into how large language models are served at scale, including model weight sizes, GPU setups, and context window limits. The conversation covers trade-offs between latency, throughput, memory, and cost. Learn about model parallelism, KV cache mechanics, continuous batching, prompt caching, and practical patterns for short, mid, and long-term memory.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Use KV Cache And Continuous Batching

  • Use KV cache to avoid recomputing attention keys/values and implement continuous batching to keep GPUs busy and reduce per-user latency.
  • Continuous batching dynamically inserts/removes requests so finished slots are immediately refilled, maximizing GPU utilization.
INSIGHT

KV Cache Is Just Virtual Memory Pages

  • KV cache can be treated like virtual memory with pages; these are established OS-style solutions, not novel inventions.
  • Francesco emphasizes these engineering patterns have existed for decades and are adaptations, not magic.
INSIGHT

The Stateless Illusion Of Memory

  • LLMs are stateless; perceived memory comes from client/backend reconstructing and resending conversation context each turn.
  • The browser stores history locally and the application backend persists it, then the inference API receives the rebuilt prompt every request.
Get the Snipd Podcast app to discover more snips from this episode
Get the app