
Data Science at Home There Is No AI. There's a Stateless Function on 10,000 GPUs Pretending to Know You (Ep. 299)
Mar 3, 2026
A deep dive into how large language models are served at scale, including model weight sizes, GPU setups, and context window limits. The conversation covers trade-offs between latency, throughput, memory, and cost. Learn about model parallelism, KV cache mechanics, continuous batching, prompt caching, and practical patterns for short, mid, and long-term memory.
AI Snips
Chapters
Transcript
Episode notes
Use KV Cache And Continuous Batching
- Use KV cache to avoid recomputing attention keys/values and implement continuous batching to keep GPUs busy and reduce per-user latency.
- Continuous batching dynamically inserts/removes requests so finished slots are immediately refilled, maximizing GPU utilization.
KV Cache Is Just Virtual Memory Pages
- KV cache can be treated like virtual memory with pages; these are established OS-style solutions, not novel inventions.
- Francesco emphasizes these engineering patterns have existed for decades and are adaptations, not magic.
The Stateless Illusion Of Memory
- LLMs are stateless; perceived memory comes from client/backend reconstructing and resending conversation context each turn.
- The browser stores history locally and the application backend persists it, then the inference API receives the rebuilt prompt every request.
