The a16z Show

Inferact: Building the Infrastructure That Runs Modern AI

258 snips
Jan 22, 2026
Simon Mo and Woosuk Kwon, co-founders of Infraact, discuss their groundbreaking work on building a universal open-source inference layer for AI. They explore the evolution of vLLM from a UC Berkeley prototype to a key player in AI infrastructure. The duo delves into the complexities of running large AI models, tackling challenges like scheduling and memory management. They emphasize the importance of open-source in driving diversity and interoperability, while envisioning a future where efficient inference is as foundational as operating systems.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

A Demo Optimization Turned Research Project

  • Woosuk started optimizing a demo of Meta's OPT model and discovered serving LLMs was much harder than expected.
  • That side project grew into research, a paper, and the open-source VLLM project.
INSIGHT

Language Models Demand Dynamic Runtimes

  • LLM inference is dynamic: inputs and outputs vary widely and unpredictably.
  • This dynamism forces new scheduling and memory-first designs unlike static ML workloads.
INSIGHT

Scheduling And KV Cache Are Core Challenges

  • Efficient LLM serving centers on scheduling and memory management (KV cache).
  • Micro-batching no longer suffices; engines must step tokens across heterogeneous requests.
Get the Snipd Podcast app to discover more snips from this episode
Get the app