Complex Systems with Patrick McKenzie (patio11)

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

87 snips
Mar 12, 2026
Philip Kiely, author and inference engineering practitioner who helped build Baseten, walks through the inference stack and real-world LLM deployment. He breaks down what engineers actually build. He explores model size tradeoffs, agentic workflows that multiply inference calls, routing between local and SOTA models, and practical harnesses for testing and scaling inference.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Inference Is The Runtime That Ships Model Weights

  • Inference is the runtime layer that turns model weights into usable APIs for businesses.
  • Philip Kiely frames it as the engineering work (prompts, harnesses) that makes a giant weight file into production value.
ANECDOTE

Stable Diffusion Shocked Engineers With 2GB Compression

  • Philip Kiely recounts stable diffusion's 2GB model shock as a mind‑blowing compression of visual data.
  • He contrasts that with modern trillion‑parameter language models that can be hundreds of gigabytes on disk.
INSIGHT

Agents Turn Single Queries Into Many Inference Calls

  • Agentic systems change inference from one request → one reply to one request → many actions and decisions.
  • Function calling or listing tools lets LLMs choose structured actions, which blows up inference volumes and enables new real‑world behaviors.
Get the Snipd Podcast app to discover more snips from this episode
Get the app