
Complex Systems with Patrick McKenzie (patio11) Inference engineering and the real-world deployment of LLMs, with Philip Kiely
87 snips
Mar 12, 2026 Philip Kiely, author and inference engineering practitioner who helped build Baseten, walks through the inference stack and real-world LLM deployment. He breaks down what engineers actually build. He explores model size tradeoffs, agentic workflows that multiply inference calls, routing between local and SOTA models, and practical harnesses for testing and scaling inference.
AI Snips
Chapters
Books
Transcript
Episode notes
Inference Is The Runtime That Ships Model Weights
- Inference is the runtime layer that turns model weights into usable APIs for businesses.
- Philip Kiely frames it as the engineering work (prompts, harnesses) that makes a giant weight file into production value.
Stable Diffusion Shocked Engineers With 2GB Compression
- Philip Kiely recounts stable diffusion's 2GB model shock as a mind‑blowing compression of visual data.
- He contrasts that with modern trillion‑parameter language models that can be hundreds of gigabytes on disk.
Agents Turn Single Queries Into Many Inference Calls
- Agentic systems change inference from one request → one reply to one request → many actions and decisions.
- Function calling or listing tools lets LLMs choose structured actions, which blows up inference volumes and enables new real‑world behaviors.




