Complex Systems with Patrick McKenzie (patio11)

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

89 snips

Mar 12, 2026

Philip Kiely, author and inference engineering practitioner who helped build Baseten, walks through the inference stack and real-world LLM deployment. He breaks down what engineers actually build. He explores model size tradeoffs, agentic workflows that multiply inference calls, routing between local and SOTA models, and practical harnesses for testing and scaling inference.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

INSIGHT

Inference Is The Runtime That Ships Model Weights

Inference is the runtime layer that turns model weights into usable APIs for businesses.
Philip Kiely frames it as the engineering work (prompts, harnesses) that makes a giant weight file into production value.

ANECDOTE

Stable Diffusion Shocked Engineers With 2GB Compression

Philip Kiely recounts stable diffusion's 2GB model shock as a mind‑blowing compression of visual data.
He contrasts that with modern trillion‑parameter language models that can be hundreds of gigabytes on disk.

INSIGHT

Agents Turn Single Queries Into Many Inference Calls

Agentic systems change inference from one request → one reply to one request → many actions and decisions.
Function calling or listing tools lets LLMs choose structured actions, which blows up inference volumes and enables new real‑world behaviors.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Patrick McKenzie (patio11) and Philip Kiely, early employee at Baseten, discuss the inference stack: the critical layer of software and hardware that sits between a model’s weights and a user’s prompt. They cover inference engineering, how intermediate layers are evolving over a technical stack that is changing every six months, and how sophisticated organizations are actually consuming LLMs beyond just writing their questions into chatbot apps.
–

Full transcript available here: www.complexsystemspodcast.com/inference-engineering-with-philip-kiely/

–
Presenting Sponsors: Mercury, Meter, & Granola

Complex Systems is presented by Mercury—radically better banking for founders. Mercury offers the best wire experience anywhere: fast, reliable, and free for domestic U.S. wires, so you can stay focused on growing your business. Apply online in minutes at mercury.com.

Networking infrastructure has a way of accumulating technical debt faster than almost anything else in IT. Meter handles the full stack (wired, wireless, and cellular) as a single integrated solution: designed, deployed, and managed end-to-end so there's only one vendor to call when something goes wrong. Visit meter.com/complexsystems to book a demo.

If meetings consistently leave you with hazy action items and lost context, Granola handles the transcription so you can actually participate and gives you searchable notes afterward. Try it free at granola.ai/complexsystems with code COMPLEXSYSTEMS
–

Links:

Download Inference Engineering: https://www.baseten.com/inference-engineering/
Philip's website: https://philipkiely.com/
Stripe's Emily Sands on Complex Systems: https://www.complexsystemspodcast.com/episodes/the-past-present-and-future-of-ai-with-stripe/
Des Traynor on Complex Systems: https://www.complexsystemspodcast.com/episodes/des-traynor/

–

Timestamps:
(00:00) Intro
(00:30) The AI deployment pipeline
(03:04) Evolution of abstraction layers in engineering
(05:14) Defining inference and model weights
(08:45) Architecture of language and diffusion models
(10:11) AI adoption in the broader economy
(11:30) The shift toward agentic workflows and RL
(14:55) Function calling and real-world actions
(20:10) Sponsors: Mercury | Meter
(22:59) Technologies for agentic tools: MCP and skills
(25:32) The craft of writing a harness
(29:56) Using AI for automated proofreading and tool creation
(34:12) Balancing LLMs with deterministic code
(37:31) Observability and chain of thought reasoning
(39:31) Sponsor: Granola
(41:21) Observability and chain of thought reasoning
(50:45) Speculative decoding and hidden states
(55:37) The value of smaller, task-specific models
(59:55) Internal competencies versus buying solutions
(01:09:27) Self-publishing a technical book in record time
(01:23:20) Wrap