Stratechery

The Inference Shift

107 snips
May 11, 2026
Discussion of why AI is driving a semiconductor surge and how GPUs became central to modern models. Exploration of training scale needs like HBM and chip-to-chip networking. A breakdown of inference stages and the memory versus speed tradeoffs for agentic versus answer-style workloads. Look at wafer-scale SRAM designs, disaggregated memory trends, and how different architectures reshape market opportunities.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Why GPUs Dominated Early AI Compute

  • GPUs became central to AI because programmable graphics processors and CUDA mapped parallel graphics work to parallel model calculations.
  • NVIDIA paired HBM memory and chip networking to scale models across many GPUs for training and large-model inference.
INSIGHT

Inference Is Serial And Memory Bandwidth Bound

  • Inference splits into pre-fill and two decode steps that alternate, making it serial and memory-bandwidth bound.
  • For each token the KV cache and model weights must be read in full, so bandwidth and memory capacity drive latency and throughput.
ANECDOTE

Cerebris Wafer Scale Chip And WSE3 Specs

  • Cerebrus uses a wafer‑scale approach that wires across reticle scribe lines to treat the whole wafer as a single chip.
  • The WSE3 offers 44 GB SRAM at 21 PB/s bandwidth, versus H100's 80 GB HBM at 3.35 TB/s, yielding massive on‑chip bandwidth advantage.
Get the Snipd Podcast app to discover more snips from this episode
Get the app