
Stratechery The Inference Shift
107 snips
May 11, 2026 Discussion of why AI is driving a semiconductor surge and how GPUs became central to modern models. Exploration of training scale needs like HBM and chip-to-chip networking. A breakdown of inference stages and the memory versus speed tradeoffs for agentic versus answer-style workloads. Look at wafer-scale SRAM designs, disaggregated memory trends, and how different architectures reshape market opportunities.
AI Snips
Chapters
Transcript
Episode notes
Why GPUs Dominated Early AI Compute
- GPUs became central to AI because programmable graphics processors and CUDA mapped parallel graphics work to parallel model calculations.
- NVIDIA paired HBM memory and chip networking to scale models across many GPUs for training and large-model inference.
Inference Is Serial And Memory Bandwidth Bound
- Inference splits into pre-fill and two decode steps that alternate, making it serial and memory-bandwidth bound.
- For each token the KV cache and model weights must be read in full, so bandwidth and memory capacity drive latency and throughput.
Cerebris Wafer Scale Chip And WSE3 Specs
- Cerebrus uses a wafer‑scale approach that wires across reticle scribe lines to treat the whole wafer as a single chip.
- The WSE3 offers 44 GB SRAM at 21 PB/s bandwidth, versus H100's 80 GB HBM at 3.35 TB/s, yielding massive on‑chip bandwidth advantage.
