The Inference Shift

107 snips

May 11, 2026

Discussion of why AI is driving a semiconductor surge and how GPUs became central to modern models. Exploration of training scale needs like HBM and chip-to-chip networking. A breakdown of inference stages and the memory versus speed tradeoffs for agentic versus answer-style workloads. Look at wafer-scale SRAM designs, disaggregated memory trends, and how different architectures reshape market opportunities.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Why GPUs Dominated Early AI Compute

GPUs became central to AI because programmable graphics processors and CUDA mapped parallel graphics work to parallel model calculations.
NVIDIA paired HBM memory and chip networking to scale models across many GPUs for training and large-model inference.

INSIGHT

Inference Is Serial And Memory Bandwidth Bound

Inference splits into pre-fill and two decode steps that alternate, making it serial and memory-bandwidth bound.
For each token the KV cache and model weights must be read in full, so bandwidth and memory capacity drive latency and throughput.

ANECDOTE

Cerebris Wafer Scale Chip And WSE3 Specs

Cerebrus uses a wafer‑scale approach that wires across reticle scribe lines to treat the whole wafer as a single chip.
The WSE3 offers 44 GB SRAM at 21 PB/s bandwidth, versus H100's 80 GB HBM at 3.35 TB/s, yielding massive on‑chip bandwidth advantage.

Get the Snipd Podcast app to discover more snips from this episode

Get the app