OpenAI's partnership with Cerebras and Nvidia's announcement of context memory storage raises a fundamental question: as agentic AI demands long sessions with massive context windows, can SRAM-based accelerators designed before the LLM era keep up—or will they converge with GPUs?

Key Takeaways
1. Context is the new bottleneck. As agentic workloads demand long sessions with massive codebases, storing and retrieving KV cache efficiently becomes critical.
2. There's no one-size-fits-all. Sachin Khatti's (OpenAI, ex-Intel) signals a shift toward heterogeneous compute—matching specific accelerators to specific workloads.
3. Cerebras has 44GB of SRAM per wafer — orders of magnitude more than typical chips — but the question remains: where does the KV cache go for long context?
4. Pre-GPT accelerators may converge toward GPUs. If they need to add HBM or external memory for long context, some of their differentiation erodes.
5. Post-GPT accelerators (Etched, MatX) are the ones to watch. Designed specifically for transformer inference, they may solve the KV cache problem from first principles.

Chapters
- 00:00 — Intro
- 01:20 — What is context memory storage?
- 03:30 — When Claude runs out of context
- 06:00 — Tokens, attention, and the KV cache explained
- 09:07 — The AI memory hierarchy: HBM → DRAM → SSD → network storage
- 12:53 — Nvidia's G1/G2/G3 tiers and the missing G0 (SRAM)
- 14:35 — Bluefield DPUs and GPU Direct Storage
- 15:53 — Token economics: cache hits vs misses
- 20:03 — OpenAI + Cerebras: 750 megawatts for faster Codex
- 21:29 — Why Cerebras built a wafer-scale engine
- 25:07 — 44GB SRAM and running Llama 70B on four wafers
- 25:55 — Sachin Khatti on heterogeneous compute strategy
- 31:43 — The big question: where does Cerebras store KV cache?
- 34:11 — If SRAM offloads to HBM, does it lose its edge?
- 35:40 — Pre-GPT vs Post-GPT accelerators
- 36:51 — Etched raises $500M at $5B valuation
- 38:48 — Wrap up

Can Pre-GPT AI Accelerators Handle Long Context Workloads?

Semi Doped

Why Cerebras built a wafer-scale engine

The AI-powered Podcast Player