Breaking the Memory Wall in the Age of Inference

9 snips

Feb 12, 2026

Sid Sheth, founder and CEO of D‑Matrix, builds memory-centric AI inference hardware optimized for low-latency reasoning. He discusses SRAM-first accelerator designs, why HBM favors training not inference, digital in-memory compute to cut data movement, and trade-offs between latency and throughput. Practical deployment, software porting, and future multimodal/agentic inference trends are also covered.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Betting Early On Cloud Inference

D-Matrix bet on cloud inference in 2019, before ChatGPT made inference widely visible.
The team chose SRAM-first and aimed to pack 10x more SRAM capacity than competitors.

INSIGHT

Inference Is A Memory Problem

Inference workloads are increasingly memory-bound as model sizes and KV caches explode.
D-Matrix focused on integrating memory and compute to collapse the memory wall for inference.

ADVICE

Bring Memory Next To Compute

Put memory as close to compute as possible to serve larger models and reduce latency.
Pack more SRAM near compute as a first practical step for low-latency cloud inference.

Get the Snipd Podcast app to discover more snips from this episode

Get the app