The Data Exchange with Ben Lorica

Breaking the Memory Wall in the Age of Inference

9 snips
Feb 12, 2026
Sid Sheth, founder and CEO of D‑Matrix, builds memory-centric AI inference hardware optimized for low-latency reasoning. He discusses SRAM-first accelerator designs, why HBM favors training not inference, digital in-memory compute to cut data movement, and trade-offs between latency and throughput. Practical deployment, software porting, and future multimodal/agentic inference trends are also covered.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Betting Early On Cloud Inference

  • D-Matrix bet on cloud inference in 2019, before ChatGPT made inference widely visible.
  • The team chose SRAM-first and aimed to pack 10x more SRAM capacity than competitors.
INSIGHT

Inference Is A Memory Problem

  • Inference workloads are increasingly memory-bound as model sizes and KV caches explode.
  • D-Matrix focused on integrating memory and compute to collapse the memory wall for inference.
ADVICE

Bring Memory Next To Compute

  • Put memory as close to compute as possible to serve larger models and reduce latency.
  • Pack more SRAM near compute as a first practical step for low-latency cloud inference.
Get the Snipd Podcast app to discover more snips from this episode
Get the app