
The Data Exchange with Ben Lorica Breaking the Memory Wall in the Age of Inference
9 snips
Feb 12, 2026 Sid Sheth, founder and CEO of D‑Matrix, builds memory-centric AI inference hardware optimized for low-latency reasoning. He discusses SRAM-first accelerator designs, why HBM favors training not inference, digital in-memory compute to cut data movement, and trade-offs between latency and throughput. Practical deployment, software porting, and future multimodal/agentic inference trends are also covered.
AI Snips
Chapters
Transcript
Episode notes
Betting Early On Cloud Inference
- D-Matrix bet on cloud inference in 2019, before ChatGPT made inference widely visible.
- The team chose SRAM-first and aimed to pack 10x more SRAM capacity than competitors.
Inference Is A Memory Problem
- Inference workloads are increasingly memory-bound as model sizes and KV caches explode.
- D-Matrix focused on integrating memory and compute to collapse the memory wall for inference.
Bring Memory Next To Compute
- Put memory as close to compute as possible to serve larger models and reduce latency.
- Pack more SRAM near compute as a first practical step for low-latency cloud inference.
