Tech Talks Daily

d-Matrix - Ultra-low Latency Batched Inference for Gen AI

5 snips
Mar 7, 2026
Satyam Srivastava, electrical engineer and d-Matrix co-founder who built energy-efficient ML systems after stints at NVIDIA and Intel. He discusses why inference—not training—is becoming the real scaling challenge. He explains memory, data-movement, and power bottlenecks, why general-purpose GPUs struggle, and how purpose-built, efficiency-first hardware and 3D memory designs can change data center economics.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Inference Is The Real Productization Bottleneck

  • Inference, not training, will define AI's real-world success because models must serve millions of daily interactions.
  • Satyam explains that serving those interactions strains memory, energy, and data movement far more than model research did.
ADVICE

Plan Data Centers For Efficiency Not Raw Power

  • Plan capacity around efficiency instead of brute-force scaling to avoid impossible power and cooling demands.
  • Satyam urges custom designs that deliver the work of 10 GPUs within existing power and cooling limits.
INSIGHT

Low GPU Utilization Reveals Hidden Cost Problem

  • Low utilization of expensive general-purpose hardware signals a fundamental ROI problem for inference workloads.
  • Satyam notes top-end hardware often runs at low single-digit utilization, wasting capital and infrastructure.
Get the Snipd Podcast app to discover more snips from this episode
Get the app