Improving GPU utilization and time-to-token

Alex describes engineering changes—KV cache tiering and DPU offload—that raise GPU utilization and reduce time to first token.

Play episode from 07:40

chevron_right

Transcript

chevron_right

Transcript

Episode notes

What does it really take to turn a massive AI infrastructure investment into actual business value?

In this episode, I'm joined by Alex Bouzari, founder and CEO of DDN, for a conversation that gets right to the heart of where AI infrastructure is heading next. There is a lot of noise in the market about faster chips, larger models, and bigger data centers, but Alex argues that the real story has changed. According to him, GPUs are no longer the main constraint. The true bottleneck now lies in the data layer, where data is moved, cached, served, and managed across increasingly complex AI environments.

That shift matters because many organizations are still thinking about AI in terms of hardware acquisition. Buy more GPUs, add more power, build more capacity. But as Alex explains, that mindset misses the bigger picture.

If your data architecture cannot keep pace, those expensive systems stall, efficiency drops, and the return on investment quickly becomes shaky. It was a timely discussion, especially as NVIDIA's Rubin platform points toward rack-scale AI factories where compute, networking, storage, and offload all need to work together as one operational system.

One part I found especially interesting was Alex's focus on measuring efficiency. He argued that the future winners in AI will not simply be the companies with the most hardware. They will be the ones who think like industrial operators, measuring cost per token, rack utilization, time-to-value, and power consumption per unit of intelligence output. That is a very different conversation from the hype cycle, and it is one that business leaders need to hear. AI value is no longer about showing that something can work. It is about proving that it can work predictably, securely, and economically at scale.

We also talked about DDN's collaboration with NVIDIA, the role of BlueField-4 DPUs, and why inference performance now depends on intelligent memory architecture and data movement just as much as raw compute. Alex shared how DDN is helping customers reach up to 99 percent GPU utilization and reduce time to first token for long context workloads. Those numbers are impressive on their own, but what matters most is what they represent—better throughput, lower waste, and AI systems that move from science project to production reality.

There is also an important leadership lesson running through this conversation. DDN has been profitable for over a decade, powers more than one million GPUs worldwide, and has built its business by staying close to real customer pain points. Alex speaks with the kind of clarity that comes from building through constraints rather than simply talking around them.

If AI factories are going to define the next phase of enterprise technology, how should leaders rethink infrastructure, efficiency, and value creation before they invest in the next wave, and what do you think?

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books