MLOps.community  cover image

We Cut LLM Latency by 70% in Production

MLOps.community

00:00

Choosing models by pre-fill vs decoding needs

Maher contrasts workloads dominated by pre-fill versus decoding to route requests to appropriate GPUs and models.

Play episode from 19:51
Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app