MLOps.community

Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable

44 snips
Feb 19, 2026
Igor Šušić, founding ML engineer focused on large-scale inference and performance tuning. Ioana Apetrei, senior product manager building accessible, cost-effective LLM deployment. They debate why deployments fail at scale. They cover model routing and cost vs accuracy. They explain time-sharing GPUs, quantization, prefill vs decode separation, and when self-hosting or managed endpoints make sense.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Self-Hosting Is Harder Than It Looks

  • Self-hosting is more complex than expected due to quotas, capacity, and expensive idle GPUs.
  • Proper orchestration, autoscaling and domain knowledge can yield 2-3x effective capacity gains.
ANECDOTE

Timeshare Setup Cut Costs For A Customer

  • One customer migrated from AWS SageMaker to a timeshare setup and saved 40% with equal or better latency.
  • The cost savings allowed that team to expand faster.
ADVICE

Profile First, Exotic Tricks Later

  • Profile and benchmark your workload before chasing exotic optimizations.
  • Continuous batching, quantization and bottleneck analysis will solve most performance problems.
Get the Snipd Podcast app to discover more snips from this episode
Get the app