
MLOps.community Serving LLMs in Production: Performance, Cost & Scale // CAST AI Roundtable
44 snips
Feb 19, 2026 Igor Šušić, founding ML engineer focused on large-scale inference and performance tuning. Ioana Apetrei, senior product manager building accessible, cost-effective LLM deployment. They debate why deployments fail at scale. They cover model routing and cost vs accuracy. They explain time-sharing GPUs, quantization, prefill vs decode separation, and when self-hosting or managed endpoints make sense.
AI Snips
Chapters
Transcript
Episode notes
Self-Hosting Is Harder Than It Looks
- Self-hosting is more complex than expected due to quotas, capacity, and expensive idle GPUs.
- Proper orchestration, autoscaling and domain knowledge can yield 2-3x effective capacity gains.
Timeshare Setup Cut Costs For A Customer
- One customer migrated from AWS SageMaker to a timeshare setup and saved 40% with equal or better latency.
- The cost savings allowed that team to expand faster.
Profile First, Exotic Tricks Later
- Profile and benchmark your workload before chasing exotic optimizations.
- Continuous batching, quantization and bottleneck analysis will solve most performance problems.
