Open-Weight AI Models

76 snips

Apr 28, 2026

Benny Chen, co-founder of Fireworks AI and former Meta ML infrastructure engineer, shares his work building platforms to serve and fine-tune open-weight models at scale. He discusses custom kernels, speculative decoding for faster code completion, multi-hardware support, reinforcement fine-tuning, and using production traces to evaluate real-world performance.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Custom Kernels Ensure Numeric Consistency And Multi Hardware

Fireworks built in-house FireAttention kernels to control numeric fidelity and multi-hardware support across NVIDIA and AMD.
They prioritize training-inference numeric alignment to avoid RL training instability caused by mismatched kernels.

INSIGHT

Speculative Decoding Speeds Interactive Coding Completions

Speculative decoding uses a small speculator model to propose tokens the large model will accept, speeding interactive tasks.
Fireworks trains and continuously updates speculators for customers' fine-tuned models and changing data distributions.

INSIGHT

Optimizer Database Automates Deployment Trade-Offs

3D Fire Optimizer is an internal performance database plus prediction system for deployment trade-offs.
It stores past optimizations and predicted results across workload patterns, hardware, and cache rates to automate scaling decisions.

Get the Snipd Podcast app to discover more snips from this episode

Get the app