Software Engineering Daily

Open-Weight AI Models

76 snips
Apr 28, 2026
Benny Chen, co-founder of Fireworks AI and former Meta ML infrastructure engineer, shares his work building platforms to serve and fine-tune open-weight models at scale. He discusses custom kernels, speculative decoding for faster code completion, multi-hardware support, reinforcement fine-tuning, and using production traces to evaluate real-world performance.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Custom Kernels Ensure Numeric Consistency And Multi Hardware

  • Fireworks built in-house FireAttention kernels to control numeric fidelity and multi-hardware support across NVIDIA and AMD.
  • They prioritize training-inference numeric alignment to avoid RL training instability caused by mismatched kernels.
INSIGHT

Speculative Decoding Speeds Interactive Coding Completions

  • Speculative decoding uses a small speculator model to propose tokens the large model will accept, speeding interactive tasks.
  • Fireworks trains and continuously updates speculators for customers' fine-tuned models and changing data distributions.
INSIGHT

Optimizer Database Automates Deployment Trade-Offs

  • 3D Fire Optimizer is an internal performance database plus prediction system for deployment trade-offs.
  • It stores past optimizations and predicted results across workload patterns, hardware, and cache rates to automate scaling decisions.
Get the Snipd Podcast app to discover more snips from this episode
Get the app