
Software Engineering Daily Open-Weight AI Models
76 snips
Apr 28, 2026 Benny Chen, co-founder of Fireworks AI and former Meta ML infrastructure engineer, shares his work building platforms to serve and fine-tune open-weight models at scale. He discusses custom kernels, speculative decoding for faster code completion, multi-hardware support, reinforcement fine-tuning, and using production traces to evaluate real-world performance.
AI Snips
Chapters
Transcript
Episode notes
Custom Kernels Ensure Numeric Consistency And Multi Hardware
- Fireworks built in-house FireAttention kernels to control numeric fidelity and multi-hardware support across NVIDIA and AMD.
- They prioritize training-inference numeric alignment to avoid RL training instability caused by mismatched kernels.
Speculative Decoding Speeds Interactive Coding Completions
- Speculative decoding uses a small speculator model to propose tokens the large model will accept, speeding interactive tasks.
- Fireworks trains and continuously updates speculators for customers' fine-tuned models and changing data distributions.
Optimizer Database Automates Deployment Trade-Offs
- 3D Fire Optimizer is an internal performance database plus prediction system for deployment trade-offs.
- It stores past optimizations and predicted results across workload patterns, hardware, and cache rates to automate scaling decisions.


