
Super Data Science: ML & AI Podcast with Jon Krohn 976: NVIDIA’s Nemotron 3 Super: The Perfect LLM for Multi-Agent Systems
54 snips
Mar 20, 2026 They unpack NVIDIA’s Nemotron 3 Super architecture and how a 120B model only activates 12B parameters for efficiency. Listeners hear about the hybrid Mamba-Transformer design and latent mixture-of-experts routing. The conversation covers million-token context windows, NVFP4 precision with Blackwell GPUs, throughput benchmarks, and where to access and deploy the model for multi-agent systems.
AI Snips
Chapters
Transcript
Episode notes
Hybrid Mamba Transformer Backbone
- NVIDIA combined Mamba state-space layers and transformer attention layers to form a hybrid backbone.
- Mamba gives linear-time efficiency for long context while transformers preserve precise retrieval in key depths.
Latent MOE Routes In Latent Space
- Latent MOE compresses tokens into a smaller latent space before routing to experts, cutting routing compute overhead.
- Savings allow activating four times more experts per token, raising specialization and accuracy.
Multi Token Prediction For Faster Decoding
- Nemotron 3 Super implements multi-token prediction (MTP) to predict multiple future tokens in one forward pass.
- MTP acts as a built-in draft model enabling speculative decoding and up to 3x speedup on structured generation like code.
