Super Data Science: ML & AI Podcast with Jon Krohn

976: NVIDIA’s Nemotron 3 Super: The Perfect LLM for Multi-Agent Systems

54 snips
Mar 20, 2026
They unpack NVIDIA’s Nemotron 3 Super architecture and how a 120B model only activates 12B parameters for efficiency. Listeners hear about the hybrid Mamba-Transformer design and latent mixture-of-experts routing. The conversation covers million-token context windows, NVFP4 precision with Blackwell GPUs, throughput benchmarks, and where to access and deploy the model for multi-agent systems.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Hybrid Mamba Transformer Backbone

  • NVIDIA combined Mamba state-space layers and transformer attention layers to form a hybrid backbone.
  • Mamba gives linear-time efficiency for long context while transformers preserve precise retrieval in key depths.
INSIGHT

Latent MOE Routes In Latent Space

  • Latent MOE compresses tokens into a smaller latent space before routing to experts, cutting routing compute overhead.
  • Savings allow activating four times more experts per token, raising specialization and accuracy.
INSIGHT

Multi Token Prediction For Faster Decoding

  • Nemotron 3 Super implements multi-token prediction (MTP) to predict multiple future tokens in one forward pass.
  • MTP acts as a built-in draft model enabling speculative decoding and up to 3x speedup on structured generation like code.
Get the Snipd Podcast app to discover more snips from this episode
Get the app