Super Data Science: ML & AI Podcast with Jon Krohn

976: NVIDIA’s Nemotron 3 Super: The Perfect LLM for Multi-Agent Systems

54 snips

Mar 20, 2026

They unpack NVIDIA’s Nemotron 3 Super architecture and how a 120B model only activates 12B parameters for efficiency. Listeners hear about the hybrid Mamba-Transformer design and latent mixture-of-experts routing. The conversation covers million-token context windows, NVFP4 precision with Blackwell GPUs, throughput benchmarks, and where to access and deploy the model for multi-agent systems.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Hybrid Mamba Transformer Backbone

NVIDIA combined Mamba state-space layers and transformer attention layers to form a hybrid backbone.
Mamba gives linear-time efficiency for long context while transformers preserve precise retrieval in key depths.

INSIGHT

Latent MOE Routes In Latent Space

Latent MOE compresses tokens into a smaller latent space before routing to experts, cutting routing compute overhead.
Savings allow activating four times more experts per token, raising specialization and accuracy.

INSIGHT

Multi Token Prediction For Faster Decoding

Nemotron 3 Super implements multi-token prediction (MTP) to predict multiple future tokens in one forward pass.
MTP acts as a built-in draft model enabling speculative decoding and up to 3x speedup on structured generation like code.

Get the Snipd Podcast app to discover more snips from this episode

Get the app