LLM Architecture in 2026: What You Need to Know with Sebastian Raschka

83 snips

Apr 13, 2026

Sebastian Raschka, independent AI researcher and author of practical, code-first LLM guides. He digs into what modern model architectures actually contain. Conversations hit inference-scaling tricks, hybrid transformer/state-space designs, KV-cache and long-context tactics, Multi-head Latent Attention, and the tradeoffs of running local vs. frontier models.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

INSIGHT

Reasoning Is Often Distilled In Training Data

Reasoning behavior often exists in base models via examples in training data; chain-of-thought is more a behavior than a single technique.
Pre-training can already distill reasoning traces from post-trained models because training corpora include reasoning-style content.

ANECDOTE

Website CSS Fixes Feel Like Magic And Frustration

Sebastian uses LLMs to fix long-accumulated CSS cruft on his 12-year-old static website, and it often feels like magic when it succeeds.
When the model misaligns elements he resorts to screenshots and iterative fixes, then sometimes toggles to manual CSS tweaks.

ADVICE

Automate Writing QA With A Fixed Checklist

Automate tedious editorial and QA tasks with LLMs using a fixed checklist to save time.
Sebastian runs a ~20-item checklist (title casing, link checks, code-notebook sync) and asks the model to flag mismatches across notebook and docs.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

If you take a model release as an anchor point, let’s say Nemotron 3 or Qwen 3.5, you can go in both directions: You can either plug them into an agent and play around with that, or you can look, okay, what does the model look like under the hood? What are the ingredients? What type of attention mechanism do they use? What are currently research techniques that could make that even better in the next generation of models? What can we swap out, basically? And I’m interested in both of these!

Sebastian Raschka, Independent AI Researcher and author of Build a Large Language Model from Scratch, joins Hugo to talk about what’s changed in AI architecture, from post-training to hybrid models, and why understanding what’s under the hood matters more than ever for developers building in the agentic era. Sebastian’s upcoming book, Build a Reasoning Model from Scratch, currently available for pre-order on Amazon and in early access on Manning!

Vanishing Gradients is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

We Discuss:

* Ed Tech for Agents: should we design educational content specifically for agentic systems, or is there a better approach?

* Inference Scaling is the new frontier, driving “gold-level” performance during generation via parallel sampling and internal meta-judges;

* Hybrid Architectures from Qwen 3.5 and Nemotron 3 scale almost linearly, making long-context agentic workflows significantly more affordable and performant;

* Multi-head Latent Attention (MLA), developed by DeepSeek, wins the KV cache war by drastically reducing memory overhead without performance hits;

* Agent Harnesses need to be continuously simplified as frontier models are post-trained on agent trajectories. Teams that don’t strip back their scaffolding risk the harness getting in the way of a more capable model.

* “AI Psychosis”: the cognitive load of supervising self-supervising agents, and why we’re all conducting an orchestra we were never trained to conduct;

* Sebastian’s AI Stack: a surprisingly simple setup (Mac mini, Codex, Ollama) with a ~20-item QA checklist, delegating the boring work to preserve energy for creative development;

* Fine-tuning is now an economic decision, optimizing costs and latency for high-volume tasks where long system prompts outweigh a one-time training run;

* Process Reward Models (PRMs) are the next frontier, verifying intermediate reasoning steps to solve “hallucination in the middle” for complex math and code tasks;

* “Implementation Does Not Lie”: Sebastian’s layer-by-layer verification philosophy, comparing from-scratch builds against HuggingFace references to catch details invisible in papers;

* Architecture Details dictate inference stack choices; nuances like RMSNorm stability or RoPE flavors are critical for optimal performance and troubleshooting;

* The Distillation Loop drives open-weight parity, enabling specialized, “frontier-class” models by “pre-digesting” frontier outputs without multi-million dollar training risks.

You can also find the full episode on Spotify, Apple Podcasts, and YouTube.

You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!

Our flagship course Building AI Applications just wrapped its final cohort but we’re cooking up something new. If you want to be first to hear about it (and help shape what we build), drop your thoughts here.

Links and Resources

* Build a Reasoning Model (From Scratch): Sebastian’s new book, currently available for pre-order on Amazon and in early access on Manning. You’ll learn how reasoning LLMs actually work by starting with a pre-trained base LLM and adding reasoning capabilities step by step in code. A hands-on follow-up to Build a Large Language Model from Scratch.

* LLM Architecture Gallery: Sebastian’s collection of architecture figures and fact sheets from his blog posts, updated with each major model release. A go-to visual reference for comparing what’s changed under the hood across model generations.

* Sebastian Raschka on LinkedIn

* Sebastian’s website

* Ahead of AI (Sebastian’s Substack)

* Build a Large Language Model from Scratch

* PinchBench: OpenClaw Benchmark Leaderboard

* DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

* Gated Delta Networks: Improving Mamba2 with Delta Rule (ICLR 2025)

* DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

* Hugging Face Model Hub

* Upcoming Events on Luma

* Vanishing Gradients on YouTube

A Bit More on Agent Harnesses

* Components of A Coding Agent by Sebastian

* How To Build An Agent that Builds its own Harness by Hugo and Ivan Leo (DeepMind, ex-Manus)

* Build Your Own Deep Research Agent with Hugo & Ivan Leo (Google DeepMind, ex-Manus): In this livestream, you’ll learn how to build a production-grade agent harness from scratch in pure Python;

* AI Agent Harness, 3 Principles for Context Engineering, and the Bitter Lesson Revisited with Lance Martin (Anthropic), Duncan Gilchrist (Delphina), and Hugo

* The Post-Coding Era: What Happens When AI Writes the System? with Nicholas Moy (Google DeepMind), Duncan Gilchrist (Delphina), and Hugo

* What is an Agent Harness? from What 300+ Engineers from Netflix, Amazon, and Instacart Asked About AI Engineering.

How You Can Support Vanishing Gradients

Vanishing Gradients is a podcast, workshop series, blog, and newsletter focused on what you can build with AI right now. Over 70 episodes with expert practitioners from Google DeepMind, Netflix, Stanford, and elsewhere. Hundreds of hours of free, hands-on workshops. All independent, all free.

If you want to help keep it going:

* Become a paid subscriber, from $8/month

* Share this with a builder who’d find it useful

* Subscribe to our YouTube channel.

Thanks for reading Vanishing Gradients! This post is public so feel free to share it.

Get full access to Vanishing Gradients at hugobowne.substack.com/subscribe