29 - Science of Deep Learning with Vikrant Varma

Apr 25, 2024

Vikrant Varma discusses challenges with unsupervised knowledge discovery, grokking in neural networks, circuit efficiency, and the role of complexity in deep learning. The conversation delves into the balance between memorization and generalization, exploring neural circuits, implicit priors, optimization, and alignment projects at DeepMind.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Limits of Linear Probes for Model Concepts

Linear probes might fail to capture complex or alien model concepts lacking short natural language decode paths.
Many model concepts may be encoded nonlinearly and not directly decoded into natural language tokens.

INSIGHT

Grokking Reveals Training Dynamics

Grokking describes a sudden shift in neural networks from memorization to generalization after extended training.
Understanding grokking can illuminate training dynamics and inductive biases relevant to AI alignment.

INSIGHT

Circuit Model for Grokking Explains Behavior

Memorizing and generalizing circuits conceptually compete in networks but overlap heavily in parameters.
Circuit interactions are complex; models may morph circuits rather than scale independently.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:

0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS

0:00:36 - What is CCS?

0:09:54 - Consistent and contrastive features other than model beliefs

0:20:34 - Understanding the banana/shed mystery

0:41:59 - Future CCS-like approaches

0:53:29 - CCS as principal component analysis

0:56:21 - Explaining grokking through circuit efficiency

0:57:44 - Why research science of deep learning?

1:12:07 - Summary of the paper's hypothesis

1:14:05 - What are 'circuits'?

1:20:48 - The role of complexity

1:24:07 - Many kinds of circuits

1:28:10 - How circuits are learned

1:38:24 - Semi-grokking and ungrokking

1:50:53 - Generalizing the results

1:58:51 - Vikrant's research approach

2:06:36 - The DeepMind alignment team

2:09:06 - Follow-up work

The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html

Vikrant's Twitter/X account: twitter.com/vikrantvarma_

Main papers:

- Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029

- Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390

Other works discussed:

- Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827

- Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit

- Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1

- Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY

- Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4

- Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast

- Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177

- Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306

- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521

Episode art by Hamish Doodles: hamishdoodles.com