An unexpected RL Renaissance

31 snips

Feb 13, 2025

Reinforcement learning is experiencing a renaissance, fueled by advanced research and improved infrastructure. The impact of training from human feedback has transformed language models, reshaping AI capabilities. Exciting new tools like TRL and OpenRLHF are making it easier to train innovative models. The evolution of techniques such as DeepRL is paving the way for scalable, adaptable AI. With a wealth of funding and open-source resources, the future of reinforcement learning promises to be both dynamic and groundbreaking.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Scaling Laws vs. Instruction Tuning

Scaling laws focused on token prediction accuracy, while language models are primarily used for instruction following.
This disconnect highlights the gap between model development and real-world application.

INSIGHT

O1's Paradigm Shift

OpenAI's O1 shifted the focus to verifiable rewards, driving a significant change in the AI landscape.
The demand for AI models that deliver tangible results fueled this shift.

INSIGHT

RL for Language Models

RL training for language models involves a policy (text generation), action (completion), state (prompt), and reward model (score function).
This framework adapts RL principles to the specific context of language model interaction.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

The era we are living through in language modeling research is one characterized by complete faith that reasoning and new reinforcement learning (RL) training methods will work. This is well-founded. A day | cannot | go | by | without | a new | reasoning model, RL training result, or dataset distilled from DeepSeek R1.

The difference, compared to the last time RL was at the forefront of the AI world with the fact that reinforcement learning from human feedback (RLHF) was needed to create ChatGPT, is that we have way better infrastructure than our first time through this. People are already successfully using TRL, OpenRLHF, veRL, and of course, Open Instruct (our tools for Tülu 3/OLMo) to train models like this.

When models such as Alpaca, Vicuña, Dolly, etc. were coming out they were all built on basic instruction tuning. Even though RLHF was the motivation of these experiments, tooling, and lack of datasets made complete and substantive replications rare. On top of that, every organization was trying to recalibrate its AI strategy for the second time in 6 months. The reaction and excitement of Stable Diffusion was all but overwritten by ChatGPT.

This time is different. With reasoning models, everyone already has raised money for their AI companies, open-source tooling for RLHF exists and is stable, and everyone is already feeling the AGI.

Aside: For a history of what happened in the Alpaca era of open instruct models, watch my recap lecture here — it’s one of my favorite talks in the last few years.

The goal of this talk is to try and make sense of the story that is unfolding today:

* Given it is becoming obvious that RL with verifiable rewards works on old models — why did the AI community sleep on the potential of these reasoning models?

* How to contextualize the development of RLHF techniques with the new types of RL training?

* What is the future of post-training? How far can we scale RL?

* How does today’s RL compare to historical successes of Deep RL?

And other topics. This is a longer-form recording of a talk I gave this week at a local Seattle research meetup (slides are here). I’ll get back to covering the technical details soon!

Some of the key points I arrived on:

* RLHF was necessary, but not sufficient for ChatGPT. RL training like for reasoning could become the primary driving force of future LM developments. There’s a path for “post-training” to just be called “training” in the future.

* While this will feel like the Alpaca moment from 2 years ago, it will produce much deeper results and impact.

* Self-play, inference-time compute, and other popular terms related to this movement are more “side quests” than core to the RL developments. They’re both either inspirations or side-effects of good RL.

* There is just so much low-hanging fruit for improving models with RL. It’s wonderfully exciting.

For the rest, you’ll have to watch the talk. Soon, I’ll cover more of the low level technical developments we are seeing in this space.

00:00 The ingredients of an RL paradigm shift16:04 RL with verifiable rewards27:38 What DeepSeek R1 taught us29:30 RL as the focus of language modeling

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe