Challenges of Aligning Human Preferences in Reinforcement Learning from Human Feedback

This chapter delves into the difficulties of matching human preferences with reward models in RLHF, touching on topics like The Alignment Ceiling, model-based RL vs. RLHF, constitutional AI, synthesizing preference data, and managing subjective disagreements among labelers in aggregated human preferences.

Play episode from 22:37

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app