
Deep Papers ChatGPT and InstructGPT: Aligning Language Models to Human Intention
14 snips
Jan 18, 2023 AI Snips
Chapters
Transcript
Episode notes
Instruction-Focused Models Change Behavior
- InstructGPT fine-tunes LLMs to follow natural-language instructions rather than predict internet text continuations.
- That shift makes models behave like assistants and perform diverse tasks with simple prompts.
Use Human Rankings To Train A Reward Model
- Collect human preference rankings between model outputs and train a reward model to predict those preferences.
- Then use reinforcement learning to optimize a policy model to maximize the learned reward.
Demonstrations Bootstrap; RLHF Fine-Tunes
- Supervised demonstrations bootstrap model behavior by providing high-quality examples to imitate.
- RLHF then fine-tunes behavior more subtly by optimizing for human preference scores.
