
The Nonlinear Library LW - On the lethality of biased human reward ratings by Eli Tyre
Nov 17, 2023
54:36
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On the lethality of biased human reward ratings, published by Eli Tyre on November 17, 2023 on LessWrong.
I'm rereading the List of Lethalities carefully and considering what I think about each point.
I think I strongly don't understand #20, and I thought that maybe you could explain what I'm missing?
20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors - regular, compactly describable, predictable errors. To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).
If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It's a fact about the territory, not the map - about the environment, not the optimizer - that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.
I think that I don't understand this.
(Maybe one concrete thing that would help is just having examples.)
* * *
One thing that this could be pointing towards is the problem of what I'll call "dynamic feedback schemes", like RLHF. The key feature of a dynamic feedback scheme is that the AI system is generating outputs and a human rater is giving it feedback to reinforce good outputs and anti-reinforce bad outputs.
The problem with schemes like this is there is adverse selection for outputs that look good to the human rater but are actually bad. This means that, in the long run, you're reinforcing initial accidental misrepresentation, and shaping it into more and more sophisticated deception (because you anti-reinforce all the cases of misrepresentation that are caught out, and reinforce all the ones that aren't).
That seems very bad for not ending up in a world where all the metrics look great, but the underlying reality is awful or hollow, as Paul describes in part I of What Failure Looks Like.
It seems like maybe you could avoid this with a static feedback regime, where you take a bunch of descriptions of outcomes, maybe procedurally generated, maybe from fiction, maybe from news reports, whatever, and have humans score those outcomes on how good they are, to build a reward model that can be used for training. As long as the ratings don't get fed back into the generator, there's not much systematic incentive towards training deception.
...Actually, on reflection, I suppose this just pushes the problem back one step. Now you have a reward model which is giving feedback to some AI system that you're training. And the AI system will learn to adversarially game the reward model in the same way that it would have gamed the human.
That seems like a real problem, but it also doesn't seem like what this point from the list is trying to get at. It seems to be saying something more like "the reward model is going to be wrong, because there's going to be systematic biases in the human ratings."
Which, fair enough, that seems true, but I don't see why that's lethal. It seems like the reward model will be wrong in some places, and we would lose value in those places. But why does the reward model need to be an exact, high fidelity representation, across all domains, in order to not kill us? Why is a reward model that's a little off, in a predictable direction, catastrophic?
First things first:
What you're calling the "dynamic feedback schemes" problem is indeed a lethal problem which I think is not quite the same as Yudkowsky's #20, as you said.
"there's going to be systematic biases in the human ratings" is... technically correct, but I think a mi...
I'm rereading the List of Lethalities carefully and considering what I think about each point.
I think I strongly don't understand #20, and I thought that maybe you could explain what I'm missing?
20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors - regular, compactly describable, predictable errors. To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).
If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It's a fact about the territory, not the map - about the environment, not the optimizer - that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.
I think that I don't understand this.
(Maybe one concrete thing that would help is just having examples.)
* * *
One thing that this could be pointing towards is the problem of what I'll call "dynamic feedback schemes", like RLHF. The key feature of a dynamic feedback scheme is that the AI system is generating outputs and a human rater is giving it feedback to reinforce good outputs and anti-reinforce bad outputs.
The problem with schemes like this is there is adverse selection for outputs that look good to the human rater but are actually bad. This means that, in the long run, you're reinforcing initial accidental misrepresentation, and shaping it into more and more sophisticated deception (because you anti-reinforce all the cases of misrepresentation that are caught out, and reinforce all the ones that aren't).
That seems very bad for not ending up in a world where all the metrics look great, but the underlying reality is awful or hollow, as Paul describes in part I of What Failure Looks Like.
It seems like maybe you could avoid this with a static feedback regime, where you take a bunch of descriptions of outcomes, maybe procedurally generated, maybe from fiction, maybe from news reports, whatever, and have humans score those outcomes on how good they are, to build a reward model that can be used for training. As long as the ratings don't get fed back into the generator, there's not much systematic incentive towards training deception.
...Actually, on reflection, I suppose this just pushes the problem back one step. Now you have a reward model which is giving feedback to some AI system that you're training. And the AI system will learn to adversarially game the reward model in the same way that it would have gamed the human.
That seems like a real problem, but it also doesn't seem like what this point from the list is trying to get at. It seems to be saying something more like "the reward model is going to be wrong, because there's going to be systematic biases in the human ratings."
Which, fair enough, that seems true, but I don't see why that's lethal. It seems like the reward model will be wrong in some places, and we would lose value in those places. But why does the reward model need to be an exact, high fidelity representation, across all domains, in order to not kill us? Why is a reward model that's a little off, in a predictable direction, catastrophic?
First things first:
What you're calling the "dynamic feedback schemes" problem is indeed a lethal problem which I think is not quite the same as Yudkowsky's #20, as you said.
"there's going to be systematic biases in the human ratings" is... technically correct, but I think a mi...
