Quick Concepts: Reinforcement Learning from Human Feedback
What is RLHF and why is it important?
RLHF, or Reinforcement Learning from Human Feedback, is a type of AI model training in which human involvement is indirectly used to fine-tune a model. RLHF is a form of reinforcement learning — neither supervised nor unsupervised, but a third, distinct type of machine learning. In RLHF, a pretrained model is fine-tuned using a separate reward model that has been developed using human feedback. The main model seeks to maximize the reward (score) it receives from the reward model, improving its outputs in the process. OpenAI recently used RLHF to train its InstructGPT and ChatGPT models, sparking widespread interest in this model training method.
RLHF is a powerful way to help align AI with human intent. Large language models that have been trained on a vast corpus of internet data can generate strange, inaccurate, and even harmful responses, especially when coaxed by users to do so. RLHF uses human input to provide guidance that narrows the field of possible ML responses. In this way, models can get closer to OpenAI’s goal of being “helpful, honest, and harmless” at all times.
How does RLHF work?
RLHF has three phases. The first phase is selecting a pretrained model as the main model. For language models in particular, using a pretrained model is key due to the immense volume of data required for training.
The second phase involves creating a second model, called the reward model. The reward model is trained using human inputs, where humans are given two or more samples of model-generated outputs and are asked to rank them based on quality. This information is converted into a scoring system which the reward model will use to evaluate the performance of the main model.
In the third phase of RLHF, the reward model is fed outputs from the main model and returns a quality score that tells the main model how well it has performed. The main model uses this feedback to improve its performance on future tasks.
Here is how OpenAI described the process of using RLHF to train an ML model to simulate a robot doing a backflip:
Our AI agent starts by acting randomly in the environment. Periodically, two video clips of its behavior are given to a human, and the human decides which of the two clips is closest to fulfilling its goal—in this case, a backflip. The AI gradually builds a model of the goal of the task by finding the reward function that best explains the human’s judgments. It then uses RL to learn how to achieve that goal. As its behavior improves, it continues to ask for human feedback on trajectory pairs where it’s most uncertain about which is better, and further refines its understanding of the goal.
Current Challenges in RLHF
There are still a number of challenges and limitations to consider when using RLHF, such as:
- Even with fine-tuning, model responses can still be inaccurate or toxic.
- Human involvement is very expensive and slow compared to unsupervised learning. While RLHF processes continue to develop, qualified, efficient human annotators will be critical.
- Human feedback is not infallible. Sometimes ML models devise workarounds that fool human annotators and get undeservedly high scores from reward models.
- Humans can disagree on evaluations, making the reward model’s training data messy or noisy.
- Reward models may not reflect the values or norms of people who are different from their human trainers, i.e., they may be biased towards certain cultural, linguistic, or socioeconomic groups.
These limitations notwithstanding, RLHF is an exciting avenue for progress in aligning ML outputs with human intent. With further research, development, and refinement, RLHF is likely to make AI models safer, more reliable, and more beneficial for users.
Accelerate AI with Annotated Data
Check Out this Article on Why Your Model Performance Problems Are Likely in the Data