Reinforcement Learning from Human Feedback (RLHF) is a technique that uses human feedback to align a language model with human preferences. It consists of three steps: pre-training a language model, training a reward model from human-labeled data, and fine-tuning the language model with reinforcement learning.
A key technique for aligning large language models.
Used to train models like ChatGPT to be helpful and harmless.