Reinforcement Learning from Human Feedback (RLHF)

What is Reinforcement Learning from Human Feedback (RLHF)?

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that trains a 'reward model' directly from human feedback and uses the model to optimize an agent's policy using reinforcement learning. In the context of Large Language Models (LLMs), it is used to align the model's output with human intent, making it more helpful, harmless, and honest.

Where did the term "Reinforcement Learning from Human Feedback (RLHF)" come from?

The technique was developed by OpenAI and DeepMind researchers. It was notably described in the 2017 paper 'Deep Reinforcement Learning from Human Preferences' and refined for LLMs in the 2022 InstructGPT paper.

How is "Reinforcement Learning from Human Feedback (RLHF)" used today?

RLHF is the secret sauce behind the success of ChatGPT and other modern conversational AI assistants. It transformed LLMs from simple text predictors into capable and safe assistants.

Related Terms