Reinforcement Learning from Human Feedback (RLHF)

What is Reinforcement Learning from Human Feedback (RLHF)?

RLHF is a technique used to align language models with human values. It involves collecting human preferences on model outputs and using that feedback to train a reward model, which then optimizes the LLM via reinforcement learning.

Where did the term "Reinforcement Learning from Human Feedback (RLHF)" come from?

Key to the success of InstructGPT and ChatGPT, ensuring models are helpful, harmless, and honest.

How is "Reinforcement Learning from Human Feedback (RLHF)" used today?

The industry standard for post-training alignment of commercial LLMs.

Related Terms

Alignment
Fine-tuning
Large Language Model (LLM)