Reinforcement Learning from Human Feedback (RLHF)

What is Reinforcement Learning from Human Feedback (RLHF)?

Reinforcement Learning from Human Feedback (RLHF) is a technique that uses human feedback to align a language model with human preferences. It consists of three steps: pre-training a language model, training a reward model from human-labeled data, and fine-tuning the language model with reinforcement learning.

Where did the term "Reinforcement Learning from Human Feedback (RLHF)" come from?

A key technique for aligning large language models.

How is "Reinforcement Learning from Human Feedback (RLHF)" used today?

Used to train models like ChatGPT to be helpful and harmless.

Related Terms

Reinforcement Learning
DPO (Direct Preference Optimization)
KTO (Kahneman-Tversky Optimization)