Reinforcement Learning from Human Feedback (RLHF)

What is Reinforcement Learning from Human Feedback (RLHF)?

Reinforcement Learning from Human Feedback (RLHF) is a technique that uses human feedback to align a language model with human preferences. It consists of three steps: pre-training a language model, training a reward model from human-labeled data, and fine-tuning the language model with reinforcement learning.

Where did the term "Reinforcement Learning from Human Feedback (RLHF)" come from?

A key technique for aligning large language models.

How is "Reinforcement Learning from Human Feedback (RLHF)" used today?

Used to train models like ChatGPT to be helpful and harmless.

Related Terms