RLHF is a technique used to align language models with human values. It involves collecting human preferences on model outputs and using that feedback to train a reward model, which then optimizes the LLM via reinforcement learning.
Key to the success of InstructGPT and ChatGPT, ensuring models are helpful, harmless, and honest.
The industry standard for post-training alignment of commercial LLMs.