Direct Preference Optimization (DPO) is a technique for aligning language models with human preferences. It simplifies the traditional Reinforcement Learning from Human Feedback (RLHF) pipeline by removing the need for an explicit reward model. Instead, DPO uses a direct loss function that encourages the model to increase the likelihood of preferred responses and decrease the likelihood of dispreferred ones, based on a dataset of human comparisons (e.g., 'Response A is better than Response B'). This makes the alignment process more stable, efficient, and computationally less expensive.
DPO was introduced in the 2023 paper 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model' by researchers from Stanford. It offered a groundbreaking, simplified alternative to the complex, multi-stage RLHF process, which involves training a separate reward model and then fine-tuning the language model with reinforcement learning. DPO demonstrated that the language model itself can be directly optimized on preference data, acting as its own reward model.
Since its introduction, DPO and its variants have been widely adopted for fine-tuning and aligning large language models. It has become a standard technique for creating instruction-tuned models, as seen in popular open-source models like Llama 3, Zephyr, and Tulu. Its stability and simplicity have made it a preferred choice over RLHF for many researchers and practitioners, significantly influencing the development of safer and more helpful AI.