DPO (Direct Preference Optimization)

What is DPO (Direct Preference Optimization)?

DPO is a method to align language models to human preferences by optimizing the model directly on preference data (A vs B) without training a separate reward model, offering a simpler and often more stable alternative to RLHF.

Where did the term "DPO (Direct Preference Optimization)" come from?

A major shift in alignment methodology emerging in 2024.

How is "DPO (Direct Preference Optimization)" used today?

Standard for aligning open-source models like Llama 3 instruct.

Related Terms