DPO is a method to align language models to human preferences by optimizing the model directly on preference data (A vs B) without training a separate reward model, offering a simpler and often more stable alternative to RLHF.
A major shift in alignment methodology emerging in 2024.
Standard for aligning open-source models like Llama 3 instruct.