PPO (Proximal Policy Optimization)

What is PPO (Proximal Policy Optimization)?

A reinforcement learning algorithm used to fine-tune LLMs. It updates the model's policy to maximize the reward score while preventing the model from changing too drastically (staying 'proximal' to the old policy).

Where did the term "PPO (Proximal Policy Optimization)" come from?

OpenAI (2017).

How is "PPO (Proximal Policy Optimization)" used today?

The standard algorithm for aligning GPT-3, GPT-4.

Related Terms

rlhf
Reward Model