A reinforcement learning algorithm used to fine-tune LLMs. It updates the model's policy to maximize the reward score while preventing the model from changing too drastically (staying 'proximal' to the old policy).
OpenAI (2017).
The standard algorithm for aligning GPT-3, GPT-4.