A separate model trained to predict human preferences (e.g., 'Model A is better than Model B'). It serves as the 'judge' or scorekeeper during the Reinforcement Learning phase of training (RLHF).
Crucial component of RLHF.
The bottleneck of high-quality model alignment.