AI Alignment is the field of research dedicated to ensuring that artificial intelligence systems aim for the intended goals and operate in accordance with human values and ethical principles. It addresses the 'value alignment problem' where a capable AI might pursue a specified objective in destructive ways (e.g., 'reward hacking' or 'specification gaming'). It distinguishes between the 'inner alignment' (what the model actually pursues) and 'outer alignment' (what we told it to pursue).
Concepts date back to Wiener (1960) and Asimov's laws, but formalized recently by researchers like Stuart Russell and organizations like OpenAI and Anthropic.
A critical pillar of modern AI development, implemented via techniques like RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI to make models helpful, honest, and harmless.