Reinforcement Learning from Human Feedback (RLHF)
Definition
A training technique where AI models are fine-tuned using human feedback on output quality, safety, and helpfulness. Human evaluators rank model outputs, and these rankings are used to train a reward model that guides further optimization. RLHF is used by OpenAI, Anthropic, and Google to align models with human preferences.
Why It Matters
RLHF improves model behavior on average but doesn't eliminate tail risk. Models trained with RLHF can still produce harmful outputs in edge cases, novel contexts, or under adversarial conditions. RLHF reduces the frequency of failures but doesn't prevent them — it shifts the distribution, not the boundary.
How Exogram Addresses This
RLHF makes models better. Exogram makes execution governed. They solve different problems: RLHF reduces the probability of harmful outputs through training. Exogram blocks harmful actions through deterministic infrastructure. 0% false negative rate vs. probabilistic improvement.
Related Terms
Key Takeaways
- → This concept is part of the broader AI governance landscape
- → Production AI requires multiple layers of protection
- → Deterministic enforcement provides zero-error-rate guarantees