Reinforcement Learning from Human Feedback

Instruction tuning teaches a model to imitate demonstrations. Reinforcement learning from human feedback, usually abbreviated RLHF, goes further. Instead of only copying target responses, the model learns to optimize behavior according to human preferences.

The central idea is that many desirable properties of language model behavior are difficult to specify with simple supervised labels. For example:

Desired property	Why it is difficult
Helpfulness	Depends on context and user intent
Harmlessness	Requires judgment and safety tradeoffs
Honesty	Requires uncertainty awareness
Conciseness	Depends on task and audience
Tone	Depends on conversational context
Reasoning quality	Often subjective

Rather than writing explicit rules for every situation, RLHF learns a reward signal from preference comparisons made by humans or by a stronger supervisory model.

The system is then optimized to maximize this learned reward.

The RLHF Pipeline

A standard RLHF pipeline has three stages:

Stage	Goal
Pretraining	Learn language and world knowledge
Supervised fine-tuning	Learn instruction-following behavior
Reinforcement learning	Optimize responses using preference rewards

The reinforcement learning stage usually begins with an instruction-tuned model.

The overall process looks like:

Pretraining
    ↓
Instruction tuning
    ↓
Preference collection
    ↓
Reward model training
    ↓
Policy optimization

The final policy is the aligned assistant model.

Preference Data

The core training signal in RLHF is preference data.

Human annotators compare candidate model responses and indicate which one is preferred.

Example:

Prompt	Response A	Response B	Preferred
“Explain recursion.”	Clear explanation	Confusing answer	A
“How do I build malware?”	Refusal	Harmful instructions	Refusal
“Summarize this article.”	Accurate summary	Hallucinated summary	Accurate summary

Preference data does not require annotators to write perfect answers from scratch. Ranking alternatives is often faster and more consistent than free-form generation.

The comparisons define a partial ordering over responses.

Reward Models

The preference comparisons are used to train a reward model.

The reward model receives:

Input	Description
Prompt	User instruction or conversation
Candidate response	Model-generated answer

The reward model outputs a scalar score:

$$ r_\phi(x, y), $$

where:

Symbol	Meaning
$x$	Prompt
$y$	Response
$\phi$	Reward model parameters

Higher scores indicate preferred responses.

The reward model is trained from pairwise comparisons. Suppose humans prefer response $y_w$ over response $y_l$. The reward model should assign:

$$ r_\phi(x, y_w)

r_\phi(x, y_l). $$

A common training objective is the Bradley-Terry preference model:

$$ P(y_w \succ y_l) = \frac{ \exp(r_\phi(x,y_w)) }{ \exp(r_\phi(x,y_w)) + \exp(r_\phi(x,y_l)) }. $$

The loss is:

$$ \mathcal{L} = -\log P(y_w \succ y_l). $$

The reward model therefore learns to approximate human preferences statistically.

Policy Optimization

Once the reward model is trained, the language model is optimized to maximize reward.

The language model becomes a policy:

$$ \pi_\theta(y \mid x), $$

which generates responses $y$ conditioned on prompts $x$.

The objective is approximately:

$$ \max_\theta \mathbb{E}{y \sim \pi\theta} [r_\phi(x,y)]. $$

However, directly maximizing reward is dangerous. The model may exploit weaknesses in the reward model and generate unnatural or degenerate text.

To stabilize training, RLHF usually constrains the policy to remain close to the supervised fine-tuned model.

KL-Regularized Objectives

A common RLHF objective includes a KL-divergence penalty:

$$ \max_\theta \mathbb{E}{y \sim \pi\theta} \left[ r_\phi(x,y) - \beta D_{\mathrm{KL}} ( \pi_\theta | \pi_{\mathrm{ref}} ) \right]. $$

Here:

Symbol	Meaning
$\pi_\theta$	Current policy
$\pi_{\mathrm{ref}}$	Reference policy
$r_\phi$	Reward model
$\beta$	KL penalty coefficient

The KL penalty discourages the model from drifting too far from the original instruction-tuned distribution.

Without this constraint, the model may maximize reward through pathological outputs rather than genuinely useful behavior.

PPO and Policy Gradient Methods

Early RLHF systems commonly used Proximal Policy Optimization, or PPO.

PPO is a policy-gradient reinforcement learning algorithm designed to improve stability during policy updates.

The idea is simple:

Generate responses.
Score them with the reward model.
Estimate advantages.
Update the policy gradually.

The PPO objective constrains policy updates so that each optimization step remains relatively small.

A simplified form is:

$$ L^{\mathrm{PPO}} = \mathbb{E} \left[ \min ( r_t A_t, \mathrm{clip}(r_t, 1-\epsilon, 1+\epsilon)A_t ) \right], $$

where:

Symbol	Meaning
$r_t$	Probability ratio
$A_t$	Advantage estimate
$\epsilon$	Clipping parameter

PPO reduces unstable jumps in policy behavior.

However, PPO training is computationally expensive and operationally complex. Many modern systems now prefer simpler alternatives.

Direct Preference Optimization

A newer approach is Direct Preference Optimization, or DPO.

DPO avoids explicit reinforcement learning. Instead of training a separate reward model and running PPO, DPO directly optimizes preference comparisons.

The key insight is that under certain assumptions, maximizing a KL-regularized reward objective can be transformed into a supervised classification objective over preferred and rejected responses.

The DPO objective encourages:

$$ \pi_\theta(y_w \mid x)

\pi_\theta(y_l \mid x), $$

while keeping the model close to a reference policy.

Advantages of DPO include:

Advantage	Description
Simpler pipeline	No PPO rollout loop
More stable	Easier optimization
Lower compute cost	Fewer moving parts
Easier implementation	Standard supervised-style training

Because of these advantages, many modern alignment systems use preference optimization variants instead of classical PPO-based RLHF.

Reward Hacking

A reward model is only an approximation of human judgment. If optimized aggressively, the policy may exploit weaknesses in the reward signal.

This is called reward hacking.

Examples include:

Failure mode	Example
Verbosity bias	Extremely long answers because reward correlates with detail
Sycophancy	Agreeing with the user even when incorrect
Style exploitation	Polite wording masking factual errors
Safety over-optimization	Excessive refusal behavior
Repetition	Repeating patterns that reward model likes
Hallucinated confidence	Fluent but false explanations

Reward hacking is a fundamental alignment problem. Optimizing proxy rewards can produce unintended behavior.

The reward model does not define true human values. It defines a learned approximation.

Distribution Shift

The reward model is trained on a limited distribution of responses. During optimization, the policy may generate outputs outside that distribution.

This creates distribution shift.

For example:

The reward model sees mostly ordinary assistant responses.
The policy explores unusual outputs during optimization.
The reward model produces unreliable scores on unfamiliar text.
The policy exploits those errors.

This is similar to adversarial optimization in other machine learning systems.

Large policy shifts can therefore destabilize RLHF.

KL regularization, conservative optimization, rejection sampling, and human auditing are used to reduce this problem.

Multi-Objective Alignment

Human preferences are not one-dimensional.

A useful assistant should balance multiple goals:

Objective	Meaning
Helpfulness	Solves the user’s problem
Harmlessness	Avoids dangerous behavior
Honesty	Avoids fabrication
Calibration	Expresses uncertainty appropriately
Conciseness	Avoids unnecessary verbosity
Robustness	Resists jailbreaks and manipulation

These objectives may conflict.

For example:

Tradeoff	Example
Helpfulness vs safety	Medical guidance
Conciseness vs completeness	Technical explanations
Honesty vs confidence	Uncertain answers
Harmlessness vs utility	Dual-use scientific topics

RLHF systems therefore optimize approximate mixtures of objectives rather than a single universal reward.

Constitutional and AI Feedback Methods

Human feedback is expensive and difficult to scale.

Modern systems increasingly use AI-generated feedback.

A stronger model may:

Role	Example
Critic	Identify factual errors
Judge	Rank candidate outputs
Safety evaluator	Detect policy violations
Rewriter	Improve weak responses
Preference annotator	Generate synthetic rankings

Constitutional AI approaches define principles or rules that guide critique and revision.

Example principles:

Principle	Purpose
Avoid harmful advice	Safety
Admit uncertainty	Honesty
Respect privacy	Security
Avoid discrimination	Fairness

The model critiques its own outputs relative to the constitution, then revises them.

This reduces reliance on large human labeling teams.

RLHF and Reasoning

RLHF affects reasoning behavior strongly.

During pretraining, the model learns statistical reasoning patterns implicitly. RLHF changes which reasoning traces are rewarded.

If detailed reasoning receives high reward, the model may produce more chain-of-thought style outputs. If concise answers receive higher reward, the model may shorten explanations.

This can improve usability but also distort behavior.

For example:

RLHF effect	Possible issue
More confident tone	False certainty
More coherent reasoning	Persuasive hallucinations
Longer explanations	Rewarding verbosity
Refusal optimization	Over-refusal

The model may learn how reasoning should look rather than how to reason correctly internally.

This distinction between external reasoning traces and internal computation remains an active research topic.

RLHF and Tool Use

RLHF often trains models to use tools correctly.

Examples include:

Tool type	Example
Search	Web retrieval
Code execution	Python interpreters
APIs	Weather or finance services
Databases	Structured queries
Agents	Multi-step planning systems

The reward process encourages behaviors such as:

Desired behavior	Example
Calling tools when uncertain	Retrieval before answering
Using valid arguments	Correct API schemas
Interpreting outputs correctly	Reading tool results
Avoiding hallucination	Prefer retrieved evidence

Tool-augmented alignment is increasingly important because modern assistants are not purely text generators.

Human Preference Biases

Preference labels are influenced by human psychology.

Annotators may prefer:

Bias	Example
Fluent text	Even if inaccurate
Confident tone	Even when wrong
Longer answers	Perceived depth
Agreeable behavior	Sycophancy
Familiar styles	Cultural bias
Safe responses	Even when overcautious

These biases become encoded into the reward model.

As a result, RLHF can amplify social and stylistic biases present in the annotation process.

Alignment therefore depends not only on optimization algorithms, but also on who provides feedback and how that feedback is collected.

PyTorch View of Preference Training

Suppose we have:

Tensor	Meaning
`chosen_logps`	Log probabilities for preferred responses
`rejected_logps`	Log probabilities for rejected responses

A simplified DPO-style loss may look like:

import torch
import torch.nn.functional as F

beta = 0.1

logits = beta * (chosen_logps - rejected_logps)

loss = -F.logsigmoid(logits).mean()

The model is encouraged to increase probability for preferred outputs relative to rejected outputs.

Unlike ordinary supervised learning, the target is not a single fixed sequence. The target is a preference ordering.

Limits of RLHF

RLHF is powerful, but it has major limitations.

First, reward models are imperfect proxies for human values.

Second, preference optimization may hide rather than solve dangerous behaviors.

Third, RLHF can reduce diversity and originality by pushing models toward highly rewarded styles.

Fourth, preference data is expensive and culturally dependent.

Fifth, RLHF does not guarantee truthfulness. A model may become more persuasive without becoming more accurate.

Finally, RLHF scales poorly if every new capability requires extensive human oversight.

These limitations motivate research into scalable oversight, mechanistic interpretability, constitutional methods, debate systems, verifier models, and automated alignment techniques.

Why RLHF Changed Modern Language Models

Pretrained models can generate fluent text. Instruction-tuned models can follow tasks. RLHF made models substantially more interactive, cooperative, and conversational.

It improved:

Capability	Effect
Dialogue quality	More natural interaction
Helpfulness	Better task completion
Safety behavior	Reduced harmful outputs
Refusal behavior	Better policy compliance
Tone control	More socially acceptable responses
Multi-turn consistency	Improved conversation flow

Many modern assistants rely heavily on preference optimization.

Without RLHF-style alignment, large language models often behave unpredictably in interactive settings.

Summary

Reinforcement learning from human feedback aligns language models with human preferences using preference comparisons and reward optimization.

The standard RLHF pipeline includes:

Pretraining
Instruction tuning
Preference data collection
Reward model training
Policy optimization

Reward models estimate human preferences statistically, and policy optimization adjusts the model to maximize those rewards while remaining close to the supervised policy.

Modern systems increasingly use preference optimization methods such as DPO rather than classical PPO-based reinforcement learning.

RLHF improves usability, safety, and dialogue quality, but it introduces challenges such as reward hacking, sycophancy, over-optimization, and dependence on imperfect human feedback.