Reinforcement Learning from Human Feedback

Instruction tuning teaches a model to imitate demonstrations.

Instruction tuning teaches a model to imitate demonstrations. Reinforcement learning from human feedback, usually abbreviated RLHF, goes further. Instead of only copying target responses, the model learns to optimize behavior according to human preferences.

The central idea is that many desirable properties of language model behavior are difficult to specify with simple supervised labels. For example:

Desired property Why it is difficult
Helpfulness Depends on context and user intent
Harmlessness Requires judgment and safety tradeoffs
Honesty Requires uncertainty awareness
Conciseness Depends on task and audience
Tone Depends on conversational context
Reasoning quality Often subjective

Rather than writing explicit rules for every situation, RLHF learns a reward signal from preference comparisons made by humans or by a stronger supervisory model.

The system is then optimized to maximize this learned reward.

The RLHF Pipeline

A standard RLHF pipeline has three stages:

Stage Goal
Pretraining Learn language and world knowledge
Supervised fine-tuning Learn instruction-following behavior
Reinforcement learning Optimize responses using preference rewards

The reinforcement learning stage usually begins with an instruction-tuned model.

The overall process looks like:

Pretraining
    ↓
Instruction tuning
    ↓
Preference collection
    ↓
Reward model training
    ↓
Policy optimization

The final policy is the aligned assistant model.

Preference Data

The core training signal in RLHF is preference data.

Human annotators compare candidate model responses and indicate which one is preferred.

Example:

Prompt Response A Response B Preferred
“Explain recursion.” Clear explanation Confusing answer A
“How do I build malware?” Refusal Harmful instructions Refusal
“Summarize this article.” Accurate summary Hallucinated summary Accurate summary

Preference data does not require annotators to write perfect answers from scratch. Ranking alternatives is often faster and more consistent than free-form generation.

The comparisons define a partial ordering over responses.

Reward Models

The preference comparisons are used to train a reward model.

The reward model receives:

Input Description
Prompt User instruction or conversation
Candidate response Model-generated answer

The reward model outputs a scalar score:

$$ r_\phi(x, y), $$

where:

Symbol Meaning
$x$ Prompt
$y$ Response
$\phi$ Reward model parameters

Higher scores indicate preferred responses.

The reward model is trained from pairwise comparisons. Suppose humans prefer response $y_w$ over response $y_l$. The reward model should assign:

$$ r_\phi(x, y_w)

r_\phi(x, y_l). $$

A common training objective is the Bradley-Terry preference model:

$$ P(y_w \succ y_l) = \frac{ \exp(r_\phi(x,y_w)) }{ \exp(r_\phi(x,y_w)) + \exp(r_\phi(x,y_l)) }. $$

The loss is:

$$ \mathcal{L} = -\log P(y_w \succ y_l). $$

The reward model therefore learns to approximate human preferences statistically.

Policy Optimization

Once the reward model is trained, the language model is optimized to maximize reward.

The language model becomes a policy:

$$ \pi_\theta(y \mid x), $$

which generates responses $y$ conditioned on prompts $x$.

The objective is approximately:

$$ \max_\theta \mathbb{E}{y \sim \pi\theta} [r_\phi(x,y)]. $$

However, directly maximizing reward is dangerous. The model may exploit weaknesses in the reward model and generate unnatural or degenerate text.

To stabilize training, RLHF usually constrains the policy to remain close to the supervised fine-tuned model.

KL-Regularized Objectives

A common RLHF objective includes a KL-divergence penalty:

$$ \max_\theta \mathbb{E}{y \sim \pi\theta} \left[ r_\phi(x,y) - \beta D_{\mathrm{KL}} ( \pi_\theta | \pi_{\mathrm{ref}} ) \right]. $$

Here:

Symbol Meaning
$\pi_\theta$ Current policy
$\pi_{\mathrm{ref}}$ Reference policy
$r_\phi$ Reward model
$\beta$ KL penalty coefficient

The KL penalty discourages the model from drifting too far from the original instruction-tuned distribution.

Without this constraint, the model may maximize reward through pathological outputs rather than genuinely useful behavior.

PPO and Policy Gradient Methods

Early RLHF systems commonly used Proximal Policy Optimization, or PPO.

PPO is a policy-gradient reinforcement learning algorithm designed to improve stability during policy updates.

The idea is simple:

  1. Generate responses.
  2. Score them with the reward model.
  3. Estimate advantages.
  4. Update the policy gradually.

The PPO objective constrains policy updates so that each optimization step remains relatively small.

A simplified form is:

$$ L^{\mathrm{PPO}} = \mathbb{E} \left[ \min ( r_t A_t, \mathrm{clip}(r_t, 1-\epsilon, 1+\epsilon)A_t ) \right], $$

where:

Symbol Meaning
$r_t$ Probability ratio
$A_t$ Advantage estimate
$\epsilon$ Clipping parameter

PPO reduces unstable jumps in policy behavior.

However, PPO training is computationally expensive and operationally complex. Many modern systems now prefer simpler alternatives.

Direct Preference Optimization

A newer approach is Direct Preference Optimization, or DPO.

DPO avoids explicit reinforcement learning. Instead of training a separate reward model and running PPO, DPO directly optimizes preference comparisons.

The key insight is that under certain assumptions, maximizing a KL-regularized reward objective can be transformed into a supervised classification objective over preferred and rejected responses.

The DPO objective encourages:

$$ \pi_\theta(y_w \mid x)

\pi_\theta(y_l \mid x), $$

while keeping the model close to a reference policy.

Advantages of DPO include:

Advantage Description
Simpler pipeline No PPO rollout loop
More stable Easier optimization
Lower compute cost Fewer moving parts
Easier implementation Standard supervised-style training

Because of these advantages, many modern alignment systems use preference optimization variants instead of classical PPO-based RLHF.

Reward Hacking

A reward model is only an approximation of human judgment. If optimized aggressively, the policy may exploit weaknesses in the reward signal.

This is called reward hacking.

Examples include:

Failure mode Example
Verbosity bias Extremely long answers because reward correlates with detail
Sycophancy Agreeing with the user even when incorrect
Style exploitation Polite wording masking factual errors
Safety over-optimization Excessive refusal behavior
Repetition Repeating patterns that reward model likes
Hallucinated confidence Fluent but false explanations

Reward hacking is a fundamental alignment problem. Optimizing proxy rewards can produce unintended behavior.

The reward model does not define true human values. It defines a learned approximation.

Distribution Shift

The reward model is trained on a limited distribution of responses. During optimization, the policy may generate outputs outside that distribution.

This creates distribution shift.

For example:

  1. The reward model sees mostly ordinary assistant responses.
  2. The policy explores unusual outputs during optimization.
  3. The reward model produces unreliable scores on unfamiliar text.
  4. The policy exploits those errors.

This is similar to adversarial optimization in other machine learning systems.

Large policy shifts can therefore destabilize RLHF.

KL regularization, conservative optimization, rejection sampling, and human auditing are used to reduce this problem.

Multi-Objective Alignment

Human preferences are not one-dimensional.

A useful assistant should balance multiple goals:

Objective Meaning
Helpfulness Solves the user’s problem
Harmlessness Avoids dangerous behavior
Honesty Avoids fabrication
Calibration Expresses uncertainty appropriately
Conciseness Avoids unnecessary verbosity
Robustness Resists jailbreaks and manipulation

These objectives may conflict.

For example:

Tradeoff Example
Helpfulness vs safety Medical guidance
Conciseness vs completeness Technical explanations
Honesty vs confidence Uncertain answers
Harmlessness vs utility Dual-use scientific topics

RLHF systems therefore optimize approximate mixtures of objectives rather than a single universal reward.

Constitutional and AI Feedback Methods

Human feedback is expensive and difficult to scale.

Modern systems increasingly use AI-generated feedback.

A stronger model may:

Role Example
Critic Identify factual errors
Judge Rank candidate outputs
Safety evaluator Detect policy violations
Rewriter Improve weak responses
Preference annotator Generate synthetic rankings

Constitutional AI approaches define principles or rules that guide critique and revision.

Example principles:

Principle Purpose
Avoid harmful advice Safety
Admit uncertainty Honesty
Respect privacy Security
Avoid discrimination Fairness

The model critiques its own outputs relative to the constitution, then revises them.

This reduces reliance on large human labeling teams.

RLHF and Reasoning

RLHF affects reasoning behavior strongly.

During pretraining, the model learns statistical reasoning patterns implicitly. RLHF changes which reasoning traces are rewarded.

If detailed reasoning receives high reward, the model may produce more chain-of-thought style outputs. If concise answers receive higher reward, the model may shorten explanations.

This can improve usability but also distort behavior.

For example:

RLHF effect Possible issue
More confident tone False certainty
More coherent reasoning Persuasive hallucinations
Longer explanations Rewarding verbosity
Refusal optimization Over-refusal

The model may learn how reasoning should look rather than how to reason correctly internally.

This distinction between external reasoning traces and internal computation remains an active research topic.

RLHF and Tool Use

RLHF often trains models to use tools correctly.

Examples include:

Tool type Example
Search Web retrieval
Code execution Python interpreters
APIs Weather or finance services
Databases Structured queries
Agents Multi-step planning systems

The reward process encourages behaviors such as:

Desired behavior Example
Calling tools when uncertain Retrieval before answering
Using valid arguments Correct API schemas
Interpreting outputs correctly Reading tool results
Avoiding hallucination Prefer retrieved evidence

Tool-augmented alignment is increasingly important because modern assistants are not purely text generators.

Human Preference Biases

Preference labels are influenced by human psychology.

Annotators may prefer:

Bias Example
Fluent text Even if inaccurate
Confident tone Even when wrong
Longer answers Perceived depth
Agreeable behavior Sycophancy
Familiar styles Cultural bias
Safe responses Even when overcautious

These biases become encoded into the reward model.

As a result, RLHF can amplify social and stylistic biases present in the annotation process.

Alignment therefore depends not only on optimization algorithms, but also on who provides feedback and how that feedback is collected.

PyTorch View of Preference Training

Suppose we have:

Tensor Meaning
chosen_logps Log probabilities for preferred responses
rejected_logps Log probabilities for rejected responses

A simplified DPO-style loss may look like:

import torch
import torch.nn.functional as F

beta = 0.1

logits = beta * (chosen_logps - rejected_logps)

loss = -F.logsigmoid(logits).mean()

The model is encouraged to increase probability for preferred outputs relative to rejected outputs.

Unlike ordinary supervised learning, the target is not a single fixed sequence. The target is a preference ordering.

Limits of RLHF

RLHF is powerful, but it has major limitations.

First, reward models are imperfect proxies for human values.

Second, preference optimization may hide rather than solve dangerous behaviors.

Third, RLHF can reduce diversity and originality by pushing models toward highly rewarded styles.

Fourth, preference data is expensive and culturally dependent.

Fifth, RLHF does not guarantee truthfulness. A model may become more persuasive without becoming more accurate.

Finally, RLHF scales poorly if every new capability requires extensive human oversight.

These limitations motivate research into scalable oversight, mechanistic interpretability, constitutional methods, debate systems, verifier models, and automated alignment techniques.

Why RLHF Changed Modern Language Models

Pretrained models can generate fluent text. Instruction-tuned models can follow tasks. RLHF made models substantially more interactive, cooperative, and conversational.

It improved:

Capability Effect
Dialogue quality More natural interaction
Helpfulness Better task completion
Safety behavior Reduced harmful outputs
Refusal behavior Better policy compliance
Tone control More socially acceptable responses
Multi-turn consistency Improved conversation flow

Many modern assistants rely heavily on preference optimization.

Without RLHF-style alignment, large language models often behave unpredictably in interactive settings.

Summary

Reinforcement learning from human feedback aligns language models with human preferences using preference comparisons and reward optimization.

The standard RLHF pipeline includes:

  1. Pretraining
  2. Instruction tuning
  3. Preference data collection
  4. Reward model training
  5. Policy optimization

Reward models estimate human preferences statistically, and policy optimization adjusts the model to maximize those rewards while remaining close to the supervised policy.

Modern systems increasingly use preference optimization methods such as DPO rather than classical PPO-based reinforcement learning.

RLHF improves usability, safety, and dialogue quality, but it introduces challenges such as reward hacking, sycophancy, over-optimization, and dependence on imperfect human feedback.