Instruction Tuning

Pretraining teaches a language model to predict text. It does not directly teach the model to follow user instructions, answer safely, maintain dialogue structure, or format outputs in a useful way.

A pretrained model may continue text well but still behave poorly in interactive settings. For example, it may ignore instructions, generate irrelevant continuations, produce unsafe content, or imitate undesirable patterns from the training corpus.

Instruction tuning adapts a pretrained language model into a system that responds to tasks expressed in natural language.

The core idea is simple: instead of training on generic text continuation, we train the model on pairs of instructions and desired responses.

A typical example looks like:

Instruction	Response
“Translate this sentence into French.”	Correct French translation
“Summarize the following article.”	Summary
“Write a Python function for binary search.”	Python code
“Explain gradient descent.”	Educational explanation

Instruction tuning changes the model’s behavior from generic next-token continuation toward task-oriented response generation.

From Language Modeling to Task Following

A pretrained autoregressive model learns

$$ p_\theta(x_t \mid x_{<t}). $$

The model predicts the next token from previous tokens. During pretraining, the corpus may contain instructions, answers, conversations, code, essays, and many unrelated text types mixed together.

Instruction tuning reorganizes the training distribution. Instead of arbitrary web text, the model receives structured examples:

$$ (\text{instruction}, \text{response}). $$

The model then learns the conditional distribution

$$ p_\theta(\text{response} \mid \text{instruction}). $$

This appears superficially similar to pretraining, since the model still predicts tokens autoregressively. The difference is the structure of the data distribution.

Pretraining teaches language structure broadly. Instruction tuning teaches cooperative task behavior.

Supervised Fine-Tuning

Instruction tuning is usually implemented as supervised fine-tuning, often abbreviated SFT.

The dataset contains demonstrations written by humans, synthetic systems, or mixtures of both. Each example includes:

Field	Purpose
System prompt	Defines global behavior
User prompt	Contains the instruction
Assistant response	Desired output

A training sample may look like:

<system>
You are a helpful assistant.

<user>
Explain backpropagation in simple terms.

<assistant>
Backpropagation computes gradients by applying the chain rule ...

The model is trained to predict the assistant tokens conditioned on all previous tokens.

The supervised loss is standard cross-entropy:

$$ \mathcal{L} = -\sum_{t} \log p_\theta(y_t \mid x, y_{<t}), $$

where:

Symbol	Meaning
$x$	Prompt or instruction
$y_t$	Target response token
$y_{<t}$	Previous response tokens

Only assistant tokens usually contribute to the loss. User and system tokens provide conditioning context but are not prediction targets.

Why Instruction Tuning Works

Instruction tuning works because pretrained models already contain broad latent capabilities. Pretraining exposes the model to many tasks indirectly through text. The model may already contain useful representations for translation, reasoning, summarization, coding, and dialogue.

Instruction tuning teaches the model when and how to use those capabilities.

This is often described as eliciting latent knowledge rather than creating entirely new knowledge.

The model learns patterns such as:

Behavior	Example
Obeying instructions	Following formatting requests
Maintaining dialogue roles	Responding as assistant rather than continuing user text
Producing concise answers	Avoiding irrelevant continuation
Refusing unsafe requests	Safety alignment
Using chain-of-thought style reasoning	Stepwise solutions
Formatting outputs	Markdown, JSON, code blocks

A relatively small instruction dataset can significantly change model behavior because the pretrained model already contains strong language representations.

Prompt Formatting and Chat Templates

Modern instruction-tuned models usually rely on structured prompt templates.

A dialogue is converted into a token sequence with role markers:

<system>
You are a concise assistant.

<user>
What is overfitting?

<assistant>

The model generates the assistant continuation.

Different model families use different formatting conventions:

Model family	Example format
ChatML-style	`<system>`, `<user>`, `<assistant>`
Instruction-style	`### Instruction:`
Llama-style chat	`[INST] ... [/INST]`
XML-style	`<instruction>` tags
JSON-style	Structured objects

The formatting matters because the model learns statistical associations between role markers and behavior.

Changing the template can affect performance substantially.

Multi-Task Instruction Tuning

Instruction datasets often combine many tasks:

Task type	Example
Question answering	Factual responses
Summarization	Compress documents
Translation	Convert languages
Coding	Generate programs
Classification	Assign labels
Dialogue	Multi-turn interaction
Reasoning	Solve structured problems
Tool use	Call APIs or functions

The model learns a unified interface: natural language instructions.

Instead of separate models for each task, one instruction-tuned model learns many conditional behaviors.

This unification is one reason large language models are flexible. The instruction itself acts as part of the program specification.

Zero-Shot and Few-Shot Generalization

Instruction tuning improves zero-shot generalization. A zero-shot task is one where the model receives only the instruction, without examples.

Example:

Classify this review as positive or negative:
"The battery life is excellent."

The model may perform the task correctly even without task-specific training examples in the prompt.

Few-shot prompting provides demonstrations inside the prompt itself:

Input: "Amazing product."
Label: Positive

Input: "Very disappointing."
Label: Negative

Input: "Battery life is excellent."
Label:

Instruction tuning improves the model’s ability to interpret such prompts consistently.

Pretraining alone may give weak task following. Instruction tuning calibrates the model toward cooperative interaction.

Chain-of-Thought Supervision

Some instruction datasets include intermediate reasoning steps rather than only final answers.

Example:

Question: If a train travels 60 km in 2 hours, what is its average speed?

Reasoning:
Speed = distance / time
= 60 / 2
= 30 km/h

Answer: 30 km/h

Training on reasoning traces can improve performance on multi-step reasoning tasks.

The model learns statistical patterns associated with decomposition, intermediate computation, verification, and explanation.

This is called chain-of-thought supervision.

However, chain-of-thought introduces several concerns:

Concern	Description
Verbosity	Longer outputs increase cost
Faithfulness	Reasoning text may not reflect internal computation
Data contamination	Public reasoning datasets may leak benchmarks
Safety	Hidden reasoning may expose unsafe internal content

Some systems therefore separate visible reasoning from internal latent reasoning.

Instruction Diversity

An instruction-tuned model must generalize across many instruction styles.

If the dataset is too narrow, the model may overfit to specific phrasing. High-quality instruction tuning datasets therefore vary:

Variation	Example
Wording	“Summarize” versus “Give a short overview”
Tone	Formal versus conversational
Format	JSON, markdown, prose
Difficulty	Simple and complex tasks
Domain	Science, law, code, dialogue
Language	Multilingual prompts

Diversity improves robustness.

The model learns abstract task semantics rather than memorizing exact templates.

Synthetic Instruction Data

Human-written instruction datasets are expensive. Many modern systems therefore generate synthetic instruction data.

A strong model can generate:

Synthetic component	Example
Instructions	“Write a SQL query for…”
Responses	High-quality completions
Reasoning traces	Stepwise derivations
Critiques	Error analysis
Preference labels	Ranking candidate answers

Synthetic data generation creates a recursive training loop:

Train a strong model.
Use the model to generate instruction data.
Filter or rank the outputs.
Train a new model on the expanded dataset.

This process scales data generation beyond purely human annotation.

However, synthetic data can amplify errors, stylistic artifacts, and model biases. Filtering and evaluation become increasingly important.

Catastrophic Forgetting

Instruction tuning changes the model distribution. If done poorly, it can damage capabilities learned during pretraining.

This is called catastrophic forgetting.

Possible symptoms include:

Problem	Example
Reduced factual recall	Worse knowledge retrieval
Lower language diversity	Repetitive responses
Reduced multilingual ability	Strong English bias
Style collapse	Overly uniform outputs
Short-answer bias	Failure on long reasoning tasks

Instruction tuning datasets are much smaller than pretraining corpora. Aggressive fine-tuning can therefore distort the pretrained representation space.

Several techniques reduce forgetting:

Technique	Purpose
Small learning rates	Preserve pretrained features
Mixed training data	Blend instruction and pretraining text
Parameter-efficient tuning	Update fewer parameters
Regularization	Prevent large parameter drift
Replay buffers	Reintroduce older data

Balancing specialization and preservation is a major practical challenge.

Parameter-Efficient Instruction Tuning

Full fine-tuning updates all parameters. For large models, this is expensive.

Parameter-efficient fine-tuning updates only small subsets of parameters.

Common approaches include:

Method	Idea
LoRA	Low-rank weight updates
Adapters	Small trainable modules inserted into layers
Prefix tuning	Train virtual prompt vectors
Prompt tuning	Learn soft prompts
BitFit	Train only bias terms

For example, LoRA approximates weight updates using low-rank matrices:

$$ \Delta W = AB, $$

where $A$ and $B$ have much smaller rank than $W$.

This greatly reduces memory and compute requirements while preserving much of the model’s performance.

Parameter-efficient tuning is widely used for domain adaptation and open-source fine-tuning.

Instruction Tuning and Alignment

Instruction tuning improves usability, but it does not fully solve alignment.

A model may still:

Failure mode	Example
Hallucinate	Invent facts
Follow harmful requests	Unsafe outputs
Over-refuse	Reject harmless queries
Manipulate users	Social persuasion
Leak training data	Memorized content
Produce biased responses	Social stereotypes

Instruction tuning mainly teaches behavioral imitation from demonstrations.

More advanced alignment methods, such as reinforcement learning from human feedback, constitutional training, and preference optimization, further shape the model’s behavior.

PyTorch View of Supervised Fine-Tuning

Suppose a tokenized batch has shape:

[B, T]

where:

Symbol	Meaning
`B`	Batch size
`T`	Sequence length

The model produces logits:

[B, T, V]

where $V$ is the vocabulary size.

Instruction tuning usually masks non-assistant tokens from the loss.

Example:

import torch
import torch.nn.functional as F

# input_ids: [B, T]
# labels: assistant tokens kept, others set to -100

logits = model(input_ids)

loss = F.cross_entropy(
    logits.view(-1, logits.size(-1)),
    labels.view(-1),
    ignore_index=-100
)

The label tensor may look like:

Input tokens:   [SYSTEM USER USER ASSISTANT ASSISTANT]
Loss mask:      [  -100  -100  -100     y1        y2 ]

Only assistant outputs contribute gradients.

Data Mixture and Curriculum

Instruction tuning datasets are often mixtures of many sources:

Source	Example
Human annotation	Expert-written prompts
Public QA datasets	Reading comprehension
Code datasets	Programming tasks
Synthetic conversations	Generated dialogues
Tool traces	API interaction examples
Reasoning datasets	Math and logic problems

The mixture ratio matters.

Too much conversational data may weaken reasoning. Too much code data may distort natural language style. Too much synthetic data may create repetitive outputs.

Some training pipelines also use curricula:

Easier tasks first.
More complex reasoning later.
Specialized tasks near the end.

Curriculum design can improve stability and convergence.

Why Instruction Tuning Changed Language Models

Early large language models were often difficult to control. Users needed carefully engineered prompts to obtain reliable behavior.

Instruction tuning changed the interaction model. Instead of treating the model as a generic text completer, users could treat it as a cooperative assistant.

This shift enabled:

Capability	Impact
Conversational systems	Multi-turn dialogue
General-purpose assistants	Broad task coverage
Tool integration	API and retrieval systems
Coding assistants	Natural language programming
Educational tutors	Explanatory interaction
Agent systems	Planning and execution loops

Instruction tuning therefore transformed pretrained language models into usable interactive systems.

Summary

Instruction tuning adapts pretrained language models for task-following behavior using supervised examples of instructions and desired responses.

The model learns conditional generation:

$$ p_\theta(\text{response} \mid \text{instruction}). $$

Instruction tuning improves usability, formatting, dialogue structure, reasoning style, and zero-shot task generalization.

Modern instruction-tuned systems rely on structured prompts, diverse datasets, chain-of-thought supervision, synthetic data generation, and parameter-efficient adaptation methods.

Instruction tuning greatly improves interaction quality, but it does not fully solve factuality, robustness, or safety. Later alignment stages further shape model behavior beyond supervised imitation.