Summarization

Summarization is the task of producing a shorter version of one or more source texts while preserving the important information.

Summarization is the task of producing a shorter version of one or more source texts while preserving the important information. The input may be a news article, a scientific paper, a legal document, a support thread, a meeting transcript, a code review, or a set of retrieved passages. The output is a compact text that should be faithful, readable, and appropriate for the user’s purpose.

Examples:

Input: A long news article about an election result.
Output: A paragraph describing who won, by what margin, and what happens next.
Input: A meeting transcript.
Output: Decisions, open questions, and action items.

Summarization is a sequence-to-sequence problem. The model receives a source sequence

$$ x = (x_1, x_2, \ldots, x_T) $$

and produces a target sequence

$$ y = (y_1, y_2, \ldots, y_M), $$

where usually

$$ M < T. $$

The goal is not merely to shorten text. The goal is to select, compress, organize, and express the relevant content.

Extractive and Abstractive Summarization

There are two main forms of summarization.

Type Description
Extractive summarization Selects sentences or spans from the source
Abstractive summarization Generates new wording based on the source

Extractive summarization is closer to ranking. The system chooses important units from the original document. For example, it may select three sentences from a news article.

Abstractive summarization is closer to conditional generation. The system writes a new summary that may paraphrase, combine, or reorder information.

Extractive systems are easier to constrain because every selected sentence comes from the source. Abstractive systems are more flexible but can introduce unsupported claims.

Encoder-Decoder Formulation

Modern abstractive summarization often uses an encoder-decoder transformer.

The encoder reads the source document and produces hidden states:

$$ H = \text{Encoder}(x). $$

The decoder generates the summary one token at a time:

$$ p(y \mid x) = \prod_{m=1}^{M} p(y_m \mid y_{<m}, x). $$

Training usually uses teacher forcing. At step $m$, the decoder receives the gold previous tokens and predicts the next gold token.

The loss is token-level negative log-likelihood:

$$ \mathcal{L} = -\sum_{m=1}^{M} \log p_\theta(y_m^\star \mid y_{<m}^\star, x). $$

This objective teaches the model to imitate reference summaries. It does not directly optimize factuality, coverage, usefulness, or brevity. Those properties must be handled through data, decoding, evaluation, and system design.

Decoder-Only Summarization

Decoder-only language models can also perform summarization. The source text and instruction are placed in the prompt, and the model continues with the summary.

Example prompt:

Summarize the following article in five bullet points.

Article:
...

The model defines the same autoregressive factorization:

$$ p(y \mid c) = \prod_{m=1}^{M} p(y_m \mid c, y_{<m}), $$

where $c$ is the prompt containing the instruction and source text.

Decoder-only models are convenient for instruction-following summarization. They can adapt the output format using natural language instructions. Encoder-decoder models are often more efficient when the task is fixed and the source text is long relative to the output.

Building a Summarization Dataset

A supervised summarization dataset contains source-summary pairs:

{
  "source": "Long document text...",
  "summary": "Short summary..."
}

The quality of this dataset determines the behavior of the model. If the references are brief, the model learns brief summaries. If the references include opinions, the model learns opinions. If references contain unsupported details, the model learns to hallucinate.

Important dataset properties include:

Property Why it matters
Domain News, legal, medical, code, meetings, and research papers require different summaries
Compression ratio Controls how much shorter the summary should be
Reference style Bullets, paragraph, headline, abstract, action items
Factual alignment Reference should be supported by the source
Recency For time-sensitive domains, summaries must reflect current conventions
Length distribution Training length should match deployment length

A dataset for meeting summarization may need decisions and action items. A dataset for scientific summarization may need methods, results, and limitations. A dataset for legal summarization may need parties, claims, holdings, and procedural posture.

Tokenization and Batching

Summarization uses both input tokenization and target tokenization.

For a batch of $B$ examples:

input_ids      # [B, T_src]
attention_mask # [B, T_src]
labels         # [B, T_tgt]

The source length $T_{\text{src}}$ and target length $T_{\text{tgt}}$ may differ.

Padding is used to make examples in a batch the same length. Loss should ignore padding tokens in the labels. In PyTorch and Hugging Face-style training, ignored target positions are often set to -100.

labels[labels == tokenizer.pad_token_id] = -100

This prevents padding tokens from contributing to the loss.

A Minimal PyTorch Wrapper

A practical summarization model often wraps a pretrained encoder-decoder model.

import torch
import torch.nn as nn

class Summarizer(nn.Module):
    def __init__(self, seq2seq_model):
        super().__init__()
        self.model = seq2seq_model

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels,
        )

        return outputs.loss, outputs.logits

This looks small because the pretrained model contains the encoder, decoder, attention layers, embeddings, language modeling head, and generation utilities.

A training step is similar to other supervised sequence tasks:

def train_step(model, batch, optimizer, device):
    model.train()

    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    labels = batch["labels"].to(device)

    optimizer.zero_grad(set_to_none=True)

    loss, logits = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        labels=labels,
    )

    loss.backward()
    optimizer.step()

    return loss.item()

Decoding Summaries

At inference time, the model must decode a sequence. Common decoding methods include greedy search, beam search, top-k sampling, nucleus sampling, and constrained decoding.

For summarization, deterministic decoding is often preferred. Greedy search and beam search are common because summaries should be stable and faithful.

Method Behavior
Greedy search Chooses the highest-probability token at each step
Beam search Keeps several candidate sequences
Top-k sampling Samples from the top $k$ tokens
Nucleus sampling Samples from the smallest set whose probability mass exceeds $p$
Length penalty Adjusts preference for shorter or longer outputs
Repetition penalty Reduces repeated phrases

Beam search can improve fluency, but large beam sizes may produce generic summaries. Sampling can produce varied summaries, but it can also increase hallucination. For factual summarization, lower-temperature decoding is usually safer.

Example generation call:

@torch.no_grad()
def summarize(model, tokenizer, text, device):
    model.eval()

    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=1024,
    )

    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)

    output_ids = model.model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=160,
        num_beams=4,
        length_penalty=1.0,
        no_repeat_ngram_size=3,
    )

    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

Long-Document Summarization

Many documents exceed the context length of a standard model. Long-document summarization needs special handling.

Common strategies include:

Strategy Description
Truncation Keep only the beginning of the document
Sliding windows Summarize overlapping chunks
Map-reduce summarization Summarize chunks, then summarize the summaries
Hierarchical encoding Encode chunks, then aggregate at document level
Retrieval-based summarization Select relevant passages before summarizing
Long-context models Use models designed for long inputs

Naive truncation is dangerous. In many documents, important information appears near the end: conclusions, decisions, risks, action items, or exceptions.

A simple map-reduce pipeline:

def map_reduce_summarize(summarize_fn, chunks):
    chunk_summaries = []

    for chunk in chunks:
        chunk_summaries.append(summarize_fn(chunk))

    combined = "\n".join(chunk_summaries)
    final_summary = summarize_fn(combined)

    return final_summary

This is useful, but it can lose details. Errors in the first stage can propagate into the final summary.

Controlling Summary Style

A summarizer should match the user’s purpose. The same source may need different outputs.

Purpose Good output form
Executive reading Short paragraph with key consequences
Meeting review Decisions, action items, owners, dates
Research paper Problem, method, result, limitations
Legal document Issue, rule, holding, reasoning
Customer support Problem, resolution, next step
Search result One-sentence snippet with evidence

With instruction-tuned models, style can be controlled through prompts:

Summarize the document as:
- Decision:
- Evidence:
- Risks:
- Next actions:

For supervised models, style is controlled mainly by training data. If the model is fine-tuned on bullet summaries, it will tend to generate bullets.

Factuality and Hallucination

A summarization model hallucinates when it adds information that is unsupported by the source.

Hallucinations include:

Type Example
Entity error Wrong person, company, drug, or place
Number error Wrong amount, date, percentage, or count
Relation error Reverses who did what
Causal error Invents a cause or consequence
Temporal error Misstates when something happened
Unsupported inference Adds a conclusion absent from the source

Factuality is central in summarization. A fluent summary can still be wrong.

Methods to reduce hallucination include:

  1. Prefer extractive or evidence-grounded summaries for high-stakes use.
  2. Use retrieval or citation constraints.
  3. Decode conservatively.
  4. Ask the model to quote or cite supporting spans.
  5. Run a separate factual consistency checker.
  6. Use domain-specific evaluation data.
  7. Avoid asking for information that the source does not contain.

A practical rule is simple: the summary should not contain a claim that cannot be traced to the input.

Evaluation Metrics

Summarization evaluation is difficult because many valid summaries may exist for the same source.

Common automatic metrics include:

Metric Measures Limitation
ROUGE N-gram overlap with reference Rewards surface similarity
BLEU Precision-oriented overlap Designed for translation
METEOR Overlap with stemming/synonyms Still reference-dependent
BERTScore Semantic similarity May miss factual errors
QAEval-style metrics Answer consistency Depends on QA quality
Human evaluation Relevance, coherence, factuality Expensive

ROUGE is common but incomplete. A summary can have high ROUGE and still contain factual errors. A summary can have low ROUGE and still be useful if it is phrased differently from the reference.

Human evaluation often uses dimensions such as:

Dimension Meaning
Coverage Captures important source information
Faithfulness Does not add unsupported claims
Coherence Reads naturally
Concision Avoids unnecessary detail
Usefulness Serves the intended task

Extractive Baselines

Before training a large abstractive model, build extractive baselines. They are easy to implement and help detect whether the task really requires generation.

A simple baseline ranks sentences by similarity to the document centroid or by term importance. Another baseline selects the first few sentences, which is surprisingly strong for news articles because important information often appears at the beginning.

A minimal lead baseline:

def lead_summary(text, num_sentences=3):
    sentences = split_into_sentences(text)
    return " ".join(sentences[:num_sentences])

A neural system should beat this baseline on the metrics that matter. If it does not, the dataset or evaluation protocol may be weak.

Common Failure Modes

Summarization systems fail in recurring ways.

Failure mode Description
Hallucination Adds unsupported facts
Omission Leaves out central information
Over-compression Removes necessary context
Redundancy Repeats the same idea
Entity drift Confuses names or references
Number drift Changes numerical values
Style mismatch Wrong tone, length, or format
Lost chronology Events appear in the wrong order
Source bias amplification Repeats biased framing without qualification
Prompt overreach Answers beyond the supplied document

Long documents add more failure modes. The model may focus on early sections, miss tables, ignore appendices, or confuse similar entities across sections.

Practical System Design

A production summarization system should make several design choices explicitly.

Decision Typical choices
Input scope Single document, many documents, retrieved passages
Summary type Extractive, abstractive, hybrid
Output format Paragraph, bullets, structured fields
Length control Token limit, sentence count, section budget
Evidence policy No citations, inline citations, quoted support
Update behavior Static summary, incremental summary
Risk tolerance Creative, conservative, high-faithfulness

For low-risk consumer summaries, an abstractive model may be acceptable. For legal, medical, financial, or compliance settings, evidence-grounded summarization is safer. The system should preserve important source text, expose uncertainty, and support auditability.

Summary

Summarization compresses source text into a shorter output. Extractive summarization selects source spans. Abstractive summarization generates new text. Encoder-decoder models and decoder-only language models are both widely used.

The core training objective is next-token likelihood conditioned on the source. This objective gives a useful model, but it does not guarantee factuality. Reliable summarization requires careful data, decoding, long-context handling, evaluation, and error analysis.

A good summarizer should preserve what matters, omit what does not, and avoid unsupported claims.