Efficient Transformers

Standard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size

Standard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size

$$ T \times T. $$

This gives attention complexity

$$ O(T^2D), $$

where $D$ is the model dimension.

Quadratic scaling becomes expensive for long documents, code repositories, videos, audio streams, retrieval contexts, and agent memory traces. Efficient transformer methods reduce attention cost, memory usage, or latency while preserving as much modeling quality as possible.

Efficient transformers are therefore both an algorithmic and systems topic.

The Quadratic Attention Bottleneck

Standard self-attention computes

$$ S = \frac{QK^\top}{\sqrt{d_k}}, $$

where

$$ Q,K \in \mathbb{R}^{T \times d_k}. $$

The matrix multiplication

$$ QK^\top $$

produces

$$ S\in\mathbb{R}^{T\times T}. $$

The memory required for $S$ grows quadratically with sequence length.

For example:

Sequence length Attention matrix size
1,024 $\approx 10^6$ entries
8,192 $\approx 6.7\times10^7$ entries
32,768 $\approx 1.1\times10^9$ entries

Even when activations are stored in low precision, long-context attention quickly becomes expensive.

Efficient transformers aim to reduce one or more of:

Resource Problem
Compute Too many attention operations
Memory Large attention matrices
Latency Slow autoregressive decoding
Bandwidth Excessive movement of KV tensors

Categories of Efficient Attention

Efficient transformer methods can be grouped into several broad categories.

Category Main idea
Sparse attention Attend only to selected positions
Local attention Restrict attention to nearby tokens
Linear attention Avoid explicit $T\times T$ matrices
Low-rank attention Approximate attention structure
Memory compression Reduce stored sequence representations
Retrieval augmentation Move information outside attention
State-space models Replace attention with recurrent dynamics
Kernel fusion Improve implementation efficiency

Different methods optimize different bottlenecks. Some reduce theoretical complexity. Others mainly improve hardware efficiency.

Local Attention

Local attention restricts each token to a fixed-size neighborhood.

Suppose each token attends only within a window of size $w$. Then token $i$ attends to positions

$$ [i-w, \ldots, i+w]. $$

Instead of attending to all $T$ tokens, each token attends to roughly $2w+1$ positions.

The attention complexity becomes

$$ O(TwD). $$

If $w \ll T$, this is much cheaper than quadratic attention.

Local attention works well when most useful dependencies are nearby. This is common in language, audio, and images.

However, purely local attention struggles with long-range dependencies unless information propagates gradually across layers.

Sliding Window Attention

Sliding window attention is a practical form of local attention.

Each attention window overlaps with nearby windows:

Window 1: [1 2 3 4]
Window 2:   [3 4 5 6]
Window 3:     [5 6 7 8]

Overlapping windows allow information to propagate across long sequences over multiple layers.

If each layer expands the receptive field slightly, deep stacks can still model long-range structure.

Sliding-window attention is common in long-context language models because it preserves locality while remaining efficient.

Global Tokens

Local attention can be augmented with global tokens.

A global token attends to all positions and is visible to all positions. This creates a communication pathway across distant regions.

Examples include:

Global structure Purpose
CLS token Sequence summarization
Retrieval tokens External memory access
Sink tokens Stabilize long-context attention
Landmark tokens Hierarchical compression

A hybrid local-global pattern may look like:

Local windows + selected global connections

This preserves efficient local computation while enabling long-range interaction.

Sparse Attention

Sparse attention allows each token to attend only to selected positions rather than to the full sequence.

Instead of a dense $T\times T$ attention matrix, sparse attention uses a sparse graph.

Common sparse patterns include:

Pattern Description
Local Nearby tokens only
Strided Every $k$-th token
Block sparse Dense blocks with sparse global structure
Random Randomly selected links
Dilated Exponentially spaced positions

A sparse attention mask might look conceptually like:

x . . x . . x
. x . . x . .
. . x . . x .
x . . x . . x

Here x indicates allowed attention positions.

Sparse attention reduces complexity approximately to

$$ O(TsD), $$

where $s$ is the average number of attended positions per token.

The challenge is preserving information flow and hardware efficiency. Sparse patterns may reduce theoretical complexity but perform poorly on GPUs if memory access becomes irregular.

Block Sparse Attention

Block sparse attention groups tokens into blocks and applies sparse connectivity among blocks.

Suppose the sequence is divided into blocks of size $b$. Instead of attending token-by-token, attention is computed block-by-block.

This improves hardware utilization because matrix operations remain dense inside blocks.

A block sparse pattern may include:

Connection type Purpose
Local neighboring blocks Nearby context
Global blocks Shared information
Random blocks Long-range connectivity

Block sparse methods often achieve better accelerator efficiency than fine-grained sparse masks.

Linear Attention

Linear attention avoids explicitly constructing the $T\times T$ attention matrix.

Standard attention computes

$$ \text{softmax}(QK^\top)V. $$

Linear attention methods rewrite or approximate this expression so computation scales linearly with sequence length.

A common idea is to approximate attention using feature maps:

$$ \phi(Q)\phi(K)^\top. $$

Then attention becomes

$$ \text{Attention}(Q,K,V) = \frac{ \phi(Q) \left( \phi(K)^\top V \right) }{ \phi(Q) \left( \phi(K)^\top \mathbf{1} \right) }. $$

The key observation is that

$$ \phi(K)^\top V $$

can be accumulated incrementally without forming a full attention matrix.

This changes complexity from

$$ O(T^2D) $$

to approximately

$$ O(TD^2). $$

Linear attention is attractive for very long sequences, but approximating softmax attention accurately is difficult. Many linear methods trade modeling quality for efficiency.

Kernelized Attention

Linear attention often uses kernel methods.

Suppose softmax attention is approximated by a positive kernel:

$$ \exp(q^\top k) \approx \phi(q)^\top \phi(k). $$

The feature map $\phi$ transforms queries and keys into a space where the attention interaction becomes associative.

This allows reordering computation:

$$ (QK^\top)V = Q(K^\top V). $$

The right-hand side avoids constructing a $T\times T$ matrix explicitly.

Kernelized attention methods differ mainly in their feature map design and numerical stabilization.

Low-Rank Attention

Attention matrices often have redundant structure. Low-rank methods approximate the attention matrix using fewer dimensions.

A low-rank approximation may represent

$$ S \approx UV^\top, $$

where

$$ U,V\in\mathbb{R}^{T\times r}, $$

and $r \ll T$.

Low-rank methods reduce memory and compute cost, especially when the attention matrix is approximately compressible.

However, some tasks require highly detailed token-token interactions, making aggressive compression harmful.

Attention Pooling and Compression

Another approach compresses the sequence itself.

Suppose the original sequence length is $T$. A compression layer produces a shorter representation of length $T'$:

$$ T' < T. $$

Attention is then computed over the compressed representation.

Compression methods include:

Method Description
Pooling Average or max pooling
Learned projection Linear compression
Clustering Group similar tokens
Landmark selection Keep important positions
Downsampling Reduce temporal resolution

This is common in hierarchical transformers for documents, videos, and multimodal systems.

Memory-Augmented Transformers

Memory-augmented transformers extend context using external memory.

Instead of storing everything in attention, the model maintains memory states:

Memory type Description
Cached hidden states Reuse previous activations
Retrieval memory External database or vector store
Learned memory tokens Persistent trainable memory
Compressed summaries Historical context compression

Transformer-XL introduced segment recurrence. Hidden states from previous segments become memory for future segments.

Retrieval-augmented systems use external retrieval instead of expanding the context window indefinitely.

FlashAttention

Some efficient transformer methods do not change the mathematical attention formula. Instead, they improve implementation efficiency.

FlashAttention computes exact attention while reducing memory movement between GPU memory and on-chip SRAM.

The key idea is tiled attention computation:

  1. Load small query/key/value blocks.
  2. Compute partial attention.
  3. Fuse softmax and matrix multiplication.
  4. Avoid materializing the full attention matrix.

FlashAttention reduces memory overhead and improves throughput while preserving exact softmax attention.

This is important because modern accelerators are often bandwidth-bound rather than arithmetic-bound.

In practice, FlashAttention is widely used even in standard transformers because it improves efficiency without changing model behavior.

Grouped and Multi-Query Attention

During autoregressive decoding, KV cache memory becomes expensive.

Standard multi-head attention stores separate keys and values for every head:

$$ K,V\in\mathbb{R}^{B\times h\times T\times d_h}. $$

Grouped-query attention and multi-query attention reduce KV storage.

Method KV sharing
Multi-head attention Separate KV per head
Multi-query attention One shared KV for all heads
Grouped-query attention One KV per head group

This reduces memory bandwidth and inference latency.

Grouped-query attention is widely used in large decoder-only language models because it improves serving efficiency while preserving quality better than fully shared KV.

Chunked and Streaming Attention

Streaming systems process sequences incrementally rather than as complete fixed contexts.

For audio or live interaction, tokens arrive continuously:

x1 → x2 → x3 → ...

Streaming attention uses chunked processing:

Method Description
Fixed chunks Process windows independently
Overlapping chunks Preserve continuity
Rolling memory Carry limited state forward
Streaming causal attention Online autoregressive processing

Streaming models must balance latency and context size.

State Space Alternatives

Some architectures replace attention entirely.

State space models process sequences using learned dynamical systems rather than token-token attention.

A simplified state-space recurrence is

$$ h_t = Ah_{t-1} + Bx_t, $$

$$ y_t = Ch_t. $$

Modern state-space models use structured parameterizations and parallel scan algorithms to scale efficiently.

Advantages include:

Property Benefit
Linear sequence scaling Better long-context efficiency
Constant-size hidden state Compact memory
Streaming-friendly Natural online processing

Attention remains stronger for some reasoning and retrieval tasks, but state-space methods are increasingly competitive for long sequences.

Efficient Attention in Vision

Vision transformers often use structured efficient attention.

Common strategies include:

Method Purpose
Window attention Restrict attention locally
Shifted windows Cross-window interaction
Hierarchical pooling Reduce resolution
Patch merging Compress sequence length
Sparse global tokens Long-range communication

Images have strong spatial locality, so efficient local attention works particularly well.

Choosing an Efficient Transformer Design

The best efficient transformer depends on the application.

Application Common strategy
Long documents Sliding-window or sparse attention
Real-time generation KV-efficient decoding
Mobile inference Quantized compact attention
Video modeling Hierarchical compression
Streaming speech Chunked causal attention
Retrieval systems Retrieval augmentation
Million-token contexts Sparse/local/retrieval hybrids

There is no universally best efficient transformer. Tradeoffs involve:

Tradeoff Example
Quality vs speed Dense attention is usually strongest
Simplicity vs scalability Sparse patterns complicate kernels
Memory vs compute Checkpointing recomputes activations
Exactness vs approximation Linear attention approximates softmax

Practical systems often combine multiple techniques.

Summary

Standard transformer attention scales quadratically with sequence length because it constructs a full attention matrix. Efficient transformers reduce compute, memory, latency, or bandwidth cost using sparse attention, local attention, linear attention, low-rank approximations, sequence compression, retrieval augmentation, optimized kernels, or alternative sequence models.

Some methods modify the attention algorithm itself. Others keep exact attention but optimize implementation. Modern large-scale systems often combine efficient attention, KV-efficient decoding, FlashAttention kernels, retrieval systems, and distributed memory management.