Efficient Transformers

Standard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size

$$ T \times T. $$

This gives attention complexity

$$ O(T^2D), $$

where $D$ is the model dimension.

Quadratic scaling becomes expensive for long documents, code repositories, videos, audio streams, retrieval contexts, and agent memory traces. Efficient transformer methods reduce attention cost, memory usage, or latency while preserving as much modeling quality as possible.

Efficient transformers are therefore both an algorithmic and systems topic.

The Quadratic Attention Bottleneck

Standard self-attention computes

$$ S = \frac{QK^\top}{\sqrt{d_k}}, $$

where

$$ Q,K \in \mathbb{R}^{T \times d_k}. $$

The matrix multiplication

$$ QK^\top $$

produces

$$ S\in\mathbb{R}^{T\times T}. $$

The memory required for $S$ grows quadratically with sequence length.

For example:

Sequence length	Attention matrix size
1,024	$\approx 10^6$ entries
8,192	$\approx 6.7\times10^7$ entries
32,768	$\approx 1.1\times10^9$ entries

Even when activations are stored in low precision, long-context attention quickly becomes expensive.

Efficient transformers aim to reduce one or more of:

Resource	Problem
Compute	Too many attention operations
Memory	Large attention matrices
Latency	Slow autoregressive decoding
Bandwidth	Excessive movement of KV tensors

Categories of Efficient Attention

Efficient transformer methods can be grouped into several broad categories.

Category	Main idea
Sparse attention	Attend only to selected positions
Local attention	Restrict attention to nearby tokens
Linear attention	Avoid explicit $T\times T$ matrices
Low-rank attention	Approximate attention structure
Memory compression	Reduce stored sequence representations
Retrieval augmentation	Move information outside attention
State-space models	Replace attention with recurrent dynamics
Kernel fusion	Improve implementation efficiency

Different methods optimize different bottlenecks. Some reduce theoretical complexity. Others mainly improve hardware efficiency.

Local Attention

Local attention restricts each token to a fixed-size neighborhood.

Suppose each token attends only within a window of size $w$. Then token $i$ attends to positions

$$ [i-w, \ldots, i+w]. $$

Instead of attending to all $T$ tokens, each token attends to roughly $2w+1$ positions.

The attention complexity becomes

$$ O(TwD). $$

If $w \ll T$, this is much cheaper than quadratic attention.

Local attention works well when most useful dependencies are nearby. This is common in language, audio, and images.

However, purely local attention struggles with long-range dependencies unless information propagates gradually across layers.

Sliding Window Attention

Sliding window attention is a practical form of local attention.

Each attention window overlaps with nearby windows:

Window 1: [1 2 3 4]
Window 2:   [3 4 5 6]
Window 3:     [5 6 7 8]

Overlapping windows allow information to propagate across long sequences over multiple layers.

If each layer expands the receptive field slightly, deep stacks can still model long-range structure.

Sliding-window attention is common in long-context language models because it preserves locality while remaining efficient.

Global Tokens

Local attention can be augmented with global tokens.

A global token attends to all positions and is visible to all positions. This creates a communication pathway across distant regions.

Examples include:

Global structure	Purpose
CLS token	Sequence summarization
Retrieval tokens	External memory access
Sink tokens	Stabilize long-context attention
Landmark tokens	Hierarchical compression

A hybrid local-global pattern may look like:

Local windows + selected global connections

This preserves efficient local computation while enabling long-range interaction.

Sparse Attention

Sparse attention allows each token to attend only to selected positions rather than to the full sequence.

Instead of a dense $T\times T$ attention matrix, sparse attention uses a sparse graph.

Common sparse patterns include:

Pattern	Description
Local	Nearby tokens only
Strided	Every $k$-th token
Block sparse	Dense blocks with sparse global structure
Random	Randomly selected links
Dilated	Exponentially spaced positions

A sparse attention mask might look conceptually like:

x . . x . . x
. x . . x . .
. . x . . x .
x . . x . . x

Here x indicates allowed attention positions.

Sparse attention reduces complexity approximately to

$$ O(TsD), $$

where $s$ is the average number of attended positions per token.

The challenge is preserving information flow and hardware efficiency. Sparse patterns may reduce theoretical complexity but perform poorly on GPUs if memory access becomes irregular.

Block Sparse Attention

Block sparse attention groups tokens into blocks and applies sparse connectivity among blocks.

Suppose the sequence is divided into blocks of size $b$. Instead of attending token-by-token, attention is computed block-by-block.

This improves hardware utilization because matrix operations remain dense inside blocks.

A block sparse pattern may include:

Connection type	Purpose
Local neighboring blocks	Nearby context
Global blocks	Shared information
Random blocks	Long-range connectivity

Block sparse methods often achieve better accelerator efficiency than fine-grained sparse masks.

Linear Attention

Linear attention avoids explicitly constructing the $T\times T$ attention matrix.

Standard attention computes

$$ \text{softmax}(QK^\top)V. $$

Linear attention methods rewrite or approximate this expression so computation scales linearly with sequence length.

A common idea is to approximate attention using feature maps:

$$ \phi(Q)\phi(K)^\top. $$

Then attention becomes

$$ \text{Attention}(Q,K,V) = \frac{ \phi(Q) \left( \phi(K)^\top V \right) }{ \phi(Q) \left( \phi(K)^\top \mathbf{1} \right) }. $$

The key observation is that

$$ \phi(K)^\top V $$

can be accumulated incrementally without forming a full attention matrix.

This changes complexity from

$$ O(T^2D) $$

to approximately

$$ O(TD^2). $$

Linear attention is attractive for very long sequences, but approximating softmax attention accurately is difficult. Many linear methods trade modeling quality for efficiency.

Kernelized Attention

Linear attention often uses kernel methods.

Suppose softmax attention is approximated by a positive kernel:

$$ \exp(q^\top k) \approx \phi(q)^\top \phi(k). $$

The feature map $\phi$ transforms queries and keys into a space where the attention interaction becomes associative.

This allows reordering computation:

$$ (QK^\top)V = Q(K^\top V). $$

The right-hand side avoids constructing a $T\times T$ matrix explicitly.

Kernelized attention methods differ mainly in their feature map design and numerical stabilization.

Low-Rank Attention

Attention matrices often have redundant structure. Low-rank methods approximate the attention matrix using fewer dimensions.

A low-rank approximation may represent

$$ S \approx UV^\top, $$

where

$$ U,V\in\mathbb{R}^{T\times r}, $$

and $r \ll T$.

Low-rank methods reduce memory and compute cost, especially when the attention matrix is approximately compressible.

However, some tasks require highly detailed token-token interactions, making aggressive compression harmful.

Attention Pooling and Compression

Another approach compresses the sequence itself.

Suppose the original sequence length is $T$. A compression layer produces a shorter representation of length $T'$:

$$ T' < T. $$

Attention is then computed over the compressed representation.

Compression methods include:

Method	Description
Pooling	Average or max pooling
Learned projection	Linear compression
Clustering	Group similar tokens
Landmark selection	Keep important positions
Downsampling	Reduce temporal resolution

This is common in hierarchical transformers for documents, videos, and multimodal systems.

Memory-Augmented Transformers

Memory-augmented transformers extend context using external memory.

Instead of storing everything in attention, the model maintains memory states:

Memory type	Description
Cached hidden states	Reuse previous activations
Retrieval memory	External database or vector store
Learned memory tokens	Persistent trainable memory
Compressed summaries	Historical context compression

Transformer-XL introduced segment recurrence. Hidden states from previous segments become memory for future segments.

Retrieval-augmented systems use external retrieval instead of expanding the context window indefinitely.

FlashAttention

Some efficient transformer methods do not change the mathematical attention formula. Instead, they improve implementation efficiency.

FlashAttention computes exact attention while reducing memory movement between GPU memory and on-chip SRAM.

The key idea is tiled attention computation:

Load small query/key/value blocks.
Compute partial attention.
Fuse softmax and matrix multiplication.
Avoid materializing the full attention matrix.

FlashAttention reduces memory overhead and improves throughput while preserving exact softmax attention.

This is important because modern accelerators are often bandwidth-bound rather than arithmetic-bound.

In practice, FlashAttention is widely used even in standard transformers because it improves efficiency without changing model behavior.

Grouped and Multi-Query Attention

During autoregressive decoding, KV cache memory becomes expensive.

Standard multi-head attention stores separate keys and values for every head:

$$ K,V\in\mathbb{R}^{B\times h\times T\times d_h}. $$

Grouped-query attention and multi-query attention reduce KV storage.

Method	KV sharing
Multi-head attention	Separate KV per head
Multi-query attention	One shared KV for all heads
Grouped-query attention	One KV per head group

This reduces memory bandwidth and inference latency.

Grouped-query attention is widely used in large decoder-only language models because it improves serving efficiency while preserving quality better than fully shared KV.

Chunked and Streaming Attention

Streaming systems process sequences incrementally rather than as complete fixed contexts.

For audio or live interaction, tokens arrive continuously:

x1 → x2 → x3 → ...

Streaming attention uses chunked processing:

Method	Description
Fixed chunks	Process windows independently
Overlapping chunks	Preserve continuity
Rolling memory	Carry limited state forward
Streaming causal attention	Online autoregressive processing

Streaming models must balance latency and context size.

State Space Alternatives

Some architectures replace attention entirely.

State space models process sequences using learned dynamical systems rather than token-token attention.

A simplified state-space recurrence is

$$ h_t = Ah_{t-1} + Bx_t, $$

$$ y_t = Ch_t. $$

Modern state-space models use structured parameterizations and parallel scan algorithms to scale efficiently.

Advantages include:

Property	Benefit
Linear sequence scaling	Better long-context efficiency
Constant-size hidden state	Compact memory
Streaming-friendly	Natural online processing

Attention remains stronger for some reasoning and retrieval tasks, but state-space methods are increasingly competitive for long sequences.

Efficient Attention in Vision

Vision transformers often use structured efficient attention.

Common strategies include:

Method	Purpose
Window attention	Restrict attention locally
Shifted windows	Cross-window interaction
Hierarchical pooling	Reduce resolution
Patch merging	Compress sequence length
Sparse global tokens	Long-range communication

Images have strong spatial locality, so efficient local attention works particularly well.

Choosing an Efficient Transformer Design

The best efficient transformer depends on the application.

Application	Common strategy
Long documents	Sliding-window or sparse attention
Real-time generation	KV-efficient decoding
Mobile inference	Quantized compact attention
Video modeling	Hierarchical compression
Streaming speech	Chunked causal attention
Retrieval systems	Retrieval augmentation
Million-token contexts	Sparse/local/retrieval hybrids

There is no universally best efficient transformer. Tradeoffs involve:

Tradeoff	Example
Quality vs speed	Dense attention is usually strongest
Simplicity vs scalability	Sparse patterns complicate kernels
Memory vs compute	Checkpointing recomputes activations
Exactness vs approximation	Linear attention approximates softmax

Practical systems often combine multiple techniques.

Summary

Standard transformer attention scales quadratically with sequence length because it constructs a full attention matrix. Efficient transformers reduce compute, memory, latency, or bandwidth cost using sparse attention, local attention, linear attention, low-rank approximations, sequence compression, retrieval augmentation, optimized kernels, or alternative sequence models.

Some methods modify the attention algorithm itself. Others keep exact attention but optimize implementation. Modern large-scale systems often combine efficient attention, KV-efficient decoding, FlashAttention kernels, retrieval systems, and distributed memory management.