GPU Tensor Kernels

Modern automatic differentiation systems are fundamentally tensor compiler systems. Their performance depends less on mathematical differentiation rules than on how...

GPU Tensor Kernels

Modern automatic differentiation systems are fundamentally tensor compiler systems. Their performance depends less on mathematical differentiation rules than on how efficiently tensor kernels execute on accelerators.

GPU tensor kernels are the computational substrate of modern deep learning, scientific computing, and differentiable simulation. Reverse-mode AD amplifies the importance of kernel efficiency because every forward operation usually induces at least one backward operation.

The challenge is therefore not only:

$$ \text{compute derivatives}, $$

but also:

$$ \text{compute derivatives with high hardware efficiency}. $$

GPU Execution Model

GPUs are throughput-oriented processors.

They achieve performance through massive parallelism:

Hardware Unit Role
Thread Smallest execution unit
Warp / wavefront SIMD execution group
Block / workgroup Cooperative thread group
Streaming multiprocessor Parallel execution engine
Global memory Large but slow
Shared memory Fast local scratchpad
Registers Fastest storage

Tensor operations map naturally onto GPUs because many array elements can be processed independently.

Example:

$$ Y_{ij} = X_{ij} + Z_{ij} $$

can assign one thread per element.

Matrix multiplication assigns thread blocks to output tiles.

Tensor Kernel

A tensor kernel is a low-level operation executed on accelerator hardware.

Examples:

Kernel Type Example
Elementwise ReLU, exp, add
Reduction sum, max
Contraction GEMM
Gather/scatter indexing
Convolution CNN kernels
Softmax normalized exponential
Attention batched contraction

An AD engine lowers tensor programs into sequences of kernels.

Forward and Backward Kernels

Every differentiable primitive induces backward kernels.

Example:

Forward:

$$ Y = XW. $$

Backward:

$$ \bar{X} = \bar{Y}W^T, $$

$$ \bar{W} = X^T\bar{Y}. $$

Thus one matrix multiplication creates two additional matrix multiplications during reverse mode.

Backward computation is therefore often more expensive than forward computation.

Approximate cost:

Pass Relative FLOPs
Forward 1x
Backward 2x to 3x

Memory traffic may dominate arithmetic cost even more strongly in backward passes.

Arithmetic Intensity

GPU efficiency depends heavily on arithmetic intensity:

$$ \text{Arithmetic Intensity} = \frac{\text{FLOPs}} {\text{Memory Traffic}}. $$

High arithmetic intensity means many computations per byte loaded.

Examples:

Operation Intensity
Matrix multiplication High
Elementwise add Low
Reduction Moderate
Scatter-add Very low

Matrix multiplication maps extremely well to GPUs because data reuse is high.

Elementwise kernels are often memory-bandwidth bound.

Matrix Multiplication Kernels

General matrix multiplication (GEMM) is the dominant dense tensor primitive.

$$ C = AB. $$

Efficient GPU GEMM uses:

tiling
shared memory reuse
register blocking
warp-level matrix instructions
tensor cores

A tile-based kernel computes submatrices:

$$ C_{tile} = A_{tile}B_{tile}. $$

Threads cooperatively load blocks into shared memory and reuse them repeatedly.

Backward rules reuse GEMM:

Gradient Expression
Input gradient $\bar{X} = \bar{Y}W^T$
Weight gradient $\bar{W} = X^T\bar{Y}$

Thus highly optimized GEMM kernels are critical for AD performance.

Tensor Cores

Modern GPUs include specialized matrix-multiply hardware:

Vendor Unit
NVIDIA Tensor cores
AMD Matrix cores
TPU Systolic arrays

These units accelerate:

$$ D = AB + C. $$

They often use reduced precision:

Precision Common Use
FP16 Training
BF16 Training
TF32 Mixed precision
FP8 Emerging large-scale training

AD systems must preserve numerical stability under reduced precision arithmetic.

Kernel Fusion

Kernel launch overhead is expensive.

Instead of:

tmp1 = x + y
tmp2 = relu(tmp1)
tmp3 = tmp2 * z

a compiler may fuse into:

out = relu(x + y) * z

This avoids:

Cost Reason
Intermediate memory writes Temporary tensors
Extra global memory reads Reloading intermediates
Kernel launch overhead Multiple launches

Fusion is especially important for backward passes, which otherwise produce many tiny kernels.

Backward Fusion

Reverse-mode AD naturally creates chains of small operations.

Example:

$$ y = \sin(x^2 + z). $$

Backward:

$$ \bar{x} = \bar{y} \cos(x^2+z) 2x. $$

Naively, this creates several intermediate tensors.

Fusion combines them into one kernel.

Without fusion:

tmp1 = x*x
tmp2 = tmp1 + z
tmp3 = cos(tmp2)
tmp4 = tmp3 * 2*x
dx   = dy * tmp4

With fusion:

dx = dy * cos(x*x + z) * 2*x

Fusion reduces memory bandwidth pressure dramatically.

Reduction Kernels

Reductions are harder than elementwise kernels because they require communication.

Example:

$$ y = \sum_i x_i. $$

Parallel reduction proceeds hierarchically:

  1. Local thread accumulation.
  2. Warp reduction.
  3. Block reduction.
  4. Global reduction.

Backward rule:

$$ \bar{x}_i = \bar{y}. $$

Forward reduction is communication-heavy. Backward broadcast is embarrassingly parallel.

Reductions are often synchronization-bound.

Softmax Kernels

Softmax combines reduction and elementwise operations:

$$ y_i = \frac{e^{x_i}} {\sum_j e^{x_j}}. $$

Efficient kernels fuse:

max reduction
stabilized subtraction
exponentiation
sum reduction
division

Backward softmax also fuses reductions:

$$ \bar{x} = y \odot ( \bar{y} - \langle \bar{y}, y\rangle ). $$

Attention kernels heavily depend on efficient softmax implementations.

Convolution Kernels

Convolutions are structured tensor contractions.

Forward:

$$ Y = X * K. $$

Backward produces:

Gradient Operation
Input gradient Convolution with flipped kernel
Kernel gradient Correlation
Bias gradient Reduction

GPU implementations use:

Method Use
Direct convolution Small kernels
im2col + GEMM General dense convolution
FFT convolution Large kernels
Winograd Small dense kernels

Backward convolutions often have different optimal algorithms from forward convolutions.

Attention Kernels

Transformer attention dominates modern large models.

Basic attention:

$$ S = \frac{QK^T}{\sqrt{d}}, $$

$$ P = \operatorname{softmax}(S), $$

$$ Y = PV. $$

Backward pass involves multiple matrix multiplications and softmax gradients.

Attention is memory-intensive because:

$$ S \in \mathbb{R}^{n \times n}. $$

Long sequences create quadratic memory growth.

Flash Attention

Flash Attention avoids materializing the full attention matrix.

Key idea:

stream blocks through shared memory
recompute local statistics
avoid large intermediate tensors

This reduces:

Resource Reduction
Global memory traffic Large
HBM usage Large
Intermediate allocation Eliminated

Backward pass recomputes local softmax statistics instead of storing them.

This is an example of recomputation trading FLOPs for memory.

Checkpointing and Recomputation

Backward passes require forward intermediates.

Storing all activations is expensive.

Checkpointing stores only selected tensors:

store sparse checkpoints
recompute missing intermediates during backward

Tradeoff:

Strategy Memory Compute
Full storage High Low
Full recomputation Low High
Checkpointing Medium Medium

Large-scale AD systems rely heavily on recomputation.

Memory Bandwidth

Many tensor kernels are memory-bound rather than compute-bound.

Example:

$$ y_i = x_i + z_i. $$

This performs:

  • 1 addition
  • multiple memory loads/stores

Arithmetic cost is tiny compared to memory movement.

GPU optimization therefore focuses on:

coalesced access
cache reuse
shared memory reuse
kernel fusion
layout optimization

Bandwidth dominates many backward passes.

Tensor Layout

Tensor layout affects kernel efficiency.

Common layouts:

Layout Description
Row-major C layout
Column-major Fortran layout
NHWC Channels last
NCHW Channels first

Different kernels prefer different layouts.

Layout transforms themselves cost memory bandwidth.

AD systems often propagate layout constraints through graphs to avoid unnecessary transposes.

Strides and Views

Many tensor operations create views rather than copies.

Example:

y = x.transpose()

No data movement may occur.

The kernel instead uses modified strides.

Backward kernels must preserve aliasing semantics.

View-based AD is significantly harder than purely functional tensor models.

Scatter and Gather

Sparse and indexing operations use:

Operation Description
Gather Read scattered locations
Scatter Write scattered locations
Scatter-add Accumulate scattered writes

Backward rules swap gather and scatter-add.

Example:

Forward:

$$ y_i = x_{idx_i}. $$

Backward:

$$ \bar{x}_{idx_i} \mathrel{+}= \bar{y}_i. $$

Scatter-add is difficult on GPUs because many threads may update the same location.

This requires:

atomic operations
segmented reductions
parallel conflict resolution

Parallel Reduction Nondeterminism

Floating-point addition is not associative.

Parallel reductions may produce:

$$ (a+b)+c \ne a+(b+c). $$

Different thread scheduling gives slightly different results.

Thus GPU gradients are often nondeterministic at low precision.

Tradeoff:

Goal Cost
Deterministic reductions Lower throughput
Maximum throughput Nondeterministic accumulation

Large-scale training usually prioritizes throughput.

Mixed Precision Training

Modern AD systems use mixed precision extensively.

Forward:

Tensor Type Precision
Activations FP16/BF16
Weights FP16/BF16
Accumulators FP32

Backward:

Quantity Precision
Gradients Often FP16/BF16
Master weights FP32

Loss scaling prevents gradient underflow:

$$ L' = sL. $$

Backward computes scaled gradients, then rescales:

$$ \nabla L = \frac{1}{s}\nabla L'. $$

Mixed precision is essential for modern large-scale training.

Compiler Systems

Modern AD systems increasingly compile tensor graphs.

Examples:

System Compiler
JAX XLA
PyTorch TorchInductor
TensorFlow XLA
TVM Tensor compiler
Triton GPU DSL

These systems perform:

fusion
layout optimization
kernel scheduling
memory planning
vectorization
autotuning

AD is now tightly integrated with compiler optimization.

Autotuning

Kernel performance depends on:

tile sizes
warp shapes
block dimensions
shared memory usage
register pressure

Compilers often benchmark multiple variants dynamically.

Autotuning is especially important for:

Primitive Sensitivity
GEMM Very high
Convolution Very high
Attention High
Reductions Moderate

GPU-Aware AD Design

A GPU-efficient AD system should minimize:

Problem Cause
Tiny kernels Poor fusion
Excessive intermediates Materialization
Global memory traffic Unfused graphs
Layout conversions Mismatched kernels
Synchronization Reductions/scatter
Recomputation imbalance Poor checkpointing

Good AD design is therefore also hardware design.

Practical Principles

Prefer:

Good Practice Avoid
Fused kernels Long chains of tiny ops
Tensor contractions Scalar loops
Structured reductions Random scatter
Recomputation checkpointing Storing every activation
Mixed precision Full FP64 everywhere
Layout-aware graphs Constant transposes

For compiler designers:

Focus Importance
Memory bandwidth Critical
Fusion Critical
Tensor contraction scheduling Critical
Reduction efficiency Major
Autotuning Major
Sparse kernel optimization Increasingly important

Summary

GPU tensor kernels are the execution layer beneath modern automatic differentiation systems. The mathematical derivative rules are relatively small compared to the systems complexity required to execute them efficiently.

Reverse-mode AD amplifies memory traffic, synchronization cost, and intermediate tensor allocation. Efficient implementations therefore rely on fusion, tiling, mixed precision, checkpointing, layout optimization, and compiler-guided kernel generation.

Modern AD systems are best understood not merely as differentiation engines, but as tensor compiler runtimes specialized for accelerator hardware.