Testing Derivatives

An automatic differentiation engine is only useful if its derivatives are correct. A small mistake in a backward rule can silently corrupt optimization, training, or...

Testing Derivatives

An automatic differentiation engine is only useful if its derivatives are correct. A small mistake in a backward rule can silently corrupt optimization, training, or scientific computation. Derivative testing therefore belongs at the center of AD system design.

A correct primal computation with incorrect gradients is usually worse than a runtime error. Numerical optimization may still appear to progress while converging to meaningless results.

Derivative testing should validate:

  • local operator rules
  • graph traversal logic
  • gradient accumulation
  • broadcasting semantics
  • tensor reductions
  • higher-order behavior
  • numerical stability

The most important principle:

Every primitive operator should be testable independently.

Classes of Errors

Derivative bugs usually fall into a few categories.

Error type Example
Wrong derivative formula Missing factor or sign
Incorrect accumulation Using = instead of +=
Shape mismatch Incorrect reduction axes
Broadcasting bug Missing sum during backward
Aliasing bug Overwritten saved value
Numerical instability Overflow in backward
Traversal bug Visiting node multiple times
Missing gradient path Detached edge
Stateful inconsistency Different random mask

Many of these errors produce gradients with plausible values. That is why testing matters.

Finite Difference Approximation

The simplest validation method compares AD gradients against finite differences.

For:

$$ f : \mathbb{R} \to \mathbb{R} $$

central difference approximation:

$$ f'(x) \approx \frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon} $$

A helper:

func FiniteDiff(
    f func(float64) float64,
    x float64,
) float64 {
    eps := 1e-6

    return (
        f(x+eps) -
        f(x-eps)
    ) / (2 * eps)
}

Example function:

$$ f(x) = x^2 + \sin(x) $$

func scalarFunction(x float64) float64 {
    return x*x + math.Sin(x)
}

AD version:

func scalarGrad(xval float64) float64 {
    var t Tape

    x := t.Var(xval)

    y := t.Add(
        t.Mul(x, x),
        t.Sin(x),
    )

    t.Backward(y)

    return t.Grads[x]
}

Test:

func TestScalarGrad(t *testing.T) {
    x := 2.0

    fd := FiniteDiff(scalarFunction, x)
    ad := scalarGrad(x)

    if math.Abs(fd-ad) > 1e-5 {
        t.Fatalf(
            "gradient mismatch: fd=%g ad=%g",
            fd,
            ad,
        )
    }
}

This is the basic derivative sanity check.

Choosing Epsilon

Finite differences are approximate. The choice of:

$$ \epsilon $$

matters.

Too small:

  • cancellation error dominates

Too large:

  • truncation error dominates

Typical values:

  • 1e-4
  • 1e-5
  • 1e-6

Central differences are preferred over forward differences because they have smaller truncation error.

Forward difference:

$$ \frac{f(x+\epsilon)-f(x)}{\epsilon} $$

Central difference:

$$ \frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon} $$

Central difference is usually much more accurate for testing.

Relative Error

Absolute error alone can be misleading.

Better comparison:

$$ \text{relative error} = \frac{|a-b|} {\max(1, |a|, |b|)} $$

Helper:

func RelativeError(a, b float64) float64 {
    return math.Abs(a-b) /
        math.Max(
            1,
            math.Max(math.Abs(a), math.Abs(b)),
        )
}

Then:

if RelativeError(fd, ad) > 1e-5 {
    t.Fatalf("gradient mismatch")
}

Relative error scales better across small and large gradients.

Operator-Level Testing

Each primitive operator should have isolated tests.

Example for multiplication:

$$ f(x, y) = xy $$

Expected derivatives:

$$ \frac{\partial f}{\partial x} = y $$

$$ \frac{\partial f}{\partial y} = x $$

Test:

func TestMulGrad(t *testing.T) {
    var tape Tape

    x := tape.Var(3)
    y := tape.Var(5)

    z := tape.Mul(x, y)

    tape.Backward(z)

    if tape.Grads[x] != 5 {
        t.Fatalf("wrong grad dx")
    }

    if tape.Grads[y] != 3 {
        t.Fatalf("wrong grad dy")
    }
}

Primitive operator tests should be:

  • small
  • exact when possible
  • easy to inspect

These tests catch most implementation bugs early.

Shared Subexpression Testing

Gradient accumulation is a common failure point.

Example:

$$ f(x) = x^2 + x^2 $$

Derivative:

$$ f'(x) = 4x $$

Test:

func TestSharedNodeAccumulation(t *testing.T) {
    var tape Tape

    x := tape.Var(3)

    a := tape.Mul(x, x)
    y := tape.Add(a, a)

    tape.Backward(y)

    want := 12.0

    if tape.Grads[x] != want {
        t.Fatalf(
            "got %g want %g",
            tape.Grads[x],
            want,
        )
    }
}

This catches incorrect overwrite behavior:

grad = ...

instead of:

grad += ...

Traversal Testing

Reverse traversal should visit nodes exactly once while still accumulating every edge contribution.

A graph with sharing is a good test case:

x -> a -> y
 \       /
  ------- 

Tests should verify:

  • no duplicate backward execution
  • no missing edge contributions
  • stable topological ordering

An incorrect traversal often produces gradients that are:

  • doubled
  • partially missing
  • dependent on node allocation order

These bugs are subtle without dedicated tests.

Tensor Shape Testing

Tensor AD systems need shape-aware tests.

Broadcasting example:

x shape: [2, 3]
b shape: [3]

y = x + b

Expected backward:

grad_b = reduce_sum(grad_y, axis=0)

A shape test should verify:

  • output shape
  • gradient shape
  • numerical values

Shape mismatches are among the most common tensor AD bugs.

Reduction Testing

Reduction operators require careful backward logic.

Example:

$$ y = \sum_i x_i $$

Backward should broadcast the output gradient back to all inputs.

Test:

func TestSumBackward(t *testing.T) {
    x := Tensor{
        Data: []float64{1, 2, 3},
        Shape: []int{3},
    }

    gradOut := 2.0

    gradX := SumBackward(x, gradOut)

    want := []float64{2, 2, 2}

    for i := range want {
        if gradX.Data[i] != want[i] {
            t.Fatalf("bad sum grad")
        }
    }
}

Reduction tests should cover:

  • axis reduction
  • keepdims behavior
  • empty dimensions
  • singleton dimensions

Numerical Stability Testing

Some operators have mathematically correct but numerically unstable derivatives.

Example:

$$ \log(1 + e^x) $$

Large positive x may overflow.

Tests should include:

  • very large inputs
  • very small inputs
  • zero
  • infinities
  • NaNs

Example:

func TestSoftplusStability(t *testing.T) {
    xs := []float64{
        -100,
        -10,
        0,
        10,
        100,
    }

    for _, x := range xs {
        var tape Tape

        v := tape.Var(x)
        y := tape.Softplus(v)

        tape.Backward(y)

        if math.IsNaN(tape.Grads[v]) {
            t.Fatalf("nan gradient at %g", x)
        }
    }
}

Stability tests are often more important than ordinary correctness tests in real optimization systems.

Randomized Gradient Checks

A useful strategy:

  1. generate random inputs
  2. compare AD with finite differences
  3. repeat many times

Example:

func RandomGradCheck(
    f func(*Tape, Slot) Slot,
) {
    for i := 0; i < 1000; i++ {
        x := rand.Float64()*10 - 5

        fd := finiteDiffScalar(f, x)
        ad := adScalar(f, x)

        if RelativeError(fd, ad) > 1e-5 {
            panic("gradient mismatch")
        }
    }
}

Random testing explores edge cases humans often miss.

Useful distributions:

  • small values
  • large values
  • near-zero values
  • mixed signs
  • powers of two
  • denormalized ranges

Higher-Order Testing

Higher-order AD requires additional validation.

Example:

$$ f(x) = x^3 $$

Derivatives:

$$ f'(x) = 3x^2 $$

$$ f''(x) = 6x $$

A second-order system should be checked against known analytic results.

Higher-order systems often fail because:

  • saved gradients are incorrect
  • perturbation confusion occurs
  • nested tapes interfere
  • backward rules are not themselves differentiable

These bugs usually appear only in nested differentiation tests.

Stateful Operator Testing

Stateful operators require consistency between forward and backward.

Dropout example:

forward uses random mask M
backward must reuse M

Test:

func TestDropoutConsistency(t *testing.T) {
    seed := int64(42)

    forwardMask := GenerateMask(seed)
    backwardMask := GenerateMask(seed)

    if !Equal(forwardMask, backwardMask) {
        t.Fatalf("mask mismatch")
    }
}

Without this consistency, gradients become random noise.

Property Testing

Some derivative properties hold independently of specific values.

Example:

  • derivative of addition is linear
  • derivative of constant is zero
  • chain rule should compose correctly

Property testing checks algebraic invariants rather than individual numeric examples.

Example:

grad(a*x + b*y)
=
a*grad(x) + b*grad(y)

Property tests are useful because they scale across many input combinations.

Graph Integrity Testing

The graph itself should be validated.

Checks:

  • no dangling references
  • no cycles in acyclic graphs
  • valid slot indexes
  • backward only visits reachable nodes
  • gradients reset correctly
  • tape reuse preserves correctness

Minimal invariant:

Every backward rule reads valid primal values and writes valid gradient slots.

Violating this usually produces silent corruption.

Test Organization

A practical AD engine should separate tests by level.

Test type Purpose
Operator unit tests Validate local derivative rules
Graph tests Validate traversal and accumulation
Numerical tests Validate stability
Randomized tests Explore broad input space
Tensor shape tests Validate broadcasting and reductions
End-to-end optimization tests Validate realistic training behavior
Regression tests Prevent performance and correctness regressions

Operator-level tests should dominate. Most AD failures originate there.

End-to-End Validation

Eventually the engine should optimize a real objective successfully.

Simple example:

$$ f(x) = (x-3)^2 $$

Gradient descent:

x := 0.0

for i := 0; i < 100; i++ {
    var tape Tape

    xv := tape.Var(x)

    diff := tape.Sub(xv, tape.Const(3))
    loss := tape.Mul(diff, diff)

    tape.Backward(loss)

    x -= 0.1 * tape.Grads[xv]
}

Expected:

x approaches 3

This does not prove the engine is correct, but it validates that gradients are directionally useful.

Minimal Correctness Contract

A small reverse-mode engine should guarantee:

Backward computes gradients consistent with the chain rule and local operator derivatives, up to floating point arithmetic.

Testing exists to validate this contract operationally.

No finite collection of tests proves correctness completely. But a layered testing strategy:

  • operator tests
  • finite differences
  • randomized checks
  • graph invariants
  • end-to-end optimization

can make derivative failures rare, localizable, and reproducible.