Taylor Mode AD

Taylor mode automatic differentiation computes derivatives by propagating truncated Taylor series through a program.

Instead of tracking only values and first derivatives, Taylor mode tracks entire local polynomial expansions.

For a scalar function

$$ f(x), $$

expanded around point $x_0$,

$$ f(x_0 + h) = f(x_0) + f'(x_0)h + \frac{f''(x_0)}{2!}h^2 + \frac{f^{(3)}(x_0)}{3!}h^3 + \cdots $$

Taylor mode computes these coefficients automatically.

This gives direct access to higher-order derivatives.

Basic Idea

Forward mode propagates first-order tangent information:

$$ x + \epsilon v. $$

Taylor mode propagates higher-order polynomial structure:

$$ x(t) = x_0 + x_1 t + x_2 t^2 + \cdots + x_p t^p. $$

Each coefficient represents derivative information.

Typically:

$$ x_k = \frac{1}{k!} \frac{d^k x}{dt^k}(0). $$

The program is evaluated symbolically over truncated power series rather than over ordinary numbers.

Truncated Polynomial Algebra

Taylor mode works in the algebra:

$$ \mathbb{R}[t]/(t^{p+1}), $$

the ring of polynomials truncated at degree $p$.

A Taylor object has form:

$$ a_0 + a_1 t + a_2 t^2 + \cdots + a_p t^p. $$

Addition is coefficient-wise:

$$ (a+b)_k = a_k + b_k. $$

Multiplication uses convolution:

$$ (ab)k = \sum{i=0}^k a_i b_{k-i}. $$

Higher-order derivatives emerge from polynomial coefficient propagation.

Example: Polynomial Function

Let

$$ f(x) = x^3. $$

Suppose:

$$ x(t) = x_0 + x_1 t + x_2 t^2. $$

Expand:

$$ f(x(t)) = (x_0 + x_1 t + x_2 t^2)^3. $$

Keeping terms up to degree 2:

$$= x_0^3 + 3x_0^2x_1 t + (3x_0^2x_2 + 3x_0x_1^2)t^2.$$

The coefficients are:

Degree	Coefficient
$t^0$	$x_0^3$
$t^1$	$3x_0^2x_1$
$t^2$	$3x_0^2x_2 + 3x_0x_1^2$

The second-order term contains the quadratic interaction naturally.

No explicit Hessian rules are required. The polynomial algebra produces the correct structure automatically.

Recovering Derivatives

If we choose:

$$ x(t) = x_0 + t, $$

then:

$$ f(x_0+t) = \sum_{k=0}^p \frac{f^{(k)}(x_0)}{k!} t^k. $$

Therefore:

Coefficient	Meaning
$t^0$	function value
$t^1$	first derivative
$t^2$	second derivative divided by $2!$
$t^3$	third derivative divided by $3!$

and so on.

Taylor mode computes all derivatives up to order $p$ simultaneously.

Comparison with Nested Forward Mode

Nested forward mode computes higher derivatives by nesting dual numbers repeatedly.

Taylor mode computes all orders together in one algebraic object.

Method	Representation
nested duals	repeated first-order structure
Taylor mode	truncated polynomial
hyper-duals	structured second-order dual algebra
symbolic differentiation	explicit algebraic manipulation

Taylor mode avoids some nesting complexity and perturbation tagging issues.

Elementary Functions

Taylor mode requires propagation rules for elementary functions.

Suppose:

$$ y(t) = \exp(x(t)). $$

$$ x(t) = \sum_{k=0}^p x_k t^k, $$

then

$$ y(t) = \sum_{k=0}^p y_k t^k $$

must satisfy:

$$ y'(t) = y(t)x'(t). $$

Matching coefficients gives recurrence relations for the Taylor coefficients.

Similarly:

Function	Defining relation
$\exp x$	$y' = yx'$
$\sin x$	$y' = \cos(x)x'$
$\cos x$	$y' = -\sin(x)x'$
$\log x$	$y'x = x'$

Taylor mode implementations derive coefficient recurrences from these identities.

Recurrence Computation

Taylor coefficients are usually computed recursively.

Suppose:

$$ x(t) = \sum_{k=0}^p x_k t^k. $$

For multiplication:

$$ z(t) = x(t)y(t), $$

the coefficient recurrence is:

$$ z_k = \sum_{i=0}^k x_i y_{k-i}. $$

This is discrete convolution.

For division:

$$ z(t) = \frac{x(t)}{y(t)}, $$

coefficients satisfy:

$$ x_k = \sum_{i=0}^k y_i z_{k-i}. $$

Solving recursively gives $z_k$.

Taylor mode therefore becomes a coefficient propagation system.

Directional Higher Derivatives

Taylor mode naturally computes directional higher derivatives.

Suppose:

$$ x(t) = x_0 + tv. $$

Then:

$$ f(x_0 + tv) = \sum_{k=0}^p \frac{1}{k!} D^k f(x_0)[v,\ldots,v] t^k. $$

The coefficients correspond to repeated directional derivatives.

This is valuable because full higher-order derivative tensors are usually too large to materialize.

Taylor mode computes directional expansions efficiently.

Example: Second Directional Derivative

Let

$$ f(x,y)=x^2y. $$

Choose direction:

$$ v=(v_1,v_2). $$

Define:

$$ x(t)=x+tv_1, \quad y(t)=y+tv_2. $$

Then:

$$ f(x(t),y(t)) = (x+tv_1)^2(y+tv_2). $$

Expand:

$$= x^2y + (2xyv_1 + x^2v_2)t + (yv_1^2 + 2xv_1v_2)t^2 + v_1^2v_2 t^3.$$

The coefficient of $t^2$ gives:

$$ \frac{1}{2} D^2f(x,y)[v,v]. $$

Taylor mode extracts this automatically.

Complexity

Suppose:

Symbol	Meaning
$C$	cost of evaluating original function
$p$	Taylor order

Then Taylor mode often costs roughly:

$$ O(p^2 C) $$

because coefficient propagation involves convolution-like operations across orders.

More optimized methods can reduce some operations using FFT-like techniques or specialized recurrences.

Still, high-order derivatives become expensive quickly.

Memory Structure

Taylor mode stores coefficients for every intermediate variable.

If the original program has $N$ intermediates and Taylor order $p$, storage is roughly:

$$ O(pN). $$

This is usually better than explicitly materializing high-order derivative tensors.

The representation is compact because it stores directional polynomial structure rather than full multidimensional derivative arrays.

Taylor Mode vs Reverse Mode

Taylor mode and reverse mode have different strengths.

Property	Taylor mode	Reverse mode
gradients	moderate	excellent
high-order directional derivatives	excellent	difficult
full Hessians	possible	possible
high-order tensors	structured	memory-heavy
implementation complexity	algebraic	tape-heavy
nested differentiation	simpler	harder
scalar-output optimization	weaker	strong

Taylor mode is particularly attractive for moderate-order derivatives and local series expansions.

Taylor Models

Taylor mode connects naturally to Taylor models used in validated numerics.

A Taylor model represents a function as:

$$ P(x) + R(x), $$

where:

Component	Meaning
$P(x)$	polynomial approximation
$R(x)$	rigorous remainder bound

This allows:

Capability	Purpose
interval arithmetic	rigorous bounds
verified ODE solving	guaranteed solutions
uncertainty propagation	bounded error
validated optimization	correctness guarantees

Taylor-mode AD provides the polynomial component automatically.

Differential Equation Solvers

Taylor mode is especially useful for ODE solvers.

Suppose:

$$ \frac{dx}{dt} = f(x). $$

A Taylor-series integrator computes:

$$ x(t+h) = x(t) + hx'(t) + \frac{h^2}{2!}x''(t) + \cdots $$

Automatic differentiation computes the required derivatives recursively.

This yields high-order integrators with strong local accuracy.

Taylor integrators are widely used in celestial mechanics, validated numerics, and stiff dynamical systems.

Sparse Higher-Order Structure

Full higher-order derivative tensors are enormous.

For dimension $n$:

Order	Tensor size
1	$n$
2	$n^2$
3	$n^3$
4	$n^4$

Taylor mode avoids constructing these directly.

Instead it propagates compressed directional structure through coefficient series.

This is often the only practical way to obtain high-order derivative information.

Compiler Representation

A compiler-based Taylor-mode system may lower each variable into coefficient arrays:

x[0] = primal
x[1] = first-order coefficient
x[2] = second-order coefficient
...
x[p] = pth-order coefficient

Operations become coefficient recurrences.

For multiplication:

for k in 0..p:
    z[k] = sum(x[i] * y[k-i] for i in 0..k)

The transformed program resembles polynomial arithmetic.

Numerical Stability

High-order Taylor coefficients can become unstable.

Common problems include:

Problem	Cause
coefficient explosion	factorial growth
cancellation	alternating signs
truncation instability	insufficient order
floating point amplification	repeated convolutions
stiff dynamics	rapidly varying derivatives

Scaling and normalization techniques are often required.

Practical Uses

Taylor mode appears in:

Domain	Usage
celestial mechanics	high-order integration
validated numerics	rigorous bounds
dynamical systems	local flow expansion
chaos analysis	sensitivity growth
PDE solvers	local expansions
perturbation methods	asymptotic analysis
scientific computing	repeated directional derivatives

It is less common in large neural network training because reverse-mode gradients dominate that workload.

Conceptual Interpretation

Forward mode propagates tangent vectors.

Taylor mode propagates local polynomial geometry.

Instead of asking:

$$ \text{How does the function change infinitesimally?} $$

Taylor mode asks:

$$ \text{What local power series does the function induce?} $$

The program becomes a transformation over truncated polynomial algebras rather than over ordinary scalar values.