Functional Languages

Functional programming languages provide a natural semantic foundation for automatic differentiation. Programs are expressed as compositions of functions, immutable values,...

Functional Languages

Functional programming languages provide a natural semantic foundation for automatic differentiation. Programs are expressed as compositions of functions, immutable values, and transformations over data. This aligns closely with the mathematical structure of differentiation.

In a pure functional language, a function behaves like a mathematical mapping:

$$ f : X \rightarrow Y $$

The absence of mutable state means evaluation can be analyzed as a composition graph. Automatic differentiation can therefore operate directly on the structure of computation.

Functional languages also emphasize higher-order functions, algebraic data types, and typed abstractions. These features allow AD systems to represent differentiation as a compositional program transformation rather than as an external runtime mechanism.

Pure Functions and Referential Transparency

A pure function always produces the same output for the same input and has no observable side effects.

square x = x * x

The expression

square 3

can always be replaced by 9.

This property is called referential transparency. It allows the compiler and AD system to reason about computations algebraically.

A derivative transformation can therefore be viewed as another pure function:

$$ D(f)(x, \dot{x}) = (f(x), J_f(x)\dot{x}) $$

Forward mode lifts a function into one that propagates tangent information alongside primal values.

Forward Mode as Function Lifting

Suppose a function is defined as

f x = (x + 1) * sin x

Forward mode transforms it into:

fForward (x, dx) =
  let
    v1  = x + 1
    dv1 = dx

    v2  = sin x
    dv2 = cos x * dx

    v3  = v1 * v2
    dv3 = dv1 * v2 + v1 * dv2
  in
    (v3, dv3)

Each operation is lifted from ordinary arithmetic into tangent arithmetic.

This transformation is compositional. If functions f and g are differentiable:

$$ D(f \circ g) = D(f) \circ D(g) $$

The derivative transformation itself becomes a structure-preserving mapping over programs.

Dual Numbers in Functional Languages

Functional languages often implement forward mode through dual numbers.

A dual number has the form:

$$ a + b\varepsilon $$

with

$$ \varepsilon^2 = 0 $$

$$ \varepsilon^2 = 0

data Dual a = Dual a a
instance Num a => Num (Dual a) where
  (Dual x dx) + (Dual y dy) =
    Dual (x + y) (dx + dy)
  (Dual x dx) * (Dual y dy) =
    Dual (x * y) (dx * y + x * dy)
map
foldl
filter
compose
sumSquares xs =
  foldl (+) 0 (map (\x -> x * x) xs)
f x = (y, backprop)
backprop dy = dx
f x = y
f x k = k y
x = expensiveComputation

$$

In Haskell-like notation:

where:

Field Meaning
First value Primal value
Second value Tangent value

Arithmetic operators are overloaded:

The product rule emerges directly from algebraic structure.

This approach is elegant because differentiation becomes ordinary evaluation over an extended numeric domain.

Higher-Order Functions

Functional languages rely heavily on higher-order functions.

Examples include:

An AD system must differentiate through these abstractions.

For example:

The differentiation system propagates derivatives through both:

  1. The lambda expression \x -> x * x
  2. The structural operators map and foldl

This creates a recursive view of differentiation:

Construct Differentiation behavior
Function composition Chain rule
Mapping Elementwise derivative propagation
Folding Sequential accumulation
Recursion Repeated derivative propagation

Functional structure therefore mirrors derivative structure.

Reverse Mode in Functional Languages

Reverse mode is more subtle.

The derivative must flow backward from outputs to inputs. This requires constructing a reverse dependency graph.

Functional languages often represent reverse mode using continuations, closures, or pullbacks.

Instead of returning only a value, a function returns:

  1. The primal result
  2. A backward propagation function

Conceptually:

where:

The backward function consumes an output adjoint and produces input adjoints.

This style appears in modern differentiable functional systems such as:

System Technique
Haskell AD libraries Continuation-passing or tapes
JAX Tracing + pullbacks
Zygote Source transformation
Swift AD Compiler-generated adjoints

Continuation-Passing Style

Continuation-passing style (CPS) provides a useful interpretation of reverse mode.

A continuation represents "the rest of the computation."

Instead of computing directly:

the program becomes:

where k is the continuation.

Reverse mode can then accumulate gradients by composing continuations backward.

This interpretation reveals a deep connection between:

Concept Interpretation
Forward mode Tangent propagation
Reverse mode Continuation propagation
Chain rule Composition of continuations

Many theoretical treatments of reverse AD are formulated in CPS terms.

Lazy Evaluation

Some functional languages use lazy evaluation.

Expressions are evaluated only when needed.

does not execute immediately.

This complicates automatic differentiation because the order of evaluation becomes implicit.

A reverse-mode system needs a stable execution order to build adjoint dependencies. Laziness can therefore introduce:

Problem Effect
Delayed evaluation Unclear tape structure
Shared thunks Repeated gradients
Infinite structures Non-terminating differentiation
Hidden allocations Memory unpredictability

Practical AD systems often restrict laziness or force evaluation during differentiation.

Typed Differentiation

Strong type systems are one of the most important advantages of functional languages.

The derivative operator itself can be typed.

For example:

$$ D : (a \rightarrow b) \rightarrow (a \rightarrow T(a,b))

grad :: (Vector -> Scalar) -> Vector -> Vector
v1 = add x 1
v2 = sin x
v3 = mul v1 v2

$$

where T(a,b) represents tangent structure.

In practice:

The type system ensures:

  • Gradients are only computed for differentiable functions
  • Tangent dimensions match primal dimensions
  • Invalid derivative compositions are rejected statically

Advanced systems encode even richer information:

Type feature Purpose
Linear types Control gradient accumulation
Effect systems Track mutation
Shape types Verify tensor dimensions
Differentiable constraints Restrict AD domains

Typed AD becomes increasingly important in large differentiable systems.

Functional Intermediate Representations

Many modern AD systems lower programs into a functional intermediate representation (IR).

The IR is often:

  • Immutable
  • SSA-like
  • Graph-oriented
  • Purely functional

A functional IR simplifies optimization because dependencies are explicit.

For example:

This representation behaves simultaneously as:

View Interpretation
Program Sequence of computations
Graph Dependency structure
Algebra Composition of maps

Modern machine learning compilers frequently use this model internally even if the surface language is imperative.

Category-Theoretic Interpretation

Functional languages encouraged categorical interpretations of automatic differentiation.

A differentiable function can be lifted into a morphism carrying derivative structure:

$$ f : X \rightarrow Y $$

becomes

$$ D(f) : X \times TX \rightarrow Y \times TY $$

where TX is the tangent space.

The derivative transformation preserves composition:

$$ D(f \circ g) = D(f) \circ D(g) $$

$$ D(f \circ g) = D(f) \circ D(g) $$

This makes AD resemble a functor over computational structure.

Functional languages made these abstractions practical because programs were already represented compositionally.

Advantages of Functional Languages for AD

Property Benefit
Pure functions Clear dependency structure
Immutability Stable reverse-mode semantics
Higher-order functions Compositional derivative operators
Strong typing Static correctness guarantees
Algebraic structure Natural chain-rule interpretation
Functional IRs Easier optimization and transformation

These properties strongly influenced the design of modern differentiable programming systems.

Limitations

Functional languages also face practical difficulties.

Persistent immutable structures can increase allocation pressure. Reverse mode often requires storing intermediate states, producing large memory overhead. Lazy evaluation complicates execution ordering. Pure semantics may conflict with efficient GPU kernels or in-place tensor updates.

As a result, practical systems often introduce controlled impurity:

Mechanism Purpose
Mutation under the hood Efficient tensor updates
Linear types Safe destructive operations
Runtime tracing Dynamic graph construction
Compiler lowering Hardware-efficient execution

Modern differentiable systems therefore combine functional semantics with low-level optimization techniques.

Functional Languages and Modern AD

Many ideas from functional programming now appear across modern AD frameworks:

Modern system Functional influence
JAX Pure functional transformations
Zygote Source-to-source differentiation
Swift AD Typed differentiable functions
Enzyme Compiler-level transformation
MLIR Functional graph IR structure

Even frameworks written in imperative languages increasingly adopt functional internal representations because differentiation fundamentally operates on compositional structure.

Functional programming provided one of the clearest conceptual models for automatic differentiation: differentiation as a structure-preserving transformation over pure computations.