AD in Julia

Julia was designed for high-performance technical computing. It combines interactive syntax with a compiler capable of specializing code aggressively based on types. This...

AD in Julia

Julia was designed for high-performance technical computing. It combines interactive syntax with a compiler capable of specializing code aggressively based on types. This makes it a strong environment for automatic differentiation because AD systems can operate close to ordinary mathematical code while still generating optimized machine code.

Unlike Python, where tensor operations are often dispatched into external runtimes, Julia programs are commonly written directly in the host language itself. Numerical kernels, array operations, loops, and scientific algorithms are all expressed in Julia. As a result, Julia AD systems frequently differentiate ordinary language-level programs rather than only tensor graphs.

Multiple Dispatch and Generic Programming

Julia’s core abstraction mechanism is multiple dispatch.

Functions specialize on argument types:

f(x) = (x + 1) * sin(x)

The same function can run on:

  • Float64
  • Complex numbers
  • Dual numbers
  • Static arrays
  • GPU tensors
  • Reverse-mode tracked values

This makes AD integration natural. An AD library introduces new numeric types, and Julia dispatches specialized methods automatically.

Forward Mode with Dual Numbers

Forward mode in Julia is commonly implemented with dual numbers.

A simplified representation:

struct Dual{T}
    x::T
    dx::T
end

Arithmetic propagates tangent values:

Base.:+(a::Dual, b::Dual) =
    Dual(a.x + b.x, a.dx + b.dx)

Base.:*(a::Dual, b::Dual) =
    Dual(
        a.x * b.x,
        a.dx * b.x + a.x * b.dx
    )

Elementary functions define derivative propagation:

Base.sin(a::Dual) =
    Dual(sin(a.x), cos(a.x) * a.dx)

A generic Julia function automatically becomes differentiable:

f(x) = (x + 1) * sin(x)

Evaluating:

f(Dual(2.0, 1.0))

returns both the primal value and derivative.

This works because Julia numerical code is usually written generically rather than hardcoding concrete scalar types.

Type Specialization

Julia compiles methods specialized to concrete argument types.

If:

f(x::Float64)

and:

f(x::Dual{Float64})

are called, the compiler generates separate optimized machine code for each.

This gives several advantages for AD:

Benefit Explanation
Static specialization Removes dynamic dispatch overhead
Inlining Derivative propagation becomes efficient
SIMD/vectorization Works on differentiated code
Loop optimization AD inside loops remains fast
Generic reuse Same source works for many numeric domains

Forward-mode dual arithmetic therefore performs well in Julia compared to many dynamic languages.

Reverse Mode in Julia

Reverse mode is essential for machine learning and optimization workloads with many parameters.

Julia reverse-mode systems usually construct pullbacks or tapes representing adjoint propagation.

For:

y = f(x)

the AD system computes:

  1. The primal output
  2. A backward propagation function

Conceptually:

(y, back) = pullback(f, x)

where:

dx = back(dy)

propagates output adjoints backward into input adjoints.

This functional pullback representation became central in modern Julia AD systems.

Source-to-Source AD

One influential Julia approach is source-to-source differentiation.

Instead of tracing runtime execution, the AD system transforms Julia intermediate representations into derivative code.

A simplified primal function:

f(x) = x * sin(x)

may conceptually transform into:

function f_pullback(x)
    y1 = sin(x)
    y2 = x * y1

    function back(dy)
        dx = dy * y1 + dy * x * cos(x)
        return dx
    end

    return y2, back
end

The derivative is represented directly as Julia code.

Advantages include:

Property Benefit
Native language semantics AD works on ordinary Julia code
Compiler visibility Optimizer sees derivative code
Control-flow support Loops and branches differentiate naturally
Higher-order support Derivative code itself is Julia
Composability AD interacts with generic dispatch

This approach differs from tensor-only graph systems.

Julia Intermediate Representations

Julia exposes multiple compiler IR stages.

IR stage Role
AST Parsed syntax
Lowered IR Simplified control flow
Typed IR Type-inferred representation
LLVM IR Low-level optimized representation
Machine code Final compiled execution

AD systems can operate at different levels.

Level Characteristics
AST transformation Simple but limited semantic information
Lowered IR Explicit control flow
Typed IR Full type information
LLVM IR Hardware-level optimization

Many Julia AD tools operate around typed IR because it preserves language semantics while exposing compiler information.

Control Flow

Julia AD systems typically support ordinary control flow directly.

Example:

function f(x)
    y = 0.0

    for i in eachindex(x)
        if x[i] > 0
            y += x[i]^2
        else
            y -= x[i]
        end
    end

    return y
end

The differentiated program follows the executed path.

This is important because Julia scientific code frequently contains:

  • Adaptive solvers
  • Iterative algorithms
  • Dynamic branching
  • Recursive methods
  • Simulation loops

The AD system must therefore differentiate general programs, not only static computation graphs.

Mutation and Arrays

Mutation is one of the difficult areas for Julia AD.

Julia arrays are mutable:

x[1] = 5

Reverse mode must define how adjoints behave through updates.

Problems include:

Problem Example
Aliasing Multiple references to same array
In-place updates Overwriting needed primal values
Broadcasting mutation Elementwise adjoint accumulation
Views/slices Shared memory regions
Sparse updates Scatter-style reverse rules

Some Julia AD systems restrict mutation or transform mutating code into functional equivalents internally.

Others provide specialized adjoint rules for common mutation patterns.

Broadcast Fusion

Julia supports fused broadcasting:

y .= sin.(x) .+ x .* x

This lowers into a fused loop rather than allocating intermediates.

For AD, this matters because naive differentiation may accidentally break fusion and introduce temporary arrays.

A high-performance Julia AD system should preserve:

  • Loop fusion
  • SIMD vectorization
  • Allocation elimination
  • Static dispatch

Otherwise the derivative program may be much slower than the primal program.

Higher-Order Differentiation

Julia’s compositional design makes nested AD natural.

Example:

gradient(x -> gradient(f, x)[1], x)

This computes second derivatives.

Because derivative transformations themselves produce Julia functions, higher-order differentiation becomes recursive transformation over ordinary code.

This supports:

Operation Use
Hessians Curvature methods
Hessian-vector products Large-scale optimization
Meta-gradients Differentiating optimizers
Implicit differentiation Solver sensitivities
Differential equations Sensitivity analysis

The challenge is perturbation confusion and tape interaction across nested passes.

AD for Scientific Computing

Julia is heavily used in scientific computing, so its AD ecosystem emphasizes more than neural networks.

Typical workloads include:

Domain AD use
Differential equations Sensitivity analysis
Optimization Gradients and Hessians
PDE solvers Parameter inference
Probabilistic programming Gradient-based inference
Control systems Trajectory optimization
Physics simulation Differentiable dynamics
Computational biology Parameter fitting

This shaped Julia AD systems toward differentiating general numerical programs rather than only tensor layers.

Differentiating Through Solvers

Scientific code often calls iterative solvers.

Example:

x = solve(A, b)

Naively differentiating the implementation may unroll every iteration. This can be expensive and numerically unstable.

Julia AD frameworks increasingly support implicit differentiation:

$$ A x = b $$

Differentiate both sides:

$$ dA , x + A , dx = db $$

and solve directly for dx.

$$ A x = b

loss(params)

$$

This avoids differentiating every internal solver iteration.

GPU Support

Julia supports GPU execution through native language tooling.

A tensor operation may compile directly into GPU kernels.

AD systems therefore must support:

Requirement Purpose
GPU-compatible pullbacks Reverse mode on device
Kernel differentiation Gradients through GPU code
Memory synchronization Host-device consistency
Broadcast lowering Efficient fused kernels
Mixed precision Accelerator efficiency

Since Julia kernels are written in Julia itself, AD can sometimes operate directly on GPU-level code.

Compiler Integration

Julia’s compiler visibility enables unusually deep AD integration.

The compiler can optimize differentiated code using ordinary optimization passes:

  • Dead code elimination
  • Inlining
  • Constant propagation
  • Loop optimization
  • Escape analysis
  • Allocation removal

This is a major advantage over black-box runtime tracing systems.

The derivative program becomes an ordinary compilable Julia program.

Major Julia AD Systems

Representative Julia AD systems include:

System Main style
ForwardDiff.jl Dual-number forward mode
ReverseDiff.jl Tape-based reverse mode
Zygote.jl Source-to-source reverse mode
Tracker.jl Dynamic graph reverse mode
Enzyme.jl LLVM IR differentiation
Diffractor.jl Compiler-integrated experimentation

These systems explore different tradeoffs between:

  • Performance
  • Mutation support
  • Compiler integration
  • GPU compatibility
  • Higher-order differentiation
  • Language coverage

Performance Characteristics

Julia AD systems often achieve strong performance because:

Feature Effect
Specialization Efficient derivative kernels
Generic numerical code AD integrates naturally
Compiler optimization Derivative code optimized normally
Multiple dispatch Clean derivative extensibility
Low-level access Efficient tensor and solver integration

However, performance still depends heavily on allocation patterns, mutation handling, and compiler inference quality.

Design Philosophy

Julia AD systems tend to treat differentiation as a compiler and language problem rather than only a runtime graph problem.

The important idea is that numerical programs themselves should remain ordinary Julia programs.

A user writes:

and the AD system transforms, specializes, optimizes, and compiles the derivative computation automatically.

This aligns closely with the broader goal of differentiable programming: gradients should emerge from ordinary programs rather than requiring a separate graph language or restricted execution environment.