Effect Systems and Mutation

Automatic differentiation is easiest to define for pure functions. A pure function behaves like a mathematical mapping: it consumes inputs, produces outputs, and has no...

Effect Systems and Mutation

Automatic differentiation is easiest to define for pure functions. A pure function behaves like a mathematical mapping: it consumes inputs, produces outputs, and has no observable side effects. Real programs are rarely this simple. They mutate arrays, allocate memory, read state, launch kernels, perform I/O, sample random values, and synchronize across devices.

An effect system describes these behaviors explicitly. It tracks what a computation does in addition to what value it returns. This becomes important for automatic differentiation because effects change the meaning of the derivative transformation.

Pure Computation and the Chain Rule

Consider a pure function:

$$ f : X \rightarrow Y $$

The derivative transformation:

$$ D(f) $$

can be defined compositionally.

If:

$$ f = h \circ g $$

then:

$$ D(f) = D(h) \circ D(g) $$

$$ D(f)=D(h) \circ D(g)

x = [1.0, 2.0]
y = x
x[0] = 5.0
f : A -> B
f : A -> B [Pure]
g : A -> B [State]
x = Dual(2.0, 1.0)
x = x * x
x = 3
x = x * 2
x = x + 1
store old x = 3
x = x * 2
store old x = 6
x = x + 1
x[i] += y
x2 = update(x, i, x[i] + y)
x += y
x : Linear<Tensor>
w -= lr * grad
if x > 0:
    state += 1
else:
    state -= 1
z = sample_normal(mu, sigma)
f : A -> B [Random]
print(loss)
save_model(weights)
parallel_for(...)

$$

This works because the computation has no hidden state. Every dependency is visible in the expression graph.

Mutation breaks this simplicity because values can change over time.

Mutation Changes Semantics

Suppose:

Now y also changes because x and y alias the same storage.

A derivative system must answer:

  • Which version of x does the gradient refer to?
  • Was the original value needed later?
  • Should the reverse pass reconstruct old state?
  • Is the mutation observable outside the differentiated region?

These are semantic questions, not only implementation details.

Effects as Typed Behavior

An effect system augments function types with behavioral information.

Instead of:

the language may track:

or:

Possible effects include:

Effect Meaning
Pure No observable side effects
State Mutation of memory
IO External interaction
Random Random number generation
Exception Non-local control flow
Async Concurrent execution
Device GPU or accelerator execution

Differentiation semantics depend strongly on these effects.

Mutation in Forward Mode

Forward mode is relatively tolerant of mutation because derivatives propagate alongside primal execution.

Example:

The primal and tangent update together.

Problems appear when:

  • Aliasing exists
  • Mutation order matters
  • Shared buffers are updated
  • State escapes the local computation

Even then, forward mode usually follows ordinary execution naturally.

Mutation in Reverse Mode

Reverse mode is harder because the backward pass occurs after forward execution.

Suppose:

The reverse pass must propagate gradients through earlier states of x.

Conceptually:

Step Primal value
Initial 3
After multiply 6
After add 7

Backward propagation needs access to the intermediate value 6.

If the old value was overwritten, reverse mode may fail unless the system saved enough information.

The State Reconstruction Problem

Reverse mode often needs to reconstruct earlier program state.

This creates the central problem of mutation-aware AD:

Requirement Reason
Preserve overwritten values Needed for derivative rules
Preserve execution order Reverse pass depends on it
Preserve aliasing semantics Shared state affects gradients
Preserve control decisions Branches determine path

Several implementation strategies exist.

Tape-Based State Saving

A tape records operations and saved values during the forward pass.

Example:

The backward pass replays derivative rules using saved intermediates.

Advantages:

Benefit Explanation
Correct reverse semantics Exact replay
Dynamic control flow Works naturally
Flexible mutation Handles arbitrary updates

Costs:

Cost Explanation
Memory overhead Stores many intermediates
Allocation pressure Tape growth
Runtime overhead Recording operations

Large tensor systems can spend more memory on saved activations than on parameters themselves.

Functionalization

Functionalization rewrites mutation into immutable updates.

Instead of:

the system internally rewrites:

This restores functional semantics.

Advantages:

Benefit Explanation
Simpler AD rules Pure transformations
Easier optimization Explicit dataflow
Safer parallelism No hidden state

Disadvantages:

Cost Explanation
Extra allocations Immutable copies
Loss of locality More memory movement
Potential slowdown Large tensor duplication

Modern systems often functionalize internally while exposing mutable syntax externally.

Versioned Tensors

Some frameworks track tensor versions.

Example:

increments an internal version counter.

If reverse mode later needs the old tensor value, the system checks whether the tensor was modified unsafely.

This detects invalid gradient computations.

Version tracking is common in eager tensor frameworks because it preserves user-facing imperative syntax while protecting reverse-mode correctness.

Linear and Ownership Types

Ownership systems help manage mutation safely.

A linearly owned value has exactly one owner:

If ownership is unique, mutation becomes safer because no aliases exist.

Advantages for AD:

Property Benefit
No hidden aliases Easier adjoint reasoning
Safe destructive updates Efficient kernels
Better memory reuse Reduced allocations
Predictable lifetimes Easier checkpointing

Languages with ownership systems can optimize reverse mode aggressively because they know when values can be safely reused or overwritten.

Mutation in Tensor Systems

Tensor frameworks rely heavily on mutation internally even when the user API appears functional.

Examples include:

  • In-place GPU kernels
  • Gradient accumulation buffers
  • Optimizer updates
  • Memory pooling
  • Buffer reuse

An optimizer step:

mutates parameter storage.

The AD system must ensure this occurs outside the differentiated region or treat it explicitly as stateful computation.

Effects and Control Flow

Effects interact with branching and loops.

Example:

The backward pass depends on:

  • Which branch executed
  • What state existed before mutation
  • Whether state affects future computation

Differentiating such programs requires preserving both control-flow structure and state transitions.

Randomness and Probabilistic Effects

Random sampling is another important effect.

Example:

Sampling is not ordinarily differentiable because the output changes discontinuously with the random seed.

Several approaches exist:

Method Idea
Reparameterization Move randomness outside differentiable path
Score-function estimators Differentiate expectation instead
Straight-through estimators Approximate backward behavior
Deterministic replay Reuse random seeds

An effect system can track random behavior explicitly:

This prevents the compiler from incorrectly treating stochastic computations as pure.

I/O and External State

I/O effects usually lie outside derivative semantics.

Example:

These operations do not contribute mathematical derivatives.

However, external state can still influence computation:

Effect Problem
Reading files Input changes across executions
Global variables Hidden dependencies
Device synchronization Timing-sensitive behavior
Distributed state Non-local mutation

Differentiable regions are therefore often restricted to pure numerical subgraphs.

Effect Systems for AD Compilers

An AD compiler benefits from explicit effect tracking.

The compiler can answer:

Question Why it matters
Is this function pure? Safe for source transformation
Does it mutate memory? Requires state preservation
Does randomness occur? Requires stochastic semantics
Can operations reorder? Important for reverse correctness
Can buffers be reused? Memory optimization

Effect-aware optimization is increasingly important in large differentiable systems.

Checkpointing and Effects

Checkpointing trades memory for recomputation.

Instead of storing all intermediates:

  1. Save selected checkpoints.
  2. Recompute missing values during reverse mode.

This works best for pure computations.

If mutation or randomness occurs, recomputation may not reproduce the original execution unless:

  • State is restored
  • Random seeds are replayed
  • External effects are isolated

Effects therefore constrain checkpointing strategies.

Parallelism and Concurrency

Concurrent mutation complicates AD further.

Example:

Multiple threads may update shared adjoints.

The AD system must define:

Issue Requirement
Adjoint accumulation Associative reduction
Ordering Deterministic or nondeterministic
Synchronization Correct race-free updates
Distributed state Cross-device aggregation

Effect systems can describe concurrent regions explicitly.

Referential Transparency Recovery

Many AD systems attempt to recover functional semantics internally even if the surface language is imperative.

This often involves:

Transformation Purpose
Functionalization Remove mutation
SSA conversion Make dataflow explicit
Alias analysis Track shared state
Effect isolation Separate differentiable regions
Purity inference Enable optimization

The resulting IR behaves more like a pure functional program even when the original source was imperative.

Why Effect Systems Matter

Without explicit effect handling, AD systems become difficult to reason about.

Incorrect handling of mutation can produce:

  • Wrong gradients
  • Silent corruption
  • Invalid checkpoint recomputation
  • Nondeterministic derivatives
  • Excessive memory retention

As differentiable programming expands into systems software, databases, simulators, robotics, and distributed infrastructure, these issues become central.

Broader Perspective

Effect systems and mutation handling ultimately determine whether differentiation behaves as a mathematically meaningful transformation or merely as a runtime heuristic.

Pure functional semantics make differentiation elegant:

$$ D(f \circ g)

D(f) \circ D(g) $$

Imperative programs introduce time, state, aliasing, and side effects. A differentiable programming language must therefore explain not only how values compute, but also how state evolves across execution.

Effect systems provide the framework for expressing those semantics explicitly.