Differentiation of Large Stateful Systems

Automatic differentiation works naturally on pure mathematical functions:

Automatic differentiation works naturally on pure mathematical functions:

$$ y = f(x). $$

The function receives inputs, computes outputs, and produces no persistent side effects. Most theoretical treatments of AD assume this model.

Large real systems rarely behave this way.

Operating systems, databases, distributed services, simulations, robotics stacks, game engines, browsers, network services, and long-running machine learning pipelines all contain substantial mutable state. Their behavior depends not only on explicit inputs, but also on internal memory, caches, queues, synchronization, randomness, external events, and temporal ordering.

Differentiating such systems remains one of the major open problems in automatic differentiation.

What Is a Stateful System?

A stateful system maintains internal state across time.

Instead of:

$$ y = f(x), $$

the system behaves more like:

$$ (s_{t+1}, y_t) = F(s_t, x_t). $$

Where:

Symbol Meaning
$s_t$ internal state at time $t$
$x_t$ external input
$y_t$ output
$F$ transition function

Examples include:

System Internal state
database pages, indexes, transactions
web server connection pools, caches
robot controller sensor history, control state
physics engine world state
distributed system replicated logs, consensus state
neural network training loop optimizer moments, parameter buffers
browser engine DOM tree, rendering state
operating system scheduler queues, memory tables

The derivative now depends on the evolution of state over time, not only on one isolated function evaluation.

Reverse Mode Across Time

Suppose a system evolves for $T$ steps:

$$ s_{t+1} = F(s_t, x_t). $$

A scalar loss depends on the final state:

$$ L = \ell(s_T). $$

Reverse mode computes gradients backward through time:

$$ \bar{s}t = \bar{s}{t+1} \frac{\partial s_{t+1}}{\partial s_t}. $$

This resembles recurrent neural network backpropagation.

However, large stateful systems introduce much harder conditions:

Neural network Large stateful system
mostly tensor operations heterogeneous operations
explicit training graph implicit evolving state
bounded semantics external side effects
controlled primitives arbitrary program behavior
accelerator-oriented distributed heterogeneous runtime

The problem becomes less like differentiating a tensor graph and more like differentiating a live computational process.

Mutation

Mutation is the first major obstacle.

Consider:

buffer[i] += x

The reverse pass may require:

  • previous value of buffer[i]
  • index i
  • ordering of writes
  • alias relationships

If the same memory is modified repeatedly:

for i in range(n):
    state = update(state, data[i])

reverse mode may need all intermediate states:

$$ s_0, s_1, s_2, \ldots, s_T. $$

Storing every state can become impossible.

This creates the central memory problem of stateful reverse mode.

Temporal Depth

Long-running systems can evolve for millions or billions of state transitions.

Examples:

System Temporal scale
database server months
robot controller continuous
online recommender indefinite
distributed cache persistent
simulation millions of timesteps

Reverse mode over an indefinitely evolving process is fundamentally difficult because backward propagation requires access to earlier computational history.

A naive tape model becomes infeasible.

Checkpointing Limits

Checkpointing reduces storage by saving selected states and recomputing others.

For a short numerical simulation this works well.

Large stateful systems introduce new problems:

Issue Consequence
nondeterminism recomputation may diverge
external events replay impossible
concurrency ordering may differ
random scheduling execution path unstable
network timing state evolution changes

Recomputation assumes the system can replay identically. Many real systems cannot guarantee this.

External Side Effects

Stateful systems interact with the outside world.

Examples:

send_packet(data)
write_to_disk(page)
emit_event(msg)

These operations are not ordinary differentiable transformations.

Questions immediately appear:

Question Difficulty
what is the derivative of a disk write? unclear semantics
what is the derivative of network latency? stochastic external process
what is the derivative of cache eviction? discontinuous policy
what is the derivative of scheduler behavior? nondeterministic execution

Most AD systems avoid these questions by treating such operations as nondifferentiable boundaries.

This is acceptable for neural networks. It is not sufficient for general differentiable systems.

Hidden State

Large systems often contain hidden state not visible in the user program.

Examples include:

  • allocator metadata
  • kernel page cache
  • GPU runtime queues
  • filesystem buffers
  • thread scheduling state
  • hardware prefetch behavior

A system’s behavior may depend on these hidden states even if the program source does not expose them.

This creates a semantic problem:

$$ \text{What exactly is the function being differentiated?} $$

The observed output may depend on implementation details outside the formal computational graph.

Differentiable Databases

Databases illustrate the challenge clearly.

A query system contains:

Component Stateful behavior
buffer manager page cache
optimizer adaptive plans
transaction manager locks and MVCC
storage engine mutable disk layout
replication distributed logs

Suppose one wishes to optimize latency:

$$ L = \text{query latency}. $$

Latency depends on:

  • cache state
  • branch prediction
  • disk layout
  • concurrent traffic
  • scheduler behavior
  • network contention

The resulting function is noisy, discontinuous, and history-dependent.

Differentiating such systems requires combining AD with stochastic modeling, systems instrumentation, and probabilistic reasoning.

Distributed Systems

Distributed systems create even harder problems.

A distributed system evolves under:

$$ s_{t+1} = F(s_t, x_t, m_t, \sigma_t), $$

where:

Symbol Meaning
$m_t$ message arrivals
$\sigma_t$ scheduling decisions

The system state depends on message timing and asynchronous interleavings.

Small perturbations may change:

  • ordering of events
  • consensus leader election
  • retry behavior
  • timeout triggers
  • partition recovery

The execution path may bifurcate discontinuously.

Classical reverse mode assumes stable local derivatives. Distributed systems often violate that assumption.

Differentiable Simulators

Physics simulators are among the most successful large stateful differentiable systems so far.

A simulator evolves:

$$ s_{t+1} = \Phi(s_t, u_t). $$

Reverse mode propagates adjoints through time integration.

Differentiable simulators work because:

Property Benefit
deterministic dynamics replay possible
explicit state visible transitions
mathematical structure known local derivatives
controlled operations stable semantics

Even here, problems appear:

  • collisions introduce discontinuities
  • contact solvers are iterative
  • friction models may be nonsmooth
  • adaptive timesteps alter trajectories

Large engineering simulators remain difficult to differentiate robustly.

Concurrency

Concurrency fundamentally complicates AD.

Suppose two threads update shared state:

thread A: x += 1
thread B: x *= 2

The final state depends on execution ordering.

Reverse mode would need to know:

  • exact interleaving
  • memory visibility rules
  • synchronization ordering
  • atomic semantics

The derivative is therefore partly a property of the scheduler.

Most AD systems implicitly assume sequential execution. Extending reverse mode to concurrent systems remains largely unresolved.

Sparse Credit Assignment

Large stateful systems often contain delayed effects.

A decision made at time $t$ may affect outcomes at time $t + 10^6$.

This creates a long-horizon credit assignment problem:

$$ \frac{\partial L}{\partial s_t}. $$

Examples:

Domain Long-range effect
database caching later query latency
network routing future congestion
compiler optimization downstream runtime
operating systems future scheduling behavior
autonomous systems delayed task success

Backpropagating through extremely long horizons is computationally and numerically unstable.

Hybrid Symbolic and Learned Systems

Modern systems increasingly combine:

  • neural components
  • rule engines
  • databases
  • search systems
  • planners
  • simulators

Example:

retrieval
  -> ranking
  -> planning
  -> execution
  -> environment interaction

Some components are differentiable. Others are discrete or symbolic.

Differentiating across such hybrid boundaries remains an open research area.

Semantic Boundaries

A deeper issue is semantic meaning.

For a tensor operation:

$$ y = x^2, $$

the derivative is mathematically clear.

For a database transaction scheduler:

$$ y = \text{latency}(x), $$

what exactly should the derivative mean?

Possible interpretations include:

Interpretation Meaning
local sensitivity infinitesimal perturbation
expected effect stochastic average
policy gradient optimization signal
finite perturbation empirical response
relaxed continuous model smoothed approximation

Different interpretations lead to different "gradients."

The problem becomes partly philosophical: what does differentiability mean for complex computational systems?

Possible Directions

Several research directions appear promising.

Differentiable State Abstractions

Instead of differentiating raw mutable systems, one may differentiate structured state models:

$$ s_{t+1} = F_\theta(s_t, x_t). $$

This reduces hidden behavior and makes adjoint propagation explicit.

Event-Sourced Systems

Event sourcing naturally preserves history:

state = replay(events)

Because prior states can be reconstructed from event logs, reverse replay may become easier.

This is one reason differentiable event-driven architectures are interesting research targets.

Probabilistic Differentiation

Instead of exact trajectories, differentiate expectations:

$$ \nabla_\theta \mathbb{E}[L]. $$

This connects AD with stochastic control and reinforcement learning.

Compiler-Level AD

Systems like Enzyme suggest that low-level compiler analysis may help differentiate larger stateful programs by leveraging alias analysis, memory SSA, and optimization infrastructure.

Differentiable Runtime Systems

Future runtimes may expose:

  • reversible execution
  • replayable scheduling
  • deterministic concurrency
  • differentiable memory tracking

This would make state evolution more compatible with reverse mode.

Why This Problem Matters

Differentiating large stateful systems would enable:

Capability Example
self-optimizing databases learned storage layouts
differentiable operating systems adaptive scheduling
differentiable networks optimized routing policies
adaptive distributed systems learned replication strategies
autonomous infrastructure self-tuning clusters
differentiable software stacks end-to-end optimization

Today, optimization boundaries are usually local. Models optimize tensor parameters. The surrounding systems remain hand-engineered.

Large-scale stateful differentiation would extend optimization into the systems layer itself.

Summary

Automatic differentiation works best for pure, structured computations with explicit dependency graphs and limited mutable state.

Large stateful systems violate these assumptions.

They contain:

  • mutation
  • hidden state
  • concurrency
  • external effects
  • nondeterminism
  • distributed execution
  • long temporal horizons

Differentiating such systems requires more than extending reverse mode mechanically. It requires new semantic models for time, state, memory, concurrency, and external interaction.

This remains one of the central unsolved problems in differentiable programming and systems research.