Differentiation of Large Stateful Systems

Automatic differentiation works naturally on pure mathematical functions:

$$ y = f(x). $$

The function receives inputs, computes outputs, and produces no persistent side effects. Most theoretical treatments of AD assume this model.

Large real systems rarely behave this way.

Operating systems, databases, distributed services, simulations, robotics stacks, game engines, browsers, network services, and long-running machine learning pipelines all contain substantial mutable state. Their behavior depends not only on explicit inputs, but also on internal memory, caches, queues, synchronization, randomness, external events, and temporal ordering.

Differentiating such systems remains one of the major open problems in automatic differentiation.

What Is a Stateful System?

A stateful system maintains internal state across time.

Instead of:

$$ y = f(x), $$

the system behaves more like:

$$ (s_{t+1}, y_t) = F(s_t, x_t). $$

Where:

Symbol	Meaning
$s_t$	internal state at time $t$
$x_t$	external input
$y_t$	output
$F$	transition function

Examples include:

System	Internal state
database	pages, indexes, transactions
web server	connection pools, caches
robot controller	sensor history, control state
physics engine	world state
distributed system	replicated logs, consensus state
neural network training loop	optimizer moments, parameter buffers
browser engine	DOM tree, rendering state
operating system	scheduler queues, memory tables

The derivative now depends on the evolution of state over time, not only on one isolated function evaluation.

Reverse Mode Across Time

Suppose a system evolves for $T$ steps:

$$ s_{t+1} = F(s_t, x_t). $$

A scalar loss depends on the final state:

$$ L = \ell(s_T). $$

Reverse mode computes gradients backward through time:

$$ \bar{s}t = \bar{s}{t+1} \frac{\partial s_{t+1}}{\partial s_t}. $$

This resembles recurrent neural network backpropagation.

However, large stateful systems introduce much harder conditions:

Neural network	Large stateful system
mostly tensor operations	heterogeneous operations
explicit training graph	implicit evolving state
bounded semantics	external side effects
controlled primitives	arbitrary program behavior
accelerator-oriented	distributed heterogeneous runtime

The problem becomes less like differentiating a tensor graph and more like differentiating a live computational process.

Mutation

Mutation is the first major obstacle.

Consider:

buffer[i] += x

The reverse pass may require:

previous value of buffer[i]
index i
ordering of writes
alias relationships

If the same memory is modified repeatedly:

for i in range(n):
    state = update(state, data[i])

reverse mode may need all intermediate states:

$$ s_0, s_1, s_2, \ldots, s_T. $$

Storing every state can become impossible.

This creates the central memory problem of stateful reverse mode.

Temporal Depth

Long-running systems can evolve for millions or billions of state transitions.

Examples:

System	Temporal scale
database server	months
robot controller	continuous
online recommender	indefinite
distributed cache	persistent
simulation	millions of timesteps

Reverse mode over an indefinitely evolving process is fundamentally difficult because backward propagation requires access to earlier computational history.

A naive tape model becomes infeasible.

Checkpointing Limits

Checkpointing reduces storage by saving selected states and recomputing others.

For a short numerical simulation this works well.

Large stateful systems introduce new problems:

Issue	Consequence
nondeterminism	recomputation may diverge
external events	replay impossible
concurrency	ordering may differ
random scheduling	execution path unstable
network timing	state evolution changes

Recomputation assumes the system can replay identically. Many real systems cannot guarantee this.

External Side Effects

Stateful systems interact with the outside world.

Examples:

send_packet(data)
write_to_disk(page)
emit_event(msg)

These operations are not ordinary differentiable transformations.

Questions immediately appear:

Question	Difficulty
what is the derivative of a disk write?	unclear semantics
what is the derivative of network latency?	stochastic external process
what is the derivative of cache eviction?	discontinuous policy
what is the derivative of scheduler behavior?	nondeterministic execution

Most AD systems avoid these questions by treating such operations as nondifferentiable boundaries.

This is acceptable for neural networks. It is not sufficient for general differentiable systems.

Hidden State

Large systems often contain hidden state not visible in the user program.

Examples include:

allocator metadata
kernel page cache
GPU runtime queues
filesystem buffers
thread scheduling state
hardware prefetch behavior

A system’s behavior may depend on these hidden states even if the program source does not expose them.

This creates a semantic problem:

$$ \text{What exactly is the function being differentiated?} $$

The observed output may depend on implementation details outside the formal computational graph.

Differentiable Databases

Databases illustrate the challenge clearly.

A query system contains:

Component	Stateful behavior
buffer manager	page cache
optimizer	adaptive plans
transaction manager	locks and MVCC
storage engine	mutable disk layout
replication	distributed logs

Suppose one wishes to optimize latency:

$$ L = \text{query latency}. $$

Latency depends on:

cache state
branch prediction
disk layout
concurrent traffic
scheduler behavior
network contention

The resulting function is noisy, discontinuous, and history-dependent.

Differentiating such systems requires combining AD with stochastic modeling, systems instrumentation, and probabilistic reasoning.

Distributed Systems

Distributed systems create even harder problems.

A distributed system evolves under:

$$ s_{t+1} = F(s_t, x_t, m_t, \sigma_t), $$

where:

Symbol	Meaning
$m_t$	message arrivals
$\sigma_t$	scheduling decisions

The system state depends on message timing and asynchronous interleavings.

Small perturbations may change:

ordering of events
consensus leader election
retry behavior
timeout triggers
partition recovery

The execution path may bifurcate discontinuously.

Classical reverse mode assumes stable local derivatives. Distributed systems often violate that assumption.

Differentiable Simulators

Physics simulators are among the most successful large stateful differentiable systems so far.

A simulator evolves:

$$ s_{t+1} = \Phi(s_t, u_t). $$

Reverse mode propagates adjoints through time integration.

Differentiable simulators work because:

Property	Benefit
deterministic dynamics	replay possible
explicit state	visible transitions
mathematical structure	known local derivatives
controlled operations	stable semantics

Even here, problems appear:

collisions introduce discontinuities
contact solvers are iterative
friction models may be nonsmooth
adaptive timesteps alter trajectories

Large engineering simulators remain difficult to differentiate robustly.

Concurrency

Concurrency fundamentally complicates AD.

Suppose two threads update shared state:

thread A: x += 1
thread B: x *= 2

The final state depends on execution ordering.

Reverse mode would need to know:

exact interleaving
memory visibility rules
synchronization ordering
atomic semantics

The derivative is therefore partly a property of the scheduler.

Most AD systems implicitly assume sequential execution. Extending reverse mode to concurrent systems remains largely unresolved.

Sparse Credit Assignment

Large stateful systems often contain delayed effects.

A decision made at time $t$ may affect outcomes at time $t + 10^6$.

This creates a long-horizon credit assignment problem:

$$ \frac{\partial L}{\partial s_t}. $$

Examples:

Domain	Long-range effect
database caching	later query latency
network routing	future congestion
compiler optimization	downstream runtime
operating systems	future scheduling behavior
autonomous systems	delayed task success

Backpropagating through extremely long horizons is computationally and numerically unstable.

Hybrid Symbolic and Learned Systems

Modern systems increasingly combine:

neural components
rule engines
databases
search systems
planners
simulators

Example:

retrieval
  -> ranking
  -> planning
  -> execution
  -> environment interaction

Some components are differentiable. Others are discrete or symbolic.

Differentiating across such hybrid boundaries remains an open research area.

Semantic Boundaries

A deeper issue is semantic meaning.

For a tensor operation:

$$ y = x^2, $$

the derivative is mathematically clear.

For a database transaction scheduler:

$$ y = \text{latency}(x), $$

what exactly should the derivative mean?

Possible interpretations include:

Interpretation	Meaning
local sensitivity	infinitesimal perturbation
expected effect	stochastic average
policy gradient	optimization signal
finite perturbation	empirical response
relaxed continuous model	smoothed approximation

Different interpretations lead to different "gradients."

The problem becomes partly philosophical: what does differentiability mean for complex computational systems?

Possible Directions

Several research directions appear promising.

Differentiable State Abstractions

Instead of differentiating raw mutable systems, one may differentiate structured state models:

$$ s_{t+1} = F_\theta(s_t, x_t). $$

This reduces hidden behavior and makes adjoint propagation explicit.

Event-Sourced Systems

Event sourcing naturally preserves history:

state = replay(events)

Because prior states can be reconstructed from event logs, reverse replay may become easier.

This is one reason differentiable event-driven architectures are interesting research targets.

Probabilistic Differentiation

Instead of exact trajectories, differentiate expectations:

$$ \nabla_\theta \mathbb{E}[L]. $$

This connects AD with stochastic control and reinforcement learning.

Compiler-Level AD

Systems like Enzyme suggest that low-level compiler analysis may help differentiate larger stateful programs by leveraging alias analysis, memory SSA, and optimization infrastructure.

Differentiable Runtime Systems

Future runtimes may expose:

reversible execution
replayable scheduling
deterministic concurrency
differentiable memory tracking

This would make state evolution more compatible with reverse mode.

Why This Problem Matters

Differentiating large stateful systems would enable:

Capability	Example
self-optimizing databases	learned storage layouts
differentiable operating systems	adaptive scheduling
differentiable networks	optimized routing policies
adaptive distributed systems	learned replication strategies
autonomous infrastructure	self-tuning clusters
differentiable software stacks	end-to-end optimization

Today, optimization boundaries are usually local. Models optimize tensor parameters. The surrounding systems remain hand-engineered.

Large-scale stateful differentiation would extend optimization into the systems layer itself.

Summary

Automatic differentiation works best for pure, structured computations with explicit dependency graphs and limited mutable state.

Large stateful systems violate these assumptions.

They contain:

mutation
hidden state
concurrency
external effects
nondeterminism
distributed execution
long temporal horizons

Differentiating such systems requires more than extending reverse mode mechanically. It requires new semantic models for time, state, memory, concurrency, and external interaction.

This remains one of the central unsolved problems in differentiable programming and systems research.