Probabilistic Automatic Differentiation

Classical automatic differentiation computes derivatives of deterministic programs.

Probabilistic Automatic Differentiation

Classical automatic differentiation computes derivatives of deterministic programs.

A probabilistic program instead describes random variables, probability distributions, and stochastic transformations. The output is not a single value but a distribution, expectation, likelihood, or sampled trajectory.

Probabilistic automatic differentiation studies how derivatives propagate through such systems.

This includes:

Problem Example
differentiating expectations stochastic optimization
differentiating sampling procedures variational inference
differentiating probabilistic programs Bayesian learning
differentiating Monte Carlo estimators simulation gradients
differentiating stochastic dynamics diffusion models

The central challenge is that randomness introduces discontinuities, variance, and estimator bias into the differentiation process.

Deterministic vs Stochastic Computation

A deterministic computation defines

$$ y = f(x,\theta). $$

A stochastic computation introduces random variables:

$$ y = f(x,\theta,\omega), $$

where

$$ \omega \sim p(\omega). $$

The quantity of interest is often an expectation:

$$ L(\theta) = \mathbb{E}{\omega \sim p\theta} [\ell(f(\theta,\omega))]. $$

The derivative becomes

$$ \nabla_\theta L(\theta). $$

The difficulty is that both the sampled value and the distribution itself may depend on θ.

Differentiating Expectations

Suppose

$$ L(\theta)=\mathbb{E}{x\sim p\theta(x)}[\ell(x)]. $$

Expanding the expectation:

$$ L(\theta) = \int \ell(x)p_\theta(x),dx. $$

Differentiate under the integral:

$$ \nabla_\theta L = \int \ell(x)\nabla_\theta p_\theta(x),dx. $$

Using

$$ \nabla_\theta p_\theta(x) = p_\theta(x)\nabla_\theta \log p_\theta(x), $$

we obtain

$$ \nabla_\theta L = \mathbb{E}{x\sim p\theta} [ \ell(x)\nabla_\theta \log p_\theta(x) ]. $$

This is the score-function estimator.

It is also called:

Name Context
REINFORCE estimator reinforcement learning
likelihood-ratio estimator statistics
score-function gradient probabilistic inference

The estimator does not require differentiating through the sampled value itself.

Score-Function Estimator

The score-function estimator is

$$ \nabla_\theta \mathbb{E}{x\sim p\theta} [\ell(x)] = \mathbb{E} [ \ell(x)\nabla_\theta \log p_\theta(x) ]. $$

Monte Carlo approximation gives

$$ \nabla_\theta L \approx \frac{1}{N} \sum_{i=1}^N \ell(x_i)\nabla_\theta \log p_\theta(x_i). $$

This estimator is general.

It works even when:

Situation Supported
discrete random variables yes
non-differentiable samples yes
black-box simulators yes

However, it often has very high variance.

Large variance leads to unstable optimization and slow convergence.

Reparameterization Trick

Suppose samples can be written as

$$ x = g(\theta,\epsilon), \qquad \epsilon \sim p(\epsilon), $$

where the randomness is independent of θ.

Then

$$ L(\theta) = \mathbb{E}_{\epsilon} [ \ell(g(\theta,\epsilon)) ]. $$

Now the expectation is over a fixed distribution. The derivative becomes

$$ \nabla_\theta L = \mathbb{E}{\epsilon} [ \nabla\theta \ell(g(\theta,\epsilon)) ]. $$

This allows ordinary reverse-mode AD through the sampled computation.

This is the reparameterization estimator.

Gaussian Example

Suppose

$$ x \sim \mathcal{N}(\mu,\sigma^2). $$

Reparameterize:

$$ x = \mu + \sigma \epsilon, \qquad \epsilon \sim \mathcal{N}(0,1). $$

Then

$$ L(\mu,\sigma) = \mathbb{E}_\epsilon [ \ell(\mu+\sigma\epsilon) ]. $$

Gradients become

$$ \nabla_\mu L = \mathbb{E} [ \nabla_x \ell(x) ], $$

and

$$ \nabla_\sigma L = \mathbb{E} [ \epsilon \nabla_x \ell(x) ]. $$

The stochasticity is isolated in ε. The remaining computation is differentiable.

This estimator usually has much lower variance than the score-function estimator.

Pathwise Derivatives

The reparameterization trick is also called the pathwise derivative estimator.

The derivative propagates through the sampled path itself:

epsilon -> sample x -> loss

The stochastic node becomes a differentiable transformation.

This makes probabilistic programs compatible with ordinary reverse-mode AD systems.

Comparison of Gradient Estimators

Estimator Requires differentiable sample path Supports discrete variables Variance
score-function no yes high
reparameterization yes usually no lower
finite differences no yes very high
implicit estimators partial partial moderate

No estimator is uniformly best.

The choice depends on distribution structure and computational constraints.

Variance Reduction

Monte Carlo gradient estimators are noisy.

Variance reduction is therefore central in probabilistic differentiation.

Baselines

Subtract a constant b:

$$ \mathbb{E} [ (\ell(x)-b)\nabla_\theta \log p_\theta(x) ]. $$

The estimator remains unbiased because

$$ \mathbb{E}[\nabla_\theta \log p_\theta(x)] = 0. $$

A good baseline reduces variance dramatically.

Control variates

Introduce correlated auxiliary estimators with known expectation.

Antithetic sampling

Use negatively correlated samples.

Rao-Blackwellization

Integrate analytically over some variables instead of sampling them.

These methods are essential for practical stochastic gradient estimation.

Discrete Random Variables

Discrete sampling is difficult because sampled values change discontinuously.

Suppose

$$ x \sim \operatorname{Categorical}(p_\theta). $$

A tiny parameter perturbation may abruptly change the sampled category.

Ordinary pathwise differentiation fails because:

$$ \frac{\partial x}{\partial \theta} $$

does not exist in the classical sense.

The score-function estimator still works because it differentiates the probability distribution rather than the sampled value.

Relaxed Distributions

A common workaround replaces discrete variables with continuous approximations.

For categorical sampling, the Gumbel-Softmax trick uses:

$$ y_i = \frac{ \exp((\log p_i + g_i)/\tau) }{ \sum_j \exp((\log p_j + g_j)/\tau) }, $$

where:

$$ g_i \sim \operatorname{Gumbel}(0,1). $$

As temperature τ approaches zero, the relaxed sample approaches a one-hot discrete sample.

For finite temperature, the sample remains differentiable.

This allows approximate pathwise differentiation through discrete choices.

Probabilistic Computational Graphs

A probabilistic program can be represented as a graph containing both deterministic and stochastic nodes.

Example:

theta -> z ~ p(z|theta)
z -> x ~ p(x|z)
x -> loss

Differentiation propagates through:

Node type Gradient rule
deterministic node ordinary chain rule
stochastic node estimator-specific rule

Modern probabilistic programming systems combine AD with stochastic estimators to differentiate entire probabilistic models.

Variational Inference

Variational inference optimizes an approximate distribution

$$ q_\phi(z) $$

to approximate a target posterior.

The evidence lower bound (ELBO) is

$$ \mathcal{L}(\phi) = \mathbb{E}{z\sim q\phi} [ \log p(x,z)-\log q_\phi(z) ]. $$

Gradients require differentiating expectations over learned distributions.

Reparameterization gradients made deep variational models practical.

Variational autoencoders are a canonical example.

Variational Autoencoders

A variational autoencoder defines:

Component Role
encoder parameterizes latent distribution
latent variable sampled representation
decoder reconstructs data

The encoder predicts:

$$ \mu(x),\quad \sigma(x). $$

A latent sample is drawn using reparameterization:

$$ z=\mu+\sigma\epsilon. $$

The decoder computes reconstruction loss.

Reverse-mode AD then differentiates the entire stochastic pipeline.

Without reparameterization, efficient training would be much harder.

Probabilistic Programs

A probabilistic program includes random choices:

z = sample(normal(mu, sigma))
x = sample(decoder(z))
observe(data, x)

The program defines a probability distribution over execution traces.

Differentiation may involve:

Quantity Meaning
log probability likelihood gradient
posterior expectation inference objective
sampled trajectory simulation sensitivity

Probabilistic AD systems combine tracing, sampling, and reverse-mode differentiation.

Monte Carlo Differentiation

Suppose

$$ L(\theta)=\mathbb{E}[f_\theta(X)]. $$

Monte Carlo estimates use samples:

$$ L_N(\theta) = \frac{1}{N} \sum_{i=1}^N f_\theta(X_i). $$

Differentiating gives

$$ \nabla_\theta L_N = \frac{1}{N} \sum_i \nabla_\theta f_\theta(X_i). $$

This estimator itself becomes random.

Thus optimization uses stochastic gradients of stochastic objectives.

Understanding variance propagation becomes critical.

Stochastic Differential Equations

Probabilistic dynamics often use stochastic differential equations:

$$ dz = f(z,t),dt + g(z,t),dW_t, $$

where W_t is Brownian motion.

These systems appear in:

Domain Example
diffusion models generative modeling
finance stochastic volatility
physics thermal noise
biology random population dynamics

Differentiation through SDEs requires handling stochastic integrals and noise-dependent trajectories.

Diffusion Models

Modern generative diffusion models evolve data through noisy dynamics.

Forward diffusion adds noise:

$$ dx = -\frac{1}{2}\beta(t)x,dt + \sqrt{\beta(t)},dW_t. $$

Reverse diffusion learns to invert the stochastic process.

Training involves expectations over noisy trajectories and repeated stochastic sampling.

Probabilistic AD is fundamental to these systems.

Measure-Theoretic Issues

Differentiating probabilistic systems introduces mathematical subtleties.

Questions include:

Question Issue
can derivative move inside expectation? dominated convergence
does density exist? measure regularity
is estimator unbiased? interchange of limits
does variance exist? integrability

Many practical estimators rely on assumptions that may fail in heavy-tailed or discontinuous systems.

Stochastic Control Flow

Programs with random branching are especially difficult.

Example:

if sample(bernoulli(p)):
    y = f1(theta)
else:
    y = f2(theta)

The execution trace itself becomes stochastic.

The derivative depends on both:

Component Effect
branch probability score-function term
branch computation pathwise term

Hybrid estimators are often required.

Gradient Estimator Bias

Some estimators are unbiased:

$$ \mathbb{E}[\hat{g}] = \nabla_\theta L. $$

Others trade bias for lower variance.

A low-variance biased estimator may outperform a theoretically correct unbiased estimator in optimization.

This creates a central engineering tradeoff:

Goal Cost
unbiasedness high variance
low variance possible bias

Practical probabilistic learning often prefers stable optimization over exact gradient fidelity.

Probabilistic AD Systems

Modern systems combine:

Capability Purpose
reverse-mode AD deterministic differentiation
stochastic estimators random variables
trace graphs probabilistic execution
Monte Carlo sampling expectation approximation
symbolic density tracking log-likelihood computation

Examples include probabilistic programming frameworks and differentiable simulators.

Failure Modes

Probabilistic differentiation introduces many instability sources.

High variance

Gradient estimates may fluctuate wildly.

Rare-event instability

Extreme samples dominate gradients.

Discontinuous sampling

Discrete variables create undefined pathwise derivatives.

Monte Carlo noise

Optimization may become noisy or biased.

Numerical underflow

Tiny probabilities destabilize log-likelihoods.

Correlated randomness

Dependent samples complicate variance analysis.

These issues often dominate runtime behavior.

Conceptual Shift

Classical AD differentiates functions.

Probabilistic AD differentiates distributions, expectations, and stochastic processes.

The chain rule alone is no longer sufficient. Gradient estimation becomes a statistical problem as well as a computational one.

This changes the meaning of differentiation itself.

Instead of asking:

$$ \frac{dy}{d\theta}, $$

we ask:

$$ \nabla_\theta \mathbb{E}[y]. $$

The derivative becomes an expectation over random trajectories.

Summary

Probabilistic automatic differentiation extends AD into stochastic systems.

Differentiation may occur through expectations, random variables, Monte Carlo estimators, stochastic differential equations, or probabilistic programs.

The two dominant techniques are:

Method Core idea
score-function estimators differentiate probabilities
reparameterization estimators differentiate sampled paths

These methods make modern probabilistic machine learning practical, including variational inference, stochastic simulation, diffusion models, and probabilistic programming.

The central challenge is no longer only correctness of the chain rule. It is managing variance, bias, stochasticity, and numerical stability while preserving useful gradients through random computation.