Probabilistic Programming

Probabilistic programming represents uncertainty using executable probabilistic models. A probabilistic program defines a distribution rather than only a deterministic computation.

Probabilistic Programming

Probabilistic programming represents uncertainty using executable probabilistic models. A probabilistic program defines a distribution rather than only a deterministic computation.

Automatic differentiation is important because most modern probabilistic inference methods rely on gradients of probability densities, likelihoods, or variational objectives.

A probabilistic model typically defines:

$$ x \sim p(x \mid \theta), $$

where:

Symbol Meaning
$x$ latent variables
$\theta$ model parameters
$p$ probability distribution

Given observations $y$, inference computes the posterior:

$$ p(x,\theta \mid y). $$

AD enables efficient optimization and sampling in high-dimensional probabilistic systems.

Probabilistic Programs as Computation Graphs

A probabilistic program mixes deterministic computation with random sampling.

Example:

latent variables
    -> deterministic transforms
    -> likelihood computation
    -> log probability
    -> inference objective

The deterministic parts are ordinary differentiable programs. The probabilistic parts introduce distributions, expectations, and stochasticity.

Inference algorithms often differentiate quantities such as:

$$ \log p(y \mid \theta), $$

$$ \log p(x,\theta,y), $$

or variational objectives.

Log Probability and Gradients

Most gradient-based inference methods work with log densities.

Suppose

$$ p(x \mid y,\theta) \propto p(y \mid x,\theta)p(x \mid \theta). $$

Then the log posterior is

$$ \log p(x \mid y,\theta) = \log p(y \mid x,\theta) + \log p(x \mid \theta) - \log Z. $$

The normalization constant $Z$ is usually ignored during gradient computation because it does not depend on $x$.

AD computes gradients such as

$$ \nabla_x \log p(x \mid y,\theta). $$

These gradients drive modern inference algorithms.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo augments latent variables with momentum variables:

$$ (x,p). $$

The Hamiltonian is

$$ H(x,p) = U(x)+K(p), $$

where

$$ U(x)=-\log p(x \mid y), $$

and

$$ K(p)=\frac{1}{2}p^\top M^{-1}p. $$

The dynamics are

$$ \frac{dx}{dt}=\frac{\partial H}{\partial p}, \qquad \frac{dp}{dt}=-\frac{\partial H}{\partial x}. $$

AD computes

$$ \nabla_x U(x), $$

which is the gradient of the negative log posterior.

Without AD, deriving gradients for large probabilistic models would be extremely tedious.

Variational Inference

Variational inference approximates the posterior using a simpler distribution

$$ q_\phi(x). $$

The objective is usually the evidence lower bound:

$$ \mathcal{L}(\phi) = \mathbb{E}{q\phi(x)} \left[ \log p(x,y)-\log q_\phi(x) \right]. $$

The goal is

$$ \max_\phi \mathcal{L}(\phi). $$

AD computes gradients of this objective with respect to variational parameters.

Reparameterization Trick

Direct differentiation through random sampling is difficult.

Suppose

$$ x \sim q_\phi(x). $$

Instead of sampling $x$ directly, we write

$$ x = g(\epsilon,\phi), \qquad \epsilon \sim p(\epsilon), $$

where the randomness is separated from the parameters.

For a Gaussian:

$$ x = \mu + \sigma \epsilon, \qquad \epsilon \sim \mathcal{N}(0,1). $$

Now expectations become

$$ \mathbb{E}{q\phi(x)}[f(x)] = \mathbb{E}_{p(\epsilon)} [f(g(\epsilon,\phi))]. $$

AD can differentiate through $g$.

This produces lower-variance gradient estimators than score-function methods.

Score-Function Estimators

Some distributions cannot be reparameterized easily.

In that case, we use the identity

$$ \nabla_\phi \mathbb{E}{q\phi(x)}[f(x)] = \mathbb{E}{q\phi(x)} \left[ f(x)\nabla_\phi \log q_\phi(x) \right]. $$

This is the score-function estimator, also called REINFORCE.

It works for discrete variables and non-reparameterizable distributions, but it often has high variance.

Method Strength Weakness
Reparameterization Low variance Requires differentiable transform
Score-function estimator General High variance
Hybrid methods Flexible More complex

AD computes the deterministic derivatives inside these estimators.

Stochastic Computation Graphs

Probabilistic programs are stochastic computation graphs.

Nodes may represent:

Node type Meaning
Deterministic node Ordinary differentiable computation
Random node Sampling operation
Observation node Conditioning on data
Objective node Log probability or reward

Gradient propagation depends on node type.

Deterministic paths use ordinary chain rule differentiation. Random paths require estimator identities.

Discrete Random Variables

Discrete latent variables create differentiation problems because sampling is discontinuous.

Example:

$$ z \sim \operatorname{Categorical}(\pi). $$

The sampled value changes discontinuously as $\pi$ changes.

Approaches include:

Method Idea
Score-function estimator Differentiate probability, not sample
Gumbel-softmax Continuous relaxation
Marginalization Sum exactly over states
Straight-through estimator Approximate backward pass

The Gumbel-softmax relaxation replaces hard categories with differentiable soft assignments:

$$ y_i = \frac{ \exp((\log \pi_i + g_i)/\tau) }{ \sum_j \exp((\log \pi_j + g_j)/\tau) }. $$

As temperature $\tau$ decreases, the distribution becomes more discrete.

Probabilistic Graphical Models

Many probabilistic systems are structured graphical models.

Examples include:

  • Bayesian networks,
  • hidden Markov models,
  • factor graphs,
  • state-space models,
  • probabilistic circuits.

Inference often reduces to repeated local message computations.

AD is useful for:

  • learning parameters,
  • differentiable message passing,
  • variational optimization,
  • differentiable filtering and smoothing.

State-Space Models

A probabilistic state-space model evolves latent states:

$$ x_{t+1}\sim p(x_{t+1}\mid x_t), $$

$$ y_t\sim p(y_t\mid x_t). $$

Inference computes posterior trajectories:

$$ p(x_{0:T}\mid y_{0:T}). $$

Differentiable filtering and smoothing algorithms allow gradient-based learning of:

  • transition dynamics,
  • observation models,
  • noise covariances,
  • control parameters.

Kalman filters are differentiable under standard linear algebra assumptions. Particle filters are more difficult because resampling is discrete.

Particle Filters

Particle filters approximate distributions using weighted samples.

A resampling step selects particles according to weights:

$$ w_i \propto p(y_t \mid x_i). $$

Resampling is discrete and therefore non-differentiable in the usual sense.

Differentiable particle filtering methods use:

Method Idea
Soft resampling Smooth weight redistribution
Relaxed categorical sampling Continuous approximation
Gradient estimators Differentiate expectation
Deterministic transport Replace stochastic resampling

This remains an active research area.

Bayesian Neural Networks

A Bayesian neural network places distributions over weights:

$$ w \sim p(w). $$

Predictions marginalize over weights:

$$ p(y \mid x) = \int p(y \mid x,w)p(w),dw. $$

Variational inference approximates the posterior over weights.

AD computes gradients through:

  • neural network computations,
  • variational objectives,
  • stochastic parameter samples.

This combines probabilistic inference with deep learning infrastructure.

Normalizing Flows

A normalizing flow transforms a simple distribution into a complex one using invertible mappings.

Suppose

$$ z_0 \sim p_0(z), $$

and

$$ z_K = f_K \circ \cdots \circ f_1(z_0). $$

The density becomes

$$ \log p(z_K) = \log p_0(z_0) - \sum_k \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right|. $$

AD computes:

  • Jacobian determinants,
  • parameter gradients,
  • inverse transform sensitivities.

Flows are fundamentally differentiable probabilistic systems.

Probabilistic Programs with Control Flow

Probabilistic programs may contain loops, branching, and recursion.

Example:

sample latent state
if condition:
    sample another variable
else:
    apply deterministic transform
accumulate log probability

Control flow creates challenges:

Issue Consequence
Dynamic graph structure Variable execution paths
Discrete branching Piecewise derivatives
Variable-dimensional latent space Complex gradient semantics
Recursive stochastic structure Dynamic memory requirements

Modern probabilistic programming systems often use tracing or graph capture to represent execution.

Log-Density Stability

Probabilistic inference is numerically sensitive.

Probabilities may underflow:

$$ p(x)\approx 10^{-300}. $$

Therefore computations are usually performed in log space.

AD systems must preserve stable numerical forms such as:

$$ \log \sum_i e^{x_i}. $$

Direct exponentiation and summation can produce catastrophic overflow or underflow.

Stable primitives should expose stable derivatives.

Automatic Differentiation Variational Inference

Automatic differentiation variational inference, or ADVI, treats inference generically:

  1. transform latent variables to unconstrained space,
  2. define variational distribution,
  3. estimate ELBO gradients,
  4. optimize with stochastic gradient methods.

The important idea is that inference becomes a differentiable optimization problem.

The probabilistic model no longer needs manually derived updates.

Constraints and Transformations

Latent variables may have constraints:

Variable type Constraint
Variance positive
Probability vector sums to one
Covariance matrix positive definite
Correlation bounded

Inference systems transform constrained variables into unconstrained coordinates.

Examples:

$$ x = \exp(z), $$

$$ p_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}. $$

AD differentiates through these transforms automatically.

Sparse and Structured Probabilistic Systems

Large probabilistic systems often have sparse dependence graphs.

Examples:

  • sparse Gaussian processes,
  • factor graphs,
  • probabilistic graphical models,
  • structured variational families.

Efficient differentiation exploits this structure.

Dense Jacobian construction is usually infeasible.

Failure Modes

Differentiable probabilistic systems fail in characteristic ways.

Failure mode Cause
Exploding gradients Sharp posterior geometry
Vanishing gradients Saturated likelihoods
High-variance estimators Stochastic gradient noise
Unstable ELBO optimization Poor variational family
NaNs in log probability Underflow or invalid parameters
Broken inference Discrete non-differentiable operations
Memory explosion Reverse mode through long stochastic traces

Gradient correctness alone does not guarantee good inference behavior.

Practical Architecture

A robust differentiable probabilistic system typically separates:

Layer Responsibility
Random variable representation Sampling and constraints
Deterministic transforms Differentiable computation
Log-density accumulation Stable probability evaluation
Inference objective ELBO, posterior log density, likelihood
Gradient estimator Reparameterization or score-function
Optimizer or sampler Variational optimization or MCMC

Each layer needs carefully defined derivative semantics.

Summary

Probabilistic programming turns inference into a differentiable computation problem. Automatic differentiation supplies gradients for posterior densities, variational objectives, stochastic simulators, and probabilistic transformations.

The main challenges are stochastic sampling, discrete variables, estimator variance, dynamic control flow, and numerical stability in probability computations. Effective systems combine AD with reparameterization, stable log-density algebra, structured inference algorithms, and carefully designed gradient estimators.