Inverse Problems

An inverse problem asks for causes from effects. A forward model predicts observations from parameters. An inverse model tries to recover parameters from observations.

Inverse Problems

An inverse problem asks for causes from effects. A forward model predicts observations from parameters. An inverse model tries to recover parameters from observations.

The usual form is

$$ y = F(\theta), $$

where $\theta$ is an unknown parameter vector and $y$ is the model output. In practice, we observe noisy data

$$ z \approx F(\theta). $$

The inverse problem is to estimate $\theta$ from $z$.

Examples include seismic imaging, medical tomography, material parameter estimation, source reconstruction, system identification, and calibration of physical simulations.

Forward and Inverse Maps

The forward map is usually well-defined:

$$ \theta \mapsto F(\theta). $$

The inverse map may be unstable, non-unique, or only partially defined.

Problem Forward direction Inverse direction
Heat equation Initial temperature gives later temperature Recover initial temperature from later temperature
CT scan Tissue density gives projections Recover density from projections
Seismic imaging Earth model gives waveforms Recover subsurface structure
Material fitting Material parameters give deformation Recover parameters from measured deformation

Automatic differentiation is useful because inverse problems are often solved by optimization. The gradient of the mismatch between simulated and observed data gives the direction for improving the parameter estimate.

Least-Squares Formulation

A common formulation defines a residual

$$ r(\theta) = F(\theta) - z. $$

The loss is

$$ L(\theta) = \frac{1}{2}|r(\theta)|^2. $$

The gradient is

$$ \nabla_\theta L = J(\theta)^\top r(\theta), $$

where

$$ J(\theta)=\frac{\partial F}{\partial \theta}. $$

This equation explains why reverse-mode AD is central. We usually do not need the full Jacobian. We need the product $J^\top r$, which is a vector-Jacobian product.

Ill-Posedness

Many inverse problems are ill-posed. Small errors in the data can cause large errors in the recovered parameters.

A well-posed problem should have:

Property Meaning
Existence A solution exists
Uniqueness The solution is determined by the data
Stability Small data changes cause small solution changes

Inverse problems often violate uniqueness or stability. For example, many parameter settings may produce almost identical observations.

Automatic differentiation gives accurate gradients, but accurate gradients do not remove ill-posedness. The model, data, and objective must still be designed carefully.

Regularization

Regularization adds prior structure to the solution. Instead of minimizing only data mismatch,

$$ \frac{1}{2}|F(\theta)-z|^2, $$

we minimize

$$ L(\theta) = \frac{1}{2}|F(\theta)-z|^2 + \lambda R(\theta). $$

Here $R(\theta)$ penalizes undesirable solutions, and $\lambda$ controls the strength of the penalty.

Common regularizers include:

Regularizer Effect
$|\theta|^2$ Prefers small parameters
$|\nabla \theta|^2$ Prefers smooth fields
$|\theta|_1$ Encourages sparsity
Total variation Preserves edges while reducing noise
Physics constraints Enforces known conservation laws

AD computes gradients for both the forward mismatch and the regularization term, provided both are implemented as differentiable programs.

Adjoint Methods

Inverse problems often have many parameters and relatively few scalar objectives. Reverse-mode AD and adjoint methods are therefore natural.

Suppose the forward model is defined by a differential equation:

$$ G(u,\theta)=0, $$

where $u$ is the state and $\theta$ is the parameter. The loss is

$$ L(u,\theta). $$

Direct differentiation gives

$$ G_u \frac{du}{d\theta} + G_\theta = 0. $$

So

$$ \frac{du}{d\theta} =

  • G_u^{-1}G_\theta. $$

Substituting into the derivative of $L$ would require solving one system per parameter. This is too expensive when $\theta$ is large.

The adjoint method avoids this. Define an adjoint variable $\lambda$ by

$$ G_u^\top \lambda = L_u^\top. $$

Then the gradient is

$$ \nabla_\theta L = L_\theta - G_\theta^\top \lambda. $$

This requires one adjoint solve per scalar loss, rather than one forward sensitivity solve per parameter.

Discrete Inverse Problems

Many inverse problems are solved after discretization. The state becomes a vector $u$, the parameters become a vector $\theta$, and the governing equation becomes a finite-dimensional system.

For example:

$$ A(\theta)u = b. $$

The observation model might be

$$ F(\theta)=Hu, $$

where $H$ selects measured components. The loss is

$$ L(\theta) = \frac{1}{2}|Hu-z|^2. $$

AD can differentiate the full computational path:

$$ \theta \to A(\theta) \to u=A(\theta)^{-1}b \to Hu \to L. $$

For efficiency, the linear solve should have a custom derivative rule. Reverse mode uses a transpose solve rather than differentiating through every iteration of an iterative solver.

Differentiating Through Solvers

Inverse problems often contain numerical solvers:

Solver type Example
Linear solver $Ax=b$
Nonlinear solver $F(x,\theta)=0$
ODE solver Time integration
PDE solver Finite element simulation
Optimization solver Inner minimization

There are two main differentiation strategies.

The first is unrolled differentiation. We differentiate through every solver iteration. This is simple and matches the implemented computation, but it can be memory-heavy and sensitive to iteration count.

The second is implicit differentiation. We differentiate the equation solved at convergence. This is often cleaner and cheaper, but it assumes the solver reached a meaningful fixed point.

Strategy Differentiates Advantage Cost
Unrolled AD Actual iterations Exact for the executed program High memory for many iterations
Implicit AD Converged equation Avoids long tapes Requires linearized solve

Identifiability

Identifiability asks whether the parameters can be determined from the observations.

If two different parameters produce the same output,

$$ F(\theta_1)=F(\theta_2), $$

then the inverse problem cannot distinguish them.

Local identifiability is related to the rank of the Jacobian:

$$ J = \frac{\partial F}{\partial \theta}. $$

If $J$ has deficient rank, there are parameter directions that do not change the observations to first order.

These directions are called null directions:

$$ Jv = 0. $$

Moving along such a direction leaves the output locally unchanged.

AD helps identify these directions through Jacobian-vector products, vector-Jacobian products, and approximate Hessian methods.

Gauss-Newton Methods

For nonlinear least squares,

$$ L(\theta)=\frac{1}{2}|r(\theta)|^2, $$

the Hessian is

$$ \nabla^2 L = J^\top J + \sum_i r_i \nabla^2 r_i. $$

Gauss-Newton drops the second term:

$$ H_{\text{GN}} = J^\top J. $$

The update solves

$$ J^\top J \Delta \theta =

  • J^\top r. $$

This method is powerful when residuals are small or the model is close to linear near the solution.

AD supports Gauss-Newton without materializing $J$. We only need products:

$$ Jv $$

and

$$ J^\top w. $$

Forward mode gives $Jv$. Reverse mode gives $J^\top w$.

Bayesian Inverse Problems

A Bayesian inverse problem treats parameters as random variables. Instead of returning one estimate, it returns a posterior distribution:

$$ p(\theta \mid z) \propto p(z \mid \theta)p(\theta). $$

The negative log posterior often becomes an optimization objective:

$$ L(\theta) = -\log p(z \mid \theta) - \log p(\theta). $$

AD provides gradients for sampling and variational methods, including:

Method Uses gradients for
Hamiltonian Monte Carlo Simulating posterior dynamics
Langevin dynamics Gradient-informed sampling
Variational inference Optimizing approximate posterior
Laplace approximation Computing local curvature

In this setting, derivatives support uncertainty quantification, not only point estimation.

Noise and Observation Models

The loss function should match the noise model.

If observation noise is Gaussian,

$$ z = F(\theta) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma^2 I), $$

then least squares is natural.

If noise is not Gaussian, another likelihood may be better.

Noise model Typical loss
Gaussian Squared error
Laplace Absolute error
Poisson Poisson negative log likelihood
Bernoulli Cross entropy
Heavy-tailed Robust losses

AD makes it easy to change the loss, but the statistical meaning changes with it.

Constraints

Many inverse problems have constraints:

$$ \theta \in C. $$

Examples include positivity, conservation laws, bounds, smoothness, monotonicity, and geometric feasibility.

Constraints can be handled by:

Method Idea
Reparameterization Write $\theta = g(\phi)$ so constraints hold automatically
Penalty methods Add constraint violation to loss
Projected methods Project updates back into feasible set
Barrier methods Prevent crossing constraint boundaries
Constrained solvers Solve KKT systems directly

AD supplies derivatives for the objective and constraint functions. The optimization algorithm must still enforce feasibility.

Failure Modes

Inverse problems fail in characteristic ways.

Failure mode Cause
Non-unique solution Insufficient observations
Unstable solution Ill-conditioned forward map
Overfitting noise Weak regularization
Biased estimate Wrong model class
Bad gradient Discontinuous solver logic
Slow convergence Poor scaling or conditioning
False confidence Ignored uncertainty

AD solves the derivative computation problem. It does not solve the modeling problem.

Practical Design Pattern

A practical differentiable inverse problem usually has this structure:

parameters theta
    -> constrained parameter transform
    -> physical or statistical forward model
    -> numerical solver
    -> observation operator
    -> residual
    -> regularized loss
    -> gradient by AD
    -> optimizer or sampler

Each stage should have clear derivative semantics. The most important design decision is where to use general AD and where to provide custom rules.

Good candidates for custom rules include:

Component Reason
Linear solves Use transpose solves
Nonlinear fixed points Avoid differentiating many iterations
ODE/PDE solvers Control memory and stability
Interpolation Define consistent boundary behavior
Discontinuous events Expose piecewise derivative semantics

Summary

Inverse problems recover hidden causes from observed effects. They are usually solved by minimizing a mismatch between simulated and observed data, often with regularization or Bayesian priors.

Automatic differentiation is central because it provides gradients, adjoints, Jacobian products, and Hessian approximations for complex forward models. However, inverse problems remain limited by identifiability, conditioning, noise, model error, and constraints.

The best AD implementations for inverse problems are solver-aware. They combine reverse mode, implicit differentiation, sparse linear algebra, and checkpointing rather than differentiating every low-level operation blindly.