Inverse Problems

An inverse problem asks for causes from effects. A forward model predicts observations from parameters. An inverse model tries to recover parameters from observations.

The usual form is

$$ y = F(\theta), $$

where $\theta$ is an unknown parameter vector and $y$ is the model output. In practice, we observe noisy data

$$ z \approx F(\theta). $$

The inverse problem is to estimate $\theta$ from $z$.

Examples include seismic imaging, medical tomography, material parameter estimation, source reconstruction, system identification, and calibration of physical simulations.

Forward and Inverse Maps

The forward map is usually well-defined:

$$ \theta \mapsto F(\theta). $$

The inverse map may be unstable, non-unique, or only partially defined.

Problem	Forward direction	Inverse direction
Heat equation	Initial temperature gives later temperature	Recover initial temperature from later temperature
CT scan	Tissue density gives projections	Recover density from projections
Seismic imaging	Earth model gives waveforms	Recover subsurface structure
Material fitting	Material parameters give deformation	Recover parameters from measured deformation

Automatic differentiation is useful because inverse problems are often solved by optimization. The gradient of the mismatch between simulated and observed data gives the direction for improving the parameter estimate.

Least-Squares Formulation

A common formulation defines a residual

$$ r(\theta) = F(\theta) - z. $$

The loss is

$$ L(\theta) = \frac{1}{2}|r(\theta)|^2. $$

The gradient is

$$ \nabla_\theta L = J(\theta)^\top r(\theta), $$

where

$$ J(\theta)=\frac{\partial F}{\partial \theta}. $$

This equation explains why reverse-mode AD is central. We usually do not need the full Jacobian. We need the product $J^\top r$, which is a vector-Jacobian product.

Ill-Posedness

Many inverse problems are ill-posed. Small errors in the data can cause large errors in the recovered parameters.

A well-posed problem should have:

Property	Meaning
Existence	A solution exists
Uniqueness	The solution is determined by the data
Stability	Small data changes cause small solution changes

Inverse problems often violate uniqueness or stability. For example, many parameter settings may produce almost identical observations.

Automatic differentiation gives accurate gradients, but accurate gradients do not remove ill-posedness. The model, data, and objective must still be designed carefully.

Regularization

Regularization adds prior structure to the solution. Instead of minimizing only data mismatch,

$$ \frac{1}{2}|F(\theta)-z|^2, $$

we minimize

$$ L(\theta) = \frac{1}{2}|F(\theta)-z|^2 + \lambda R(\theta). $$

Here $R(\theta)$ penalizes undesirable solutions, and $\lambda$ controls the strength of the penalty.

Common regularizers include:

Regularizer	Effect
$\|\theta\|^2$	Prefers small parameters
$\|\nabla \theta\|^2$	Prefers smooth fields
$\|\theta\|_1$	Encourages sparsity
Total variation	Preserves edges while reducing noise
Physics constraints	Enforces known conservation laws

AD computes gradients for both the forward mismatch and the regularization term, provided both are implemented as differentiable programs.

Adjoint Methods

Inverse problems often have many parameters and relatively few scalar objectives. Reverse-mode AD and adjoint methods are therefore natural.

Suppose the forward model is defined by a differential equation:

$$ G(u,\theta)=0, $$

where $u$ is the state and $\theta$ is the parameter. The loss is

$$ L(u,\theta). $$

Direct differentiation gives

$$ G_u \frac{du}{d\theta} + G_\theta = 0. $$

$$ \frac{du}{d\theta} =

G_u^{-1}G_\theta. $$

Substituting into the derivative of $L$ would require solving one system per parameter. This is too expensive when $\theta$ is large.

The adjoint method avoids this. Define an adjoint variable $\lambda$ by

$$ G_u^\top \lambda = L_u^\top. $$

Then the gradient is

$$ \nabla_\theta L = L_\theta - G_\theta^\top \lambda. $$

This requires one adjoint solve per scalar loss, rather than one forward sensitivity solve per parameter.

Discrete Inverse Problems

Many inverse problems are solved after discretization. The state becomes a vector $u$, the parameters become a vector $\theta$, and the governing equation becomes a finite-dimensional system.

For example:

$$ A(\theta)u = b. $$

The observation model might be

$$ F(\theta)=Hu, $$

where $H$ selects measured components. The loss is

$$ L(\theta) = \frac{1}{2}|Hu-z|^2. $$

AD can differentiate the full computational path:

$$ \theta \to A(\theta) \to u=A(\theta)^{-1}b \to Hu \to L. $$

For efficiency, the linear solve should have a custom derivative rule. Reverse mode uses a transpose solve rather than differentiating through every iteration of an iterative solver.

Differentiating Through Solvers

Inverse problems often contain numerical solvers:

Solver type	Example
Linear solver	$Ax=b$
Nonlinear solver	$F(x,\theta)=0$
ODE solver	Time integration
PDE solver	Finite element simulation
Optimization solver	Inner minimization

There are two main differentiation strategies.

The first is unrolled differentiation. We differentiate through every solver iteration. This is simple and matches the implemented computation, but it can be memory-heavy and sensitive to iteration count.

The second is implicit differentiation. We differentiate the equation solved at convergence. This is often cleaner and cheaper, but it assumes the solver reached a meaningful fixed point.

Strategy	Differentiates	Advantage	Cost
Unrolled AD	Actual iterations	Exact for the executed program	High memory for many iterations
Implicit AD	Converged equation	Avoids long tapes	Requires linearized solve

Identifiability

Identifiability asks whether the parameters can be determined from the observations.

If two different parameters produce the same output,

$$ F(\theta_1)=F(\theta_2), $$

then the inverse problem cannot distinguish them.

Local identifiability is related to the rank of the Jacobian:

$$ J = \frac{\partial F}{\partial \theta}. $$

If $J$ has deficient rank, there are parameter directions that do not change the observations to first order.

These directions are called null directions:

$$ Jv = 0. $$

Moving along such a direction leaves the output locally unchanged.

AD helps identify these directions through Jacobian-vector products, vector-Jacobian products, and approximate Hessian methods.

Gauss-Newton Methods

For nonlinear least squares,

$$ L(\theta)=\frac{1}{2}|r(\theta)|^2, $$

the Hessian is

$$ \nabla^2 L = J^\top J + \sum_i r_i \nabla^2 r_i. $$

Gauss-Newton drops the second term:

$$ H_{\text{GN}} = J^\top J. $$

The update solves

$$ J^\top J \Delta \theta =

J^\top r. $$

This method is powerful when residuals are small or the model is close to linear near the solution.

AD supports Gauss-Newton without materializing $J$. We only need products:

$$ Jv $$

and

$$ J^\top w. $$

Forward mode gives $Jv$. Reverse mode gives $J^\top w$.

Bayesian Inverse Problems

A Bayesian inverse problem treats parameters as random variables. Instead of returning one estimate, it returns a posterior distribution:

$$ p(\theta \mid z) \propto p(z \mid \theta)p(\theta). $$

The negative log posterior often becomes an optimization objective:

$$ L(\theta) = -\log p(z \mid \theta) - \log p(\theta). $$

AD provides gradients for sampling and variational methods, including:

Method	Uses gradients for
Hamiltonian Monte Carlo	Simulating posterior dynamics
Langevin dynamics	Gradient-informed sampling
Variational inference	Optimizing approximate posterior
Laplace approximation	Computing local curvature

In this setting, derivatives support uncertainty quantification, not only point estimation.

Noise and Observation Models

The loss function should match the noise model.

If observation noise is Gaussian,

$$ z = F(\theta) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma^2 I), $$

then least squares is natural.

If noise is not Gaussian, another likelihood may be better.

Noise model	Typical loss
Gaussian	Squared error
Laplace	Absolute error
Poisson	Poisson negative log likelihood
Bernoulli	Cross entropy
Heavy-tailed	Robust losses

AD makes it easy to change the loss, but the statistical meaning changes with it.

Constraints

Many inverse problems have constraints:

$$ \theta \in C. $$

Examples include positivity, conservation laws, bounds, smoothness, monotonicity, and geometric feasibility.

Constraints can be handled by:

Method	Idea
Reparameterization	Write $\theta = g(\phi)$ so constraints hold automatically
Penalty methods	Add constraint violation to loss
Projected methods	Project updates back into feasible set
Barrier methods	Prevent crossing constraint boundaries
Constrained solvers	Solve KKT systems directly

AD supplies derivatives for the objective and constraint functions. The optimization algorithm must still enforce feasibility.

Failure Modes

Inverse problems fail in characteristic ways.

Failure mode	Cause
Non-unique solution	Insufficient observations
Unstable solution	Ill-conditioned forward map
Overfitting noise	Weak regularization
Biased estimate	Wrong model class
Bad gradient	Discontinuous solver logic
Slow convergence	Poor scaling or conditioning
False confidence	Ignored uncertainty

AD solves the derivative computation problem. It does not solve the modeling problem.

Practical Design Pattern

A practical differentiable inverse problem usually has this structure:

parameters theta
    -> constrained parameter transform
    -> physical or statistical forward model
    -> numerical solver
    -> observation operator
    -> residual
    -> regularized loss
    -> gradient by AD
    -> optimizer or sampler

Each stage should have clear derivative semantics. The most important design decision is where to use general AD and where to provide custom rules.

Good candidates for custom rules include:

Component	Reason
Linear solves	Use transpose solves
Nonlinear fixed points	Avoid differentiating many iterations
ODE/PDE solvers	Control memory and stability
Interpolation	Define consistent boundary behavior
Discontinuous events	Expose piecewise derivative semantics

Summary

Inverse problems recover hidden causes from observed effects. They are usually solved by minimizing a mismatch between simulated and observed data, often with regularization or Bayesian priors.

Automatic differentiation is central because it provides gradients, adjoints, Jacobian products, and Hessian approximations for complex forward models. However, inverse problems remain limited by identifiability, conditioning, noise, model error, and constraints.

The best AD implementations for inverse problems are solver-aware. They combine reverse mode, implicit differentiation, sparse linear algebra, and checkpointing rather than differentiating every low-level operation blindly.