Summary and Further Reading

Probabilistic deep learning extends neural networks with explicit probability models.

Probabilistic deep learning extends neural networks with explicit probability models. Instead of producing only point estimates, a probabilistic model represents uncertainty, likelihood, latent structure, or a posterior distribution over parameters.

This chapter covered five core ideas.

First, Bayesian neural networks treat weights as random variables. A prior describes plausible parameters before observing data. The posterior updates this belief after seeing data. Prediction averages over plausible networks rather than relying on one fixed parameter setting.

Second, variational inference turns posterior inference into optimization. It introduces an approximate posterior and fits it by maximizing the evidence lower bound. This makes Bayesian learning usable with neural networks, although the approximation can be crude.

Third, Monte Carlo methods approximate expectations using samples. They appear in posterior prediction, variational inference, dropout uncertainty, latent variable models, diffusion models, and reinforcement learning.

Fourth, uncertainty estimation separates prediction from confidence. Aleatoric uncertainty comes from irreducible data noise. Epistemic uncertainty comes from lack of knowledge. Ensembles, Bayesian methods, dropout sampling, probabilistic heads, calibration, and conformal prediction are common tools.

Fifth, Gaussian processes provide a distribution over functions. They give exact Bayesian regression in small settings, strong uncertainty estimates, and a theoretical bridge between kernel methods and infinitely wide neural networks.

Core Equations

Bayesian posterior:

$$ p(\theta \mid D) = \frac{p(D \mid \theta)p(\theta)}{p(D)}. $$

Posterior predictive distribution:

$$ p(y^\star \mid x^\star,D) = \int p(y^\star \mid x^\star,\theta)p(\theta\mid D),d\theta. $$

Monte Carlo approximation:

$$ p(y^\star \mid x^\star,D) \approx \frac{1}{S} \sum_{s=1}^{S} p(y^\star \mid x^\star,\theta_s). $$

Variational ELBO:

$$ \mathcal{L}(\phi) = \mathbb{E}{q\phi(\theta)} [ \log p(D\mid\theta) ] - \mathrm{KL}(q_\phi(\theta)|p(\theta)). $$

Gaussian process prior:

$$ f \sim \mathcal{GP}(m(x),k(x,x')). $$

Gaussian process predictive mean:

$$ \mu_\star = k_\star^\top(K+\sigma_n^2I)^{-1}y. $$

Gaussian process predictive variance:

$$ \sigma_\star^2 = k(x_\star,x_\star) - k_\star^\top(K+\sigma_n^2I)^{-1}k_\star. $$

Practical Patterns

Most probabilistic PyTorch models follow one of these patterns.

Pattern Network output Loss
Gaussian regression Mean and variance Gaussian negative log likelihood
Classification Logits Cross-entropy or categorical NLL
Mixture density network Mixture weights, means, variances Mixture negative log likelihood
VAE Latent posterior and decoder likelihood Negative ELBO
Bayesian neural network Weight posterior parameters Negative ELBO
Ensemble Multiple model predictions Averaged likelihood or probability
MC dropout Stochastic forward passes Averaged prediction at inference

The common implementation idea is direct: compute distribution parameters, construct a probability distribution, evaluate log_prob, and minimize negative log probability.

When to Use Probabilistic Deep Learning

Use probabilistic methods when the output distribution matters, not only the best prediction.

They are appropriate when:

Situation Useful method
Noisy regression targets Gaussian or Student-t output head
Multimodal targets Mixture density network
Limited data Bayesian neural network or Gaussian process
Need calibrated probabilities Temperature scaling, ensembles, Bayesian methods
Need prediction intervals Probabilistic regression or conformal prediction
Need latent representations Variational autoencoder
Need sample generation VAE, flow, diffusion, autoregressive model
Expensive experiments Gaussian process Bayesian optimization
Safety-critical deployment Ensembles, uncertainty thresholds, conformal prediction

For many production systems, deep ensembles plus calibration provide a strong baseline. Full Bayesian neural networks are conceptually clean but can be harder to scale and tune.

Common Failure Modes

Probabilistic outputs can look precise while being poorly calibrated. A model can produce a variance, probability, or interval that does not match empirical reality.

Common problems include:

Failure mode Description
Overconfident softmax Classifier assigns high probability to wrong outputs
Underestimated variance Regression intervals are too narrow
Poor posterior approximation Variational family misses important uncertainty
Bad prior choice Prior conflicts with the task or data scale
Distribution mismatch Gaussian likelihood used for heavy-tailed targets
OOD overconfidence Model is confident far from training data
Sample inefficiency Monte Carlo estimate has high variance
Excessive serving cost Ensembles or MC sampling are too expensive

Uncertainty estimates must be validated empirically. Calibration plots, negative log likelihood, coverage tests, and out-of-distribution benchmarks are more informative than accuracy alone.

For Bayesian modeling and probabilistic machine learning, read Kevin Murphy’s Probabilistic Machine Learning: An Introduction and Probabilistic Machine Learning: Advanced Topics.

For Gaussian processes, read Rasmussen and Williams, Gaussian Processes for Machine Learning.

For variational inference, read Blei, Kucukelbir, and McAuliffe, “Variational Inference: A Review for Statisticians.”

For Bayesian deep learning, read work on Bayes by Backprop, Monte Carlo dropout, deep ensembles, Laplace approximations, and stochastic gradient MCMC.

For practical PyTorch modeling, study torch.distributions, Pyro, NumPyro, GPyTorch, and BoTorch.

Exercises

  1. Implement a Gaussian regression model that predicts both mean and variance. Compare mean squared error with Gaussian negative log likelihood.

  2. Train a classifier and measure its expected calibration error. Then apply temperature scaling on a validation set.

  3. Implement Monte Carlo dropout for a small image classifier. Compare predictive entropy on in-distribution and out-of-distribution examples.

  4. Train an ensemble of three neural networks. Measure disagreement across ensemble members.

  5. Implement a mixture density network on a synthetic dataset where one input can map to two possible outputs.

  6. Train a small variational autoencoder. Plot samples from the prior and reconstructions from the encoder.

  7. Use GPyTorch to fit a Gaussian process regression model on a one-dimensional dataset. Plot predictive mean and uncertainty intervals.

  8. Compare a Gaussian process and a neural network on a small-data regression problem.

  9. For a Bayesian linear layer, inspect how increasing the KL penalty affects posterior variance.

  10. Evaluate prediction interval coverage for a probabilistic regression model. Compare nominal coverage with empirical coverage.