Practical Probabilistic Modeling in PyTorch

Probabilistic deep learning adds distributions to ordinary neural networks.

Probabilistic deep learning adds distributions to ordinary neural networks. A model no longer predicts only a value. It predicts parameters of a probability distribution, samples from latent variables, estimates uncertainty, or defines a likelihood for observed data.

In PyTorch, this usually means combining three pieces:

Piece Role
Neural network Computes distribution parameters
Probability distribution Defines likelihood or sampling rule
Loss function Optimizes negative log likelihood or ELBO

The neural network remains deterministic in its computation. The probabilistic part appears in how we interpret its outputs.

Deterministic Versus Probabilistic Outputs

A deterministic regression model predicts

$$ \hat{y}=f_\theta(x). $$

A probabilistic regression model predicts a distribution:

$$ p(y\mid x) = \mathcal{N}(y;\mu_\theta(x),\sigma_\theta^2(x)). $$

The network outputs both $\mu_\theta(x)$ and $\sigma_\theta^2(x)$. The mean gives the central prediction. The variance gives uncertainty.

For classification, the network outputs logits:

$$ z=f_\theta(x). $$

The logits define a categorical distribution:

$$ p(y=k\mid x) = \mathrm{softmax}(z)_k. $$

Thus, ordinary classification with cross-entropy already has a probabilistic interpretation.

Using torch.distributions

PyTorch provides probability distributions through torch.distributions.

A normal distribution can be created as follows:

import torch
from torch.distributions import Normal

mu = torch.tensor([0.0])
sigma = torch.tensor([1.0])

dist = Normal(mu, sigma)

sample = dist.sample()
log_prob = dist.log_prob(sample)

The method sample() draws random values. The method log_prob() evaluates the log probability density.

This is useful because many probabilistic losses are negative log likelihoods:

$$ \mathcal{L} = -\log p(y\mid x). $$

Gaussian Regression Model

A Gaussian regression model predicts a mean and a variance.

import torch
from torch import nn
from torch.distributions import Normal

class GaussianRegressor(nn.Module):
    def __init__(self, in_features, hidden_features):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, hidden_features),
            nn.ReLU(),
        )

        self.mean = nn.Linear(hidden_features, 1)
        self.log_std = nn.Linear(hidden_features, 1)

    def forward(self, x):
        h = self.net(x)

        mean = self.mean(h)
        log_std = self.log_std(h).clamp(-10.0, 5.0)
        std = log_std.exp()

        return Normal(mean, std)

The model returns a distribution object, not a raw tensor.

Training uses negative log likelihood:

def gaussian_nll_loss(model, x, y):
    dist = model(x)
    loss = -dist.log_prob(y).mean()
    return loss

This loss rewards accurate means and calibrated variances. A model cannot make the variance arbitrarily large without paying a log-standard-deviation penalty.

Bernoulli Models

For binary prediction, the Bernoulli distribution is common.

A network can output logits:

from torch.distributions import Bernoulli

class BernoulliClassifier(nn.Module):
    def __init__(self, in_features, hidden_features):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, 1),
        )

    def forward(self, x):
        logits = self.net(x)
        return Bernoulli(logits=logits)

The loss is again negative log likelihood:

def bernoulli_nll_loss(model, x, y):
    dist = model(x)
    return -dist.log_prob(y).mean()

This is equivalent to binary cross-entropy with logits, but the probabilistic version makes the modeling assumption explicit.

Categorical Models

Multiclass classification uses the categorical distribution.

from torch.distributions import Categorical

class CategoricalClassifier(nn.Module):
    def __init__(self, in_features, hidden_features, num_classes):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, num_classes),
        )

    def forward(self, x):
        logits = self.net(x)
        return Categorical(logits=logits)

The negative log likelihood is:

def categorical_nll_loss(model, x, y):
    dist = model(x)
    return -dist.log_prob(y).mean()

Here y should contain integer class indices.

This is equivalent to torch.nn.functional.cross_entropy, but the distribution form generalizes more naturally to probabilistic models.

Mixture Density Networks

A single Gaussian may be too simple. Some regression problems are multimodal: the same input can lead to several plausible outputs.

A mixture density network predicts a mixture distribution:

$$ p(y\mid x) = \sum_{k=1}^{K} \pi_k(x) \mathcal{N}(y;\mu_k(x),\sigma_k^2(x)). $$

Here:

Symbol Meaning
$K$ Number of mixture components
$\pi_k(x)$ Mixture weight
$\mu_k(x)$ Component mean
$\sigma_k(x)$ Component standard deviation

In PyTorch:

from torch.distributions import Categorical, Normal, MixtureSameFamily

class MixtureDensityNetwork(nn.Module):
    def __init__(self, in_features, hidden_features, num_components):
        super().__init__()

        self.num_components = num_components

        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.Tanh(),
            nn.Linear(hidden_features, hidden_features),
            nn.Tanh(),
        )

        self.logits = nn.Linear(hidden_features, num_components)
        self.means = nn.Linear(hidden_features, num_components)
        self.log_stds = nn.Linear(hidden_features, num_components)

    def forward(self, x):
        h = self.net(x)

        logits = self.logits(h)
        means = self.means(h)
        stds = self.log_stds(h).clamp(-10.0, 5.0).exp()

        mixture = Categorical(logits=logits)
        components = Normal(means, stds)

        return MixtureSameFamily(mixture, components)

Training:

def mdn_loss(model, x, y):
    dist = model(x)
    return -dist.log_prob(y.squeeze(-1)).mean()

Mixture density networks are useful when the conditional target distribution has several modes.

Latent Variable Models

A latent variable model introduces an unobserved variable $z$:

$$ p_\theta(x,z) = p_\theta(x\mid z)p(z). $$

The latent variable may represent hidden factors of variation. In an image model, $z$ may encode object identity, pose, lighting, or style.

The marginal likelihood is

$$ p_\theta(x) = \int p_\theta(x\mid z)p(z),dz. $$

This integral is usually intractable. Variational inference introduces an approximate posterior:

$$ q_\phi(z\mid x). $$

This is the foundation of variational autoencoders.

Minimal Variational Autoencoder

A variational autoencoder contains an encoder, a decoder, and a latent distribution.

The encoder predicts

$$ q_\phi(z\mid x) = \mathcal{N}(z;\mu_\phi(x),\sigma_\phi^2(x)). $$

The decoder predicts

$$ p_\theta(x\mid z). $$

A minimal PyTorch VAE:

import torch
from torch import nn
from torch.distributions import Normal, Bernoulli
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
        )

        self.z_mean = nn.Linear(hidden_dim, latent_dim)
        self.z_log_std = nn.Linear(hidden_dim, latent_dim)

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
        )

    def encode(self, x):
        h = self.encoder(x)
        mean = self.z_mean(h)
        log_std = self.z_log_std(h).clamp(-10.0, 5.0)
        std = log_std.exp()

        return Normal(mean, std)

    def decode(self, z):
        logits = self.decoder(z)
        return Bernoulli(logits=logits)

    def forward(self, x):
        q_z = self.encode(x)
        z = q_z.rsample()
        p_x = self.decode(z)

        return p_x, q_z, z

The method rsample() uses the reparameterization trick, allowing gradients to pass through the sampled latent variable.

VAE Loss

The VAE optimizes the negative ELBO.

$$ \mathcal{L} = -\mathbb{E}{q\phi(z\mid x)} [ \log p_\theta(x\mid z) ] + D_{KL}(q_\phi(z\mid x)|p(z)). $$

For a standard normal prior $p(z)=\mathcal{N}(0,I)$, the KL term has a closed form when $q_\phi$ is diagonal Gaussian.

def vae_loss(model, x):
    p_x, q_z, z = model(x)

    reconstruction_loss = -p_x.log_prob(x).sum(dim=-1).mean()

    mean = q_z.loc
    std = q_z.scale
    var = std.pow(2)

    kl = 0.5 * (
        mean.pow(2) + var - var.log() - 1.0
    ).sum(dim=-1).mean()

    return reconstruction_loss + kl

The reconstruction term trains the decoder. The KL term keeps the approximate posterior close to the prior.

Sampling from a VAE

After training, we can generate samples by drawing from the prior:

$$ z\sim\mathcal{N}(0,I), $$

then decoding:

$$ x\sim p_\theta(x\mid z). $$

In PyTorch:

@torch.no_grad()
def sample_vae(model, num_samples, latent_dim, device):
    z = torch.randn(num_samples, latent_dim, device=device)
    p_x = model.decode(z)

    return p_x.sample()

For image data, the generated tensor can be reshaped back into image format.

Normalizing Flows

Normalizing flows transform a simple distribution into a complex distribution using invertible mappings.

Let

$$ z_0 \sim p_0(z_0) $$

and

$$ z_K = f_K\circ \cdots \circ f_1(z_0). $$

The density is computed using the change-of-variables formula:

$$ \log p_K(z_K) = \log p_0(z_0) - \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right|. $$

Flows are useful because they provide exact likelihoods and expressive distributions.

Their main constraint is architectural: transformations must be invertible and have tractable Jacobian determinants.

Probabilistic Forecasting

Probabilistic modeling is also useful in time series and forecasting.

Instead of predicting a single future value, the model predicts a distribution over future values:

$$ p(y_{t+1:t+H}\mid x_{1:t}). $$

Common output distributions include:

Distribution Use case
Gaussian Continuous targets
Student-t Heavy-tailed noise
Negative binomial Count data
Quantile outputs Prediction intervals
Mixture distributions Multimodal futures

A probabilistic forecast can answer questions such as:

  • What is the expected value?
  • What is the 90 percent prediction interval?
  • What is the probability of exceeding a threshold?

Choosing a Distribution

The output distribution should match the target type.

Target type Common distribution
Binary value Bernoulli
Class label Categorical
Real value Normal, Student-t, mixture of normals
Positive value LogNormal, Gamma
Count Poisson, negative binomial
Bounded value Beta
Sequence Autoregressive categorical or continuous likelihood

Poor distribution choice can produce misleading uncertainty. For example, a Gaussian likelihood may handle symmetric noise well but may perform poorly on heavy-tailed data.

Training Loop Pattern

Most probabilistic models follow a common training pattern:

def train_step(model, optimizer, x, y):
    model.train()

    optimizer.zero_grad()

    dist = model(x)
    loss = -dist.log_prob(y).mean()

    loss.backward()
    optimizer.step()

    return loss.item()

For latent variable models, the loss may include extra KL terms:

loss = reconstruction_loss + kl_loss

For Bayesian neural networks, the loss may include a KL term between the variational posterior and the prior.

Numerical Stability

Probabilistic models require careful numerical handling.

Use logits instead of probabilities when possible. For example:

Bernoulli(logits=logits)
Categorical(logits=logits)

This avoids unstable calls to log(sigmoid(x)) or log(softmax(x)).

Clamp log standard deviations:

log_std = log_std.clamp(-10.0, 5.0)

This prevents standard deviations from becoming zero or exploding.

Add small epsilons when computing logs manually:

torch.log(x + 1e-8)

Prefer library distribution methods when possible, since they often implement stable formulas internally.

Evaluation

Probabilistic models should be evaluated with probabilistic metrics.

Metric Meaning
Negative log likelihood Quality of predicted distribution
Calibration error Whether probabilities match frequencies
Prediction interval coverage Whether intervals contain true values
Sharpness How concentrated predictions are
Brier score Accuracy of probabilistic classification
CRPS Probabilistic forecasting quality

Accuracy alone is insufficient. A probabilistic model should be both accurate and calibrated.

Summary

Practical probabilistic modeling in PyTorch usually means making neural networks output probability distributions.

Regression models can output Gaussian or mixture distributions. Classifiers define Bernoulli or categorical distributions. Latent variable models introduce hidden variables and optimize ELBO-style objectives.

The key implementation pattern is simple: the network computes distribution parameters, the distribution defines log_prob, and training minimizes negative log likelihood or negative ELBO.