Skip to content
Go back

From MLE to Variational Inference

Edit page

Table of contents

Open Table of contents

Introduction: The Foundation of Generative Modeling

Understanding modern generative models requires grasping a fundamental progression in machine learning theory: from simple maximum likelihood estimation to sophisticated variational inference methods. This mathematical journey reveals how we moved from modeling observable data directly to learning complex latent representations that capture the hidden structure of our world.

The path we’ll explore connects three crucial concepts: Maximum Likelihood Estimation as our optimization foundation, Latent Variable Models as our structural framework, and Variational Inference as our computational solution. Each builds upon the previous, creating a coherent theoretical narrative that explains why modern generative models like Variational Autoencoders work the way they do.

Part I: Maximum Likelihood Learning - The Foundation

The Core Principle

Maximum Likelihood Estimation (MLE) forms the bedrock of modern generative modeling. Given a dataset D={x(1),x(2),...,x(n)}\mathcal{D} = \{x^{(1)}, x^{(2)}, ..., x^{(n)}\}, our goal is to find model parameters θ\theta that maximize the likelihood of observing our data:

θ=argmaxθL(θ)=argmaxθi=1npθ(x(i))\theta^* = \arg\max_\theta \mathcal{L}(\theta) = \arg\max_\theta \prod_{i=1}^n p_\theta(x^{(i)})

In practice, we maximize the log-likelihood for computational convenience:

θ=argmaxθi=1nlogpθ(x(i))\theta^* = \arg\max_\theta \sum_{i=1}^n \log p_\theta(x^{(i)})

The Connection to KL Divergence

A profound insight emerges when we recognize that MLE is equivalent to minimizing the KL divergence between the true data distribution pdata(x)p_{data}(x) and our model distribution pθ(x)p_\theta(x):

θ=argminθDKL(pdata(x)pθ(x))\theta^* = \arg\min_\theta D_{KL}(p_{data}(x) || p_\theta(x))

This connection reveals why MLE is so fundamental: we’re not just fitting parameters, we’re finding the model distribution that best approximates reality.

For a detailed derivation of this connection and its implications (including the crucial asymmetry of KL divergence), see my previous post on Maximum Likelihood Learning and the KL Connection.

From Expected to Empirical Log-Likelihood

In practice, we don’t know the true data distribution pdata(x)p_{data}(x). Instead, we approximate the expected log-likelihood with the empirical log-likelihood using our dataset:

Expdata[logpθ(x)]1DxDlogpθ(x)\mathbb{E}_{x \sim p_{data}}[\log p_\theta(x)] \approx \frac{1}{|D|}\sum_{x \in D} \log p_\theta(x)

Maximum likelihood learning then becomes:

maxpθ1DxDlogpθ(x)\max_{p_\theta} \frac{1}{|D|}\sum_{x \in D} \log p_\theta(x)

The Bias-Variance Tradeoff

A crucial challenge in MLE is balancing model complexity. Consider the bias-variance decomposition:

Expected Error=Bias2+Variance+Noise\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}

Simple models (high bias, low variance) underfit, while complex models (low bias, high variance) overfit. The art lies in finding the sweet spot.

Optimization Through Gradient Descent

MLE objectives are typically optimized using gradient descent:

θt+1=θt+ηθL(θt)\theta_{t+1} = \theta_t + \eta \nabla_\theta \mathcal{L}(\theta_t)

where η\eta is the learning rate. The gradient points in the direction of steepest increase in likelihood, guiding us toward better parameter values.

For neural networks, we use backpropagation to compute gradients efficiently, enabling the deep learning revolution in generative modeling.

Part II: Latent Variable Models - Capturing Hidden Structure

Beyond Observable Variables

While MLE works well for simple distributions, real-world data often has underlying structure that isn’t directly observable. Consider images: the pixels we see are generated by hidden factors like object identity, lighting, pose, and style. Latent Variable Models capture this intuition by introducing unobserved variables zz that explain the data generation process.

The fundamental latent variable model assumes:

pθ(x)=pθ(xz)p(z)dzp_\theta(x) = \int p_\theta(x|z)p(z)dz

where:

The Intractability Problem

This integral is the heart of the challenge. For complex models like neural networks, this integral becomes intractable:

pθ(x)=pθ(xz)p(z)dz is intractable!p_\theta(x) = \int p_\theta(x|z)p(z)dz \text{ is intractable!}

We cannot directly optimize logpθ(x)\log p_\theta(x) because we cannot compute it.

Example: Mixture of Gaussians

Consider a simple latent variable model - Mixture of Gaussians:

p(z)=Categorical(z;π)p(z) = \text{Categorical}(z; \pi) pθ(xz=k)=N(x;μk,Σk)p_\theta(x|z=k) = \mathcal{N}(x; \mu_k, \Sigma_k)

Even this simple case requires summing over all mixture components. For neural networks with continuous latent spaces, the intractability becomes severe.

The Posterior Inference Challenge

Even if we could compute pθ(x)p_\theta(x), we face another challenge: computing the posterior pθ(zx)p_\theta(z|x) for inference:

pθ(zx)=pθ(xz)p(z)pθ(x)p_\theta(z|x) = \frac{p_\theta(x|z)p(z)}{p_\theta(x)}

This posterior tells us what latent factors likely generated a given observation - crucial for understanding and manipulating our model.

Part III: Variational Inference - The Computational Solution

Approximating the Intractable

Variational inference provides an elegant solution to intractability. Instead of computing the exact posterior pθ(zx)p_\theta(z|x), we approximate it with a simpler distribution qϕ(zx)q_\phi(z|x) parameterized by ϕ\phi.

The key insight is to optimize this approximation to be as close as possible to the true posterior:

ϕ=argminϕDKL(qϕ(zx)pθ(zx))\phi^* = \arg\min_\phi D_{KL}(q_\phi(z|x) || p_\theta(z|x))

Deriving the Evidence Lower Bound (ELBO)

The breakthrough comes from a clever mathematical manipulation. Starting with the log-likelihood:

logpθ(x)=logpθ(x,z)dz\log p_\theta(x) = \log \int p_\theta(x,z)dz

We multiply and divide by our approximation qϕ(zx)q_\phi(z|x):

logpθ(x)=logpθ(x,z)qϕ(zx)qϕ(zx)dz\log p_\theta(x) = \log \int \frac{p_\theta(x,z)}{q_\phi(z|x)} q_\phi(z|x) dz

Applying Jensen’s inequality (since log is concave):

logpθ(x)qϕ(zx)logpθ(x,z)qϕ(zx)dz\log p_\theta(x) \geq \int q_\phi(z|x) \log \frac{p_\theta(x,z)}{q_\phi(z|x)} dz

Understanding Jensen’s Inequality: The key mathematical insight here is that the logarithm function is concave, meaning that for any convex combination of points, the function value at that combination is greater than or equal to the convex combination of function values. This fundamental property enables the ELBO derivation.

Jensen's Inequality for Log Function

Figure: Jensen’s inequality demonstration for the logarithm function. The plot shows how the expectation of the log (green point) lies above the log of the expectation (magenta point) for concave functions. This mathematical property is crucial for deriving the ELBO. View source code

This gives us the Evidence Lower Bound (ELBO):

L(θ,ϕ)=Eqϕ(zx)[logpθ(x,z)]Eqϕ(zx)[logqϕ(zx)]\mathcal{L}(\theta,\phi) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)] - \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)]

Rearranging terms: We can rewrite this by using the factorization pθ(x,z)=pθ(xz)p(z)p_\theta(x,z) = p_\theta(x|z)p(z):

L(θ,ϕ)=Eqϕ(zx)[logpθ(x,z)]Eqϕ(zx)[logqϕ(zx)]=Eqϕ(zx)[log(pθ(xz)p(z))]Eqϕ(zx)[logqϕ(zx)]=Eqϕ(zx)[logpθ(xz)+logp(z)]Eqϕ(zx)[logqϕ(zx)]=Eqϕ(zx)[logpθ(xz)]+Eqϕ(zx)[logp(z)]Eqϕ(zx)[logqϕ(zx)]=Eqϕ(zx)[logpθ(xz)]+Eqϕ(zx)[logp(z)logqϕ(zx)]=Eqϕ(zx)[logpθ(xz)]Eqϕ(zx)[logqϕ(zx)p(z)]=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\begin{aligned} \mathcal{L}(\theta,\phi) &= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)] - \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)] \\ &= \mathbb{E}_{q_\phi(z|x)}[\log (p_\theta(x|z)p(z))] - \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)] \\ &= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z) + \log p(z)] - \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)] \\ &= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] + \mathbb{E}_{q_\phi(z|x)}[\log p(z)] - \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)] \\ &= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] + \mathbb{E}_{q_\phi(z|x)}[\log p(z) - \log q_\phi(z|x)] \\ &= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{q_\phi(z|x)}{p(z)}\right] \\ &= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z)) \end{aligned}

This final form reveals the ELBO’s intuitive structure as a balance between reconstruction and regularization.

Understanding the ELBO

The ELBO decomposes into two intuitive terms:

  1. Reconstruction Term: Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]

    • Measures how well we can reconstruct xx from latent zz
    • Encourages the decoder to be accurate
  2. Regularization Term: DKL(qϕ(zx)p(z))D_{KL}(q_\phi(z|x) || p(z))

    • Keeps the approximate posterior close to the prior
    • Prevents overfitting and ensures smooth latent space

The Variational Gap

The difference between the true log-likelihood and the ELBO is exactly the KL divergence we wanted to minimize. To see where this comes from, let’s start with the definition of KL divergence between our approximate posterior and the true posterior:

DKL(qϕ(zx)pθ(zx))=Eqϕ(zx)[logqϕ(zx)pθ(zx)]D_{KL}(q_\phi(z|x) || p_\theta(z|x)) = \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)}\right]

Using Bayes’ rule, we know that pθ(zx)=pθ(x,z)pθ(x)p_\theta(z|x) = \frac{p_\theta(x,z)}{p_\theta(x)}, so:

DKL(qϕ(zx)pθ(zx))=Eqϕ(zx)[logqϕ(zx)pθ(x,z)/pθ(x)]=Eqϕ(zx)[logqϕ(zx)pθ(x)pθ(x,z)]=Eqϕ(zx)[logqϕ(zx)]+Eqϕ(zx)[logpθ(x)]Eqϕ(zx)[logpθ(x,z)]=Eqϕ(zx)[logqϕ(zx)]+logpθ(x)Eqϕ(zx)[logpθ(x,z)]=logpθ(x)(Eqϕ(zx)[logpθ(x,z)]Eqϕ(zx)[logqϕ(zx)])=logpθ(x)L(θ,ϕ)\begin{aligned} D_{KL}(q_\phi(z|x) || p_\theta(z|x)) &= \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{q_\phi(z|x)}{p_\theta(x,z)/p_\theta(x)}\right] \\ &= \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{q_\phi(z|x) \cdot p_\theta(x)}{p_\theta(x,z)}\right] \\ &= \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)] + \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x)] - \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)] \\ &= \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)] + \log p_\theta(x) - \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)] \\ &= \log p_\theta(x) - \left(\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)] - \mathbb{E}_{q_\phi(z|x)}[\log q_\phi(z|x)]\right) \\ &= \log p_\theta(x) - \mathcal{L}(\theta,\phi) \end{aligned}

Therefore:

logpθ(x)L(θ,ϕ)=DKL(qϕ(zx)pθ(zx))\log p_\theta(x) - \mathcal{L}(\theta,\phi) = D_{KL}(q_\phi(z|x) || p_\theta(z|x))

Why maximizing the ELBO simultaneously improves both model and approximation:

Since DKL(qϕ(zx)pθ(zx))0D_{KL}(q_\phi(z|x) || p_\theta(z|x)) \geq 0 always, we have:

logpθ(x)=L(θ,ϕ)+DKL(qϕ(zx)pθ(zx))L(θ,ϕ)\log p_\theta(x) = \mathcal{L}(\theta,\phi) + D_{KL}(q_\phi(z|x) || p_\theta(z|x)) \geq \mathcal{L}(\theta,\phi)

When we maximize L(θ,ϕ)\mathcal{L}(\theta,\phi), we can do so by:

  1. Improving our model (w.r.t. θ\theta): Increasing L(θ,ϕ)\mathcal{L}(\theta,\phi) forces logpθ(x)\log p_\theta(x) to increase (since the KL term is non-negative), meaning our model assigns higher probability to the observed data.

  2. Improving our approximation (w.r.t. ϕ\phi): Increasing L(θ,ϕ)\mathcal{L}(\theta,\phi) while keeping logpθ(x)\log p_\theta(x) fixed forces the KL divergence DKL(qϕ(zx)pθ(zx))D_{KL}(q_\phi(z|x) || p_\theta(z|x)) to decrease, meaning our approximate posterior qϕ(zx)q_\phi(z|x) becomes closer to the true posterior pθ(zx)p_\theta(z|x).

This is the beauty of variational inference: a single objective simultaneously pushes us toward both a better model and a better approximation!

Part IV: Variational Autoencoders - The Practical Realization

From Theory to Architecture

Variational Autoencoders (VAEs) represent the practical implementation of the variational inference framework we’ve developed. In the landscape of generative models, VAEs stand out for their elegant probabilistic approach to learning latent representations of data. Unlike autoregressive models that predict data sequentially, VAEs learn a continuous latent space, which allows for powerful generation and manipulation of data.

The VAE Architecture

A VAE consists of two neural networks working in tandem:

Encoder qϕ(zx)q_\phi(z|x):

Decoder pθ(xz)p_\theta(x|z):

The VAE Objective in Practice

The ELBO we derived translates directly into the VAE training objective:

LVAE=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L}_{VAE} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))

This decomposes into:

  1. Reconstruction Loss: How well can we reconstruct the input from its latent representation?
  2. KL Regularization: How close is our learned posterior to the simple prior?

The beauty of this objective is that it naturally balances two competing goals:

Understanding the Latent Space

The KL regularization term serves a crucial purpose beyond mathematical convenience. By forcing qϕ(zx)p(z)=N(0,I)q_\phi(z|x) \approx p(z) = \mathcal{N}(0,I), we ensure:

When the bound becomes tight (i.e., when qϕ(zx)pθ(zx)q_\phi(z|x) \approx p_\theta(z|x)), the VAE achieves optimal performance:

logpθ(x)=ELBO+DKL(qϕ(zx)pθ(zx))\log p_\theta(x) = \text{ELBO} + D_{KL}(q_\phi(z|x) || p_\theta(z|x))

The gap between ELBO and true log-likelihood is exactly this approximation quality.

Part V: Advanced Variational Techniques

Importance Weighted Autoencoders (IWAE)

The ELBO can sometimes be a loose bound. IWAE provides a tighter bound using multiple samples:

LK=Ez1,...,zKqϕ(zx)[log1Kk=1Kpθ(x,zk)qϕ(zkx)]\mathcal{L}_K = \mathbb{E}_{z_1,...,z_K \sim q_\phi(z|x)} \left[ \log \frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x,z_k)}{q_\phi(z_k|x)} \right]

Key properties:

The bound hierarchy is:

logpθ(x)LKL1=ELBO\log p_\theta(x) \geq \mathcal{L}_K \geq \mathcal{L}_1 = \text{ELBO}

β-VAE: Controlling Disentanglement

The β-VAE modifies the ELBO to control the latent space structure:

Lβ=Eqϕ(zx)[logpθ(xz)]βDKL(qϕ(zx)p(z))\mathcal{L}_\beta = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \beta \cdot D_{KL}(q_\phi(z|x) || p(z))

Different β values yield different behaviors:

This provides a principled way to trade off between reconstruction fidelity and interpretable representations.

Semi-Supervised VAE (SSVAE)

SSVAE extends VAEs to leverage both labeled and unlabeled data:

For labeled data:

Llabeled=Eq(zx,y)[logp(xz)]DKL(q(zx,y)p(z))+logp(yx)\mathcal{L}_{labeled} = \mathbb{E}_{q(z|x,y)}[\log p(x|z)] - D_{KL}(q(z|x,y) || p(z)) + \log p(y|x)

For unlabeled data:

Lunlabeled=yq(yx)L(x,y)+H[q(yx)]\mathcal{L}_{unlabeled} = \sum_y q(y|x) \mathcal{L}(x,y) + \mathcal{H}[q(y|x)]

where H[q(yx)]\mathcal{H}[q(y|x)] is the entropy encouraging diverse predictions.

Fully Supervised VAE (FSVAE)

When full supervision is available, we can condition the entire model on labels:

LFSVAE=Eq(zx,y)[logp(xz,y)]DKL(q(zx,y)p(zy))\mathcal{L}_{FSVAE} = \mathbb{E}_{q(z|x,y)}[\log p(x|z,y)] - D_{KL}(q(z|x,y) || p(z|y))

This allows the model to learn label-specific latent structures.

Part VI: The Reparameterization Trick - Making It Trainable

The Gradient Problem

A crucial challenge in training variational models is that the ELBO involves expectations over the approximate posterior qϕ(zx)q_\phi(z|x). We need gradients with respect to ϕ\phi, but sampling operations aren’t differentiable.

The Solution

The reparameterization trick transforms the sampling process. Instead of sampling directly from qϕ(zx)q_\phi(z|x), we:

  1. Sample noise from a fixed distribution: ϵp(ϵ)\epsilon \sim p(\epsilon)
  2. Apply a deterministic transformation: z=gϕ(ϵ,x)z = g_\phi(\epsilon, x)

For Gaussian qϕ(zx)=N(μϕ(x),σϕ2(x))q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)):

z=μϕ(x)+σϕ(x)ϵ,ϵN(0,I)z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0,I)

Now gradients can flow through the deterministic path from ϕ\phi to the loss.

Monte Carlo Estimation

With reparameterization, we estimate the ELBO using Monte Carlo:

L(θ,ϕ)1Ll=1L[logpθ(xz(l))DKL(qϕ(zx)p(z))]\mathcal{L}(\theta,\phi) \approx \frac{1}{L} \sum_{l=1}^L [\log p_\theta(x|z^{(l)}) - D_{KL}(q_\phi(z|x) || p(z))]

where z(l)=μϕ(x)+σϕ(x)ϵ(l)z^{(l)} = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon^{(l)}.

Part VII: Practical Implementation Insights

Neural Network Architectures

Encoder qϕ(zx)q_\phi(z|x):

Decoder pθ(xz)p_\theta(x|z):

Key Implementation Functions

Key implementation functions include:

def sample_gaussian(mu, log_var):
    """Sample from Gaussian using reparameterization trick"""
    std = torch.exp(0.5 * log_var)
    eps = torch.randn_like(std)
    return mu + eps * std

def log_normal(x, mu, log_var):
    """Compute log probability of Gaussian"""
    return -0.5 * (log_var + (x - mu).pow(2) / log_var.exp())

def negative_elbo_bound(x_hat, x, mu, log_var):
    """Compute negative ELBO for optimization"""
    reconstruction_loss = F.binary_cross_entropy(x_hat, x, reduction='sum')
    kl_divergence = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return reconstruction_loss + kl_divergence

Training Considerations

Optimization: Adam optimizer typically works well with learning rates around 1e-3 to 1e-4.

KL Annealing: Gradually increase the KL weight from 0 to 1 during training to avoid posterior collapse:

Lannealed=Eqϕ(zx)[logpθ(xz)]β(t)DKL(qϕ(zx)p(z))\mathcal{L}_{annealed} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \beta(t) \cdot D_{KL}(q_\phi(z|x) || p(z))

where β(t)\beta(t) increases from 0 to 1 over training.

Part VIII: Connecting Theory to Practice

From MLE to VAE

The progression we’ve explored connects directly:

  1. MLE: maxθilogpθ(x(i))\max_\theta \sum_i \log p_\theta(x^{(i)})
  2. Latent Variables: pθ(x)=pθ(xz)p(z)dzp_\theta(x) = \int p_\theta(x|z)p(z)dz (intractable)
  3. Variational Approximation: maxθ,ϕL(θ,ϕ)\max_{\theta,\phi} \mathcal{L}(\theta,\phi) (tractable ELBO)

Each step addresses a fundamental limitation of the previous approach while maintaining theoretical rigor.

The Power of the Framework

This variational framework enables:

Real-World Applications

These techniques power:

Conclusion: The Elegant Mathematical Journey

Our journey from Maximum Likelihood to Variational Inference reveals the elegant mathematical progression underlying modern generative modeling. We started with the fundamental principle of fitting models to data, encountered the challenge of hidden structure, and developed sophisticated approximation techniques to make complex models tractable.

Key Insights:

  1. MLE provides the optimization foundation - maximizing likelihood is equivalent to minimizing KL divergence with the true distribution

  2. Latent variables capture hidden structure - real data is generated by unobservable factors that we must model explicitly

  3. Variational inference makes complexity tractable - by approximating intractable posteriors, we can optimize complex models efficiently

  4. The ELBO unifies reconstruction and regularization - balancing data fidelity with model simplicity emerges naturally from the mathematical framework

  5. Extensions provide practical control - β-VAE, IWAE, and SSVAE show how mathematical insights translate to practical improvements

The mathematics isn’t just theoretical abstraction - it’s the foundation that makes modern generative AI possible. From the images created by diffusion models to the text generated by large language models, the principles of maximum likelihood, latent variables, and variational inference continue to drive innovation in machine learning.

This mathematical journey continues to evolve, with new developments in normalizing flows, diffusion models, and transformer architectures all building upon these foundational concepts. Understanding this progression provides the theoretical grounding needed to both use and extend the cutting edge of generative modeling.


Edit page
Share this post on:

Previous Post
RL Series Part 1: A Deep Dive into Policy and Value Iteration
Next Post
From Exponential Complexity to Chain Rules: Understanding Autoregressive Generative Models