Skip to content
Go back

From Exponential Complexity to Chain Rules: Understanding Autoregressive Generative Models

Edit page

From Exponential Complexity to Chain Rules: Understanding Autoregressive Generative Models

Recently, I’ve been diving deep into the world of generative modeling, and I wanted to share the key insights I’ve gained about one of the most fundamental approaches: autoregressive models. This post serves as both my personal study notes and hopefully a useful resource for others trying to understand the theoretical foundations that connect everything from classical Bayesian networks to modern language models like GPT.

Table of contents

Open Table of contents

The Exponential Wall: Why Naive Approaches Fail

Let’s start with a seemingly simple question: how do we represent a joint probability distribution over nn discrete random variables?

For nn binary variables X1,X2,,XnX_1, X_2, \ldots, X_n, a naive approach would store the probability for every possible configuration. But here’s the catch: we need 2n12^n - 1 independent parameters (the 1-1 comes from the normalization constraint that all probabilities sum to 1).

This is exponential in the number of variables. For just 20 binary variables, we’d need over a million parameters. For 100 variables? More than the number of atoms in the observable universe.

This exponential explosion is often called the “curse of dimensionality,” and it’s the first major obstacle any generative model must overcome.

The Chain Rule: Our Universal Tool

The solution lies in one of probability theory’s most fundamental identities: the chain rule of probability.

For any joint distribution, we can always write:

p(x1,x2,,xn)=p(x1)p(x2x1)p(x3x1,x2)p(xnx1,,xn1)=i=1np(xix<i)\begin{aligned} p(x_1, x_2, \ldots, x_n) &= p(x_1) \cdot p(x_2|x_1) \cdot p(x_3|x_1,x_2) \cdot \ldots \cdot p(x_n|x_1,\ldots,x_{n-1}) \\ &= \prod_{i=1}^n p(x_i | x_{<i}) \end{aligned}

where x<i=(x1,,xi1)x_{<i} = (x_1, \ldots, x_{i-1}).

This factorization is always exact - no approximation involved. The key insight is that we’ve transformed the problem from modeling a massive joint distribution to modeling a sequence of conditional distributions.

From Exponential to Linear: Conditional Independence

But we’re not done yet. Each conditional p(xix<i)p(x_i | x_{<i}) could still require exponential parameters if we model all dependencies. This is where conditional independence assumptions become crucial.

Bayesian Networks make this concrete. If we assume that each variable XiX_i only depends on a small set of parents Pa(i)\text{Pa}(i), then:

p(x1,,xn)=i=1np(xixPa(i))\begin{aligned} p(x_1,\ldots,x_n) &= \prod_{i=1}^n p(x_i | x_{\text{Pa}(i)}) \end{aligned}

With clever conditional independence assumptions, we can dramatically reduce parameter counts. For example, if each binary variable depends on at most one parent:

That’s a reduction from 2n12^n - 1 (exponential) to 2n12n - 1 (linear) - a massive improvement!

The Autoregressive Family: From Simple to Sophisticated

This chain rule factorization is the foundation for the entire family of autoregressive models. Let me walk through the evolution from simple to state-of-the-art:

Fully Visible Sigmoid Belief Networks (FVSBN)

The simplest approach is the Fully Visible Sigmoid Belief Network (FVSBN). It’s called “fully visible” because there are no hidden layers - all variables are observed/visible. This is essentially a pure autoregressive model where each conditional is modeled as logistic regression:

p(Xi=1x<i)=σ(α0i+j=1i1αjixj)\begin{aligned} p(X_i = 1 | x_{<i}) &= \sigma\left(\alpha_0^i + \sum_{j=1}^{i-1} \alpha_j^i x_j\right) \end{aligned}

Image Generation Process: To generate a 28×28 MNIST digit, we sample pixels sequentially:

  1. Sample pixel 1: x1Bernoulli(σ(α01))x_1 \sim \text{Bernoulli}(\sigma(\alpha_0^1))
  2. Sample pixel 2: x2Bernoulli(σ(α02+α12x1))x_2 \sim \text{Bernoulli}(\sigma(\alpha_0^2 + \alpha_1^2 x_1))
  3. Continue until pixel 784: x784Bernoulli(σ(α0784+j=1783αj784xj))x_{784} \sim \text{Bernoulli}(\sigma(\alpha_0^{784} + \sum_{j=1}^{783} \alpha_j^{784} x_j))

This gives us tractable parameters, exact likelihood computation, and sequential generation - but no hidden representations or explicit feature learning.

Neural Autoregressive Distribution Estimator (NADE)

NADE 1 introduced a clever parameter sharing scheme that reduces overfitting while maintaining expressiveness:

hi=σ(W,<ix<i+c)x^i=p(xix1,,xi1)=σ(αihi+bi)\begin{aligned} h_i &= \sigma(W_{\cdot,<i} x_{<i} + c) \\ \hat{x}_i &= p(x_i|x_1, \ldots, x_{i-1}) = \sigma(\alpha_i h_i + b_i) \end{aligned}

FVSBN vs NADE Architecture

Figure: Architectural comparison between FVSBN (left) and NADE (right). FVSBN uses direct connections from all previous variables (v₁, v₂, v₃) to each output, resulting in O(n²) parameters. NADE introduces shared hidden layers (h₁, h₂, h₃, h₄) with weight tying - the blue arrows show how the same weight matrix W is reused across positions, dramatically reducing parameters to O(n) while maintaining expressiveness. (Source: Larochelle & Murray, 2011) 1

Key insight: Instead of separate parameters for each conditional like FVSBN, NADE uses weight tying:

The architectural difference is clear from the figure: FVSBN has direct connections from all previous variables to each output, while NADE introduces a hidden layer that shares weights across positions, dramatically reducing parameters while improving generalization.

Concrete example: For MNIST with d=500d=500 hidden units:

Connection to Autoencoders

At first glance, NADE looks similar to a standard autoencoder - both have an encoder that creates hidden representations and a decoder that reconstructs the input. However, there’s a crucial difference:

Standard Autoencoder:

NADE (Autoregressive Autoencoder):

The key insight is that a vanilla autoencoder is not a generative model - it doesn’t define a distribution over xx that we can sample from to generate new data points. NADE solves this by enforcing the autoregressive property, making it properly generative while learning useful hidden representations as a byproduct of modeling the conditional distributions.

Masked Autoencoder for Distribution Estimation (MADE)

MADE 2 solved a key computational inefficiency in NADE through a clever masking approach. While NADE can theoretically compute all conditional probabilities in parallel during training (since all input values x1,x2,,xnx_1, x_2, \ldots, x_n are known), the original implementation required separate forward passes for different conditioning sets.

MADE Architecture

Figure: MADE uses masked connections to transform a standard autoencoder into an autoregressive model. The masks (shown in center) ensure that output x^i\hat{x}_i only depends on inputs x1,,xi1x_1, \ldots, x_{i-1}, preserving the autoregressive property while enabling parallel computation. (Source: Germain et al., 2015) 2

MADE’s Key Innovation:

The Masking Mechanism:

  1. Assign each hidden unit a number m{1,2,,n1}m \in \{1, 2, \ldots, n-1\}
  2. Hidden unit hjh_j with number mjm_j can only connect to inputs x1,,xmjx_1, \ldots, x_{m_j}
  3. Output x^i\hat{x}_i can only connect to hidden units with mj<im_j < i

This elegant solution made autoregressive modeling much more hardware-efficient and would later inspire the masked self-attention mechanism in Transformers.

Modern Transformers

Today’s language models like GPT follow exactly the same principle, just with more sophisticated architectures:

Maximum Likelihood Learning and the KL Connection

How do we train these models? The standard approach is Maximum Likelihood Estimation (MLE), but there’s a beautiful connection to information theory that’s worth understanding deeply.

The Key Insight: Maximizing the likelihood of our data is equivalent to minimizing the Kullback-Leibler (KL) divergence between the true data distribution and our model’s distribution.

Let’s unpack that.

The Math: From KL Divergence to Maximum Likelihood

The KL divergence from our model’s distribution PθP_\theta to the true data distribution PdataP_{\text{data}} is defined as:

DKL(PdataPθ)=ExPdata[logPdata(x)Pθ(x)] D_{KL}(P_{\text{data}} || P_\theta) = \mathbb{E}_{x \sim P_{\text{data}}} \left[ \log \frac{P_{\text{data}}(x)}{P_\theta(x)} \right]

This formula measures the “divergence” of our model’s predictions from the true distribution. Let’s expand it:

DKL(PdataPθ)=ExPdata[logPdata(x)logPθ(x)]=ExPdata[logPdata(x)]Entropy of the dataExPdata[logPθ(x)]Expected log-likelihood of the model\begin{aligned} D_{KL}(P_{\text{data}} || P_\theta) &= \mathbb{E}_{x \sim P_{\text{data}}} [\log P_{\text{data}}(x) - \log P_\theta(x)] \\ &= \underbrace{\mathbb{E}_{x \sim P_{\text{data}}} [\log P_{\text{data}}(x)]}_{\text{Entropy of the data}} - \underbrace{\mathbb{E}_{x \sim P_{\text{data}}} [\log P_\theta(x)]}_{\text{Expected log-likelihood of the model}} \\ \end{aligned}

The first term, the entropy of the data, is a fixed value. We can’t change it. Therefore, to minimize the KL divergence, we must maximize the second term: ExPdata[logPθ(x)]\mathbb{E}_{x \sim P_{\text{data}}} [\log P_\theta(x)].

Since we don’t know the true PdataP_{\text{data}}, we use our training dataset D={x(1),,x(m)}D = \{x^{(1)}, \ldots, x^{(m)}\} as an empirical sample. This turns the expectation into an average:

argmaxθExPdata[logPθ(x)]argmaxθ1DxDlogPθ(x) \arg \max_\theta \mathbb{E}_{x \sim P_{\text{data}}} [\log P_\theta(x)] \approx \arg \max_\theta \frac{1}{|D|} \sum_{x \in D} \log P_\theta(x)

This is exactly the objective for Maximum Likelihood Estimation.

Therefore: Minimizing KL divergence is the same as maximizing the log-likelihood.

An Intuitive Guide to Information Theory

For those new to information theory, let’s build an intuition for these concepts using a simple example.

Imagine two people, Alice and Bob. Alice is flipping a coin and Bob is trying to guess the outcome.

In our training scenario:

The Asymmetry of KL Divergence: A Crucial Detail

A key property of KL divergence is that it’s asymmetric: DKL(PQ)DKL(QP)D_{KL}(P || Q) \neq D_{KL}(Q || P). This isn’t just a mathematical quirk; it has profound implications for how models are trained.

Let’s use our coin example again:

Forward KL: DKL(PQ)D_{KL}(P || Q) (What we use in MLE)

Reverse KL: DKL(QP)D_{KL}(Q || P)

Since we train generative models via Maximum Likelihood, we are implicitly using the forward KL divergence, DKL(PdataPmodel)D_{KL}(P_{\text{data}} || P_{\text{model}}). This encourages our models to be broad and cover all the variations present in the training data. (This holds true for the models we’ve discussed, which are trained via MLE. As we’ll touch on in the final section, other prominent generative models like VAEs and GANs use different training objectives that can lead to different behaviors.)

Forward vs Reverse KL Divergence

Figure: A visual comparison of Forward and Reverse KL Divergence. In Forward KL (DKL(PdPg)D_{KL}(P_d || P_g)), the model PgP_g (gray) must cover both modes of the true data distribution PdP_d (blue) to avoid high penalties. In Reverse KL (DKL(PgPd)D_{KL}(P_g || P_d)), the model can focus on a single mode of the data to avoid penalization for generating unlikely samples. (Source: Manisha & Gujar, 2018) 3

A Fundamental Limitation: Forward vs Reverse Factorizations

Here’s something that surprised me: different factorization orders can represent different hypothesis spaces.

Consider factoring a joint distribution p(x1,x2)p(x_1, x_2) in two ways:

Forward:p(x1,x2)=p(x1)p(x2x1)Reverse:p(x1,x2)=p(x2)p(x1x2)\begin{aligned} \text{Forward}: \quad p(x_1, x_2) &= p(x_1) p(x_2|x_1) \\ \text{Reverse}: \quad p(x_1, x_2) &= p(x_2) p(x_1|x_2) \end{aligned}

If we use neural networks with Gaussian outputs for each conditional, these can represent different sets of distributions!

Example: Suppose p(x1)=N(0,1)p(x_1) = \mathcal{N}(0,1) and p(x2x1)=N(μ2(x1),ϵ)p(x_2|x_1) = \mathcal{N}(\mu_2(x_1), \epsilon) where μ2(x1)=0\mu_2(x_1) = 0 if x10x_1 \leq 0 and μ2(x1)=1\mu_2(x_1) = 1 otherwise.

The marginal becomes:

p(x2)=p(x1)p(x2x1)dx1=0.5N(0,ϵ)+0.5N(1,ϵ)p(x_2) = \int_{-\infty}^{\infty} p(x_1) p(x_2|x_1) dx_1 = 0.5\mathcal{N}(0,\epsilon) + 0.5\mathcal{N}(1,\epsilon)

This marginal p(x2)p(x_2) is a mixture of two Gaussians, which a single Gaussian p(x2)p(x_2) in the reverse factorization cannot represent. This shows that the hypothesis spaces are genuinely different.

From Theory to Practice: GPT-2 Implementation

Working with a GPT-2 model brings these concepts to life. The architecture follows our chain rule exactly:

  1. Token Embeddings: Each of 50,257 possible tokens gets a 768-dimensional embedding
  2. Transformer Layers: Self-attention and feed-forward networks process the sequence
  3. Output Layer: Projects back to 50,257-dimensional logits for p(xix<i)p(x_i|x_{<i})

Temperature Scaling

Temperature scaling provides an interesting way to control the sharpness of predictions:

pT(xix<i)elogp(xix<i)/Tp_T(x_i|x_{<i}) \propto e^{\log p(x_i|x_{<i})/T}

The temperature parameter TT has the following effects:

T<1: Sharper distributions (more confident predictions)T=1: Original model distributionT>1: Smoother distributions (more diverse outputs)\begin{aligned} T < 1 &\text{: Sharper distributions (more confident predictions)} \\ T = 1 &\text{: Original model distribution} \\ T > 1 &\text{: Smoother distributions (more diverse outputs)} \end{aligned}

Interestingly, applying temperature scaling token-by-token doesn’t recover joint temperature scaling:

i=0MpT(xix<i)pTjoint(x0,x1,,xM)\begin{aligned} \prod_{i=0}^M p_T(x_i | x_{<i}) &\neq p_T^{\text{joint}}(x_0, x_1, \ldots, x_M) \end{aligned}

This is another manifestation of how different factorizations can lead to different behaviors.

The Bigger Picture: Why This All Matters

What strikes me most about this journey through autoregressive models is how a single mathematical principle - the chain rule - connects such a wide range of techniques:

  1. Classical graphical models (Bayesian networks) use it with conditional independence assumptions
  2. Early neural approaches (FVSBN, NADE) parameterize the conditionals with neural networks
  3. Modern language models (GPT, etc.) scale up the same principle with massive architectures and data

The information-theoretic perspective provides the unifying learning framework: we’re always trying to minimize the “compression loss” when encoding real data using our model’s learned code.

Reflections and What’s Next

Understanding these fundamentals has been crucial for appreciating more advanced generative models. Variational autoencoders, GANs, and diffusion models all build on different approaches to the same core challenge: how to represent and learn complex, high-dimensional distributions.

It’s also crucial to recognize that the Maximum Likelihood approach, and its connection to forward KL divergence (DKL(PdataPmodel)D_{KL}(P_{\text{data}} || P_{\text{model}})), is characteristic of autoregressive models but not universal. Other major families of generative models are defined by their avoidance of direct likelihood maximization. For instance, Variational Autoencoders (VAEs) optimize a lower bound on the likelihood (the ELBO) which involves a reverse KL term, leading to different model behavior. Generative Adversarial Networks (GANs) bypass likelihood entirely, instead training via a minimax game whose objective relates to Jensen-Shannon divergence. These different training paradigms are a key reason why the “generative zoo” is so diverse and fascinating.

The autoregressive approach is particularly elegant because it’s:

For anyone diving into generative modeling, I’d recommend really understanding this foundation before moving to more complex approaches. The chain rule isn’t just a mathematical identity - it’s the key that unlocks scalable generative modeling.


This post represents my understanding from working through the foundational concepts of generative modeling. If you spot any errors or have questions, I’d love to hear from you!

References

Footnotes

  1. Larochelle, H., & Murray, I. (2011). The Neural Autoregressive Distribution Estimator. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 15, 29-37. 2

  2. Germain, M., Gregor, K., Murray, I., & Larochelle, H. (2015). MADE: Masked Autoencoder for Distribution Estimation. arXiv preprint arXiv:1502.03509. 2

  3. Manisha, P., & Gujar, S. (2018). Generative Adversarial Networks (GANs): What it can generate and What it cannot?. arXiv preprint arXiv:1804.00140.


Edit page
Share this post on:

Previous Post
From MLE to Variational Inference