Jeremy Minton About
Generative vs. discriminative models
Posted by Jeremy Minton, ,

There is a conventional explanation given about discriminative and generative classification models. One good example of this is Bishop’s Pattern Recognition and Machine Learning 1: Chapter 4. As a quick summary:

Discriminative models predict the probability of a label for a given data point:

\[p(y|\mathbf{x}).\]

With such probabilities predicted, it is easy to perform classification by assigning the label with the greatest probability.

Generative models predict the probability of a data point occurring, jointly with or conditioned on the label:

\[p(\mathbf{x}, y) \quad \textrm{or}\quad p(\mathbf{x}|y).\]

These two cases are related by \(p(\mathbf{x},y)=p(\mathbf{x}|y)p(y)\), where \(p(y)\) is the prior probability of the classes, usually estimated by label frequencies in the dataset.

Classification can be performed by conditioning the distribution on the observed data point:

\[p(y|\mathbf{x}) = \frac{p(\mathbf{x}, y)}{p(\mathbf{x})} = \frac{p(\mathbf{x}, y)}{\sum_y p(\mathbf{x}, y)}.\]

Of course for implementation, each of these probability distributions will be estimated with models that have parameters or weights that need to be fit.

Generative example

The Gaussian Naive Bayes classifier uses Bayes’ rule to calculate the label probability from a generative model that relies on the naive assumption that each feature is drawn independently from a Gaussian distribution.

\[p(y|\mathbf{x}) = p(y)\frac{p(\mathbf{x}|y)}{p(\mathbf{x})} = \frac{N_y}{N}\frac{\mathcal{N}(\mathbf{x}; \mu_y, \sigma_y)}{\sum_{k}{\mathcal{N}(\mathbf{x}; \mu_k, \sigma_k)}}\]

Discriminative example

Where as, logistic regression models the log odds of each class with linear regression, hence labelling class probabilities directly:

\[p(y|\mathbf{x}) = \frac{1}{1 + \exp{\left(A\mathbf{x} + \mathbf{b}\right)}}.\]

Pros and Cons

  • Generative models are useful for unsupervised machine learning tasks; discriminative models are useful for supervised tasks.
  • Generative models are robust to missing or incomplete data.
  • Discriminative models are computationally less intensive.
  • Discriminative models are more robust to outliers.

A more rigorous analysis

Given we’ve transformed a generative model into a discriminative model, with

\[p(y|\mathbf{x}) = \frac{p(\mathbf{x}, y)}{\sum_y p(\mathbf{x}, y)},\]

aren’t generative models just a subset of discriminative models? Perhaps a subset imbued with useful properties? And can any discriminative model be reformulated as a generative model, even if not as a closed form solution?

A short article written up by Minka 2, and expanded in a paper 3, systematically explores this idea. There is also a nice blog post on Passing thoughts4 piece that presents the results as factor graphs.

The argument is thus:

The task we are interested in is to predict the label \(y\) given some feature data \(x\). In addition, there is some labelled training data \(D = \{X, Y\}\).

To achieve this, the ideal probability distribution is approximated with a model that has parameters, \(\theta\), fitted to the data. Ideally, the label predictions would come from marginalising over those parameters but, in practice, this is often approximated with the MAP (maximum a posteriori) estimation:

\[p(y | \mathbf{x}, D) \approx \int_\theta p(y | \mathbf{x}, \theta) p (\theta | D) d\theta \approx p(y | \mathbf{x}, \theta_\mathrm{MAP}).\]

The a posteriori parameter distribution can be computed by

\[p(\theta | D) = \frac{p(\theta, D)}{p(D)} = \frac{p(\theta, D)}{\int_\theta p(\theta, D)d\theta}\]

and the MAP estimation is

\[\theta_\mathrm{MAP} = \max_{\theta \in \Theta} p(\theta | D) = \max_{\theta \in \Theta} p(\theta, D)\]

We can now propose two models for this joint probability: a generative model

\[\max_{\theta_g \in \Theta_g} \quad p_\mathrm{g}(\theta_g, D) = \max_{\theta_g \in \Theta_g} \quad p_{x\mathrm{g}}(\mathbf{X} | Y, \lambda) p_{y\mathrm{g}}(Y | \pi) p_{\theta\mathrm{g}}(\lambda, \pi), \tag{1} \label{generative-factorisation}\]

where \(\theta_g := \{\lambda, \pi\}\), and a discriminative model

\[\max_{\theta_d \in \Theta_d} \quad p_\mathrm{d}(\theta_d, D) = \max_{\theta_d \in \Theta_d} \quad p_{y\mathrm{d}}(Y | \mathbf{X}, \phi) p_{x\mathrm{d}}(\mathbf{X} | \psi) p_{\theta\mathrm{d}}(\phi, \psi), \tag{2} \label{discriminative-factorisation}\]

where \(\theta_d := \{\phi, \psi\}\).

Now by defining the discriminative model in terms of the generative model

\[p_{y\mathrm{d}}(y | \mathbf{x}, \phi) = \frac{p_\mathrm{g}(\mathbf{x}, y| \phi)}{\sum_{y} p_{\mathrm{g}}(\mathbf{x}, y | \phi)} \tag{3} \label{label-distribution}\]

and

\[p_{x\mathrm{d}}(\mathbf{x} | \psi) = \sum_{y} p_{\mathrm{g}}(\mathbf{x}, y | \psi), \tag{4} \label{feature-distribution}\]

we see that \(\phi \in \Theta_g\) and \(\psi \in \Theta_g\) so \(\theta_d \in \Theta_g^2\) and therefore a discriminative model has (roughly) twice the parameters as a generative one. We can then conclude that it has less statistical bias but greater uncertainty in the parameters.

Personally, there are two steps here that I need to convince myself of: first, that discriminative models do, in fact, maximise the joint probability, including the implicit treatment of \(p_{\mathrm{d}x}\); and second, whether it is reasonable to represent each discriminative distribution with the generative model, especially whether \(p_{\mathrm{d}x}\) is free to be defined given it’s implicit treatment as well as whether \(p_{\mathrm{d}y}\) can be represented in that form.

Do discriminative models maximise the joint probability?

Firstly, most discriminative models minimise a loss between predicted and observed labels. Is this equivalent to maximising the joint probability \(p_\mathrm{d}(y, \mathbf{x}, \theta')\)?

Breaking this problem down, let’s consider whether maximising the conditional probability is equivalent to maximising the joint probability. That is, whether

\[\phi^* = \max_{\phi \in \Phi} \quad p_{\mathrm{d}y}(y | \mathrm{x}, \phi).\]

and

\[\{\phi^*, \psi^*\} = \max_{\theta_d \in \Theta_d} p_\mathrm{d}(D, \theta_d)\]

produce the same \(\phi*\)?

Assuming the two sets of parameters are independent, \(p_{\theta \mathrm{d}}(\phi, \psi) = p_{\phi \mathrm{d}}(\phi)p_{\psi \mathrm{d}}(\psi),\) then equation \eqref{discriminative-factorisation} can be factored as

\[\max_{\theta_d \in \Theta_d} \quad p_{y\mathrm{d}}(y | \mathbf{x}, \phi) p_{x\mathrm{d}}(\mathbf{x} | \psi) p_{\theta\mathrm{d}}(\phi, \psi) = \max_{\phi \in \Phi} \quad p_{y\mathrm{d}}(y|\mathbf{x}, \phi) p_{\phi\mathrm{d}}(\phi) \quad \max_{\psi \in \Psi} \quad p_{x\mathrm{d}}(\mathbf{x}|\psi) p_{\psi\mathrm{d}}(\psi) \tag{5} \label{discriminative-prior}\]

This shows that \(\phi\) and \(\psi\) can be maximised independently, hence maximising the joint probability is equivalent to maximising the discriminative conditional probability.

Next we compare maximising the conditional probability with minimising an error. As an simpler exercise, consider the binary cross-entropy loss

\[\min_{\phi \in \Phi} \mathcal{L}_\mathrm{BCE} = \min_{\phi \in \Phi} \left( - \sum_{\{\bar{y}, \bar{\mathbf{x}}\} \in D} \left( (1 - \bar{y})\log{\left(1 - p(y=1|\mathbf{x}, \phi)\right)} + \bar{y} \log{p(y=1|\mathbf{x}, \phi)} \right) \right),\]

which can be shown to be equivalent to maximising the joint probability by separating terms where \(\bar{y}=0\) and \(\bar{y}=1\), and exploiting the fact that a extrema remains unchanged under logarithms:

\[\begin{align*} \min_{\phi \in \Phi} \mathcal{L}_\mathrm{BCE} &= \min_{\phi \in \Phi} \quad -\left( \sum_{\{\bar{y}, \bar{\mathbf{x}}\} \in D: \bar{y}=0} \log{\left(1 - p(y=1|\bar{\mathrm{x}}, \phi)\right)} + \sum_{\{\bar{y}, \bar{\mathbf{x}}\} \in D: \bar{y}=1} \log{p(y=1|\bar{\mathrm{x}}, \phi)} \right) \\ &= \max_{\phi \in \Phi} \quad \left( \prod_{\{\bar{y}, \bar{\mathbf{x}}\} \in D: \bar{y}=0} \left(1 - p(y=1|\bar{\mathrm{x}}, \phi)\right) \prod_{\{\bar{y}, \bar{\mathbf{x}}\} \in D: \bar{y}=1} p(y=1|\bar{\mathrm{x}}, \phi) \right) \\ &= \max_{\phi \in \Phi} \quad \left( \prod_{\{\bar{y}, \bar{\mathbf{x}}\} \in D: \bar{y}=0} p(y=0|\bar{\mathrm{x}}, \phi) \prod_{\{\bar{y}, \bar{\mathbf{x}}\} \in D: \bar{y}=1} p(y=1|\bar{\mathrm{x}}, \phi) \right) \\ &= \max_{\phi \in \Phi} \quad p_{y\mathrm{d}}(y | \mathrm{x}, \phi) \end{align*}\]

I assume this result would generalise to multi-class and regression results with the correct loss function. However, other loss functions are obviously not equivalent. This explains why binary cross-entropy is the go-to choice, but what does this say about the these results for other cases? Is it true by analogy, at least enough for intuition?

Can a discriminative model be written in terms of a generative model?

Trivially, a discriminative model can be defined from any generative model using equation \eqref{label-distribution}, but is the inverse also true: is there an equivalent generative distribution for any discriminative model? If it is a proper probability distribution then it must follow the rules of probability and hence can be redefined as such, although likely not in a closed form. And if it is not a proper probability distribution then, like non-BCE loss functions, these results may or may not hold. Although, one would hope that models are chosen to approximate probabilities so these results are, at least, applicable by analogy.

Regarding the data distribution term, this is not explicitly defined in a discriminative model. In fact, the consequence of equation \eqref{discriminative-prior} is that the discriminative part of the model can be maximised without any constraints on this term. We are therefore free to define it’s form to conform to equation \eqref{feature-distribution} for the analysis; however, this is placing additional restrictions on the discriminative model and so only considers a subset of possible solutions. This term would be more accurately described as the non-parametrised distribution, which can be made obvious by rewriting equation \eqref{discriminative-factorisation} with this independence in mind:

\[\max_{\phi \in \Phi} \quad p_\mathrm{d}(D, \phi) = \max_{\phi \in \Phi} \quad p_{y\mathrm{d}}(Y | \mathbf{X}, \phi) p_{x\mathrm{d}}(\mathbf{X}) p_{\phi\mathrm{d}}(\phi), = \max_{\phi \in \Phi} \quad p_{y\mathrm{d}}(Y | \mathbf{X}, \phi) p_{\phi\mathrm{d}}(\phi).\]

Consequently, the ‘Generative models consider a more restrictive parameter space’ section in the Passing thoughts post is comparing generative models to discriminative models with additional constraints. Of course, lifting constraints from the less restrictive model would only make their conclusions stronger, so is directionally correct and useful to develop insight.

Struggles with symmetries

By the symmetry of Bayes’ rule, can’t the generative model formulation of equation \eqref{generative-factorisation} be substituted for discriminative models and the whole argument can be run in reverse? Well doing just that goes as far as equation \eqref{label-distribution} and \eqref{feature-distribution}, which would still rely on generative probabilities but with labels instead of features marginalized over. Attempts at further substitution become rather circular or are left wanting for a defined feature distribution term, \(p_{x \mathrm{d}}(\mathbf{x} | \psi)\).

An alternative symmetry to explore is to consider a data imputation problem, where none of the data columns are called labels. Framing this another way, build a model to predict a feature column, instead of the label column. This analysis can still be used with this construction and it becomes apparent that discriminative models could be considered any model where only a subset of the data columns are predicted and generative models will predict all the data. Perhaps this could be another dimension to explore hybrid models: predict the labels and some of the features? But maybe that wouldn’t be useful in practice.

Conclusion

I think the best mental model from this analysis is to consider the coupling/independence of the priors in equation \eqref{discriminative-prior}.

Substituting in the generative forms of the discriminative distributions gives

\[p_\mathrm{d}(D, \theta') = p_{\mathrm{d}\theta}(\phi, \psi)\frac{p_\mathrm{g}(\mathbf{x}, y| \phi)}{\sum_{y} p_{\mathrm{g}}(\mathbf{x}, y | \phi)} \sum_{y} p_{\mathrm{g}}(\mathbf{x}, y | \psi).\]

If the prior enforces \(\phi=\psi\) then this reduces to a generative model, but if the prior is that these parameter sets are independent, then the data distribution just becomes a constant factor when determining \(\phi\).

An interpretation of this is that the label and feature distribution terms are each a source of data for determining the model parameters. With complete coupling, both the labels and features are used to fit the parameters, but if they are independent, only the label data is used because the feature data determines \(\psi\), which does not feature in inference.

Future work

Having worked through the details of Minka’s analysis, there are definitely some possible limitations such as loss functions or model parametrizations that don’t generate proper probability distributions. Exploring some of these exceptions and the implications could be interesting, although likely hard work.

Otherwise, I would like to learn more about the hybrid models, with examples and applications.

References

  1. Bishop, C.M.; 2006; Pattern Recognition and Machine Learning; Springer. 

  2. Discriminative models, not discriminative training; Minka, T.P 

  3. Lasserre, J.A.; Bishop, C.M.; Minka, T.P; 2006; Principled Hybrids of Generative and Discriminative Models; Conference on Computer Vision and Pattern Recognition (CVPR). 

  4. Generative vs. discriminative models; Passing Thoughts