# Product of experts

Max Welling (2007), Scholarpedia, 2(10):3879. | doi:10.4249/scholarpedia.3879 | revision #91672 [link to/cite this article] |

A **Product of Experts** model (PoE) (Hinton 2002) combines a
number of individual component models (the experts) by taking
their product and normalizing the result. Each expert is defined
as a possibly unnormalized probabilistic model \(f(x)\)
over its input space.

\[\tag{1} P(x|\{\theta_j\})=\frac{1}{Z}\prod_{j=1}^M f_j(x|\theta_j) \]

with \[ Z=\int\mbox{d}x~\prod_{j=1}^M f_j(x|\theta_j)\]

PoEs stand in contrast to *Mixture Models* which combine expert
models additively,

\[\tag{2} P(x|\{\theta_j\})=\sum_{j=1}^M \alpha_j p_j(x|\theta_j) \]

where each component model \(p_j(x)\) is normalized over \(x\) and \( \sum_{j=1}^M ~\alpha_j = 1 \)

Note that *Mixture of Expert Models* are usually associated with
conditional models where the experts are of the form
\(p(y|x)\) and the mixture coefficients \(\alpha(x)\) (known as gating
functions) may depend on \(x\) as well. Conditional PoEs may be defined as well.

One can qualitatively understand the difference between mixtures and products by observing that a mixture distribution can have high probability for event \(x\) when only a single expert assigns high probability to that event. In contrast, a product can only have high probability for an event \(x\) when no expert assigns an especially low probability to that event. Hence, metaphorically speaking, a single expert in a mixture has the power to pass a bill while a single expert in a product has the power to veto it.

Put another way, each component in a product represents a soft
*constraint*, while each expert in a mixture represents a soft
template or prototype. For an event to be likely under a product
model, all constraints must be (approximately) satisfied, while an
event is likely under a mixture model if it (approximately)
matches with any single template. Hence, to sculpt the overall probability
distribution of a mixture, each expert adds a lump of
probability mass which is usually localised to one region, while to sculpt
the overall probability of a product, each expert scales the
probability at each point by a different factor.
A product can also be viewed as adding together lumps in the log probability domain.
This essential difference can result in much sharper
boundaries, especially for high-dimensional input spaces
(Hinton 2002).

## Contents |

## Training a Product of Experts

Given data \(\{x_n\},~n=1..N\) with \(x\in {R}^d\ ,\) one can use the log-likelihood as an objective function to train a PoE,

\[\tag{3} L(\{\theta_j\}|\{x_n\})=\sum_{n=1}^N\sum_{j=1}^M \log f_j(x_n|\theta_j)-N\log Z \]

Denoting the gradient of the
objective w.r.t. \(\theta_j\) with \(\nabla_j
L\) one can compute the following gradient,

\[\tag{4} \nabla_jL = \sum_{n=1}^N \nabla_j \log f_j(x_n) - N \langle \nabla_j \log f_j(x) \rangle_{P(x)} \]

where
\(\langle\cdot\rangle_{P(x)}\) denotes taking the average
w.r.t. \(P(x)\ .\) Learning is achieved by changing the
parameters incrementally according to the following update rule,

\[\tag{5} \theta_j \rightarrow \theta_j + \eta \nabla_j L \]

where \(\eta\) represents the learning rate. Learning
efficiency can usually be improved by using a stochastic
approximation of the full gradient based on a single data-case or
a few data-cases (a mini-batch). Another effective heuristic to speed up learning
is to add a "momentum term" to the gradient (Plaut et al, 1986).

The first term of the gradient (Eqn.(4) can
be interpreted as *increasing* the probability of expert
\(i\) on the dataset. The second term on the other hand
can be interpreted as *decreasing* the probability of expert
\(i\) in regions of input space where the model assigns
high probability. When these terms balance, learning has converged
to a local maximum of the log-likelihood.

### Contrastive Divergence learning

The simplicity of the gradient in eqn.(4) is
deceptive: it requires the evaluation of an intractable average
over \(P(x)\ .\) For most interesting models this average
requires methods like MCMC sampling to approximate it.
But MCMC sampling is computationally expensive and results in
high-variance estimates of the required averages. A cheaper,
lower-variance alternative was proposed by (Hinton 2002) under
the name * contrastive divergence* (CD). The idea is to run
\(N\) samplers in parallel, one for each data-case in the
(mini-)batch. These samplers must be initialized at the respective
data-cases and will move towards equilibrium using
MCMC sampling. After only a few steps of sampling, long before the MCMC converges,there is
usually sufficient signal in the population of samples to change the
parameters. A surrogate learning rule can be derived by replacing
the log-likelihood with a new objective: the contrastive
divergence \(KL(P_{data}||P_{model})-KL(P_k||P_{model})\)
where \(P_{data}\) is the empirical distribution
\(P_{data}=\frac{1}{N}\sum_n \delta(x-x_n)\ ,\)
\(P_{model}\) is the current estimate of the model
distribution and \(P_k\) is the distribution based on
\(k\) steps of sampling. Taking gradients and ignoring a
term which is usually very small, the new learning rule is almost
identical to the one based on the log-likelihood, but using the approximation
\[ \langle \log f_j(x)\rangle_{P(x)} \approx \frac{1}{N}
\sum_n f_j(x_n^k) \] where \(x_n^k\) is the sample
obtained from MCMC sampler \(n\) after \(k\)
steps of sampling.

One can view this approximation as trading variance for bias. Thus, at convergence, we expect that the estimates of the parameters will not be equal to those of maximum likelihood learning, but will be slightly biased. To correct this, one can increase \(k\) close to convergence (Carreira-Perpinan and Hinton 2005).

## Restricted Boltzmann Machines and Exponential Family Harmoniums

Perhaps the simplest PoE is given by a restricted Boltzmann machine (see Figure 1). In this model there are two layers of binary (0/1) variables where the bottom layer is observed while the top layer remains unobserved or hidden. The joint probability distribution over hidden and observed variables is given as,

\[\tag{6} P(x,h)=\frac{1}{Z} \exp\left(\sum_i \alpha_ix_i + \sum_j \beta_j h_j + \sum_{ij} W_{ij}x_ih_j\right) \]

where the undirected edges in the graphical model inFigure 1 are representing
\(\{W_{ij}\}\ .\) The bias terms are parameterized by
\(\{\alpha_i,\beta_j\}\ .\) Marginalizing over
\(\{h_j\}\) the PoE structure becomes evident,

\[\tag{7} P(x)=\frac{1}{\tilde{Z}} \prod_i\exp(\alpha_ix_i)\prod_j\left(1+\exp(\beta_j+\sum_iW_{ij}x_i)\right) \]

where elements in the first product represent single-variable experts and elements in the second product represent constraints between the input variables.

The conditional Bernoulli expert distributions can be generalized
to distributions in the exponential family . The resulting joint
model is called an **exponential family harmonium** (EFH)
(Welling et. al. 2004). The joint distribution can be
obtained by replacing \(x_i\rightarrow f_i(x_i)\) and
\(h_j\rightarrow g_j(h_j)\ .\) where \(f(\cdot)\)
and \(g(\cdot)\) are the features for the corresponding
exponential family distribution.

The special bipartite structure of the RBM and EFH results in a
very efficient Gibbs sampler that alternates between sampling all
hidden variables independently given values for the observed
variables and vice versa sampling all visible variables
independently given values for the hidden variables. The efficient
Gibbs sampler directly translates into an efficient * contrastive*
divergence learning algorithm* (see previous section Figure 1).*

## Relation to Independent and Extreme Components Analysis

Noiseless **Independent Components Analysis** (ICA)
(Comon 1994) with an equal number of input dimensions and source
distributions can be written as a PoE model as follows,

\[\tag{8} P(x|\{w_j\}) = |\det(W)|~\prod_{j=1}^M p_j\left(\sum_i w_{ji} x_i\right) \]

where \(W\) is the matrix with

elements \(w_{ji}\ .\)

Note that each expert, \(p_j\ ,\) is defined as a distribution on a one-dimensional projection of the input space. One can think of each projection as a "source" and a linear combination of these sources generated the signal. Unless \(W\) is rank deficient, the product is a well defined distribution over the entire input space.

Choosing heavy tailed Student-T distributions as the experts one obtains the general form of the "Products of Student-T" distribution (PoT) (Welling et. al. 2002). The PoT can be represented with the help of auxiliary variables (taking the role of hidden variables) as follows,

\[\tag{9} P(x,h)=\frac{1}{Z} \prod_{j=1}^M\exp\left(-h_j[1+\frac{1}{2} (\sum_iw_{ji}x_i)^2] + (1-\alpha_j)\log h_j \right) \]

where \(P(x|h)\) is
a full covariance Gaussian distribution and \(P(h|x)\) a
product of Gamma distributions.

The PoT becomes different from ICA if one chooses the number of
experts to be larger than the number of input dimensions (a.k.a.
an *over-complete representation*). In this case *marginal*
independence* between the hidden variables is lost, but *
*conditional independence* between the hidden variables is
retained. Over-complete variants of ICA that retain marginal
independence have also been proposed (Lewicki and Sejnowski 1998).
Over-complete ICA models have conditional dependencies between the hidden
variables known as *explaining away* which makes inference
difficult. In contrast, for the over-complete PoT model inference
over the hidden variables given observations is trivial due to the
absence of such conditional dependencies
(Teh et. al. 2003).

Instead of the non-Gaussian experts used for ICA, one can also
choose an *under-complete* (\(M<d\)) set of
one-dimensional * Gaussian* experts, i.e. \(p_j\left(\sum_i
w_{ji} x_i\right)\) with \(p_j(\cdot)\) Gaussian. Using the
fact that the inverse-covariance of the product is equal to the sum of the
inverse-covariances of the individual Gaussian experts one can formulate
probabilistic principal component analysis (Roweis, 1997; Tipping and Bishop, 1999) or probabilistic minor
component analysis (Williams and Agakov, 2002). However, it is also possible to formulate a
model that extracts the optimal combination of principal and minor components in the spectrum of
the sample covariance matrix. The probabilistic model, known as
``eXtreme Components Analysis* (XCA), is described in*
(Welling et. al. 2003).

## Applications of PoEs

Variants of PoEs have been applied under different names to
various data-domains: for example the *rate-coded RBM* to face
recognition (Teh and Hinton 2001), the *dual wing harmonium* to
video-track data (Xing et. al. 2005), the *rate adapting*
Poisson* model (see Figure 1) to text and image*
data (Gehler et. al 2006), the *product of HMMs* model to
language data (Brown and Hinton 2001), *hierarchical* versions of
PoE to digits (Hinton et. al 2006), text and collaborative filtering data
(Salakhutdinov et. al. 2007).

## References

- C.K.I. Williams and F.V. Agakov. Products of gaussians and probabilistic minor components analysis. Neural Computation, 14(5):1169--1182, 2002.

- A. Brown and G. Hinton, Products of hidden Markov models, In Proceedings of the Conference on Artifcial Intelligence and Statistics}, 2001.

- M. Carreira-Perpinan and G.E. Hinton, On contrastive divergence learning, Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 2005.

- P. Comon, Independent component analysis, a new concept? Signal Processing, 36:287-314, 1994.

- M.E. Tipping and C.M. Bishop, Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 61(3), 611–622, 1999.

- P.V. Gehler, A.D. Holub, and M. Welling, The rate adapting Poisson model for information retrieval and object recognition, ACM, 06 2006.

- G.E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation, 14:1771--1800, 2002.

- G.E. Hinton, S. Osindero, and Y.W. Teh, A fast learning algorithm for deep belief nets, Neural Computation, 18:1527-1554, 2006.

- M.S. Lewicki and T.J. Sejnowski, Learning overcomplete representations, Neural Computation, 12:p.337-365, 2000.

- D.S. Plaut, S. Nowlan and G.E. Hinton, Experiments on learning by back-propagation, Technical report CMU-CS-86-126, Dept. Comp. Science, CMU, Pittsburgh, PA, 1986.

- S. Roweis, EM Algorithms for PCA and SPCA, Advances in Neural Information Processing Systems 10, pp.626-632, 1997.

- R.R. Salakhutdinov, A. Mnih, and G.E. Hinton, Restricted Boltzmann machines for collaborative filtering, Proceedings of the 21st International Conference on Machine Learning, 2007.

- Y.W. Teh and G.E. Hinton, Rate-coded restricted Boltzmann machines for face recognition, Advances in Neural Information Processing Systems, volume 13, 2001.

- Y.W. Teh, M. Welling, S. Osindero, and G.E. Hinton, Energy-based models for sparse overcomplete representations, Journal of Machine Learning Research - Special Issue on ICA, 4:1235--1260, 2003.

- M. Welling, F. Agakov, and C.K.I. Williams, Extreme components analysis, Advances in Neural Information Processing Systems, volume 16, Vancouver, Canada, 2003.

- M. Welling, G.E. Hinton, and S. Osindero, Learning sparse topographic representations with products of Student-t distributions, Advances in Neural Information Processing Systems, volume 15, Vancouver, Canada, 2002.

- M. Welling, M. Rosen-Zvi, and G.E. Hinton, Exponential family harmoniums with an application to information retrieval, Advances in Neural Information Processing Systems, volume 17, Vancouver, Canada, 2004.

- E. Xing, R. Yan, and A. Hauptman, Mining associated text and images with dual-wing harmoniums, Proc. Uncertainty in Artificial Intelligence 2005.

**Internal references**

- Geoffrey E. Hinton (2007) Boltzmann machine. Scholarpedia, 2(5):1668.
- Eugene M. Izhikevich (2007) Equilibrium. Scholarpedia, 2(10):2014.
- Mark Aronoff (2007) Language. Scholarpedia, 2(5):3175.

## See also

Independent Component Analysis, Mixture of Experts, Pattern Recognition