# Neural net language models

Yoshua Bengio (2008), Scholarpedia, 3(1):3881. | doi:10.4249/scholarpedia.3881 | revision #91566 [link to/cite this article] |

A **language model** is a function, or an algorithm for learning such a
function, that captures the salient statistical characteristics of the
distribution of sequences of words in a natural language, typically
allowing one to make probabilistic predictions of the next word given
preceding ones.

A **neural network language model** is a language model based on Neural Networks , exploiting their ability to learn
distributed representations to reduce the impact of the
*curse of dimensionality*.

In the context of learning algorithms, the curse of dimensionality refers to the need for huge numbers of training examples when learning highly complex functions. When the number of input variables increases, the number of required examples can grow exponentially. The curse of dimensionality arises when a huge number of different combinations of values of the input variables must be discriminated from each other, and the learning algorithm needs at least one example per relevant combination of values. In the context of language models, the problem comes from the huge number of possible sequences of words, e.g., with a sequence of 10 words taken from a vocabulary of 100,000 there are \(10^{50}\) possible sequences...

A **distributed representation** of a symbol is a tuple (or vector)
of features which characterize the meaning of the symbol, and are not mutually
exclusive. If a human
were to choose the features of a word, he might pick grammatical features
like gender or plurality, as well as semantic features like *animate"*
or *invisible*. With a neural network language model, one relies
on the learning algorithm to discover these features, and the
features are continuous-valued (making the optimization problem
involved in learning much simpler).

The basic idea is to learn to associate each word in the dictionary with a continuous-valued vector representation. Each word corresponds to a point in a feature space. One can imagine that each dimension of that space corresponds to a semantic or grammatical characteristic of words. The hope is that functionally similar words get to be closer to each other in that space, at least along some directions. A sequence of words can thus be transformed into a sequence of these learned feature vectors. The neural network learns to map that sequence of feature vectors to a prediction of interest, such as the probability distribution over the next word in the sequence. What pushes the learned word features to correspond to a form of semantic and grammatical similarity is that when two words are functionally similar, they can be replaced by one another in the same context, helping the neural network to compactly represent a function that makes good predictions on the training set, the set of word sequences used to train the model.

The advantage of this distributed representation approach is that it allows the model to generalize well to sequences that are not in the set of training word sequences, but that are similar in terms of their features, i.e., their distributed representation. Because neural networks tend to map nearby inputs to nearby outputs, the predictions corresponding to word sequences with similar features are mapped to similar predictions. Because many different combinations of feature values are possible, a very large set of possible meanings can be represented compactly, allowing a model with a comparatively small number of parameters to fit a large training set.

## Contents |

## History

### N-gram language models

The dominant methodology for probabilistic language modeling since the
1980's has been based on *n-gram* models (Jelinek and Mercer, 1980;Katz 1987). See
(Manning and Schutze, 1999) for a review. These non-parametric learning algorithms are based on storing and combining
frequency counts of word subsequences of different lengths, e.g., 1, 2 and
3 for 3-grams. If a sequence of words ending in \(\cdots w_{t-2},
w_{t-1},w_t,w_{t+1}\) is observed and has been seen frequently in the training
set, one can estimate the probability \(P(w_{t+1}|w_1,\cdots, w_{t-2},w_{t-1},w_t)\) of
\(w_{t+1}\) following \(w_1,\cdots w_{t-2},w_{t-1},w_t\) by ignoring context
beyond \(n-1\) words, e.g., 2 words, and dividing the number of
occurrences of \(w_{t-1},w_t,w_{t+1}\) by the number of occurrences of
\(w_{t-1},w_t\ .\) Note that in doing so we ignore the identity of
words that preceded \(w_{t-1}\ .\) Furthermore, a new observed sequence
typically will have occurred rarely or not at all in the training set. An important
idea in n-grams is therefore to combine the above estimator of
\(P(w_{t+1}|w_{t-1},w_t)\) with one obtained from a shorter suffix of the
currently observed sequence. For example, here we can also predict the
probability of \(w_{t+1}\) (given the context that precedes it)
by dividing the number of occurrences of
\(w_t,w_{t+1}\) by the number of occurrences of \(w_t\) (this
is called a *bigram*). Similarly, using only the relative frequency of
\(w_{t+1}\ ,\) one obtains a unigram estimator. The three estimators
can then be combined, either by choosing only one of them in a particular context (e.g., based
the frequency counts of the subsequences), or by combining them
(usually in a linear mixture). A large literature on techniques
to *smooth* frequency counts of subsequences has given rise to
a number of algorithms and variants.

### Distributed representations

The idea of **distributed representation** has been at the core of the
revival of artificial neural network research in the early 1980's,
best represented by the connectionist
bringing
together computer scientists, cognitive psychologists, physicists,
neuroscientists, and others. The main proponent of this idea
has been Geoffrey Hinton,
in articles such as (Hinton 1986) and (Hinton 1989). An early discussion
can also be found in the Parallel Distributed Processing book (1986),
a landmark of the connectionist approach.

The idea of distributed representations was introduced with reference to
cognitive representations: a mental object can be represented efficiently
(both in terms of number of bits and in terms of number of examples needed
to generalize about it) by characterizing the object using many features,
each of which can separately each be *active* or *inactive*. For example,
with \(m\) binary features, one can describe up to
\(2^m\) different objects. The idea is that the brain would be
learning and using such representations because they help it generalize to
new objects that are similar to known ones in many respects. A distributed
representation is opposed to a *local* representation, in which only one
neuron (or very few) is active at each time, i.e., as with grandmother cells.
One can view n-gram models as a mostly *local* representation: only
the units associated with the specific subsequences of the input
sequence are turned on. Hence the number of units needed to capture
the possible sequences of interest grows exponentially with sequence length.

Previously to the neural network language models introduced in (Bengio et al 2001, 2003), several neural network models had been proposed that exploited distributed representations for learning about symbolic data (Bengio and Bengio, 2000; Paccanaro and Hinton, 2000), modeling linguistic data (Miikkulainen 1991) and character sequences (Schmidhuber 1996). In (Bengio et al 2001, Bengio et al 2003), it was demonstrated how distributed representations for symbols could be combined with neural network probability predictions in order to surpass standard n-gram models on statistical language modeling tasks. Experiments on related algorithms for learning distributed representations of words have shown that the learned features make sense linguistically (Blitzer et al 2005).

## The mathematics of neural net language models

The probability of a sequence of words can be obtained from the probability of each word given the context of words preceding it, using the chain rule of probability (a consequence of Bayes theorem): \[ P(w_1, w_2, \ldots, w_{t-1},w_t) = P(w_1) P(w_2|w_1) P(w_3|w_1,w_2) \ldots P(w_t | w_1, w_2, \ldots w_{t-1}). \] Most probabilistic language models (including published neural net language models) approximate \(P(w_t | w_1, w_2, \ldots w_{t-1})\) using a fixed context of size \(n-1\ ,\) i.e. using \(P(w_t | w_{t-n+1}, \ldots w_{t-1})\ ,\) as in n-grams.

In the model introduced in (Bengio et al 2001, Bengio et al 2003), the probabilistic prediction \(P(w_t | w_{t-n+1}, \ldots w_{t-1})\) is obtained as follows. First, each word \(w_{t-i}\) (represented with an integer in \([1,N]\)) in the \(n-1\)-word context is mapped to an associated \(d\)-dimensional feature vector \(C_{w_{t-i}}\ ,\) which is column \(w_{t-i}\) of parameter matrix \(C\ .\) Vector \(C_k\) contains the learned features for word \(k\ .\) Let vector \(x\) denote the concatenation of these \(n-1\) feature vectors: \[ x = (C_{w_{t-n+1},1}, \ldots, C_{w_{t-n+1},d}, C_{w_{t-n+2},1}, \ldots C_{w_{t-2},d}, C_{w_{t-1},1}, \ldots C_{w_{t-1},d}). \] The probabilistic prediction of the next word, starting from \(x\) is then obtained using a standard artificial neural network architecture for probabilistic classification, using the softmax activation function at the output units (Bishop, 1995): \[ P(w_t=k | w_{t-n+1}, \ldots w_{t-1}) = \frac{e^{a_k}}{\sum_{l=1}^N e^{a_l}} \] where \[ a_k = b_k + \sum_{i=1}^h W_{ki} \tanh(c_i + \sum_{j=1}^{(n-1)d} V_{ij} x_j) \] where the vectors \(b,c\) and matrices \(W,V\) are also parameters (in addition to matrix \(C\)). Let us denote \(\theta\) for the concatenation of all the parameters. The capacity of the model is controlled by the number of hidden units \(h\) and by the number of learned word features \(d\ .\)

The neural network is trained using a gradient-based optimization algorithm
to maximize the training set *log-likelihood*
\[
L(\theta) = \sum_t \log P(w_t | w_{t-n+1}, \ldots w_{t-1}) .
\]
The gradient \(\frac{\partial L(\theta)}{\partial \theta}\)
can be computed using the error back-propagation algorithm, extended
to provide the gradient with respect to \(C\) as well as with
respect to the other parameters. Note that the gradient on most of \(C\)
is zero (and need not be computed or used) for most of the columns of \(C\ :\)
only those corresponding to words in the input subsequence have a non-zero gradient.
Because of the large number of examples (millions to hundreds of millions),
the only known practical optimization algorithm for
artificial neural networks
are online algorithms, such as stochastic gradient descent: the
gradient on the log-likelihood of a single example at a time (one word in its
context) or a mini-batch of examples (e.g., 100 words) is iteratively used to perform
each update of the parameters.

In a similar spirit, other variants of the above equations have been proposed (Bengio et al 2001, 2003;Schwenk and Gauvain 2004;Blitzer et al 2005; Morin and Bengio 2005; Bengio and Senecal 2008).

## Computational issues and applications

Several variants of the above neural network language model were compared
by several authors (Schwenk and Gauvain 2002, Bengio et al 2003, Xu et al 2005, Schwenk et al 2006, Schwenk 2007, Mnih and Hinton 2007) against n-gram based language models, either
in terms of log-likelihood or in terms of classification accuracy of a
speech recognition or statistical machine translation system (such systems use a probabilistic language model
as a component). The experiments have been mostly on small corpora, where
training a neural network language model is easier, and show important
improvements on both log-likelihood and speech recognition accuracy. Resampling techniques may be used to train
the neural network language model on corpora of several hundreds of millions of words (Schwenk and Gauvain 2004).
It has been noted that neural network language
models and n-gram based language models make errors in different
places: hence simply averaging the probabilistic predictions from the two
types of models often yields improved
predictions. However, naive implementations of the above
equations yield predictors that are too slow for large scale natural
language applications. Schwenk and Gauvain (2004) were able to build systems in
which the neural network component took less than 5% of *real-time*
(the duration of the speech being analyzed).

English vocabulary sizes used in natural language processing applications such as speech recognition and translation involve tens of thousands, possibly hundreds of thousands of different words. With \(N=100,000\) in the above equations, the computational bottleneck is at the output layer, where one computes \(O(N h)\) operations. This is much more than the number of operations typically involved in computing probability predictions for n-gram models. Several researchers have developed techniques to speed-up either probability prediction (when using the model) or estimating gradients (when training the model). One of the ideas behind these techniques is to use the neural network language models for only a subset of words (Schwenk 2004), or storing in a cache the most relevant softmax normalization constants (Zamora et al 2009). Another idea is to decompose the probability computation hierarchically, using a tree of binary probabilistic decisions, so as to replace \(O(N)\) computations by \(O(\log N)\) computations (Morin and Bengio 2005). Yet another idea is to replace the exact gradient by a stochastic estimator obtained using a Monte-Carlo sampling technique (Bengio and Senecal 2008).

## Challenges ahead

In addition to the computational challenges briefly described above, several weaknesses of the neural network language model are being worked on by researchers in the field. One of them is the representation of a fixed-size context. To represent longer-term context, one may employ a recurrent network formulation, which learns a representation of context that summarizes the past word sequence in a way that preserves information predictive of the future. This learned summarization would keep higher-level abstract summaries of more remote text, and a more detailed summary of very recent words. A fundamental obstacle to progress in this direction has to do with the diffusion of gradients through long chains of non-linear transformations, making it difficult to learn long-term dependencies (Bengio et al 1994) in sequential data.

Another weakness is the shallowness of the current model and the difficult optimization problem of training a neural net language model. For a discussion of shallow vs deep architectures, see (Bengio and LeCun 2007). Whereas current models have two or three layers, theoretical research on deep architectures suggests that representing high-level semantic abstractions efficiently may require deeper networks. Until 2006, it was not clear how one could train deep neural networks, as training appeared to get stuck in poor local minima, but papers published since 2006 (Hinton 2006, Bengio et al 2007, Ranzato et al 2007) on Deep Belief Networks, auto-encoders and Restricted Boltzmann Machines suggest avenues for addressing this issue.

There remains a debate between the use of local non-parametric methods based on n-grams, and methods based on more compact and distributed representations such as neural net language models. Optimizing the latter remains a difficult challenge. In addition, it could be argued that using a huge training set (e.g., all the text in the Web), one could get n-gram based language models that appear to capture semantics correctly. However, in the light of the exponential nature of the curse of dimensionality, one should also ask the question of how much closer to human understanding of language one can get by multiplying n-gram training corpora size by a mere 100 or 1000.

## References

- Jelinek, F. and Mercer, R.L. (1980) Interpolated Estimation of Markov Source Parameters from Sparse Data. Pattern Recognition in Practice, Gelsema E.S. and Kanal L.N. eds, North-Holland. pp. 381-397.
- Hinton, G.E. (1986) Learning Distributed Representations of Concepts. Proceedings of the Eighth Annual Conference of the Cognitive Science Society:1-12
- Rumelhart, D. E. and McClelland, J. L (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge.
- Katz, S.M. (1987) Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 3:400-401.
- Hinton, G.E. (1989) Connectionist Learning Procedures. Artificial Intelligence J. 40:185-234.
- Miikkulainen, R. and Dyer, M.G. (1991) Natural language processing with modular PDP networks and distributed lexicon. Cognitive Science 15:343-399.
- Elman, J. L. (1991) Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7:195-224.
- Bengio, Y., Simard, P., and Frasconi, P. (1994) Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks 5:157-166.
- Schmidhuber, J. and Heil, S. (1996) Sequential Neural Text Compression. IEEE Transactions on Neural Networks 7:142-146.
- Manning, C. and Schutze H. (1999) Foundations of Statistical Natural Language Processing, MIT Press.
- Paccanaro, A. and Hinton, G.E. (2000) Extracting Distributed Representations of Concepts and Relations from Positive and Negative Propositions. IJCNN'2000.
- Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C. (2001, 2003) A Neural Probabilistic Language Model. NIPS'2000 13:933-938, and revised in J. Machine Learning Research (2003) 3:1137-1155.
- Schwenk, H., Gauvain, J.-L. (2002) Connectionist Language Modeling for Large Vocabulary Continuous Speech Recognition. In ICASSP, pages I:765-768.
- Xu, P., Emami, A., and Jelinek, F. (2003) Training Connectionist Models for the Structured Language Model, EMNLP'2003.
- Schwenk, H. and Gauvain, J.-L. (2004) Training Neural Network Language Models On Very Large Corpora. In Joint Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 201-208.
- Blitzer, J., Weinberger, K., Saul, L., and Pereira F. (2005) Hierarchical Distributed Representations for Statistical Language Modeling, NIPS'04.
- Morin, F. and Bengio, Y. (2005) Hierarchical Probabilistic Neural Network Language Model. AISTATS'2005.
- Schwenk, H., Dchelotte, D., and Gauvain, J.-L. (2006), Continuous space language models for statistical machine translation, COLING/ACL'2006.
- Hinton, G.E., Osindero, S. and Teh, Y. (2006) A Fast Learning Algorithm for Deep Belief Nets. Neural Computation 18:1527-1554.
- Bengio, Y. and LeCun, Y. (2007) Scaling Learning Algorithms towards AI. Large Scale Kernel Machines, MIT Press.
- Bengio, Y., Lamblin, P., Popovici, D. and Larochelle H. (2007) Greedy Layer-Wise Training of Deep Networks. NIPS'2006 19:153-160.
- Ranzato, M-A., Poultney, C., Chopra, S. and LeCun, Y. (2007) Efficient Learning of Sparse Representations with an Energy-Based Model. NIPS'2006.
- Schwenk, H. (2007), Continuous Space Language Models, Computer Speech and language, vol 21, pages 492-518, Academic Press.
- Mnih, A. and Hinton, G.E. (2007) Three New Graphical Models for Statistical Language Modelling, ICML'2007.
- Bengio, Y. and Senecal, J.-S. (2008) Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model. IEEE Transactions on Neural Networks, to appear.
- C. M. Bishop. (1995). Neural networks for pattern recognition. Oxford University Press.
- Zamora-Martínez, F., Castro-Bleda, M., España-Boquera, S.: Fast evaluation of connectionist language models. In: 10th International Work-Conference on Artificial Neural Networks. LNCS. Springer (2009) 144--151.

**Internal references**

- Jan A. Sanders (2006) Averaging. Scholarpedia, 1(11):1760.

- Valentino Braitenberg (2007) Brain. Scholarpedia, 2(11):2918.

- Mark Aronoff (2007) Language. Scholarpedia, 2(5):3175.

## External Links

- See above links to reference papers.