# Neural net language models

 Yoshua Bengio (2008), Scholarpedia, 3(1):3881. doi:10.4249/scholarpedia.3881 revision #140963 [link to/cite this article]
Post-publication activity

Curator: Yoshua Bengio

A language model is a function, or an algorithm for learning such a function, that captures the salient statistical characteristics of the distribution of sequences of words in a natural language, typically allowing one to make probabilistic predictions of the next word given preceding ones.

A neural network language model is a language model based on Neural Networks , exploiting their ability to learn distributed representations to reduce the impact of the curse of dimensionality.

In the context of learning algorithms, the curse of dimensionality refers to the need for huge numbers of training examples when learning highly complex functions. When the number of input variables increases, the number of required examples can grow exponentially. The curse of dimensionality arises when a huge number of different combinations of values of the input variables must be discriminated from each other, and the learning algorithm needs at least one example per relevant combination of values. In the context of language models, the problem comes from the huge number of possible sequences of words, e.g., with a sequence of 10 words taken from a vocabulary of 100,000 there are $$10^{50}$$ possible sequences...

A distributed representation of a symbol is a tuple (or vector) of features which characterize the meaning of the symbol, and are not mutually exclusive. If a human were to choose the features of a word, he might pick grammatical features like gender or plurality, as well as semantic features like animate or invisible. With a neural network language model, one relies on the learning algorithm to discover these features, and the features are continuous-valued (making the optimization problem involved in learning much simpler).

The basic idea is to learn to associate each word in the dictionary with a continuous-valued vector representation. Each word corresponds to a point in a feature space. One can imagine that each dimension of that space corresponds to a semantic or grammatical characteristic of words. The hope is that functionally similar words get to be closer to each other in that space, at least along some directions. A sequence of words can thus be transformed into a sequence of these learned feature vectors. The neural network learns to map that sequence of feature vectors to a prediction of interest, such as the probability distribution over the next word in the sequence. What pushes the learned word features to correspond to a form of semantic and grammatical similarity is that when two words are functionally similar, they can be replaced by one another in the same context, helping the neural network to compactly represent a function that makes good predictions on the training set, the set of word sequences used to train the model.

The advantage of this distributed representation approach is that it allows the model to generalize well to sequences that are not in the set of training word sequences, but that are similar in terms of their features, i.e., their distributed representation. Because neural networks tend to map nearby inputs to nearby outputs, the predictions corresponding to word sequences with similar features are mapped to similar predictions. Because many different combinations of feature values are possible, a very large set of possible meanings can be represented compactly, allowing a model with a comparatively small number of parameters to fit a large training set.

## History

### N-gram language models

The dominant methodology for probabilistic language modeling since the 1980's has been based on n-gram models (Jelinek and Mercer, 1980;Katz 1987). See (Manning and Schutze, 1999) for a review. These non-parametric learning algorithms are based on storing and combining frequency counts of word subsequences of different lengths, e.g., 1, 2 and 3 for 3-grams. If a sequence of words ending in $$\cdots w_{t-2}, w_{t-1},w_t,w_{t+1}$$ is observed and has been seen frequently in the training set, one can estimate the probability $$P(w_{t+1}|w_1,\cdots, w_{t-2},w_{t-1},w_t)$$ of $$w_{t+1}$$ following $$w_1,\cdots w_{t-2},w_{t-1},w_t$$ by ignoring context beyond $$n-1$$ words, e.g., 2 words, and dividing the number of occurrences of $$w_{t-1},w_t,w_{t+1}$$ by the number of occurrences of $$w_{t-1},w_t\ .$$ Note that in doing so we ignore the identity of words that preceded $$w_{t-1}\ .$$ Furthermore, a new observed sequence typically will have occurred rarely or not at all in the training set. An important idea in n-grams is therefore to combine the above estimator of $$P(w_{t+1}|w_{t-1},w_t)$$ with one obtained from a shorter suffix of the currently observed sequence. For example, here we can also predict the probability of $$w_{t+1}$$ (given the context that precedes it) by dividing the number of occurrences of $$w_t,w_{t+1}$$ by the number of occurrences of $$w_t$$ (this is called a bigram). Similarly, using only the relative frequency of $$w_{t+1}\ ,$$ one obtains a unigram estimator. The three estimators can then be combined, either by choosing only one of them in a particular context (e.g., based the frequency counts of the subsequences), or by combining them (usually in a linear mixture). A large literature on techniques to smooth frequency counts of subsequences has given rise to a number of algorithms and variants.

### Distributed representations

The idea of distributed representation has been at the core of the revival of artificial neural network research in the early 1980's, best represented by the connectionist bringing together computer scientists, cognitive psychologists, physicists, neuroscientists, and others. The main proponent of this idea has been Geoffrey Hinton, in articles such as (Hinton 1986) and (Hinton 1989). An early discussion can also be found in the Parallel Distributed Processing book (1986), a landmark of the connectionist approach.

The idea of distributed representations was introduced with reference to cognitive representations: a mental object can be represented efficiently (both in terms of number of bits and in terms of number of examples needed to generalize about it) by characterizing the object using many features, each of which can separately each be active or inactive. For example, with $$m$$ binary features, one can describe up to $$2^m$$ different objects. The idea is that the brain would be learning and using such representations because they help it generalize to new objects that are similar to known ones in many respects. A distributed representation is opposed to a local representation, in which only one neuron (or very few) is active at each time, i.e., as with grandmother cells. One can view n-gram models as a mostly local representation: only the units associated with the specific subsequences of the input sequence are turned on. Hence the number of units needed to capture the possible sequences of interest grows exponentially with sequence length.

Figure 1: Example of 2-dimensional distributed representation for words obtained in (Blitzer et al 2005).

Previously to the neural network language models introduced in (Bengio et al 2001, 2003), several neural network models had been proposed that exploited distributed representations for learning about symbolic data (Bengio and Bengio, 2000; Paccanaro and Hinton, 2000), modeling linguistic data (Miikkulainen 1991) and character sequences (Schmidhuber 1996). In (Bengio et al 2001, Bengio et al 2003), it was demonstrated how distributed representations for symbols could be combined with neural network probability predictions in order to surpass standard n-gram models on statistical language modeling tasks. Experiments on related algorithms for learning distributed representations of words have shown that the learned features make sense linguistically (Blitzer et al 2005).

## The mathematics of neural net language models

The probability of a sequence of words can be obtained from the probability of each word given the context of words preceding it, using the chain rule of probability (a consequence of Bayes theorem): $P(w_1, w_2, \ldots, w_{t-1},w_t) = P(w_1) P(w_2|w_1) P(w_3|w_1,w_2) \ldots P(w_t | w_1, w_2, \ldots w_{t-1}).$ Most probabilistic language models (including published neural net language models) approximate $$P(w_t | w_1, w_2, \ldots w_{t-1})$$ using a fixed context of size $$n-1\ ,$$ i.e. using $$P(w_t | w_{t-n+1}, \ldots w_{t-1})\ ,$$ as in n-grams.

In the model introduced in (Bengio et al 2001, Bengio et al 2003), the probabilistic prediction $$P(w_t | w_{t-n+1}, \ldots w_{t-1})$$ is obtained as follows. First, each word $$w_{t-i}$$ (represented with an integer in $$[1,N]$$) in the $$n-1$$-word context is mapped to an associated $$d$$-dimensional feature vector $$C_{w_{t-i}}\ ,$$ which is column $$w_{t-i}$$ of parameter matrix $$C\ .$$ Vector $$C_k$$ contains the learned features for word $$k\ .$$ Let vector $$x$$ denote the concatenation of these $$n-1$$ feature vectors: $x = (C_{w_{t-n+1},1}, \ldots, C_{w_{t-n+1},d}, C_{w_{t-n+2},1}, \ldots C_{w_{t-2},d}, C_{w_{t-1},1}, \ldots C_{w_{t-1},d}).$ The probabilistic prediction of the next word, starting from $$x$$ is then obtained using a standard artificial neural network architecture for probabilistic classification, using the softmax activation function at the output units (Bishop, 1995): $P(w_t=k | w_{t-n+1}, \ldots w_{t-1}) = \frac{e^{a_k}}{\sum_{l=1}^N e^{a_l}}$ where $a_k = b_k + \sum_{i=1}^h W_{ki} \tanh(c_i + \sum_{j=1}^{(n-1)d} V_{ij} x_j)$ where the vectors $$b,c$$ and matrices $$W,V$$ are also parameters (in addition to matrix $$C$$). Let us denote $$\theta$$ for the concatenation of all the parameters. The capacity of the model is controlled by the number of hidden units $$h$$ and by the number of learned word features $$d\ .$$

Figure 2: Architecture of neural net language model introduced in (Bengio et al 2001)

The neural network is trained using a gradient-based optimization algorithm to maximize the training set log-likelihood $L(\theta) = \sum_t \log P(w_t | w_{t-n+1}, \ldots w_{t-1}) .$ The gradient $$\frac{\partial L(\theta)}{\partial \theta}$$ can be computed using the error back-propagation algorithm, extended to provide the gradient with respect to $$C$$ as well as with respect to the other parameters. Note that the gradient on most of $$C$$ is zero (and need not be computed or used) for most of the columns of $$C\ :$$ only those corresponding to words in the input subsequence have a non-zero gradient. Because of the large number of examples (millions to hundreds of millions), the only known practical optimization algorithm for artificial neural networks are online algorithms, such as stochastic gradient descent: the gradient on the log-likelihood of a single example at a time (one word in its context) or a mini-batch of examples (e.g., 100 words) is iteratively used to perform each update of the parameters.

In a similar spirit, other variants of the above equations have been proposed (Bengio et al 2001, 2003;Schwenk and Gauvain 2004;Blitzer et al 2005; Morin and Bengio 2005; Bengio and Senecal 2008).

## Computational issues and applications

Several variants of the above neural network language model were compared by several authors (Schwenk and Gauvain 2002, Bengio et al 2003, Xu et al 2005, Schwenk et al 2006, Schwenk 2007, Mnih and Hinton 2007) against n-gram based language models, either in terms of log-likelihood or in terms of classification accuracy of a speech recognition or statistical machine translation system (such systems use a probabilistic language model as a component). The experiments have been mostly on small corpora, where training a neural network language model is easier, and show important improvements on both log-likelihood and speech recognition accuracy. Resampling techniques may be used to train the neural network language model on corpora of several hundreds of millions of words (Schwenk and Gauvain 2004). It has been noted that neural network language models and n-gram based language models make errors in different places: hence simply averaging the probabilistic predictions from the two types of models often yields improved predictions. However, naive implementations of the above equations yield predictors that are too slow for large scale natural language applications. Schwenk and Gauvain (2004) were able to build systems in which the neural network component took less than 5% of real-time (the duration of the speech being analyzed).

English vocabulary sizes used in natural language processing applications such as speech recognition and translation involve tens of thousands, possibly hundreds of thousands of different words. With $$N=100,000$$ in the above equations, the computational bottleneck is at the output layer, where one computes $$O(N h)$$ operations. This is much more than the number of operations typically involved in computing probability predictions for n-gram models. Several researchers have developed techniques to speed-up either probability prediction (when using the model) or estimating gradients (when training the model). One of the ideas behind these techniques is to use the neural network language models for only a subset of words (Schwenk 2004), or storing in a cache the most relevant softmax normalization constants (Zamora et al 2009). Another idea is to decompose the probability computation hierarchically, using a tree of binary probabilistic decisions, so as to replace $$O(N)$$ computations by $$O(\log N)$$ computations (Morin and Bengio 2005). Yet another idea is to replace the exact gradient by a stochastic estimator obtained using a Monte-Carlo sampling technique (Bengio and Senecal 2008).

In addition to the computational challenges briefly described above, several weaknesses of the neural network language model are being worked on by researchers in the field. One of them is the representation of a fixed-size context. To represent longer-term context, one may employ a recurrent network formulation, which learns a representation of context that summarizes the past word sequence in a way that preserves information predictive of the future. This learned summarization would keep higher-level abstract summaries of more remote text, and a more detailed summary of very recent words. A fundamental obstacle to progress in this direction has to do with the diffusion of gradients through long chains of non-linear transformations, making it difficult to learn long-term dependencies (Bengio et al 1994) in sequential data.

Another weakness is the shallowness of the current model and the difficult optimization problem of training a neural net language model. For a discussion of shallow vs deep architectures, see (Bengio and LeCun 2007). Whereas current models have two or three layers, theoretical research on deep architectures suggests that representing high-level semantic abstractions efficiently may require deeper networks. Until 2006, it was not clear how one could train deep neural networks, as training appeared to get stuck in poor local minima, but papers published since 2006 (Hinton 2006, Bengio et al 2007, Ranzato et al 2007) on Deep Belief Networks, auto-encoders and Restricted Boltzmann Machines suggest avenues for addressing this issue.

There remains a debate between the use of local non-parametric methods based on n-grams, and methods based on more compact and distributed representations such as neural net language models. Optimizing the latter remains a difficult challenge. In addition, it could be argued that using a huge training set (e.g., all the text in the Web), one could get n-gram based language models that appear to capture semantics correctly. However, in the light of the exponential nature of the curse of dimensionality, one should also ask the question of how much closer to human understanding of language one can get by multiplying n-gram training corpora size by a mere 100 or 1000.

## References

Internal references

• Valentino Braitenberg (2007) Brain. Scholarpedia, 2(11):2918.
• Mark Aronoff (2007) Language. Scholarpedia, 2(5):3175.