Neural net language models
From Scholarpedia
| Yoshua Bengio (2008), Scholarpedia, 3(1):3881. | revision #37398 [link to/cite this article] | |||||||||||||||||||
A language model is a function, or an algorithm for learning such a
function, that captures the salient statistical characteristics of the
distribution of sequences of words in a natural language, typically
allowing one to make probabilistic predictions of the next word given
preceding ones.
A neural network language model is a language model based on Neural Networks , exploiting their ability to learn distributed representations to reduce the impact of the curse of dimensionality.
In the context of learning algorithms, the
curse of dimensionality
refers to the need for huge numbers of training examples when learning
highly complex functions. When the number of input variables
increases, the number of required examples can grow exponentially. The
curse of dimensionality arises when a huge number of different combinations
of values of the input variables must be discriminated from each other,
and the learning algorithm needs at least one example per relevant combination
of values. In the context of
language models, the problem comes from the huge number of possible
sequences of words, e.g., with a sequence
of 10 words taken from a vocabulary of 100,000 there are
possible sequences...
A distributed representation of a symbol is a tuple (or vector) of features which characterize the meaning of the symbol, and are not mutually exclusive. If a human were to choose the features of a word, he might pick grammatical features like gender or plurality, as well as semantic features like animate" or invisible. With a neural network language model, one relies on the learning algorithm to discover these features, and the features are continuous-valued (making the optimization problem involved in learning much simpler).
The basic idea is to learn to associate each word in the dictionary with a continuous-valued vector representation. Each word corresponds to a point in a feature space. One can imagine that each dimension of that space corresponds to a semantic or grammatical characteristic of words. The hope is that functionally similar words get to be closer to each other in that space, at least along some directions. A sequence of words can thus be transformed into a sequence of these learned feature vectors. The neural network learns to map that sequence of feature vectors to a prediction of interest, such as the probability distribution over the next word in the sequence. What pushes the learned word features to correspond to a form of semantic and grammatical similarity is that when two words are functionally similar, they can be replaced by one another in the same context, helping the neural network to compactly represent a function that makes good predictions on the training set, the set of word sequences used to train the model.
The advantage of this distributed representation approach is that it allows the model to generalize well to sequences that are not in the set of training word sequences, but that are similar in terms of their features, i.e., their distributed representation. Because neural networks tend to map nearby inputs to nearby outputs, the predictions corresponding to word sequences with similar features are mapped to similar predictions. Because many different combinations of feature values are possible, a very large set of possible meanings can be represented compactly, allowing a model with a comparatively small number of parameters to fit a large training set.
Contents |
History
N-gram language models
The dominant methodology for probabilistic language modeling since the
1980's has been based on n-gram models (Jelinek and Mercer, 1980;Katz 1987). See
(Manning and Schutze, 1999) for a review. These non-parametric learning algorithms are based on storing and combining
frequency counts of word subsequences of different lengths, e.g., 1, 2 and
3 for 3-grams. If a sequence of words ending in
is observed and has been seen frequently in the training
set, one can estimate the probability
of
following
by ignoring context
beyond
words, e.g., 2 words, and dividing the number of
occurrences of
by the number of occurrences of
. Note that in doing so we ignore the identity of
words that preceded
. Furthermore, a new observed sequence
typically will have occurred rarely or not at all in the training set. An important
idea in n-grams is therefore to combine the above estimator of
with one obtained from a shorter suffix of the
currently observed sequence. For example, here we can also predict the
probability of
(given the context that precedes it)
by dividing the number of occurrences of
by the number of occurrences of
(this
is called a bigram). Similarly, using only the relative frequency of
, one obtains a unigram estimator. The three estimators
can then be combined, either by choosing only one of them in a particular context (e.g., based
the frequency counts of the subsequences), or by combining them
(usually in a linear mixture). A large literature on techniques
to smooth frequency counts of subsequences has given rise to
a number of algorithms and variants.
Distributed representations
The idea of distributed representation has been at the core of the revival of artificial neural network research in the early 1980's, best represented by the connectionist bringing together computer scientists, cognitive psychologists, physicists, neuroscientists, and others. The main proponent of this idea has been Geoffrey Hinton, in articles such as (Hinton 1986) and (Hinton 1989). An early discussion can also be found in the Parallel Distributed Processing book (1986), a landmark of the connectionist approach.
The idea of distributed representations was introduced with reference to
cognitive representations: a mental object can be represented efficiently
(both in terms of number of bits and in terms of number of examples needed
to generalize about it) by characterizing the object using many features,
each of which can separately each be active or inactive. For example,
with
binary features, one can describe up to
different objects. The idea is that the brain would be
learning and using such representations because they help it generalize to
new objects that are similar to known ones in many respects. A distributed
representation is opposed to a local representation, in which only one
neuron (or very few) is active at each time, i.e., as with grandmother cells.
One can view n-gram models as a mostly local representation: only
the units associated with the specific subsequences of the input
sequence are turned on. Hence the number of units needed to capture
the possible sequences of interest grows exponentially with sequence length.
Previously to the neural network language models introduced in (Bengio et al 2001, 2003), several neural network models had been proposed that exploited distributed representations for learning about symbolic data (Bengio and Bengio, 2000; Paccanaro and Hinton, 2000), modeling linguistic data (Miikkulainen 1991) and character sequences (Schmidhuber 1996). In (Bengio et al 2001, Bengio et al 2003), it was demonstrated how distributed representations for symbols could be combined with neural network probability predictions in order to surpass standard n-gram models on statistical language modeling tasks. Experiments on related algorithms for learning distributed representations of words have shown that the learned features make sense linguistically (Blitzer et al 2005).
The mathematics of neural net language models
The probability of a sequence of words can be obtained from the probability of each word given the context of words preceding it, using the chain rule of probability (a consequence of Bayes theorem):
Most probabilistic language models
(including published neural net language models)
approximate
using a fixed context of size
, i.e. using
,
as in n-grams.
In the model introduced in (Bengio et al 2001, Bengio et al 2003),
the probabilistic prediction
is obtained as follows. First, each word
(represented
with an integer in
) in the
-word context is mapped
to an associated
-dimensional feature vector
, which is
column
of parameter matrix
. Vector
contains the learned features for word
.
Let vector
denote the concatenation of these
feature vectors:
The probabilistic prediction of the next word, starting from
is then obtained using a standard artificial neural network architecture
for probabilistic classification:
where
where the vectors
and matrices
are also
parameters (in addition to matrix
). Let us denote
for the concatenation of all the parameters.
The capacity of the model is controlled by the number of hidden units
and by the number of learned word features
.
The neural network is trained using a gradient-based optimization algorithm to maximize the training set log-likelihood
The gradient
can be computed using the error back-propagation algorithm, extended
to provide the gradient with respect to
as well as with
respect to the other parameters. Note that the gradient on most of
is zero (and need not be computed or used) for most of the columns of
:
only those corresponding to words in the input subsequence have a non-zero gradient.
Because of the large number of examples (millions to hundreds of millions),
the only known practical optimization algorithm for
artificial neural networks
are online algorithms, such as stochastic gradient descent: the
gradient on the log-likelihood of a single example at a time (one word in its
context) or a mini-batch of examples (e.g., 100 words) is iteratively used to perform
each update of the parameters.
In a similar spirit, other variants of the above equations have been proposed (Bengio et al 2001, 2003;Schwenk and Gauvain 2004;Blitzer et al 2005; Morin and Bengio 2005; Bengio and Senecal 2008).
Computational issues and applications
Several variants of the above neural network language model were compared by several authors (Schwenk and Gauvain 2002, Bengio et al 2003, Xu et al 2005, Schwenk et al 2006, Schwenk 2007, Mnih and Hinton 2007) against n-gram based language models, either in terms of log-likelihood or in terms of classification accuracy of a speech recognition or statistical machine translation system (such systems use a probabilistic language model as a component). The experiments have been mostly on small corpora, where training a neural network language model is easier, and show important improvements on both log-likelihood and speech recognition accuracy. Resampling techniques may be used to train the neural network language model on corpora of several hundreds of millions of words (Schwenk and Gauvain 2004). It has been noted that neural network language models and n-gram based language models make errors in different places: hence simply averaging the probabilistic predictions from the two types of models often yields improved predictions. However, naive implementations of the above equations yield predictors that are too slow for large scale natural language applications. Schwenk and Gauvain (2004) were able to build systems in which the neural network component took less than 5% of real-time (the duration of the speech being analyzed).
English vocabulary sizes used in natural language processing applications
such as speech recognition and translation involve tens of thousands, possibly
hundreds of thousands of different words. With
in
the above equations, the computational bottleneck is at the output layer,
where one computes
operations. This is much more than
the number of operations typically involved in computing probability predictions
for n-gram models. Several researchers have developed techniques to
speed-up either probability prediction (when using the model) or
estimating gradients (when training the model). One of the ideas
behind these techniques is to use the neural network language models
for only a subset of words (Schwenk 2004). Another idea is to
decompose the probability computation
hierarchically, using a tree of binary probabilistic decisions,
so as to replace
computations by
computations (Morin and Bengio 2005).
Yet another idea is to replace the exact gradient
by a stochastic estimator obtained using a Monte-Carlo
sampling technique (Bengio and Senecal 2008).
Challenges ahead
In addition to the computational challenges briefly described above, several weaknesses of the neural network language model are being worked on by researchers in the field. One of them is the representation of a fixed-size context. To represent longer-term context, one may employ a recurrent network formulation, which learns a representation of context that summarizes the past word sequence in a way that preserves information predictive of the future. This learned summarization would keep higher-level abstract summaries of more remote text, and a more detailed summary of very recent words. A fundamental obstacle to progress in this direction has to do with the diffusion of gradients through long chains of non-linear transformations, making it difficult to learn long-term dependencies (Bengio et al 1994) in sequential data.
Another weakness is the shallowness of the current model and the difficult optimization problem of training a neural net language model. For a discussion of shallow vs deep architectures, see (Bengio and LeCun 2007). Whereas current models have two or three layers, theoretical research on deep architectures suggests that representing high-level semantic abstractions efficiently may require deeper networks. Until 2006, it was not clear how one could train deep neural networks, as training appeared to get stuck in poor local minima, but papers published since 2006 (Hinton 2006, Bengio et al 2007, Ranzato et al 2007) on Deep Belief Networks, auto-encoders and Restricted Boltzmann Machines suggest avenues for addressing this issue.
There remains a debate between the use of local non-parametric methods based on n-grams, and methods basead on more compact and distributed representations such as neural net language models. Optimizing the latter remains a difficult challenge. In addition, it could be argued that using a huge training set (e.g., all the text in the Web), one could get n-gram based language models that appear to capture semantics correctly. However, in the light of the exponential nature of the curse of dimensionality, one should also ask the question of how much closer to human understanding of language one can get by multiplying n-gram training corpora size by a mere 100 or 1000.
References
- Jelinek, F. and Mercer, R.L. (1980) Interpolated Estimation of Markov Source Parameters from Sparse Data. Pattern Recognition in Practice, Gelsema E.S. and Kanal L.N. eds, North-Holland. pages??
- Hinton, G.E. (1986) Learning Distributed Representations of Concepts. Proceedings of the Eighth Annual Conference of the Cognitive Science Society:1-12
- Rumelhart, D. E. and McClelland, J. L (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge.
- Katz, S.M. (1987) Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 3:400-401.
- Hinton, G.E. (1989) Connectionist Learning Procedures. Artificial Intelligence J. 40:185-234.
- Miikkulainen, R. and Dyer, M.G. (1991) Natural language processing with modular PDP networks and distributed lexicon. Cognitive Science 15:343-399.
- Elman, J. L. (1991) Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7:195-224.
- Bengio, Y., Simard, P., and Frasconi, P. (1994) Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks 5:157-166.
- Schmidhuber, J. and Heil, S. (1996) Sequential Neural Text Compression. IEEE Transactions on Neural Networks 7:142-146.
- Manning, C. and Schutze H. (1999) Foundations of Statistical Natural Language Processing, MIT Press.
- Paccanaro, A. and Hinton, G.E. (2000) Extracting Distributed Representations of Concepts and Relations from Positive and Negative Propositions. IJCNN'2000.
- Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C. (2001, 2003) A Neural Probabilistic Language Model. NIPS'2000 13:933-938, and revised in J. Machine Learning Research (2003) 3:1137-1155.
- Schwenk, H., Gauvain, J.-L. (2002) Connectionist Language Modeling for Large Vocabulary Continuous Speech Recognition. In ICASSP, pages I:765-768.
- Xu, P., Emami, A., and Jelinek, F. (2003) Training Connectionist Models for the Structured Language Model, EMNLP'2003.
- Schwenk, H. and Gauvain, J.-L. (2004) Training Neural Network Language Models On Very Large Corpora. In Joint Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 201-208.
- Blitzer, J., Weinberger, K., Saul, L., and Pereira F. (2005) Hierarchical Distributed Representations for Statistical Language Modeling, NIPS'04.
- Morin, F. and Bengio, Y. (2005) Hierarchical Probabilistic Neural Network Language Model. AISTATS'2005.
- Schwenk, H., Dchelotte, D., and Gauvain, J.-L. (2006), Continuous space language models for statistical machine translation, COLING/ACL'2006.
- Hinton, G.E., Osindero, S. and Teh, Y. (2006) A Fast Learning Algorithm for Deep Belief Nets. Neural Computation 18:1527-1554.
- Bengio, Y. and LeCun, Y. (2007) Scaling Learning Algorithms towards AI. Large Scale Kernel Machines, MIT Press.
- Bengio, Y., Lamblin, P., Popovici, D. and Larochelle H. (2007) Greedy Layer-Wise Training of Deep Networks. NIPS'2006 19:153-160.
- Ranzato, M-A., Poultney, C., Chopra, S. and LeCun, Y. (2007) Efficient Learning of Sparse Representations with an Energy-Based Model. NIPS'2006.
- Schwenk, H. (2007), Continuous Space Language Models, Computer Speech and language, vol 21, pages 492-518, Academic Press.
- Mnih, A. and Hinton, G.E. (2007) Three New Graphical Models for Statistical Language Modelling, ICML'2007.
- Bengio, Y. and Senecal, J.-S. (2008) Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model. IEEE Transactions on Neural Networks, to appear.
Internal references
- Jan A. Sanders (2006) Averaging. Scholarpedia, 1(11):1760.
- Valentino Braitenberg (2007) Brain. Scholarpedia, 2(11):2918.
- Mark Aronoff (2007) Language. Scholarpedia, 2(5):3175.
External Links
- See above links to reference papers.
See also
| Yoshua Bengio (2008) Neural net language models. Scholarpedia, 3(1):3881, (go to the first approved version) Created: 20 May 2007, reviewed: 14 January 2008, accepted: 14 January 2008 |
| Invited by: | Dr. Ke CHEN, The University of Manchester |
| Action editor: | Dr. Ke CHEN, The University of Manchester |



