Echo state network
From Scholarpedia
| Herbert Jaeger (2007), Scholarpedia, 2(9):2330. | revision #20147 [link to/cite this article] | |||||||||||||||||||
Revision as of 07:00, 6 September 2007; view current revision
←Older revision | Newer revision→
Curator: Herbert Jaeger, Jacobs University Bremen, Bremen, Germany
Contents |
Basic principle
Echo state networks (ESN) provide an architecture and supervised learning principle for recurrent neural networks (RNNs). The main idea is (i) to drive a random, large, fixed recurrent neural network with the input signal, thereby inducing in each neuron within this "reservoir" network a nonlinear response signal, and (ii) combine a desired output signal by a trainable linear combination of all of these response signals.
The basic idea of ESNs is shared with Liquid State Machines (LSM), which were developed independently from and simultaneously with ESNs by Wolfgang Maass (Maass W., Natschlaeger T., Markram H. 2002). Increasingly often, LSMs, ESNs and the more recently explored Backpropagation Decorrelation learning rule for RNNs (Schiller and Steil 2006) are subsumed under the name of Reservoir Computing. Schiller and Steil (2006) also showed that in traditional training methods for RNNs, where all weights (not only the output weights) are adapted, the dominant changes are in the output weights.
For an illustration, consider the task of training an RNN to behave
as a tunable frequency generator (download the Matlab code of this example). The
input signal
is a slowly varying frequency setting, the desired
output
is a sinewave of a frequency indicated by the current
input. Assume that a training input-output sequence
is given (see the input and output signals in figure ; here
the input is a slow random step function indicating frequencies
ranging from 1/16 to 1/4 Hz). The task is to train a RNN from these
training data such that on slow test input signals, the output is
again a sinewave of the input-determined frequency.
In the ESN approach, this task is solved by the following steps.
- Step 1: Provide a random RNN. (i) Create a random dynamical reservoir RNN, using any neuron model (in the frequency generator demo example, non-spiking leaky integrator neurons were used). The reservoir size
is task-dependent. In the frequency generator demo task,
was used. (ii) Attach input units to the reservoir by creating random all-to-all connections. (iii) Create output units. If the task requires output feedback (the frequency-generator task does), install randomly generated output-to-reservoir connections (all-to-all). If the task does not require output feedback, do not create any connections to/from the output units in this step.
- Step 2: Harvest reservoir states. Drive the dynamical reservoir with the training data
for times
. In the demo example, where there are output-to-reservoir feedback connections, this means to write both the input
into the input unit and the teacher output
into the output unit ("teacher forcing"). In tasks without output feedback, the reservoir is driven by the input
only. This results in a sequence
of
-dimensional reservoir states. Each component signal
is a nonlinear transform of the driving input. In the demo, each
is an individual mixture of both the slow step input signal and the fast output sinewave (see the five exemplary neuron state plots in figure 1).
- Step 3: Compute output weights. Compute the output weights as the linear regression weights of the teacher outputs
on the reservoir states
. Use these weights to create reservoir-to-output connections (dotted arrows in figure ). The training is now completed and the ESN ready for use. Figure 2 shows the output signal obtained when the trained ESN was driven with the slow step input shown in the same figure.
Variants
Echo state networks can be set up with or without direct trainable input-to-output connections, with or without output-to-reservoir feedback, with different neuron types, different reservoir-internal connectivity patterns, etc. Furthermore, the output weights can be computed with any of the available offline or online algorithms for linear regression. Besides least-mean-square error solutions (i.e., linear regression weights), margin-maximization criteria known from training support vector machines have been used to determine output weights (Schmidhuber et al. 2007)
The unifying theme throughout all these variations is to use a fixed RNN as a random nonlinear excitable medium, whose high-dimensional dynamical "echo" response to a driving input (and/or output feedback) is used as a non-orthogonal signal basis to reconstruct the desired output by a linear combination, minimizing some error criterium.
Formalism and theory
System equations. The basic
discrete-time, sigmoid-unit echo state network with
reservoir units,
inputs and
outputs is governed by the
state update equation
(1)
,
where
is the
-dimensional
reservoir state,
is a sigmoid function (usually the
logistic sigmoid or the tanh function),
is the
reservoir weight matrix,
is the
input
weight matrix,
is the
dimensional input signal,
is the
output feedback matrix, and
is the
-dimensional output
signal. In tasks where no output feedback is required,
is nulled. The extended system state
at
time
is the concatenation of the reservoir and input
states. The output is obtained from the extended system state by
(2)
,
where
is an output activation function (typically the
identity or a sigmoid) and
is a
-dimensional matrix of output
weights.
Learning equations. In the state harvesting stage of the
training, the ESN is driven by an input sequence
, which yields a sequence
of extended
system states. The system equations (1), (2) are used here. If the
model includes output feedback (i.e., nonzero
), then during the generation of the
system states, the correct outputs
(part of
the training data) are written into the output units ("teacher
forcing"). The obtained extended system states are filed row-wise into
a state collection matrix
of size
. Usually some initial portion of
the states thus collected are discarded to accomodate for a washout of
the arbitrary (random or zero) initial reservoir state needed at time
1. Likewise, the desired outputs
are sorted
row-wise into a teacher output collection matrix
of size
.
The desired output weights
are the
linear regression weights of the desired outputs
on the harvested extended states
. Let
be the correlation matrix of the extended
reservoir states (the prime denotes transpose), and let
be the
cross-correlation matrix of the states vs. the desired outputs. Then,
one way to compute
is to invoke the Wiener-Hopf
solution
(3)
.
Another way is to use the pseudoinverse (denoted by
) of
:
(4)
.
Both methods are, in principle, equivalent, but when
is ill-conditioned, (4) is numerically
stable, while (3) is not. However, (3) is faster to compute than (4)
(much faster if
is large), so one should
choose between the two versions on a case by case basis. (3) and (4)
are offline algorithms. Online adaptive methods known from linear
signal processing can also be used to compute output weights (Jaeger 2003).
Echo state property. In order for the ESN principle to work, the reservoir must have the echo state property (ESP), which relates asymptotic properties of the excited reservoir dynamcis to the driving signal. Intuitively, the ESP states that the reservoir will asymptotically wash out any information from initial conditions. The ESP is guaranteed for additive-sigmoid neuron reservoirs, if the reservoir weight matrix (and the leaking rates) satisfy certain algebraic conditions in terms of singular values. For such reservoirs with a tanh sigmoid, the ESP is violated for zero input if the spectral radius of the reservoir weight matrix is larger than unity. Conversely, it is empirically observed that the ESP is granted for any input if this spectral radius is smaller than unity. This has led in the literature to a far-spread but erroneous identification of the ESP with a spectral radius below 1. Specifically, the larger the input amplitude, the further above unity the spectral radius may be while still obtaining the ESP. An abstract characterization of the ESP for arbitrary reservoir types, and algebraic conditions for additive-sigmoid neuron reservoirs are given in Jaeger (2001a); for an important subclass of reservoirs, tighter algebraic conditions are given in Buehner and Young (2006); for leaky integrator neurons, algebraic conditions are spelled out in Jaeger et al. (2007).
Memory capacity. Due to the auto-feedback nature of RNNs, the
reservoir states
reflect traces of the past input
history. This can be seen as a dynamical short-term memory. For a
single-input ESN, this short-term memory's capacity
can be
quantified by
, where
is the squared correlation coefficient between
the input signal delayed by
and a trained output signal
which was trained on the task to retrodict (memorize)
on the
input signal
. It turns out that for i.i.d. input, the memory
capacity
of an echo state network of size
is bounded by
; in the absence
of numerical errors and with a linear reservoir the bound is attained
(Jaeger 2002a; White and Sompolinsky 2004). These findings imply that
it is impossible to train ESNs on tasks which require unbounded-time
memory, like for instance context-free
grammar parsing tasks (Schmidhuber et al. 2007). However, if output
units with feedback to the reservoir are trained as attractor memory
units, unbounded memory spans can be realized with ESNs, too (cf. the
multistable switch example in Jaeger 2002a; beginnings of a theory of feedback-induced memory-hold attractors in Maass, Joshi & Sontag 2007).
Universal computation and approximation properties. ESNs can realize every nonlinear filter with bounded memory arbitrarily well. This line of theoretical research has been started and advanced in the field of Liquid State Machines (Maass, Natschlaeger & Markram 2002; Maass, Joshi & Sontag 2007), and the reader is referred to the LSM article for detail.
Practical issues: tuning global controls and regularization
When using ESNs in practical nonlinear modeling tasks, the ultimate objective is to minimize the test error. A standard method in machine learning to get an estimate of the test error is to use only a part of the available training data for model estimation, and monitor the model's performance on the withheld portion of the original training data (the validation set). The question is, how can the ESN models be optimized in order to reduce the error on the validation set? In the terminology of machine learning, this boils down to the question how one can equip the ESN models with a task-appropriate bias. With ESNs, there are three sorts of bias (in a wide sense) which one should adjust.
The first sort of bias is to employ regularization. This essentially means that the models are smoothed. Two standard ways to achieve some kind of smoothing are the following:
- Ridge regression (also known as Tikhonov regularization): modify the linear regression equation (3) for the output weights:
- (5)
,
- where
is some nonnegative number (the larger, the stronger the smoothing effect), and
is the identity matrix.
- State noise: During the state harvesting, instead of (1) use a state update which adds a noise vector
to the reservoir states:
- (6)
.
Both methods lead to smaller output weights. Adding state noise is computationally more expensive, but appears to have the additional benefit of stabilizing solutions in models with output feedback (Jaeger 2002a; Jaeger, Lukosevicius, Popovici & Siewert 2007).
The second sort of bias is effected by making the echo state network, as one could say, "dynamically similar" to the system that one wants to model. For instance, if the original system is evolving on a slow timescale, the ESN should do the same; or if the original system has long memory spans, so should the ESN. This shaping of major dynamical characteristics is realized by adjusting a small number of global control parameters:
- The spectral radius of the reservoir weight matrix codetermines (i) the effective time constant of the echo state network (larger spectral radius implies slower decay of impulse response) and (ii) the amount of nonlinear interaction of input components through time (larger spectral radius implies longer-range interactions).
- The input scaling codetermines the degree of nonlinearity of the reservoir dynamics. In one extreme, with very small effective input amplitudes the reservoir behaves almost like a linear medium, while in the other extreme, very large input amplitudes drive the neurons to the saturation of the sigmoid and a binary switching dynamics results.
- The output feedback scaling determines the extent to which the trained ESN has an autonomous pattern generation component. ESNs without any output feedback are the typical choice for purely input-driven dynamical pattern recognition and classification tasks. The frequency generator demo task, on the other hand, needed strong output feedback to generate oscillations (which are not present in the input). Nonzero output feedback entails the danger of dynamical instability.
- The connectivity of the reservoir weight matrix is often claimed (starting with the early techreports of Jaeger) to be responsible for the "richness" of the response signals in the reservoir, following this line of reasoning: sparse connectivity
decomposition of reservoir dynamics into loosely coupled subsystems
large variation among the reservoir signals (desirable). However, contrary to this intuition, many authors have reported that fully connected reservoirs work as well as sparsely connected ones. Considering that sparsely but randomly connected networks have small-world properties, it appears plausible that a sparse random wiring does not lead to a dynamical decoupling, so the original intuitions are misguided. A more practically important aspect of a sparse connectivity is that it engenders linear scaling of computational complexity. If reservoirs are set up such that each neuron on average connects to a fixed number
of other neurons, regardless of network size
, the computational cost of running the trained networks grows only linearly with
.
Finally, a third sort of bias (here the terminology is stretched a bit) is simply the reservoir size
. In the sense of statistical learning theory, increasing the reservoir size is the most direct way of increasing the model capacity.
All these kinds of bias have to be optimized jointly. The current standard practice to do this is manual experimentation.
Significance
A number of algorithms for the supervised training of RNNs have been known since the early 1990'ies, most notably real-time recurrent learning (Williams and Zipser 1989), backpropagation through time (Werbos 1990), extended Kalman filtering based methods (Puskorius and Feldkamp 2004), and the Atiya-Parlos algorithm (Atiya and Parlos 2000). All of these algorithms adapt all connections (input, recurrent, output) by some version of gradient descent. This renders these algorithms slow, and what is maybe even more cumbersome, makes the learning process prone to become disrupted by bifurcations (Doya 1992); convergence cannot be guaranteed. As a consequence, RNNs were rarely fielded in practical engineering applications. ESN training, by contrast, is fast, does not suffer from bifurcations, and is easy to implement. On a number of benchmark tasks, ESNs have starkly outperformed all other methods of nonlinear dynamical modelling (Jaeger and Haas 2004, Jaeger et al. 2007). Reservoir computing techniques thus promise to re-invigorate RNN research and applications (Jaeger, Maass and Principe 2007).
Research topics
Besides work on practical applications and biologically oriented modelling studies in the LSM branch of reservoir computing, there are three research themes which are relevant to enlarge the spectrum of engineering applications.
Optimization of reservoirs
It has been pointed out above in the discussion of global control parameters that a dynamical reservoir should be adapted to the task, and that this is currently mostly done by manual search. Optimizing reservoirs for a particular task or a class of tasks in an automated fashion is currently the most important field of engineering-oriented echo state network research. Numerous methods for an unsupervised, semi-supervised or supervised optimization of reservoirs, by structural or algebraic design, genetic search, online adaptation or pre-training are being investigated. A comprehensive survey is given in Lukosevicius and Jaeger (2007).
Stability of pattern-generating ESNs
An ESN without output feedback is inherently stable due to the echo state property, but with nonzero feedback of output signals into the reservoir, stability can be lost. A formal analysis of stability conditions is challenging; fundamental insights about conditions that preclude noise magnification are made in Maass, Joshi and Sontag (2007). For practical purposes, inserting state or teacher noise during training appears to aid stabilizing solutions, as does statistical regularization (Jaeger et al. 2007). Research on this topic has just begun.
Hierachical / modular ESNs
In mature fields of machine learning, tasks which involve complex multiscale datasets invavriably have led to modular / hierarchical models, where different modules / levels take care of different modes / scales in the data. A first stride in this direction has been taken for ESNs by the construction of hierarchical ESNs in Jaeger (2007).
Online resources
A reservoir computing toolbox (Matlab) geared toward efficiency and practical applications is available at the Reservoir Lab at the University of Gent; an ESN toolbox for academic research (Matlab, too) at Jaeger's ESN resources webpage, and a simulation engine for LSMs at Maass' Neural Microcircuits website. A moderated reservoir computing mailing list is likewise hosted at the Reservoir Lab at the University of Gent.
Patent
The Fraunhofer Institute for Intelligent Analysis and Information Systems claims international patents for commercial exploits of the ESN architecture and learning principle.
References
- Atiya A.F. and Parlos A.G. (2000) New results on recurrent network training: Unifying the algorithms and accelerating convergence. IEEE Trans. Neural Networks, 11(3):697-709
- Buehner M. and Young P. (2006) A tighter bound for the echo state property. IEEE Transactions on Neural Networks, 17(3):820-824
- Doya K. (1992) Bifurcations in the learning of recurrent neural networks. In Proceedings of 1992 IEEE Int. Symp. on Circuits and Systems, Vol. 6, pages 2777-2780
- Jaeger H. (2001a) The "echo state" approach to analysing and training recurrent neural networks. GMD Report 148, GMD - German National Research Institute for Computer Science
- Jaeger H. (2002a) Short term memory in echo state networks. GMD-Report 152, GMD - German National Research Institute for Computer Science
- Jaeger H. (2002b) Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the echo state network approach. GMD Report 159, Fraunhofer Institute AIS
- Jaeger H. (2003): Adaptive nonlinear system identification with echo state networks. In Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, K. Obermayer (Eds), (MIT Press, Cambridge, MA, 2003) pp. 593-600
- Jaeger H. (2007) Discovering multiscale dynamical features with hierarchical echo state networks. Technical report 10, School of Engineering and Science, Jacobs University
- Jaeger H. and Haas, H. (2004) Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304:78-80, 2004.
- Jaeger H., Lukosevicius M., Popovici D. and Siewert U. (2007) Optimization and applications of echo state networks with leaky integrator neurons. Neural Networks, 20(3):335-352
- Jaeger H., Maass, W. and Principe J. (2007) Special issue on echo state networks and liquid state machines: Editorial. Neural Networks, 20(3), 287-289
- Lukosevicius M. and Jaeger H. (2007) Overview of reservoir recipes. Technical Report 11, School of Engineering and Science, Jacobs University Bremen
- Maass W., Joshi P. and Sontag E. (2007) Computational aspects of feedback in neural circuits. PLOS Computational Biology, 3(1):1-20
- Maass W., Natschlaeger T., Markram H. (2002) Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11):2531-2560.
- Puskorius G.V. and Feldkamp L.A. (1994) Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent network. IEEE Trans. Neural Networks, 5(2):279-297
- Schiller U.D. and Steil J. J. (2005) Analyzing the weight dynamics of recurrent learning algorithms. Neurocomputing, 63C:5-23
- Schmidhuber J., Gomez F., Wierstra D. and Gagliolo, M. (2007) Training recurrent networks by evolino. Neural Computation, 19(3):757-779
- Werbos P.J. (1990) Backpropagation through time: what it does and how to do it. Proc. of the IEEE, October, 78(10, October):1550-1560
- White O.L., Lee D.D. and Sompolinsky, H.S. (2004) Short-term memory in orthogonal neural networks. Phys. Rev. Lett., 92(14):148102
- Williams R.J. and Zipser, D. (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1:270-280
See Also
Liquid State Machine, Recurrent Neural Networks, Supervised Learning
| Herbert Jaeger (2007) Echo state network. Scholarpedia, 2(9):2330, (go to the first approved version) Created: 2 November 2006, reviewed: 6 September 2007, accepted: 6 September 2007 |

