# Recurrent neural networks

Post-publication activity

Curator: Stephen Grossberg

A recurrent neural network (RNN) is any network whose neurons send feedback signals to each other. This concept includes a huge number of possibilities. A number of reviews already exist of some types of RNNs. These include [1], [2], [3], [4].

Typically, these reviews consider RNNs that are artificial neural networks (aRNN) useful in technological applications. To complement these contributions, the present summary focuses on biological recurrent neural networks (bRNN) that are found in the brain. Since feedback is ubiquitous in the brain, this task, in full generality, could include most of the brain's dynamics. The current review divides bRNNS into those in which feedback signals occur in neurons within a single processing layer,  which occurs in networks for such diverse functional roles as storing spatial patterns in short-term memory, winner-take-all decision making, contrast enhancement and normalization, hill climbing, oscillations of multiple types (synchronous, traveling waves, chaotic), storing temporal sequences of events in working memory, and serial learning of lists; and those in which feedback signals occur between multiple processing layers, such as occurs when bottom-up adaptive filters activate learned recognition categories and top-down learned expectations focus attention on expected patterns of critical features and thereby modulate both types of learning.

## Types of Recurrent Neural Networks

There are at least three streams of bRNN research: binary, linear, and continuous-nonlinear (Grossberg, 1988):

### Binary

Binary systems were inspired in part by neurophysiological observations showing that signals between many neurons are carried by all-or-none spikes. The binary stream was initiated by the classical McCulloch and Pitts (1943) model of threshold logic systems that describes how the activities, or short-term memory (STM) traces, $$x_i$$ of the $$i^{th}$$ node in a network interact in discrete time according to the equation:

$\tag{1} x_i(t+1) = \text{sgn} \left[ \sum_j A_{ij} x_j(t) - B_j \right],$

where $$\text{sgn}(w) = +1$$ if $$w > 0$$, $$0$$ if $$w = 0$$, and $$-1$$ if $$w < 0$$. The McCulloch-Pitts model had an influence far beyond the field of neural networks through its influence on von Neumann's development of the digital computer.

Caianiello (1961) used a binary STM equation that is influenced by activities at multiple times in the past:

$\tag{2} x_i(T + \tau) = 1 \left[ \sum_{j=1}^n \sum_{k=0}^{l(m)} A_{ij}^{(k)} x_j(t - k \tau) - B_i \right] ,$

where $$l(w) = 1$$ if $$w \ge 0$$ and $$0$$ if $$w < 0$$.

Rosenblatt (1962) used an STM equation that evolves in continuous time, whose activities can spontaneously decay, and which can generate binary signals above a non-zero threshold:

$\tag{3} \frac{d}{dt} x_i = -Ax_i + \sum_{j=1}^n \phi (B_j + x_j) C_{ij},$

where $$\phi(w) = 1$$ if $$w \ge \theta$$ and $$0$$ if $$w < \theta$$. This equation was used in the classical Perceptron model.

Both Caianiello (1961) and Rosenblatt (1962) introduced equations to change the weights $$A_{ij}^{(k)}$$ in (2) and $$C_{ij}$$ in (3) through learning. Such adaptive weights are often called long-term memory (LTM) traces. In both these models, interactions between STM and LTM were uncoupled in order to simplify the analysis. These LTM equations also had a digital aspect. The Caianiello (1961) LTM equations increased or decreased at constant rates until they hit finite upper or lower bounds. The Rosenblatt (1962) LTM equations were used to classify patterns into two distinct classes, as in the Perceptron Learning Theorem.

### Linear

Widrow (1962) drew inspiration from the brain to introduce the gradient descent Adeline adaptive pattern recognition machine. Anderson (1968) initially described his intuitions about neural pattern recognition using a spatial cross-correlation function.  Concepts from linear system theory were adapted to represent some aspects of neural dynamics, including solutions of simultaneous linear equations $$Y = AX$$ using matrix theory, and concepts about cross-correlation. Kohonen (1971) made a transition from linear algebra concepts such as the Moore-Penrose pseudoinverse to more biologically motivated studies that are summarized in his books (Kohonen, 1977, 1984). These ideas began with a mathematically familiar engineering framework before moving towards more biologically motivated nonlinear interactions.

### Continuous-Nonlinear

Continuous-nonlinear network laws typically arose from an analysis of behavioral or neural data. Neurophysiological experiments on the lateral eye of the Limulus, or horseshoe crab, led to the award of a Nobel prize to H.K. Hartline. These data inspired the steady state Hartline-Ratliff model (Hartline and Ratliff, 1957):

$\tag{4} r_i = e_i - \sum_{j=1}^n k_{ij} \left[ r_j - r_{ij} \right]^+ ,$

where $$[w]^+ = \text{max}(w, 0)$$. Equation (4) describes how cell activations $$e_i$$ are transformed into smaller net responses $$r_i$$ due to recurrent inhibitory threshold-linear signals $$-k_{ij} \left[r_i - r_{ij}\right]^+$$. The Hartline-Ratliff model is thus a kind of continuous threshold-logic system. Ratliff et al. (1963) extended this steady-state model to a dynamical model:

$\tag{5} r_i(t) = e_i(t) - \sum_{j=1}^n k_{ij} \left[ \frac{1}{\tau} \int_0^t e^{- \frac{t-s}{\tau}} r_j(s) ds - r_{ij} \right]^+,$

which also behaves linearly above threshold. This model is a precursor of the Additive Model that is described below.

Another classical tradition arose from the analysis of how the excitable membrane of a single neuron can generate electrical spikes capable of rapidly and non-decrementally traversing the axon, or pathway, from one neuron's cell body to a neuron to which it is sending signals. This experimental and modeling work on the squid giant axon by Hodgkin and Huxley (1952) also led to the award of a Nobel prize. Since this work focused on individual neurons rather than neural networks, it will not be further discussed herein except to note that it provides a foundation for the Shunting Model described below.

Another source of continuous-nonlinear RNNs arose through a study of adaptive behavior in real time, which led to the derivation of neural networks that form the foundation of most current biological neural network research (Grossberg, 1967, 1968b, 1968c). These laws were discovered in 1957-58 when Grossberg, then a college Freshman, introduced the paradigm of using nonlinear systems of differential equations to model how brain mechanisms can control behavioral functions. The laws were derived from an analysis of how psychological data about human and animal learning can arise in an individual learner adapting autonomously in real time. Apart from the Rockefeller Institute student monograph Grossberg (1964), it took a decade to get them published.

The following equation is called the Additive Model because it adds the terms, possibly nonlinear, that determine the rate of change of neuronal activities, or potentials, $$x_i$$:

$\tag{6} \frac{d}{dt} x_i = - A_i x_i + \sum_{j=1}^n f_j(x_j) B_{ji} z_{ji}^{(+)} - \sum_{j=1}^n g_j(x_j) C_{ji} z_{ji}^{(-)} + I_i .$

Equation (6) includes a term for passive decay ($$-A_i x_i$$), positive feedback ($$\sum_{j=1}^n f_j(x_j) B_{ji} z_{ji}^{(+)}$$), negative feedback ($$-\sum_{j=1}^n g_j(x_j) C_{ji} z_{ji}^{(-)}$$) and input ($$I_i$$). Each feedback term includes an activity-dependent (possibly) nonlinear signal ($$f_j(x_j)$$, $$g_j(x_j)$$); a connection, or path, strength ($$B_{ji}, C_{ji}$$), and an adaptive weight, or LTM trace ($$z_{ij}^{(+)}, z_{ij}^{(-)}$$). If the positive and negative feedback terms are lumped together and the connection strengths are lumped with the LTM traces, then the Additive Model may be written in the simpler form:

$\tag{7} \frac{d}{dt} x_i = -A_i x_i + \sum_{j=1}^n f_j(x_j)z_{ji} + I_i.$

Early applications of the Additive Model included computational analyses of vision, learning, recognition, reinforcement learning, and learning of temporal order in speech, language, and sensory-motor control (Grossberg, 1969b, 1969c, 1969d, 1970a, 1970b, 1971a, 1971b, 1972a, 1972b, 1974, 1975; Grossberg and Pepe, 1970, 1971). The Additive Model has continued to be a cornerstone of neural network research to the present time; e.g., in decision-making (Usher and McClelland, 2001). Physicists and engineers unfamiliar with the classical status of the Additive Model in neural networks called it the Hopfield model after the first application of this equation in Hopfield (1984). Grossberg (1988) summarizes historical factors that contributed to their unfamiliarity with the neural network literature. The Additive Model in (7) may be generalized in many ways, including the effects of delays and other factors. In the limit of infinitely many cells, an abstraction which does not exist in the brain, the discrete sum in (7) may be replaced by an integral (see Neural fields).

### Shunting STM equation

Grossberg (1964, 1968b, 1969b) also derived an STM equation for neural networks that more closely model the shunting dynamics of individual neurons (Hodgkin, 1964). In such a shunting equation, each STM trace is bounded within an interval $$[-D,B]$$. Automatic gain control, instantiated by multiplicative shunting, or mass action, terms, interacts with balanced positive and negative signals and inputs to maintain the sensitivity of each STM trace within its interval (see The Noise-Saturation Dilemma):

$\tag{8} \frac{d}{dt} x_i = -A_i x_i + (B - x_i) \left[ \sum_{j=1}^n f_j(x_j) C_{ji} z_{ji}^{(+)} + I_i \right] - (D + x_i) \left[ \sum_{j=1}^n g_j(x_j) E_{ji} z_{ji}^{(-)} + J_i \right].$

The Shunting Model is approximated by the Additive Model in cases where the inputs are sufficiently small that the resulting activities $$x_i$$ do not come close to their saturation values $$-D$$ and $$B$$.

The Wilson-Cowan model (Wilson and Cowan, 1972) also uses a combination of shunting and additive terms, as in (8). However, instead of using sums of sigmoid signals that are multiplied by shunting terms, as in the right hand side of (8), the Wilson-Cowan model uses a sigmoid of sums that is multiplied by a shunting term, as in the expression $$(B - x_i) f_j \left( \sum_{j=1}^n C_{ji} x_j z_{ji}^{(+)} - x_j E_{ji} z_{ji}^{(-)} + I_i \right)$$. This form can saturate activities when inputs or recurrent signals get large, unlike (8), as noted in Grossberg (1973).

### Generalized STM equation

Equations (6) and (8) are special cases of an STM equation, introduced in Grossberg (1968c), which includes LTM and medium-term memory (MTM) terms that changes at a rate intermediate between the faster STM and the slower LTM. The laws for STM, MTM, and LTM are specialized to deal with different evolutionary pressures in neural models of different brain systems, including additional factors such as transmitter mobilization (Grossberg, 1969c, 1969b). This generalized STM equation is:

$\tag{9} \frac{dx_i}{dt} = -A x_i + (B - Cx_i) \left[ \sum_{k=1}^n f_k(x_k) D_{ki} y_{ki} z_{ki} + I_i \right] - (E + Fx_i) \left[ \sum_{k=1}^n g_k (x_k) G_{ki} Y_{ki} Z_{ki} + J_i \right].$

In the shunting model, the parameters $$C \ne 0$$ and $$F \ne 0$$. The parameter $$E = 0$$ when there is "silent" shunting inhibition, whereas $$E \ne 0$$ describes the case of hyperpolarizing shunting inhibition. In the Additive Model, parameters $$C = F = 0$$. The excitatory interaction term $$\left[ \sum_{k=1}^n f_k (x_k) D_{ki} y_{ki} z_{ki} + I_i \right]$$describes an external input $$I_i$$ plus the total excitatory feedback signal $$\left[ \sum_{k=1}^n f_k (x_k) D_{ki} y_{ki} z_{ki} \right]$$ that is a sum of signals from other populations via their output signals $$f_k (x_k)$$. The term $$D_{ki}$$ is a constant connection strength between cell populations $$v_k$$ and $$v_i$$, whereas terms $$y_{ki}$$ and $$z_{ki}$$ describe MTM and LTM variables, respectively. The inhibitory interaction term $$\left[ \sum_{k=1}^n g_k (x_k) G_{ki} Y_{ki} Z_{ki} + J_i \right]$$ has a similar interpretation. Equation (9) assumes "fast inhibition"; that is, inhibitory interneurons respond instantaneously to their inputs. Slower inhibition with inhibitory interneuronal activities $$X_i$$ uses an equation like (9) to describe the temporal evolution of the inhibitory activities. The output signals from these inhibitory interneurons provide inhibitory feedback signals to the excitatory activities. With slow inhibition, the inhibitory feedback signals in (9) would be $$g_k (X_k)$$ instead of $$g_k (x_k)$$.

Cohen and Grossberg (1983) derived a Liapunov function for a generalization of the Additive and Shunting Models in (9), with constant MTM and LTM variables and symmetric connections. This Liapunov function includes as a special case the Liapunov function that Hopfield (1984) stated for the Additive Model (see Cohen-Grossberg model, Liapunov function, and theorem).

### MTM: Habituative Transmitter Gates and Depressing Synapses

Medium-term memory (MTM), or activity-dependent habituation, often called habituative transmitter gates, has multiple roles. One role is to carry out intracellular adaptation that divides the response to a current input with a time-average of recent input intensity. A related role is to prevent recurrent activation from persistently choosing the same neuron, by reducing the net input to this neuron. MTM traces also enable reset events to occur. For example, in a gated dipole opponent processing network, they enable an antagonistic rebound in activation to occur in the network's OFF channel in response to either a rapidly decreasing input to the ON channel, or to an arousal burst to both channels that is triggered by an unexpected event (Grossberg, 1972b, 1980a). This property enables a resonance that reads out a predictive error to be quickly reset, thereby triggering a memory search, or hypothesis testing, to discover a recognition category capable of better representing an attended object or event (see Adaptive Resonance Theory; Grossberg, 2012; [5]). MTM reset dynamics also help to explain data about the dynamics of visual perception, cognitive-emotional interactions, decision-making under risk, and sensory-motor control (Francis and Grossberg, 1996; Francis et al., 1994; Gaudiano and Grossberg, 1991, 1992; Grossberg, 1972b, 1980a, 1984a, 1984b; Grossberg and Gutowski, 1987; Ogmen and Gagné, 1990).

In (9), the $$i^{th}$$ MTM trace, or habituative transmitter gate, $$y_i$$, typically obeys the equation:

$\tag{10} \frac{dy_i}{dt} = H(K - y_i) - Lf_k(x_k)y_k.$

By (10), $$y_i$$ accumulates at a fixed rate $$H$$ to its maximum value $$K$$ via term $$h(K - y_i)$$ and is inactivated, habituated, or depressed via a mass action interaction between the feedback signal $$f_k(x_k)$$ and the gate concentration $$y_k$$ via term $$Lf_k(x_k)y_k$$. Abbott et al. (1997) reported neurophysiological data from the visual cortex and rederived this MTM equation from it, calling it a depressing synapse. Tsodyks and Markram (1997) derived a related equation using their data from the somatosensory cortex, calling it a dynamic synapse. The mass action term may be more complex than it is in (10) in some situations; e.g., Gaudiano and Grossberg (1991, 1992) and Grossberg and Seitz (2003). The habituative transmitter gate $$Y_k$$ in the inhibitory feedback term of (1) obeys a similar equation. By multiplying intercellular signals, transmitter gates can modulate their efficacy in an activity-dependent way. Not all signals need to be habituative.

Figure 1: Outstar network for spatial pattern learning. Red bars are a schematic for cell activities, and green hemidisks for synaptic weights. Two typical learning laws for Outstar learning are also indicated.

### LTM: Gated steepest descent learning: Not Hebbian learning

An oft-used equation for the learning of adaptive weights, or long-term memory (LTM) traces, is called gated steepest descent learning. Gated steepest descent learning permits adaptive weights to increase or decrease (Grossberg, 1967, 1968b, 1968c). This is because the unit of LTM in the Additive and Shunting Models was proved to be a distributed pattern of LTM traces across a network, and the LTM traces learn to match the pattern of activities, or STM traces, of cells across the network (see Processing and STM of Spatial Patterns). If an STM activity is large (small), then the LTM trace can increase (decrease). These learning laws are thus not Hebbian, because the Hebb (1949) learning postulate says that: "When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some grown process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased". This postulate only allows LTM traces to increase. Thus, after sufficient learning took place, Hebbian traces would saturate at their maximum values. The Hebb postulate assumed the wrong processing unit: It is not the strength of an individual connection; rather it is a distributed pattern of LTM traces.

Figure 2: Outstar drawn to clarify where the name Outstar comes from.

One variant of gated steepest descent learning, called Outstar Learning, was introduced in Grossberg (1968b) for spatial pattern learning (Figure 1 and Figure 2).

Figure 3: Instar network for adaptive filtering and spatial pattern classification, or categorization. Same conventions as in Figure 1.

Another variant is called Instar Learning, which was used in Grossberg (1976a) for the learning of bottom-up adaptive filters (Figure 3) in Self-Organizing Map (SOM) models [6]. A SOM uses a recurrent on-center off-surround network (Figure 4) to choose one, or a small number, of cells for storage in STM (see Processing and STM of Spatial Patterns), before the stored activities trigger learning of LTM traces in abutting synapses (see Sparse Stable Category Learning Theorem). Kohonen (1984) also used Instar Learning in his applications of the SOM model.

Figure 4: Competitive Learning and Self-Organizing Map models use Instar inputs to drive a contrast-enhancing competition by a recurrent on-center off-surround network.

Outstar and Instar Learning are dual networks in the sense that they are the same, except for reversing which cells are sampling and which are sampled (Figure 5).

Figure 5: The Outstar and Instar networks are related to each other by duality; that is, by the exchange of the sampling and sampled cells.

Outstars and Instars were combined in Grossberg (1976a) to form a three-layer Instar-Outstar network for learning multi-dimensional maps from any m-dimensional input space to any n-dimensional output space (Figure 6). The Instars learn recognition categories that selectively respond to an m-dimensional input pattern (see Sparse Stable Category Learning Theorem), and an active category samples a simultaneously active n-dimensional input pattern (see Outstar Learning Theorem). Hecht-Nielsen (1987) called such a network a counterpropagation network.

Figure 6: An Instar-Outstar network can learn an arbitrary mapping between m-dimensional input vectors and n-dimensional output vectors.

In ART models, these concepts were used to define a bRNN. In the article Grossberg (1976b) that introduced ART, Instars define the learning in bottom-up adaptive filters, and Outstars define the learning in top-down expectations (Figure 7). The learning instabilities of competitive learning and SOM models that were described in Grossberg (1976a) led Grossberg (1976b) to show how matching of bottom-up feature patterns by top-down learned expectations, and the ensuing focusing of attention upon critical feature patterns, can dynamically stabilize the memories learned in SOM models, as well as the multi-dimensional maps learned by an Instar-Outstar network (see Adaptive Resonance Theory).

Figure 7: In an ART model, bottom-up category learning often uses Instars, whereas top-down learning of expectations often uses Outstars.

Outstar Learning equation:

$\tag{11} \frac{dz_{ij}}{dt} = Mf_i(x_i) \left[ h_j(x_j) - z_{ij} \right]$

Instar Learning Equation:

$\tag{12} \frac{dz_{ij}}{dt} = Mf_j(x_j) \left[ h_i(x_i) - z_{ij} \right].$

Equation (11) describes the outstar learning equation, by which the $$i^{th}$$ source, or sampling, cell can sample and learn a distributed spatial pattern of activation across a network of sampled cells ($$j \in J$$). When the gating signal $$f_i(x_i)$$ is positive, the adaptive weights $$z_{ij}$$ can sample the activity-dependent signals $$h_j(x_j)$$ across the sampled network of cells. Equation (12) describes the instar learning equation, by which the $$j^{th}$$ target cell can sample and learn the distributed pattern of signals ($$i \in I$$) that activated it. There are many variations of these gated steepest descent equations, including doubly-gated learning, spike-timing dependent learning, and self-normalizing learning (e.g., Gorchetchnikov et al., 2005; Grossberg and Seitz, 2003). Not all connections need to be adaptive.

As illustrated below, various combinations of these STM, MTM, and LTM equations have been used in scores of modeling studies since the 1960s. In particular, they were used by O'Reilly and Munakata (2000) in what they call the Leabra model.

## Processing and STM of Spatial Patterns

### Transformation and short-term storage of distributed input patterns by neural networks

The brain is designed to process patterned information that is distributed across networks of neurons. For example, a picture is meaningless as a collection of independent pixels. In order to understand 2D pictures and 3D scenes, the brain processes the spatial pattern of inputs that is received from them by the photosensitive retinas. Within the context of a spatial pattern, the information from each pixel can acquire meaning. The same is true during temporal processing. For example, individual speech sounds heard out of context may sound like meaningless chirps. They sound like speech and language when they are part of a characteristic temporal pattern of signals. The STM, MTM, and LTM equations enable the brain to effectively process and learn from both spatial and temporal patterns of information.

Both spatial and temporal patterns may be received at multiple intensities. Scenes can be seen in dim or bright light, and speech can be heard if it is whispered or shouted. In order to process either spatial or temporal patterns using neurons, brains have evolved network designs that can compensate for variable input intensities without a loss of pattern information.

### The Noise-Saturation Dilemma

Without suitable interactions between neurons, their input patterns can be lost in cellular noise if they are too small, or can saturate cell activities at their maximum values if they are too large. Input amplitudes can also vary from small to large through time, just as the intensity of light can vary from dim to bright.  During the processing of a visual input from a fixed object, the total intensity of an input can change while the relative intensity remains constant. The relative intensity is called the reflectance of the surface that reflects variable intensities of light to the eye. Many other examples exist wherein total intensity changes while relative intensity remains constant.

What sort of network interactions enable neurons to retain their sensitivities to the relative sizes of their inputs across the network, even while these inputs may vary in size through time over several orders of magnitude?  The answer is: an on-center off-surround network whose cells obey the membrane, or shunting, equations of neurophysiology (Grossberg, 1973, 1980a). This fact helps to explain why such networks are ubiquitous in the brain.

### A thought experiment to solve the noise-saturation dilemma

Suppose that a spatial pattern $$I_i = \theta_i I$$ of inputs is processed by a network of cells $$v_i, i=1,2,...,n$$. Each $$\theta_i$$ is the constant relative size, or reflectance, of its input $$I_i$$ and $$I = \sum_{k=1}^n I_k$$ is the variable total input size. Thus $$\sum_{k=1}^n \theta_k = 1$$. How can each cell $$v_i$$ maintain its sensitivity to $$\theta_i$$ when $$I$$ is parametrically increased? How is saturation avoided?

To compute $$\theta_i = I_i \left( \sum_{k=1}^n I_k \right)^{-1}$$, each cell $$v_i$$ must have information about all the inputs $$I_k, k=1,2,...,n$$. Rewriting the ratio $$\theta_i$$ as $$\theta_i = I_i \left( I_i + \sum_{k \ne i} I_k \right)^{-1}$$ calls attention to the fact that increasing $$I_i$$ increases $$\theta_i$$, whereas increasing any $$I_k, k \ne i$$, decreases $$\theta_i$$. When this property is translated into an anatomy for delivering feedforward inputs to the cells $$v_i$$, it suggests that the input $$I_i$$ excited $$v_i$$ and that all the inputs $$I_k, k \ne i$$, inhibit $$v_i$$. In other words, all the inputs compete among themselves while trying to activate their own cell. This rule represents a feedforward on-center off-surround anatomy. It has been known that on-center off-surround anatomies are ubiquitous in the brain at least since they were reported in the cat retina by Kuffler (1953).

How does the on-center off-surround anatomy activate and inhibit the cells $$v_i$$ through time? Suppose that each cell possesses $$B$$ excitable sites of which $$x_i(t)$$ are excited and $$B - x_i(t)$$ are not excited at time $$t$$. Thus, at cell $$v_i$$, input $$I_i$$ excites the $$B - x_i(t)$$ unexcited sites, and the total inhibitory input $$\sum_{k \ne i} I_k$$ from the off-surround inhibits the $$x_i(t)$$ excited sites. Suppose, in addition, that excitation $$x_i(t)$$ can spontaneously decay at a fixed rate $$A$$, so that the cell can return to an equilibrium point, set to equal $$0$$ for simplicity, after all inputs cease. Putting these properties together in one equation yields:

$\tag{13} \frac{d}{dt} x_i = -A x_i + (B - x_i) I_i - x_i \sum_{k \ne i} I_k .$

Equation (13) defines a feedforward on-center ($$I_i$$) off-surround ($$\sum_{k \ne i} I_k$$) network whose cells obey a simple version of the Shunting Model in equation (8).

If a fixed spatial pattern $$I_i = \theta_i I$$ is presented and the total input $$I$$ is held constant for awhile, then each $$x_i(t)$$ approaches an equilibrium value that is found by setting $$\frac{d}{dt} x_i = 0$$ in equation (13). Then

$\tag{14} x_i = \theta_i \frac{BI}{A + I}.$

Note that the relative activity $$X_i = x_i \left( \sum_{k=1}^n x_k \right)^{-1}$$ equals $$\theta_i$$ no matter how large $$I$$ is chosen; there is no saturation. However, if the off-surround input is removed, then all the $$x_i$$ saturate at $$B$$ as the total input $$I$$ becomes large.

### Automatic gain control by the off surround prevents saturation

Saturation is prevented in (13) due to automatic gain control by the inhibitory inputs from the off-surround. In other words, the off-surround $$\sum_{k \ne i} I_k$$ multiplies $$x_i$$. The total gain is found by rewriting (13) as:

$\tag{15} \frac{d}{dt} x_i = -(A + I) x_i + BI_i.$

The gain is the coefficient of $$x_i$$, namely $$-(A + I)$$. Indeed, if $$x_i(0) = 0$$, then (15) can be integrated to yield:

$\tag{16} x_i(t) = \theta_i \frac{BI}{A + I} \left( 1 - e^{-(A + I)t} \right) .$

By (16), both the steady state and the rate of change of $$x_i$$ depend upon input strength $$I$$. This is characteristic of mass action, or shunting, networks but not of additive networks, in which the inputs do not multiply the activities $$x_i$$.

### Contrast normalization and pattern processing by real-time probabilities.

Another property of (14) is that the total activity:

$\tag{17} x = \sum_{k=1}^n x_k = \frac{BI}{A + I}$

is independent of the number of active cells and approaches $$B$$ as $$I$$ increases. This normalization rule is a conservation law which says, for example, that increasing one activity forces a decrease in other activities. This property helps to explain such properties of visual perception as brightness constancy and brightness contrast (Cornsweet, 1970; Grossberg and Todorovic, 1988). During brightness contrast, increasing the luminance of inputs to the off-surround makes the on-center look darker. The normalization property is called contrast normalization in applications to visual perception. More generally, normalization underlies many properties of limited capacity processing in the brain, notably in perception and cognition, with working memory capacity limits being a classical example.

### Weber Law and shift property

Writing equation (14) in logarithmic coordinates shows that increasing the off-surround input does not reduce the sensitivity of the network to inputs to the on-center; rather, it shifts network responses to larger input sizes without a loss of sensitivity. In particular, let $$K_i = \ln(I_i)$$ and $$I_i = e^{K_i}$$. Also write the total off-surround input as $$J_i = \sum_{k \ne i} I_k$$. Then (14) can be written in logarithmic coordinates as:

$\tag{18} x_i(K_i, J_i) = \frac{B e^{K_i}}{A + e^{K_i} + J_i}.$

How does the activity $$x_i$$ change if the off-surround input $$J_i$$ is parametrically set at increasingly high values? Equation (18) shows that the entire response curve of $$x_i$$ to its on-center input $$K_i$$ also shifts, and thus its dynamic range is not compressed. For example, suppose that the off-surround input is increased from $$J_i^{(1)}$$ to $$J_i^{(2)} = J_i^{(1)} + S_i$$ by an amount $$S_i$$ . Then the amount of shift in the response curve is:

$\tag{19} S_i = \ln \frac{A + J_i^{(2)}}{A + J_i^{(1)}}.$

Such a shift property is found, for example, in the retina of the mudpuppy Necturus (Werblin, 1971). Generalizations of the feedforward on-center off-surround shunting network equations generate many other useful properties, including Weber law processing, adaptation level processing, and edge and spatial frequency processing (Grossberg, 1983).

### Physiological interpretation of shunting dynamics: The membrane equation of neurophysiology

The Shunting equation (13) has the form of the membrane equation on which cellular neurophysiology is based. This membrane equation is the voltage equation that appears in the equations of Hodgkin and Huxley (1952). In other words, the gedanken experiment shows how the noise-saturation dilemma is solved by using the membrane, or shunting, equation of neurophysiology to describe cells interacting in on-center off-surround anatomies. Because on-center off-surround anatomy and shunting dynamics work together to solve the noise-saturation dilemma, it is reasonable to predict that they coevolved during evolution.

The membrane equation describes the voltage $$V(t)$$ of a cell by the law:

$\tag{20} C \frac{\partial V}{\partial t} = (V^+ - V)g^+ + (V^- - V)g^- + (V^p - V) g^- .$

In (20), $$C$$ is a capacitance; $$V^+$$, $$V^-$$, and $$V^p$$ are constant excitatory, inhibitory, and passive saturation voltages, respectively; and $$g^+$$, $$g^-$$, and $$g^p$$ are excitatory, inhibitory, and passive conductances, respectively. When the saturation voltages are chosen to satisfy $$V^- \le V^p < V^+$$, then the cell voltage satisfies $$V^- \le V(t) \le V^+$$. Often $$V^+$$ represents the saturation point of a $$Na^+$$ channel, and $$V^-$$ represents the saturation point of a $$K^+$$ channel.

There is symmetry-breaking in (20) because $$V^+ - V^p$$ is usually much larger than $$V^p - V^-$$. Symmetry-breaking implies a noise suppression property when it is coupled to an on-center off-surround anatomy (Grossberg, 1988). Then the network suppresses uniformly active inputs and generates suprathreshold responses only to inputs that are larger than a baseline value, or adaptation level. This property illustrates that excitation and inhibition need to be properly balanced to achieve efficient neuronal dynamics: When excitation is too large, seizure activity can occur in a bRNN. When inhibition is too large, processing can never get started. Symmetry-breaking can be achieved during development by an opposites attract rule whereby the relative sizes of the intracellular excitatory and inhibitory saturation voltages $$V^+$$ and $$V^-$$ control the relative total strengths of the intercellular off-surround and on-center connections, respectively (Grossberg, 1978a, Section 45).

### Recurrent competitive fields

The activities $$x_i$$ in (13) rapidly decay if their inputs $$I_i$$ are shut off. Persistent storage in STM is achieved when feedback signals exist among the various populations, thereby creating a bRNN. The noise-saturation dilemma confronts all cellular tissues which process input patterns, whether the cells exist in a feedforward or feedback anatomy. To solve the noise-saturation dilemma in a RNN, excitatory feedback signals need to be balanced by inhibitory feedback signals. The simplest recurrent on-center off-surround shunting RNN, also called a recurrent competitive field (RCF), is defined by (Grossberg, 1973):

$\tag{21} \frac{d}{dt} x_i = -A x_i + (B - x_i) \left[ I_i + f(x_i) \right] - x_i \left[ J_i + \sum_{k \ne i} f(x_k) \right] .$

### Winner-take-all, contrast enhancement, normalization, and quenching threshold

Grossberg (1973) proved theorems showing how the choice of feedback signal function $$f(w)$$ transforms an input pattern before it is stored persistently in STM. Given the fundamental nature of these results for all bRNNs, they will be reviewed below.

Figure 8: How different choices of the feedback signal function $$f$$ in a recurrent on-center off-surround shunting network can transform input patterns before they are stored in short-term memory.

Figure 8 summarizes the results. These theorems provided the first rigorous proofs of winner-take-all (WTA) properties, and of the use of sigmoid signal functions to generate a self-normalizing "bubble", or partial contrast-enhancement, above a quenching threshold. The theorems began the mathematical classification of cooperative-competitive recurrent nonlinear dynamical systems, whose properties are applicable to many fields, ranging from morphogenesis to economics (Grossberg, 1988).

To prove the theorems, (21) is transformed into total activity variables $$\sum_{k=1}^n x_k$$ and pattern variables $$X_i = x_i x^{-1}$$ under the assumption that the inputs $$I_i$$ and $$J_i$$ are set to zero during the STM storage process. Then (21) may be rewritten as:

$\tag{22} \frac{d}{dt} X_i = BX_i \sum_{k=1}^n X_k \left[ h(X_i x) - h(X_k x) \right]$

and

$\tag{23} \frac{d}{dt} x = -Ax + (B - x) \sum_{k=1}^n f(X_k x),$

where the function $$h(w) = f(w)w^{-1}$$ is called the hill function because it exhibits a "hill" of activity for every transition between a faster-than-linear, linear, and slower-than-linear shape in the signal function, as shown for the sigmoid function in Figure 8 and Figure 9.

Figure 9: A sigmoid signal function exploits the properties of its faster-than-linear, approximately linear, and slower-than-linear processing ranges to partially contrast-enhance an input pattern before storing it in short-term memory, while suppressing signals that fall below a quenching threshold.

If $$f(w)$$ is linear—that is, $$f(w) = Cw$$, then $$h(w) = C$$ and all $$\frac{d}{dt} X_i = 0$$ in (22). Then (21) can preserve any pattern in STM! However, by (23), if $$A \ge B$$, then $$x(t)$$ approaches $$0$$ as $$t \rightarrow \infty$$, so that no pattern is stored in STM. A pattern is stored in STM only if $$B > A$$. Then $$x(t) \rightarrow B - A$$ as $$t \rightarrow \infty$$, so that the total activity is normalized. This result implies that, if STM storage is possible and $$x(0) > 0$$, then $$x(t) \rightarrow B - A$$ even if no input occurs. In other words, noise will be amplified as vigorously as inputs. A linear signal function amplifies noise, and is therefore inadequate despite its perfect memory of any input pattern. That is why nonlinear signal functions are needed.

A slower-than-linear signal function—for example, $$f(x) = Cw(D + w)^{-1}$$ or, more generally, any $$f(w)$$ whose hill function $$h(w)$$ is monotone decreasing—is even worse, because it amplifies noise and eliminates all differences in inputs within the stored pattern. This happens because, by (22), if $$X_i > X_k, k \ne i$$, then $$\frac{d}{dt} X_i < 0$$ and if $$X_i < X_k, k \ne i$$ then $$\frac{d}{dt} X_i > 0$$. Thus the maximum activity decreases and the minimum activity increases until all the activities become equal.

If both linear and slower-than-linear signal functions amplify noise, then one must turn to faster-than-linear functions in the hope that they suppress noise. If $$f(w)$$ is faster-than-linear—--for example, $$f(x) = Cw^n, n > 1$$, or, more generally, any $$f(w)$$ whose hill function $$h(w)$$ is monotone increasing—then noise is, indeed, suppressed. In this case, if $$X_i > X_k, k \ne i$$, then $$\frac{d}{dt} X_i > 0$$ and if $$X_i < X_k, k \ne i$$, then $$\frac{d}{dt} X_i < 0$$. As a result, the network chooses the population with the initial maximum of activity and totally inhibits activity in all other populations. This network behaves like a winner-take-all binary choice machine. The same is true for total activity, since as $$t \rightarrow \infty$$, (23) becomes approximately:

$\tag{24} \frac{d}{dt} x \cong x [ -A + (B - x) h(x) ].$

Thus, the equilibrium points of $$x_i$$ as $$t \rightarrow \infty$$ are $$E_0 = 0$$ and all the solutions of the equation

$\tag{25} h(x) = A(B - x)^{-1};$

see Figure 10. If $$h(0) < A / B$$, then the smallest solution, $$E_1$$, of (25) is unstable, so that activities $$x(t) < E_1$$ are suppressed as $$t \rightarrow \infty$$. This is noise suppression due to recurrent competition. Every other solution $$E_2, E_4, ...$$ of (25) is a stable equilibrium point of $$x(t)$$ as $$t \rightarrow \infty$$ (total activity quantization) and all equilibria are smaller than $$B$$ (normalization).

Figure 10: Equilibrium points of a winner-take all recurrent competitive field with a faster-than-line signal function are intersections of the hill function $$h(w)$$ with the function $$A(B - w)^{-1}$$. See text for details.

The faster-than-linear signal contrast-enhances the pattern so vigorously that the good property of noise suppression is joined to the extreme property of winner-take-all (WTA) choice. Although WTA is often a useful property in applications to choice behavior (e.g., Dev, 1975; Grossberg, 1976a; Grossberg and Pilly, 2008; Koch and Ullman, 1985; Wang, 2008), there are many cases where noise suppression is desired but more than one feature or category needs to be stored in STM. How can this be accomplished?

The results above show that any signal function that suppresses noise must be faster-than-linear at small activities. In addition, all signal functions in biology must be bounded. Such a combination is achieved most simply by using a sigmoid signal function, which is a hybrid of faster-than-linear at small activities, approximately linear at intermediate activities, and slower-than-linear at high activities (Figure 9). Then there exists a quenching threshold (QT) such that if initial activity falls below the QT, then its activity is quenched. All initial activities that exceed the QT are contrast-enhanced and stored in STM (Figure 8). The faster-than-linear part of the sigmoid suppresses noise and starts to contrast-enhance the activity pattern. As total activity normalizes, the approximately linear range is reached and tends to store the partially contrast-enhanced pattern. The QT converts the network into a tunable filter. For example, a burst of nonspecific arousal in response to an unexpected event that multiplicatively inhibits all the recurrent inhibitory interneurons will lower the QT and facilitate storage of inputs in STM until the cause of the unexpected event can be determined.

### Shunting dynamics in cortical models

Multiple generalizations of RCFs have been studied and used to explain data ranging from visual and speech perception and attentive category learning (see Unifying horizontal, bottom-up, and top-down STM and LTM interactions) to the selection of commands for arm movement control (e.g., Cisek, 2006) and for eye movement control in response to probabilistically defined visual motion signals (e.g., Grossberg and Pilly, 2008). As noted above, Usher and McClelland (2001) modeled probabilistic decision making using an Additive Model. This model does not exhibit the self-normalization properties that arise from RCF shunting dynamics.

A number of authors have applied shunting properties to simulate data about the properties of the cortical circuits that subserve visual perception; e.g., Douglas et al. (1995), Grossberg and Mingolla (1985), Grossberg and Todorovic (1988), Heeger (1992), and McLaughlin et al. (2000). Shunting dynamics also played a key role in the development of the Competitive Learning (CL), Self-Organizing Map (SOM), and Adaptive Resonance Theory (ART) models (Scholarpedia: Adaptive Resonance Theory; Grossberg, 1976a, 1976b, 1980a), but not in the CL and SOM versions of von der Malsburg (1973) and Kohonen (1984). An RCF with spiking neurons has also been shown to replicate key properties of the Grossberg (1973) theorems for rate-based neurons (Palma et al., 2012).

### Decision-making in Competitive Systems: Liapunov methods

The ubiquity of RCFs led to a search for the most general networks that could guarantee stable STM storage. The RCF in (21) is a special case of a competitive dynamical system. In general, a competitive dynamical system is defined by a system of differential equations such that:

$\tag{26} \frac{d}{dt} x_i = f_i(x_1, x_2, ..., x_n)$

where

$\tag{27} \frac{\partial f_i}{\partial x_j} \le 0, i \ne j$

and the $$f_i$$ are chosen to generate bounded trajectories. By (27), increasing the activity $$x_j$$ of a given population can only decrease the growth rates $$\frac{d}{dt}x_i$$ of other populations, $$i \ne j$$, or may not influence them at all. In such systems, cooperative interactions typically occur within a population while competitive interactions can occur between populations, as in the recurrent on-center off-surround network (21). Grossberg  (1978d, 1980b) developed a mathematical method to classify the dynamics of competitive dynamical systems by proving that any competitive system can be analyzed by keeping track of the population that is winning the competition through time. This method defines jump sets at the times when the winning population is replaced by—that is, jumps to—another population. Tracking trajectories through jump sets formalizes keeping track of the population that is winning the competition through time. Jump sets define a kind of decision hypersurface. If the jumps form a cycle, so that no globally consistent winner exists, then oscillations can occur. In particular, such a jump cycle occurs in the May and Leonard (1975) model of the voting paradox. If the jumps only form decision trees, without cycles, then all trajectories converge to limits. A global Liapunov functional was defined and provides the "energy" that moves system trajectories through these oscillatory or convergent decision hypersurfaces through time. See Grossberg (1988, Section 11) for a review.

### Competition, decision, and consensus

This method was applied to study a general problem that has intrigued philosophers and scientists for hundreds of years, and which includes many RCFs as special cases: How do arbitrarily many individuals, populations, or states, each obeying unique and personal laws, ever succeed in harmoniously interacting with each other to form some sort of stable society, or collective mode of behavior? If each individual obeys complex laws, and is ignorant of other individuals except via locally received signals, how is social chaos averted? How can local ignorance and global order, or consensus, be reconciled? Considerable interest has focused on the question: How simple can a system be and still generate "chaotic" behavior (e.g., Alligood et al., 1996)? The above issue considers the converse question: How complicated can a system be and still generate order?

Grossberg (1978c) posed these questions and introduced a class of bRNNs in which this type of global consensus arises, along with mathematical methods to prove it. Consensus arises in these systems because, despite essentially arbitrary irregularities and nonlinearities in local system design, there exists a powerful symmetry in the global rules that bind together the interacting populations. This symmetry is expressed by the existence of a shared, but state-dependent, inter-population competition function, also called an adaptation level. These results suggest that a breakdown of symmetry in competitive RNNs, say due to the existence of asymmetric biases in short-range inter-population interactions, is a basic cause of oscillations and chaos in these systems, as is illustrated by the voting paradox. There appears to exist a trade-off between how global the adaptation level ("communal understanding") is and how freely local signals ("individual differences") can be chosen without destroying global consensus.

### Adaptation level systems: Globally-consistent decision-making

System (21) is a special case of a competitive network with a broad inhibitory surround. A much more general class of systems, the adaptation level systems, also has this property:

$\tag{28} \frac{d}{dt} x_i = a_i(x) \left[ b_i(x_i) - c(x) \right] ,$

where $$x = (x_1, x_2, ..., x_n)$$, $$a_i(x)$$ is a state-dependent amplification function, $$b_i(x_i)$$ is a self-signal function, and $$c(x)$$ is the state-dependent adaptation level against which each $$b_i(x_i)$$ is compared. For the special case of (21),

$\tag{29} a_i(x) = x_i,$

$\tag{30} b_i(x_i) = x_i^{-1} [Bf(x_i) + I_i] - A - I_i - J_i,$

and

$\tag{31} c(x) = \sum_{k=1}^n f(x_k).$

The same equations hold with $$A$$, $$B$$, and $$f(x_i)$$ in (21) replaced by $$A_i$$, $$B_i$$, and $$f_i(x_i)$$; that is, different parameters and signal functions for each cell, for arbitrarily many cells.

Grossberg (1978c) proved that all trajectories in such systems are "stored in STM"; that is, converge to equilibrium values as $$t \rightarrow \infty$$, even in systems which possess infinitely many equilibrium points. The proof shows how each $$x_i(t)$$ gets trapped within a sequence of decision boundaries that get laid down through time at the abscissa values of the peaks in the graphs of the signal functions $$b_i(x_i)$$, starting with the highest peaks and working down. These signal functions generalize the hill function in (22); see (30). Multiple peaks correspond to multiple cooperating subpopulations. These graphs may thus be very complex if each population contains multiple cooperating subpopulations. After all the decision boundaries get laid down, each $$x_i(t)$$ is trapped within a single valley of its $$b_i$$ graph. After this occurs for all the $$x_i$$ variables, the function $$B(x(t)) = \max [ b_i (x(t)) : i = 1, 2, ..., n]$$ is a Liapunov function, whose Liapunov property is then used to complete the proof of the theorem.

A special case of the theorem concerns a competitive market with an arbitrary number of competing firms (Grossberg, 1988, Section 12). Each firm can choose one of infinitely many production and savings strategies that is unknown to the other firms. The firms know each other's behaviors only through their effect on a competitive market price, and they produce more goods at any time only if application of their own firm's production and savings strategy will lead to a net profit with respect to that market price. The theorem proves that the price in such a market is stable and that each firm balances its books. The theorem does not, however, determine which firms become rich and which go broke.

### Cohen-Grossberg model, Liapunov function, and theorem

Due to the importance of symmetry in proving global approach to equilbria, as in the adaptation level systems (28), Cohen and Grossberg attempted to prove that all trajectories of systems of the Cohen-Grossberg form:

$\tag{32} \frac{d}{dt} x_i = a_i(x_i) [ b_i(x_i) - \sum_{k=1}^n c_{ij} d_j(x_j)],$

with symmetric interaction coefficients $$c_{ij} = c_{ji}$$ and weak assumptions on their defining functions, approach equilibria as $$t \rightarrow \infty$$. Systems (32) include both Additive Model and Shunting Model networks (6) and (8) with distance-dependent, and thus symmetric, interaction coefficients, the Brain-State-in-a-Box model (Anderson et al., 1977), the continuous-time version of the McCulloch and Pitts (1943) model, the Boltzmann Machine equation (Ackley et al., 1985), the Ratliff et al. (1963) model, the Volterra-Lotka model (Lotka, 1956), the Gilpin and Ayala (1973) model, the Eigen and Schuster (1978) model, the Cohen and Grossberg (1986, 1997) Masking Field model, and so on.

Cohen and Grossberg first attempted to prove global equilibrium by showing that all Cohen-Grossberg systems generate jump trees, and thus no jump cycles, which would immediately prove the desired result. This hypothesis still stands as an unproved conjecture. While doing this, inspired by the use of Liapunov methods for more general competitive systems, Cohen and Grossberg (1983; see also Grossberg (1982)) discovered the Cohen-Grossberg Liapunov function that they used to prove that global equilibria exist:

$\tag{33} V = - \sum_{i=1}^n \int^{x_i} b_i (\xi_i) d_i^{\prime} (\xi_i) d \xi_i + \frac{1}{2} \sum_{j, k=1}^n c_{jk} d_j(x_j) d_k (x_k).$

Equation (33) defines a Liapunov function because integrating $$V$$ along trajectories implies that:

$\tag{34} \frac{d}{dt} V = - \sum_{i=1}^n a_i d_i^{\prime} \left[ b_i - \sum_{j=1}^n c_{ij} d_j \right]^2 .$

If $$a_i d_i^{\prime} \ge 0$$, then (34) implies that $$\frac{d}{dt} V \le 0$$ along trajectories. Once this basic property of a Liapunov function is in place, it is a technical matter to rigorously prove that every trajectory approaches one of a possibly large, or infinite, number of equilibrium points.

As noted above, the Liapunov function (33) proposed in Cohen and Grossberg (1983) includes both the Additive Model and Shunting Model, among others. A year later, Hopfield (1984) published the special case of the Additive Model and Liapunov function and asserted, without proof, that trajectories approach equilibria. Based on this 1984 article, the Additive Model has been erroneously called the Hopfield network by a number of investigators, despite the fact that it was published in multiple articles since the 1960s and its Liapunov function was also published in 1982-83. A historically more correct name, if indeed names must be given, is the Cohen-Grossberg-Hopfield model, which is the name already used in articles such as Burton (1993), Burwick (2006), Guo et al. (2004), Hoppensteadt and Izhikevich (2001), Menon et al. (1996), and Wu and Zou (1995).

### Symmetry does not imply convergence: Synchronized oscillations

Cohen (1988) showed that symmetric coefficients are not sufficient to ensure global convergence by constructing distance-dependent (hence symmetric) on-center off-surround networks that support persistent oscillations. These networks can send excitatory feedback signals to other populations than themselves. They are a special case of (8) with fast-acting inhibitory interneurons. It has long been known that shunting networks with slow inhibitory interneurons can persistently oscillate (e.g., Ellias and Grossberg, 1975). This observation led to the prediction that neural networks can undergo synchronized oscillations, first called order-preserving limit cycles (Grossberg, 1976b), during attentive resonant states. The early articles concerning synchronized oscillations during attentive brain dynamics (e.g., Gray et al., 1989; Grossberg and Somers, 1991; Grossberg and Grunewald, 1997; Eckhorn et al., 1990; Somers and Kopell, 1993) have been followed by hundreds more. Persistent oscillations can also occur in RCFs in which, instead of slow inhibitory interneurons, habituative gates (10) multiply the recurrent signals in (9) (e.g., Carpenter and Grossberg, 1983)

### Unifying horizontal, bottom-up, and top-down STM and LTM interactions

Most of the RNNs considered above characterize their interaction terms in abstract mathematical terms; e.g., symmetry of connection strengths. In contrast, the bRNNs in the brain are embodied in architectures with highly differentiated anatomical circuits. For example, models of how the cerebral cortex works are defined by bRNNs that integrate bottom-up, horizontal, and top-down interactions in laminar circuits with identified cells. These models illustrate the computational paradigm of Laminar Computing (Grossberg, 1999, 2003) which has begun to classify how different behavioral functions, such as vision, cognition, speech, and behavioral choice, may be realized by architectures that are all variations on a shared laminar design. These architectures include the:

LAMINART family of models of how the visual cortex, notably cortical areas V1, V2, and V4, interact together to see (Figure 11; e.g., Cao and Grossberg, 2005, 2012; Grossberg and Raizada, 2000; Grossberg and Versace, 2008; Raizada and Grossberg, 2001),

Figure 11: Schematic of the laminar circuitry within the LAMINART model for bottom-up adaptive filtering, horizontal perceptual grouping, and top-down attention in lateral geniculate nucleus (LGN) and cortical areas V1 and V2. See text and Raizada and Grossberg (2001) for details.

the LIST PARSE model of how prefrontal cortical working memory and list chunk learning in the ventrolateral and dorsolateral prefrontal cortices interact with adaptively-timed volitional processes in the basal ganglia to generate variable-speed motor trajectory commands in the motor cortex and cerebellum (Figure 12; Grossberg and Pearson, 2008),

Figure 12: Schematic of the laminar circuits for cognitive working memory and learned list chunking, proposed to be in the ventrolateral prefrontal cortex, that is modeled by the LIST PARSE model. These circuits interact with circuits for motor working memory, variable-rate volitional control and timing, and trajectory generation. See text and Grossberg and Pearson (2008) for details.

the cARTWORD model of contextual interactions during speech perception by the auditory cortex, including backwards effects in time (Figure 13; Grossberg and Kazerounian, 2011),

Figure 13: (a) The ARTWORD model processing stages. (b) The conscious ARTWORD, or cARTWORD, laminar circuits for processing sequences of speech sounds and generating conscious representations of them.  See text and Grossberg and Kazerounian (2011) for details.

the TELOS model of learning and choice of saccadic eye movement commands by interactions between prefrontal cortex (PFC), frontal eye fields (FEF), posterior parietal cortex (PPC) , anterior and posterior inferotemporal cortex (ITa, ITp), nigro-thalamic and nigro-collicular circuits of the basal ganglia (BG), superior colliculus (SC), and related brain regions (Figure 14; Brown et al., 1999, 2004).

Figure 14: Schematic of the laminar circuits within the TELOS model of how the frontal eye fields (FEF) interact with the basal ganglia and multiple other brain regions to learn and perform different saccadic eye movement tasks. See text and Brown et al. (2004) for details.

and the lisTELOS model of learning and choice of sequences of saccadic eye movements, wherein an Item-Order-Rank spatial working memory in the prefrontal cortex (PFC) stores sequences of saccadic eye movement commands that can include repeats, and which are selected in the supplementary eye fields (SEF) as these regions interact with posterior parietal cortex (PPC), frontal eye fields (FEF), basal ganglia (BG), and superior colliculus (SC) to carry out operations such as loading the sequences in working memory, opening gates to enable the various processing stages to selectively generate their outputs, and releasing saccadic commands (Figure 15; Silver et al., 2011).

Figure 15: Schematic of the lisTELOS model of how an Item-Order-Rank spatial working memory in the prefrontal cortex (PFC) stores sequences of saccadic eye movement commands that are loaded into working memory and performed by interactions of multiple brain regions. See text and Silver et al. (2011) for details.

There are also bRNNs of cognitive-emotional interactions during reinforcement learning and motivated attention, such as the MOTIVATOR model, which can focus on valued goals and block learning of irrelevant events (Grossberg and Levine, 1987; Kamin blocking) by interactions of the object categories in the inferotemporal (IT) and rhinal (RHIN) cortices, the object-value categories in the lateral and medial orbitofrontal cortices (ORBl, ORBm), the value categories in the amygdala (AMYGD) and lateral hypothalamus (LH), and the reward expectation filter in several parts of the basal ganglia (Figure 16; Dranias et al., 2008),

Figure 16: Schematic of the MOTIVATOR model for how reinforcement learning and motivated attention enable animals to choose environmental options that can be satisfy current needs. See text and Dranias et al. (2008) for details.

the ARTSCAN model of view-invariant object learning and visual search during unconstrained saccadic eye movements by interactions between visual cortices V1, V2, V3A, and V4, prefrontal cortex (PFC), posterior parietal cortex (PPC) and lateral intraparietal area (LIP), posterior and anterior inferotemporal cortex (pIT, aIT), and superior colliculus (SC) (Figure 17; Fazl et al., 2009; Grossberg, 2009),

Figure 17: Schematic of the ARTSCAN model for how view-invariant object categories can be learned and recognized while the eyes freely scan a scene. See text and Fazl et al. (2009) for details.

the ARTSCENE Search model of object and spatial contextual cueing of visual search for desired objects in a scene by interactions between visual cortices V1, V2, and V4, ventral and dorsolateral prefrontal cortex (VPFC, DLPFC), perirhinal cortex (PRC), parahippocampal cortex (PHC), anterior and posterior inferotemporal cortex (ITa, ITp), posterior parietal cortex (PPC), and superior colliculus (SC) (Figure 18; Huang and Grossberg, 2010),

Figure 18: Schematic of the ARTSCENE model for how object and spatial contextual cues can be learned and used to drive an efficient search for a desired target in a scene. See text and Huang and Grossberg (2010) for details.

and the GridPlaceMap model of entorhinal grid cell learning of hexagonal receptive fields, and hippocampal place cell learning of (primarily) unimodal receptive fields, in a hierarchy of Self-Organizing Maps that obey the same laws, driven by path integration signals that are generated as the model navigates realistic trajectories (Figure 19; Pilly and Grossberg, 2012).

Figure 19: Schematic of the GridPlaceMap model for how a hierarchy of Self-Organizing Maps can self-organize entorhinal grid cells and hippocampal place cells during spatial navigation. See text and Pilly and Grossberg (2012) for details.

## Interactions of STM and LTM during Neuronal Learning

### Unbiased spatial pattern learning by Generalized Additive RNNs

Various of the architectures above include interactions between STM and LTM that allow them to learn from their environments. The fact that these architectures "work" is based on a foundation of mathematical theorems which demonstrate how STM and LTM laws can be joined to design the most general networks capable of learning spatial patterns in an unbiased way, even when the cells in the network sample each other's activities through recurrent interactions.  These theorems demonstrate how unbiased learning can be achieved in networks with an arbitrarily large number of neurons, or neuron populations, that interact in suitable anatomies under general neurophysiological constraints. Once such spatial pattern learning is assured, then the results can be extended to demonstrate learning of any number of arbitrarily complicated space-time patterns, and to build from there in a series of steps towards the type of complexity that is found in primate brains. Some of these steps are reviewed below.

The theorem in this section shows that two types of anatomy and variants thereof are particularly well suited to spatial pattern learning: Let any finite number of cells $$v_i, i \in I$$, send axons to any finite number of cells $$v_j, j \in J$$. The cases $$I = J$$ (complete recurrence) and $$I \cap J = \varnothing$$ (complete non-recurrence) permit perfect pattern learning even if the strengths of axon connections from $$I$$ to $$J$$ are arbitrary positive numbers. In these anatomies, axon diameters can be chosen with complete freedom, and one can grow axons between cells separated by arbitrary distances without concern about their diameters. Grossberg (1969a, 1971b) summarizes how to extend these results to more general anatomies.

Only certain types of signal transmission between cells can compensate for differences in connection strengths, and thereby yield unbiased pattern learning (Grossberg, 1974, Section VI). The simplest possibility is to let action potentials propagate along the circumference of a cylindrical axon to the axon's synaptic knob (Hodgkin and Huxley, 1952). Let the signal disperse throughout the cross-sectional area of the synaptic knob as ionic fluxes, and let local chemical transmitter production/release in the knob be proportional to the local signal density.  Finally, let the effect of the signal on the postsynaptic cell be proportional to the product of local signal density, available transmitter density, and the cross-sectional area of the knob (Katz, 1969). By contrast, signals that propagate throughout the cross-sectional area of the axon, such as electrotonic signals, do not produce unbiased learning given arbitrary axon connection strengths.

Another constraint is that the time lag for signals to propagate from any cell to all the cells that it activates should be (almost) the same. How can different axons from a given source cell have the same time lag if they have different lengths? This property is achieved if signal velocity is proportional to axon length. But signal velocity is a local property of signal transmission, whereas axon length is a global property of the anatomy. How can this global property be controlled by a local one? This is possible if axon length is proportional to axon diameter, and signal velocity is proportional to axon diameter. The latter is often the case during spike transmission down an axon (Ruch et al., 1961, p. 73) and the former is qualitatively true: longer axons are usually thicker; see Cohen and Grossberg (1986) for developmental laws whereby this can happen. Systems with self-similar connections of this kind can be converted, through a coordinate change, into systems whose connections depend only on the source, or sampling, cells. The following Generalized Additive system is of this type. Its activities, or STM traces, $$x_i$$ and adaptive weights, or LTM traces, $$z_{ij}$$, obey:

$\tag{35} \frac{d}{dt} x_i = Ax_i + \sum_{k \in J} B_k z_{ki} + \theta_i C(t)$

and

$\tag{36} \frac{d}{dt} z_{ji} = D_j z_{ji} + E_j x_i,$

where the number of sampled cells ($$i \in I$$) and sampling cells ($$j \in J$$) can be arbitrarily large, and $$A$$, $$B_j$$, $$D_j$$, and $$E_j$$ can be continuous functionals, possibly highly nonlinear, of the entire past of the system. The signal functional $$B_j$$ and the sampling functional $$E_j$$ are non-negative, since they represent spike-based signaling terms. The decay functional $$D_j$$ also includes a wide range of possibilities, including passive decay of associative learning, and gated steepest descent learning (Figure 1 and Figure 3). The terms $$\theta_i$$ represent an arbitrary spatial pattern ($$\sum_{i \in I} \theta_i = 1$$), and different spatial patterns can be presented ($$C(t) > 0$$) as different combinations of sampling cells are active. Of particular note is the stimulus sampling operation, which means that learning only occurs if the sampling functional $$E_j > 0$$. If both the decay and learning functionals equal zero ($$D_j = E_j = 0$$), then neither learning nor forgetting occurs. The stimulus sampling property enables arbitrary subsets of sampling cells to learn different spatial patterns through time; see Serial learning.

The Unbiased Spatial Pattern Learning Theorem proves how unbiased learning may occur in response to sampling signals, or conditioned stimuli (CS), that are correlated with particular spatial patterns, or unconditioned stimuli (US). This simple form of associative learning is also called classical, or Pavlovian, conditioning. The theorems prove that, if the system is bounded, and each CS and US are practiced together sufficiently often, then perfect pattern learning occurs (Grossberg, 1969a, 1971b). That is, the relative activities $$X_i = x_i \left( \sum_k x_k \right)^{-1}$$ and $$Z_{ji} = z_{ji} \left( \sum_k z_{jk} \right)^{-1}$$ approach the training pattern $$\theta_i$$ without bias as time goes on, no matter how many sampling cells $$j \in J$$ are simultaneously active, each with its own signaling, sampling, and decay functionals, even in a fully recurrent anatomy.

If the delays from a given cell to all of its target cells are not identical, properly designed networks can rapidly resynchronize the activities of the target cells using recurrent interactions (Grossberg and Somers, 1991; Grossberg and Grunewald, 1997; Somers and Kopell, 1993, 1995), even in laminar cortical circuits with realistic synaptic and axonal delays (Yazdanbakhsh and Grossberg, 2004).

### Outstar Learning Theorem

The simplest case of the Generalized Additive model in (35) and (36) occurs for the Outstar Learning Theorem (Grossberg, 1968b), in which the network has a single sampling cell (population) in $$J$$ and a non-recurrent anatomy (Figure 1 and Figure 2). Given this theorem, the stimulus sampling operation suggests how a series of Outstars can learn an arbitrary spatiotemporal pattern as a series of spatial patterns, ordered in time; see Avalanches.

### Sparse Stable Category Learning Theorem

Another version of spatial pattern learning occurs using the dual network to the Outstar, namely the Instar (Figure 3 and Figure 5). When multiple Instars compete with each other via a RCF, they form a Competitive Learning or Self-Organizing Map network (Figure 4; Grossberg, 1976a; Kohonen, 1984; von der Malsburg, 1973). Grossberg (1976a) proved that, if there are not too many input spatial patterns presented sequentially to the network, relative to the number of available category learning cells, then category learning occurs with adaptive weights that track the input statistics, self-normalize, and lead to stable LTM, and the network has Bayesian decision properties. However, in response to a sequence of sufficiently dense non-stationary input patterns, the system can experience catastrophic forgetting in which previously learned categories are recoded by intervening input patterns (Carpenter and Grossberg, 1987, Grossberg, 1976a). Adaptive Resonance Theory, or ART, was introduced in Grossberg (1976b) to propose how top-down learned expectations and attentional focusing could dynamically stabilize learning in a Competitive Learning or Self-Organizing Map model in response to an arbitrary series of input patterns.

Kosko (1987, 1988) adapted the Cohen-Grossberg model and Liapunov function (Cohen and Grossberg, 1983), which proved global convergence of STM, to define a system that combines STM and LTM and which also globally converges to a limit. The main trick was to observe how the symmetric connections in the Cohen-Grossberg equation (32) could be used to define symmetric LTM traces interacting reciprocally between two processing levels. An Additive Model BAM system is, accordingly, defined by:

$\tag{37} \frac{d}{dt} x_i = -x_i + \sum_k f(y_k) z_{ki} + I_i$

and

$\tag{38} \frac{d}{dt} y_j = -y_j + \sum_m f(x_m) z_{mj} + J_i.$

A Shunting Model BAM can also be analogously defined. One type of learning law to which BAM methods apply is the passive decay associative law that was introduced in Grossberg (1967, 1968b, 1968c); see Figure 1 and Figure 3:

$\tag{39} \frac{d}{dt} z_{ij} = -z_{ij} + f(x_i) f(x_j).$

Kosko calls the equation in (39) the signal Hebb law, although it does not obey the property of monotonely increasing learned weights that Hebb (1949) ascribed to his law. Kosko (1988) wrote that: "When the BAM neurons are activated, the network quickly evolves to a stable state of two-pattern reverberation, or resonance". Indeed, another inspiration for BAM was Adaptive Resonance Theory, or ART.

Adaptive Resonance Theory, or ART, is currently the most advanced cognitive and bRNN theory of how the brain autonomously learns to categorize, recognize, and predict objects and events in a changing world. To a remarkable degree, humans can rapidly learn new facts without being forced to just as rapidly forget what they already know. Grossberg (1980) called this problem the stability-plasticity dilemma. It is also sometimes called the problem of catastrophic forgetting (Carpenter, 2001; French, 1999; Page, 2000). ART proposes a solution of this problem by demonstrating how top-down expectations (Figure 7) can learn to focus attention on salient combinations of cues ("critical feature patterns"), and characterizing how attention may operate via a form of self-normalizing "biased competition" (Desimone, 1998). In particular, when a good enough match between a bottom-up input pattern and a top-down expectation occurs, a synchronous resonant state emerges that embodies an attentional focus and is capable of driving fast learning of bottom-up recognition categories and top-down expectations; hence the name adaptive resonance. For a review of ART, see (Scholarpedia: Adaptive Resonance Theory). For a more comprehensive review, see Grossberg (2012; [7]).

## Working Memory: Processing and STM of Temporal Sequences

Intelligent behavior depends upon the capacity to think about, plan, execute, and evaluate sequences of events. Whether we learn to understand and speak a language, solve a mathematics problem, cook an elaborate meal, or merely dial a phone number, multiple events in a specific temporal order must somehow be stored temporarily in working memory. A working memory (WM) is thus a network that is capable of temporarily storing a sequence of events in STM (e.g., Baddeley, 1986; Baddeley and Hitch, 1974; Bradski et al., 1994; Cooper and Shallice, 2000); see Working memory). As event sequences are temporarily stored, they are grouped, or chunked, through learning into unitized plans, or list chunks, and can later be performed at variable rates under volitional control, either via imitation or from a previously learned plan. How these processes work remains one of the most important problems confronting cognitive scientists and neuroscientists.

Figure 20: An Item-and-Order working memory circuit whose outputs are triggered by a nonspecific rehearsal wave, and wherein perseverative performance of the last item to be performed is prevented by self-inhibitory feedback (inhibition of return). See text and Grossberg and Pearson (2008) for details.

### Relative activity codes temporal order in working memory

Grossberg (1978a, 1978b) introduced an Item-and-Order WM to explain how, as successive items in a list are presented through time, they may be stored in WM as a temporally evolving spatial pattern of activity across working memory cells (Figure 20). The "relative activity" of different cell populations codes the temporal order in which the items will be rehearsed. Items with the largest activities are rehearsed earliest. Hence, the name Item-and-Order working memory for this class of models. This representation represented a radical break from the popular model of Atkinson and Shiffrin (1971), which proposed binary activations of a series of linearly ordered "slots" wherein each item moves to the next slot as additional items are stored.

### Working memory design enables stable learning of list chunks

How is an Item-and-Order WM in the brain designed? In particular, is a WM a bRNN and, if it is, how could evolution discover a bRNN network to embody a function as seemingly sophisticated as a WM? Grossberg (1978a, 1978b) noted that WMs would be useless unless the item sequences that they temporarily stored could be unitized through learning into list categories, or chunks, for recognition and recall of familiar lists, much as words and motor skills are recognized and recalled.  He predicted that all WMs are designed to solve the temporal chunking problem; namely, WMs are designed to be able to learn a novel list chunk, under unsupervised learning conditions, from a sequence of stored items some of whose subsequences may have already learned to code their own list chunks, without forcing the previously learned list chunks to be forgotten. For example, a list chunk for the novel word MYSELF can be learned even when there may already be strong learned representations for the familiar words MY, SELF, and/or ELF. Why does not storage of MYSELF in WM distort the storage of its subwords MY, SELF, and ELF in a way that leads to catastrophic forgetting of their already learned list chunks?

### LTM Invariance and Normalization Rule are realized by specialized RCFs

Grossberg (1978a, 1978b) predicted that Item-and-Order models embody two constraints to ensure that stable learning and memory of list chunks can occur: the LTM Invariance Principle and the Normalization Rule. The LTM Invariance Principle is the main postulate to ensure that a new superset list chunk, such as MYSELF, can be learned without forcing catastrophic forgetting of familiar subset list chunks, such as MY, SELF, and ELF. As a result, subset list chunks can continue to activate their familiar list chunks until they are inhibited by contextually more predictive superset list chunks; e.g., until MY is supplanted by MYSELF through time. Mathematically, this postulate implies the following property: activities of items in working memory preserve their relative activations, or ratios, throughout the time that they are stored in working memory.

The Normalization Rule assumes that the total activity of the working memory network has a maximum that is (approximately) independent of the total number of actively stored items. In other words, working memory has a limited capacity and activity is redistributed, not just added, when new items are stored.

It was proved in Grossberg (1978a, 1978b) that these simple rules generate working memories that can support stable learning and long-term memory of list chunks. This analysis also showed that Item-and-Order WMs could be embodied by specialized recurrent on-center off-surround shunting networks, or RCFs, which are ubiquitous in the brain, thereby clarifying how WMs could arise through evolution. The recurrent connections in an RCF help to store inputs in short-term memory after the inputs shut off. An RCF obeys the LTM Invariance Principle due to the way that shunting, or multiplicative, interactions compute ratios of cell activities across the network; e.g., equation (14). The Normalization Rule follows from the tendency of RCFs to normalize total network activity; e.g., equation (24). As explained below, an RCF behaves like an Item-and-Order working memory model when it is equipped with a volitionally-activated nonspecific rehearsal wave to initiate read-out of stored activity patterns, and output-contingent self-inhibitory feedback interneurons to prevent perseverative performance of the most activity stored item (Figure 20).

The prediction that all WMs are specialized RCFs that obey the LTM Invariance Principle and Normalization Rule implies the additional prediction that all verbal, spatial, and motor WMs have a similar network design. For example, the LIST PARSE model predicts how such a WM can be realized in the deeper layers of ventrolateral prefrontal cortex and how list chunks of the stored sequences can be learned in the superficial layers of the same cortex (see Figure 12, Cognitive Working Memory). Item-and-Order WMs have also been generalized to Item-Order-Rank working memories in which rank, or positional, information is also included, thereby permitting the temporary storage of repeated items in a list, as in the list ABACBDE (Figure 15; Grossberg and Pearson, 2008; Silver et al., 2011).

### Primacy, recency, and bowed activation gradients

Free recall data were one source of inspiration for the discovery of Item-and-Order WMs. During free recall, a human tries to recall a once-heard list in any order. Typically, the beginning and end of the list are recalled earlier and with higher probability. If the brain is adaptively designed, then why are listed not recalled always in the correct temporal order?

It was mathematically proved that, under constant attentional conditions, the pattern of activation that evolves in an Item-and-Order working memory is one of following types (Bradski et al., 1992, 1994; Grossberg, 1978a, 1978b):

1. Primacy gradient. Here, the first item to be stored has the largest activity and the last item to be stored has the smallest activity. A primacy gradient allows the stored items to be rehearsed in their presented order.
2. Recency gradient. Here the first item is stored with the smallest activity and the last item with the largest activity. Rehearsal of a recency gradient recalls the most recent item first and the first item last.
3. Bowed gradient. Here, the first and last items to be stored have larger activities, and thus are earlier rehearsed, than items in the middle of the list.
4. From primacy to bowed gradient. It was also proved that, as more and more items are stored, a primacy gradient becomes a bowed pattern whose recency part becomes increasingly dominant. This last result predicted that there is a fundamental reason for the ubiquitous occurrence of bowed gradients in many types of serially ordered behavior: Stable learning and memory of list chunks imposes an upper bound on the number of items that can be recalled from WM in the correct temporal order.
5. Rehearsal. Item-and-Order models predict that a volitionally-controlled, nonspecific rehearsal wave activates all the working memory cells to initiate recall, with the most active cell generating its output first (Figure 20). This happens whether the WM is activated bottom-up by sequences of individual items, or top-down by a list chunk. Rehearsing the most active item first immediately raises the problem of perseveration; namely, what keeps the most active item from being rehearsed over and over again, thereby preventing less active items from being rehearsed? Perseveration can, in fact, occur in some clinical syndromes.
6. Inhibition of return. Item-and-Order models predict that the perseveration problem is solved by output-contingent self-inhibition. In other words, each rehearsed cell’s output activates a recurrent inhibitory interneuron that shuts off the cell, and thereby enables less active items to be rehearsed in the order of their relative activation (Figure 20). Self-inhibition of each rehearsed item causes inhibition of return to that item.
7. Recurrent normalization. The activity level of a stored item influences the reaction time with which it is rehearsed, with the largest activity rehearsed first. As items with the largest activities are rehearsed, the remaining items have progressively smaller activities. However, as already rehearsed items self-inhibit, the remaining stored items are no longer inhibited by them via the recurrent off-surround. As a result, the stored pattern of cell activations can begin to renormalize itself through time as items are rehearsed, and can thereby, at least partially, compensate for the slow-down in reaction time that would otherwise have occurred due to small activities. This normalization property follows from the fact that Item-and-Order networks are RCFs whose cells obey the Shunting Model.

These hypotheses have found their way into many variants of the Item-and-order WM design (e.g., Boardman and Bullock, 1991; Houghton and Hartley, 1996; Page and Norris, 1998; Rhodes et al., 2004). Houghton (1990) called Item-and-Order models Competitive Queuing models.

### Experimental support

Item-and-Order WM properties have received support from many subsequent psychological and neurobiological experiments. Farrell and Lewandowsky (2004) wrote: “Several competing theories of short-term memory can explain serial recall performance at a quantitative level. However, most theories to date have not been applied to the accompanying pattern of response latencies…these data rule out three of the four representational mechanisms. The data support the notion that serial order is represented by a primacy gradient that is accompanied by suppression of recalled items”.

Averbeck et al. (2002, 2003a, 2003b) reported the first neurophysiological evidence in monkeys that a primacy gradient, together with inhibition of the most active cell after its command is read out, governs the sequential performance of sequential copying movements.

Jones et al. (1995) reported similar performance characteristics to those of verbal WM for a spatial serial recall task, in which visual locations were remembered. Agam et al. (2005) reported psychophysical evidence of Item-and-Order WM properties in humans as they perform sequential copying movements. Silver et al. (2011) used Item-and-Order WMs to simulate neurophysiological data about spatial WMs. The fact that verbal, spatial, and motor sequences, in both humans and monkeys, seem to obey the same WM laws provides accumulating evidence for the Grossberg (1978a, 1978b) prediction that all working memories have a similar design to enable stable list chunks to be learned

Agam et al. (2007) reported data consistent with the formation of list chunks as movement sequences are practiced, thereby supporting the Grossberg (1978a) prediction that WM networks are designed to interact closely with list chunking networks.

### Stable chunk learning implies the Magical Numbers Four and Seven

The Grossberg (1978a, 1978b) prediction that primacy gradients become bows for longer lists provides a conceptually satisfying explanation of the well-known immediate memory span of 7 +/- 2 items (Miller, 1956). Because relative activity translates into both relative order and probability of recall (bigger activities can provide more reliable recall in a noisy brain), such a model helps to explain why items from the beginning and end of a list in free recall may be recalled earlier and with larger probability (Murdock, 1962). Transposition errors also have a natural explanation in such a working memory, since stored items with similar activity levels will transpose their relative activities, and thus their rehearsal order, more easily than items with very different activity levels if noise perturbs these levels through time. Grossberg (1978a, 1978b) also proved that, if attention varies across items, then multi-modal bows, or Von Restorff (1933) effects, also called isolation effects (Hunt and Lamb, 2001), can be obtained by altering the relative sizes of stored activities. Von Restorff effects can also be caused by rate and feature similarity differences across items, factors that also influence bowing in the present modeling framework. Associative and competitive mechanisms can also cause Von Restorff effects during serial verbal learning (see Serial learning and Grossberg, 1969, 1974).

Grossberg (1978a) distinguished between the classical immediate memory span (IMS) of Miller (1956) and the then new concept of the transient memory span(TMS). The TMS was predicted to be the result of purely STM WM storage and recall, without a significant top-down long-term memory (LTM) component.  The TMS is, accordingly, the longest list length for which a WM can store a primacy gradient. The IMS was predicted to be the result of combining bottom-up inputs and top-down read-out of list chunk learned expectations on the relative activities stored in WM, and thus the temporal order that is recalled. Grossberg (1978a) proved that the TMS is smaller than the IMS. Estimating the IMS at the famous Magical Number Seven, it was predicted that the TMS would be around four. Cowan (2001) has reviewed experimental data that support the existence of a four plus-or-minus one WM capacity limit when LTM and grouping influences are minimized, consistent with this prediction. Indeed, long-term memory (LTM) does bias working memory toward more primacy dominance (e.g. Knoedler, 1999), and its influence can be difficult to limit. Also see Visual short term memory.

### Equations for some Item-and-Order RNNs

An Item-and-Order RCF with mathematically provable primacy, recency, and bowed gradient properties is defined by the family of STORE (Sustained Temporal Order REcurrent) models (Bradski et al., 1992, 1994). The STORE 1 model is defined by the following RNN:

Input: Let the input $$I_i(t)$$ to the $$i^{th}$$ neuronal population satisfy:

$\tag{40} I_i(t) = 1 \text{ if } \alpha_i - t_i < t < t_i \text{ and } 0 \text{ otherwise.}$

Let the inter-input delays $$\beta_i$$ be long enough for the system variables $$x_i$$ and $$y_i$$ to approach equilibrium.

Layer $$F_1$$: Let the activity $$x_i$$ of the $$i^{th}$$ population in layer $$F_1$$ satisfy:

$\tag{41} \frac{d}{dt} x_i = (AI_i + y_i - x_i x)I,$

where $$x = \sum_{k=1}^n x_k$$ and $$I = \sum_{k=1}^n I_k$$.

Layer $$F_2$$: Let the activity $$y_i$$ of the $$i^{th}$$ population in layer $$F_2$$ satisfy:

$\tag{42} \frac{d}{dt} y_i = (x_i - y_i) I^c ,$

where $$I^c = 1 - I$$. Initially, $$x_i(0) = y_i(0) = 0$$. Equation (41) is a RCF with a broad off-surround $$x$$ that can update its STM pattern only when some input is on; that is, when $$I > 0$$. Equations (41) and (42) together define recurrent feedback loops whereby $$y_i$$ can store a new value of $$x_i$$ when all inputs shut off; that is, when $$I = 0$$. While some inputs are on, previously stored values of $$y_i$$ influence the STM pattern of $$x_i$$'s that is stored, without themselves being changed by them.

## Serial Learning: From Command Cells to Values, Decisions, and Plans

When sequences of items from a list are stored in STM, they can trigger learning and LTM that enable them to be fluently recalled at a future time. Such serial learning has been studied experimentally and theoretically for a long time in experimental psychology; e.g., Dixon and Horton (1968); Hovland (1938a, 1938); Hull et al. (1940); Jung (1968); McGeogh and Irion (1952); Osgood (1953), and Underwood (1952). In fact, the Additive Model and Shunting Model were first derived in order to explain associative learning of temporal order information in serial learning and related paradigms (Grossberg, 1969c; Grossberg and Pepe, 1970, 1971).  The step-wise historical development of models for learning of temporal order, leading to sophisticated bRNNs that can respond with increasing flexibility to different types of environmental feedback, can be summarized as follows:

### Avalanches

The properties of stimulus sampling and of encoding LTM in spatial pattern units show how to learn an arbitrary act, such as a piano recital, a dance, or a sequence of sensory images, in a minimal way. The simplest example, called an Avalanche (Grossberg, 1969d, 1970b, 1974), describes a ritualistic encoding wherein performance is insensitive to environmental feedback. In this case, only one cell is needed to encode the memory of an arbitrary space-time pattern. This fact shows that encoding complexity per se is relatively easy. Indeed, nervous systems with few cells can activate complicated behaviors, as is well known in invertebrates. The ritualistic construction is also universal, because such a cell can encode any act.

Figure 21: An Avalanche circuit learns to perform arbitrary spatiotemporal patterns by sampling and learning them using sequentially-activated Outstars.

Intuitively, an Avalanche samples and learns a spatiotemporal pattern as a sequence of spatial patterns, much as we experience the continuous flow of scenes in a movie from a rapidly presented sequence of still pictures (Figure 21). In the Avalanche, a series of Outstars (black circuits in Figure 21) are sequentially activated (red series of connections in Figure 20). The Outstars sample a spatiotemporal pattern (green region in Figure 21) as they are sequentially activated by the sampling pulse (blue pulse in Figure 21). A related concept is the synfire chain (Abeles, 1991).

Despite their simplicity, Avalanche-type circuits occur in vivo. Figure 22 illustrates that an Avalanche-type circuit occurs in the HVC-RA network that controls songbird singing (Hahnloser et al., 2002). As illustrated by the bRNNs in Figure 12, Figure 13, and Figure 15, in addition to a primary circuit for temporally ordered recall, many other circuits, such as those in frontal cortex and the basal ganglia, are also needed to ensure flexible performance, at least in higher species. Even in the songbird, frontal and basal ganglia circuits modulate song performance (Andalman and Fee, 2009).

Figure 22: The songbird HVC-RA circuit (http://web.mit.edu/feelab/research.html) includes an Avalanche anatomy to execute song sounds in the correct order.

### Command cells and nonspecific arousal

Once a pulse activates the Avalanche in Figure 21, there is no way to stop it. If, for example, the Avalanche controlled the performance of a long dance, and the stage on which the dance was being performed began to burn, there would be no way to stop the dance in mid-course to escape the flames. Sensitivity to environmental feedback is possible only if the pulse can be abruptly terminated as it travels along the Avalanche axon, and replaced by a more appropriate escape behavior. Grossberg (1969d, 1970b) proposed that the minimal circuit for including such sensitivity to environmental feedback would include command cells (Figure 23). Command cells are, in fact, found even in invertebrates, where they control such stereotyped behaviors as the rhythmic beating of crayfish swimmerets (Stein, 1971).

Figure 23: Command cell modulation of Avalanche sequential performance.

Suppose that activation of a command cell is necessary to fire the chain of Avalanche cells (Figure 23). The command cell simultaneously sends signals to all of the Outstars within the Avalanche, which can now fire only if they receive a signal from the previous Outstar source cell and from the command cell ("polyvalence"). Thus, the command cell provides nonspecific arousal to the avalanche. Withdrawal of the command cell arousal can abruptly terminate output from the next link in the Avalanche. In addition, changing the size of the command cell signal can vary the speed of performance, with larger command signals causing faster performance speeds. Command cells are also familiar in the control of other behavioral acts in invertebrates (Carlson, 1968; Dethier, 1968). Competition between command cells can then determine which ritualistic behavior the system will activate.

Grossberg (1978a) describes a series of increasingly sophisticated mechanisms that modulate Avalanche performance, leading to ever-greater sensitivity to environmental feedback, including recurrent interactions. These include issues such as sensitivity to the value of actions for achieving desired goals, and the ability to volitionally decide what actions to perform and at what speed.

Concerning the former issue: There is a difference between aborting your dance on stage if the theater is being consumed by flames, and risking your career because a mosquito is hovering above. Only more important events should be able to shut off the arousal that supports a given act. Knowing what is important to an organism requires that the network can evaluate what events are rewarding and punishing. This issue historically led to the Cognitive-Emotional-Motor (CogEM) theory of reinforcement learning in which incentive motivation plays the role of a conditionable form of nonspecific arousal, and competition between different drive representations that control the incentive motivation can determine switching between different valued actions (Grossberg, 1971a, 1972a, 1972b, 1975; see also Armony et al. (1995) and Computational models of classical conditioning). These ideas and their generalizations and extensions led eventually to the MOTIVATOR model (Figure 16).

Concerning the latter issue: How does an organism decide what act to perform? This question involves issues about volitional control of behavioral choice by prefrontal-basal ganglia circuits that led eventually to circuits such as TELOS and lisTELOS (Figure 14 and Figure 15) and related models (e.g., Frank and Claus, 2006; Schultz et al., 1997). The ritualistic performance in Avalanches hereby focused attention in Grossberg (1978a) and thereafter on multiple issues concerning the global organization of brain mechanisms that are sensitive to different kinds of environmental feedback. Articulating these mechanisms led to the types of high-dimensional bRNNs that are illustrated in Figure 11-Figure 19 and that are familiar in advanced brains.

### Self-organizing avalanches: Instar-outstar maps and serial learning of temporal order

The Outstar source cells and the links between them are pre-wired in an Avalanche (Figure 21). These limitations led Grossberg (1972, 1976a, 1976b) to interact through the published literature with von der Malsburg (1973; see also Willshaw and Malsburg, 1976)) to introduce Competitive Learning and Self-Organizing Maps (Figure 4) so that the source, or sampling, cells could self-organize as learned category cells.

Figure 24: In a Self-Organizing Avalanche, Instars (often embedded in Self-Organizing Map or Adaptive Resonance Theory circuits) learn to activate categories that act as source, or sampling, cells of Outstars. Each Outstar is capable of learning an arbitrary spatial pattern. The temporal order in which the Outstar sampling cells are activated is determined by serial associative learning among the sampling cells, starting out with adaptive completely recurrent connections.

After the Outstar source cells self-organize, they need to learn the spatial patterns that they will perform (Figure 24), as occurs in pre-wired Outstars as well. Taken together, these learned Instars and Outstars define the Instar-Outstar associative map learning circuits (Figure 6) that were introduced in Grossberg (1976a).

If the source cells self-organize, then the links between them must also be learned. This is the problem of serial learning.  The simplest network capable of learning an arbitrary temporal order among its constituent cells is a fully-recurrent RNN (Figure 25) whose sampling cells can sequentially learn to embed a temporal order of performance in the network, by building on the guarantee of the Unbiased Spatial Pattern Learning Theorem; see equations (35) and (36). Grossberg (1969c) and Grossberg and Pepe (1970, 1971) provide mathematical analyses of how serial learning can proceed through time, and thereby explain classical data properties such as the bowed serial position curve. The net result of all these learning processes is a Self-Organizing Avalanche that can learn its sampling cells, its temporal order links, and its output spatial patterns (Figure 24).

Figure 25: The simplest circuit for serial learning is a completely recurrent neural network with adaptive synapses that obey the Outstar Learning Theorem.

### Context-Sensitive Self-Organizing Avalanches: What categories control temporal order?

Once a mechanism is in place for learning categories that act as sampling cells, the question arises: What do these categories code? In Avalanches, each link in the associative chain is sensitive only to the previous link. However, in many types of tasks, information about more than one previous event or action is needed to choose the correct subsequent action. This issue led to the introduction of Item-and-Order working memories (Figure 20) so that list chunks could be learned which are sensitive to whole sequences of previous events. In such a network, list chunks are the sampling cells that are linked through serial learning into a temporally ordered circuit. List chunks also play the role of planning nodes through read-out of their learned top-down spatial patterns and serial links. Such a network is called a Context-Sensitive Self-Organizing Avalanche, or a Context-Sensitive Avalanche, for short.

### Serial learning

Issues about what previous events control subsequent responses were articulated in the classical literature about serial verbal learning. New verbal units are continually being synthesized as a result of practice, and need not be the obvious units that the experimentalist is directly manipulating. Indeed, entire sequences of previous events can create the context that determines the next response. The same problem arises in verbal, spatial, and motor learning. The concept of list chunks was introduced to explain such learned sequence-sensitive contextual control.

The severity of such difficulties led the serial learning expert Young (1968, p. 146) to write: "If an investigator is interested in studying verbal learning processes ... he would do well to choose some method other than serial learning". Underwood (1966, p. 491) went even further by writing: "The person who originates a theory that works out to almost everyone's satisfaction will be in line for an award in psychology equivalent to the Nobel prize". The mechanisms summarized in this review enable many of the classical serial learning data that inspired these statements to be explained and simulated. However, a full discussion of these data and their explanations goes beyond the scope of the current review. See Grossberg (1969c) and Grossberg and Pepe (1970, 1971) for explanations and simulations of classical serial learning data, and Grossberg (1978a, 1993) for reviews.

## References

• Abbott, L.F., Varela, K. Sen, K., and Nelson, S.B. (1997). Synaptic depression and cortical gain control. Science, 275, 220-223.
• Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge, UK: Cambridge University Press.
• Ackley, D.H., Hinton, G.E., and Sejnowski, T.J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147-169.
• Agam, Y., Bullock, D., & Sekuler, R. (2005). Imitating unfamiliar sequences of connected linear motions. Journal of Neurophysiology, 94, 2832-2843.
• Agam, Y., Galperin, H., Gold, B. J., and Sekuler, R. (2007). Learning to imitate novel motion sequences. Journal of Vision, doi:10.1167/7.5.1
• Alligood, K.T., Saurer, T.D., and Yorke, J.A. (1996). Chaos: An introduction to dynamical systems. New York: Springer.
• Andalman, A.S., and Fee, M.S. (2009). A basal ganglia-forebrain circuit in the songbird biases motor output to avoid vocal errors. Proceedings of the National Academy of Sciences, 106, 12518-12523.
• Anderson, J.A. (1968). A memory model utilizing spatial correlation functions. Kybernetik, 5, 113-119.
• Anderson, J.A., Silverstein, J.W., Ritz, S.R., and Jones, R.S. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413-451.
• Armony, J.L. Servan-Schreiber, D., Cohen, J.D., and LeDoux, J.E. (1995). An anatomically constrained neural network model of fear conditioning. Behavioral Neuroscience, 109, 246-257.
• Atkinson, R.C., & Shiffrin, R. M. (1971). The control of short term memory. Scientific American, 225(2), 82-90.
• Averbeck, B. B., Chafee, M. V., Crowe, D. A., and Georgopoulos, A. P. (2002). Parallel processing of serial movements in prefrontal cortex. Proceedings of the National Academy of Sciences, USA, 99, 20, 13172–13177.
• Averbeck, B. B., Crowe, D. A., Chafee, M. V., and Georgopoulos, A. P. (2003a). Neural activity in prefrontal cortex during copying geometrical shapes. I. Single cells encode shape, sequence, and metric parameters. Experimental Brain Research, 150, 127-141.
• Averbeck, B. B., Crowe, D. A., Chafee, M. V., and Georgopoulos, A. P. (2003b). Neural activity in prefrontal cortex during copying geometrical shapes. II. Decoding shape segments from neural ensembles. Experimental Brain Research,150, 142–153.
• Baddeley, A. (1986). Working Memory. London: Oxford University Press.
• Baddeley, A. D., and Hitch, G. J. (1974). Working memory. In G. H. Bower (Ed.), Recent Advances in Learning and Motivation (Vol. 8), New York: Academic Press.
• Boardman, I., and Bullock, D. (1991). A neural network model of serial order recall from short-term memory. Proceedings of the International Joint Conference on Neural Networks,2, 879-884. Piscataway, NJ: IEEE Service Center.
• Bradski, G., Carpenter, G. A., and Grossberg, S. (1994).  STORE working memory networks for storage and recall of arbitrary temporal sequences.  Biological Cybernetics,71, 469-480.
• Brown, J., Bullock, D., and Grossberg, S. (1999). How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. Journal of Neuroscience, 19, 10502-10511.
• Brown, J.W., Bullock, D., and Grossberg, S. (2004). How laminar frontal cortex and basal ganglia circuits interact to control planned and reactive saccades. Neural Networks, 17, 471-510.
• Burton, T.A. (1993). Averaged neural networks. Neural Networks, 6, 677-680.
• Burwick, T. (2006). Oscillatory networks: Pattern recognition without a superposition catastrophe. Neural Computation, 18, 356-380.
• Caianiello, E.R. (1961). Outline of a theory of thought and thinking machines. Journal of Theoretical Biology, 1, 204-235.
• Cao, Y. and Grossberg, S. (2005). A laminar cortical model of stereopsis and 3D surface perception: Closure and da Vinci stereopsis. Spatial Vision, 18, 515-578.
• Cao, Y., and Grossberg, S. (2012). Stereopsis and 3D surface perception by spiking neurons in laminar cortical circuits: A method of converting neural rate models into spiking models. Neural Networks, 26, 75-98.
• Carlson, F.O. (Ed.). Physiological and biochemical aspects of nervous integration. Englewood Cliffs, New Jersey: Prentice-Hall.
• Carpenter, G.A. (2001). Neural network models of learning and memory: Leading questions and an emerging framework. Trends in Cognitive Sciences, 5, 114-118.
• Carpenter, G.A. and Grossberg, S. (1983). A neural theory of circadian rhythms: The gated pacemaker. Biological Cybernetics, 48, 35-59.
• Carpenter, G.A., and Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54-115.
• Cisek, P. (2006). Integrated neural processes for defining potential actions and deciding between them: A computational model. The Journal of Neuroscience, 26, 9761-9770.
• Cohen, M.A. (1988). Sustained oscillations in a symmetric cooperative-competitive neural network: Disproof of a conjecture about content addressable memory. Neural Networks, 1, 217-221.
• Cohen, M.A., and Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 13,815-826.
• Cohen, M.A. and Grossberg, S. (1986). Neural dynamics of speech and language coding: Developmental programs, perceptual grouping, and competition for short-term memory. Human Neurobiology, 5, 1-22.
• Cohen, M.A. and Grossberg, S. (1987). Masking fields: A massively parallel neural architecture for learning, recognizing, and predicting multiple groupings of patterned data. Applied Optics, 26, 1866-1891.
• Cooper, R. P., & Shallice, T. (2000). Contention Scheduling and the control of routine activities. Cognitive Neuropsychology, 17, 297–338.
• Cornsweet, T.N. (1970). Visual perception. New York: Academic Press.
• Cowan, N. (2001). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24, 87-185.
• Desimone, R. (1998). Visual attention mediated by biased competition in extrastriate visual cortex. Philosophical Transactions of the Royal Society of London, 353, 1245–1255.
• Dethier, V.G. (1968). Physiology of insect senses. London: Methuen.
• Dev, P. (1975). Perception of depth surfaces in random-dot stereograms: a neural model. International Journal of Man-Machine Studies, 7, 511-528.
• Douglas, R.J., Koch, C., Mahowald, M., and Martin, K.A., Suarez, H.H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981-985.
• Dranias, M., Grossberg, S., and Bullock, D. (2008). Dopaminergic and non-dopaminergic value systems in conditioning and outcome-specific revaluation. Brain Research, 1238, 239-287.
• Eckhorn, R., Reitboeck, H.J., Arndt. M., and Dicke, P. (1990). Feature linking via synchronization among distributed assemblies: Simulations of results from cat visual cortex. Neural Computation, 2, 293-307.
• Eigen, M., and Schuster, P. (1978). The hypercycle: A principle of natural self-organization, B: The abstract hypercycle. Naturwissensha∫ten, 65, 7-41.
• Ellias, S.A. and Grossberg, S. (1975). Pattern formation, contrast control, and oscillations in the short-term memory of shunting on-center off-surround networks. Biological Cybernetics, 20, 69-98.
• Farrell, S., and Lewandowsky, S. (2002). An endogenous distributed model of ordering in serial recall. Psychonomic Bulletin & Review, 9, 59–79.
• Fazl, A., Grossberg, S., and Mingolla, E. (2009). View-invariant object category learning, recognition, and search: How spatial and object attention are coordinated using surface-based attentional shrouds. Cognitive Psychology, 58, 1-48.
• Francis, G. and Grossberg, S. (1996). Cortical dynamics of boundary segmentation and reset: Persistence, afterimages, and residual traces . Perception, 35, 543-567.
• Francis, G., Grossberg, S., Mingolla, E. (1994). Cortical dynamics of feature binding and reset: Control of visual persistence. Vision Research, 34, 1089-1104.
• Frank, M.J., and Claus, E.D. (2006). Anatomy of a decision: Striato-orbitofrontal interactions in reinforcement learning, decision making, and reversal. Psychological Review, 113, 300-326.
• French, R.M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3, 128-135.
• Gorchetchnikov, A., Versace, M., and Hasselmo, M.E. (2005). A model of STDP based on spatially and temporally local information: Derivation and combination with gated decay. Neural Networks, 16, 458-466.
• Gaudiano P., and Grossberg S. (1991). Vector associative maps: Unsupervised real-time error-based learning and control of movement trajectories. Neural Networks, 4, 147-183.
• Gaudiano, P., and Grossberg, S. (1992). Adaptive vector integration to endpoint: Self-organizing neural circuits for control of planned movement trajectories. Human Movement Science, 11, 141-155.
• Gilpin, M.E., and Ayala, F.J. (1973). Global models of growth and competition. Proceedings of the National Academy of Sciences, 70, 3590-3593.
• Gray, C.M., Konig, P., Engel, A.K., and Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334-337.
• Grossberg, S. (1964). The theory of embedding fields with applications to psychology and neurophysiology. Rockefeller Institute for Medical Research.
• Grossberg, S. (1967). Nonlinear difference-differential equations in prediction and learning theory. Proceedings of the National Academy of Sciences, 58, 1329-1334.
• Grossberg, S. (1968a). A prediction theory for some nonlinear functional-differential equations, II: Learning of patterns. Journal of Mathematical Analysis and Applications, 22, 490-522.
• Grossberg, S. (1968b). Some nonlinear networks capable of learning a spatial pattern of arbitrary complexity. Proceedings of the National Academy of Sciences, 59, 368-372.
• Grossberg, S. (1968c). Some physiological and biochemical consequences of psychological postulates. Proceedings of the National Academy of Sciences, 60, 758-765.
• Grossberg, S. (1969a). On learning and energy-entropy dependence in recurrent and nonrecurrent signed networks. Journal of Statistical Physics, 1, 319-350.
• Grossberg, S. (1969b). On the production and release of chemical transmitters and related topics in cellular control. Journal of Theoretical Biology, 22, 325-364.
• Grossberg, S. (1969c). On the serial learning of lists. Mathematical Biosciences, 4, 201-253.
• Grossberg, S. (1969d). Some networks that can learn, remember, and reproduce any number of complicated space-time patterns, I. Journal of Mathematics and Mechanics, 19, 53-91.
• Grossberg, S. (1970a). Neural pattern discrimination. Journal of Theoretical Biology, 27, 291-337.
• Grossberg, S. (1970b). Some networks that can learn, remember, and reproduce any number of complicated space-time patterns, II. Studies in Applied Mathematics, 49, 135-166.
• Grossberg, S. (1971a). On the dynamics of operant conditioning. Journal of Theoretical Biology, 33, 225-255.
• Grossberg, S. (1971b). Parlovian pattern learning by nonlinear neural networks. Proceedings of the National Academy of Sciences, 68, 828-831.
• Grossberg, S. (1972a). A neural theory of punishment and avoidance, I: Qualitative theory. Mathematical Biosciences, 15, 39-67.
• Grossberg, S. (1972b). A neural theory of punishment and avoidance, II: Quantitative theory. Mathematical Biosciences, 15, 253-285.
• Grossberg, S. (1973). Contour enhancement, short-term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52, 213-257.
• Grossberg, S. (1974). Classical and instrumental learning by neural networks. In Rosen, R., and Snell, F. (Eds.), Progress in theoretical biology. New York: Academic Press, pp. 51-141.
• Grossberg, S. (1975). A neural model of attention, reinforcement, and discrimination learning. International Review of Neurobiology, 18, 263-327.
• Grossberg, S. (1976a). Adaptive pattern classification and universal recoding, I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121-134.
• Grossberg, S. (1976b). Adaptive pattern classification and universal recoding, II: Feedback, expectation, olfaction, and illusions. Biological Cybernetics, 23, 187-202.
• Grossberg, S. (1978a). A theory of human memory: Self-organization and performance of sensory-motor codes, maps, and plans. In R. Rosen and F. Snell (Eds.), Progress in theoretical biology, Volume 5. New York: Academic Press, pp. 233-374.
• Grossberg, S. (1978b). Behavioral contrast in short-term memory: Serial binary memory models or parallel continuous memory models? Journal of Mathematical Psychology, 3, 199-219.
• Grossberg, S. (1978c). Competition, decision, and consensus. Journal of Mathematical Analysis and Applications, 66, 470-493.
• Grossberg, S. (1978d). Decisions, patterns, and oscillations in nonlinear competitive systems with applications to Volterra-Lotka systems. Journal of Theoretical Biology, 73, 101-130.
• Grossberg, S. (1980a). How does a brain build a cognitive code? Psychological Review, 87, 1-51.
• Grossberg, S. (1980b). Biological competition: Decision rules, pattern formation, and oscillations. Proceedings of the National Academy of Sciences, 77, 2338-2342.
• Grossberg, S. (1982). Associative and competitive principles of learning and development: The temporal unfolding and stability of STM and LTM patterns. In Amari, S.I.,  and Arbib, M. (Eds.), Competition and cooperation in neural networks. New York: Springer-Verlag, 1982.
• Grossberg, S. (1983). The quantized geometry of visual space: The coherent computation of depth, form, and lightness. The Behavioral and Brain Sciences, 6, 625-692.
• Grossberg, S. (1984a). Some normal and abnormal behavioral syndromes due to transmitter gating of opponent processes. Biological Psychiatry, 19, 1075-1118.
• Grossberg, S. (1984b). Some psychophysiological and pharmacological correlates of a developmental, cognitive, and motivational theory. In Karrer R, Cohen J, Tueting P, editors. Brain and Information: Event Related Potentials, New York: New York Academy of Sciences, pp.  58-142.
• Grossberg, S. (1988) Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1, 17-61.
• Grossberg, S. (1993). Self-organizing neural networks for stable control of autonomous behavior in a changing world. In J.G. Taylor (Ed.), Mathematical approaches to neural networks. Amsterdam : Elsevier Science Publishers,  pp.139-197.
• Grossberg, S. (1999). How does the cerebral cortex work? Learning, attention and grouping by the laminar circuits of visual cortex. Spatial Vision, 12, 163-186.
• Grossberg, S. (2003). How does the cerebral cortex work? Development, learning, attention, and 3D vision by laminar circuits of visual cortex. Behavioral and Cognitive Neuroscience Reviews, 2, 47-76.
• Grossberg, S. (2009). Cortical and subcortical predictive dynamics and learning during perception, cognition, emotion and action. Philosophical Transactions of the Royal Society of London B Biological Sciences, 364, 1223-1234.
• Grossberg, S. (2012). Adaptive Resonance Theory: How a brain learns to consciously attend, learn, and recognize a changing world. Neural Networks, 37, 1-47.
• Grossberg, S. and Grunewald, A. (1997). Cortical synchronization and perceptual framing. Journal of Cognitive Neuroscience, 9, 117-132.
• Grossberg, S. and Gutowski, W.E. (1987). Neural dynamics of decision making under risk: Affective balance and cognitive-emotional interactions. Psychological Review, 94, 300-318.
• Grossberg, S., and Kazerounian, S. (2011). Laminar cortical dynamics of conscious speech perception: A neural model of phonemic restoration using subsequent context in noise. Journal of the Acoustical Society of America, 130, 440-460.
• Grossberg, S. and Levine, D.S. (1987). Neural dynamics of attentionally modulated Pavlovian conditioning: Blocking, inter-stimulus interval, and secondary reinforcement. Applied Optics, 26, 5015-5030.
• Grossberg, S. and Mingolla, E. (1985). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentations. Perception and Psychophysics, 38, 141-171.
• Grossberg, S., and Pearson, L. (2008). Laminar cortical dynamics of cognitive and motor working memory, sequence learning and performance: Toward a unified theory of how the cerebral cortex works. Psychological Review, 115, 677-732.
• Grossberg, S. and Pepe, J. (1970). Schizophrenia: Possible dependence of associational span, bowing, and primacy vs. recency on spiking threshold. Behavioral Science, 15, 359-362.
• Grossberg, S. and Pepe, J. (1971). Spiking threshold and overarousal effects in serial learning. Journal of Statistical Physics, 3, 95-125.
• Grossberg, S. and Pilly, P. (2008). Temporal dynamics of decision-making during motion perception in the visual cortex. Vision Research, 48, 1345-1373.
• Grossberg, S. and Raizada, R. (2000). Contrast-sensitive perceptual grouping and object-based attention in the laminar circuits of primary visual cortex. Vision Research, 40', 1413-1432.
• Grossberg, S., and Seitz, A. (2003). Laminar development of receptive fields, maps, and columns in visual cortex: The coordinating role of the subplate. Cerebral Cortex,13, 852-863.
• Grossberg, S. and Somers, D. (1991). Synchronized oscillations during cooperative feature linking in a cortical model of visual perception. Neural Networks, 4, 453-466.
• Grossberg, S. and Todorovic, D. (1988). Neural dynamics of 1-D and 2-D brightness perception: A unified model of classical and recent phenomena. Perception and Psychophysics, 43, 241-277.
• Grossberg, S., and Versace, M. (2008). Spikes, synchrony, and attentive learning by laminar thalamocortical circuits. Brain Research, 1218, 278-312.
• Guo, S., Huang, L., and Wang, L. (2004). Linear stability and Hopf bifurcation in a two-neuron network with three delays. International Journal of Bifurcation and Chaos, 14, 2799-2810.
• Hahnloser, R.H.R., Kozhevnikov, A.A., and Fee, M.S. (2002). An ultra-sparse code underlies the generation of neural sequences in a songbird. Nature, 419, 65-70.
• Hartline, H.K., and Ratliff, F. (1957). Inhibitory interaction of receptor units in the eye of Limulus. Journal of General Physiology, 40, 357-376.
• Hebb, D.O. (1949). The organization of behavior. New York: Wiley.
• Hecht-Nielsen, R. (1987). Counterpropagation networks. Applied Optics, 26, 4979-4983.
• Heeger, D.J. (1992). Normalization of cell responses in cat visual cortex. Visual neuroscience, 9, 181-197.
• Hodgkin, A.L. (1964). The conduction of the nervous impulse. Springfield: C.C. Thomas.
• Hodgkin, A.L., and Huxley, A.F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology, 117, 500-544.
• Hopfield, J.J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3088-3092.
• Houghton, G. (1990). The problem of serial order: A neural network model of sequence learning and recall. In R. Dale, C. Mellish, and M. Zock (Eds.). Current Research in National Language Generation, pp. 287-319. London: Academic Press.
• Houghton, G., and Hartley, T. (1996).  Parallel Models of Serial Behavior: Lashley Revisited.  PSYCHE,2, 25.
• Hovland, C.I. (1938a). Experimental studies in rote-learning theory. I. Reminiscence following learning by massed and by distributed practice. Journal of Experimental Psychology, 22, 201-224.
• Hovland, C.I. (1938b). Experimental studies in rote-learning theory. III. Distribution of practice with varying speeds of syllable presentation. Journal of Experimental Psychology, 23, 172-190.
• Huang, T.-R., and Grossberg, S. (2010). Cortical dynamics of contextually cued attentive visual learning and search: Spatial and object evidence accumulation. Psychological Review, 117, 1080-1112.
• Hull, C.L., Hovland, C.I., Ross, R.T., Hall, M., Perkins, D.T., and Fitch, F.B. (1940). Mathematico-deductive theory of rote learning. New Haven: Yale University Press.
• Hunt, R. R., and Lamb, C. A. (2001). What causes the isolation effect? Journal of Experimental Psychology: Learning, Memory and Cognition, 27, 6, 1359-66.
• Jones, D., Farrand, P., Stuart, G., and Morris, N. (1995). The functional equivalence of verbal and spatial memory in serial short-term memory, Journal of Experimental Psychology: Learning Memory and Cognition, 21, 1008-1018.
• Katz, B. (1969). The release of neural transmitter substances. Liverpool: Liverpool University Press.
• Knoedler, A. J., Hellwig, K. A., and  Neath, I. (1999). The shift from recency to primacy with increasing delay. Journal of Experimental Psychology: Learning, Memory and Cognition, 25, 474-487.
• Koch, C., and Ullman, S. (1985). Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiology, 4, 219-227.
• Kohonen, T. (1971). A class of randomly organized associative memories. Acta Polytechnica Scandinavica, E1. 25.
• Kohonen, T. (1977). Associative memory: A system-theoretical approach. New York: Springer-Verlag.
• Kohonen, T. (1984). Self-organization and associative memory. New York: Springer-Verlag.
• Kosko, B. (1987). Adaptive bidirectional associative memories. Applied Optics, 26, 4947-4960.
• Kosko, B. (1988). Bidirectional associative memories. IEEE Transactions on Systems, Man, and Cybernetics, SMC-18, 49-60.
• Kuffler, S. W. (1953). Discharge patterns and functional organization of mammalian retina. Journal of Neurophysiology, 16,37-68.
• May, R.M., and Leonard, W.J. (1975). Nonlinear aspects of competition between three species. SIAM Journal on Applied Mathematics, 29, 243-253.
• McCulloch, W.S., and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115-133.
• McLaughlin, D., Shapley, R., Shelley, M., and Wielaard, D.J. (2000). A neuronal network model of macaque primary visual cortex (V1): Orientation selectivity and dynamics in the input layer 4Ca. Proceedings of the National Academy of Sciences, 97, 8087-8092.
• Menon, A., Mehrotra, K., Mohan, C.K., and Ranka, S. (1996). Characterization of a class of sigmoid functions with applications to neural networks. Neural Networks, 9, 819-835.
• Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological Review, 63, 2, 81-97.
• Murdock, B. B. (1962). The serial position effect of free recall. Journal of Experimental Psychology, 64, 482-488.
• Ogmen, H., and Gagné, S. (1990). Neural network architectures for motion perception and elementary motion detection in the fly visual system. Neural Networks, 3, 487-505.
• O'Reilly, R.C., and Munakata, Y. (2000). Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain. Cambridge, MA: MIT Press.
• Osgood, C.E. (1953). Method and theory in experimental psychology. Oxford University Press: London.
• Page, M. (2000). Connectionist modeling in psychology: A localist manifesto. Behavioral and Brain Sciences, 23, 443-512.
• Page, M. P. A., and Norris, D. (1998). The primacy model: A new model of immediate serial recall. Psychological Review, 105, 4, 761-781.
• Palma, J., Grossberg, S., and Versace, M. (2012). Persistence and storage of activity patterns in spiking recurrent cortical networks: Modulation of sigmoid signals by after-hyperpolarization currents and acetylcholine. Frontiers in Computational Neuroscience, 6:42. doi:  10.3389/fncom.2012.00042
• Pilly, P.K. and Grossberg, S. (2012). How do spatial learning and memory occur in the brain? Coordinated learning of entorhinal grid cells and hippocampal place cells. Journal of Cognitive Neuroscience, 5, 1031-1054.
• Raizada, R. and Grossberg, S. (2001). Context-sensitive bindings by the laminar circuits of V1 and V2: A unified model of perceptual grouping, attention, and orientation contrast. Visual Cognition, 8, 431-466.
• Ratliff, F., Hartline, H.K., and Miller, W.H. (1963). Spatial and temporal aspects of retinal inhibitory interactions. Journal of the Optical Society of America, 53, 110-120.
• Rhodes, B. J., Bullock, D., Verwey, W. B., Averbeck, B. B., and Page, M. P. A. (2004). Learning and production of movement sequences: Behavioral, neurophysiological, and modeling perspectives. Human Movement Science, 23, 683-730.
• Rosenblatt, F. (1962). Principles of neurodynamics. Washington, D.C.: Spartan Books.
• Ruch, T.C., Patton, H.D., Woodbury, J.W., and Towe, A.L. Neurophysiology. Philadelphia: W.B. Saunders.
• Schultz, W., Dayan, P., and Montague, P.R. (1997). A neural substrate of prediction and reward. Science, 275, 1593-1599.
• Silver, M.R., Grossberg, S., Bullock, D., Histed, M.H., and Miller, E.K. (2011). A neural model of sequential movement planning and control of eye movements: Item-order-rank working memory and saccade selection by the supplementary eye fields. Neural Networks, 26, 29-58.
• Somers, D., and Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biological Cybernetics, 68, 393-407.
• Somers, D., and Kopell, N. (1995). Waves and synchrony in networks of oscillators of relaxation and non-relaxation type. Physica D: Nonlinear phenomena, 89, 169-183.
• Stein, P.S.G. (1971). Intersegmental coordination of swimmeret motoneuron activity in crayfish. Journal of Neurophysiology, 34, 310-318.
• Tsodyks, M.V., and Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proceedings of the National Academy of Sciences, 94, 719-723.
• Underwood, B.J. (1952). Studies of distributed practice: VII. Learning and retention of serial nonsense lists as a function of intralist similarity. Journal of Experimental Psychology, 44, 80-87.
• Usher, M., and McClelland, J.L. (2001). The time course of perceptual choice: The leaky, competing accumulator model. Psychological Review, 108, 550-592.
• Von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Biological Cybernetics, 14, 85-100.
• Von Restorff, H. (1933). Über die Wirkung von Bereichsbildungen im Spurenfeld (The effects of field formation in the trace field). Psychologie Forschung, 18, 299-34.
• Wang, X.-J. (2008). Decision makes in recurrent neuronal circuits. Neuron Review, 60, 215-234.
• Werblin, F.S. (1971). Adaptation in a vertebrate retina: Intracellular recordings in Necturus. Journal of Neurophysiology, 34, 228-241.
• Widrow, B. (1962). Generalization and information storage in networks of Adaline neurons. In Yovits, M.C., Jacobi, G.T., and Goldstein, G.D. (Eds.). Self-organizing systems. Washington, D.C.: Spartan Books.
• Willshaw, D.J., and Malsburg, C. von der (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London, 194, 431-445.
• Wilson, H.R., and Cowan, J.D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12, 1-24.
• Wu, J., and Zou, X. (1995). Patterns of sustained oscillations in neural networks with delayed interactions. Applied Mathematics and Computation, 73, 55-75.
• Yazdanbakhsh, A., and Grossberg, S. (2004). Fast synchronization of perceptual grouping in laminar visual cortical circuits. Neural Networks, 17, 707-718.