Computational models of visual attention

Post-publication activity

John K. Tsotsos, Dept. of Computer Science & Engineering, and Centre for Vision Research, York University, Canada
Albert Rothenstein, Dept. of Computer Science and Engineering and Centre for Vision Research, York University

Figure 1: A Venn diagram of Computational Models organized by the hypotheses that influence them. The outer large circle represents all possible models while the inner four ovals represent the four major hypotheses discussed in the paper.

A Model of Visual Attention addresses the observed and/or predicted behavior of human and non-human primate visual attention. Models can be descriptive, mathematical, algorithmic or computational and attempt to mimic, explain and/or predict some or all of visual attentive behavior. A Computational Model of Visual Attention not only includes a process description for how attention is computed, but also can be tested by providing image inputs, similar to those an experimenter might present a subject, and then seeing how the model performs by comparison.

Introduction

This article presents an overview of a wide variety of models of visual attention that have been presented over the course of the past few decades. A number of model classes will be defined within an organizational taxonomy in an attempt to organize a literature that is rapidly growing and with a view towards guiding future research. The taxonomy will reflect the differing schools of thought as well as the different modeling strategies. Further, it is important to keep in mind that not all models were developed with the same goals and that modelers do not always follow only one school of thought or strategy. Motivations for all models come from two sources. The first is interest to understand the human perceptual capability to select, process and act upon parts of one's sensory experience differentially from the rest. The second is the need to reduce the quantity of sensory information processed by a perceptual system (see Computational Foundations for Attentive Processes).

This article focuses on models whose goal is to provide an understanding of all or part of human or non-human primate visual attention. The bulk of models that focus primarily on the development of artefacts for computer vision or robotic systems will not be mentioned, even if they might include significant biological inspiration. Biological relevance is the key here, that is, research that attempts to model a particular set of experimental observations and simultaneously makes predictions that would extend that set and could be verified by future experiments. We try to not judge any model but to provide factual information about modeling in general, about kinds of models (or modeling ‘camps’), and about the kinds of functions different models cover. Interested readers can draw their own conclusions.

An important class of models is not covered here solely because of the emphasis on models that have claims on explaining the biology of attention. Those are many efforts to use aspects of attentive processing in applied settings, in robotics, for surveillance and other applications. Fortunately, a recent excellent survey exists for those interested (Frintrop et al. 2010).

What is a model of visual attention?

A Model of Visual Attention is a description of the observed and/or predicted behavior of human and non-human primate visual attention. Models can employ natural language, system block diagrams, mathematics, algorithms or computations as their embodiment and attempt to mimic, explain and/or predict some or all of visual attentive behavior. Of importance are the accompanying assumptions, the set of statements or principles devised to provide the explanation, and the extent of the facts or phenomena that are explained. These cannot all be laid out here due to the resulting article length but the reader is encouraged to follow the citations provided. Models must be tested by experiments, and such experiments replicated, both with respect to their explanations of existing phenomena but also to test their predictive validity.

What is a computational model of visual attention?

A Computational Model of Visual Attention is an instance of a model of visual attention, and not only includes a formal description for how attention is computed, but also can be tested by providing image inputs, similar to those an experimenter might present to a subject, and then seeing how the model performs by comparison. The bulk of this article will focus on computational models. It should be pointed out that this definition differs from the usual, almost casual, use of the term ‘computational’ in the area of neurobiological modeling. It has come to mean almost any model that includes a mathematical formulation of some kind. Mathematical equations can be solved and/or simulated on a computer, and thus the term computational has seemed appropriate to many authors. Marr’s levels of analysis (Marr 1982) provide a different view. He specified 3 levels of analysis: the computational level (a formal statement of the problems that must overcome), the algorithmic level (the strategy that may be used), and the implementation level (how the task is actually performed in the brain or in a computer, solving the problems laid out at the computational level, using the strategies of the algorithmic level and adding in the details required for their implementation). Our use of the term ‘computational model’ is intended to capture models that specify all three of Marr’s levels in a testable manner. Our description of the functional elements of attention in Section 3 corresponds to Marr's first level of analysis, the problems that must be addressed. The terms ‘descriptive’, ‘data-fitting’ and ‘algorithmic’ as used here describe three different methodologies for specifying Marr’s algorithmic level of analysis. Section 2 will provide definitions and further discussion on the model classification strategy used here.

Models of attention are complex providing mechanisms and explanations for a number of functions all tied together with a control system; this is basically the specification at Marr’s ‘computational level’ of analysis. More detail on each of these tasks is provided in Visual Attention. Due to their complexity, model evaluation is not a simple matter and objective conclusions are still elusive.

The point of the next section is to create a context for such models; this enables one to see their scientific heritage, to distinguish models on the basis of their modeling strategy, and to situate new models appropriately to enable comparisons and evaluations.

A taxonomy of models

We present a taxonomy of models with computational models clearly lying in the intersection of how the biological community and how the computer vision community view attentive processes. See Figure 2. There are two main roots in this lattice – one for the uses of attentive methods in computer vision and one for the development of attention models in the biological vision community. Although both have proceeded independently, and indeed, the use of attention appears in the computer vision literature before most biological models, the major point of intersection is the class of computational models (using the definition given above).

It is quite clear that the motivations for all the modeling efforts come from two sources. The first is the deep interest to understand the perceptual capability that has been observed for centuries, that is, the ability to select, process and act upon parts of one's sensory experience differentially from the rest. The second is the need to reduce the quantity of sensory information entering any system, biological or otherwise, by selecting or ignoring parts of the sensory input. Although the motivation seems distinct, the conclusion is the same, and in reality the motivation for attention in any system is to reduce the quantity of information to process in order to complete some task (see Computational Foundations for Attentive Processes). But depending on one's interest, modeling efforts do not always have the same goals. That is, one may be trying to model a particular set of experimental observations, one may be trying to build a robotic vision system and attention is used to select landmarks for navigation, one may have interest in eye movements, or in the executive control function, or any one or more of the functional elements described in Visual Attention. As a result, comparing models is not straightforward, fair, or useful. Comparing pieces that represent the same functionality is more relevant, but there are so many of these combinations that it would be an exercise beyond the scope of this overview.

Figure 2: Model Taxonomy

The computer vision branch

The use of attentive methods has pervaded the computer vision literature demonstrating the importance for reducing the amount of information to be processed. It is important to note that several early analyses of the extent of the information load issue appeared (Uhr 1972, Feldman and Ballard 1982, Tsotsos 1987) with converging suggestions for its solution, those convergences appearing in a number of the models below (particularly those of Burt 1988 or Tsotsos 1990). Specifically, the methods can be grouped into four categories. Within modern computer vision, there are many, many variations and combinations of these themes because regardless of the impressive rapid increases in power in modern computers, the inherent difficulty of processing images demands attentional processes (see Computational Foundations or Attentional Processes).

Interest point operations

One way to reduce the amount of an image to be processed is to concentrate on the points or regions that are most interesting or relevant for the next stage of processing (such as for recognition or action). The idea is that perhaps 'interestingness' can be computed in parallel across the whole image and then those interesting points or regions can be processed in more depth serially. The first of these methods is due to Moravec (1981) and since then a large number of different kinds of 'interest point' computations have been used. It is interesting to note the parallel here with the Saliency Map Hypothesis described below.

Perceptual organization

The computational load is not only due to the large number of image locations (this is not so large a number as to cause much difficulty for modern computers), but rather it is due to the combinatorial nature of combinations of positions or regions. In perceptual psychology, how the brain might organize items is a major concern, pioneered by the Gestaltists (Wertheimer 1923). Thus, computer vision has used grouping strategies following Gestalt principles in order to limit the possible subsets of combinatorially defined items to consider. The first such use appeared in Muerle and Allen (1968) in the context of object segmentation.

Active vision

Human eyes move, and humans move around their world in order to acquire visual information. Active vision in computer vision uses intelligent control strategies applied to the data acquisition process depending on the current state of data interpretation (Bajcsy 1985, Tsotsos 1992). A variety of methods have appeared following this idea, perhaps the earliest one most relevant to this discussion is the robotic binocular camera system of Clark and Ferrier (1988), featuring a salience-based fixation control mechanism.

Predictive methods

The application of domain and task knowledge to guide or predict processing is a powerful tool for limiting processing, a fact that has been formally proved (Tsotsos 1989; Parodi et al. 1998). The first use was for oriented line location in a face-recognition task (Kelly 1971). The first instance for temporal window prediction was in a motion recognition task (Tsotsos et al. 1980).

The biological vision branch

Clearly, in this class, the major motivation has always been to provide explanations for the characteristics of biological, especially human, vision. Typically, these have been developed to explain a particular body of experimental observations. This is a strength; the authors usually are the ones who have done some or all of the experiments and thus completely understand the experimental methods and conclusions. Simultaneously, however, this is also a weakness because usually the models are often difficult to extend to a broader class of observations. Along the biological vision branch, the three classes identified here are:

Descriptive models

These models are described primarily using natural language and/or block diagrams. Their value lies in the explanation they provide of certain attentional processes; the abstractness of explanation is also their major problem because it is typically open to interpretation. Classic models, even though they were motivated by experiments in auditory attention, have been very influential. Early Selection (Broadbent 1958), Late Selection (Deutsch & Deutsch 1963, Moray 1969, Norman 1968), and Attenuator Theory (Treisman, 1964) are all descriptive models. Others such as Feature Integration Theory (Treisman and Gelade 1980), Guided Search (Wolfe et al. 1989), Animate Vision (Ballard 1991), Biased Competition (Desimone and Duncan 1995), FeatureGate (Cave 1999), Feature Similarity Gain Model (Treue &Martinez-Trujillo 1999), RNA (Shipp 2004), and the model of Knudsen (2007) are also considered descriptive. The Biased Competition Model has garnered many followers mostly due to the conceptual aspect of it combining competition with top-down bias, concepts that actually appeared in earlier models (such as Grossberg 1982 or Tsotsos 1990). These are conceptual frameworks, ways of thinking about the problem of attention. Many have played important, indeed foundational, roles in how the field has developed.

Data-fitting models

These models are mathematical and are developed to capture parameter variations in experimental data in as compact and parsimonious form as possible. Their value lies primarily in how well they provide a fit to experimental data, and in interpolation or extrapolation of parameter values to other experimental scenarios. Good examples are the Theory of Visual Attention (Bundesen 1990) and the set of models that employ normalization as a basic processing element. An early one is the model of Reynolds et al. 1999) that proposed a quantification of the Biased Competition model. Subsequently, this was refined further onto the Normalization Model of Attention, a marriage of divisive normalization with biased competition (Reynolds & Heeger 2009). At the same time a further normalization model appeared due to (Lee & Maunsell 2009), the Normalization Model of Attentional Modulation, showing how attention changes the gain of responses to individual stimuli and why attentional modulation is more than a gain change when multiple stimuli are present in a receptive field.

Algorithmic models

These models provide mathematics and algorithms that govern their performance and as a result present a process by which attention might be computed and deployed. They, however, do not provide sufficient detail or methodology so that the model might be tested on real stimuli. These models often provide simulations to demonstrate their actions. In a real sense they are a combination of descriptive and data-fitting models; they provide more detail on descriptions so they may be simulated while showing good comparison to experimental data at qualitative levels (and perhaps also quantitative). The best known of these models is the Saliency Map Model (Koch and Ullman 1985 - defined in Section 2.3.2); it has given rise to many subsequent models. It is interesting to note that the Saliency Map Model is strongly related to the Interest Point Operations on the other side of this taxonomy. Other algorithmic models include Adaptive Resonance Theory (Grossberg 1982), Temporal Tagging (Niebur et al. 1993; Usher and Niebur 1996), Shifter Circuits (Anderson and Van Essen 1987), Visual Routines (Ullman 1984), CODAM (Taylor and Rogers 2002), and a SOAR-based model (Wiesmeyer & Laird 1990).

The computational branch

As mentioned earlier, the point of intersection between the computer vision and biological vision communities is represented by the set of computational models in the taxonomy. Computational Models not only include a process description for how attention is computed, but also can be tested by providing image inputs, similar to those an experimenter might present a subject, and then seeing how the model performs by comparison. The biological connection is key and pure computer vision efforts are not included here. Under this definition, computational models generally provide more complete specifications and permit more objective evaluations as well. This greater level of detail is a strength but also a weakness because there are more details that require experimental validation.

Many models have elements from more than one class so the separation is not a strict one. Computational models necessarily are Algorithmic Models and often also include Data-Fitting elements. Nevertheless, in recent years four major schools of thought have emerged, schools that will be termed 'hypotheses' here since each has both supporting and detracting evidence. In what follows, an attempt is made to provide the intellectual antecedents for each of these major hypotheses. The taxonomy is completed in Section 2.4 when several instances of each of the classes are added.

The selective routing hypothesis

This hypothesis focuses on how attention solves the problems associated with stimulus selection and then transmission through the visual cortex. The issues of how signals in the brain are transmitted to ensure correct perception appear, in part, in a number of works. Milner (1974), for example, mentions that attention acts in part to activate feedback pathways to the early visual cortex for precise localization, implying a pathway search problem. The complexity of the brain’s network of feed-forward and feedback connectivity highlights the physical problems of search, transmission and finding the right path between input and output (see Felleman and Van Essen 1991). Anderson and VanEssen's Shifter Circuits proposal (Anderson & VanEssen 1987) was presented primarily to solve these physical routing and transmission problems using control signals to each layer of processing that shift selected inputs from one path to another. The routing issues, described in (Tsotsos et al. 1995), are: 1) A single unit at the top of the visual processing network receives input from a sub-network of converging inputs, and thus from a large portion of the visual field (the Context Problem - see Figure 3a); 2) A single event at the input will affect a large number of units in the network due to a diverging feed-forward signal resulting in a loss of localization information (the Blurring Problem - see Figure 3b); 3) Two separate visual events in the visual field will activate two overlapping sub-networks of units and connections, whose region of overlap will contain units whose activity is a function of both events. Thus, each event interferes with the interpretation of other events in the visual field (the Cross-Talk Problem - see Figure 3c).

Any model that uses a biologically plausible network of neural processing units needs to address these problems. One class of solutions is that of an attentional 'beam' through the processing network as shown in Figure 3d.

Models that fall into the Selective Routing class include Pyramid Vision (Burt 1988), Olshausen et al. (1993), Selective Tuning (Tsotsos et al. 1995; Zaharescu et al. 2004, Tsotsos et al. 2005; Rodriguez-Sanchez et al. 2007, Rothenstein et al. 2008), NeoCognitron (Fukushima 1986), and SCAN (Postma et al. 1997).

Figure 3: Illustrating the signal routing problems in a neural network (the bottom layer is the input and the top the highest level of processing). a) The Context Problem - In this feed-forward scenario, it is easy to see that many neurons in the input layer affect each single neuron in the highest layer. If the 'attended' stimulus is the one highlighted by the arrow, then there is no neuron in the highest layer that 'sees' only it; they all 'see' the desired stimulus within the context of other input stimuli. (adapted from Tsotsos et al. 1995) b) The Blurring Problem - In a different feed-forward scenario, a single stimulus can affect the response of the entire set of highest layer neurons. Although the stimulus is well localized within the input layer, any localization information is blurred across the highest layer if no remedial processing is added. (adapted from Tsotsos et al. 1995) c) The Cross-Talk Problem - This is also a feed-forward scenario but with two stimulus elements in the input layer, one in blue, covering only a single neuron and the other, in red, larger covering 2 neurons. The set of feed-forward connections they activate overlap, shown by the purple coloured neurons. The overlap of signals interferes with one another and this corrupted signal covers most of highest layer. (adapted from Tsotsos et al. 1995) d) A solution to these three problems involves an attentional 'beam' that modulates all layers of the network to allow the selected items to pass through while suppressing stimuli in the context that might interfere with the processing of the selected stimulus. (adapted from Tsotsos 1990) e) The modulatory action of the beam strategy shown in d) causes changes in the configuration shown in c). The selected (attended) neuron in the highest layer is indicated by the arrow. The recurrent modulation of the attentional beam leads to the selected neurons shown in black. The pathways that are suppressed (and resulting neurons deprived of input) are in grey. (adapted from Tsotsos et al. 1995)

The saliency map hypothesis

This hypothesis has its roots in Feature Integration Theory (Treisman and Gelade 1980) and appears first in the class of algorithmic models above (Koch and Ullman 1985). It includes the following elements (see Figure 4): (i) an early representation composed of a set of feature maps, computed in parallel, permitting separate representations of several stimulus characteristics; (ii) a topographic saliency map where each location encodes the combination of properties across all feature maps as a conspicuity measure; (iii) a selective mapping into a central non-topographic representation, through the topographic saliency map, of the properties of a single visual location; (iv) a winner-take-all (WTA) network implementing the selection process based on one major rule: conspicuity of location (minor rules of proximity or similarity preference are also suggested); and, (v) inhibition of this selected location that causes an automatic shift to the next most conspicuous location. Feature maps code conspicuity within a particular feature dimension. The saliency map combines information from each of the feature maps into a global measure where points corresponding to one location in a feature map project to single units in the saliency map. Saliency at a given location is determined by the degree of difference between that location and its surround. The models of Clark & Ferrier (1988), Sandon (1990) - the first implementation of the Koch & Ullman model -, Itti et al. (1998), Itti & Koch (2000), Walther et al. (2002), Navalpakkam & Itti (2005), Itti & Baldi (2006), SERR Humphreys & Müller (1993), Zhang et al. (2008), and Bruce & Tsotsos (2009) are all in this class. The drive to discover the best representation of saliency or conspicuity is a major current activity; whether or not a single such representation exists in the brain remains an open question with evidence supporting many potential loci (summarized in Tsotsos et al. 2005).

Figure 4: The Saliency Map Model as originally conceived by Koch & Ullman 1985. (figure adapted from Koch & Ullman 1985)

The temporal tagging hypothesis

The earliest conceptualization of this idea seems to be due to Grossberg who between 1973 and 1980, presented ideas and theoretical arguments regarding the relationship among neural oscillations, visual perception and attention (see Grossberg 1980). His work led to the ART model that provided details on how neurons may reach stable states given both top-down and bottom-up signals and play roles in attention and learning (Grossberg 1982). Milner also suggested that the unity of a figure at the neuronal level is defined by synchronized firing activity (Milner 1974). von der Malsburg (1981) wrote that neural modulation is governed by correlations in temporal structure of signals and that timing correlations signal objects. He defined a detailed model of how this might be accomplished, including neurons with dynamically modifiable synaptic strengths that became known as von der Malsburg synapses. Crick & Koch (1990) later proposed that an attentional mechanism binds together all those neurons whose activity relates to the relevant features of a single visual object. This is done by generating coherent semi-synchronous oscillations in the 40-70Hz range. These oscillations then activate a transient short-term memory. Models subscribing to this hypothesis typically consist of pools of excitatory and inhibitory neurons connected as shown in Figure 5. The actions of these neuron pools are governed by sets of differential equations; it is a dynamical system. Strong support for this view appears in a nice summary by Sejnowski and Paulsen (2006). The model of Hummel & Biederman (1992) and those from Deco's group - Deco & Zihl (2001), Corchs & Deco (2001), Deco, Pollatos & Zihl (2002) - are within this class. A number of other models exist but do not conform to our definition of computational model; they are mathematical models that only provide simulations of their performance. As such, we cannot include them here but do provide these citations because of the intrinsic interest in this model class (Niebur et al. (1993), Usher & Niebur (1996), Kazanovich & Borisyuk (1999), Wu & Guo (1999)). Clearly, there is room for expansion of these models into computational form. This hypothesis remains controversial (see Shadlen and Movshon 1999).

Figure 5: Typical neural connectivity pattern for attentional models focusing on oscillatory behavior. The model network consists of a fully connected set of excitatory and inhibitory neurons. Each excitatory and inhibitory neuron also receives a constant driving current, I.(figure adapted from Buia and Tiesinga 2006; further discussion can be found there).

The emergent attention hypothesis

The emergent attention hypothesis proposes that attention is a property of large assemblies of neurons involved in competitive interactions (of the kind mediated by lateral connections) and selection is the combined result of local dynamics and top-down biases (see Figure 6). In other words, there is no explicit selection process of any kind. The mathematics of the dynamical system of equations leads through its evolution alone to single peaks of response that represent the focus of attention. Duncan (1979) provided an early discussion of properties of attention having an emergent quality in the context of divided attention. Grossberg's 1982 ART (Adaptive Resonance Theory) model played a formative role here. Such an emergent view took further root with work on the role of emergent features in attention by Pomerantz and Pristach (1989) and Treisman and Paterson (1984). Later, Styles (1997) suggested that attentional behaviour emerges as a result of the complex underlying processing in the brain. Shipp's review (2004) concludes that this is the most likely hypothesis. The models of Heinke and Humphreys SAIM (1997, 2003), Hamker (1999; 2000; 2004; 2005; 2006), Spratling (2008), Deco and Zihl (2001), and Corchs and Deco (2001), belong in this class among others. Clearly, there must be mechanisms that support the process behind this; Hamker's model provide a good view of how this might be accomplished and shows, for example, how interactions between hierarchical representations are employed. Desimone and Duncan (1995) view their biased competition model as a member of this class, writing "attention is an emergent property of slow, competitive interactions that work in parallel across the visual field". In turn, many of the models in this class are also strongly based on Biased Competition.

Figure 6: The concept of competitive interactions that form the basis of the Emergent Attention Models. Shown is an example of the concept from Hamker (2005). Each of the visual representations (the rectangles) cooperates and competes with several other representations. Within each representation, additional local competitions help define the contents. No separate attentive mechanisms are provided.

The computational models that are instances of the four major hypotheses

A number of models have appeared over the years that borrow from the major attentional hypotheses and as noted earlier, many borrow from more than one. This section will classify a number of computational models conforming to the definition presented earlier. The directory of models follows while Figure 1 groups them according to their foundational ideas.

Model Directory:

AIM        Bruce & Tsotsos (2005; 2009)
ART        Grossberg (1975; 1982), Carpenter et al. (1998)
ClaFer     Clark & Ferrier (1988)
DraLio     Draper & Lionelle (2005)
FastGBA    Sharma (2016)
Hamker     Hamker (1999; 2000; 2004; 2005; 2006) 
HumBie     Hummel & Biederman (1992)
LanDen     Lanyon & Denham (2004)
LeeBux     Lee et al. (2003)
LiZ        Li (2001)
MORSEL     Mozer (1991)
NeoCog     Fukushima (1986)
NeurDyn    Deco & Zihl (2001), Corchs & Deco (2001), Deco, Pollatos & Zihl (2002)
NowSej     Nowlan & Sejnowski (1995)
OliTor     Oliva et al. (2003)
OlshAn     Olshausen et al. (1993)
PC/BC-DIM  Spratling (2008)
PyrVis     Burt (1988)
SAIM       Heinke & Humphreys (1997,2003) 
Sandon     Sandon (1990)
SCAN       Postma et al. (1997)
SERR       Humphreys & Müller (1993)
SM         Itti et al. (1998)
SMOC       Itti & Koch (2000)
SMSurp     Itti & Baldi  (2006)
SMTask     Navalpakkam & Itti (2005)
ST         Tsotsos et al. (1995)
STActive   Zaharescu et al. (2004)
STBind     Tsotsos et al. (2008), Rothenstein et al. (2008)
STFeature  Rodriguez-Sanchez et al. (2007)
STRec      Tsotsos et al. (2005)
SUN        Zhang et al. (2008)
SunFish    Sun et al. (2008)
UshNie     Usher & Niebur (1996)
vaHeGi     van de Laar et al. (1997)
VISIT      Ahmad (1992)
WalItt     Walther et al. (2002)

Figure 1 makes clear that the Saliency Map hypothesis seems most popular. Further, it is evident, that few of the possible combinations of hypotheses seem explored. We would suggest that those empty joint classes are potentially valuable avenues of exploration because it is clear that no single hypothesis covers the full breadth of attentional behavior, as was argued in Section 2 and also further discussed in Section 3.

Functional elements

What are the functional elements of attention that a complete modeling effort must include? This is a difficult question and there have been several previous papers that attempt to address it. Itti and Koch (2001), for example, review the state of attentional modeling, but from the point of view that assumes attention is primarily a bottom-up process based largely on their notion of saliency maps. Knudsen (2007) provides a more recent review; his perspective favors an early selection model. He provides a number of functional components fundamental to attention: working memory, competitive selection, top-down sensitivity control, and filtering for stimuli that are likely to be behaviorally important (salience filters). In his model, the world is first filtered by the salience filters in a purely-bottom manner, creating the various neural representations on which competitive selection is based. The role of top-down information and control is to compute sensitivity control affecting the neural representations by incorporating the results of selection, working memory and gaze. A third functional structure is that of Hamker (1999), whose work is an excellent example of the neuro-dynamical approach. The focus is on excitatory and inhibitory neural pools, the ordering of their effects as well as the neural sites affected and top-down bias is really a simple bias arising from area IT. 'What' and 'where' functions are separated - features are computed and represented in the ventral stream and spatial location in the dorsal. A review by Rothenstein & Tsotsos (2008) presents a classification of models with details on the functional elements each includes. Finally, Shipp (2004) provides yet another useful overview where he compares several different models along the dimension of how they map onto system level circuits in the brain. He presents his Real Neural Architecture (RNA) model for attention, integrating several different modes of operation – parallel or serial, bottom-up or top-down, pre-attentive or attentive – found in cognitive models of attention for visual search.

It would seem that there is value in providing an additional perspective, namely, one that is orthogonal to the neural correlates of function and that is independent of model and modeling strategy. This alternate functional decomposition is presented in Visual Attention and covers the breadth of visual attention from information reduction, to representations, to control to external manifestations of attentional behavior. It is fair to say that a complete model should account for each; it is also fair to say that no model yet comes close. These functional elements are listed below. We invite modelers to annotate each with a brief description of how their model provides the functionality listed; those details are beyond the scope of this article. The main elements of attention are now given. They are detailed further in Visual Attention where one can also see all the appropriate citations and biological evidence.

An important point here is that models of visual attention should be able to deal with each of these. It would be in the best interests of the readers of this article that each modeler provide some annotation through this article (maybe through the use of a SUB-PAGE) on how their model incorporates these attentional elements. It would form a major contribution to the comparison of models.

Evaluating a model

The above lists of elements are unlikely to be complete nor the optimal partitioning of the problem but are representative of most current thinking. The effectiveness of any model, regardless of type as laid out in Section 2, is determined by how well it provides explanations for what is known about as many of the above functional elements as possible. As important, models must be falsifiable, that is, they must make testable predictions regarding new behaviors or functions not yet observed - behaviors that are not easily deduced from current knowledge, that are counterintuitive - that would enable one to support or reject the model. To test all the models on these criteria is beyond the scope of this article but is a necessary task for anyone wishing to answer the question "Which is the best model of visual attention?"

Nevertheless, several authors are making strong attempts at comparative evaluation using large databases of images and providing executable code that others can use. Primarily, these evaluations are for models that focus on representations of saliency that drive fixation models in the Saliency Map Hypothesis class. Itti's Neuromorphic Vision Toolkit was the first; more recently others, such as Bruce, Draper and Lionnelle, and Zhang et al. show serious evaluations and provide public databases for others to use. We add that Draper & Lionelle (2003) laid out the first steps for a principled comparative evaluation. This is very positive even though statistical validity of databases and the relevant comparative dimensions remain issues needing more work.

Acknowledgement

We thank Mazyar Fallah, Heather Jordan, Fred Hamker and an anonymous reviewer for their comments on earlier drafts.

References

Ahmad, S. (1992). VISIT: a neural model of covert visual attention, in Advances in Neural Information Processing Systems, edited by J.E. Moody, et al., 4:420-427, San Mateo, CA: Morgan Kaufmann.
Anderson, C. and D. Van Essen (1987). Shifter Circuits: a computational strategy for dynamic aspects of visual processing. Proc. Natl. Academy Sci. USA 84, p6297-6301.
Bajcsy, R. (1985). Active perception vs passive perception. In Proc. IEEE Workshop on Computer Vision: Representation and Control, Oct., Bellaire, Mich., p55–62.
Ballard, D. (1991). Animate vision. Artificial Intelligence, 48, p57–86.
Broadbent, D. (1958). Perception and communication, Pergamon Press, NY.
Bruce, N.D.B., Tsotsos, J.K. (2009). Saliency, Attention, and Visual Search: An Information Theoretic Approach, Journal of Vision 9:3, p1-24.
Bruce, N.D.B., Tsotsos, J.K. (2005). Saliency Based on Information Maximization, Proc. NIPS 2005, Vancouver, BC.
Buia, C., Tiesinga, P. (2006). Attentional modulation of firing rate and synchrony in a model cortical network, Journal of Computational Neuroscience 20(3), p1573-6873.
Bundesen, C. (1990). A theory of visual attention, Psychological Review, 97, p523-547.
Burt, P. (1988). Attention mechanism for vision in a dynamic world, Proc. 9th Int. Conf. on Pattern Recognition, p977–987.
Carpenter, G.A., Grossberg, S., Lesher, G.W. (1998). The what-and-where filter: A spatial mapping neural network for object recognition and image understanding, Computer Vision and Image Understanding, 69, p1-22.
Cave. K. (1999). The FeatureGate model of visual selection, Psychological Res. 62, p182-194.
Clark, J.J., Ferrier, N. (1988). Modal control of an attentive vision system. Proc. ICCV, Tarpon Springs Florida, p514–523.
Corchs, S., Deco, G. (2001). A neurodynamical model for selective visual attention using oscillators, Neural Networks 14, p981-990.
Crick, F., Koch, C. (1990). Some reflections on visual aware-ness. Cold Spring Harbor Symp. Quant. Biol. 55, p953–962.
Deco, G., Zihl, J. (2001). A neurodynamical model of visual attention: feedback enhancement of spatial resolution in a hierarchical system, J Comput Neurosci 10(3), p231-53.
Deco, G., Pollatos, O., Zihl, J. (2002). The time course of selective visual attention: theory and experiments, Vision Research 42, p2925–2945
Desimone, R., Duncan, J. (1995). Neural mechanisms of selective visual attention, Ann. Rev. of Neuroscience 18, p193-222.
Deutsch, J., Deutsch, D. (1963). Attention: Some theoretical considerations, Psych. Review 70, p80-90.
Draper, B., Lionelle, A. (2005). Evaluation of Selective Attention under Similarity Transforms, Computer Vision and Image Understanding 100(1-2), p152-171.
Duncan J., (1979). Divided attention: the whole is more than the sum of its parts, J Exp Psychol Hum Percept Perform 5(2), p216-28.
Feldman, J. & Ballard, D. (1982). Connectionist models and their properties, Cognitive Science 6, p205 - 254.
Felleman, D., Van Essen, D., (1991). Distributed hierarchical processing in the primate visual cortex. Cerebral Cortex 1, p1–47.
Frintrop, S., Rome, E. and Christensen, H.I., (2010): Computational Visual Attention Systems and their Cognitive Foundation: A Survey, ACM Transactions on Applied Perception (TAP), 7(1)
Frith, C. (2005). The Top in Top-Down Attention, in Neurobiology of Attention, ed. by Itti, Rees, Tsotsos, Elsevier Press, p105-108
Fukushima, K. (1986). A neural network model for selective attention in visual pattern recognition, Biological Cybernetics 55(1), p5 - 15.
Grossberg, S. (1980). Biological Competiton: Decision Rules, pattern formation and oscillations, PNAS 77 p2338-2342.
Grossberg, S. (1982). A psychophysiological theory of reinforcement, drive, motivation, and attention. Journal of Theoretical Neurobiology, 1, p286-369.
Grossberg, S. (1975). A neural model of attention, reinforcement, and discrimination learning, International Review of Neurobiology 18, p263-327.
Hamker, F.H. (1999). The role of feedback connections in task-driven visual search. In: D. Heinke, G. W. Humphreys & A. Olson (eds.) Connectionist Models in Cognitive Neuroscience, Proc. of the 5th Neural Computation and Psychology Workshop (NCPW'98). London: Springer Verlag, 252-261.
Hamker, F.H. (2000). Distributed competition in directed attention, Proceedings in Artificial Intelligence, Vol. 9. Dynamische Perzeption,Workshop der GI-Fachgruppe 1.0.4 Bildverstehen. Hrsg. von G. Baratoff, H.Neumann. Berlin: AKA, Akademische Verlagsgesellschaft, p39-44.
Hamker, F.H. (2004). A dynamic model of how feature cues guide spatial attention, Vision Research 44, p 501-521.
Hamker, F. H. (2005) The emergence of attention by population-based inference and its role in distributed processing and cognitive control of vision, Computer Vision and Image Understanding, 100, p. 64-106.
Hamker, F. H., Zirnsak, M. (2006) V4 receptive field dynamics as predicted by a systems-level model of visual attention using feedback from the frontal eye field. Neural Networks. 19:1371-1382.
Hanson,A.R., Riseman, E.M., (1978). Computer Vision Systems, Academic Press.
Heinke, D., Humphreys, G.W. (1997). SAIM: A Model of Visual Attention and Neglect, 7th International Conference on Artificial Neural Networks, Lausanne, Switzerland, Springer Verlag.
Heinke, D., Humphreys, G.W., (2003). Attention, Spatial Representation, and Visual Neglect: Simulating Emergent Attention and Spatial Memory in the Selective Attention for Identification Model (SAIM). Psychological Review, 110(1), pp.29-87.
Hummel, J.E., & Biederman, I. (1992). Dynamic binding in a neural network for shape recognition, Psychological Review 99, p480–517.
Humphreys, G., Müller, H., (1993). Search via Recursive Rejection (SERR): A Connectionist Model of Visual Search, Cognitive Psychology, 25, p45 - 110.
Itti, L., Baldi, P. (2006). Bayesian Surprise Attracts Human Attention. Advances in Neural Information Processing Systems 18, 547–554.
Itti, L. (2005), Models of Bottom-up Attention and Saliency, in Neurobiology of Attention, ed. by Itti, Rees and Tsotsos, p576-582.
Itti, L., Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention, Vision Res 40(10-12), p1489-506
Itti, L., C. Koch, et al. (1998). A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11), p1254-1259.
Kazanovich, Y. B., Borisyuk, R. M. (1999). Dynamics of neural networks with a central element, Neural Networks, 12, p441-454.
Kelly, M. (1971). Edge detection in pictures by computer using planning, Machine Intell. 6, p397-409.
Koch, C. Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Hum. Neurobiology 4, p219–227.
Knudsen, E. (2007). Fundamental Components of Attention, Annu. Rev. Neurosci. 30, p57–78
Lanyon, L. J., Denham, S.L. (2004). A model of active visual search with object-based attention guiding scan paths, Neural Networks 17, 873–897.
Lee, K. W., H. Buxton, et al. (2003). Selective attention for cue-guided search using a spiking neural network. International Workshop on Attention and Performance in Computer Vision, Graz, Austria.
Lee, J., Maunsell, J.H. (2009). A Normalization Model of Attentional Modulation of Single Unit Responses, PLoS One 4: e4651.
Li, Z. (2002). A saliency map in primary visual cortex, Trends in Cognitive Sciences Vol. 6, No. 1, Jan. 2002, p9-16.
Li, Z. (2001). Computational design and nonlinear dynamics of a recurrent network model of the primary visual cortex, Neral Computation 13/8, p. 1749-1780
Macmillan, N.A., Creelman, C.D. (2004). Detection Theory: A User's Guide, Routledge.
Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Henry Holt and Co., New York
Milner, P. (1974). A model for visual shape recognition, Psych. Rev. 81, p521-535.
Moravec, H. (1981). Rover visual obstacle avoidance, IJCAI, Vancouver, BC, p785-790.
Moray, N. (1969). Attention: Selective Processes in Vision and Hearing, Hutchinson, London.
Mozer, M. C. (1991). The perception of multiple objects: a connectionist approach, Cambridge, Mass., MIT Press.
Muerle, J., Allen, D. (1968). Experimental Evaluation of Techniques for Automatic Segmentation of Objects in a Complex Scene, in G. Cheng et al., Eds. Pictorial Pattern Recognition, Thompson, Washington DC, p3 - 13
Navalpakkam V, Itti L. (2005). Modeling the influence of task on attention, Vision Res. 45(2), p205-31.
Niebur, E., Koch, C., Rosin, C. (1993). An oscillation-based model for the neural basis of attention, Vision Research 33, p2789-2802.
Niebur, E., Koch, C. (1994). A model for the neuronal implementation of selective visual attention based on temporal correlation among neurons, J. Comput. Neuroscience 1(1), p141-158.
Norman, D. (1968). Toward a theory of memory and attention, Psych. Review 75, p522-536.
Nowlan, S. and Sejnowski, T. (1995). A selection model for motion processing in area MT of primates. The Journal of Neuroscience,15(2), p1195–1214.
Oliva, A., A. Torralba, et al. (2003). Top-Down control of visual attention in object detection. IEEE International Conference on Image Processing, Barcelona, Spain.
Olshausen, B. A., C. H. Anderson, et al. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J Neurosci 13(11), p4700-19.
Parodi, P., Lanciwicki, R., Vijh, A., Tsotsos J.K. (1998). Empirically-Derived Estimates of the Complexity of Labeling Line Drawings of Polyhedral Scenes, Artificial Intelligence 105, p47 - 75.
Parkhurst, D., Law, K., Neibur, E. (2002). Modeling the role of salience in the allocation of overt visual attention, Vision Research 42, p107–123
Pomerantz, J.R., Pristach, E.A. (1989). Emergent Features, Attention, and Perceptual Glue in Visual Form Perception, Journal of Experimental Psychology: Human Perception and Performance, 15(4), p635-649
Postma, E. O. et al. (1997). SCAN: A scalable model of attentional selection. Neural Networks 10(6), p993-1015.
Reynolds, J., Chelazzi, L., Desimone, R. (1999). Competitive Mechanisms Subserve Attention in Macaque Areas V2 and V4, J. Neurosci. 19 (5), p1736–1753.
Reynolds, J., Heeger, D. (2009). The Normalization Model of Attention, Neuron 61, p168 - 185.
Riesenhuber M, Poggio T. (1999). Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, p1019–25
Rodriguez-Sanchez, A.J., Simine, E., Tsotsos., J.K. (2007). Attention And Visual Search, Int. J. Neural Systems 17(4), p275-88.
Rosenblatt, F. (1961). Principles of Neurodynamics: Perceptions and the Theory of Brain Mechanisms, Washington, CD: Spartan Books.
Roskies, A. (1999). The binding problem - introduction. Neuron 24, p7–9.
Rothenstein, A.L., Rodriguez-Sanchez, A.J., Simine, E., Tsotsos, J.K. (2008). Visual Feature Binding within the Selective Tuning Attention Framework, Int. J. Pattern Recognition and Artificial Intelligence - Special Issue on Brain, Vision and Artificial Intelligence, p861-881.
Sandon, P. (1990). Simulating visual attention, J. Cognitive Neuroscience 2, p213-231.
Sejnowksi, T., Paulsen, O. (2006). Network Oscillations: Emerging Computational Principles, The Journal of Neuroscience 26(6), p1673-1676.
Serre, T., Wolf, L. Bileschi, S., Riesenhuber, M., Poggio, T. (2007). Recognition with Cortex-like Mechanisms, IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), p411-426.
Sharma P. (2016). Modeling Bottom-Up Visual Attention Using Dihedral Group D4 §. Symmetry, 8(8):79, p1-14.
Shadlen. M., Movshon, A. (1999). Synchrony Unbound: Review A Critical Evaluation of the Temporal Binding Hypothesis, Neuron 24, p67–77.
Shipp, S. (2004). The brain circuitry of attention, Trends in Cognitive Sciences 8(5), p223-230.
Spratling (2008). Predictive coding as a model of biased competition in visual attention. Vision Research 48(12):1391-408.
Styles, E. (1997). The Psychology of Attention, Psychology Press.
Sun, Y., Fisher, R., Wang, F., Gomes, H. (2008). A computer vision model for visual-object-based attention and eye movements, Computer Vision and Image Understanding 112(2), p126-142.
Taylor, J.G., Rogers, M. (2002). A control model of the movement of attention. Neural Networks 15, p309-326
Treisman, A. (1964). The effect of irrelevant material on the efficiency of selective listening, American J. Psychology 77, p533-546.
Treisman, A., Gelade, G. (1980). A feature integration theory of attention, Cognitive Psychology 12, p97-136.
Treisman, A., Paterson, R. (1984). Emergent features, attention, and object perception. Journal of Experimental Psychology: Human Perception and Performance, 10, p12-31.
Treue, S., Martinez-Trujillo, J. (1999). Feature-Based Attention Influences Motion Processing Gain in Macaque Visual Cortex, Nature 399(6736), p575-9.
Tsotsos, J.K. (1987). A ‘complexity level’ analysis of vision, Proceedings of International Conference on Computer Vision, London, England.
Tsotsos, J.K. (1989). The complexity of perceptual search tasks, Proc. Int. J. Conf. Artif. Intell. Detroit p1571–1577.
Tsotsos, J.K. (1990). A Complexity Level Analysis of Vision, Behavioral and Brain Sciences 13, p423 – 445.
Tsotsos, J., Mylopoulos, J., Covvey, H., Zucker, S. (1980). A framework for visual motion understanding, IEEE Patt. Anal. Machine Intell. 2, p563-573.
Tsotsos, J.K. (1992). On the Relative Complexity of Passive vs Active Visual Search, International Journal of Computer Vision 7-2, p 127 - 141.
Tsotsos, J. K., S. M. Culhane, et al. (1995). Modeling Visual-Attention Via Selective Tuning. Artificial Intelligence 78(1-2), p507-545.
Tsotsos, J.K., Liu, Y., Martinez-Trujillo, J., Pomplun, M., Simine, E., Zhou, K. (2005). Attending to Visual Motion, Computer Vision and Image Understanding 100(1-2), p3 - 40.
Tsotsos, J.K., Rodriguez-Sanchez, A.J., Rothenstein, A.L., Simine, E. (2008). Different Binding Strategies for the Different Stages of Visual Recognition, Brain Research 1225, p119-132.
Tsotsos, J.K., Itti, L., Rees, G. (2005). A Brief and Selective History of Attention, in Neurobiology of Attention, Editors Itti, Rees & Tsotsos, Elsevier Press, 2005
Uhr, L. (1972). Layered `recognition cone' networks that preprocess, classify and describe, IEEE Transactions on Computers, p758-768.
Usher, M., Niebur, E. (1996). Modeling the temporal dynamic of IT neurons in visual search: A mechanism for top-down selective attention, J. Cognitive Neuroscience 8:4, p311-327.
van de Laar, P., Heskes, T., Gielen, S. (1997). Task-dependent learning of attention, Neural Networks 10, p981-992.
von der Malsburg, C. (1981). The correlation theory of brain function, Internal Rpt. 81-2, Dept. of Neurobiology, Max-Planck-Institute for Biophysical Chemistry, Göttingen, Germany.
Wolfe, J., Cave, K., Franzel, S. (1989). Guided search: An alternative to the feature integration model for visual search, J. Exp. Psychology: Human Perception and Performance 15, p419-433.
Ullman, S. (1984). Visual routines, Cognition 18, p97–159
Walther, D., L. Itti, et al. (2002). Attentional selection for object recognition - A gentle way. Biologically Motivated Computer Vision, Proceedings 2525, p472-479.
Wertheimer, M. (1923). Untersuchungen zur Lehre von der Gestalt. II Psychologische Forschung, 4 301-350.
Wiesmeyer, M., Laird, J. (1990). A Computer Model of 2D Visual Attention, Proceedings of the Twelfth Annual Conference of the Cognitive Science Society, p582 - 589.
Wixson, L. (1994). Gaze selection for visual search. Rochester, N.Y., University of Rochester Dept. of Computer Science.
Wu, A., Guo, A. (1999). Selective visual attention in a neurocomputational model of phase oscillators, Biol. Cybern. 80, p205-214.
Zaharescu, A., Rothenstein, A.L., et al. (2004). Towards a Biologically Plausible Active Visual Search Model. in Attention and Performance in Computational Vision: Second International Workshop, WAPCV 2004, Revised Selected Papers, Lecture Notes in Computer Science Volume 3368 / 2005, Springer-Verlag Heidelberg, p133-147.
Zhang, L., Tong, M. H., Marks, T.K., Shan, H., & Cottrell, G.W. (2008). SUN: A Bayesian framework for saliency using natural statistics. Journal of Vision, 8(7):32, p1–20.

Additional reading

Itti, L., Rees, G., Tsotsos, J.K. (Editors) (2005) Neurobiology of Attention, Elsevier Press.
Tsotsos, J.K., Itti, L., Rees, G. (2005). A Brief and Selective History of Attention, in Neurobiology of Attention, Editors Itti, Rees & Tsotsos, Elsevier Press, 2005
Itti, L., Koch, C. (2001), Computational modeling of visual attention, Nature Reviews Neuroscience 2, p 1-11.
Rothenstein, A.L., Tsotsos, J.K., (2008). Attention Links Sensing with Perception, Image & Vision Computing Journal, Special Issue on Cognitive Vision Systems (ed. H. Buxton), 26(1) p114-126.
Tsotsos, J.K. (2011). A Computational Perspective on Visual Attention, MIT Press, Cambridge MA.

Links to relevant Scholarpedia articles

Computational Models of Attention

Biology (general link to the category Vision)