# Visual salience

Post-publication activity

Curator: Laurent Itti

Visual salience (or visual saliency) is the distinct subjective perceptual quality which makes some items in the world stand out from their neighbors and immediately grab our attention.

## Definitions

Our attention is attracted to visually salient stimuli. It is important for complex biological systems to rapidly detect potential prey, predators, or mates in a cluttered visual world. However, simultaneously identifying any and all interesting targets in one's visual field has prohibitive computational complexity making it a daunting task even for the most sophisticated biological brains (Tsotsos, 1991), let alone for any existing computer. One solution, adopted by primates and many other animals, is to restrict complex object recognition process to a small area or a few objects at any one time. The many objects or areas in the visual scene can then be processed one after the other. This serialization of visual scene analysis is operationalized through mechanisms of visual attention: A common (although somewhat inaccurate) metaphor for attention is that of a virtual spotlight, shifting to and highlighting different sub-regions of the visual world, so that one region at a time can be subjected to more detailed visual analysis (Treisman & Gelade, 1980; Crick, 1984; Weichselgartner & Sperling, 1987).

Visual attention may be a solution to the inability to fully process all locations in parallel. However, this solution produces a problem. If you are only going to process one region or object at a time, how do you select that target of attention? Visual salience helps your brain achieve reasonably efficient selection. Early stages of visual processing give rise to a distinct subjective perceptual quality which makes some stimuli stand out from among other items or locations. Our brain has evolved to rapidly compute salience in an automatic manner and in real-time over the entire visual field. Visual attention is then attracted towards salient visual locations.

The core of visual salience is a bottom-up, stimulus-driven signal that announces “this location is sufficiently different from its surroundings to be worthy of your attention”. This bottom-up deployment of attention towards salient locations can be strongly modulated or even sometimes overridden by top-down, user-driven factors (Desimone & Duncan, 1995; Itti & Koch, 2001). Thus, a lone red object in a green field will be salient and will attract attention in a bottom-up manner (see illustration below). In addition, if you are looking through a child’s toy bin for a red plastic dragon, amidst plastic objects of many vivid colors, no one color may be especially salient until your top-down desire to find the red object renders all red objects, whether dragons or not, more salient.

Visual salience is sometimes carelessly described as a physical property of a visual stimulus. It is important to remember that salience is the consequence of an interaction of a stimulus with other stimuli, as well as with a visual system (biological or artificial). As a straight-forward example, consider that a color-blind person will have a dramatically different experience of visual salience than a person with normal color vision, even when both look at exactly the same physical scene (see, e.g., the first example image below). As a more controversial example, it may be that expertise changes the salience of some stimuli for some observers. Nevertheless, because visual salience arises from fairly low-level and stereotypical computations in the early stages of visual processing (details in the following section), the factors contributing to salience are generally quite comparable from one observer to the next, leading to similar experiences across a range of observers and of behavioral conditions.

## Experiencing visual salience

Below are simple examples of so-called search arrays stimuli, which contain many items, one of which should appear to the reader as highly visually salient.

One item in the array of items strongly pops-out and effortlessly and immediately attracts attention. Many studies have suggested that in simple displays like this, no scanning occurs: Attention is immediately drawn to the salient item, no matter how many other items (called distractors) are present in the display (Treisman & Gelade, 1980; Wolfe 1998). This suggests that the image is processed in parallel (all at once) to determine salience at every location and to orient towards the most salient location.
In this display, the vertical bar is visually salient. Comparing this example to the previous one suggests that local visual properties of a given item do not determine how perceptually salient this item will be; rather, looking at a given item within its surrounding context is crucial. Compare, for example, the red bar in the top-left corner of this image to the salient bar in the image above: both bars are red, roughly horizontal, and they both have very similar local appearances. Yet the one in the top-left corner here has low salience and attention is much more strongly attracted to the more salient vertical bar, while the red bar in the above image is highly salient.
In this display, there is again one bar that is unique and different from all the other ones. However, by design and through judicious choice of distracting items, there is little salience to guide you towards the target bar (why that is will be discussed in the following section). The target is a so-called conjunction target: is the only red and vertical bar (Treisman & Gelade, 1980). Because salience does not help you direct attention towards potentially interesting items in the display, you find yourself scanning the image, seemingly at random, looking for something interesting.
Items in visual displays can be salient for many reasons, here is an example where a distinct pattern of motion is the only thing which distinguishes the salient dot from its neighboring distractor dots. Take a snapshot of any frame in this movie, and you will not be able to tell which one is the salient dot.
Perceptual salience is computed automatically, effortlessly, and in real-time. In natural environments, highly salient objects tend to automatically draw attention towards them. Designers have long relied on their own salience system to create objects, such as this emergency triangle, which would also appear highly salient to others in a wide range of viewing conditions.

While a large body of experiments has been mainly concerned with simple tasks, such as looking for an odd-man-out target embedded within an array of distractors, there is mounting evidence that perceptual salience is not a fixed quantity, but rather is strongly modulated in real-time by the task demands of the moment, by previous stimuli and responses, by motivation and reward, and by many other poorly understood factors. For example, it has recently become clear that your brain processes visual information quite differently when you are interested, say, in upwards moving things as compared to downward moving things (e.g., Treue & Martinez-Trujillo, 1999). Because perceptual salience arises from this top-down-modulated early visual processing, it is consequently also quite strongly affected by the task at hand. This top-down, task-based modulation of salience has been shown to have a number of behavioral consequences (e.g., Yeshurun & Carrasco, 1998; Navalpakkam & Itti, 2007).

## Neural and computational mechanisms

The basic principle behind computing salience is the detection of locations whose local visual attributes significantly differ from the surrounding image attributes, along some dimension or combination of dimensions. This significant difference could be in a number of simple visual feature dimensions which are believed to be represented in the early stages of cortical visual processing: color, edge orientation, luminance, or motion direction (Treisman & Gelade, 1980; Itti & Koch, 2001). Wolfe and Horowitz (2004) provide a very nice review of which elementary visual features may strongly contribute to visual salience and guide visual search.

### Simple computational framework

A simple framework to think about how salience may be computed in biological brains has been developed over the past three decades (Treisman & Gelade, 1980; Koch & Ullman, 1985; Wolfe, 1994; Niebur & Koch, 1996; Itti & Koch, 2001). According to the framework, incoming visual information is first analyzed by early visual neurons, which are sensitive to the various elementary visual features of the stimulus. This analysis, operated in parallel over the entire visual field and at multiple spatial and temporal scales, gives rise to a number of cortical feature maps, where each map represents the amount of a given visual feature at any location in the visual field. Within each of the feature maps, locations which significantly differ from their neighbors are highlighted, as further discussed below. Finally, all highlighted locations from all feature maps combine into a single saliency map which represents a pure salience signal that is independent of visual features (Koch & Ullman, 1985; Nothdurft, 2000). According to several models, the relative contributions of different feature maps to the final saliency map is dependent upon the current behavioral goals and subjective state of the observer (Wolfe, 1994; Navalpakkam & Itti, 2005). In the absence of any particular task, such as, for example, during casual viewing, attention is drawn towards the most salient locations in the saliency map, as detected, for example, via a winner-take-all mechanism (Didday, 1976; Koch & Ullman, 1985). This, in turns, triggers motor actions which direct the eyes and the head towards salient visual locations (Dominey & Arbib, 1992; Findlay & Walker, 1999). Note that a number of theories exist as to whether an explicit saliency map is necessary or not (Hamker 1999; Li, 2002; see Saliency Map for additional discussion).

Simple framework for computing salience (Itti et al., 1998).

### The essence of salience: competing for representation

The essence of salience lies in enhancing the neural and perceptual representation of locations whose local visual statistics significantly differ from the broadly surrounding image statistics in some behaviorally relevant manner. This basic principle is intuitively motivated as follows. Imagine a simple search array as depicted below, where one bar pops-out because of its unique orientation. Now imagine examining a feature map which is tuned to stimulus intensity (luminance) contrast: because there are many white bars on a black background, early visual neurons sensitive to local intensity contrast will respond vigorously to each of the bars (distractors and target alike, since all have identical intensity). Based on the pattern of activity in this map, in which essentially every bar elicits a strong peak of activity, one would be hard pressed to pick one location as being clearly more interesting and worthy of attention than all the others. Intuitively, hence, one might want to apply some normalization operator $$N(.)$$ which would give a very low overall weight to this map's contribution to the final saliency map. The situation is quite different when examining a feature map where neurons are tuned to local vertically oriented edges. In this map, one location (where the single roughly vertical bar is) would strongly excite the neural feature detectors, while all other locations would elicit much weaker responses. Hence, one location clearly stands out and hence becomes an obvious target for attention. It would be desirable in this situation that the normalization operator $$N(.)$$ give a high weight to this map's contribution to the final saliency map (Itti et al., 1998; Itti & Koch, 2000; Itti & Koch, 2001).

Salience depends on context and on how unique of a response is elicited by a given item in a display (Itti et al., 1998).

In summary, a feature map containing numerous responses of comparable amplitude might contribute little to visual salience, because it does not help orient attention towards a given interesting location in particular. In contrast, feature maps containing one or a few locations where feature responses are much stronger than anywhere else contribute strongly to perceptual salience, because they clearly single out one location as being significantly different from any other.

Related formulations of this basic principle have been expressed in slightly different terms, including defining salient locations as those which contain spatial outliers (Rosenholtz, 1999), which may be more informative in Shannon's sense (Bruce & Tsotsos, 2006), or, in a more general formulation, which may be more surprising in a Bayesian sense (Itti & Baldi, 2006).

### Neural and behavioral correlates

Interestingly, neural correlates of some mechanisms behaving like our putative normalization operator $$N(.)$$ have been extensively characterized in early visual cortex. Through so-called non-classical receptive field inhibition (e.g., Allman et al., 1985), neural responses to a central local pattern can be substantially modulated by the presence of other surrounding patterns nearby but outside the receptive field of the neuron of interest. When the surrounding items resemble closely the central one, the neuron responding to the central location is inhibited; but when they differ grossly (e.g., the central location contains a vertical bar, but surrounding locations contain many horizontal bars), no such inhibition occurs (Allman et al., 1985; Cannon & Fullenkamp, 1991; Sillito et al., 1995). Such long-range inhibition by like items, probably mediated by long-range horizontal connections in early visual cortex, is believed to be responsible for some content-based normalization of neural activity similar to the operator $$N(.)$$ depicted above. In the end, after combination across all visual features which contribute to salience, a location will be salient if it was locally unique in at least one of the elementary feature dimensions.

In addition to visual search behavior as described above, computation of salience with mechanisms as described in this article has been widely demonstrated to quite strongly predict where humans look while inspecting complex images or video clips (Parkhurst et al., 2002; Itti, 2005; Carmi & Itti, 2006).

### Top-down modulation by task demands

The basic mechanism just described can be strongly influenced top-down. Behaviorally, this is manifested by the ability to more efficiently guide search towards visual targets whose appearance is known in advance (Wolfe, 1994; note the weights applied to each feature map in Wolfe's figure below; also see Niebur & Koch, 1996), and to more efficiently ignore irrelevant distractors. At the single-unit level and in functional neuroimaging studies, top-down modulation of salience is apparent from so-called feature-based attention whereby the neural representation of neurons encoding for specific visual features of current behavioral interest to the animal show enhanced activity compared to baseline (Motter, 1994; Treue & Martinez-Trujillo, 1999; Saenz et al., 2002).

Guided Search model of J. Wolfe (1994). Note the weights applied to each feature map and determined top-down by task demands.

One question which has remained outstanding until recently is how exactly to set the weights so as to promote the salience of arbitrarily complex targets in complex backgrounds. Computationally, Navalpakkam & Itti (2007) have proposed that the gain of different neuronal populations sensitive to different features (e.g., different edge orientations) will be relatively modulated depending on the features of the desired targets and of the distractors, so as to maximize the overall signal-to-noise ratio at the level of the entire population. In simple cases, where targets and distractors are differentiated in very obvious ways (e.g., a vertical target among horizontal distractors), this optimal biasing theory correlates well with intuition and single-unit observations (e.g., neurons responsive to vertical orientations should be boosted while neurons responsive to horizontal should be suppressed). In more complicated cases (e.g., with multimodal distributions of target and/or distractor features, or when significant overlap exists between the features of the targets and of the distractors), the theory generalizes these results in a manner which has been shown to be compatible with human behavior (Navalpakkam & Itti, 2007).

## Beyond biology: technological applications

Computing visual salience has become a topic of recent technological interest. Indeed, until recently, most computer vision algorithms have relied on brute-force, systematic scanning of images from left-to-right and top-to bottom when attempting to locate objects of interest. Visual salience provides a relatively inexpensive and rapid mechanism to select a few likely candidates and eliminate obvious clutter (Itti & Koch, 2000; Navalpakkam & Itti, 2005).

Applications of computational models of visual salience include, among many others:

• Automatic target detection (e.g., finding traffic signs along the road or military vehicles in a savanna; Itti & Koch, 2000);
• Robotics (using salient objects in the environment as navigation landmarks; Frintrop et al., 2006; Siagian & Itti, 2007);
• Image and video compression (e.g., giving higher quality to salient objects at the expense of degrading background clutter; Maeder et al., 1996; Itti, 2004);
• Automatic cropping/centering of images for display on small portable screens (Le Meur et al., 2006);
• Finding tumors in mammograms (Hong & Brady, 2003);
• And many more.
Example where a salience model immediately located a vehicle as being the most salient object in a complex scene (Itti & Koch, 2000).

## References

J. Allman, F. Miezin, & E. McGuinness (1985). Stimulus specific responses from beyond the classical receptive field: neurophysiological mechanisms for local-global comparisons in visual neurons. Annual Review of Neuroscience 8:407-430.

N. Bruce & J. K. Tsotsos (2006). Saliency Based on Information Maximization. In: Advances in Neural Information Processing Systems, 18:155-162.

M. W. Cannon & S. C. Fullenkamp (1991). Spatial interactions in apparent contrast: inhibitory effects among grating patterns of different spatial frequencies, spatial positions and orientations. Vision Research 31:1985-1998.

R. Carmi & L. Itti (2006). Visual Causes versus Correlates of Attentional Selection in Dynamic Scenes. Vision Research 46(26):4333-4345.

F. Crick (1984). Function of the thalamic reticular complex: the searchlight hypothesis. Proceedings of the National Academies of Sciences USA 81(14):4586-90.

R. Desimone & J. Duncan (1995). Neural mechanisms of selective visual attention. Annual Review of Neuroscience 18:193-222.

R. L. Didday (1976). A model of visuomotor mechanisms in the frog optic tectum. Mathematical Biosciences 30:169-180.

P. F. Dominey & M. A. Arbib (1992). A cortico-subcortical model for generation of spatially accurate sequential saccades. Cerebral Cortex 2(2):153-175.

J. M. Findlay & R. Walker, R (1999). A model of saccade generation based on parallel processing and competitive inhibition. Behavioral and Brain Sciences 22:661-674.

S. Frintrop, P. Jensfelt, & H. Christensen (2006). Attentional Landmark Selection for Visual SLAM. In: Proc. IEEE International Conference on Intelligent Robots and Systems (IROS'06).

F.H. Hamker (1999). The role of feedback connections in task-driven visual search, in: D. Heinke, G.W. Humphreys, A. Olson (Eds.), Connectionist Models in Cognitive Neuroscience. Springer Verlag. London, pp. 252-261.

B.-W. Hong & M. Brady (2003). A Topographic Representation for Mammogram Segmentation. In: Lecture Notes in Computer Science 2879:730-737.

L. Itti, C. Koch, & E. Niebur (1998). A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11):1254-1259.

L. Itti & C. Koch (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research 40(10-12):1489-1506.

L. Itti & C. Koch (2001). Computational Modeling of Visual Attention. Nature Reviews Neuroscience 2(3):194-203.

L. Itti (2004). Automatic Foveation for Video Compression Using a Neurobiological Model of Visual Attention. IEEE Transactions on Image Processing 13(10):1304-1318.

L. Itti (2005). Quantifying the Contribution of Low-Level Saliency to Human Eye Movements in Dynamic Scenes. Visual Cognition 12(6):1093-1123.

L. Itti & P. Baldi (2006). Bayesian Surprise Attracts Human Attention. In: Advances in Neural Information Processing Systems, Vol. 19 (NIPS*2005), Cambridge, MA:MIT Press.

C. Koch & S. Ullman (1985). Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiology 4:219-227.

Li Z (2002). A saliency map in primary visual cortex Trends in Cognitive Sciences 6(1): 9-16.

A. Maeder, J. Diederich, & E. Niebur (1996). Limiting human perception for image sequences, in: Proceedings of the SPIE, Human Vision and Electronic Imaging, vol. 2657, pp. 330-337.

O. Le Meur, P. Le Callet, D. Barba, & D. Thoreau (2006). A coherent computational approach to model bottom-up visual attention. IEEE Transactions on Pattern Analysis and Machine Intelligence. 28(5):802-817.

B. C. Motter (1994), Neural correlates of attentive selection for color or luminance in extrastriate area V4. Journal of Neuroscience 14(4):2178–2189.

V. Navalpakkam & L. Itti (2005). Modeling the influence of task on attention, Vision Research 45(2):205-231.

V. Navalpakkam & L. Itti (2007). Search goal tunes visual features optimally, Neuron 53(4):605-617.

E. Niebur & C. Koch (1996). Control of Selective Visual Attention: Modeling the `Where' Pathway. Neural Information Processing Systems 8:802-808.

H. Nothdurft (2000). Salience from feature contrast: additivity across dimensions. Vision Research 40:1183-1201.

D. Parkhurst, K. Law, & E. Niebur (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research 42(1):107-123.

R. Rosenholtz (1999). A simple saliency model predicts a number of motion popout phenomena. Vision Research, 39:3157-3163.

M. Saenz, G. T. Buracas, & G. M. Boynton (2002). Global effects of feature-based attention in human visual cortex. Nature Neuroscience 5(7):631-632.

C. Siagian & L. Itti (2007). Biologically-Inspired Robotics Vision Monte-Carlo Localization in the Outdoor Environment, In: Proc. IEEE International Conference on Intelligent Robots and Systems (IROS'07).

A. M. Sillito, K. L. Grieve, H. E. Jones, J. Cudeiro, & J. Davis (1995). Visual cortical mechanisms detecting focal orientation discontinuities. Nature 378:492-496.

A. Treisman G. & Gelade (1980). A feature integration theory of attention. Cognitive Psychology 12:97-136.

S. Treue & J. C. Martinez-Trujillo (1999). Feature-based attention influences motion processing gain in macaque visual cortex. Nature 399:575-579.

J. K. Tsotsos (1991). Is Complexity Theory appropriate for analysing biological systems? Behavioral and Brain Sciences 14(4):770-773.

E. Weichselgartner & G. Sperling (1987). Dynamics of automatic and controlled visual attention. Science 238:778-780.

J. M. Wolfe (1994). Guided Search 2.0: A Revised Model of Visual Search. Psychonomic Bulletin & Review 1(2):202-238.

J. M. Wolfe (1998). Visual Search. In: Pashler H., editor. Attention. London UK: University College London Press.

J. M. Wolfe & T. S. Horowitz (2004). What attributes guide the deployment of visual attention and how do they do it? Nature Reviews Neuroscience 5:1-7.

Y. Yeshurun & M. Carrasco (1998). Attention improves or impairs visual performance by enhancing spatial resolution. Nature 396:72-75.

Internal references