# Biological object recognition

Post-publication activity

Curator: Gabriel Kreiman

Modern computers can perform many apparently complex tasks much faster, more efficiently and more precisely than humans. A trivial example of this would be evaluating the square root of 10. However, in many other problems, such as tasks involving pattern recognition, a three-year-old can outperform the most sophisticated algorithms available today. Some of these pattern recognition tasks constitute part of the basis for robust intelligence, that is, the ability to learn, to adapt and to extrapolate. The interpretation of sensory information and its transformation into behaviorally meaningful signals is crucial in every day life. Therefore, it is perhaps not too surprising that the human brain (and the mammalian brain in general) has achieved, through millions of years of evolution, a remarkable ability to recognize visual patterns in a robust, selective and fast manner. It is likely that, upon understanding how the neuronal circuitries can achieve these remarkable properties, it will be possible to translate the biological circuits into algorithms for machine visual and pattern recognition.

This review focuses on visual object recognition because this is one of the most studied problems in pattern recognition. The fact that it has been extensively studied does not imply that visual object recognition in biological systems is thoroughly understood. On the contrary, much remains to be learned as emphasized throughout this review. Many of the essential principles and concepts, as well as the challenges ahead, are very similar in other pattern recognition problems such as biological motion recognition, scene recognition, auditory scene analysis, olfactory discrimination, etc. The main goal in this overview is to highlight some of the essential principles and concepts that underlie the computational studies of biological object recognition. The current overview will summarize many studies that use different techniques ranging from psychophysics to computational modeling, electrophysiological recordings and functional imaging. It is not possible to exhaustively summarize all the literature in the field. Thus, the aficionados will easily note many missing references. I hope that many of the reviews cited here will point to the missing references.

Figure 1: (A) Transformations of the Van Gough Sunflowers including changes in noise, rotation, blurring, etc. In spite of the large changes at the pixel level, it is quite straightforward to identify all of these images. (B) Simple line drawings of common objects. In spite of the simplicity of the diagrams and the dissimilarity with the real objects, identification is also straightforward.

## The visual system is very selective, robust and fast

The human visual system is quite powerful. It has an exquisite selectivity that allows us to distinguish among very similar objects, such as the faces of identical twins. Some estimates indicate that the human visual system can discriminate among at least tens of thousands of different object categories [1]. It would be relatively easy to build a computer system that can be extremely selective by simply memorizing all the pixels in several training images. However, such a system would lack any power to generalize. A famous piece of fiction describes the misadventures of Funes, the memorious, a character with a prodigious memory and with no ability to generalize [2]. Such a character would not be able to recognize a face after a minute change such as a change in expression or rotation; even slightly transformed objects would represent different novel objects to him.

One of the remarkable features of the visual system is that, in addition to precise selectivity, it also shows robustness to transformations of those images. An object remains the same object (at the perceptual level) after changing its position, scale, rotation, illumination, color, occlusion and many other properties. (Figure 1A). Perhaps even more striking is our ability to recognize objects from line drawings and caricatures (Figure 1B). Given the drastic changes in the images at the pixel level, these recognition problems are extremely difficult for computer vision algorithms. Yet, recognizing the objects in Figure 1A or Figure 1B is rather trivial even for children. It would be easy to build a computer system that could display strong invariance at the expense of selectivity. The trade-off between selectivity and invariance constitutes one of the most astounding accomplishments of the primate visual recognition machinery and also one of the key challenges for computer vision [3].

Figure 2: Schematic examples of alternative extreme representation. (A) A representation that achieves strong robustness (by responding in the same way to all variations of object 1) but lacks selectivity because it fails to differentiate among distinct objects. (B) A representation that achieves high selectivity but cannot generalize (this is a so-called Funes representation in honor of Funes, the memorious, a fictitious character with prodigious memory but with no capability of intelligent learning [2]). (C) The representation in the human brain allows the visual system to achieve high selectivity (right) while at the same time maintaining strong robustness to transformations of the images (left).

Another remarkable aspect of our visual system is that recognition can be very fast. We can recognize objects in about 100 to 200 ms [4-6]. Considering that there are at least 10 synapses from the photoreceptors in the retina to some of the areas involved in object recognition such as inferior temporal cortex (see Anatomy of the primate visual system below), this leaves only about 10 to 20 ms per synapse. This sets a strong constraint on how many computational processing steps can take place in the brain before recognition and how long each step can take.

## Anatomy of the primate visual system

A detailed knowledge of the architecture of the neuronal circuits involved in visual recognition is very important towards building biophysically-plausible models of biological pattern recognition. Most of our detailed knowledge about the neuroanatomy of the visual system comes from studies in the non-human primate brain. Several indications suggest that there is a strong homology between the non-human primate visual system and the human visual system. The invasive nature of current tools to study the brain at high spatial resolution implies that it is difficult to achieve the same level of detail in the description of the circuitry in the human brain. One of the most thorough overviews of the architecture of the visual system is the work of Felleman and Van Essen [7].

Figure 3: Schematic representation of the anatomy of the visual object recognition areas in the primate brain (see also Fig. 4A). This is a highly schematic representation of information processing stages along the ventral visual stream (for more details about the anatomy of the visual system, see main text and [7]). In contrast to the other areas represented in this diagram, the frontal cortex and the medial temporal lobe are not exclusively visual areas.

When a photon impinges on the retina, the light signal is transduced into an electrical signal. The retina itself is derived from the central nervous system and includes a complex circuitry to process the incoming information into a signal that is conveyed to other brain areas by the retinal ganglion cells [8-11]. The main pathway emerging from the retina, in the context of object recognition, takes the information to the lateral geniculate nucleus in the thalamus . After processing the information, the thalamus sends the signals to primary visual cortex (V1) [12-15].

V1 (and neocortex in general) is composed of six different layers. Some of these layers may in turn have separate sublayers [7, 16, 17]. Physiological studies (see Functional studies of the ventral visual stream have shown that groups of neurons perpendicular to the cortical surface typically share similar properties [18]. This columnar organization seems to be prevalent throughout neocortex and is absent or less clear in other brain areas. In addition to receiving feed-forward input from the thalamus, V1 receives feedback signals from higher cortical areas. There are also extensive horizontal connections along the V1 cortex [7, 17, 19, 20].

Two main pathways of information processing emerge from V1. These are usually referred to as the dorsal where / action pathway and the ventral what / object pathway [7, 21, 22]. The dorsal pathway is particularly involved in the spatial localization of objects within their environments and in guiding action towards those objects [23]. The ventral pathway is particularly involved in the recognition of those objects. Although these are parallel pathways, there are multiple connections that bridge across the two systems, which are therefore not independent.

V1 sends signals to the ventral pathway through visual areas V2 and V4. V1 also projects back to the thalamus. In fact, all the areas studied so far in cortex send back-projections to the areas that conveyed the feed-forward input (see also Backprojections below) . The circuitry is rather stereotypical in that specific layers and sublayers display different functions throughout cortex [7, 17, 19, 24]. Information flows from V1 to V2 and V4 and from V4 to the inferior temporal (IT) cortex. IT cortex is in turn usually divided into anterior and posterior areas (usually labeled AIT and PIT respectively). There are also by-pass routes that can skip one or more processing steps and convey the information directly (and presumably more rapidly), e.g., from V2 to IT. IT represents the last exclusively visual area. IT projects to structures in the medial temporal lobe (MTL) involved in memory consolidation [25, 26] and also to pre-frontal cortex (PFC) areas [27]. In addition to the feed-forward information processing steps described in the last paragraph, there are strong backprojections essentially at all stages of the visual cortex (with the exception of the retino-thalamic synapse). Thus, each area projects back (through different layers than the ones involved in processing the incoming input) to its input area. There are also long-range back-projections. This is a highly schematic description of the neuroanatomy of the visual system (see [7, 16, 19, 28-37] for further information and references).

## Lesion and neurological studies

Converging evidence from multiple studies and different techniques suggests that IT plays a fundamental role in the recognition of complex objects. An important piece of evidence comes from studying the effects of lesions. Lesions in IT in macaque monkeys produce severe deficits in object recognition [38-40]. In humans, there are no studies that show clear lesions that render subjects with sight but incapable of recognizing objects in general. A condition called prosopagnosia describes patients who have normal sight but cannot recognize faces from visual stimuli [41, 42]. Importantly, these subjects can recognize multiple other object types and they can also recognize people from other traits (including voice, gait, etc.). Prosopagnosic patients show lesions in the temporal lobe. There is also evidence of human subjects who show deficits for other categories beyond faces (e.g. [43-45]); whether or not these other deficits are restricted to the visual domain or not remains less clear.

## Functional studies of the ventral visual stream

By inserting a thin microelectrode into the brain, it is possible to monitor the spiking electrical activity of single neurons. This technique constitutes the basis for approximately four decades of studies on the responses of neurons in different parts of cortex to the presentation of visual stimuli. Ascending through the visual hierarchy, neurons show longer latencies to visual presentation, larger receptive field sizes and more complex feature preferences [29, 46, 47].

The pioneering studies of Hubel and Wiesel showed that (i) individual neurons in V1 have a location within the visual field that elicits a maximum response (called a receptive field), (ii) this receptive field changes smoothly over space forming a retinotopic map of the visual environment and (iii) individual V1 neurons are particularly responsive to the presentation of a bar of a specific orientation within their receptive field [20]. Hubel and Wiesel proposed a simple model that could explain the responses of such orientation-tuned cells: this response pattern could arise by combining the responses of on-center lateral geniculate nucleus (LGN) cells that have adjacent and overlapping receptive fields and are aligned to the orientation of the V1 neuron preferences. V1 is by far the most studied part of visual cortex. Still, Hubel and Wiesel’s model is neither fully accepted nor disproved and several authors have claimed that we do not yet fully understand the responses of V1 neurons [48]. Yet, the simple model of Hubel and Wiesel has inspired many computational models of visual cortex. For the purposes of the computational efforts discussed below, many authors model the responses of V1 simple neurons by using an oriented Gabor filter. It is beyond the scope of this article to discuss the multiple and more sophisticated models of V1 responses (see e.g. [49-52] among many others). Also, this article does not discuss the very important temporal aspects of the responses of V1 neurons and their motion direction selectivity or color preferences.

Compared to V1, much less work has been done to characterize and model the responses of neurons in V2, V4 and higher visual areas. Extending the ideas about how orientation selectivity may arise from LGN responses, several investigators have suggested that neurons in V2 are sensitive to angles (in its simplest form, two intersecting oriented bars) [53, 54]. V2 neurons also respond to illusory borders [55]. At the level of V4, there are neurons that seem to prefer much more complex shapes such as spirals and contour patterns [56-59].

Electrophysiological recordings in IT cortex have revealed single neurons that respond selectively to complex objects including faces as well as other stimuli [47, 60-63]. One of the remarkable aspects of the IT responses is that they show high selectivity while at the same time maintaining robustness to many stimulus transformations. In particular, IT neuronal responses show invariance to scale and position changes [6, 61, 64-66], robustness to eye movements [67], invariance to the type of cue defining the shape [68], rotation [66] and other transformations. Therefore, IT is ideally positioned to resolve many of the fundamental challenges in visual object recognition discussed in the Introduction.

Little is known about the activity of individual neurons in the human brain [69]. Single unit recordings in human epileptic patients have revealed that neurons in the medial temporal lobe also show a remarkable degree of selectivity and invariance to object transformations [69-72]. It remains unclear whether these responses are necessary for visual object recognition or instead constitute an important step in transforming explicit representations into visual memories.

In spite of the extensive work of several decades of research into the responses of IT neurons, we still lack a clear principled understanding of the types of features preferred by IT neurons (the equivalent to orientation preferences in primary visual cortex). Several investigators have tried to start from the responses of an IT neuron to complex objects and gradually decompose the preferences into different object part preferences [73-76].

Two other pieces of evidence point to the key role of IT in object recognition. First, electrical stimulation of networks of neurons in IT cortex can bias a monkey’s performance in recognition tasks [77]. Secondly, functional imaging evidence from humans have revealed areas that are presumably related to the macaque monkey inferior temporal cortex that respond to the presentation of complex visual stimuli (see e.g. [78, 79]).

## Computational theories of object recognition

### Feedforward architectures

A quantitative, computational theory of object recognition can provide a framework for integrating existing data and for planning, coordinating and interpreting new experiments. This theory needs to be constrained and inspired by experimental findings. In contrast to multiple models that aim to explain the computations in any one part of the visual system or a particular visual phenomenon, this review focuses on integrative models of object recognition.

Many of the existing models of object recognition can be traced back to some of the original ideas developed by Hubel and Wiesel [80] and the influential work of Marr [81-82]. If the receptive fields and response preferences of a V1 neuron could be thought of as integration over specific sets of LGN neurons, then maybe the responses of V2 neurons can be explained by the appropriately weighted integration of V1 responses, V4 responses as arising from the integration of V2 responses and so on. Thus, several computational models show a hierarchical structure, perhaps even a stricter hierarchy than the one present in visual cortex.

One of the first such models was able to account for invariance to position changes in the image [83]. The hierarchical architecture of the neocognitron model has an input layer which conveys the pixels in the image in a retinotopic manner, much like the output of a digital camera [83]. Subsequent processing stages alternate between S and C layers which are inspired by the distinction between simple and complex cells in primary visual cortex [80]. Each unit throughout the network has a receptive field whose size is determined by the range of input cells. S-layer units extract specific localized features in the images; these features increase in complexity upon ascending the hierarchical network. The C-layer units provide the key invariance to position changes. This model succeeded in accounting for positional changes in the recognition of hand-written letters and digits with relatively high accuracy. A similar algorithm was applied to build a character recognition system [84].

The neocognitron model was followed by several other biologically-inspired recognition models including [3, 85-90]. Common to most of these computational efforts are the following principles: (i) A hierarchical architecture, (ii) Feed-forward architecture (in most cases), (iii) Increase in receptive field size and complexity in unit feature preferences along the hierarchy, (iv) Increase in invariance (at least to position and scale) along the hierarchy, (v) Learning and plasticity at multiple levels, (vi) Cascades of linear and non-linear processing steps.

Mel developed a similar system that incorporated the possibility of explicitly analyzing texture, color and contours [86]. Using real objects embedded in video sequences, the system could perform recognition in the presence of changes in position, scale and also rotation. The basic features are view-invariant and belong to five major groups that include Gabor-based features, colors, angles, blobs and contours.

Olshausen et al took a somewhat different perspective, following earlier psychological theories that posit that an object needs to be represented in a canonical format [89]. This canonical representation does not require much storage space and allows for a comparatively easier recognition problem. One of the main computational difficulties resides in converting objects to such a canonical representation. To achieve this, the authors use a dynamical routing circuit to transform the representation from a retina-based reference frame to an object-centered reference frame. A key and original component of this model is the existence of control neurons that are in charge of routing information from lower-level areas to higher-level areas.

Of note, the spatial relations among objects or object parts are not explicitly encoded in these computational models. In contrast, Biederman and colleagues have proposed a framework based on recognizing a specific basis set of components labeled “geons” [1, 91] consisting of generalized cones and cylinders. An object is defined by its components (the geons) and the configuration of those components relative to each other. Some authors have argued that it is not trivial to fit or uncover these geon-like shapes in natural scenes [92]. In a way, there seems to be a requirement for a separate object recognition mechanism to detect geons in the first place before being able to do computations with them. Still, there is an interesting ongoing debate about component-based versus view-based representations [1, 92-94].

### A theory of immediate object recognition

One of the most recent instantiations of such a class of hierarchical feed-forward models is described here in further [95], yet many of the observations below are likely to apply to many of the models cited above. This hierarchical and feed-forward model is represented schematically in Fig. 4. The model starts at the level of primary visual cortex (V1) by convolving the original image with a Gabor filter with a specific orientation, spatial frequency and position. Two main operations are prevalent throughout the architecture: a tuning operation and a non-linear invariance operation. Units in the S layers show a Gaussian-like tuning. The S comes from the simple cells in V1 following the work of Hubel and Wiesel [20, 83, 96]. Units in the C layers show the same tuning as their input S-layer units but achieve a higher degree of position and scale invariance by combining information from units with slightly different receptive fields or scale preferences through a soft-max operation. The C comes from the complex cells in V1. These two specific operations constitute a distinctive aspect of this particular feed-forward architecture.

Figure 4: (A) Schematic diagram illustrating the connectivity patterns in the macaque monkey visual cortex (left) adapted from [7]. (B) Architecture of a recent instantiation of a hierarchical feedforward model of object recognition discussed in the text. The colors in the model match specific areas on the anatomical connectivity diagram on the left [3, 95, 97, 98].

Ascending through the hierarchy, S units in subsequent layers show progressively larger receptive fields and also more complex feature preferences. The units that provide the input to a given S unit have the same receptive field. The model is static but could be modified to incorporate the dynamics of firing in visual cortex [95]. Tuning is perhaps one of the most ubiquitous and prevalent properties in cortex. Although we do not understand the feature preferences of neurons in higher visual areas, it is clear that neurons respond in a highly selective fashion and show very distinct firing patterns to different stimuli (see e.g. [74] among many others). The model in Fig. 4 assumes a particular shape for the tuning function (Gaussian tuning). Gaussian tuning has been observed and described, particularly in early visual areas. Although the particular shape and properties of the tuning function remain unclear in higher visual areas, some instances of Gaussian-like tuning have also been described [66]. Still, tuned functions that are only approximately Gaussian would also work [95] and perhaps even simpler tuning functions might be sufficient. From a computational viewpoint, Gaussian-like tuning profiles may play an important role in the generalization ability of cortex. Networks that combine the activity of several units tuned with a Gaussian profile to different training examples have proved to provide a powerful learning scheme [99, 100]. Let $$y$$ denote the response of a unit (simple or complex). The set of $$N$$ inputs to the cell are denoted with subscripts $$j = 1, \ldots , N\ .$$ When presented with a pattern of activity $$x=(x_1, \ldots, x_N)$$ as input, an idealized description of a simple unit response is given by$y = \exp\bigg[-\frac{1}{2\sigma^2} \sum_{j=1}^N (x_j - w_j)^2\bigg]\ .$ This equation defines the tuning of the S unit around its preferred stimulus given by $$w=(w_1,\ldots,w_N)\ .$$ The maximum activation for the S unit corresponds to $$x=w\ .$$

Ascending through the hierarchy, C units in subsequent layers show progressively larger receptive fields and also higher degrees of invariance to changes in the position and scale of its preferred input. The C units perform a max-like operation over their afferents inputs to gain invariance to object transformations. All the S units that project to a given C unit show the same stimulus preferences (e.g. the same orientation preference) but they have slightly shifted position or scale preferences. The non-linear combination of these inputs provides a certain amount of invariance to changes in position and scale in the image. In the simplest possible scenario, the output of a C unit is given by $$y=\max(x_j)$$ for $$j=1,\ldots,N\ .$$

The intercalation of S and C layers provides a gradual increase in both selectivity and invariance (as observed along the ventral stream). This gradual increase may prove to be critical in order to avoid a combinatorial explosion in the number of units, and the binding problem between features [101, 102]. Recent work has shown that the two basic operations described above can be implemented by a similar biophysically-plausible circuit based on feedforward and/or feedback shunting inhibition combined with normalization [95].

The connectivity shown in Fig. 4, although apparently very complex, is still rather simple compared to the myriad of different connections in visual cortex [7, 16, 17, 103]. The spirit, in this and other computational models, is not to reproduce every single connection in the brain but to attempt to extract the basic computational principles. This approach is rooted in the Physics tradition of building progressively more complex models that start with very simple ideas (even if they require assuming a point elephant with no friction).

The development of the circuitry in visual cortex involves a complex series of events including positioning of neurons in cortex, formation of the layered and columnar structures, formation of long-range connections giving rise to hypercolumnar topology and the specification of the precise connectivity weights. The development of such a complex circuit requires the interplay of genetically determined mechanisms with activity-dependent mechanisms [104-106]. In the context of the architecture of Fig. 4, Serre et al assumed that the synaptic weights for the model units are specified by learning which combinations of features appear most frequently in natural images [95, 97]. The importance of learning the statistics of natural images has been recently emphasized by many authors (see, for example [50, 107]). The wiring of the S layers relies on learning correlations of features in the image at the same time (i.e. inputs with different feature preferences but the same positional preferences). The wiring of the C layers may reflect learning to associate different variations of an image across time. Foldiak emphasized that invariance could be learnt by temporal association (e.g. one a face is rotated or when an object approaches or shifts its position) [90].

The model in Fig. 4 provides a multi-scale summary of many aspects of immediate visual recognition. The quantitative account of information processing in the feedforward path of the ventral stream proposes a model to explain a particular challenging aspect of human vision, namely the ability of the human brain to perform ultra-rapid (i.e. ca. 150 ms) complex discrimination tasks. The model can explain data at the level of psychophysics [108] and also physiological observations throughout the visual cortex [3, 6, 58, 95, 109].

## The limits of feed-forward processing and visual attention

### Limitations in feed-forward recognition

Feed-forward architectures provide powerful vision systems to solve challenging recognition problems. Although many aspects of immediate object recognition can be explained on the basis of feed-forward architectures, there is much more to vision than immediate recognition. More challenging tasks involving complex backgrounds, multiple objects and occlusion require more processing time (e.g. [110-112]). A recent study has quantified the limits of feed-forward architectures in complex recognition problems [98] and the problem has also been justified based on complexity theory arguments [113].

### Backprojections, attention and visual object recognition

The circuitry of the cortex involves a massive amount of backprojections that convey information from higher areas back to the lower areas. The anatomy has been more extensively studied in the visual system where it is clear that feedforward connections constitute only a small fraction of the total connectivity (see for example [7, 16, 17, 103]). In spite of the anatomical evidence, most computational efforts such as the ones highlighted in Sections 6.1 and 6.2, have focused on feedforward processing of information. Feedback using backprojections provides the opportunity to use previous knowledge, memory and task dependent expectations.

Information conveyed by backprojections can be generally thought of in terms of modulating and constraining the interpretation of sensory information [114]. As emphasized by several authors [115-120], Bayesian inference offers a useful framework to understand the relative contributions of bottom-up and top-down processing mediated by the feedforward and backprojections. At any given stage in cortex, we are interested in finding the (unknown) activity $$z$$ that is compatible with the input and the context. Let us represent the input observations by $$x_{obs}\ .$$ This could correspond to the sensory input from the retina or the input from earlier cortical areas. Let us represent the context by $$x_{context}\ .$$ This context can take the form of information from memory, tasks, attention and expectations. The goal of the visual system is to infer the hidden variable or set of hypothesizes $$z$$ (e.g., the category of the object stimulus presented) given the current observation $$x_{obs}$$ (e.g., the pattern of neural activity in IT) and the context $$x_{context}$$ (e.g. the pattern of neural activity in PFC). This requires estimating the conditional probability of all hypotheses given the observation $$x_{obs}$$ and the context $$x_{context}\ .$$ This conditional probability can be estimated from: $P(z|x_{obs},x_{context}) = \frac{P(x_{obs}|z,x_{context})P(z|x_{context})}{P(x_{obs}|x_{context})}$ The factor $$P(z|x_{context})$$ is where backprojections exert their influence (e.g., top-down influence from pre-frontal cortex to IT) and reflects prior constraints (e.g., knowledge about the task or other information pertinent to the current conditions). This could take the form of indicating that the probability of finding a lion inside an office is very small or it could represent the memory for specific features of a face being searched for in a crowd.

A particular focus of several models that include backprojections has been the deployment of attentional resources to specific locations or features [88, 112, 121-123]. This could well be accounted within the above framework by making $$P(z|x_{context})$$ represent, for example, an attentional enhancement factor that biases neuronal competition towards a particular location in space (see [118, 120] for biologically plausible implementations). Indeed it has been suggested that neurons may act as probabilistic integrators of bottom-up and top-down signals and that attention may be used by the visual system for reducing uncertainty in complex scenes. Thus, this framework can include previous models of attention while at the same time providing flexibility for other roles of backprojections [114].

This approach can also be related to the reverse hierarchy routine [124] and also to the shifter circuit of Olshaussen and colleagues [89] . Briefly, a program running in pre-frontal cortex decides, depending on the initial feedforward categorization, the next question to ask in order to resolve ambiguity or improve accuracy. Typically, answering this question involves zooming in on a particular subregion of the image at the appropriate level and using appropriate units (for instance at the C1 level in the model presented in Fig. 4) and calling a specific classifier – out of a repertoire – to provide the answer. This framework involves a flavor of the 20 questions game and the use of reverse hierarchy routines which control access to lower level units.

Gating information through attention has received considerable evidence as a mechanism to focus resources on the most salient or task-relevant aspects of an image [121, 125-128]. In its simplest formulation (spatial attention), a particular location within the image is salient (e.g. through its color, motion or other physical aspect) or is particularly important for a given task (e.g. a subject is instructed to attend to a particular location or a relevant event is expected to occur in a given location). Under these circumstances, bottom-up or top-down mechanisms could enhance information coming from the attended location at the expense of other parts of the image. Multiple experiments have shown that neuronal responses throughout visual cortex can be modulated by spatial attention (e.g. [113, 117, 118]). Alternatively, in feature-based attention may also enhance specific aspects of the image (as opposed to specific locations). For example, upon searching for a person wearing a red dress in a crowd, it may be useful to enhance the processing of red objects. Recently, physiological evidence has suggested that feature-based attention modulates the processing of specific simple features such as color or orientation in a parallel manner in early visual areas [119]. Backprojections could play an important role in using task-relevant information (perhaps coming from frontal cortex) to gate or reroute the information that reaches higher cortical stages.

## Computer vision

Visual object recognition is a very important and challenging problem for multiple every-day applications including security, robot navigation, character recognition, clinical image understanding and many others. Therefore, it remains an important area of intense research in computer vision. An efficient engineering solution to this problem does not necessarily have to be based on biological circuits. Therefore, computer vision can use a diversity of mathematical and algorithmic tricks that may perhaps be unavailable to biological neuronal networks. Multiple object recognition algorithms have been proposed. Yet, none of the algorithms available today can surpass the performance of the human brain. This is by no means a statement (let alone a proof) that there is an important limit in object recognition given by human performance. It is merely a reflection that suggests that more work needs to be done in the field to solve the multiple problems in robust intelligence that the human brain is so good at. Eventually, it is quite likely that computer vision may surpass and even enhance human performance.

## References

1. I. Biederman, Recognition-by-components: A theory of human image understanding. Psychological Review, 1987. 24: 115-147.

2. J.L. Borges, Fictions (Ficciones), ed. e.b.J. Sturrock. 1942: Grove Press.

3. M. Riesenhuber and T. Poggio, Hierarchical models of object recognition in cortex. Nature Neuroscience, 1999. 2: 1019-1025.

4. M. Potter and E. Levy, Recognition memory for a rapid sequence of pictures. Journal of Experimental Psychology, 1969. 81: 10-15.

5. S. Thorpe, D. Fize and C. Marlot, Speed of processing in the human visual system. Nature, 1996. 381: 520-522.

6. C. Hung, G. Kreiman, T. Poggio and J. DiCarlo, Fast Read-out of Object Identity from Macaque Inferior Temporal Cortex. Science, 2005. 310: 863-866.

7. D.J. Felleman and D.C. Van Essen, Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1991. 1: 1-47.

8. C. Koch, T. Poggio and V. Torre, Retinal ganglion cells: a functional interpretation of dendritic morphology. Philos Trans R Soc Lond B Biol Sci, 1982. 298: 227-63.

9. M. Meister, L. Lagnado and D.A. Baylor, Concerted signaling by retinal ganglion cells. Science, 1995. 270: 1207-1210.

10. M. Meister, Multineuronal Codes in Retinal Signaling. PNAS, 1996. 93: 609-614.

11. F. Rieke, D. Warland, R. van Steveninck and W. Bialek, Spikes. 1997, Cambridge, Massachusetts: The MIT Press.

12. P. Reinagel, D. Godwin, S. Sherman and C. Koch, Encoding of visual information by LGN bursts. Journal of Neurophysiology, 1999. 81: 2558-2569.

13. S. Sherman, Tonic and burst firing: dual modes of thalamocortical relay. Trends in Neurosciences, 2001. 24: 122-126.

14. J. Alonso, W. Usrey and R. Reid, Precisely correlated firing in cells of the lateral geniculate nucleus. Nature, 1996. 383: 815-819.

15. N.A. Lesica and G.B. Stanley, Encoding of natural scene movies by tonic and burst spikes in the lateral geniculate nucleus. Journal of Neuroscience, 2004. 24: 10731-40.

16. E.M. Callaway, Feedforward, feedback and inhibitory connections in primate visual cortex. Neural Netw, 2004. 17: 625-32.

17. R.J. Douglas and K.A. Martin, Neuronal circuits of the neocortex. Annu Rev Neurosci, 2004. 27: 419-51.

18. V.B. Mountcastle, Modality and topographic properties of single neurons of cat's somatic sensory cortex. J Neurophysiol, 1957. 20: 408-34.

19. E.M. Callaway, Local circuits in primary visual cortex of the macaque monkey. Annu Rev Neurosci, 1998. 21: 47-74.

20. D.H. Hubel and T.N. Wiesel, Early exploration of the visual cortex. Neuron, 1998. 20: 401-12.

21. J. Haxby, et al., Dissociation of object and spatial visual processing pathways in human extrastriate cortex. PNAS, 1991. 88: 1621-1625.

22. M. Mishkin, A memory system in the monkey. Philosophical Transaction of the Royal Society of London Series B, 1982. 298: 85.

23. M. Goodale and A. Milner, Separate visual pathways for perception and action. Trends in Neurosciences, 1992. 15: 20-25.

24. R.J. Douglas, C. Koch, M. Mahowald, K.A. Martin and H.H. Suarez, Recurrent excitation in neocortical circuits. Science, 1995. 269: 981-5.

25. S. Zola and L. Squire, Remembering the hippocampus. Behavioral and Brain Sciences, 1999. 22: 469-486.

26. S. Zola-Morgan and L.R. Squire, Neuroanatomy of memory. Annual Review of Neuroscience, 1993. 16: 547-563.

27. E. Miller, The prefrontal cortex and cognitive control. Nature Reviews Neuroscience, 2000. 1: 59-65.

28. K. Cheng, K.S. Saleem and K. Tanaka, Organization of Corticostriatal and Corticoamygdalar Projections Arising from the Anterior Inferotemporal Area TE of the Macaque Monkey: A Phaseolus vulgaris Leucoagglutinin Study. Journal of Neuroscience, 1997. 17: 7902-7925.

29. M. Livingstone and D. Hubel, Segregation of form, color, movement and depth: anatomy, physiology and perception. Science, 1988. 240: 740-749.

30. W.A. Suzuki, Neuroanatomy of the monkey entorhinal, perirhinal and parahippocampal cortices: Organization of cortical inputs and interconnections with amygdala and striatum. Seminars in the Neurosciences, 1996. 8: 3-12.

31. K.S. Saleem and K. Tanaka, Divergent projections from the anterior inferotemporal area TE to the perirhinal and entorhinal cortices in the macaque monkey. Journal of Neuroscience, 1996. 16: 4757-4775.

32. J. Nolte, The human brain: an introduction to its functional anatomy. 4th ed. 1998, New York: Mosby.

33. K. Brodmann, Vergleichende Lokalisationslehre der Grosshirnnrinde in ihren Prinzipien dargestellt auf Grund des Zellenbaues. 1909, Leipzig: Barth.

34. F. Crick and C. Koch, Constraints on cortical thalamic projections: the no-strong-loops hypothesis. Nature, 1998. 391: 245-250.

35. R. Lorente de No, Studies on the structure of the cerebral cortex. II - Continuation of the study of the Ammonic System. 1934, Central Institute for the Deaf: Saint Louis.

36. D.C. Van Essen, C.H. Anderson and D.J. Felleman, Information processing in the primate visual system: an integrated systems perspective. Science, 1992. 255: 419-23.

37. R.T. Born and D.C. Bradley, Structure and function of visual area MT. Annu Rev Neurosci, 2005. 28: 157-89.

38. P. Dean, Effects of inferotemporal lesions on the behavior of monkeys. Psychological Bulletin, 1976. 83: 41-71.

39. C.G. Gross, How inferior temporal cortex became a visual area. Cerebral cortex, 1994. 5: 455-469.

40. E. Holmes and C. Gross, Stimulus equivalence after inferior temporal lesions in monkeys. Behavioral Neuroscience, 1984. 98: 898-901.

41. A. Damasio, D. Tranel and H. Damasio, Face agnosia and the neural substrtes of memory. Annual Review of Neuroscience, 1990. 13: 89-109.

42. N. Kanwisher and M. Moscovitch, The cognitive neuroscience of face processing: An introduction. Cognitive Neuropsychology, 2000. 17: 1-11.

43. E. Warrington and T. Shallice, Category specific semantic impairments. Brain, 1984. 107: 829-854.

44. E. Warrington and R. Mc. Carthy, Categories of knowledge - Further fractionations and an attempted integration. Brain, 1987. 110: 1273-1296.

45. R. McCarthy and E. Warrington, Disorders of semantic memory. Philosophical Transactions of the Royal Society of London Series B, 1994. 346: 89-96.

46. M. Schmolesky, Y. Wang, D. Hanes, K. Thompson, S. Leutgeb, J. Schall and A. Leventhal, Signal timing across the macaque visual system. Journal of Neurophysiology, 1998. 79: 3272-3278.

47. N.K. Logothetis and D.L. Sheinberg, Visual object recognition. Annual Review of Neuroscience, 1996. 19: 577-621.

48. M. Carandini, et al., Do we know what the early visual system does? J Neurosci, 2005. 25: 10577-97.

49. M. Carandini, D.J. Heeger and J.A. Movshon, Linearity and normalization in simple cells of the macaque primary visual cortex. J Neurosci, 1997. 17: 8621-44.

50. E. Simoncelli and B. Olshausen, Natural Image Statistics and Neural Representation. Annual Review of Neuroscience, 2001. 24: 193-216.

51. B.A. Olshausen and D.J. Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 1996. 381: 607-9.

52. S.V. David and J.L. Gallant, Predicting neuronal responses during natural vision. Network, 2005. 16: 239-60.

53. M. Ito and H. Komatsu, Representation of angles embedded within contour stimuli in area V2 of macaque monkeys. J Neurosci, 2004. 24: 3313-24.

54. A. Plebe. A model of angle selectivity development in visual area v2. in Computational neuroscience. 2006.

55. R. von der Heydt, E. Peterhans and G. Baumgartner, Illusory contours and cortical neuron responses. Science, 1984. 224: 1260-1262.

56. A. Pasupathy and C. Connor, Responses to contour features in macaque area V4. Journal of Neurophysiology, 1999. 82: 2490-2502.

57. J.L. Gallant, J. Braun and D.C. Van Essen, Selectivity for polar, hyperbolic, and Cartesian gratings in macaque visual cortex. Science, 1993. 259: 100-3.

58. C. Cadieu, M. Kouh, A. Pasupathy, C. Connor, M. Riesenhuber and T. Poggio, A model of shape representation in area V4. Submitted to Neuron, submitted.

59. S.V. David, B.Y. Hayden and J.L. Gallant, Spectral receptive field properties explain shape selectivity in area V4. J Neurophysiol, 2006. 96: 3492-505.

60. K. Tanaka, Inferotemporal cortex and object vision. Annual Review of Neuroscience, 1996. 19: 109-139.

61. R. Desimone, T. Albright, C. Gross and C. Bruce, Stimulus-selective properties of inferior temporal neurons in the macaque. Journal of Neuroscience, 1984. 4: 2051-2062.

62. C.G. Gross, P.H. Schiller, C. Wells and G.L. Gerstein, Single-unit activity in temporal association cortex of the monkey. J Neurophysiol, 1967. 30: 833-43.

63. E. Rolls, D. Perrett and F. Wilson, Neuronal responses related to visual recognition. Brain, 1982. 105: 611-646.

64. M. Ito, H. Tamura, I. Fujita and K. Tanaka, Size and position invariance of neuronal responses in monkey inferotemporal cortex. J Neurophysiol, 1995. 73: 218-26.

65. E. Rolls, Neural organization of higher visual functions. Current Opinion in Neurobiology, 1991. 1: 274-278.

66. N.K. Logothetis, J. Pauls and T. Poggio, Shape representation in the inferior temporal cortex of monkeys. Current Biology, 1995. 5: 552-563.

67. J. DiCarlo and H. Maunsell, Form representation in monkey inferotemporal cortex is virtually unaltered by free viewing. Nature Neuroscience, 2000. 3: 814-821.

68. G. Sary, R. Vogels and G.A. Orban, Cue-invariant shape selectivity of macaque inferior temporal neurons. Science, 1993. 260: 995-997.

69. G. Kreiman, Single neuron approaches to human vision and memories. Current Opinion in Neurobiology, 2007. 17: 471-475.

70. R. Quian Quiroga, L. Reddy, G. Kreiman, C. Koch and I. Fried, Invariant visual representation by single neurons in the human brain. Nature, 2005. 435: 1102-1107.

71. G. Kreiman, C. Koch and I. Fried, Category-specific visual responses of single neurons in the human medial temporal lobe. Nature Neuroscience, 2000. 3: 946-953.

72. R. Quian Quiroga, G. Kreiman, C. Koch and I. Fried, Sparse but not 'Grandmother-cell' coding in the medial temporal lobe. Trends in Cognitive Science, 2008. 12: 87-91.

73. C. Gross, C. Rocha-Miranda and D. Brender, Visual properties of neurons in inferotemporal cortex of the Macaque. Journal of Neurophysiology, 1972. 35: 96-111.

74. E. Kobatake and K. Tanaka, Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. J Neurophysiol, 1994. 71: 856-67.

75. G. Kayaert, I. Biederman and R. Vogels, Shape tuning in macaque inferior temporal cortex. J Neurosci, 2003. 23: 3016-27.

76. K. Tsunoda, Y. Yamane, M. Nishizaki and M. Tanifuji, Complex objects are represented in macaque inferotemporal cortex by the combination of feature columns. Nat Neurosci, 2001. 4: 832- 8.

77. S.R. Afraz, R. Kiani and H. Esteky, Microstimulation of inferotemporal cortex influences face categorization. Nature, 2006. 442: 692-5.

78. N. Kanwisher, J. McDermott and M.M. Chun, The fusiform face area: a module in human extrastriate cortex specialized for face perception. Journal of Neuroscience, 1997. 17: 4302- 4311.

79. A. Ishai, J.V. Haxby and L.G. Ungerleider. Human Neural Systems for the generation of visual images. in Society for Neuroscience. 1999. Miami.

80. D.H. Hubel and T.N. Wiesel, Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. J Physiol, 1962. 160: 106-54.

81. D. Marr, Vision. 1982: Freeman publishers.

82. D. Marr and H.K. Nishihara, Representation and recognition of the spatial organization of three-dimensional shapes. Proc R Soc Lond B Biol Sci, 1978. 200: 269-94.

83. K. Fukushima, Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 1980. 36: 193-202.

84. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition. Proc of the IEEE, 1998. 86: 2278-2324.

85. D. Perrett and M. Oram, Neurophysiology of shape processing. Img. Vis. Comput., 1993. 11: 317-333.

86. B. Mel, SEEMORE: Combining color, shape and texture histogramming in a neurally inspired approach to visual object recognition. Neural Computation, 1997. 9: 777.

87. G. Wallis and E.T. Rolls, Invariant face and object recognition in the visual system. Progress in Neurobiology, 1997. 51: 167-94.

88. G. Deco and E.T. Rolls, A neurodynamical cortical model of visual attention and invariant object recognition. Vision Res, 2004. 44: 621-42.

89. B.A. Olshausen, C.H. Anderson and D.C. Van Essen, A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J Neurosci, 1993. 13: 4700-19.

90. P. Foldiak, Learning Invariance from Transformation Sequences. Neural Computation, 1991. 3: 194-200.

91. I. Biederman and E.E. Cooper, Priming contour-deleted images: evidence for intermediate representations in visual object recognition. Cognit Psychol, 1991. 23: 393-419.

92. S. Edelman and S. Duvdevani-Bar, A model of visual recognition and categorization. Philos Trans R Soc Lond B Biol Sci, 1997. 352: 1191-202.

93. H.H. Bulthoff, S.Y. Edelman and M.J. Tarr, How are three-dimensional objects represented in the brain? Cereb Cortex, 1995. 5: 247-60.

94. T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich and T. Poggio, A quantitative theory of immediate visual recognition. Progress In Brain Research, 2007. 165C: 33-56.

95. T. Serre, M. Kouh, C. Cadieu, U. Knoblich, G. Kreiman and T. Poggio, A theory of object recognition: computations and circuits in the feedforward path of the ventral stream in primate visual cortex. 2005, MIT: Boston. p. CBCL Paper #259/AI Memo #2005-036.

96. D. Hubel and T. Wiesel, Receptive fields of single neurons in the cat's striate cortex. Journal of Physiology (London), 1959. 148: 574-591.

97. T. Serre, Thomas Thesis, in Brain and Cognitive Science. 2006, MIT: Boston.

98. T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich and T. Poggio, A quantitative theory of immediate visual recognition. Progress In Brain Research, In Press.

99. T. Poggio and S. Edelman, A network that learns to recognize three-dimensional objects. Nature, 1990. 343: 263-6.

100. C.M. Bishop, Neural Networks for Pattern Recognition. 1995, Oxford: Clarendon Press.

101. A. Treisman, The binding problem. Curr Opin Neurobiol, 1996. 6: 171-8.

102. M. Riesenhuber and T. Poggio, Are cortical models really bound by the "binding problem"? Neuron, 1999. 24: 87-93.

103. T. Binzegger, R.J. Douglas and K.A. Martin, A quantitative map of the circuit of cat primary visual cortex. J Neurosci, 2004. 24: 8441-53.

104. T.N. Wiesel and D.H. Hubel, Single-Cell Responses in Striate Cortex of Kittens Deprived of Vision in One Eye. J Neurophysiol, 1963. 26: 1003-17.

105. M. Sur and J.L. Rubenstein, Patterning and plasticity of the cerebral cortex. Science, 2005. 310: 805-10.

106. C.S. Goodman and C.J. Shatz, Developmental mechanisms that generate precise patterns of neuronal connectivity. Cell, 1993. 72 Suppl: 77-98.

107. E.C. Smith and M.S. Lewicki, Efficient auditory coding. Nature, 2006. 439: 978-82.

108. T. Serre, A. Oliva and T. Poggio, Feedforward theories of visual cortex account for human performance in rapid categorization. PNAS, In Press.

109. I. Lampl, D. Ferster, T. Poggio and M. Riesenhuber, Intracellular measurements of spatial integration and the MAX operation in complex cells of the cat primary visual cortex. J Neurophysiol, 2004. 92: 2704-13.

110. A.M. Treisman and G. Gelade, A feature-integration theory of attention. Cognit Psychol, 1980. 12: 97-136.

111. M.I. Posner, Attention: the mechanisms of consciousness. Proc Natl Acad Sci U S A, 1994. 91: 7398-403.

112. J.M. Wolfe and T.S. Horowitz, What attributes guide the deployment of visual attention and how do they do it? Nat Rev Neurosci, 2004. 5: 495-501.

113. J. Tsotsos, Analyzing Vision at the Complexity Level. Behavioral and Brain Sciences, 1990. 13-3: 423-445.

114. T.S. Lee and D. Mumford, Hierarchical Bayesian inference in the visual cortex. J Opt Soc Am A Opt Image Sci Vis, 2003. 20: 1434-48.

115. D. Mumford, On the computational architecture of the neocortex. II. The role of cortico-cortical loops. Biol Cybern, 1992. 66: 241-51.

116. A. Yuille and D. Kersten, Vision as Bayesian inference: analysis by synthesis? Trends Cogn Sci, 2006. 10: 301-8.

117. D.C. Knill and W. Richards, eds. Perception as Bayesian Inference. 1996, Cambridge University Press.

118. R.P.N. Rao, B.A. Olshausen and M.S. Lewicki, eds. Probabilistic Models of the Brain: Perception and Neural Function. 2002, MIT Press: Cambridge.

119. K. Friston, Learning and inference in the brain. Neural Netw, 2003. 16: 1325-52.

120. A.J. Yu and P. Dayan, Uncertainty, neuromodulation, and attention. Neuron, 2005. 46: 681-92.

121. R. Desimone and J. Duncan, Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 1995. 18: 193-222.

122. A. Compte and X.J. Wang, Tuning curve shift by attention modulation in cortical neurons: a computational study of its mechanisms. Cereb Cortex, 2006. 16: 761-78.

123. J.H. Reynolds and L. Chelazzi, Attentional modulation of visual processing. Annu Rev Neurosci, 2004. 27: 611-47.

124. S. Hochstein and M. Ahissar, View from the top: hierarchies and reverse hierarchies in the visual system. Neuron, 2002. 36: 791-804.

125. D.B. Walther and C. Koch, Attention in hierarchical models of object recognition. Prog Brain Res, 2007. 165: 57-78.

126. D. Walther, U. Rutishauser, C. Koch and P. Perona, Selective visual attention enables learning and recognition of multiple objects in cluttered scenes. Computer Vision and Image Understanding, 2005. 100: 41-63.

127. L. Itti and C. Koch, Computational modelling of visual attention. Nat Rev Neurosci, 2001. 2: 194-203.

128. Y. Amit and M. Mascaro, An integrated network for invariant visual detection and recognition. Vision Research, 2003. 43: 2073-2088.

129. C.E. Connor, D.C. Preddie, J.L. Gallant and D.C. Van Essen, Spatial attention effects in macaque area V4. J Neurosci, 1997. 17: 3201-14.

130. Y.B. Saalmann, I.N. Pigarev and T.R. Vidyasagar, Neural mechanisms of visual attention: how top-down feedback highlights relevant locations. Science, 2007. 316: 1612-5.

131. N.P. Bichot, A.F. Rossi and R. Desimone, Parallel and serial neural mechanisms for visual search in macaque area V4. Science, 2005. 308: 529-34.

Internal references

• Olaf Sporns (2007) Complexity. Scholarpedia, 2(10):1623.
• Keith Rayner and Monica Castelhano (2007) Eye movements. Scholarpedia, 2(10):3649.
• William D. Penny and Karl J. Friston (2007) Functional imaging. Scholarpedia, 2(5):1478.
• Howard Eichenbaum (2008) Memory. Scholarpedia, 3(3):1747.
• Kunihiko Fukushima (2007) Neocognitron. Scholarpedia, 2(1):1717.
• Almut Schüz (2008) Neuroanatomy. Scholarpedia, 3(3):3158.
• John Dowling (2007) Retina. Scholarpedia, 2(12):3487.
• S. Murray Sherman (2006) Thalamus. Scholarpedia, 1(9):1583.
• Nicholas V. Swindale (2008) Visual map. Scholarpedia, 3(6):4607.

Logothetis, N. K., and Sheinberg, D. L. (1996). Visual object recognition. Annual Review of Neuroscience 19, 577-621.

Marr, D. (1982). Vision, Freeman publishers.

Riesenhuber, M., and Poggio, T. (2000). Models of object recognition. Nature Neuroscience 3 Suppl, 1199-1204.