Models of Visual Cortex

From Scholarpedia
This article has not yet been published; it may contain inaccuracies, unapproved changes, or be unfinished.
Jump to: navigation, search


The past decades of experimental work in visual neuroscience have generated a large and rapidly increasing amount of data. Starting with Rosenblatt's perceptron (Rosenblatt, 1958), there has been an explosion in the development of computational models for neuroscience with hundreds of computational models for early vision alone. Some of these models have indeed been quite successful at describing some aspects of the processing of information in the visual cortex and have been shown to account for a wide array of visual phenomena bridging the gap from biophysics to physiology and behavior.

Computational modeling is one of the central methods in vision research, and recent developments in computational neuroscience, machine learning, and computer vision have provided a wealth of new tools for developing computational accounts of primate vision. Models of the visual cortex can provide a much-needed framework for summarizing and integrating existing data and for planning, coordinating and interpreting new experiments. Models can be powerful tools in basic research, integrating knowledge across several levels of analysis -- from molecular to synaptic, cellular, systems and to complex visual behavior. Models, however, are limited in their explanatory power; ideally they should eventually lead to a deeper and more general theory. However, as noted by Chuck Stevens, “Models are common; good theories are scarce”.

Disclaimer: There exist many models of vision that are not models of the visual cortex. These include, for instance, extensive work on the fly, the beetle, the rabbit, and others. There is also a large body of work from the past twenty years of research in computer vision. Related systems that do not try to mimic the processing of information in the visual cortex will not be reviewed here.

Contents

Architecture and function of the visual cortex

Here we briefly review some of the main aspects of the anatomy of the visual cortex -- its hierarchical organization, its organization in terms of feature maps and visual modules as well as the so-called feedforward-feedback dichotomy -- because they provide a useful way to classify models.

Anatomy of the visual cortex

<review> I do understand that this is an important foundation for models of visual cortex and the respective scholarpedia article on the visual cortex is not yet finished. However, it is kind of awkward to read through a lengthy description of the real brain first, before it comes to models. </review>

The visual cortex is the set of cortical sensory areas that are involved mainly with the processing of visual information. The visual cortex includes the primary visual cortex, also known as the striate cortex (in primates also referred to as Brodmann area 17 by anatomists and V1 by electrophysiologists), as well as several extrastriate areas (about thirty or so in monkeys <review> please add information on number of areas in humans </review> ).

Since the pioneering work of Hubel & Wiesel, who first recorded in the primary visual cortex and characterized the visual responses of neurons in this area, it has become commonplace to think of vision as residing in the visual cortex. Indeed an increasing body of evidence from patient to physiology and imaging studies suggests that many of our visual abilities depend on the visual cortex <review> you could cite the work on blindsight. Either a current review "The blindsight saga. Cowey A. Exp Brain Res. 2010 Jan;200(1):3-24." or original work "Weiskrantz L, Barbur JL, Sahraie A (1995) Factors aVecting conscious versus unconscious visual discrimination with V1. Proc Natl Acad Sci USA 92:6122–6126" </review> . A note of caution is required here, however, since it is likely that subcortical pathways and structures may play a larger role than suspected so far, even for high-level visual capabilities such as object recognition (Kveraga et al. 2007).

The visual cortex is composed of several areas that are organized hierarchically (Felleman & Van Essen 1991). In somewhat of an oversimplification, it has been customary to describe processing of visual information in the brain along two parallel and concurrent streams <review> here a citation is missing. Mortimer Mishkin as just published a nice review in Net Rev Neurosci </review> . The ventral (what) stream processes visual shape appearance and is largely responsible for object recognition. The dorsal (where) stream encodes spatial locations and processes motion information. In an extreme version of this view, the two streams underlie the perception of what and where concurrently and are relatively independently of each other. Lesions in a key area of the ventral stream (the inferior temporal cortex) cause severe deficits in visual discrimination tasks without affecting performance on visuospatial tasks such as visually guided reaching tasks or tasks that involve judgments of proximity between an object and a visual landmark. In contrast, parietal lesions in the dorsal stream cause severe deficits in visuospatial performance tasks while sparing visual discrimination ability. In everyday life, the identity and location of objects must somehow be integrated to enable us to direct appropriate actions to objects. <review> the division into magno/parvo cellular system should be mentioned. It feeds the ventral/dorsal division. </review>

Visual modules

<review> here the relation to areas is unclear. Are there different areas for different functions, or is it a subdivision of areas and the modules are rather small? </review>

A widely held, often implicit assumption is that vision can be thought of as relying on several independent modules, each implementing a different visual function (Hubel & Wiesel, 1977; Swindale, 2000). For instance, color perception is often assumed to rely on brain processes that are somewhat independent of the processes underlying motion, shape or stereo. This assumption is based in part on the existence of feature maps for the analysis of color, orientation, depth, motion, spatial frequency, etc. <review> This is kind of difficult. There is a bias, but a clear cut separation into small modules probably is a myth. See, Gegenfurtner, K. R. (2003). Sensory systems: Cortical mechanisms of colour vision. Nature Reviews Neuroscience, 4(7), 563–572. doi:10.1038/nrn1138 figure 5&6, there is no anti-correlation of orientation . </review>

The feedforward-feedback dichotomy

Studies of cortico-cortical circuits (e.g., from V1 to extrastriate areas) <review> here the 6-layer structure of neocotex is important. It is defining the asymmetric connectivity in the hierarchy. </review> have shown that feedforward connections are focused while feedback connections (e.g., from extrastriate cortex to V1) tend to be more widespread (although this picture is likely to change in the future with the development of new tracing techniques). Note that under this dichotomy, lateral connections would be treated as feedback connections. Despite the widespread nature of feedback connections, classical receptive fields (in V1 for instance) are relatively small. This has led scientists to suggest that feedforward inputs may shape the receptive fields and therefore the selectivity of individual neurons. Feedback connections on the other end would play more of a modulatory role, influencing neuronal responses primarily when visual stimuli are placed outside the classical receptive field (Bullier, 1996).

Back-projections are typically thought of as being neither sufficient (i.e., on their own they could not activate target neurons without feedforward inputs, see (Grossberg 2005) for a review) nor necessary (neurons tend to be selective from the very beginning of the onset of their responses before back-projections could be active). Indeed, one criterion often used to isolate back-projections is that they are only activated after the neuron onto which they project (Callaway 1998). At the same time, this picture is rapidly changing with recent neuroimaging experiments suggesting that patterns of functional magnetic resonance imaging response in human foveal retinotopic cortex contain information about objects presented in the periphery, far away from the fovea (Williams et al 2008) <review> compared to the context, this seems like a rather special piece of information. </review> . The feedforward-feedback dichotomy can be used to distinguish between classes of models of object recognition (Riesenhuber & Poggio 1999).

Models, levels of analysis and biological realism

In the computational paradigm of visual processing the human visual system can be regarded as an information processor performing computations on internal symbolic representations of visual information <review> well, here I'd disagree and put forward an embodied perspective. Either formulate more neutral or label as one of a small number of viewpoints. </review> . As a consequence, one should distinguish between the meaning of the symbols as well as the operations that are carried out on them on the one hand, and the physical manifestation of these symbols on the other hand.

A classical distinction made between modeling approaches is the bottom-up vs. top-down approach. The bottom-up approach aims at reverse-engineering the visual system. This often takes the form of electrophysiology experiments aiming to systematically dissect <review> "dissect" emphasizes the gathering of detailed information. The synthesis, putting everying together again to a large functioning system is the crucial step of the bottom-up approach. (i.e. maybe point out that the bottom-up approach has nothing to whatsoever with the bottom-up pathway in the hierarchy. </review> the electrical properties of cortical circuits. The Blue Brain Project constitutes a noticeable example of the bottom-up approach. The top-down approach, on the other hand, corresponds to forward engineering, which is the general goal of computer vision, machine learning and artificial intelligence. <review> Here it is not as detailed, how it is done. Examples? </review>

As originally described by Marr and Poggio (1977), the understanding of a complex system -– and models of it -– can and needs to be considered at different levels of analysis. In this article we distinguish a) models of functions and computations performed by the visual cortex such as motion computation and b) models of circuits of the visual cortex performing specific operations such as gain control, which may be the building blocks for several different computations.

There is a certain tension between these two views and only some of the models try to combine both. Both levels of modeling are in fact critically important, and they will eventually merge (as happened in physics, for example, with thermodynamics and statistical mechanics). A notable exception is the research program of Steve Grossberg which has produced many models at all levels, from biophysics to the full visual cortex and beyond.

Biological realism

Models are by definition simplifications of reality and thus none of the models we list below are fully realistic. <review> Sturm and Konig 2000 is a nice review on this issue if I might say so. </review> Especially at the level of explaining visual functions, only some of the models attempt to relate to actual cortical structures and specific areas and almost never go so far as smaller components, like layers or specific neurons or interneurons, for instance. The notion of biological plausibility for models of the visual cortex has remained elusive. For instance, Ullman (1979) suggested that biologically-plausible computations should exploit the inherent parallelism of cortex as well as its apparent uniformity (a somewhat more modern version of this is the idea of canonical circuits (Douglas & Martin 1991) and short-range connectivity). Koch’s Biophysics of computation (Koch 1999) provides a tentative list of mathematical operations that are implementable in neural circuits in theory. While not a necessary condition, a model that uses operations from this list can, in principle, be considered as biologically plausible.

Normative models

Normative models typically assume that our visual cortex has evolved to become optimally adapted to the statistical structure of our visual world. They aim at understanding the visual cortex on the basis of optimality: Starting with a theoretical formulation of the problems and constraints of visual functions, they propose an answer for how visual tasks should be solved optimally given these constraints. Such approaches typically take the form of Bayesian probabilistic models with prominent examples including coding models of the primary visual cortex (Olhsausen 1996) and visual statistics (Geisler 2008), ideal observer models of motion processing (Weiss et al, 2002) and eye movements (Najemnik & Geisler 2005) as well as Bayesian inference models of visual perception (Kersten et al 2004; Yuille & Kersten 2006). <review> here more material is dearly needed. A good review on normative models is: Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1216. doi:10.1146/annurev.neuro.24.1.1193 As hierarchy is important Wyss, R., König, P., & Verschure, P. F. M. J. (2006). A model of the ventral visual system based on temporal stability and local memory PLoS Biology, 4(5), e120. doi:10.1371/journal.pbio.0040120 and Franzius, M., Sprekeler, H., & Wiskott, L. (2007). Slowness and Sparseness Lead to Place, Head-Direction, and Spatial-View Cells. PLoS Computational Biology, 3(8), e166. doi:10.1371/journal.pcbi.0030166.sd001 should be cited </review>

Spiking vs. non-spiking models

Spiking vs. non-spiking is yet another dichotomy that is often used to discriminate between models of the visual cortex. Non-spiking models (often abusively called rate-based) usually rely on rather simplified neural circuits and assume that neural activity or firing rates are encoded by analog values either from single cells over “long” time windows >100 ms or from populations of cells over shorter time windows. A prominent model of visual processing based on spiking circuits is the SpikeNet model (see (Thorpe 2002) for a review). <review> spiking vs. non-spiking seems like a sub-issue of biological realism. Compartment models vs. point neurons, rate coding versus temporal detail are equally important. </review>

Models of operations and circuits

<review> when I imagine a reader working down to this section, it is perceived as an example. Which is a good idea. But the classification above should be complete. However, further down more classification is following and even a section on models of early cortex. Maybe I'm confused by the structure. Better phrase this as an example detailing the classification developed above (and below)? </review>

The first model of the visual cortex dates back to Hubel & Wiesel, who described a qualitative model of simple and complex cells in primary visual cortex in non-human primates (see Box 1 for details). Since then, a tremendous amount of data has been collected about the visual cortex and a large number of models of the visual cortex have been developed.

Serre poggio box.png

Most of the models developed today are at the level of circuits or microcircuits of cortex, some with quantitative details in terms of underlying biophysics of synapses and neurons. Again, only a partial list is possible here. In addition to the circuits described below, other interesting models include those referred to as ‘winner-take-all’ circuits (Yuille 1988; Hahnloser, Sarpeshkar et al. 2000; Rousselet et al. 2003) and models of ’oscillations and synchronization’. Note that there is a very large literature on the latter subject – often related to attention, see below, starting with (Niebur & Koch 1994; Borgers and Kopell 2003; Tiesinga and Sejnowski 2004). A close cousin of the ‘winner-take-all’ circuit is the softmax operation, which assumes the selection and transmission of the most active response among a set of neural inputs (Nowlan and Sejnowski 1995; Riesenhuber and Poggio 2000).

Circuits for normalization, gain control and tuning of cortical cells

<review> what is the relation of this section to the overall topic? Is it an application area or a longer list of examples? </review>

One of the well-established models of complex cells is the energy model (Adelson & Bergen 1985), which proposes the summation of quadrature pairs following a squaring nonlinearity in order to explain phase invariance of complex V1 cells. In the meantime, there are also biophysical explanations for some of the aspects of the model such as the approximate squaring (Miller 2002). A different operation -– normalization -– is addressed by the divisive normalization model (Heeger 1993; Carandini and Heeger 1994), which assumes a gain-controlling, divisive inhibition to explain contrast dependent, sigmoid-like response profiles within a pool of neurons in V1. Divisive normalization is also likely to constitute an important part in explaining the motion selectivity in V1 and MT (Nowlan & Sejnowski 1995; Rust, Schwartz et al. 2005; Rust, Mante et al. 2006) and of attentional mechanisms (Lee, Itti et al. 1999; Reynolds, Chelazzi et al. 1999; Reynolds & Heeger 2009; Chikkerur et al. 2010). It was introduced earlier as the basic operation to account for motion detection (Torre and Poggio 1978) and to describe quantitatively the detection of motion discontinuities in the fly (Poggio et al, 1981).

Still another operation common to many neurons in visual cortex has been described as a Gaussian-like operation corresponding to a bell-shaped response tuned to a specific, optimal pattern of activation of the presynaptic inputs (Poggio 2004), as induced for instance by a bar of a specific orientation for simple cells in V1 or by a face for neurons in specific patches of IT (Tsao, Schweers et al. 2008).

Interestingly, all of these distinct neural operations – and in particular the energy model, the tuning and the softmax operation described above – may be computed by a similar circuitry, involving divisive normalization and biophysically plausible polynomial nonlinearities, for different parameter values within the circuit (Kouh and Poggio 2008). Thorpe and colleagues have described a related circuit for template matching based on the timing of the arrival of spikes from a group of neurons (Thorpe, 2002). There is now in fact limited evidence for such type of encoding in somatosensory areas (see (vanRullen 2005) for a review).

A more complex cortical circuit emerging from a combination of anatomical and physiological studies was proposed earlier as a canonical circuit for cortex by (Douglas & Martin 1991). The circuit explains the intracellular responses to pulse stimulation in V1 in terms of the interactions between three basic populations of neurons. Activation of a microcircuit sets in motion a sequence of excitation and inhibition in every neuron of the module, rather than initiating a sequential activation of separate neurons at different hypothetical processing stages. As described most recently by (Logothetis 2008), re-excitation is tightly controlled by local inhibition, and the time evolution of excitation–inhibition is far longer than the synaptic delays of the circuits involved. This means the magnitude and timing of any local mass activation arise as properties of the microcircuits. Computational modeling suggested that these microcircuits, containing such a precisely balanced excitation and inhibition, can indeed account for a large variety of observations of cortical activity over time scales which are longer than 10 msec.

Models of columnar and laminar organization

<review> Columnar and laminar organization have not really been described above. </review>

Recent advances in available technologies for dissecting the detailed circuitry of the cortex will soon allow neuroscientists to study neural circuits at an unprecedented level of detail and test detailed models of columnar and laminar organization. One of the most prominent model of laminar processing is the LAMINART model (see (Grossberg, 2007) for a review). Grossberg and colleagues have suggested that the laminar organization of visual cortex is critical for: (1) learning and development whereby the cortex adapts to its visual environment; (2) binding distributed representations across visual modules into coherent object representations; and (3) selecting behaviorally relevant events or stimuli via attention processes. The LAMINART circuit was shown to behave like a real-time probabilistic decision circuit that operates in a fast feedforward mode when there is little uncertainty, and automatically switches to a slower feedback mode when there is significant uncertainty.

Microcircuits of attention

Attention is the process by which stimuli are selected based on their saliency (bottom-up) or current behavioral goals (top-down). Several groups have recently developed detailed biophysical models with spiking neurons and conductance-based synapses of the attentional effects in V4 neurons, which go beyond previous phenomenological models (Reynolds et al, 1999). For instance (Buia & Tiesinga 2008) described a model that includes populations of spiking excitatory cells and two types of interneurons that can predict different types of oscillatory behaviors for spatial vs. feature based attention.

Models of visual functions

Many of our perceptual abilities in vision have been extensively studied in psychophysics and corresponding algorithms usually exist in computer vision.

Many of the models that have been developed do not have a direct correspondence with the visual cortex. Others attempt to take into account the anatomy and the physiology of cortex. We consider only the latter set of models, though as usual there are plenty of examples that do not directly map into cortical structures but are relevant nonetheless. For instance, the first model of stereo disparity by (Marr & Poggio 1976) is not per se a model of cortex since it mainly spells out some of the constraints to be used to solve the matching problem of stereo and an algorithm. It can be mapped however into circuits of neurons that could correspond to cortical disparity tuned cells. The second stereo model by (Marr and Poggio 1979) is instead a model of cortex in the sense that it takes into account physiology data about cortical neurons and makes specific predictions about them. Still it lacks much of the details that one would expect from a model of cortex today.

A partial list of models of cortical functions -– roughly in historical order of appearance -– comprises contour and edge detection, lightness and color perception, segmentation, stereo vision and depth perception, motion processing, attention, object recognition. Below, we consider three of them.

Models of early vision

<review> the model of simple and complex cells should go here. Or is it mixing early vision and early visual cortex? </review>

A number of models for early vision have been described (mostly in the eighties, following the work of Marr, Poggio, Ullman, Horn, Grimson, Richards, Winston, Ballard, Koch, Hildreth and others). This includes models of edge detection (Marr 1979), spatio-temporal interpolation and approximation, computation of optical flow and direction selectivity (Ullman 1979, Marr 1981), computation of lightness and albedo, shape form contours, shape form texture, shape from shading, binocular stereo matching (Marr & Poggio 1976), structure from motion, structure from stereo, surface reconstruction (Grimson 1982) and filling-in (Ullman 1976), computation of surface color (Barrow & Tenebaum 1981, Marr 1982; Hurlbert 1988).

Stereo vision

<review> now comes a list of visual features. You see, I'm lost in the structure and do not know whether I read the information as a systematic classification of models of visual cortex or as examples demonstrating such classes. What is the systematicity of the list? </review>

Disparity, which is the ability to reconstruct 3D information by combining two (or more) 2D views from each one of the two eyes. Some of the first computational models of visual cortex were about stereopsis (Marr & Poggio 1976). In the meantime, the disparity-specific responses of simple disparity tuned neurons in V1 have been described by an energy model based on local, feedforward interactions (Read 2002). Little progress has been made on global properties of stereopsis and their representation at the neural level. This is an area where computational efforts have been lacking.

Edge detection and orientation selectivity

Several models exist for explaining the orientation tuning of simple cells in the primary visual cortex (V1). Various forms of feedforward models and recurrent models are the two major examples. Though it is still unclear which model best fits the data, it is likely that both feedforward input geometry from the LGN -– as originally proposed by Hubel and Wiesel -– and recurrent intracortical circuits are involved in shaping the selectivity of simple cells. It is also possible that a version of the feedforward model is well suited for explaining orientation tuning in simple cells, whereas recurrent models involving recurrent normalization are suited to account for complex cells (Teich & Qian 2006).

Motion processing

The analysis of motion is also fairly well studied. This includes problems such as the analysis of motion information in early visual areas such as MT as well as the recognition of biological motion (e.g., object/person moving left or right, etc). A number of models (Nowlan and Sejnowski 1995; Rust, Mante et al. 2006) have quantitatively described the properties of motion sensitive neurons in MT, a visual area involved in motion processing. In one of the models (Rust, Mante et al. 2006), which is itself the evolution of previous models of MT, the computation is performed in two stages, corresponding to neurons in cortical areas V1 and MT. Each stage computes a weighted linear sum of inputs, followed by rectification and divisive normalization. The output of the model corresponds to the steady-state firing rates of a population of MT neurons, which form a distributed representation (population encoding) of image velocity for each local spatial region of the visual stimulus. The model accounts for a wide range of physiological data.

Models of learning and development

Learning is arguably the key to understanding intelligence (Poggio & Smale, 2003). One of the most striking features of the cortex is its ability to wire itself. Understanding how the visual cortex wires up through development and how plasticity refines connections into adulthood is likely to give necessary constraints to computational models of visual processing. An active area of research concerns the learning of invariances in the visual cortex such as invariance to 2D transformations (e.g., translation and scale) as well as invariance to 3D pose (Foldiak 1991, see also invariance learning, and slow feature analysis). Several models of self-organization of brain function based on Kohonen maps have been described as plausible mechanisms for the development of cortical columns, ocular dominance, direction selectivity, spatial frequency selectivity. disparity and even color.

Models of high level functions

The most studied visual function is probably object recognition, i.e., our ability to assign a label or meaning to an image of an object irrespective of the precise size, position, illumination or context and clutter. The main computational problem in object recognition is achieving invariance while preserving selectivity. It is natural to think that the hierarchical architecture of cortex in a sequence of visual areas is devoted to achieving a tradeoff between selectivity and invariance by building into neurons at higher and higher levels an increasing degree of invariance to image transformations such as translations and scale changes. Many of the models are feedforward (notice that they allow for recurrent circuits within areas but not between areas; the term feedforward is used with different meanings here vs. earlier in the section about circuits) and are restricted to describe at most the first ~100ms of information processing after an image is flashed on the retina. Of course, a model of vertebrate vision must take into account multiple fixations, image sequences, as well as top-down signals, attentional effects and the structures mediating them (e.g., the extensive back-projections present throughout cortex).

Action recognition and body movements

Several models exist that try to account for neural mechanisms involved in the processing of dynamic body stimuli (see Giese & Poggio (2003)). These models are based on hierarchical neural architectures, including detectors that extract form or motion features from image sequences. Position and scale invariance has been accounted for by pooling neural responses along the hierarchy. It has been shown that such models reproduce several properties of neurons that are selective for body movements and behavioral and brain imaging data (Giese & Poggio, 2003). Recent work proves the high computational performance of biologically inspired architectures for the recognition of body movement, which lies in the range of the best non-biological algorithms in computer vision (Jhuang et al., 2007).

Face processing

It is becoming well accepted that a network of patches of visual cortex, mostly in IT, may constitute a system to process various aspects of face recognition – from the detection of a face in an image to its identification to a classification of its expression (see Livingstone & Tsao 2008). It is still unclear whether computational strategies similar to the models proposed for general object recognition (see above) but with different parameters (eg number and selectivity of tuned cells and the spatial extent of the features coded by neurons at intermediate levels) may be able to account for the properties of the face neurons and for the psychophysical signatures of face perception.

Attention and eye movements

<review> Ovciously, I love this stuff. Yet, is a saliency map part of visual cortex? In case you decide to keep it in the current developments on processing natural stimuli are relevant. Then this section is on par with the general problems addressed in the other parts of this article. Maybe you find Einhäuser, W., & König, P. (2010). Getting real-sensory processing of natural stimuli Current Opinion in Neurobiology, 20(3), 389–395. doi:10.1016/j.conb.2010.03.010 relevant </review>

Several theoretical proposals and computational models have been described to try to explain the main functional and computational role of visual attention. One of the first computational model of attention was the saliency model by Koch & Ullman (1985). One important proposal by (Tsotsos 1997) is that attention reflects evolution's attempt to fix the processing bottleneck in the visual system (Broadbent 1958) by directing the finite computational capacity of the visual system preferentially to relevant stimuli within the visual field while ignoring everything else.

Several computational models have attempted to account for specific behavioral and physiological effects of attention. Behavioral effects include pop-out of salient objects (Itti et al. 1998), top-down bias of target features (Wolfe 2007), influence from scene context (Torralba 2003), serial vs. parallel-search effect , etc. Physiological effects include multiplicative modulation of neuron response under spatial attention and feature based attention. Several studies have shown that image-based bottom-up cues can capture attention, particularly during free viewing conditions. Locations where the stimulus differs significantly from rest of the image is said to 'pop-out'. In (Itti et al 1998), center-surround difference across color, intensity and orientation dimensions is used as measure of saliency. These models, however, cannot account for the task-dependency of eye movements. Depending on the search tasks, human eye movements may differ substantially -- even when the stimuli are identical.

A seminal proposal to explain how top-down visual search may operate is the Guided Search model proposed by Wolfe (Wolfe, 2007) according to which the various feature maps are weighted according to their relevance for the task at hand to compute a solitary saliency map. Building on Wolfe's model, several approaches have been suggested. In addition to direct cueing, spatial cues may also be derived indirectly, by context, in natural scenes. Spatial relations between objects and their locations within a scene have been shown to play a significant role in visual search and object recognition.

Hierarchical models of visual cortex

<review> Now, this seems to be a most important chapter. Wiki style it should move way up. Please give a more complete coverage of relevant literature (see above). </review>
Figure 1: The model (Riesenhuber and Poggio, 1999) consists of a hierarchy of layers with tuning (‘S’ units) and pooling (‘C’ units) operations. These two operations provide pattern specificity and invariance to transformations (such as translation), by pooling over an appropriate set of templates.

Most of the models of visual functions involve a single visual area or a small number of them. There are very few examples of models that consider most visual areas, and all of these models are models of visual recognition. It is natural, then, to consider object recognition as the best example for hierarchical models that reflect one of the most obvious features of cortex – the hierarchy of areas from V1 to IT.

Feedforward hierarchical models have a long history, beginning in the 1970s with Marko and Giebel’s homogeneous multilayered architecture and later Fukushima’s Neocognitron. One of their key computational mechanisms originates from the pioneering physiological studies and models of Hubel and Wiesel (see Figure 1). The basic idea is to build an increasingly complex and invariant object representation in a hierarchy of stages by progressively integrating, or pooling, convergent inputs from lower levels. Since then, many models have been proposed (see (Serre & Poggio, 2010) for a recent review) which extend the classical simple-to-complex cells model by Hubel & Wiesel (see Box 1) to extra-striate areas and have been shown to account for a host of experimental data. Such models assume two functional classes of simple and complex cells with specific predictions about their respective wiring and resulting functionalities.

Now, why hierarchies? The answer -- for models in the Hubel and Wiesel spirit -- is that the hierarchy provides a solution to the trade-off between invariance and selectivity for visual recognition <review> this needs more explanation. </review> . Hierarchical organization in cortex is however not limited to the visual pathways and thus a more general explanation may be needed <review> hierarchies are mainly observed in the sensory cortices, which is a lot, but has a common problem. </review> . Interestingly, from the point of view of classical learning theory (Poggio & Smale 2003), there is no need for architectures with more than three layers. So, why hierarchies? There may be reasons of efficiency, such as the efficient use of computational resources. For instance, the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks.

There may also be the more fundamental issue of sample complexity, the number of training examples required for good generalization. An obvious difference between the best classifiers derived from learning theory and human learning is in fact the number of examples required in tasks such as object detection. The theory shows that the complexity of the hypothesis space sets the speed limit and the sample complexity for learning. If a task -- like a visual recognition task – can be decomposed into low-complexity learning tasks for each layer of a hierarchical learning machine, then each layer may require only a small number of training examples. Neuroscience suggests that what humans can learn can be represented by hierarchies that are locally simple. Thus our ability to learn from just a few examples, and its limitations, may be related to the hierarchical architecture of cortex.

Backprojections, image inference and how visual cortex really works

Most of the models of visual cortex assume feedforward connections between visual areas. Consider again the obvious case of object recognition. Starting with the pioneering work of Fukushima and the Neocognitron, a number of models have been developed that describe the visual processing required for recognition from V1 up to IT and thus comprising all the main areas that constitute the ventral stream. All these models are purely feedforward hierarchical models. In reality there are massive backprojections (between and not just within areas) in visual cortex which are actually greater in number than the forward projections. Thus a major question for modeling visual cortex revolves around the role of backprojections and the related fact that vision is more than object detection.

Vision requires interpreting and parsing visual scenes. A human observer can essentially answer an infinite number of questions about an image (one could in fact imagine a Turing test for vision). From this point of view, should one think of the ventral stream as performing a continuous, task-independent inference about the world between eye movements? In this scenario the ventral stream would be continuously developing a complex data structure representing the components of the scene and their relationships. Or should one follow the more conventional intuition where vision amounts to a high resolution buffer in areas higher than visual cortex and its job is to do inference in a task dependent way only when needed, through a search involving attention and eye movements?

There are at least two broad classes of models that include backprojections and could account for image inference abilities: attentional models and hierarchical generative models such as the model proposed by (Lee & Mumford 2003).

In the first case, the basic idea – which is not new and more or less accepted in these general terms – is that one key role of back-projections is to select and modulate specific connections in early areas in a top-down fashion (in addition to managing and controlling learning processes). In this extension, the backprojections mediate attentional modulations in lower areas; they also route information from specific lower areas to specialized task dependent routines/classifiers running in higher areas such as PFC. This class of models correspond to the belief that our subjective feeling of the richness of vision is based on the ability of looking again -– possibly by shifting attention instead of shifting gaze – at the image –- when needed.

Thus back-projections not only may control the gain of specific neurons -– this is the simplest model of a spotlight of attention -– but they may effectively run routines for reading out specific task-dependent information from IT (for instance, one program may correspond to the question “is the object in the scene an animal?”, another may read out information about the size of the object in the image from activity in IT.) They may also select programs in areas lower than IT (probably by modulating connection weights) to carry image inference tasks (is the animal to the right or to the left of the tree?).

During normal vision, back-projections are likely to control, in a dynamic way, routines running at all levels of the visual system throughout attentional shifts (and fixations). In particular, small areas of the visual fields may be routed from the appropriate early visual area (as early as V1) by covert attentional shift controlled from the top to circuits specialized for any number of specific tasks (Poggio 1984). This highly speculative framework fits best with the point of view described by (Hochstein & Ahissar 2002). Hochstein and Ahissar suggested that explicit vision advances in reverse hierarchical direction, starting with “vision at a glance” (corresponding to our “immediate recognition”) at the top of the cortical hierarchy and returning downward as needed in a “vision with scrutiny” mode in which reverse hierarchy routines focus attention to specific, active, low-level units.

The emphasis of this first class of models is thus somewhat different with respect to the hierarchical generative models – which are related to prediction-verification recursions – an approach known in AI as “hypothesis-verification” (Mumford 1996, Rao & Ballard 1999, Hawkins & Blakeslee 2002). Hierarchical generative models assume that the main function of the backprojections is to carry top-down hypotheses in order to compare them with bottom-up sensory information (the idea is old; for a modern version see (Hawkins 2002)).

One of the most attractive versions of this class of models is due to Lee & Mumford (2003). In their Bayesian framework, the recurrent feedforward/feedback loops in the cortex serve to integrate top-down contextual priors and bottom-up observations so as to implement concurrent probabilistic inference along the visual hierarchy. Here recognition proceeds by iteration, with higher levels in the ventral stream generating a guess about the meaning of the incoming signal, and feeding this back to lower levels to generate what is in essence a synthetic image that can be compared with the image delivered from the retina. This “generative” point of view in its strongest form implies that neurons in the ventral pathways represent –- after a series of bottom-up and top-down iterations -– mutually and globally consistent conditional probabilities of certain features given the sensory input and the high level hypothesis or priors.

State of the field and final remarks

<review> you end with comments not on developments in computational neuroscience, but in anatomy!? </review>

Feedforward models of recognition and of other visual abilities have been useful to explore the power of fixed hierarchical organization as originally suggested by Hubel and Wiesel. They have led, for instance, to algorithms competitive with the best computer vision systems (Serre & Poggio, 2010). Their limitations, however, are becoming increasingly obvious. Not only top-down effects are key to normal, everyday vision but backprojections are also likely to be a key part of what cortex is computing and how. Unfortunately, relatively little is known experimentally about backprojections and their role. Thus one cannot escape the conclusion that the next critical stage towards better model of the visual cortex has to wait for new experimental techniques to manipulate identified populations of neurons in a reversible, transient, deliberate and delicate manner. New techniques are being developed at rapid pace and range from new optical imaging to optical activation/inactivation of specific neurons to viral vectors .

Additional resources

The visual cortex by Matthew Schmolesky at University of Utah

Eye, Brain and Vision (online book) by David Hubel at Harvard University

Computational Vision: Information Processing in Perception and Visual Behavior, by H.A. Mallot. MIT Press, Cambridge (MA).

Computational models of visual processing, M. S. Landy & J. A. Movshon (editors). MIT Press, Cambridge (MA).

References

Adelson, E. H. and J. R. Bergen (1985). "Spatiotemporal energy models for the perception of motion." J Opt Soc Am A 2(2): 284-99.

Barrow, H. G. and Tenenbaum, J. M. (1981a). ``Computational vision. Proceedings of the IEEE, 69(5):572--595.

Borgers, C. and N. Kopell (2003). "Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity." Neural Comput 15(3): 509-38.

Broadbent, D. E. (1958). Perception and communication.

Buia, C. I. and P. H. Tiesinga (2008). "Role of interneuron diversity in the cortical microcircuit for attention." J Neurophysiol 99(5): 2158-82.

Bullier, J., J. M. Hupe, A. James, and P. Girard (1996). "Functional interactions between areas V1 and V2 in the monkey." J. Phys., 90:217–220.

Carandini, M. and D. J. Heeger (1994). "Summation and division by neurons in primate visual cortex." Science 264(5163): 1333-6.

Callaway, E. M. (1998). "Local circuits in primary visual cortex of the macaque monkey." Ann. Rev. Neurosci., 21:47–74.

Chikkerur, S., T. Serre, C. Tan, and T. Poggio (2010). "What and where: A Bayesian inference theory of attention." Vision Research pp. 1-15 doi:10.1016/ j.visres.2010.05.013.

Douglas, R. J. and K. A. Martin. "A functional microcircuit for cat visual cortex." J. Physiol. (Lond)., 440:735–69, 507 1991.

Foldiak, P. (1991). "Learning invariance from transformation sequences." Neural Comp., 3:194–200.

Giese, M. and T. Poggio (2003). Neural mechanisms for the recognition of biological movements and action. Nat. Rev. Neurosci., 4:179–192.

Grimson, W. E. L. (1982) “A Computational Theory of Visual Surface Interpolation”, Philosophical Transactions of the Royal Society of London, Series B, 298, 395-427.

Geisler, W.S. (2008). "Visual perception and the statistical properties of natural scenes." Annual Review of Psychology, 59, 10.1-10.26.

Grossberg, S. (2005). "Linking attention to learning, expectation, competition, and consciousness" in Neurobiology of attention, pages 652–662. Elsevier, San Diego.

Grossberg, S. (2007). "Towards a unified theory of neocortex: Laminar cortical circuits for vision and cognition." In Computational Neuroscience: From Neurons to Theory and Back Again, eds: Paul Cisek, Trevor Drew, John Kalaska; Elsevier, Amsterdam, pp. 79-104.

Hahnloser, R. H., R. Sarpeshkar, et al. (2000). "Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit." Nature 405(6789): 947-51.

Hamker, F. (2004). "A dynamic model of how feature cues guide spatial attention." Vision Research vol. 44 (5) pp. 501-521

Hawkins, J. and S. Blakeslee (2002). "On Intelligence." New York, Times Books, Holt.

Heeger, D. J. (1993). "Modeling simple-cell direction selectivity with normalized, half-squared, linear operators." J Neurophysiol 70(5): 1885-98.

Hubel, D.H. and T.N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Phys., 160:106–154, 1962.

Hurlbert, A. 1989. PhD Thesis title: “The Computation of Color,” Harvard Medical School / Massachusetts Institute of Technology, Department of Brain & Cognitive Sciences.

Itti, L., Koch, C., & Niebur, E. (1998). "A model of saliency-based visual attention for rapid scene analysis." IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11).

Hochstein, S. and M. Ahissar (2002). "View from the top: hierarchies and reverse hierarchies in the visual system." Neuron 36(5): 791-804.

Hubel, D. H. and T. N. Wiesel (1977). Ferrier Lecture: "Functional architecture of macaque monkey."

Felleman, D.J. and D.C. van Essen. "Distributed hierarchical processing in the primate cerebral cortex." Cereb. Cortex, 1:1–47, 1991.

Kersten, D., Mamassian, P., & Yuille, A. (2004). "Object perception as Bayesian Inference." Annual Review of Psychology, 55, 271-304.

Koch, C. (1999) "Biophysics of Computation: Information Processing in Single Neurons." Oxford University Press. ISBN 0-19-510491-9

Koch, C., and S. Ullman. "Shifts in selective visual attention: towards the underlying neural circuitry." Hum Neurobiol (1985) vol. 4 (4) pp. 219-27

Kouh, M. and T. Poggio (2008). "A canonical neural circuit for cortical nonlinear operations." Neural Comput 20(6): 1427-51.

Kveraga, K., J. Boshyan, et al. (2007). "Magnocellular projections as the trigger of top-down facilitation in recognition." J Neurosci 27(48): 13232-40.

Lee, D. K., L. Itti, et al. (1999). "Attention activates winner-take-all competition among visual filters." Nat Neurosci 2(4): 375-81.

Lee, T. S. and D. Mumford (2003). "Hierarchical Bayesian inference in the visual cortex." J Opt Soc Am A Opt Image Sci Vis 20(7): 1434-48.

Logothetis, N. K. (2008). "What we can do and what we cannot do with fMRI." Nature 453: 869 - 878.

Marr, D. and T. Poggio (1976). "Cooperative computation of stereo disparity." Science 194(4262): 283-7.

Marr, D. and T. Poggio (1977). “From Understanding Computation to Understanding Neural Circuitry” In: Neuronal Mechanisms in Visual Perception, E. Poppel, R. Held and J.E. Dowling (eds.), Neurosciences Res. Prog. Bull., 15, 470-488.

Marr, D. and T. Poggio (1979). "A computational theory of human stereo vision." Proc R Soc Lond B Biol Sci 204(1156): 301-28.

Miller, K. D., and T.W. Troyer (2002). "Neural Noise Can Explain Expansive, Power-Law Nonlinearities in Neural Response Functions." Journal of Neurophysiology 87(2): 653-659.

Mumford, D. (1996). "Pattern theory: a unifying perspective." New York, NY USA, Cambridge University Press

Najemnik, J. and Geisler, W.S. (2005). "Optimal eye movement strategies in visual search." Nature, 434, 387-391.

Niebur, E. and C. Koch (1994). "A model for the neuronal implementation of selective visual attention based on temporal correlation among neurons." J Comput Neurosci 1(1-2): 141-58.

Nowlan, S. J. and T. J. Sejnowski (1995). "A selection model for motion processing in area MT of primates." J Neurosci 15(2): 1195-214.

Olshausen, B. A. and D. J. Field (1996). "Emergence of simple-cell receptive field properties by learning a sparse code for natural images." Nature vol. 381 (6583) pp. 607-9.

Poggio, T. (1984). "Routing Thoughts." Massachusetts Institute of Technology. Artificial Intelligence Laboratory, AI Working Paper (#258).

Poggio, T. and E. Bizzi (2004). "Generalization in vision and motor control." Nature 431: 768-774.

Poggio, T. and S. Smale (2003). "The Mathematics of Learning: Dealing with Data." Notices of the American Mathematical Society (AMS) 50(05): 537-544.

Poggio, T., W. Reichardt and W. Hausen (1981). "A Neuronal circuitry for relative movement discrimination by the visual system of the fly, Naturwissenschaften, 68,9, 43-466.

Rao, R. P. N. and D.H. Ballard (1999). "Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects." Nature Neuroscience 02: 79-87.

Read, J. C. A., A. J. Parker and B. G. Cumming (2002). "A simple model accounts for the response of disparity-tuned V1 neurons to anticorrelated images." Visual Neuroscience 19(06): 735-753.

Reichardt, L. F. and P. Early (1983). "Neurobiology." Science 219(4589): 1213.

Reynolds, J. H and D. J. Heeger (2009). "The normalization model of attention." Neuron vol. 61 (2) pp. 168-85.

Reynolds, J. H., L. Chelazzi, et al. (1999). "Competitive mechanisms subserve attention in macaque areas V2 and V4." J Neurosci 19(5): 1736-53.

Riesenhuber, M. and T. Poggio (2000). "Models of object recognition." Nat Neurosci 3 Suppl: 1199-204.

Riesenhuber, M. and T. Poggio (1999). "Hierarchical models of object recognition in cortex." Nat. Neurosci., 2:1019–1025.

Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp. 386–408. doi:10.1037/h0042519

Rousselet G. A., S. J. Thorpe and M. Fabre-Thorpe (2003). "Taking the MAX from neuronal responses." Trends Cogn Sci 7: 99-102.

Rust, N. C., V. Mante, et al. (2006). "How MT cells analyze the motion of visual patterns." Nat Neurosci 9(11): 1421-31.

Rust, N. C., O. Schwartz, et al. (2005). "Spatiotemporal elements of macaque v1 receptive fields." Neuron 46(6): 945-56.

Serre, T. & T. Poggio. "Reverse-engineering the brain." In: the Communications of the Association for Computing Machinery (CACM), 53(10), pp. 54-61, Oct 2010

Serre, T. G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich and T. Poggio. "A quantitative theory of immediate visual recognition." In: Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, pp. 33-56, 2007

Serre, T., A. Oliva and T. Poggio (2007). "A feedforward architecture accounts for rapid categorization." Proc Natl Acad Sci U S A 104(15): 6424-9.

Swindale, N. V. (2000) "How many maps are there in visual cortex?" Cereb. Cortex 10, 633–643.

Teich, A. F. and N. Qian (2006). "Comparison among some models of orientation selectivity." J Neurophysiol 96(1): 404-19.

Tiesinga, P. H. and T. J. Sejnowski (2004). "Rapid temporal modulation of synchrony by competition in cortical interneuron networks." Neural Comput 16(2): 251-75.

Thorpe, S.J. (2002). "Ultra-rapid scene categorization with a wave of spikes." Proc. of Biologically Motivated Computer Vision: 2nd International Workshop, Tübingen, Germany.

Torralba, A. (2003). "Modeling global scene factors in attention." Journal of Optical Society of America, 20(7), 1407–1418.

Torre, V. and T. Poggio. "A synaptic mechanism possibly underlying directional selectivity motion, Proc of the Royal Society London B, 202, 409-416, 1978.

Tsao, D. Y., N. Schweers, et al. (2008). "Patches of face-selective cortex in the macaque frontal lobe." Nat Neurosci 11(8): 877-9.

Tsao, D. Y. L., M.S. (2008). "Mechanisms of Face Perception." Annual Review of Neuroscience 31: 411-437.

Tsotsos, J. (1997). "Limited capacity of any realizable perceptual system is a sufficient reason for attentive behavior." Consciousness and cognition, 6(2–3), 429–436.

Ullman, S. (1979) "The interpretation of Visual Motion." MIT Press, Cambridge, MA.

Ullman, S. (1976). "Filling in the gaps: the shape of subjective contours and a model for their generation." Biological Cybernetics, 25:1--6.

VanRullen, R., R. Guyonneau, and S.J. Thorpe (2005). "Spike times make sense." Trends in Neurosci., 28(1).

Weiss et al. (2002) Motion illusions as optimal percepts. Nature Neuroscience vol. 5 (6) pp. 598

Williams, M. Baker, C. I., Op de Beeck, H. P., Shim, W. M. Dang, S., Triantafyllou, C., & Kanwisher, N. (2008). "Feedback of visual object information to foveal retinotopic cortex." Nat Neurosci vol. 11 (12) pp. 1439-1445.

Wolfe, J. M. (2007). "Guided search 4.0: Current progress with a model of visual search." Integrated Models of Cognitive System, 99–119.

Yuille, A., & Kersten, D. (2006). "Vision as Bayesian inference: analysis by synthesis?" Trends Cogn Sci, 10(7), 301-308.

Yuille, A. and N.M Grzywacz (1988). "A computational theory for the perception of coherent visual motion." Nature 333: 71-74.

Personal tools
Namespaces
Variants
Actions
Navigation
Focal areas
Activity
Toolbox