Computational foundations for attentive processes
From Scholarpedia
| John K. Tsotsos and Neil D. B. Bruce (2008), Scholarpedia, 3(12):6545. | revision #48025 [link to/cite this article] | |||||||||||||||||||
Revision as of 20:02, 20 September 2008; view current revision
←Older revision | Newer revision→
Curator: Dr. John K. Tsotsos, Centre for Vision Research, York University, Canada
Curator: Dr. Neil D. B. Bruce, Centre for Vision Research, York University, Canada
Computational foundations for attentive processes - a formal, theoretical basis for why algorithms that address perceptual or cognitive problems require attentional mechanisms in order for their realizations to perform at human-like performance levels.
Contents |
Introduction
Although attention is a human ability we all intuitively think we understand, the Computational Foundations for Attentive Processes in the brain or in computer systems are not quite as obvious. Notions such as capacity limits pervade the attention literature but remain vague. This presentation, focusing on vision and sensory perception mostly, attempts to make these more concrete. Through mathematical proofs, it is possible to derive the necessity of attentive processes, and through algorithmic approximations and processing optimizations it is possible to discover realizations given either biological or computational resource constraints. Perhaps the most important conclusion is that the brain is not solving the generic perception problem, and by extension, the generic cognition problem. Rather, the generic problem is re-shaped through approximations so that it becomes solvable by the amount of processing power available in the brain.
The human cognitive ability to attend has been widely researched in cognitive and perceptual psychology, neurophysiology and in computational systems. Regardless of discipline, the core issue has been identified to be Information Reduction. Humans, and many other animals as well, are faced with immense amounts of information and are limited in their ability to process all of it by the size of our brains. It is not simply that there is too much information; the problem is that each component of each stimulus can be matched to many different objects and scenes in our memory resulting in a combinatorial explosion of potential interpretations, as is caricatured in Figure 1.
The basic idea that humans can be viewed as limited-capacity information processing systems was first proposed by Broadbent in his 1958 book Perception and Communication. For example, attention appears in early artificial intelligence (AI) systems explicitly as a focus of attention mechanism or implicitly as a search-limiting heuristic motivated primarily from practical concerns – computers were not powerful enough and one had to do whatever possible to limit the amount of processing required so that resources could be allocated to the most relevant tasks. All search methods that involve ordering or pruning of the search space perform information reduction. Information reduction is needed because the size of the full search space for a problem does not match the computing resources and system performance requirements and thus a brute-force search scheme is not sufficient. The classic mechanisms include: Branch-and-bound search, Hill-climbing, Best-first search, A* search, and more. However, motivation from cognitive psychology also made an important impact with the early appearance of a number of systems. Anderson's 1976 ACT system was intended as a general model of cognition. ACT has a focus of attention which changes as nodes in long-term memory are activated and put in working memory and as other nodes are pushed out of working memory. Focus is implemented with a small working memory (capacity limit), with strategies for determining which productions are applicable at any time. Along a very different application domain, Barstow's 1979 automatic programming system PECOS uses intermediate level grouping to focus attention on the relevant and to ignore detail. LIBRA was a system developed by Kant in 1979 for efficient analysis of code. LIBRA has explicit attention and resource management rules. Rules determine how LIBRA's own resources are to be allocated on the basis of greatest utility. A landmark in AI, the 1980 HEARSAY-II system for speech understanding due to Erman, Hayes-Roth, Lesser and Reddy, ranked concepts using goodness-of-fit in order to focus in on the strongest and those with greatest utility. Several computer vision systems also included attention strategies to limit the region of interest that is processed in an image (the earliest being Kelly 1971 in the context of face outline detection) or even the window in time for video sequence input (the first being Tsotsos 1980 for heart motion analysis). There are many more examples that help make the point that efficient matching of input, processing methods and resources has played a major role in the development of computer systems whose performance attempts to match that of humans.
Capacity limits and computational complexity
One of the most frustrating things about studying attention is that research is so often accompanied by vague discussions of capacity limits, bottlenecks, and resource limits. How can these notions be made more concrete? The area of computer science known as Computational Complexity is concerned with the theoretical issues dealing with the cost of achieving solutions to problems in terms of time, memory and processing power as a function of problem size. It thus provides the necessary theoretical foundation on which to base an answer to the capacity question.
It is reasonable to ask whether computational complexity has relevance for real problems.
Stockmeyer & Chandra (1988) present a compelling argument. The most powerful
computer that could conceivably be built could not be larger than the known universe,
could not consist of hardware smaller than the proton, and could not transmit information
faster than the speed of light. Such a computer could consist of at most
pieces of
hardware. It can be proved that, regardless of the ingenuity of its design and the
sophistication of its program, this ideal computer would take at least 20 billion years to
solve certain mathematical problems that are known to be solvable in principle (for
example, the well-known traveling salesman problem with a sufficiently large number of
destinations). Since the universe is probably less than 20 billion years old, it seems safe to
say that such problems defy computer analysis. There exist many real problems for which
this argument applies (see Garey & Johnson, 1979, for a catalog), and they form the
foundation for the theorems presented here.
With respect to neurobiology, many have considered complexity constraints in the past but mostly with coarse, counting arguments (for example, Feldman & Ballard 1982 or Tsotsos 1987). All roads lead to the same conclusions: the brain cannot fully process all stimuli in parallel in the observed response times. But this is like saying there is a capacity limit: This does not constrain a solution. By arguing in this manner we are no closer to knowing what exactly the brain is doing to solve this problem. This paper takes the position that a more formal analysis of vision at the appropriate level of abstraction will help to reveal quantitative constraints on visual architectures and processing. First, however, it is important to address the applicability of this analysis for the neurobiology of the brain.
Human perception/cognition and computation
Can cognition, intelligence, perception or vision be expressed in computational form? Surely, the discovery of explanatory mechanisms that yield observed performance is central to all experimental studies. In a very real sense, such explanations are presented as algorithms, a step-by-step procedure for achieving some result, and algorithms are a central component of the language of computation. But could the brain actually process signals in an algorithmic manner? This non-trivial issue is important because if it could be proved that human brain processes cannot be modeled computationally (and this is irrespective of current computer hardware), then modeling efforts are futile. The jury is still out on this point, but a perhaps overwhelming amount of support has accumulated beginning with a critical theoretical notion, decidability. A proof of decidability is sufficient to guarantee that a problem can be modeled computationally (Davis, 1958, 1965). Decidability should thus not be confused with tractability. Tractability refers to the sort of problem Stockmeyer and Chandra described earlier: a tractable problem is one where enough resources can be found and enough time can be allocated so that the problem can be solved reasonably. An intractable problem may be decidable; but for an undecidable problem, one cannot determine its tractability. Intractable problems are those that have exponential complexity in space and/or time, that is, the mathematical function that relates processing time/space to the size of the input is exponential in that input size. If the input is large, then its role as an exponent leads to the need for enormous amounts of processing resources, as described by Stockmeyer and Chandra above. There are several classes of such problems with differing characteristics and NP-Complete is one of those classes.
To show that vision is decidable, then it must first be formulated as a Decision Problem.
This means that if it is the case that for some problem we wish to know of each element in
a countably infinite set
, whether or not that element belongs to a certain set
which is a
proper subset of
, then that problem can be formulated as a decision problem. Such a
problem is decidable if there exists a Turing Machine which computes ‘yes’ or ‘no’ for
each element of
in answer to the decision question. A Turing Machine is a hypothetical
computing device consisting of a finite state control, a read-write head and a two-way
infinite sequence of labelled tape squares. A program then provides input to the machine,
is executed by the finite state control, and computations specified by the program read and
write symbols on the squares of the tape.
If we look at the perceptual task definitions provided by Macmillan and Creelman (2005), we see that all psychophysical judgments are of one stimulus relative to another - the basic process is comparison. The most basic task is discrimination, the ability to tell two stimuli apart. This certainly looks very much like the decision problem defined above. Macmillan and Creelman define recognition as a discrimination task where a subject is required to say to which of two classes a given a stimulus belongs. Strongly related to this, the visual match problem defined by Tsotsos (1989) - does an image contain an instance of a given target - is an instance of Yashuhara’s Comparing Turing Machine (1971) and thus decidable.
This however is not a proof that all of human vision can be modeled computationally. If no sub-problem of vision could be found to be decidable, then it might be that perception as a whole is undecidable and thus cannot be computationally modeled. But, what if there are other undecidable vision sub-problems? Even if some other aspect of vision is determined to be undecidable, this does not mean that all of vision is also undecidable or that other aspects of perception cannot be modeled computationally. Hilbert’s 10th problem in mathematics and the Halting Problem for Turing Machines are two examples of famous undecidable problems. The former does not imply that mathematics is not possible while the latter does not mean that computers are impossible. It seems that most domains feature both decidable as well as undecidable sub-problems and these co-exist with no insurmountable difficulty.
Once a problem is known to be decidable, it may in fact be an instance of one of the many known NP-Complete problems and proof methods exist for demonstrating this. The proofs need to show that for any possible algorithm or implementation, the time complexity function of a solution is exponential in the size of the input. The analysis is asymptotic; that is, the problem is intractable only as the input size becomes very large. Small values of an exponent (i.e., small input sizes) are often quite realizable.
Some might think that since the brain’s neurons operate in parallel, that parallelization of computations makes the overall process tractable. NP-complete problems are strongly conjectured to not be parallelizable and showing a problem is NP-complete is very strong evidence that the problem is not parallelizable. It is important to note that the problem may still be parallelizable on some of the possible inputs and it might be that all sufficiently small instances may be parallelizable. NP-completeness is definitely a sign that the problem may be hard, it does not rule out efficient algorithms for the search problems the brain must actually compute. We are faced with two choices as a result: can we find those algorithms or ask if the brain is perhaps not even solving the original problem. We approach these questions in the context of vision (since 40% or more of the cortex is devoted to vision, it is a reasonable starting point).
The computational complexity of vision
What is the generic vision problem? For our purposes we will use the following: Given a sequence of images, for each pixel determine whether it belongs to some particular object or other spatial construct, localize all objects in space, detect and localize all events in time, determine the identity of all the objects and events in the sequence, determine relationships among objects and relate all objects and events to the available world knowledge. This section will briefly summarize the steps of an argument that concludes that the generic vision problem as defined here is intractable. If a vision system needs to search through the set of all possible image locations (pixels) and compare them to each element of memory, then without any task guidance or knowledge of the characteristics of the subset it seeks, it cannot know which subset may be more likely than another. As a result, it is the powerset of all locations that gives the number of image subsets to examine, an exponential function. Thus, purely feed-forward, unconstrained visual processing seems to have an inherent exponential nature.
Two abstract problems can be defined to make this more concrete. The important variables
that play a role in the definition are: the number of image locations (
), the number of
object/event prototypes in memory (
) and the number of measurements made at each
image location (
). The first problem is Unbounded Visual Match. This was intended to
model recognition where no task guidance to optimize search is permitted. It corresponds
to recognition with all top-down connections in the visual processing hierarchy removed or
disabled. In other words, this is pure data-directed vision. More precisely, it determines if
there exists a subset of pixels in an image that matches a particular target. The second
problem is Bounded Visual Match. This is recognition with knowledge of a target and
task in advance, such that the knowledge is used to optimize the process. The basic
theorems, proved in (Tsotsos 1989, 1990a) are:
- Theorem 1
- Unbounded (Bottom-Up) Visual Matching (UVM) is NP-Complete, with time complexity an exponential function of
(the worst-case time complexity is
).
- Theorem 2
- Bounded (Task-Directed) Visual Match (BVM) has time complexity linear in
(the worst-case time complexity is
).
The results have important implications. They first tell us that the pure data-directed approach to vision (and in fact to perception in any sensory modality) is computationally intractable in the general case. They also tell us that bounded visual search takes time linearly dependent on the size of the input, something that has been observed in a huge number of experiments. Even small amounts of task guidance can turn an exponential problem into one with linear time complexity.
However, perhaps biological vision systems are designed around average or best-case assumptions and not worst-case assumptions as are the basis of the above theorems; it might be likely that expected case analysis more correctly reflects biological systems. Parodi et al. (1998) addressed this criticism by developing an algorithm for generating random instances of polyhedral scenes and examining median case complexity of labeling the scenes. The key results echo the worst-case results: the unbounded median-case time complexity is exponential in the number of polyhedral junctions while the bounded median-case time complexity is linear in number of junctions. As a result, the objection that worst-case analysis is not appropriate is defused. The key conclusion remains: The application of task knowledge guides the processing and converts an exponential complexity problem into a linear one. This is one of the main forms of attentive processing.
Complexity constrains the architecture for visual processing
How can one deal with an intractable problem in practice? NP-Completeness eliminates the possibility of developing a completely optimal and general algorithm. Garey & Johnson (1979) provide a number of guidelines. The relevant guideline for this analysis is: Use natural parameters to guide the search for approximation algorithms, algorithms that solve the problem but not optimally, providing a solution only to within some specified error tolerance. One way of doing this it to reduce the exponential effect of the largest valued parameters. The process was presented in (Tsotsos 1987, 1988, 1990a, 1990b, 1991).
The natural parameters of the computation for visual search are
,
and
with worstcase
time complexity is
). </math>N</math> is a large number but any reduction leads to linear
improvements.
is also a large number but reduction in </math>P</math> can lead to exponential
improvement, as does reduction in
, however
is not so large a number. We can
conclude that the best improvement would come from reductions in
. What would such
changes look like? Here are several that are straightforward yet sufficient (but not
necessary):
- Hierarchical organization takes search of model space from
to
.
- Search within a pyramidal representation of the image (a layered representation, each layer with decreasing spatial resolution and with bidirectional connections between adjacent layers) operates in a top-down fashion, beginning with a smaller more abstract image, and is then refined locally thus reducing
.
- Spatio-temporally localized receptive fields reduce number of possible receptive fields from
to
(this assumes contiguous receptive fields of all possible sizes centered at all locations in the image array).
- Spatial selectivity can further reduce the
term if one selects the receptive field that is to be processed. This is not a selection of location only, but rather a selection of a local region and its size at a particular location.
- Feature selectivity can further reduce the
term, that is, which subset of all possible features actually is represented in the image or is important for the task at hand.
- Object selectivity can further reduce the
term, reflecting task-specific information.
After applying the first three constraints,
is the worst-case time
complexity. The next three reduce
,
and
all to 1 to bring the expression down to
perhaps its lowest value. These last three are all manifestations of attentive processing.
But how do these actions affect the generic perception problem described above? Hierarchical organization and logical map separation do not affect the nature of the vision problem. However, the other mechanisms have the following effects:
- Pyramidal abstraction affects the problem through the loss of location information and signal combination.
- Spatio-temporally localized receptive fields force the system to look at features across a receptive field instead of finer grain combinations and thus arbitrary combinations of locations are disallowed.
- Attentional selection further limits what is processed in the location, feature and object domains. The knowledge that task-related attentive processes employ to accomplish the selection can be as generic as statistical regularities.
As a result of these, the generic problem as defined earlier has been altered. Unfortunately it is not easy to formally characterize this altered problem. But it is clear that certain aspects of the problem are lost with these approximations.
Extending to cognition and action
Much of human cognition is action-based and actions need sensory input for their guidance. The kind of visual search described above seems ubiquitous within intelligent behavior. Simple behavior may be conceptualized as :
- localize and recognize a stimulus (with or without a pre-specified target)
- link the stimulus to an applicable action
- decide among all possible actions
- generate actuator commands.
If sensory perception is an integral component of intelligence, sensory search problems are integral sub-problems of intelligent behavior.
The first step is exactly the UVM problem. The second step can be solved by table look-up, the table size being the number of possible stimuli times the number of possible behaviors and where the table entries code strength of applicability, perhaps dependent on stimulus characteristics (thus, each stimulus may be associated with more than one behavior). The third step can be done by choosing the largest applicability strength from among all the entries, and the fourth would involve a function call or table-look-up again.
The problems are formalized in Tsotsos (1995) with proofs for the following:
- Theorem 3
- Unbounded Stimulus-Behavior Search is NP-hard.
- Theorem 4
- Bounded Stimulus-Behavior Search has time complexity linear in the number of image locations.
Extensions to time-varying input, active camera systems, and to the sensor planning problem in visual search have all been presented and mirror the above results (Tsotsos 1992, Ye & Tsotsos 1996). The generic, strictly data-directed versions of these problems all exhibit inherent exponential time complexity and the addition of attentive, task-specific guidance is sufficient to reduce this complexity to linear or better.
Conclusions
We have shown that visual search, and by extension any other intelligent behavior that requires some sensory perception as an integral component, has exponential time complexity where the exponents are too large to enable it to be tractably solved in any realization. By re-shaping the problem – through optimizations and approximations – the problem can be solvable with very low complexity. Those optimizations and approximations are all different kinds of attention. Attention, thus, is a set of mechanisms that help optimize the search processes inherent in perception and cognition. What are those mechanisms? Several have already been mentioned, and the full group can be classified as being of three main types with several specializations within each:
Selection spatio-temporal region of interest world/task/object/event model gaze/viewpoint best interpretation/response Restriction task relevant search space pruning location cues fixation points search depth control Suppression spatial/feature surround inhibition inhibition of return
Details on how such mechanisms may manifest themselves in computational systems appear in many of the works cited earlier.
References
- Anderson, J (1976). Language, Memory and Thought. Erlbaum Associates, Hillsdale, NJ.
- Barstow, D R (1979). Knowledge-Based Program Construction. North-Holland, New York.
- Broadbent, D (1958). Perception and Communication. Pergamon Press, London.
- Davis, M (1958). Computability and Unsolvability. McGraw-Hill, New York.
- Davis, M (1965). The Undecidable. Hewlett Raven Press, New York.
- Erman, L; Hayes-Roth, F; Lesser, V and Reddy, R (1980). The Hearsay-II Speech-Understanding System: Integrating Knowledge to Resolve Uncertainty. ACM Computing Surveys 12: 213-253.
- Feldman, J and Ballard, D (1982). Connectionist models and their properties. Cognitive Science 6: 205-254.
- Garey, M and Johnson, D (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, San Francisco.
- Kant, E (1979). Efficiency Considerations in Program Synthesis: A knowledge-based Approach. Stanford University, Palo Alto, CA.
- Kelly, M (1971). Edge Detection in Pictures by Computer Using Planning. Machine Intelligence 6: 397-409.
- Macmillan, N A and Creelman, C D (2005). Signal Detection Theory: A User's Guide. Routledge, ADDRESS.
- P, Lancewicki; R, Vijh; A, Tsotsos and J K, {{{13}}} (47-75). Empirically-Derived Estimates of the Complexity of Labelling Line Drawings of Polyhedral Scenes. Artificial Intelligence105 1998: Parodi.
- Stockmeyer, L and Chandra, A (1988). Intrinsically difficult problems. Scientific American Trends in Computing JOURNAL 1: 88-97.
- Tsotsos, J K (1980). A Framework for Visual Motion Understanding. University of Toronto, Toronto.
- Tsotsos, J K (1987). A `Complexity Level' Analysis of Vision. Proceedings of International Conference on Computer Vision: Human and Machine Vision Workshop, London.
- Tsotsos, J K (1989). The Complexity of Perceptual Search Tasks. Proc. IJCAI, Detroit. 1571-1577
- Tsotsos, J K (1988). A `complexity level' analysis of immediate vision. International Journal of Computer Vision 1(4): 303-320.
- Tsotsos, J K (1990a). A Complexity Level Analysis of Vision. Behavioral and Brain Sciences 13(3): 423-455.
- Tsotsos, J K (1990b). A Little Complexity Analysis Goes a Long Way. Behavioral and Brain Sciences 13(3): 459-469.
- Tsotsos, J K (1991). Is Complexity Analysis Appropriate for Analyzing Biological Systems? Behavioral and Brain Sciences 14(4): 770-773.
- Tsotsos, J K (1992). On the relative complexity of active vs passive visual search. International Journal of Computer Vision 7(2): 127-141.
- Tsotsos, J K (1995). On Behaviorist Intelligence and the Scaling Problem. Artifical Intelligence 75: 135-160.
- Turing, A M (1936). On Computable Numbers, with an Application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 2(42): 230-265.
- Turing, A M (1948 (1969 reprint)). Intelligent Machinery (reprint in Machine Intelligence 5). Meltzer, B., Michie, D. Edinburgh University Press, Edinburgh.
- Ye, Y and Tsotsos, J K (1996). 3D Sensor Planning: Its Formulation and Complexity. International Symposium on Artificial Intelligence and Mathematics, Fort Lauderdale, FL.
- Yashuhara, A (1971). Recursive function theory and logic. Academic Press, New York.
Further reading
- None.
External links
See also
Attention, Complexity, Visual search
| John K. Tsotsos, Neil D. B. Bruce (2008) Computational foundations for attentive processes. Scholarpedia, 3(12):6545, (go to the first approved version) Created: 11 February 2008, reviewed: 23 December 2008, accepted: 23 December 2008 |
| Invited by: | Dr. Naotsugu Tsuchiya, California Institute of Technology, Pasadena, CA, USA |
| Action editor: | Dr. Naotsugu Tsuchiya, California Institute of Technology, Pasadena, CA, USA |




