The neocognitron, proposed by Fukushima (1980), is a hierarchical multilayered neural network capable of robust visual pattern recognition through learning (Fukushima, 1988; 2003).
Outline of the Neocognitron
Figure 1 shows a typical architecture of the network. The lowest stage is the input layer consisting of two-dimensional array of cells, which correspond to photoreceptors of the retina. There are retinotopically ordered connections between cells of adjoining layers. Each cell receives input connections that lead from cells situated in a limited area on the preceding layer. Layers of "S-cells" and "C-cells" are arranged alternately in the hierarchical network. (In the network shown in Figure 1, a contrast-extracting layer is inserted between the input layer and the S-cell layer of the first stage).
S-cells work as feature-extracting cells. They resemble simple cells of the primary visual cortex in their response. Their input connections are variable and are modified through learning. After having finished learning, each S-cell come to respond selectively to a particular feature presented in its receptive field. The features extracted by S-cells are determined during the learning process. Generally speaking, local features, such as edges or lines in particular orientations, are extracted in lower stages. More global features, such as parts of learning patterns, are extracted in higher stages.
C-cells, which resembles complex cells in the visual cortex, are inserted in the network to allow for positional errors in the features of the stimulus. The input connections of C-cells, which come from S-cells of the preceding layer, are fixed and invariable. Each C-cell receives excitatory input connections from a group of S-cells that extract the same feature, but from slightly different positions. The C-cell responds if at least one of these S-cells yield an output. Even if the stimulus feature shifts in position and another S-cell comes to respond instead of the first one, the same C-cell keeps responding. Thus, the C-cell's response is less sensitive to shift in position of the input pattern. We can also express that C-cells make a blurring operation, because the response of a layer of S-cells is spatially blurred in the response of the succeeding layer of C-cells.
Each layer of S-cells or C-cells is divided into sub-layers, called "cell-planes", according to the features to which the cells responds. The cells in each cell-plane are arranged in a two-dimensional array. A cell-plane is a group of cells that are arranged retinotopically and share the same set of input connections. In other words, the connections to a cell-plane have a translational symmetry. As a result, all the cells in a cell-plane have receptive fields of an identical characteristic, but the locations of the receptive fields differ from cell to cell. The modification of variable connections during the learning progresses also under the restriction of shared connections.
Principles of Deformation-Resistant Recognition
In the whole network, with its alternate layers of S-cells and C-cells, the process of feature-extraction by S-cells and toleration of positional shift by C-cells is repeated. During this process, local features extracted in lower stages are gradually integrated into more global features, as illustrated in Figure 2
Since small amounts of positional errors of local features are absorbed by the blurring operation by C-cells, an S-cell in a higher stage comes to respond robustly to a specific feature even if the feature is slightly deformed or shifted.
Figure 3 illustrates this situation. Let an S-cell in an intermediate stage of the network have already been trained to extract a global feature consisting of three local features of a training pattern ‘A’ as illustrated in Figure 3(a). The cell tolerates a positional error of each local feature if the deviation falls within the dotted circle. Hence, the S-cell responds to any of the deformed patterns shown in Figure 3(b). The toleration of positional errors should not be too large at this stage. If large errors are tolerated at any one step, the network may come to respond erroneously, such as by recognizing a stimulus like Figure 3(c) as an 'A' pattern.
Thus, tolerating positional error a little at a time at each stage, rather than all in one step, plays an important role in endowing the network with the ability to recognize even distorted patterns.
The C-cells in the highest stage work as recognition cells, which indicate the result of the pattern recognition. Each C-cell of the recognition layer at the highest stage integrates all the information of the input pattern, and responds only to one specific pattern. Since errors in the relative position of local features are tolerated in the process of extracting and integrating features, the same C-cell responds in the recognition layer at the highest stage, even if the input pattern is deformed, changed in size, or shifted in position. In other words, after having finished learning, the neocognitron can recognize input patterns robustly, with little effect from deformation, change in size, or shift in position.
Self-Organization of the Network
The neocognitron can be trained to recognize patterns through learning. Only S-cells in the network have their input connections modified through learning. Various training methods, including unsupervised learning and supervised learning, have been proposed so far. This section introduces a process of unsupervised learning.
In the case of unsupervised learning, the self-organization of the network is performed using two principles. The first principle is a kind of winner-take-all rule: among the cells situated in a certain small area, which is called a hypercolumn, only the one responding most strongly becomes the winner. The winner has its input connections strengthened. The amount of strengthening of each input connection to the winner is proportional to the intensity of the response of the cell from which the relevant connection leads.
To be more specific, an S-cell receives variable excitatory connections from a group of C-cells of the preceding stage as illustrated in Figure 5. Each S-cell is accompanied with an inhibitory cell, called a V-cell. The S-cell also receives a variable inhibitory connection from the V-cell. The V-cell receives fixed excitatory connections from the same group of C-cells as does the S-cell, and always responds with the average intensity of the output of the C-cells.
The initial strength of the variable connections is very weak and nearly zero. Suppose the S-cell responds most strongly among the S-cells in its vicinity when a training stimulus is presented. According to the winner-take-all rule described above, variable connections leading from activated C-cells are strengthened. The variable excitatory connections to the S-cell grow into a template that exactly matches the spatial distribution of the response of the cells in the preceding layer. The inhibitory variable connection from the V-cell is also strengthened at the same time to the average strength of the excitatory connections.
After the learning, the S-cell acquires the ability to extract a feature of the stimulus presented during the learning period. Through the excitatory connections, the S-cell receives signals indicating the existence of the relevant feature to be extracted. If an irrelevant feature is presented, the inhibitory signal from the V-cell becomes stronger than the direct excitatory signals from the C-cells, and the response of the S-cell is suppressed.
Once an S-cell is thus selected and has learned to respond to a feature, the cell usually loses its responsiveness to other features. When a different feature is presented, a different cell usually yields the maximum output and learns the second feature. Thus, a "division of labor" among the cells occurs automatically.
The second principle for the learning is introduced in order that the connections being strengthened always preserving translational symmetry, or the condition of shared connections. The maximum-output cell not only grows by itself, but also controls the growth of neighboring cells, working, so to speak, like a seed in crystal growth. To be more specific, all of the other S-cells in the cell-plane, from which the “seed cell” is selected, follow the seed cell, and have their input connections strengthened by having the same spatial distribution as those of the seed cell.
Various Types of Neocognitron
Various modifications of the neocognitron have been proposed to improve the recognition rate or to make biologically more natural. If backward (i.e., top-down) connections are added to the conventional neocognitron, for example, the network come to have an ability to recognize occluded patterns correctly and can restore the occluded parts of the patterns (Fukushima, 1987; 2005). Even if many patterns are presented simultaneously, it focuses attention to individual patterns one by one and recognizes them correctly.
- K. Fukushima: "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position", Biological Cybernetics, 36, pp. 193-202 (April 1980).
- K. Fukushima: "Neocognitron: A hierarchical neural network capable of visual pattern recognition", Neural Networks, 1, pp. 119-130 (1988).
- K. Fukushima: "Neocognitron for handwritten digit recognition", Neurocomputing, 51, pp. 161-180 (April 2003).
- K. Fukushima: "Neural network model for selective attention in visual pattern recognition and associative recall", Applied Optics, 26, pp. 4985-4992 (Dec. 1987).
- K. Fukushima: "Restoring partly occluded patterns: a neural network model", Neural Networks, 18, pp. 33-43 (2005).