Slow feature analysis

From Scholarpedia

This article has not been peer-reviewed or accepted for publication yet; It may be unfinished, contain inaccuracies, or unapproved changes.

Author: Prof. Laurenz Wiskott, Institut für Neuroinformatik, Ruhr-Universität Bochum, Germany
Author: Dr. Pietro Berkes, Volen Center for Complex Systems, Brandeis University, Waltham, MA, USA
Author: Dr. Mathias Franzius, Honda Research Institute Europe
Author: Dr. Henning Sprekeler, Laboratory of Computational Neuroscience, EPFL, Lausanne, Switzerland
Author: Mr. Niko Wilbert, Humboldt Universität zu Berlin, Berlin, Germany

Dr. Laurenz Wiskott accepted the invitation on 6 October 2008 (self-imposed deadline: 6 April 2009).

Under Construction!

Figure 1: Schematics of the optimization problem solved by slow feature analysis.
Enlarge
Figure 1: Schematics of the optimization problem solved by slow feature analysis.

Slow feature analysis (SFA) refers to an algorithm for extracting slowly varying features from a quickly varying input signal. It has been successfully applied to the self-organization of complex-cell receptive fields, the invariant recognition of whole objects, the self-organization of place-cells, and to nonlinear blind source separation.

Contents

Slow feature analysis

The slowness principle

Image:SlowFeatureAnalysis-SlownessPrinciple.png
Figure 2: Illustration of the slowness principle.

Slowness as a learning principle in vision is based on the observation that different representations of our environment vary on different time scales. The environment itself varies on a time scale of a large fraction of a second. The primary sensory signal, i.e. responses of single retinal receptors or the gray values of a single pixel in a CCD camera, vary on a much faster time scale. A useful high-level representation of the environment varies on a slow time scale again. Somehow the brain manages to extract a high-level representation from the primary sensory signal, and it has to do so by unsupervised learning. One marked difference between the primary sensory signal and the high-level representation is that the former varies quickly while the latter varies slowly. This difference can be employed for unsupervised learning and leads to the central idea of the slowness principle: If we find functions that extract from the quickly varying input signal slowly varying output signals, we have a good chance of obtaining a useful high-level representation of the environment.

The optimization problem

In mathematical terms the optimization problem of the slowness principle can be stated as follows (cf. Fig. 1): Given a (potentially high-dimensional) input signal \mathbf{x}(t) find functions g_j(\mathbf{x}) such that the output signals y_j(t) := g_j(\mathbf{x}(t))

minimize
(1)
\Delta(y_j) := \langle \dot{y}_j^2 \rangle_t

under the constraints

(2)
\langle y_j \rangle_t = 0 (zero mean),
(3)
\langle y_j^2 \rangle_t = 1 (unit variance),
(4)
\forall i<j: \langle y_i y_j \rangle_t = 0 (decorrelation and order).

\langle \cdot \rangle_t indicates averaging over time and \dot{y} is the derivative of y with respect to time. The \Delta-value defined by (1) measures the slowness (or rather fastness) of a signal. A low value indicates that the signal has a small slope on average and is therefore slow. The constraints (2) and (3) avoid the trivial constant solution, which would otherwise be optimal, since its derivative is zero. The constraint (4) guarantees that different output signal components code for different information, otherwise all y_j might be identical.

It is important to note that although the objective is slowness, the function g_j extracts the output signal y_j instantaneously. There is no low-pass filtering involved. This also means that one can get slowly varying output signals only if there are slowly varying causes underlying the quickly varying input signal. If the input is white noise, the output will be white noise.

The algorithm

As stated above the optimization problem is one of variational calculus. If one confines the functions g_j to a finite dimensional function space, such as all polynomials of degree two, the problem becomes simpler and one can use algebraic methods, which are the basis for the slow feature analysis algorithm.

Figure 3: Input signal of the simple example.
Enlarge
Figure 3: Input signal of the simple example.

Consider as a simple example the two dimensional input signal x_1(t) := \sin(t) + \cos(11t)^2 and x_2(t) := \cos(11t), see Fig. 3. Both components are quickly varying, but hidden in the signal is the slowly varying 'feature' y(t) = x_1(t) - x_2(t)^2 = \sin(t), which can be extracted with a polynomial of degree two, namely g(\mathbf{x})=x_1-x_2^2.

Figure 4: Illustration of the Algorithm.
Enlarge
Figure 4: Illustration of the Algorithm.

For finding the function g(\mathbf{x}) that extracts the slow feature from the input signal, one proceeds as follows.

  • Step 1: First expand the input into the space of nonlinear functions you are considering, see Fig. 4B. I.e. if you consider only the polynomials of degree two as possible functions for g(\mathbf{x}), then create the signal components \tilde{z}_1 := x_1, \tilde{z}_2 := x_2, \tilde{z}_3 := x_1^2, \tilde{z}_4 := x_1x_2, \tilde{z}_5 := x_2^2. Any polynomial of degree two in x_1 and x_2 can be written as a linear combination of these five components (except for a constant, which will be taken care of in the next step). This makes the problem linear from now on.
  • Step 2: Normalize the signal. First subtract the mean (this determines the missing constant of the previous step), then apply a linear transformation such that the signal has unit variance in all directions, see Fig. 4C. This is referred to as whitening or sphering and can be done with principal component analysis. If you project the sphered signal onto any direction (i.e. you take the inner product between the signal vector \mathbf{z} and any unit vector), the resulting signal has zero mean and unit variance; if you project the sphered signal onto two orthogonal directions, the two resulting signals are uncorrelated. These properties make it easy to obey the constraints (2-4).
  • Step 3: Calculate the time derivative \mathbf{\dot{z}}(t) of the sphered signal, see Fig. 4D.
  • Step 4: Find the direction of least variance of the time derivative signal, see Fig. 4D. This is the direction in which the sphered signal varies most slowly, because on average the square of the time derivative is smallest. If you need more than one output component, take orthogonal directions with the next smallest variance. Finding these directions can again be done with principal component analysis. The directions are then the principal components with the smallest eigenvalues, and the eigenvalues are exactly the \Delta-values of the projected signals as defined by (1).

The sphered signal projected onto the direction of least variance of the time derivative signal is the slow feature extracted, see Fig. 4E. Combining all the steps above (nonlinear expansion, whitening, projection onto the direction of least variance of the time derivative signal) yields function g(\mathbf{x}), see Fig. 4F. Evaluating g(\mathbf{x}) along the trajectory \mathbf{x}(t) yields y(t) = g(\mathbf{x}(t)) as shown in Fig. 4E.

It is possible to combine steps 2 and 4 in one by solving a generalized eigenvalue problem.

Relation to other approaches and historical remarks (under construction)

Probably the first explicit mentioning of slowness (or rather smoothness) as a possible objective for learning can be found in (Hinton, 1989, on page 208).

Early models were presented by Földiák (1991) and Mitchison (1991). Földiák (1991) introduced the trace learning rule, which became quite popular. He based his approach on models of classical conditioning, which were even earlier than (Hinton, 1989).

A related approach to the slowness principle is the maximization of mutual information between the output of a function on neighboring image patches, e.g. (Becker and Hinton, 1992). This approach works on the spatial rather than the temporal domain but has been extended to time as well. With certain approximations this approach leads to a similar optimization problem as above (1-4). Relations between information theoretic measures and the optimization problem of slow feature analysis have later been shown more explicitly by Shaw (2003), Turner and Sahani (2007), and Creutzig and Sprekeler (2008). These results are instructive in that they provide a more fundamental derivation of the optimization problem.

Slow feature analysis has been developed independently of earlier publications on the slowness principle and has first been published in (Wiskott, 1998). The principal publication introducing slow feature analysis is (Wiskott and Sejnowski, 2002). The main difference between slow feature analysis and all other algorithms is that it finds a closed form solution while the latter are all incremental or online learning rules, often in the form of a gradient ascent/descent method. This has several consequences:

  • Slow feature analysis is guaranteed to find the optimal solution within the given (finite-dimensional) function space. It does therefore not suffer from local optima as the other algorithms do.
  • Slow feature analysis yields a set of many uncorrelated output signals that are ordered by slowness. In online learning rules the output signals are either not ordered, i.e. it is not guaranteed that any of the output signal components is the slowest possible one, or it takes a long time to find the faster signal components, because the slower ones have to converge before the faster ones can.
  • Due to the nonlinear expansion, slow feature analysis suffers from the curse of dimensionality. For instance if the function space considered is the set of all polynomials of degree two (except for the constant), then for 10-dimensional input the expanded space is 65-dimensional while for 100-dimensional input the input is 5150-dimensional. Thus the dimensionality of the expanded space grows rapidly with the number of input dimensions, which may cause problems in practice due to limitations in memory size of a computer. This problem can be solved by hierarchical processing, which breaks down a large input space into many smaller ones, cf. Invariant visual object recognition and Grid, place, head-direction, and view cells in the hippocampus.


Applications in computational neuroscience

Hierarchical SFA networks for high-dimensional data (under construction)

In the previous section it was already mentioned that SFA suffers from the curse of dimensionality when combined with nonlinear expansion. This is especially a problem if the input data already has a high dimensionality, like visual input data. One natural solution for this problem are hierarchical networks that mimic the feedforward aspects of the ventral visual system. Such a networks consists of a converging hierarchy of layers of SFA nodes (which also perform a nonlinear). Each node only receives data from a patch of the layer below, thereby breaking the data down into smaller pieces. Of course this approach requires that the structure of the data is compatible. For visual stimuli this is typically the case, because the correlation of pixels decreases with their distance. So an SFA node that looks at a small local image patch can learn this local structure and extract useful features (see Complex-cell receptive fields in primary visual cortex).

In the higher layers of a hierarchical network the effective receptive field size becomes larger, so it is possible to extract complex features (like recognizing whole objects). This is helped by the accumulation of computational power with each layer. For example in a three-layer network with quadratic expansion in each layer the whole network can be represented by functions from a subspace of the polyomials of degree up to 2^3=8. For SFA this was already used in (Wiskott and Sejnowski, 2002) and was later applied to more complex stimuli for the modeling of Grid, place, head-direction, and view cells in the hippocampus and Invariant visual object recognition.

Complex-cell receptive fields in primary visual cortex (empty)

Grid, place, head-direction, and view cells in the hippocampus (under construction)

Figure 5: Spatial and orientation tuning of different cell types in the hippocampal formation: Idealized Grid Cell, Place Cell, Head-Direction Cell, and Spatial-View Cell
Enlarge
Figure 5: Spatial and orientation tuning of different cell types in the hippocampal formation: Idealized Grid Cell, Place Cell, Head-Direction Cell, and Spatial-View Cell

A number of cell types with firing correlates of an animal's position and head direction in space have been identified in the hippocampus and neighboring areas. These "oriospatial" cells include place cells, grid cells, head direction cells and spatial view cells (Fig 5). Oriospatial cells are driven by multimodal stimuli but are typically dominated by vestibular and visual input. In comparison with the rapidly changing sensory input during an animal's movement in a rich visual world, the firing rates of these cell types change relatively slowly. This observation motivates a model of unsupervised formation of oriospatial cells based on slow feature analysis and sparse coding. The model architecture is depicted in Fig 6.

Figure 6: Architecture of the hierarchical model for spatial learning. At a given position and orientation of the virtual rat (arrow) in the naturally textured virtual-reality environment (A), input views are generated (B), and processed in a hierarchical network (C). The lower three layers perform the same sequence (D) of linear SFA (for dimensionality reduction), expansion, additive noise, linear SFA (for feature extraction), and clipping; the last layer performs linear sparse coding.
Enlarge
Figure 6: Architecture of the hierarchical model for spatial learning. At a given position and orientation of the virtual rat (arrow) in the naturally textured virtual-reality environment (A), input views are generated (B), and processed in a hierarchical network (C). The lower three layers perform the same sequence (D) of linear SFA (for dimensionality reduction), expansion, additive noise, linear SFA (for feature extraction), and clipping; the last layer performs linear sparse coding.

On the one hand, a hierarchical and converging setup is a technical means for solving SFA in highly nonlinear function spaces on large inputs, in this case polynomials of degree 8 on 38.400 input dimensions. On the other hand, this architecture mimics the feedforward aspects of the ventral visual system together with a final layer for simulating the hippocampal formation on top, where all layers but the last implement the same computational objective.


Figure 7: Theoretical predictions and simulation results.
Enlarge
Figure 7: Theoretical predictions and simulation results.

We simulate the movement of a rat moving in a static virtual environment (Fig 6A) and generate a video stream as an approximation of a rat's visual stimuli (Fig 6B). Although the model operates on such high-dimensional data, this setup allows us to make detailed theoretical predictions on the optimal representations generated by the model (see Section "Theory of slow feature analysis", Fig 7 A,D), which are closely matched by the results of computer simulations of the model (see Fig 7 B,E). These predictions are solely based on the low-dimensional configuration space of position and orientation of the virtual rat and its movement trajectories in space. Consider a typical place cell experiment, where a rat explores an open environment while food pellets are thrown into its cage at random positions. In this scenario, the rat will typically turn its head often while it moves rather slowly. Intuitively, our model should develop representations that do not depend on head direction, as such dependence would increase their \Delta-values. The representations could, on the other hand, depend on the animal's position and have smaller \Delta-values, and thus form a better solution of the optimization problem given above. In other words, for this movement paradigm, representations should be invariant with respect to head direction but specific w.r.t. spacial position. The theoretically optimal solutions in Fig 7 A indeed show these properties. The slowest solution in this case is a firing map in the form of a half sine (north-south) wave in the virtual environment. The following representation is a half sine in the orthogonal direction (east-west). Later solutions show higher frequencies and mixtures of both spatial directions. These higher solutions resemble to some degree grid cells (see Fig 5A).

Using a different movement paradigm, representations become invariant to spatial position and specific for head direction (see Fig 7D,E), which is a crucial property of head direction cells. Such representations generated by optimizing a slowness function, however, are not localized in position space like place cells or in orientation space like head direction cells. For modelling these properties, we apply a final step of sparse coding to more closely match these cell types (see Fig 7C,F). Further information on modelling other movement paradigms and environmental settings can be found in Franzius, Sprekeler, and Wiskott 2007.

Note that the model representations are generated instantaneously for a single visual input, in accordance with the rapid firing onset of place and head direction cells when lights are switched on in a previously dark room. This, of course, also implies that the model performs no temporal integration after the training phase and, specifically, cannot perform path integration (or dead reckoning), which is assumed to play a crucial role in oriospatial cells. Instead, we consider this model as complementary to a path integration system. Nevertheless, to our knowledge, this model is the only one that simulates all known oriospatial cell types with a single computational principle and that operates from realistic visual stimuli.

Invariant visual object recognition (under construction)

In object recognition tasks the identity of objects is typically not the only relevant information. Just as important is the configuration of the objects (e.g., the position and orientation of an object). The identities of objects and their configurations are typically slow features in the sense of SFA. After training a hierarchical SFA network with such data it should therefore be able to extract features like object identity and configuration. Another important aspect is that ideally the individual features should be independent of each other, i.e., one wants a position representation that is invariant under changes in object orientation. It has been shown that for simple situations a hierarchical SFA network is indeed able to directly extract the desired features (Fig 8).

Figure 8: Output of a hierarchical SFA network used for object recognition. The five slowest features are plotted over the in-plane angle for three different objects. Object identity can easily be deducted from the first two outputs, which resemble step functions and are largely angle invariant. The following outputs contain angle information. All five outputs are practically translation invariant (the gray areas around the lines show the standard deviation under translation).
Enlarge
Figure 8: Output of a hierarchical SFA network used for object recognition. The five slowest features are plotted over the in-plane angle for three different objects. Object identity can easily be deducted from the first two outputs, which resemble step functions and are largely angle invariant. The following outputs contain angle information. All five outputs are practically translation invariant (the gray areas around the lines show the standard deviation under translation).

However, in more complicated situations (e.g., more objects or more configuration features) it is generally not possible to directly interpret the output of a hierarchical SFA network. Nevertheless the relevant features might be much more accessible after the data processing by the network. This can be dealt with by further processing of the SFA output, with supervised or unsupervised methods. In Franzius (2008) a simple linear regression was used to extract continuous variables from the output, and a Gaussian classifier was used to extract the object identity (both methods are supervised). The simplicity and performance of the post-processing methods shows that the hard part of the feature extraction was performed by the SFA network. In a different experiment it was demonstrated that the SFA output can be used as the input for a reinforcement learning model. This enabled the reinforcement learning network to work on very high dimensional visual input data which would otherwise have been out of reach. The only supervision signal here was the reward signal.

Technical applications (empty)

Extraction of driving forces from nonlinear dynamical systems

Nonlinear dynamical systems can be observed by monitoring one or several of their variables over time. The resulting time series can be quite complex and difficult to analyze. If the system has internal parameters, so called driving forces, that vary over time, the analysis is even more difficult. If such driving forces vary more slowly than the internal dynamics of the system, they can be estimated in an unsupervised fashion by slow feature analysis (Wiskott, 2003). Knowing the time course of the driving forces can be useful in itself or can subsequently simplify the analysis of the dynamical system.

Figure 9: Tent map function (black) and its cyclicly shifted version (red).
Enlarge
Figure 9: Tent map function (black) and its cyclicly shifted version (red).
Slow feature analysis
Enlarge
Figure 10: Time series of the iterative tent map (top) produced with a slowly varying cyclic shift (bottom, solid line). The first SFA output (bottom, dots) is highly correlated (r=0.96) with the true driving force.

As a simple example consider an iterative tent-map (Fig 9), which produces a discrete time series (Fig 10, top) by repeatedly applying a tent-like function to an initial value between 0 and 1. The tent-map can be cyclicly shifted by \gamma within the interval [0,1], and this shift is then the driving force of the system. Since the resulting time series is one-dimensional and slow feature analysis requires a multi-dimensional input signal to be useful, one needs to perform time embedding. In this case 10 successive time points form a 10-dimensional input vector, with a shift by one time point from one to the next input vector, i.e. if w_i indicates the one-dimensional time series, the input vectors are \vec{x}_i:=(w_{i-4}, w_{i-3}, w_{i-2}, w_{i-1}, w_{i}, w_{i+1}, w_{i+2}, w_{i-+3}, w_{i-+4}, w_{i+5})^T. Using \vec{x}_i as an input to slow feature analysis with polynomials of degree 3 results in a first slow signal that is highly correlated with the true driving force (correlation r=0.96), thus extracting the driving force in an unsupervised fashion.

Nonlinear blind source separation (empty)

Theory of slow feature analysis (under construction)

The "Harmonic Oscillation" Result

Several theoretical insights have been gained which can help substantially in the interpretation of simulation results. First, it was shown (Wiskott, 2003) that the optimal output signals of SFA are harmonic oscillations, with faster signals (higher index j \Leftrightarrow larger \Delta-value) oscillating at higher frequencies. Although, in a strict sense, these optimal output signals result only from limited training data in the case of extreme overfitting, this result helps significantly in the interpretation of many simulation results:

  • Complex Cells: The orientation and frequency tuning of the complex cell units simulated by Berkes (2005) can be understood as a means of generating harmonic oscillations when the input patches are rotated or zoomed at constant velocity.
  • Place- and Head-direction cells: The position and head-direction dependence of the highest SFA units in the spatial learning hierarchy is such that when the animal moves or turns at constant speed, the output signals are harmonic oscillations.
  • Invariant Object Recognition: The orientation dependence of the invariant object recognition system leads to sinusoidal output signals if the objects are rotated (in-plane) with constant velocity.

Input Signals from a Manifold

The intuition gained from the "harmonic oscillation" result are supported by further theoretical analysis of the case in which the input data lie on a smooth manifold (Franzius et al, 2007 Sprekeler et al., 2009). The assumptions are that

  • the training data are sampled from a smooth manifold with a probability distribution p(s,\dot s). Here, s is either the input data or an arbitrary parametrization of the manifold and \dot s is its derivative. The theory treats the limit of infinite amount of training data, where this distribution is fully sampled.
  • the function space of SFA is sufficiently rich to generate arbitrary (smooth) functions on this input manifold.

Under these assumptions, it can be shown that the optimal functions g_j(s) are the solutions of a partial differential eigenvalue problem (Franzius et al., 2007):

- \nabla_s \cdot p(s) K(s) \nabla_s g_j(s) = \Delta_j p(s) g_j(s)

where K(s) is a matrix that contains the second moments of the velocity, conditioned on s:

K(s) = \langle \dot s \dot s^T \rangle_{{\dot s}|s} = \int p({\dot s}|s) {\dot s} {\dot s}^T \mathrm{d} {\dot s}.

The eigenvalue equation is complemented by von Neumann boundary conditions, i.e., for every boundary point s with normal vector n(s):

n(s) \cdot K(s) \nabla_s g_j = 0.

The eigenvalue equation has the structure of a Sturm-Liouville problem, i.e., it is a generalized wave equation. In line with the harmonic oscillation result, the optimal functions are oscillatory eigenmodes on the input manifold, with the eigenvalue \Delta-value \Delta_j corresponding to the squared oscillation frequency. The mathematical structure of the eigenvalue equation ensures that the output signals of the eigenfunctions obey the zero mean and the decorrelation constraint.

In cases, where the input manifold is low-dimensional, the optimal functions can be calculated analytically. The highest SFA modules in the hierarchical spatial learning architecture are an illustrative example: The visual input signals is fully determined by the position and head direction of the simulated rat. Therefore, these three parameters form a parametrization of the input manifold. The optimal functions on this manifold (assuming a uniform distribution p(s,{\dot s}) = p({\dot s})) are shown in Figures 7A and 7B and closely match the simulation results.

Transformation-Based Input Signals (empty)

References

  • Becker, S and Hinton, GE (1992). A Self-Organizing Neural Network that Discovers Surfaces in Random-Dot Stereograms. Nature 355(6356): 161-163.
  • Berkes, P and Wiskott, L (2005). Slow Feature Analysis Yields a Rich Repertoire of Complex Cell Properties Journal of Vision 5(6): 579-602.
  • Creutzig, F and Sprekeler, H (2008). Predictive Coding and the Slowness Principle: An Information-Theoretic Approach Neural Computation 20(4): 1026-41.
  • Földiák, P (1991). Learning Invariance from Transformation Sequences. Neural Computation 3(2): 194-200.
  • Hinton, GE (1989). Connectionist Learning Procedures. Artificial Intelligence 40(1-3): 185-234.
  • Mitchison, G (1991). Removing Time Variation with the Anti-Hebbian Differential Synapse. Neural Computation 3(3): 312-320.
  • Wiskott, L (1998). Learning invariance manifolds. Proc. 5th Joint Symposium on Neural Computation, JSNC'98, San Diego, May 16, publ. University of California, San Diego, pp. 196-203.
  • Wiskott, L (2003). Slow Feature Analysis: A Theoretical Analysis of Optimal Free Responses. Neural Computation 15(9) : 2147-2177.

External links

Laurenz Wiskott's website

Bibliography on Slowness by Henning Sprekeler

Modular Toolkit for Data Processing (MDP) Free open source software library (in Python) with a standard implementation of the SFA algorithm and tools for hierarchical SFA networks.

Invited by: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
For authors