# Fisherfaces

Post-publication activity

Curator: Aleix Martinez

A key problem in computer vision, pattern recognition and machine learning is to define an appropriate data representation for the task at hand.

One way to represent the input data is by finding a subspace which represents most of the data variance. This can be obtained with the use of Principal Components Analysis (PCA). When applied to face images, PCA yields a set of eigenfaces. These eigenfaces are the eigenvectors associated to the largest eigenvalues of the covariance matrix of the training data. The eigenvectors thus found correspond to the least-squares (LS) solution. This is indeed a powerful way to represent the data because it ensures the data variance is maintained while eliminating unnecessary existing correlations among the original features (dimensions) in the sample vectors.

When the goal is classification rather than representation, the LS solution may not yield the most desirable results. In such cases, one wishes to find a subspace that maps the sample vectors of the same class in a single spot of the feature representation and those of different classes as far apart from each other as possible. The techniques derived to achieve this goal are known as discriminant analysis (DA).

The most known DA is Linear Discriminant Analysis (LDA), which can be derived from an idea suggested by R.A. Fisher in 1936. When LDA is used to find the subspace representation of a set of face images, the resulting basis vectors defining that space are known as Fisherfaces.

Figure 1: Shown here are the first four Fisherfaces from a set of 100 classes (subjects).

## Discriminant Scores

To compute the Fisherfaces, we assume the data in each class is Normally distributed. We denote the multivariate Normal distribution as $$N_i(\mu_i,\Sigma_i)\ ,$$ with mean $$\mu_i$$ and covariance matrix $$\Sigma_i\ ,$$ and its probability density function is $$f_i(\mathbf{x}|\mu_i,\Sigma_i)\ .$$

In the $$C$$ class problem, we have $$N_i(\mu_i, \Sigma_i)\ ,$$ with $$i= { 1,\dots,C }\ .$$ Given these Normal distributions and their class prior probabilities $$P_i\ ,$$ the classification of a test sample $$\mathbf{x}$$ is given by comparing the log-likelihoods of $$f_i(\mathbf{x}|\mu_i,\Sigma_i) P_i$$ for all $$i\ .$$ That is,

$$\arg \min_{1 \leq i \leq C} d_i(\mathbf{x}),$$

where $$d_i(\mathbf{x}) = (\mathbf{x}-\mu_i)^T {\Sigma}_i^{-1} (\mathbf{x}-\mu_i) + ln|\Sigma_i| - 2\ln P_i$$ are known as the discriminant scores of each class. The discriminant scores thus defined yield the Bayes optimal solution.

The discriminant scores generally result in quadratic classification boundaries between classes. However, for the case where all the covariance matrices are the same, $$\Sigma_i = \Sigma\ ,$$ $$\forall i\ ,$$ the quadratic parts of $$d_i$$ cancel out, yielding linear classifiers. These classifiers are called linear discriminant bases. Hence, the name of Linear Discriminant Analysis. The case where all the covariances are identical is known as homoscedastic Normal distributions.

Assume that $$C=2$$ and that the classes are homoscedastic Normals. Project the sample feature vectors onto the one-dimensional subspace orthogonal to the classification hyperplane given by the discriminant score. It follows that the number of misclassified samples in the original space of $$p$$ dimensions and in this subspace of just one dimension are the same. This is easily verifiable. Since the classification boundary is linear, all the samples that where on one side of the space will remain on the same side of the 1-dimensions subspace. This important point was first noted by R.A. Fisher and has allowed us to defined the LDA algorithm and Fisherfaces.

## Computing the Fisherfaces

The theoretical argument given in the preceding section shows how to obtain the Bayes optimal solution for the 2-class homoscedastic case. In general, we will have more than 2-classes. In such a case, we reformulate the above stated problem as that of minimizing within-class differences and maximizing between-class distances.

Within class differences can be estimated using the within-class scatter matrix, given by

$$\mathbf{S}_w = \sum_{j=1}^{C}\sum_{i=1}^{n_j} (\mathbf{x}_{ij}-\mu_j)(\mathbf{x}_{ij}-\mu_j)^T,$$

where $$\mathbf{x}_{ij}$$ is the $$i$$th sample of class $$j\ ,$$ $$\mu_j$$ is the mean of class $$j\ ,$$ and $$n_j$$ the number of samples in class $$j\ .$$

Likewise, the between class differences are computed using the between-class scatter matrix,

$$\mathbf{S}_b = \sum_{j=1}^{C} (\mu_j-\mu)(\mu_j-\mu)^T,$$

where $$\mu$$ represents the mean of all classes.

We now want to find those basis vectors $$\mathbf{V}$$ where $$\mathbf{S}_w$$ is minimized and $$\mathbf{S}_b$$ is maximized, where $$\mathbf{V}$$ is a matrix whose columns $$\mathbf{v}_i$$ are the basis vectors defining the subspace. These are given by,

$$\frac{| \mathbf{V}^T \mathbf{S}_b \mathbf{V} |}{| \mathbf{V}^T \mathbf{S}_w \mathbf{V} |} .$$

The solution to this problem is given by the generalized eigenvalue decomposition

$$\mathbf{S}_b \mathbf{V} = \mathbf{S}_w \mathbf{V} \mathbf{\Lambda},$$

where $$\mathbf{V}$$ is (as above) the matrix of eigenvectors and $$\mathbf{\Lambda}$$ is a diagonal matrix of corresponding eigenvalues.

The eigenvectors of $$\mathbf{V}$$ associated to non-zero eigenvalues are the Fisherfaces. There is a maximum of $$C-1$$ Fisherfaces. This can be readily seen from the definition of $$\mathbf{S}_b\ .$$ Note that in our definition, $$\mathbf{S}_b$$ is a combination of $$C$$ feature vectors. Any $$C$$ vectors define a subspace of $$C-1$$ or less dimensions. The equality holds when these vectors are linearly independent from one another. Figure 1 shows the first four Fisherfaces obtained when using the defined algorithm on a set of frontal face image of 100 different subjects. Images were selected to have a neutral expression.

## Technicalities

To obtain the Fisherfaces, we need to compute the inverse of $$\mathbf{S}_w\ ,$$ i.e., $$\mathbf{S}_w^{-1}\ .$$ If the sample feature vectors are defined in a $$p$$-dimensional space and $$p$$ is larger than the total number of samples $$n\ ,$$ then $$\mathbf{S}_w$$ is singular. There are two typically used solutions to this problem. In the first solution, we project the sample vectors (or, equivalently, the between- and within-class scatter matrices) onto the PCA space of $$r$$ dimensions, with $$r\leq rank(\mathbf{S}_w)$$ and compute the Fisherfaces in this PCA space. The second solution is to add a regularising term to $$\mathbf{S}_w\ .$$ That is, $$\mathbf{S}_w+\epsilon \mathbf{I}\ ,$$ where $$\mathbf{I}$$ is the identity matrix and $$\epsilon$$ is a small constant.

One can also substitute the between- and within-class scatter matrices for other measurements that do a similar job. First, note that these two matrices are symmetric and positive semi-definite. Hence, each defines a metric. This means we can substitute these matrices with others as far as they define metrics whose goals are to minimize within-class variance and maximize between-class distances. For example, we can substitute the within-class scatter matrix for the sample covariance matrix.

The Fisherfaces obtained with the approach described thus far are based on the linear assumption mentioned above. This assumption holds when the classes are homoscedastic Normals. In general, this assumption is violated. To resolve this problem, one can add another metric into the equation. The goal of this new metric is to map the original heteroscedastic (meaning with different covariances) problem into a homoscedastic one. This mapping function can be converted into a kernel mapping thanks to the Representer's Theorem. And, the metric is given by the Gram (or Kernel) matrix. For this reason, this alternative method is called Kernel LDA (or, KLDA for short). Several authors have also defined heteroscedastic measures of within- and between-class differences. Yet, another alternative to lessen the Normal assumption is to represent the samples in each class as a mixture of Normal distributions. In this approach, the trick is how to determine the number of mixtures per class. A popular solution is given by the algorithm Subclass Discriminant Analysis (SDA) and its kernel extension KSDA. Kernel methods are generally preferred when the number of training samples is sufficiently large to facilitate the learning of the nonlinear mapping.

## Fisherface extensions

Recently, Two-dimensional LDA (2DLDA), a tensor extension of LDA, is proposed. Different from the LDA which requires the input patterns to be converted to one-dimensional vectors, the 2DLDA directly extracts the proper features from image matrices based on Fisher’s Linear Discriminant Analysis.

## References

• P. Belhumeur, J. Hespanha, and D. Kriegman, Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711--720, 1997.
• K. Etemad and R. Chellapa, Discriminant Analysis for Recognition of Human Face Images, J. Optics of Am. A, vol. 14, no. 8 pp. 1724-1733, 1997.
• D.L. Swets and J.J. Weng, Using Discriminant Eigenfeatures for Image Retrieval, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 831-836, Aug. 1996.
• A.M. Martinez and A.C. Kak, PCA versus LDA, IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2):228-233, 2001.
• M. H. Yang, Face Recognition Using Kernel Methods, Advances in Neural Information Processing systems (NIPS), pp. 215--220, 2002.
• A.M. Martinez and M.Zhu, Where Are Linear Feature Extraction Methods Applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 27, no. 12, pp. 1934-1944, 2005.
• J. Ye, R. Janardan and Q. Li, Two-Dimensional Linear Discriminant Analysis, Advances in Neural Information Processing systems (NIPS), 2004.
• H. Kong, L. Wang and E. K. Teoh, J.G. Wang and V. Ronda, A Framework of 2D Fisher Discriminant Analysis: Application to Face Recognition with Small Number of Training Samples, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1083-1088, 2005.