Information theoretic clustering

From Scholarpedia

This article is undergoing 2 initial reviews; It may contain inaccuracies and unapproved changes made by anonymous reviewers.

Peer review status

You are not logged in.

Reviewer A: Awaits author's response (last modifications by reviewer - 164 days ago)
Reviewer B: Awaits author's response (last modifications by reviewer - 149 days ago)
This article is not accepted to Scholarpedia yet. It may contain inaccuracies and unapproved changes made by anonymous reviewers.
Dr. Robert Jenssen prefers old-fashioned peer-review process: all comments should be put into the 'reviews' part of the article.

Author: Dr. Robert Jenssen, Department of Physics and Technology, University of Tromso, N-9037 Tromso, Norway
Author: Dr. Jose Carlos Principe, Electrical and Computer Engineering Department, University of Florida, Gainesville, FL 32611, USA

Information theoretic clustering is the process of grouping, or clustering, the items comprising a data set, according to a Renyi entropy-based divergence measure between probability density functions. Such an information theoretic divergence measure captures all the statistical information contained in the data, as opposed to mean squared error measures, and can thus handle non-linear cluster boundaries. Sample-based estimators of the divergence measure is obtained via kernel density estimation, without having to estimate the full density function. Information theoretic clustering is related to graph-theoretic clustering and Mercer kernel based learning.

Contents

Problem specification

Information theoretic clustering
Enlarge
Figure 1: Information theoretic clustering: A label l_k, k=1,\dots,N, out of C possibilities, is assigned to each data point \mathbf{x}_k, k=1,\dots,N. This creates C clusters in the data set. The labels are assigned in such a way that a Renyi entropy-based divergence measure between the resulting clusters is maximized.

In data clustering, a label l_k \in \{1,\dots,C \}, k=1,\dots,N, is assigned to each d-dimensional data point \mathbf{x}_k, k=1,\dots,N, comprising a data set. The data points with the same label form a group in the data set. Each group should correspond to a "natural" cluster of data points, such that members of the same cluster are more similar, in some sense, than members of different clusters. In information theoretic clustering, similarity is quantified by a divergence measure, D(p_1,\dots,p_C), between the probability density functions (pdfs), p_1(\mathbf{x}),\dots,p_C(\mathbf{x}), of the clusters. The divergence measure is based on the quadratic Renyi entropy, which is an information theoretic measure of the uncertainty of a random variable, and include measures based on the Cauchy-Schwarz (CS) inequality and the integrated squared error (ISE). The goal is to assign labels such that an estimate, D(\hat{p_1},\dots,\hat{p}_c), of the divergence is maximized, because this corresponds to clusters which are maximally distant in terms of the cluster pdfs. This process is illustrated in Fig. 1.

Renyi entropy-based divergence measures

The quadratic Renyi entropy is given by

(1)
H_2(p) = - \log \int p^2(\mathbf{x}) d\mathbf{x},

where p(\mathbf{x}) is the pdf. It is a special case of the family of Renyi entropies, H_\alpha(p) = \frac{1}{1-\alpha} \log \int p^\alpha(\mathbf{x}) d\mathbf{x}, where \alpha is the Renyi entropy parameter (Renyi, 1976). In the limit \alpha \rightarrow 1 the Shannon entropy is obtained.

Consider two pdfs, p_1(\mathbf{x}) and p_2(\mathbf{x}). A divergence measure between these two pdfs, based on the Cauchy-Schwarz inequality, is given by (Principe et al., 2000)

(2)
D_{CS}(p_1,p_2) = -\log \frac{\int p_1(\mathbf{x}) p_2(\mathbf{x}) d\mathbf{x}}{\sqrt{\int p_1^2(\mathbf{x}) d\mathbf{x} \int p_2^2(\mathbf{x}) d\mathbf{x}}}.

This measure is symmetric and obeys D_{CS}(p_1,p_2) \in [0,\infty). Note that the divergence is zero only if p_1(\mathbf{x}) = p_2(\mathbf{x}). The measure is based on the quadratic Renyi entropy since D_{CS}(p_1,p_2) = - \log \int p_1(\mathbf{x}) p_2(\mathbf{x}) d\mathbf{x} - \frac{1}{2} H_2(p_1) - \frac{1}{2} H_2(p_2), where - \log \int p_1(\mathbf{x}) p_2(\mathbf{x}) d\mathbf{x} can be considered a "cross" Renyi entropy between p_1(\mathbf{x}) and p_2(\mathbf{x}).

Another divergence measure based on the quadratic Renyi entropy is the integrated squared error between p_1(\mathbf{x}) and p_2(\mathbf{x}). It is given by (Principe et al., 2000)

(3)
D_{ISE}(p_1,p_2) = \int [p_1(\mathbf{x}) - p_2(\mathbf{x})]^2 d\mathbf{x}.

This measure is also symmetric, obeying D_{ISE}(p_1,p_2) \in [0,\infty), being zero only if p_1(\mathbf{x}) = p_2(\mathbf{x}). The ISE divergence is intrinsically related to the quadratic Renyi entropy, since D_{ISE}(p_1,p_2) = \int p_1^2(\mathbf{x}) d\mathbf{x} - 2 \int p_1(\mathbf{x}) p_2(\mathbf{x}) d\mathbf{x} + \int p_2^2(\mathbf{x}) d\mathbf{x}.

Both the CS divergence and the ISE divergence may be extended to the C-cluster case as follows

D_{CS}(p_1,\dots,p_C) = -\log \frac{1}{\kappa} \sum_{i=1}^{C-1} \sum_{j > i} \frac{\int p_i(\mathbf{x})p_j(\mathbf{x}) d\mathbf{x}}{\sqrt{\int p_i^2(\mathbf{x}) d\mathbf{x} \int p_j^2(\mathbf{x}) d\mathbf{x}}},

where \kappa = \sum_{c=1}^{C-1} c. Similarly

D_{ISE}(p_1,\dots,p_C) =  \sum_{i=1}^{C-1} \sum_{j > i} \int [p_i(\mathbf{x}) - p_j(\mathbf{x})]^2 d\mathbf{x}.


Sample-based estimators via kernel density estimation

In order to be able to utilize the Renyi entropy-based divergences as cost functions for clustering, sample-based estimators need to be derived.

Notice that by (1), H_2(p) = - \log V(p) where V(p) = \mathcal{E}_p(p) and \mathcal{E}_p(\cdot) denotes the expectation operator wrt. p(\mathbf{x}). Hence, the Renyi quadratic entropy is a function of the expectation value of the density p(\mathbf{x}). This expectation can be estimated via the kernel density estimator formalism, without having to obtain an estimate of the full density function, as follows:

Assume that a data set \mathbf{x}_k, k=1,\dots,N, is available, generated from p(\mathbf{x}). A kernel density estimator of p(\mathbf{x}) based on these samples is given by (Parzen, 1962)

\hat{p}(\mathbf{x}) = \frac{1}{N} \sum_{k=1}^N W_\sigma(\mathbf{x},\mathbf{x}_k).

There are many valid kernel functions W_\sigma(\cdot,\cdot). The Gaussian function is the most widely used W_\sigma(\mathbf{x},\mathbf{x}_k) = \frac{1}{2 \pi \sigma^d} \exp \left\{ - \frac{1}{2\sigma^2}  \left\| \mathbf{x} - \mathbf{x}_k \right\|^2  \right\}, where \sigma is the scale parameter.

A (biased) sample-based estimator of the quadratic Renyi entropy is obtained by approximating the expectation operator using the sample mean, i.e. \mathcal{E}_p(p) \approx \frac{1}{N} \sum_{k=1}^N p(\mathbf{x}_k) and by replacing p(\mathbf{x}) by \hat{p}(\mathbf{x}), obtaining

\begin{array}{lcl} H_2(\hat{p}) &=& - \log \frac{1}{N} \sum_{k=1}^N \hat{p}(\mathbf{x}_k) \\              &=& - \log \frac{1}{N} \sum_{k=1}^N \frac{1}{N} \sum_{k^\prime = 1}^N W_{\sigma}(\mathbf{x}_k,\mathbf{x}_{k^\prime}) \\              &=& - \log \frac{1}{N^2} \sum_{k,k^\prime = 1}^{N,N} W_{\sigma}(\mathbf{x}_k,\mathbf{x}_{k^\prime}). \end{array}

Assume furthermore that samples corresponding to two clusters are available, such that \mathbf{x}_i, i=1,\dots,N_1, are generated from p_1(\mathbf{x}) and \mathbf{x}_j, j=1,\dots,N_2, are generated from p_2(\mathbf{x}). Following an exactly similar derivation as above, replacing p_1(\mathbf{x}) and p_2(\mathbf{x}) by kernel density estimators \hat{p}_1(\mathbf{x}) and \hat{p}_2(\mathbf{x}), (biased) sample-based estimators for the CS divergence (2) and the ISE divergence (3) are obtained as follows

(4)
D_{CS}(\hat{p}_1,\hat{p}_2) =  - \log \frac{\frac{1}{N_1N_2} \sum_{i,j=1}^{N_1,N_2} W_\sigma(\mathbf{x}_i,\mathbf{x}_j)}{\sqrt{\frac{1}{N_1^2} \sum_{i,i^\prime = 1}^{N_1,N_1} W_\sigma(\mathbf{x}_i,\mathbf{x}_{i^\prime} ) \frac{1}{N_2^2} \sum_{j,j^\prime = 1}^{N_2,N_2} W_\sigma(\mathbf{x}_j,\mathbf{x}_{j^\prime})}},
(5)
D_{ISE}(\hat{p}_1,\hat{p}_2) =  \frac{1}{N_1^2} \sum_{i,i^\prime = 1}^{N_1,N_1} W_\sigma(\mathbf{x}_i,\mathbf{x}_{i^\prime} ) - \frac{2}{N_1N_2} \sum_{i,j=1}^{N_1,N_2} W_\sigma(\mathbf{x}_i,\mathbf{x}_j) + \frac{1}{N_2^2} \sum_{j,j^\prime = 1}^{N_2,N_2} W_\sigma(\mathbf{x}_j,\mathbf{x}_{j^\prime}).

These estimators are easily extended to the C-cluster case. The use of kernel functions other than the Gaussian is also possible (Jenssen et al., 2007).


Gradient descent for clustering optimization

To be useful as clustering cost functions, the sample-based estimators of the CS divergence or the ISE divergence must be expressed in terms of the labels of the data points. The goal of the clustering is precisely to adjust the labels, such that the cost function used obtains its maximum value. The CS divergence has been the most widely used information theoretic clustering cost function, and will be focused upon in the remainder of this section.

It is helpful to introduce the cluster membership vectors \mathbf{l}_i, \ i = 1,\dots,N, defined as C dimensional binary vectors. Only the c'th element, l_{ic}, of any \mathbf{l}_i equals one, meaning that data pattern \mathbf{x}_i is assigned to cluster c (crisp cluster memberships). The relevant CS-based clustering cost function (focusing on the argument of the \log in (2) since minimization of this quantity is equivalent to maximization of the CS divergence) may be expressed in terms of these membership vectors as

J(\mathbf{l}_1,\dots,\mathbf{l}_N) = \frac{\frac{1}{2} \sum_{i,j=1}^{N,N} \left( 1-\mathbf{l}_i^T\mathbf{l}_j \right) W(\mathbf{x}_i,\mathbf{x}_j)}{\sqrt{\prod_{c=1}^C  \sum_{i,j=1}^{N,N} l_{ic} l_{jc} W(\mathbf{x}_i,\mathbf{x}_j)}},

where i and j now are indexes running over all the N data points.

Gokcay and Principe (2002) made a first attempt at optimizing this function. However, they only optimized the numerator of this expression, and the optimization was carried out in an exhaustive search manner. This means that every clustering possibility had to be examined by computing the cost function in each case, for then to pick the best result in the end. Since the computation of the cost function scales as O(N^2), where N is the number of data points, this method became prohibitive for anything but very small and manageable data sets. The results obtained were however very promising, and instigated research focused on more efficient optimization rules.

Jenssen et al. (2007) instead derived a fuzzy fixed point optimization procedure based on the technique of Lagrange multipliers, resulting in an efficient gradient descent-based information theoretic clustering algorithm. Being a fuzzy algorithm, each data point may thus be assigned to several clusters at the same time during the optimization process, yielding a smooth cost function. This is similar in spirit to Fuzzy C-means cluster analysis.

Let \mathbf{l}_i \in [0,1]^d, i=1,\dots,N. The following constrained optimization problem was defined

(6)
\min_{\mathbf{l}_1,\dots,\mathbf{l}_N} \ J(\mathbf{l}_1,\dots,\mathbf{l}_N),

subject to \mathbf{l}_j^T \mathbf{1} - 1 = 0, \quad j=1,\dots,N, where \mathbf{1} is a C-dimensional vector whose elements are all one. Hence, a data pattern is allowed to have a certain degree of membership to any cluster, but the constraint ensures that the sum of the memberships adds up to one. Furthermore, the following useful change of variables was introduced. Let l_{ic}=v_{ic}^2, \ c=1,\dots,C. Consider

(7)
\min_{\mathbf{v}_1,\dots,\mathbf{v}_N} \ J(\mathbf{v}_1,\dots,\mathbf{v}_N),

subject to \mathbf{v}_j^T \mathbf{v}_j - 1 = 0, \quad j=1,\dots,N. The constraints in (7) are equivalent to the constraints in (6). The optimization problem, (7), amounts to adapting the vectors \mathbf{v}_i, \ i=1,\dots,N, such that

\frac{\partial J}{{\partial \mathbf{v}_i}} = \left( \frac{\partial J}{{\partial \mathbf{l}_i}}^T \frac{\partial \mathbf{l}_i}{{\partial \mathbf{v}_i}} \right)^T = \mathbf{\Gamma} \frac{\partial J}{{\partial \mathbf{l}_i}} \rightarrow \mathbf{0},

where \mathbf{\Gamma}=diag(2\sqrt{l_{i1}},\dots,2\sqrt{l_{iC}}). Notice that if all of the diagonal elements 2 \sqrt{l_{ic}}, \ c=1,\dots,C, are positive, \frac{\partial J}{{\partial \mathbf{v}_i}} \rightarrow \mathbf{0} implies that \frac{\partial J}{{\partial \mathbf{l}_i}} \rightarrow \mathbf{0}. The elements of the membership vectors \mathbf{l}_i, \ i=1,\dots,N, are forced to always be positive, by adding a small positive constant \epsilon (e.g \epsilon \sim 0.05) to all the elements during each membership update.

The Lagrange function is given by

L=J(\mathbf{v}_1,\dots,\mathbf{v}_N) + \sum_{j=1}^N \lambda_j \left( \mathbf{v}_j^T \mathbf{v}_j - 1  \right),

where \lambda_j, \ j=1,\dots,N, are the Lagrange multipliers. The necessary conditions for the extremum of L are given by

(8)
\frac{\partial L}{{\partial \mathbf{v}_i}}  = \frac{\partial J}{{\partial \mathbf{v}_i}} + \sum_{k=1}^N \lambda_k \frac{\partial}{{\partial \mathbf{v}_i}} \left( \mathbf{v}_k^T \mathbf{v}_k - 1  \right) = \mathbf{0},
(9)
\frac{\partial L}{{\partial \lambda}_j}  = \mathbf{v}_j^T \mathbf{v}_j - 1 = 0,

for i=1,\dots,N and j=1,\dots,N. From (8) the following fixed-point adaptation rule for the vector \mathbf{v}_i is derived as follows

\frac{\partial J}{{\partial \mathbf{v}_i}} + 2\lambda_i \mathbf{v}_i = \mathbf{0} \quad \Rightarrow \quad \mathbf{v}_i^+= - \frac{1}{2 \lambda_i} \ \frac{\partial J}{\partial \mathbf{v}_i},

i=1,\dots,N, and where \mathbf{v}_i^+ denotes the updated vector.

The Lagrange multipliers, \lambda_i, \ i=1,\dots,N, are solved for by evaluating the constraints given by (9) as follows

\begin{array}{ccl} \mathbf{v}_i^{+T} \mathbf{v}_i^+ - 1 & = & 0, \\  & \Rightarrow & \left( - \frac{1}{2 \lambda_i} \ \frac{\partial J}{\partial \mathbf{v}_i} \right)^T \left( - \frac{1}{2 \lambda_i} \ \frac{\partial J}{\partial \mathbf{v}_i} \right) - 1 = 0, \\  & \Rightarrow & \lambda_i = \frac{1}{2} \sqrt{\frac{\partial J}{\partial \mathbf{v}_i}^T \frac{\partial J}{\partial \mathbf{v}_i}}. \end{array}

See Jenssen et al. (2007) for the derivation of \frac{\partial J}{\partial \mathbf{v}_i}. After convergence of the algorithm, or after a predetermined number of iterations, the maximum value of the elements of each membership vector \mathbf{l}_i, \ i=1,\dots,N, is assigned the value one, and the rest zero.

This information theoretic clustering algorithm has several advantages. Firstly, the computational complexity of calculating gradients may be reduced by using only a random subset of size M \ll N of the data points in the calculation. This creates an O(MN) algorithm at each iteration step. Secondly, the problem of local optima of non-convex cost functions can to a large degree be avoided by incorporating kernel annealing as an optimization strategy. Hence, the kernel (window) width may start out large during the initial iterations, for the to be annealed, or reduced, as the number of iterations increases. Thirdly, the annealing procedure may help the algorithm cope with different data scales, by discovering large scale structures in the initial phase, followed by a "fine tuning" as the kernel width is decreased.


Some examples

A few examples of information theoretic clustering results are provided. See for example Jenssen et al. (2007) or Gokcay and Principe (2002) for more examples. In information theoretic clustering no parametric assumptions are made with respect to the shape of the cluster probability density functions. Therefore, clusters with elongated shapes can be clustered, as shown in Fig. 2. In the left panel, the data set to be clustered is shown. In the right panel, the obtained clusters are shown using different symbols to indicate cluster membership. Non-linear cluster boundaries may also be discovered, as shown in Fig. 3. Finally, Fig. 4 shows an example of information theoretic clustering for segmentation of a textured image, where image features are created using a bank of Gabor filters.

Figure 2: Information theoretic clustering of data set with elongated clusters. Reprinted from Pattern Recognition, 40 (3), R. Jenssen, D. Erdogmus, K. E. Hild III, J. C. Principe and T. Eltoft, Information Cut for Clustering using a Gradient Descent Approach, Pages No. 796-806, Copyright (2007), with permission from Elsevier.
Enlarge
Figure 2: Information theoretic clustering of data set with elongated clusters. Reprinted from Pattern Recognition, 40 (3), R. Jenssen, D. Erdogmus, K. E. Hild III, J. C. Principe and T. Eltoft, Information Cut for Clustering using a Gradient Descent Approach, Pages No. 796-806, Copyright (2007), with permission from Elsevier.
Figure 3: Information theoretic clustering of a data set with highly non-linear clusters. Reprinted from Pattern Recognition, 40 (3), R. Jenssen, D. Erdogmus, K. E. Hild III, J. C. Principe and T. Eltoft, Information Cut for Clustering using a Gradient Descent Approach, Pages No. 796-806, Copyright (2007), with permission from Elsevier.
Enlarge
Figure 3: Information theoretic clustering of a data set with highly non-linear clusters. Reprinted from Pattern Recognition, 40 (3), R. Jenssen, D. Erdogmus, K. E. Hild III, J. C. Principe and T. Eltoft, Information Cut for Clustering using a Gradient Descent Approach, Pages No. 796-806, Copyright (2007), with permission from Elsevier.
Figure 4: Information theoretic clustering of textured image. Reprinted from Pattern Recognition, 40 (3), R. Jenssen, D. Erdogmus, K. E. Hild III, J. C. Principe and T. Eltoft, Information Cut for Clustering using a Gradient Descent Approach, Pages No. 796-806, Copyright (2007), with permission from Elsevier.
Enlarge
Figure 4: Information theoretic clustering of textured image. Reprinted from Pattern Recognition, 40 (3), R. Jenssen, D. Erdogmus, K. E. Hild III, J. C. Principe and T. Eltoft, Information Cut for Clustering using a Gradient Descent Approach, Pages No. 796-806, Copyright (2007), with permission from Elsevier.

Connection to graph-theoretic clustering

A set of points, \mathbf{x}_k, k=1,\dots,N, in an arbitrary data space can be represented as a weighted undirected graph \mathcal{G}. Each node in the graph corresponds to a data point. The edge formed between a pair of nodes, say i and j, is weighted according to the similarity between the corresponding data points. The edge-weight is denoted v(\mathbf{x}_i,\mathbf{x}_j), and is often computed using a radial basis function, i.e v(\mathbf{x}_i,\mathbf{x}_j)=\exp \left\{ - \frac{1}{2\sigma^2}  \left\| \mathbf{x}_i - \mathbf{x}_j \right\|^2  \right\}. The graph cut provides a measure of the cost of partitioning a graph \mathcal{G} into two subgraphs \mathcal{G}_1 and \mathcal{G}_2, and is defined as

(10)
Cut(\mathcal{G}_1,\mathcal{G}_2) = \sum_{i,j=1}^{N_1,N_2} v(\mathbf{x}_i,\mathbf{x}_j),

where the index i=1,\dots,N_1, runs over the N_1 nodes of subgraph \mathcal{G}_1 and the index j=1,\dots,N_2, runs over the N_2 nodes of subgraph \mathcal{G}_2. That is, the cut measures the weight of the edges which have to be removed in order to create the two subgraphs. Wu and Leahy (1993) first proposed to minimize the cut-cost as a means for clustering and image segmentation. Many attempts have been made at somehow normalizing the cut-cost, as it tends to produce a skewed partition. A prominent example is Shi and Malik (2000), who introduced the normalized cut, optimized using the spectral properties of an affinity, or kernel, matrix.

By comparing (4) to (10), it becomes clear that the CS divergence estimator may be related to the graph cut, since the numerator of (4) is exactly equivalent to (10) by considering W_\sigma(\mathbf{x}_i,\mathbf{x}_j) an edge weight similar to v(\mathbf{x}_i,\mathbf{x}_j). Moreover, the volume of subgraph \mathcal{G}_1 is given by Vol(\mathcal{G}_1)= \sum_{i,i^\prime=1}^{N_1,N_1} v(\mathbf{x}_i,\mathbf{x}_{i^\prime}) and the volume of subgraph \mathcal{G}_2 is given by Vol(\mathcal{G}_2)= \sum_{j,j^\prime=1}^{N_2,N_2} v(\mathbf{x}_j,\mathbf{x}_{j^\prime}). Hence, the CS divergence estimator may be expressed as

D(\hat{p}_1,\hat{p}_2) = - \log \frac{Cut\mathcal{G}_1,\mathcal{G}_2}{\sqrt{Vol(\mathcal{G}_1) Vol(\mathcal{G}_2)}}.

Thus, the CS divergence estimated using kernel density estimation yields a well-defined normalization of the graph cut. This version of the graph cut has been named the Information cut (Jenssen et al., 2007). Optimization the CS divergence for clustering is therefore equivalent to the optimization of a normalized version of the graph cut. In (Jenssen et al., 2006a), a graph spectral version of information theoretic clustering was proposed.


The ISE divergence may also be expressed in a similar manner, based on (5), as

D(\hat{p}_1,\hat{p}_2) = \frac{1}{N_1^2} Vol(\mathcal{G}_1) - \frac{2}{N_1 N_2} Cut(\mathcal{G}_1,\mathcal{G}_2) + \frac{1}{N_2^2} Vol(\mathcal{G}_2).


Connection to Mercer kernel based learning

Mercer kernel based learning (Shawe-Taylor and Cristianini, 2004) is based on deriving non-linear learning algorithms by computing inner-products in a potentially infinite dimensional space obtained by the non-linear mapping \mathbf{x}_k \rightarrow \mathbf{\Phi}(\mathbf{x}_k), \ k=1,\dots,N. Inner-products are computed by means of a Mercer kernel function v(\mathbf{x}_i,\mathbf{x}_j) = \left \langle \mathbf{\Phi}(\mathbf{x}_i),\mathbf{\Phi}(\mathbf{x}_j) \right \rangle. A Mercer kernel is a positive semidefinite function. The most widely used such kernel is the radial basis function.

Jenssen et al., (2005, 2006b) noticed that the Gaussian kernel W_\sigma(\mathbf{x}_i,\mathbf{x}_j) is a Mercer kernel. Thus, W_\sigma(\mathbf{x}_i,\mathbf{x}_j) = \left \langle \mathbf{\Phi}(\mathbf{x}_i),\mathbf{\Phi}(\mathbf{x}_j) \right \rangle. Hence, based on (4), the CS divergence estimator may be rewritten as follows

\begin{array}{ccl} D_{CS}(\hat{p}_1,\hat{p}_2) & = & - \log \frac{\frac{1}{N_1 N_2} \sum_{i,j=1}^{N_1,N_2} \left \langle \mathbf{\Phi}(\mathbf{x}_i),\mathbf{\Phi}(\mathbf{x}_j) \right \rangle}{\sqrt{\frac{1}{N_1^2} \sum_{i,i^\prime=1}^{N_1,N_1} \left \langle \mathbf{\Phi}(\mathbf{x}_i),\mathbf{\Phi}(\mathbf{x}_{i^\prime}) \right \rangle  \frac{1}{N_2^2} \sum_{j,j^\prime=1}^{N_2,N_2} \left \langle \mathbf{\Phi}(\mathbf{x}_j),\mathbf{\Phi}(\mathbf{x}_{j^\prime}) \right \rangle }} \\ &=& - \log \frac{\left \langle \frac{1}{N_1} \sum_{i=1}^{N_1} \mathbf{\Phi}(\mathbf{x}_i), \frac{1}{N_2}\sum_{j=1}^{N_2} \mathbf{\Phi}(\mathbf{x}_j)\right \rangle  }{\sqrt{\left \langle \frac{1}{N_1} \sum_{i=1}^{N_1} \mathbf{\Phi}(\mathbf{x}_i), \frac{1}{N_1}\sum_{i^\prime=1}^{N_1} \mathbf{\Phi}(\mathbf{x}_{i^\prime})\right \rangle \left \langle \frac{1}{N_2} \sum_{j=1}^{N_2} \mathbf{\Phi}(\mathbf{x}_j), \frac{1}{N_2}\sum_{j^\prime=1}^{N_2} \mathbf{\Phi}(\mathbf{x}_{j^\prime})\right \rangle}} \\ &=&  - \log \frac{ \left \langle \mathbf{m}_{1},\mathbf{m}_{2} \right \rangle }{\sqrt{ \left \langle \mathbf{m}_{1},\mathbf{m}_{1} \right \rangle \left \langle \mathbf{m}_{2},\mathbf{m}_{2} \right \rangle }} =  - \log \cos \angle (\mathbf{m}_{1},\mathbf{m}_{2}), \end{array}

where \mathbf{m}_1 = \frac{1}{N_1} \sum_{i=1}^{N_1} \mathbf{\Phi}(\mathbf{x}_i) and \mathbf{m}_2 = \frac{1}{N_2} \sum_{j=1}^{N_2} \mathbf{\Phi}(\mathbf{x}_j) can be considered mean vectors of feature space data clusters corresponding to the data points associated with p_1(\mathbf{x}) and p_2(\mathbf{x}), respectively. This means that maximization of the CS divergence for clustering is equivalent to minimization of the cosine of the angle between the Mercer kernel feature space cluster mean vectors.

Using a similar derivation, the ISE divergence may be expressed in terms of the squared Euclidean distance between the Mercer kernel space cluster mean vectors, as

D_{ISE}(\hat{p}_1,\hat{p}_2) = \lVert \mathbf{m}_1 - \mathbf{m}_2 \rVert.

References

Renyi A. (1976). On measures of entropy and information. Selected Papers of Alfred Renyi, Akademiai Kiado, Budapest, 2:565-580.

Principe J. C., Xu D. and Fisher J. (2000). Information Theoretic Learning. Unsupervised Adaptive Filtering, Vol. 1, Chapter 7 (Haykin S., Editor), John Wiley and Sons, New York.

Parzen E. (1962). On the Estimation of a Probability Density Function and the Mode. The Annals of Mathematical Statistics, 32:1065-1076.

Gokcay E. and Principe J. C. (2002). Information Theoretic Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):158-170.

Jenssen R., Erdogmus D., Hild K. E., Principe J. C. and Eltoft T. (2007). Information Cut for Clustering using a Gradient Descent Approach. Pattern Recognition, 40:796-806.

Wu Z. and Leahy R. (1993). An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Applications to Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11):1101-1113.

Shi J. and Malik J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888-905.

Jenssen R., Erdogmus D., Principe J. C. and Eltoft T. (2006a). Information Theoretic Angle-Based Spectral Clustering: A Theoretical Analysis and an Algorithm. In Proceedings of the International Joint Conference on Neural Networks, 4904-4911.

Shawe-Taylor J. and Cristianini N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge UK.

Jenssen R., Erdogmus D., Principe J. C. and Eltoft T. (2005). The Laplacian PDF Distance: A Cost function for Clustering in a Kernel Feature Space. In Advances in Neural Information Processing Systems 17: 625-632, Cambridge MA, USA.

Jenssen R., Erdogmus D., Principe J. C. and Eltoft T. (2006b). Some Equivalences Between Kernel Methods and Information Theoretic Methods. Journal of VLSI Signal Processing, 45:49-65.

Further reading

Erdogmus D. and Principe J. C. (2006). From Linear Adaptive Filtering to Nonlinear Information Processing. IEEE Signal Processing Magazine, 23(6):14-33.

Jenssen R. (2005). An Information Theoretic Approach to Machine Learning. Dr. Scient (Ph.D) Thesis. University of Tromso, Norway. Figure 5: Image:Example.jpgFigure 6: Image:Example.jpg

Invited by: Dr. Ke CHEN, School of Computer Science, The University of Manchester, U.K.
Action editor: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
For authors