Information theoretic clustering
From Scholarpedia
| This article is undergoing 2 initial reviews; It may contain inaccuracies and unapproved changes made by anonymous reviewers. | ||||||||||||||||||||
Author: Dr. Robert Jenssen, Department of Physics and Technology, University of Tromso, N-9037 Tromso, Norway
Author: Dr. Jose Carlos Principe, Electrical and Computer Engineering Department, University of Florida, Gainesville, FL 32611, USA
Information theoretic clustering is the process of grouping, or clustering, the items comprising a data set, according to a Renyi entropy-based divergence measure between probability density functions. Such an information theoretic divergence measure captures all the statistical information contained in the data, as opposed to mean squared error measures, and can thus handle non-linear cluster boundaries. Sample-based estimators of the divergence measure is obtained via kernel density estimation, without having to estimate the full density function. Information theoretic clustering is related to graph-theoretic clustering and Mercer kernel based learning.
Problem specification
In data clustering, a label
,
, is assigned to each
-dimensional data point
,
, comprising a data set. The data points with the same label form a group in the data set. Each group should correspond to a "natural" cluster of data points, such that members of the same cluster are more similar, in some sense, than members of different clusters. In information theoretic clustering, similarity is quantified by a divergence measure,
between the probability density functions (pdfs),
of the clusters. The divergence measure is based on the quadratic Renyi entropy, which is an information theoretic measure of the uncertainty of a random variable, and include measures based on the Cauchy-Schwarz (CS) inequality and the integrated squared error (ISE). The goal is to assign labels such that an estimate,
, of the divergence is maximized, because this corresponds to clusters which are maximally distant in terms of the cluster pdfs. This process is illustrated in Fig. 1.
Renyi entropy-based divergence measures
The quadratic Renyi entropy is given by
- (1)
where
is the pdf. It is a special case of the family of Renyi entropies,
, where
is the Renyi entropy parameter (Renyi, 1976). In the limit
the Shannon entropy is obtained.
Consider two pdfs,
and
. A divergence measure between these two pdfs, based on the Cauchy-Schwarz inequality, is given by (Principe et al., 2000)
- (2)
This measure is symmetric and obeys
. Note that the divergence is zero only if
. The measure is based on the quadratic Renyi entropy since
where
can be considered a "cross" Renyi entropy between
and
.
Another divergence measure based on the quadratic Renyi entropy is the integrated squared error between
and
. It is given by (Principe et al., 2000)
- (3)
This measure is also symmetric, obeying
, being zero only if
. The ISE divergence is intrinsically related to the quadratic Renyi entropy, since
Both the CS divergence and the ISE divergence may be extended to the
-cluster case as follows
where
. Similarly
Sample-based estimators via kernel density estimation
In order to be able to utilize the Renyi entropy-based divergences as cost functions for clustering, sample-based estimators need to be derived.
Notice that by (1),
where
and
denotes the expectation operator wrt.
. Hence, the Renyi quadratic entropy is a function of the expectation value of the density
. This expectation can be estimated via the kernel density estimator formalism, without having to obtain an estimate of the full density function, as follows:
Assume that a data set
,
, is available, generated from
. A kernel density estimator of
based on these samples is given by (Parzen, 1962)
There are many valid kernel functions
. The Gaussian function is the most widely used
where
is the scale parameter.
A (biased) sample-based estimator of the quadratic Renyi entropy is obtained by approximating the expectation operator using the sample mean, i.e.
and by replacing
by
, obtaining
Assume furthermore that samples corresponding to two clusters are available, such that
,
, are generated from
and
,
, are generated from
. Following an exactly similar derivation as above, replacing
and
by kernel density estimators
and
, (biased) sample-based estimators for the CS divergence (2) and the ISE divergence (3) are obtained as follows
These estimators are easily extended to the
-cluster case. The use of kernel functions other than the Gaussian is also possible (Jenssen et al., 2007).
Gradient descent for clustering optimization
To be useful as clustering cost functions, the sample-based estimators of the CS divergence or the ISE divergence must be expressed in terms of the labels of the data points. The goal of the clustering is precisely to adjust the labels, such that the cost function used obtains its maximum value. The CS divergence has been the most widely used information theoretic clustering cost function, and will be focused upon in the remainder of this section.
It is helpful to introduce the cluster membership vectors
, defined as
dimensional binary vectors. Only the
'th element,
, of any
equals one, meaning that data pattern
is assigned to cluster
(crisp cluster memberships). The relevant CS-based clustering cost function (focusing on the argument of the
in (2) since minimization of this quantity is equivalent to maximization of the CS divergence) may be expressed in terms of these membership vectors as
where
and
now are indexes running over all the
data points.
Gokcay and Principe (2002) made a first attempt at optimizing this function. However, they only optimized the numerator of this expression, and the optimization was carried out in an exhaustive search manner. This means that every clustering possibility had to be examined by computing the cost function in each case, for then to pick the best result in the end. Since the computation of the cost function scales as
, where
is the number of data points, this method became prohibitive for anything but very small and manageable data sets. The results obtained were however very promising, and instigated research focused on more efficient optimization rules.
Jenssen et al. (2007) instead derived a fuzzy fixed point optimization procedure based on the technique of Lagrange multipliers, resulting in an efficient gradient descent-based information theoretic clustering algorithm. Being a fuzzy algorithm, each data point may thus be assigned to several clusters at the same time during the optimization process, yielding a smooth cost function. This is similar in spirit to Fuzzy C-means cluster analysis.
Let
,
. The following constrained optimization problem was defined
- (6)
subject to
,
where
is a
-dimensional vector whose elements are all one. Hence, a data pattern is allowed to have a certain degree of membership to any cluster, but the constraint ensures that the sum of the memberships adds up to one. Furthermore, the following useful change of variables was introduced. Let
. Consider
- (7)
subject to
. The constraints in (7) are equivalent to the constraints in (6). The optimization problem, (7), amounts to adapting the vectors
such that
where
. Notice that if all of the diagonal elements
are positive,
implies that
. The elements of the membership vectors
are forced to always be positive, by adding a small positive constant
(e.g
) to all the elements during each membership update.
The Lagrange function is given by
where
, are the Lagrange multipliers. The necessary conditions for the extremum of
are given by
for
and
. From (8) the following fixed-point adaptation rule for the vector
is derived as follows
, and where
denotes the updated vector.
The Lagrange multipliers,
, are solved for by evaluating the constraints given by (9) as follows
See Jenssen et al. (2007) for the derivation of
. After convergence of the algorithm, or after a predetermined number of iterations, the maximum value of the elements of each membership vector
, is assigned the value one, and the rest zero.
This information theoretic clustering algorithm has several advantages. Firstly, the computational complexity of calculating gradients may be reduced by using only a random subset of size
of the data points in the calculation. This creates an
algorithm at each iteration step. Secondly, the problem of local optima of non-convex cost functions can to a large degree be avoided by incorporating kernel annealing as an optimization strategy. Hence, the kernel (window) width may start out large during the initial iterations, for the to be annealed, or reduced, as the number of iterations increases. Thirdly, the annealing procedure may help the algorithm cope with different data scales, by discovering large scale structures in the initial phase, followed by a "fine tuning" as the kernel width is decreased.
Some examples
A few examples of information theoretic clustering results are provided. See for example Jenssen et al. (2007) or Gokcay and Principe (2002) for more examples. In information theoretic clustering no parametric assumptions are made with respect to the shape of the cluster probability density functions. Therefore, clusters with elongated shapes can be clustered, as shown in Fig. 2. In the left panel, the data set to be clustered is shown. In the right panel, the obtained clusters are shown using different symbols to indicate cluster membership. Non-linear cluster boundaries may also be discovered, as shown in Fig. 3. Finally, Fig. 4 shows an example of information theoretic clustering for segmentation of a textured image, where image features are created using a bank of Gabor filters.
Connection to graph-theoretic clustering
A set of points,
, in an arbitrary data space can be represented as a weighted undirected graph
. Each node in the graph corresponds to a data point. The edge formed between a pair of nodes, say
and
, is weighted according to the similarity between the corresponding data points. The edge-weight is denoted
, and is often computed using a radial basis function, i.e
. The graph cut provides a measure of the cost of partitioning a graph
into two subgraphs
and
, and is defined as
- (10)
where the index
, runs over the
nodes of subgraph
and the index
, runs over the
nodes of subgraph
. That is, the cut measures the weight of the edges which have to be removed in order to create the two subgraphs. Wu and Leahy (1993) first proposed to minimize the cut-cost as a means for clustering and image segmentation. Many attempts have been made at somehow normalizing the cut-cost, as it tends to produce a skewed partition. A prominent example is Shi and Malik (2000), who introduced the normalized cut, optimized using the spectral properties of an affinity, or kernel, matrix.
By comparing (4) to (10), it becomes clear that the CS divergence estimator may be related to the graph cut, since the numerator of (4) is exactly equivalent to (10) by considering
an edge weight similar to
. Moreover, the volume of subgraph
is given by
and the volume of subgraph
is given by
. Hence, the CS divergence estimator may be expressed as
Thus, the CS divergence estimated using kernel density estimation yields a well-defined normalization of the graph cut. This version of the graph cut has been named the Information cut (Jenssen et al., 2007). Optimization the CS divergence for clustering is therefore equivalent to the optimization of a normalized version of the graph cut. In (Jenssen et al., 2006a), a graph spectral version of information theoretic clustering was proposed.
The ISE divergence may also be expressed in a similar manner, based on (5), as
Connection to Mercer kernel based learning
Mercer kernel based learning (Shawe-Taylor and Cristianini, 2004) is based on deriving non-linear learning algorithms by computing inner-products in a potentially infinite dimensional space obtained by the non-linear mapping
Inner-products are computed by means of a Mercer kernel function
A Mercer kernel is a positive semidefinite function. The most widely used such kernel is the radial basis function.
Jenssen et al., (2005, 2006b) noticed that the Gaussian kernel
is a Mercer kernel. Thus,
Hence, based on (4), the CS divergence estimator may be rewritten as follows
where
and
can be considered mean vectors of feature space data clusters corresponding to the data points associated with
and
, respectively. This means that maximization of the CS divergence for clustering is equivalent to minimization of the cosine of the angle between the Mercer kernel feature space cluster mean vectors.
Using a similar derivation, the ISE divergence may be expressed in terms of the squared Euclidean distance between the Mercer kernel space cluster mean vectors, as
References
Renyi A. (1976). On measures of entropy and information. Selected Papers of Alfred Renyi, Akademiai Kiado, Budapest, 2:565-580.
Principe J. C., Xu D. and Fisher J. (2000). Information Theoretic Learning. Unsupervised Adaptive Filtering, Vol. 1, Chapter 7 (Haykin S., Editor), John Wiley and Sons, New York.
Parzen E. (1962). On the Estimation of a Probability Density Function and the Mode. The Annals of Mathematical Statistics, 32:1065-1076.
Gokcay E. and Principe J. C. (2002). Information Theoretic Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):158-170.
Jenssen R., Erdogmus D., Hild K. E., Principe J. C. and Eltoft T. (2007). Information Cut for Clustering using a Gradient Descent Approach. Pattern Recognition, 40:796-806.
Wu Z. and Leahy R. (1993). An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Applications to Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11):1101-1113.
Shi J. and Malik J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888-905.
Jenssen R., Erdogmus D., Principe J. C. and Eltoft T. (2006a). Information Theoretic Angle-Based Spectral Clustering: A Theoretical Analysis and an Algorithm. In Proceedings of the International Joint Conference on Neural Networks, 4904-4911.
Shawe-Taylor J. and Cristianini N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge UK.
Jenssen R., Erdogmus D., Principe J. C. and Eltoft T. (2005). The Laplacian PDF Distance: A Cost function for Clustering in a Kernel Feature Space. In Advances in Neural Information Processing Systems 17: 625-632, Cambridge MA, USA.
Jenssen R., Erdogmus D., Principe J. C. and Eltoft T. (2006b). Some Equivalences Between Kernel Methods and Information Theoretic Methods. Journal of VLSI Signal Processing, 45:49-65.
Further reading
Erdogmus D. and Principe J. C. (2006). From Linear Adaptive Filtering to Nonlinear Information Processing. IEEE Signal Processing Magazine, 23(6):14-33.
Jenssen R. (2005). An Information Theoretic Approach to Machine Learning. Dr. Scient (Ph.D) Thesis. University of Tromso, Norway.

,
