Support vector clustering
From Scholarpedia
| Asa Ben-Hur (2008), Scholarpedia, 3(6):5187. | doi:10.4249/scholarpedia.5187 | revision #51603 [link to/cite this article] | |||||||||||||||||||
Curator: Dr. Asa Ben-Hur, Computer Science Department, Colorado State University, Fort Collins, CO
The objective of clustering is to partition a data set into groups according to some criterion in an attempt to organize data into a more meaningful form. There are many ways of achieving this goal. Clustering may proceed according to some parametric model or by grouping points according to some distance or similarity measure as in hierarchical clustering. A natural way to put cluster boundaries is in regions in data space where there is little data, i.e. in "valleys" in the probability distribution of the data. This is the path taken in support vector clustering (SVC), which is based on the support vector approach.
In SVC data points are mapped from data space to a high dimensional feature space using a kernel function. In feature space we look for the smallest sphere that encloses the image of the data using the Support Vector Domain Description algorithm. This sphere, when mapped back to data space, forms a set of contours which enclose the data points. We interpret these contours as cluster boundaries, and points enclosed by each contour are associated by SVC to the same cluster.
Contents |
The algorithm
SVC uses the
Support Vector Domain Description (SVDD)
to delineate the region in data space where the
input examples are concentrated.
SVDD belongs to the general category of kernel based learning.
In its "linear" version SVDD looks for the smallest sphere that encloses the data.
When used in conjunction with a kernel function, it looks for the smallest enclosing sphere
in the feature space defined by the kernel function.
While in feature space the data is described by a sphere,
when mapped back to data-space the sphere is transformed into
a set of non-linear contours that enclose the data
(see Figure 2).
SVDD provides a decision function that tells whether
a given input is inside the feature-space sphere or not,
indicating whether a given point belongs to the
support of the distribution.
More specifically, it is the radius-squared of the feature-space sphere
minus the distance-squared of the image of a data point
from the center of the feature-space sphere.
This function, denoted by
returns a value greater than 0 if
is inside the feature space sphere and negative otherwise.
For more details on SVDD the reader is referred to the SVDD article.
We now interpret the manifold or contour where
as cluster boundaries.
An example of such contours is shown in Figure 2.
However, these boundaries define the clusters
implicitly, and an additional step is required
to "tease" the cluster membership out of the SVDD.
The key geometrical observation that enables to infer clusters
out of the SVDD is that
given a pair of data points that belong to different components (clusters),
the line segment that connects them
must go through a region in data space which is part of a "valley"
in the probability density of the data, i.e. does not belong
to the support of the distribution.
Such a line must then go outside the feature-space sphere, and therefore have
a segment with points that return a negative value when tested
with the SVDD decision function (see Figure 1).
This observation leads to the definition of
an adjacency matrix
between pairs of points in our dataset.
For a given pair of points
and
the
element of
is given by
Clusters are now defined as the connected components of the
graph induced by
.
Checking the line segment is implemented
by sampling a number of points
(20 points were used in numerical experiments).
Examples
SVC uses SVDD to generate non-linear cluster boundaries in conjunction with the Gaussian kernel:
The boundaries generated by SVDD become increasingly non-linear as the
parameter
which governs the width of the Gaussian,
is increased as shown in Figure 2.
As the value of
is increased the
SVDD boundary fits the data more tightly, and
at several values of
the enclosing boundary
splits, forming an increasing number of components (clusters).
The so-called support vectors are the data points that are
on the boundary.
As
is increased their number increases, increasing the
complexity of the boundary.
The choice of the Gaussian kernel is motivated by Roberts (1997)
where a related method produces a sequence of cluster splitting when
used in conjunction with the Gaussian kernel.
In real data, clusters are usually not as well separated as in Figure 2.
Thus, in order to observe splitting of contours, we must allow
some data points to fall outside the support of the distribution, i.e. outside the
feature-space sphere defined by SVDD.
The fraction of such points is controlled using the parameter
of the SVDD.
The effect of allowing outliers is demonstrated in Figure 3.
Without allowing outliers contour splitting does
not occur for the two outer rings for any value of
.
Allowing for outliers,
the clusters are separated easily (right panel of Figure 3).
Note that the outliers are
unclassified SVC, since they lie outside the enclosing sphere.
One may decide either to leave them unclassified, or to assign them to the cluster closest to them.
The original SVC paper recommends exploring the parameter
space of
by starting with a low value of
and high value of
where a single cluster with no outliers occurs.
is then iteratively decreased to look for cluster
bifurcations;
whenever the number of support vectors becomes excessive, indicating
a non-smooth boundary,
is decreased, allowing for more outliers.
Note that whenever applying a clustering algorithm the question arises whether the clustering captures
structure that is inherent in the data.
There are many approaches for clustering validation, one of which is the stability of the
clustering under sampling or other perturbations.
Stability of the clustering with respect to varying the width of the gaussian kernel
could be an indicator of stability of the clustering, but further research is required to show that.
Summary
SVC is a nonparametric clustering algorithm that does not make any assumption on the number or shape of the clusters in the data. In our experience it works best for low-dimensional data, so if your data is high-dimensional, a preprocessing step, e.g. using principal component analysis, is usually required. Several enhancements of the original algorithm were proposed that provide specialized algorithms for computing the clusters by only computing a subset of the edges in the adjacency matrix.
References
- A. Ben-Hur, D. Horn, H.T. Siegelmann and V. Vapnik. Support vector clustering. Journal of Machine Learning Research 2:125-137, 2001.
- S.J. Roberts. Parametric and non-parametric unsupervised cluster analysis. Pattern Recognition, 30(2): 261-272, 1997.
Internal references
- John Guckenheimer (2007) Bifurcation. Scholarpedia, 2(6):1517.
- Olaf Sporns (2007) Complexity. Scholarpedia, 2(10):1623.
- Philip Holmes and Eric T. Shea-Brown (2006) Stability. Scholarpedia, 1(10):1838.
Further reading
External links
See also
Data Clustering, Support Vector Domain Description
| Asa Ben-Hur (2008) Support vector clustering. Scholarpedia, 3(6):5187, (go to the first approved version) Created: 20 September 2007, reviewed: 14 June 2008, accepted: 14 June 2008 |
