Support vector domain description
From Scholarpedia
| This article has not been peer-reviewed or accepted for publication yet; It may be unfinished, contain inaccuracies, or unapproved changes. | ||||||||||||||||||||
Dr. Robert. P.W. Duin accepted the invitation on 24 September 2007 (self-imposed deadline: 24 November 2007).
This article will briefly cover: a support vector approach for novelty (or outlier) detection.
Contents |
Simple description
The support vector domain description (SVDD) is a classifier that tries to solve the [outlier detection] problem. It distinguishes a set of target objects (called the target class) from the outlier class. It does it by computing a spherically shaped decision boundary around a training set of target objects. The sphere is put such that the volume of the sphere is minimized, while still keeping all (or most of the) target objects inside. The sphere is actually determined by just a few objects from the target set, located on the sphere boundary. The sphere center and radius can therefore be expressed in terms of these objects. Analogous to the Support Vector Machine, these objects are called the support objects. A new object is classified by computing the distance to the sphere center. When the distance is larger than the sphere radius, the object will be classified as outlier, otherwise it is being accepted as a genuine target object.
The mathematical formulation
Hard-ball solution
The most simple formulation of the SVDD is the hard hypersphere, where
all target objects are inside the sphere. Assume we have a set of target
objects
. To find the center
and radius
of the sphere with minimal volume, we
have to solve:
such that
The constraints that the objects are inside the sphere can be incorporated in the optimization by using Lagrange Multipliers. The so-called dual representation becomes:
such that
where
are the Lagrange multipliers corresponding to each
of the constraints.
It appears that the sphere center can be written as a linear
combination of the original objects
.
The sphere radius
does not appear directly from the
optimization. For this, the distance of one of the support vectors to the sphere
centers has to be computed. For the hard-ball solution, this distance is the same
for all support vectors.
To evaluate a new object
, the (squared) distance to the sphere
center has to be computed:
.
The object is rejected when this squared distance is larger than the sphere radius
.
Soft-ball solution
The formulation can be made more robust by allowing some target objects
to fall outside the description. Objects that are far from the bulk of
the data have a large influence on the sphere solution. Therefore a
tradeoff between the distance of an object to the sphere center and the
sphere volume has to be made. This is done by introducing so-called
slack variables
for each of the training objects.
The constraints of some objects may be slackened when the sphere volume
will decrease sufficiently by it. This can be formulated as follows:
such that
where
controls the tradeoff between the sphere volume and
the object distance.
The dual representation is remarkably similar to the original one:
such that
. The only difference is the upper bound on
,
that indicates that the influence of a single training object is bounded.
Objects for which
are called bounded support
vectors, and they are outside the sphere. Objects with
are the unbounded support vectors and are located on the
decision boundary.
To compute the final sphere radius
from the solution
one now has to take care that one has to compute
the distance from the center to an unbounded support vector.
Kernelized solution
The spherical description can be made more flexible by introducing
kernel functions, analogous to the Support Vector Machines. The inner
products
are replaced by a general kernel function
. When a Gaussian kernel is used (with an extra
free parameter
)
solutions ranging form a Parzen density estimation to the original
spherical description are obtained. Also a procedure for choosing the
appropriate value for s is given such that for all types of data a tight
description can be obtained.
Remarks
This tool can now answer other questions as well. It can solve the classification problems in which the different classes are very poorly balanced (or in which one of the classes is completely absent). This happens in medial applications where the population of normal, healthy people is far bigger than the abnormal population. It also opens the possibility to give an indication that a test set is sufficiently similar to the training set.
The support vector data description does not depend on a density estimation of the target data, it only covers an area where target objects appear. In most practical applications it is not only very hard to obtain a genuine sampling of the outlier objects, also the true object distribution of the target object is up to the practitioners ability and skill.
| Invited by: | Dr. Ke CHEN, School of Computer Science, The University of Manchester, U.K. |

