Rival penalized competitive learning
From Scholarpedia
| Lei Xu (2007), Scholarpedia, 2(8):1810. | doi:10.4249/scholarpedia.1810 | revision #73360 [link to/cite this article] | |||||||||||||||||||
| Hosting and maintenance of this article is sponsored by Brain Corporation. |
Curator: Dr. Lei Xu, Dept. Computer Science & Engineering, Chinese University of Hong Kong
Rival penalized competitive learning (RPCL) is a development of competitive learning in help of an appropriate balance between two opposite mechanisms (namely a participating mechanism and a leaving mechanism), such that an appropriate number of agents or learners will be allocated to learn multiple structures underlying observations. Moreover, RPCL has been further extended into a general divide-conquer learning framework for multi-agents to divide and conquer a complicated task.
Classical competitive learning: from WTA to FSCL
Proposed decades ago, classical competitive learning (CCL) is usually used for making data clustering by a number of units or agents. Each agent is featured by a parametric point
that moves among the observation samples and seeks to be located at the center of a cluster of samples, such that several points collectively perform clustering analysis. As a sample
comes, each
evaluates a degree of its suitability on representing
by an individual value or criterion
, and competes to be allocated to represent
. The competition is guided globally by the winner-take-all (WTA) policy
- (1)
Then, each
is updated to further reduce
by
- (2)
When
are initialized as in Fig.1(a), as a sample
comes,
becomes the winner and then moves a little bit towards
. Next, as another sample
comes in Fig.1(b),
becomes the winner and then moves similarly. Finally,
and
each converge to one clusters' center as in Fig.1(c).
However, if
are initialized as in the top part of Fig.1(d),
always becomes the winner and finally moves the middle point between two clusters, while
never wins and becomes dead. This is known as the under-utilized or dead unit problem, featured by a WTA based competitive learning (Rumelhart et al, 1985; Grossberg, 1978; Hecht-Nielsen, 1987), which makes the task of clustering analysis badly performed. This problem comes from unequal chances for the dead agents to participate competition due to their unfavorable initial locations. Thus, it is desired that every agent should be initialized to have an equal chance to join competition, which is actually difficult to implement without a priori knowledge on the entire clustering structure, while such a knowledge is exactly the goal of this competitive learning.
One way to tackle this difficulty is a so called frequency sensitive competitive learning (FSCL) (Ahalt, et al, 1990), in help of modifying
into
- (3)
where
is the frequency that the
-th agent won in past. Illustrated in Fig.2 is an example of four clusters, the CCL resulted in three dead units in Fig.3, while the FSCL makes all the four units placed to the cluster's centers instead of becoming dead, as shown in Fig.4.
Such a sprit of reducing the winning rates of frequent winners can also be implemented in alternative ways, e.g., conscience (Desieno, 1988), leaky learning (Rumelhart et al, 1985; Grossberg, 1978), convex bridge (Hecht-Nielsen, 1987), and neighbor sharing in Kohonen network (Kohonen,1982).
Rival penalized competitive learning
The way of reducing the winning rates works when the number
of clusters is known. Unfortunately, it is usually unknown. Not only it is no doubt that both CCL and FSCL will result in bad performances when
is given a smaller number, but also FSCL fails when
is pre-assigned to a value larger than an appropriate one. As illustrated in Fig.5, the frequency sensitive mechanism will bring the extra agents into data to disturb the correct locations of other agents. Thus, how to determine
is a critical difficulty too.
One solution is provided under the name of Rival Penalized Competitive Learning (RPCL), proposed in the early 1990's for clustering analysis and detecting lines in an image (Xu, Krzyzak, & Oja, 1992 , 1993). As illustrated in Fig.6, not only the winner is learned but also its rival (i.e., the second winner) is repelled a little bit from the sample to reduce a duplicated information allocation. As a sample
comes,
is the winner and
is its rival in Fig.6(a),
is moved towards
by a small step size, while
is repelled away from
by a much smaller step size. Similarly in Fig.6(b),
is the winner and moved towards
, while
is its rival and repelled away from
.
More precisely, the RPCL differs from FSCL in that eq.(2) is implemented with eq.(1) replaced by
- (4)
where
approximately takes a number between
and
for controlling the penalizing strength.
As illustrated in Fig.6(c),
and
finally converge to clusters' centers, while
is driven far away from samples. Shown in Fig.7 is an experimental result that improves its FSCL counterpart in Fig.5, with a disturbing agent driven away from samples. Adding this rival penalized mechanism makes extra agents driven far away from data as long as the number of agents is initially given to be larger than the number of clusters to be detected. In other words, RPCL can allocate an appropriate number of learning agents to detect a correct number of clusters or objects automatically during learning. This automatic allocation is a favorable nature that both CCL and FSCL as well as the conventional clustering algorithms (e.g., k-means) do not have. In the past decade, a number of applications of RPCL have been made (Acciani, et al, 2003; Nair, 2003; Nakkrasae & Sophatsathit, 2004).
A general framework for divide-conquer learning
CCL, FSCL, and RPCL can be investigated systemically from the perspective of a general framework for learning agent based problem solving. That is, multiple learning agents coordinately divide a complicated task into a number of subtasks and each subtask is mainly conquered by one learning agent such that the entire task can be well solved collectively by these multiple learning agents (Xu, 2007). In the following, we introduce the four basic ingredients of this general framework.
(a) Agent's structure Each agent needs a hardware
that is able to handle one specific problem solving task. This
is usually featured by a pre-designed structure with a set
of parameters to be determined. The simplest example is that
is given by a point
. In general,
models or fits its environment, from which we get
as a reconstruction of
by
. Beyond a point
,
can also be a curve or shape as shown in Fig.8(a) for objection detection in a noise situation (Liu, Qiao, & Xu, 2006), outperforming the classical Hough transform (HT) (Hough, 1962) and Randomized Hough transform (RHT) (Xu, Oja, & Kultanen, 1990; Xu & Oja, 1993). Generally,
can also be other parametric structure too.
(b) Individual criterion
Each agent has an individual value
to evaluate how good this
is reconstructed by
. Considering the error
, it can be measured by one of the following two related choices:
- The errors are directly measured by certain norm, e.g., in a
norm
. For a simple case
, we have
and
.
- Observing that a larger error occurs with a smaller chance, an error may also be measured by its probability distribution, e.g.,
is measured by a Gaussian as shown in Fig.8(b). In general, we have
or
, which is further concisely rewritten into
. Plus a priori
for the chance of the corresponding agent to be involved, we get the following probabilistic distribution based measure:
- (5)
Type B can be regarded as a further extension of eq.(3).
(c) Global principle
Agents are coordinated globally to combine individual criteria
, in one of the following two manners:
- One is operational based, e.g., the simple ones by eq.(1) or eq.(4), and a further example will be discussed in the next section. Generally, we can denote an allocation scheme as follows:
- (6)
- The other is evaluation based, by a principle on the overall performance jointly by all the agents, which will be further discussed in the next section.
(d) Updating rule
As a further extension of eq.(2), agents are modified to adapt
according to the allocation by
, with
- (7)
where
denotes an operator that acts on
and
to yield a direction along which
reduces within a neighborhood region of
. When
is differentiable,
can be given by
or
on
, where
has a positive projection on
.
In this framework, a conquering action is performed via a updating rule by eq.(7), while a dividing action is performed via an allocation scheme by eq.(6), which are iteratively implemented per sample as shown in Fig.8(c). To get a useful iteration, the following three issues should be considered:
- The updating by Step (b) needs an appropriate control of a small enough learning stepsize.
- During the iteration, we should keep
that usually consists of three types of constraints. One is
, the second is the covariance matrix
that should be positive definite. One typical way is transforming them to be function of parameters that are free of constraints, e.g.,
and
(Xu, 2001). Moreover, there is also a constraint of a matrix within the Stiefel manifold
, for which one way bases on projection (Xu, 2002) while the other bases on differential geometry (Edelman, Arias,& Smith, 1998).
- In addition to estimating unknown parameters
, it also needs to select an appropriate number
of agents, such that not only each agent performs best according to its individual criterion but also a best overall performance is achieved, which highly depends on
. In the literatures of statistics and machine learning, the task of determining the number
belongs to what is usually called model selection.
Model selection and typical implementation
In general, model selection refers to select an appropriate one among a family of infinite many candidate structures
with each
in a same configuration but in different scales, each of which is labeled by a scale parameter
in term of one integer or a set of integers. Selecting an appropriate
means getting a structure consisting of an appropriate number of free parameters. At a given
, we expect that
best fits or represents a given set
of samples with a finite size
and determine
of all the unknown parameters within
, such that the error of using
to represent
becomes the least. For our problem discussed in previous section, we have
and
, while
is simply the number
.
However, by a typical best fitting principle (e.g., a maximum likelihood (ML) principle ), its fitting error monotonically decreases as
grows up, until the error reaches zero at
, a value that relates to
and is usually much bigger than an appropriate
. In this case, a large part of structure has actually been used to fit noises or outliers, which not only wastes computing resources but also degrades the generalization performance on new samples not in
but from a same underlying regularity. This is usually called ''over-fitting'' problem, for which we cannot get an appropriate
.
One major direction for tackling the problem is a two stage implementation. First, enumerate a candidate set of
and estimate the unknown set
of parameters by ML learning for a solution
at each
. Second, use a model selection criterion
to select a best
. In the literature, there exist several typical model selection criteria, including Akaike information criterion (AIC) and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), Bayesian approach related criteria under the names of Bayesian inference criterion (BIC) (Schwarz, 1978), Minimum message length (MML) (Wallace & Boulton,1968; Wallace & Dowe, 1999), Minimum description length (MDL) (Rissanen, 1986, 2007), Cross validation (CV) based criteria (Stone, 1978; Rivals & Personnaz, 1999), and the VC dimension based generalization error bound (Vapnik, 1995). Unfortunately, anyone of these criteria usually provides a rough estimate that may not yield a satisfactory performance. Also, this two stage approach usually incurs a huge computing cost. Moreover, the parameter learning performance deteriorates rapidly as
increases, which makes the value of
evaluated unreliably.
The other direction is stepwise model selection either incrementally or decrementally. Incremental algorithms attempt to keep as much as possible what already learned as
increases step by step, focusing on learning newly added parameters (Ghahramani & Beal, 2000; Salah and & Alpaydin, 2004). Such an incremental implementation can save computing costs, but usually leads to suboptimal performance because not only those newly added parameters but also the old parameter set
need to be re-learned but actually has been not updated. This suboptimal problem may be lessened by a decremental implementation that starts with
at a large value and decreases
step by step. At each step, one takes a subset out of
with the remaining parameter updated, and discard the subset with a biggest decreasing of
after trying a number of such subsets. Such a procedure can be formulated as a tree searching. The initial parameter set
is the root of the tree, and discarding one subset moves to one immediate descendant. A depth-first searching suffers from a suboptimal performance seriously, while a breadth-first searching suffers a huge combinatorial computing cost. Usually, a trade off between the two extremes is adopted.
Another direction of studies is called automatic model selection. One early exemplar is the effort on RPCL that is featured by driving an extra agent far away from data or equivalently pushing the proportion
that samples are allocated to this agent towards zero. Subsequently, Bayesian Ying Yang learning was proposed in (Xu,1995, 2002) and systematically developed in the past decade, which leads to not only new model selection criteria but also algorithms with automatic model selection via maximizing a Ying Yang harmony functional (Xu, 2002, 2004).
In general, this automatic model selection demonstrates a quite difference nature from the usual incremental or decremental implementation that bases on evaluating the change
as a subset
of parameters is added or removed. Thus, automatic model selection is associated with a learning algorithm or a learning principle with the following two features:
- There is an indicator
on a subset
, we have
, if
consists of parameters of a redundant structural part.
- In implementation of this learning algorithm or optimizing this learning principle, there is an mechanism that automatically drives
. Thus, the corresponding redundant structural part is effectively discarded.
Coordinated opposite mechanisms versus Ying Yang harmony
As illustrated in Fig.9(a), an appropriate combination of a global principle and individual criteria actually results in automatic model selection during parameter learning, due to a coordination of two opposite mechanisms, namely one for each agent to participate and the other for each agent to leave.
As mentioned early, the WTA allocation by eq.(1) used in CCL leads to the dead unit problem because agents do not have an equal chance to participate. This problem can be solved via a participating mechanism that lets each agent to have an equal initial chance to participate. For the FSCL that bases still on the WTA allocation by eq.(1), its local criterion by eq.(3) provides a participating mechanism that penalizes those frequent winners and compensates those constant losers. However, it lacks of a mechanism for an incapable or extra agent to leave. It fails when the number of agents in consideration is larger than an appropriate one. As previously introduced, RPCL improves FSCL via penalizing a rival by eq.(4), which provides a mechanism to drive an incapable or extra agent away. An appropriate balance between this leaving mechanism and the participating mechanism, as
illustrated in Fig.9(a), leads to a favorable RPCL nature of automatic model selection, with the balance controlled by the de-learning rate
in eq.( 4).
Instead of using a global allocating policy such as eq,(6), we may also get a leaving mechanism via an appropriate local criterion too. Still using the WTA by eq.(1), we consider Type A in eq.(5) with a special Gaussian
, while Type B can be analyzed similarly.
It follows that
varies with
as shown in Fig.9(b), where
is the dimension of
. If the
-th agent wins a lot of samples,
will become larger than a threshold, and not only
will increase but also the gap between different agents reduces. In other words, the competing ability of an agent that won too much in past is self-penalized and the competing ability of a weaker agent is compensated for a relative boosting, which effectively provides a participating mechanism similar to
in eq.(3). This similarity can also be observed from the term
in which
takes a role similar to
in eq.(3).
On the other hand, when the
-th agent won too fewer samples and thus its
drops below certain threshold, not only
will increase but also the corresponding gap between different agents increases. Therefore, an incapable/extra agent is self-penalized to fade out, which also provides a leaving mechanism that is different from the RPCL one by eq.(4). Moreover, as the competing ability of the
-th agent decreases, its
decreases and the other
will relatively increase. That is, the term
effectively enhances this self-penalizing featured leaving mechanism. As a result, the local criterion internally seeks an appropriate balance between its participating mechanism and leaving mechanism, with an automatic model selection nature.
Instead of at a bottom level via two opposite mechanisms, automatic model selection nature can also be observed in a top-down way with a global learning principle or theory that guides parameter learning and model selection. We consider the BYY system in i.i.d. cases, as shown in Figure 7 of the Scholarpedia page Bayesian Ying Yang learning. Our problem is actually a special case that is simplified into the above Fig.9(c). Moreover, considering
a degenerated case with
collapsed to
and
simplified as follows:
where
is a priori distribution of parameters and is ingored when there is no priori knowledge. Moreover,
consists of
labels or indices of agents that enter to represent
. One typical example is the Climax neighborhood policy for simplifying the Yang machine by only considering its climax and neighbors given a sample
. The rationale is that a Yang distribution conditioning on
is more reliably described by its apex realm that actually takes a major role in a harmony of the corresponding Ying part. Details are referred to the Scholarpedia page Bayesian Ying Yang learning, especially eqn.(3), eqn.(6), Fig.7, and Fig.8 there. Specifically, we observe the following three key points:
- For eq.(8) here , the Yang machine is simply
with
being a discrete label, and
consists of the
-Climax neighbors (CN) as follows
- The structure of
is designed under a uncertainty range that is restricted to perserve ones of the Ying machine
as follows:
- (10)
- It further leads to different specific forms of
via maximizing
. The following are three typical examples:
Moreover, we observe an implementation of
along the gradient flow:
from which we see a favorable nature of automatic model selection on
. The flow
becomes the one for the ML learning if
. The BYY harmony learning differs from the ML learning by a relative fitness
. If
, updating goes along the same direction of the ML learning but with an increased strength. If
, updating still goes along the ML learning direction but with a reduced strength. When
, updating reverses to the opposite direction, i.e., becoming de-learning,
which becomes similar to RPCL. This can be observed more clearly at a special case of either the choice (a) with
or the choice (b) with
, which leads to
- (13)
Moreover, there are two improvements. First, it avoids the difficulty of specifying an appropriate
that is required in RPCL. Second, either the winner or the rival may be updated by de-learning as
or learning as
instead of only de-learning by RPCL. Therefore, RPCL can be explained as a bottom level simplified approximation of the BYY harmony learning. An even closer link to RPCL is given by the choice (a) with
in eq.(11), by which we simply have
In a summary, the Ying Yang best harmony at the global level in Fig.9(c) has an inherent relation with the local balance of two mechanisms in Fig.9(a). In addition, it is also interesting to notice the choice (b) with
that all the candidates in
are considered instead of only the first one since none of them can be regarded to suit the current sample
.
Moreover,
actually provides a new allocation scheme for the iterative learning in Fig.8(c) with
, which implements a specific formulation of BYY harmony learning. Furthermore, let
, it follows from eq.(8) that we get the following
global principle:
- (15)
with
obtained in a same way as eq.(11). In a degenerated case, minimizing
with respect to
under the constraint
returns to the WTA allocation by eq.(1).
Learning algorithms for GMM, RBF nets, and mixture-of-experts
Since the coordinating nature between a participating mechanism and a leaving mechanism is actually featured via a combination of a specific individual criterion
and a specific instance of the global allocating scheme by eq.(6), here we categorize different combinations to get an outline on algorithms that fall in the proposed divide-conquer learning framework. Shown in Fig.10(a) is an outline of learning algorithms derived from Fig.8(c) on Gaussian mixture model (GMM).
The first column is featured by the WTA allocation by eq.(1). At the bottom, each learning agent is simply a center point
and multiple agents jointly perform a task of cluster analysis by either the standard CL or the FS-CL with the previously discussed improvement. Moreover, each agent is extended to a general Gaussian
with a priori
. For Type A, multiple agents jointly describe a Gaussian mixture:
- (16)
It improves the standard CL not only by extending a spherical cluster to an elliptic cluster but also by getting the balancing mechanism as discussed in Fig.9(b). For Type B, multiple agents jointly describe a product Gaussian mixture:
- (17)
It improves FS-CL similarly as the above for Type A, and differs from Type A in that
takes a role similar to the one in eq.(3), which further enhances a participating mechanism.
Featured by eq.(4), the second column consists of the standard RPCL (Xu, Krzyzak, & Oja, 1992 , 1993) and its generalization (Xu, 1998a). Next, the last two columns lead to an adaptive version of the EM algorithm (Redner & Walker, 1984; Xu & Jordan, 1996) and Bayesian Ying Yang harmony learning not only on a Gaussian mixture by eq.(16) for the cases of Type A, which was already discussed after eq.(8) in the previous section, but also on product Gaussian mixture by eq.(17) for the cases of Type B.
In addition to Gaussian mixture, learning algorithms in Fig.10(a) is also applicable to the cases of supervised learning, i.e., knowing each pairing
, by simply letting the winner
for RPCL learning and
for BYY harmony learning with
by eq.(11) in the choice (a) with
. Another choice for supervised learning is using a normalized radial basis function (RBF) networks. Actually, in its original papers (Xu, Krzyzak, & Oja, 1992 , 1993) one application of RPCL was for learning RBF nets in a two stage implementation. First, a clustering algorithm (e.g., k-means) in (Moody, & Darken, 1989) is replaced by RPCL with the number of clusters determined automatically. Second, the output layer is estimated by a least square linear fitting. It has been subsequently shown that a normalized RBF networks can be regarded as a simplified case of an alternative mixture of experts (ME):
- (18)
for which the maximum likelihood (ML) learning is implemented by the EM algorithm (Xu, Jordan, & Hinton, 1995). Also, an alternative ME becomes an extended normalized radial basis function (RBF) networks at the following special case
- (19)
which degenerates to a normalized RBF networks (Moody, & Darken, 1989) simply with
. As a result, the conventional two stage but suboptimal implementation of a normalized RBF learning is replaced by the EM algorithm (Xu, 1998b). Moreover, RPCL and Bayesian Ying Yang learning have also been adopted for determining the number of basis functions (Xu, 1998b, 2001).
Similar to Fig.10(a), these learning algorithms can also be summarized under the general divide-conquer learning framework here, for which we further interpret the following points:
- Each agent also has a part
that makes an action
to the environment.
- Individual criterion also needs to evaluate the error
, for which we generalize the Type A case of eq.(5) into :
- (20)
With the above
replacing its previous appearances on Gaussian mixture, it follows from Fig.8(c) that similarly we get the learning algorithms for the alternative ME and normalized RBF networks, as outlined in Fig.10 (b). Details are referred to (Xu, 2009).
In many practices, there is only a finite size of training samples distributed in small dimensional subspaces, instead of scattering over all the dimensions of the observation space. These subspace structures can not be well captured by considering a basis
supported on the entire space of
. There are too many free parameters in
, which usually leads to poor performances. Instead, we can further consider that observed samples are regarded as generated from a subspace with independent factors distributed along each coordinate of a
dimensional inner representation
. Readers are referred to (Xu, 2009) for details and to the subpage /APPENDIX (a) subspace based functions.
Grouped agents, k-CN majority voting, and Ying Yang Alternation
Also in many practices, the uncertainty structure of each agent
is beyond a Gaussian structure. One choice is considering eq.(20) by choosing specific
and
if we know their forms in advance. A general way is considering each
to be described by a mixture of multiple components, with each one in a simple distribution. Typically, we consider a mixture of multiple Gaussian components, that is, each agent consists of
in a Gaussian mixture by eq.(16) and
by eq.(18). Equivalently, we consider
groups of agents with each group consisting of a number of agents in Gaussian structures, which can be made by simply letting the previously discussed single label
or
(e.g., in eq.(16) and eq.(18) as well as eq.(8) ) to be replaced by a double labeling
or
, with
added for labeling Gaussian components within each group. Taking eq.(16) as an example, in sequel we consider its further details, while the cases with
by eq.(18) can be considered similarly.
We consider multiple groups of Gaussian components
with their corresponding priors
. That is, the entire data set is considered by a two level mixture of Gaussians
- (21)
Still, the algorithms in Fig.10(a) can be used to learn their parameters
via simply replacing every
single label
or
with a double labeling
or
, plus a modification on getting
through the following k-CN based majority rule:
- (22)
where
denotes the cardinality of the set
. If knowing that
comes from the
-th group, we simply let
, plus specifically considering
by eq.(11) in the choice (a) with
,
to implement BYY harmony learning. As introduced before eq.(9),
-CN policy focuses on these reliable contributions to
for a discrimination of
. Using a majority voting to get
further considers these reliable contributions based on their ranks instead of their probability values.
To be more specific, it follows from the the gradient flow by eq.(12) that we can implement
via the following Ying-Yang alternation:
- Yang step: replacing every single label
or
by a double labeling
or
, we get
(1)
by eq.(22) and
by eq.(10) with
and
.
(2)
by eq.(11) and
by eq.(12) or eq.(14), from which we get
.
(3) update the data smoothing parameter by
where
is the average gap between samples under a smoothing parameter
. Given a sample set,
needs to be computed only once and stored as a table for a quick access at each updating of eq.(23). Further details are referred to (Xu, 2007).
- Ying step: update parameters by one of the following three choices:
(a) An EM-like updating
(b) A gradient ascent updating
(c) A constraint ensured gradient ascent updating
The above Ying-Yang alternation is repeated until the iteration converges, during which automatic model selection is made as one
with its corresponding Gaussian component discarded.
Remarks:
- The above choice (a) is ensured to make updating along the direction to increase
. However, its stepsize may sometimes overshot. To avoid this, an extra check is needed on whether
is increased after updating. If not, updating can be made by the Choice (b) with a recuded stepsize
. Alternative, we may also make updating by the choice (b) with a small enough stepsize
, without checking whether
is increased. The choice (b) becomes equivalent to the choice (a) when
. Still, the updatings can not guarantee the requirement
and
remains to be positive definite, which is instead guaranteed by the choice (c).
- Though the above discussed is learning in a batch way. It directly applies to adaptive learning by setting
in eq.(24). It also covers a spectrum that
varies from
to the entire size of samples.
- We can either get
by eq.(21) or
. Moreover, as a new sample
comes we may classify
by the k-NN rule given by the choice (b) in eq.(22).
- It degenerates to the BYY harmony learning for a Gaussian mixture when
takes only one value for every
.
- To simplify computation, we may ignore eq.(23) by fixing
or a small real number.
References
- Acciani, G., et al (2003), "A feature extraction unsupervised neural network for an environmental data set", Neural Networks 16: 427-436.
- Ahalt, SC., et al (1990), "Competitive learning algorithms for vector quantization", Neural Networks 277-291.
- Akaike, H (1974), "A new look at the statistical model identification", IEEE Tr. Automatic Control 19: 714-723.
- Bozdogan, H & Ramirez, DE (1988), "FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC", Psychometrika 53(3): 407-415.
- Cavanaugh, JE (1997), "Unifying the derivations for the Akaike and corrected Akaike information criteria", Statistics & Probability Letters 33: 201-208.
- Desieno, D (1988), "Adding a conscience to competitive learning", Proc. of IEEE int. conf. on Neural Networks, Vol.I, pp.117-124.
- Edelman, A, Arias, TA, & Smith, ST, (1998), "The geometry of algorithms with orthogonality constraints", SIAM J. Matrix Anal. Appl., 20:303-353.
- Ghahramani, Z & Beal, MJ (2000), "Variational inference for Bayesian mixtures of factor analyzers", in Solla, Leen, & Muller (eds), Advances in Neural Information Processing Systems 12, 449-455.
- Grossberg, S (1987), "Competitive learning: from iterative activation to adaptive resonance", Cognitive Science, 11: 23-63.
- Hecht-Nielsen, R (1987), "Counterpropagation networks", Applied Optics 26: 4979-4984.
- Hough, PVC "Method and means for recognizing complex patterns", U.S. Patent 3069654, Dec.18, 1962.
- Jacobs, RA., et al (1991), "Adaptive mixtures of local experts", Neural Computation 3: 79-87.
- Jordan, MI & Xu, L (1995), "Convergence results for the EM approach to mixtures of experts", Neural Networks 8: 1409-1431.
- Kohonen, T (1982), "Self-organized formation of topologically correct feature maps", Biological Cybernetics 43: 59-69.
- Liu, ZY, Qiao, H, & Xu, L (2006), "Multisets Mixture learning based Ellipse Detection", Pattern Recognition 39: 731-735.
- Ma, J, Wang, T, & Xu, L (2004), "A gradient BYY harmony learning rule on Gaussian mixture with automated model selection", Neurocomputing 56: 481-487.
- Moody, J & Darken, J (1989) "Fast learning in networks of locally-tuned processing units", Neural Computation 1: 281-294.
- Nair, TM., et al, (2003) "Rival penalized competitive learning (RPCL): a topology-determining algorithm for analyzing gene expression data", Computational Biology and Chemistry 27: 565-574.
- Nakkrasae, S, & Sophatsathit, P (2004), "An rpcl-based indexing approach for software component classification" International J. of Software Engineering and Knowledge Engineering 14: 497-518.
- Redner, RA & Walker, HF (1984), "Mixture densities, maximum likelihood, and the EM algorithm", SIAM Review 26: 195-239.
- Rissanen, J (1986), "Stochastic complexity and modeling", Annals of Statistics 14 (3): 1080-1100.
- Rissanen, J (2007), Information and Complexity in Statistical Modeling, Springer, 2007.
- Rivals, I & Personnaz, L (1999) "On Cross Validation for Model Selection", Neural Computation 11: 863-870.
- Robbins, H., & Monro, S., (1951), "A stochastic approximation method", Ann. Math. Statist. 22:400-407.
- Rumelhart, DE & Zipser, D (1985), "Feature discovery by competitive learning", Cognitive Science 9: 75-112.
- Salah, AA & Alpaydin, E. (2004), "Incremental mixtures of factor analyzers", Proc. the 17th International Conference on Pattern Recognition, 23-26 Aug. 2004, Cambridge, UK, Vol.1, 276-279.
- Schwarz, G (1978), "Estimating the dimension of a model", Annals of Statistics 6: 461-464.
- Stone, M (1978), "Cross-validation: A review", Math. Operat. Statist. 9: 127-140.
- Vapnik, VN (1995), The Nature of Statistical Learning Theory, Springer.
- Wallace, CS & Boulton, DM (1968), "An information measure for classification", Computer Journal 11: 185-194.
- Wallace, CS & Dowe, DR (1999), "Minimum message length and Kolmogorov complexity", Computer Journal 42 (4): 270-280.
- Xu, L (2009), "Learning Algorithms for RBF Functions and Subspace Based Functions", Ch.3 in "Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods and Techniques", eds. by Olivas, Guerrero, Sober, Benedito, & Lopez, IGI Global publication, pp60-94.
- Xu, L (2007), "A Unified Perspective and New Results on RHT Computing, Mixture Based Learning, and Multi-learner Based Problem Solving", Pattern Recognition 40: 2129-2153.
- Xu, L (2007), "A Trend on Regularization and Model Selection in Statistical Learning: A Perspective from Bayesian Ying Yang Learning", an invited chapter in "Studies in Computational Intelligence 63: Challenges to Computational Intelligence", Duch, W. and Mandziuk, J., 365-405.
- Xu, L (2002), "BYY Harmony Learning, Structural RPCL, and Topological Self-Organizing on Unsupervised and Supervised Mixture Models", Neural Networks 15: 1125-1151.
- Xu, L (2001), "Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, ME-RBF Models and Three-Layer Nets", International J. of Neural Systems 11: 3-69.
- Xu, L (1998a), "Rival Penalized Competitive Learning, Finite Mixture, and Multisets Clustering", Proc. of IJCNN98, Anchorage, Vol.II, pp.2525-2530.
- Xu, L (1998b), "RBF nets, mixture experts, and Bayesian Ying-Yang learning", Neurocomputing 19(1-3): 223-257.
- Xu, L & Jordan, MI (1996), "On convergence properties of the EM algorithm for Gaussian mixtures", Neural Computation 8(1): 129-151.
- Xu, L, Jordan, MI, & Hinton, GE (1995), "An Alternative Model for Mixtures of Experts", in Cowan, Tesauro, and Alspector, eds., Advances in Neural Information Processing Systems 7, MIT Press, 633-640.
- Xu, L, Krzyzak, A, & Oja, E (1993), "Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection", IEEE Tr. on Neural Networks 4: 636-649.
- Xu, L, & Oja, E (1993), "Randomized Hough Transform (RHT): Basic Mechanisms, Algorithms and Complexities", Computer Vision, Graphics, and Image Processing: Image Understanding 57: 131-154.
- Xu, L, Krzyzak, A & Oja, E (1992), "Unsupervised and Supervised Classifications by Rival Penalized Competitive Learning", Proc. of 11th Intl. Conf. on Pattern Recognition, Hauge, Netherlands, Aug.30-Sept.3, 1992, Vol.I, pp.672-675.
- Xu, L, Oja, E & Kultanen, P (1990), "A New Curve Detection Method: Randomized Hough transform (RHT)", Pattern Recognition Letters 11: 331-338.
Internal links
- The subpage of /APPENDIX
- Lei Xu (2007) Bayesian Ying Yang learning. Scholarpedia, 2(3):1809.
- Teuvo Kohonen and Timo Honkela (2007) Kohonen network. Scholarpedia, 2(1):1568.
External links
Author's website (http://www.cse.cuhk.edu.hk/~lxu/)
See also
- Bayesian Ying Yang Learning
- Unsupervised Learning
- Data Clustering
- Competitive Learning
- Winner-Take-All
- Kohonen Network
- Mixtures of Experts
- Gaussian mixture
- Radial Basis Function Networks
| Lei Xu (2007) Rival penalized competitive learning. Scholarpedia, 2(8):1810, (go to the first approved version) Created: 31 July 2006, reviewed: 17 August 2007, accepted: 31 August 2007 |
| Action editor: | Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia |




