Rival penalized competitive learning
From Scholarpedia
| Lei Xu (2007), Scholarpedia, 2(8):1810. | revision #58907 [link to/cite this article] | |||||||||||||||||||
Curator: Dr. Lei Xu, Dept. Computer Science & Engineering, Chinese University of Hong Kong
Rival penalized competitive learning (RPCL) is a development of competitive learning in help of an appropriate balance between two opposite mechanisms (namely a participating mechanism and a leaving mechanism), such that an appropriate number of agents or learners will be allocated to learn multiple structures underlying observations. Moreover, RPCL has been further extended into a general divide-conquer learning framework for multi-agents to divide and conquer a complicated task.
Classical competitive learning: from WTA to FSCL
Proposed decades ago, classical competitive learning (CCL) is usually used for making data clustering by a number of units or agents. Each agent is featured by a parametric point
that moves among the observation samples and seeks to be located at the center of a cluster of samples, such that several points collectively perform clustering analysis. As a sample
comes, each
evaluates a degree of its suitability on representing
by an individual value or criterion
, and competes to be allocated to represent
. The competition is guided globally by the winner-take-all (WTA) policy
- (1)
Then, each
is updated to further reduce
by
- (2)
When
are initialized as in Fig.1(a), as a sample
comes,
becomes the winner and then moves a little bit towards
. Next, as another sample
comes in Fig.1(b),
becomes the winner and then moves similarly. Finally,
and
each converge to one clusters' center as in Fig.1(c).
However, if
are initialized as in the top part of Fig.1(d),
always becomes the winner and finally moves the middle point between two clusters, while
never wins and becomes dead. This is known as the under-utilized or dead unit problem, featured by a WTA based competitive learning (Rumelhart et al, 1985; Grossberg, 1978; Hecht-Nielsen, 1987), which makes the task of clustering analysis badly performed. This problem comes from unequal chances for the dead agents to participate competition due to their unfavorable initial locations. Thus, it is desired that every agent should be initialized to have an equal chance to join competition, which is actually difficult to implement without a priori knowledge on the entire clustering structure, while such a knowledge is exactly the goal of this competitive learning.
One way to tackle this difficulty is a so called frequency sensitive competitive learning (FSCL) (Ahalt, et al, 1990), in help of modifying
into
- (3)
where
is the frequency that the
-th agent won in past. Illustrated in Fig.2 is an example of four clusters, the CCL resulted in three dead units in Fig.3, while the FSCL makes all the four units placed to the cluster's centers instead of becoming dead, as shown in Fig.4.
Such a sprit of reducing the winning rates of frequent winners can also be implemented in alternative ways, e.g., conscience (Desieno, 1988), leaky learning (Rumelhart et al, 1985; Grossberg, 1978), convex bridge (Hecht-Nielsen, 1987), and neighbor sharing in Kohonen network (Kohonen,1982).
Rival penalized competitive learning
The way of reducing the winning rates works when the number
of clusters is known. Unfortunately, it is usually unknown. Not only it is no doubt that both CCL and FSCL will result in bad performances when
is given a smaller number, but also FSCL fails when
is pre-assigned to a value larger than an appropriate one. As illustrated in Fig.5, the frequency sensitive mechanism will bring the extra agents into data to disturb the correct locations of other agents. Thus, how to determine
is a critical difficulty too.
One solution is provided under the name of Rival Penalized Competitive Learning (RPCL), proposed in the early 1990's for clustering analysis and detecting lines in an image (Xu, Krzyzak, & Oja, 1992 , 1993). As illustrated in Fig.6, not only the winner is learned but also its rival (i.e., the second winner) is repelled a little bit from the sample to reduce a duplicated information allocation. As a sample
comes,
is the winner and
is its rival in Fig.6(a),
is moved towards
by a small step size, while
is repelled away from
by a much smaller step size. Similarly in Fig.6(b),
is the winner and moved towards
, while
is its rival and repelled away from
.
More precisely, the RPCL differs from FSCL in that eq.(2) is implemented with eq.(1) replaced by
- (4)
where
approximately takes a number between
and
for controlling the penalizing strength.
As illustrated in Fig.6(c),
and
finally converge to clusters' centers, while
is driven far away from samples. Shown in Fig.7 is an experimental result that improves its FSCL counterpart in Fig.5, with a disturbing agent driven away from samples. Adding this rival penalized mechanism makes extra agents driven far away from data as long as the number of agents is initially given to be larger than the number of clusters to be detected. In other words, RPCL can allocate an appropriate number of learning agents to detect a correct number of clusters or objects automatically during learning. This automatic allocation is a favorable nature that both CCL and FSCL as well as the conventional clustering algorithms (e.g., k-means) do not have. In the past decade, a number of applications of RPCL have been made (Acciani, et al, 2003; Nair, 2003; Nakkrasae & Sophatsathit, 2004).
A general framework for divide-conquer learning
CCL, FSCL, and RPCL can be investigated systemically from the perspective of a general framework for learning agent based problem solving. That is, multiple learning agents coordinately divide a complicated task into a number of subtasks and each subtask is mainly conquered by one learning agent such that the entire task can be well solved collectively by these multiple learning agents (Xu, 2007). In the following, we introduce the four basic ingredients of this general framework.
(a) Agent's structure Each agent needs a hardware
that is able to handle one specific problem solving task. This
is usually featured by a pre-designed structure with a set
of parameters to be determined. The simplest example is that
is given by a point
. In general,
models or fits its environment, from which we get
as a reconstruction of
by
. Beyond a point
,
can also be a curve or shape as shown in Fig.8(a) for objection detection in a noise situation (Liu, Qiao, & Xu, 2006), outperforming the classical Hough transform (HT) (Hough, 1962) and Randomized Hough transform (RHT) (Xu, Oja, & Kultanen, 1990; Xu & Oja, 1993). Generally,
can also be other parametric structure too.
(b) Individual criterion
Each agent has an individual value
to evaluate how good this
is reconstructed by
. Considering the error
, it can be measured by one of the following two related choices:
- The errors are directly measured by certain norm, e.g., in a
norm
. For a simple case
, we have
and
.
- Observing that a larger error occurs with a smaller chance, an error may also be measured by its probability distribution, e.g.,
is measured by a Gaussian as shown in Fig.8(b). In general, we have
or
, which is further concisely rewritten into
. Plus a priori
for the chance of the corresponding agent to be involved, we get the following probabilistic distribution based measure:
- (5)
Type B can be regarded as a further extension of eq.(3).
(c) Global principle
Agents are coordinated globally to combine individual criteria
, in one of the following two manners:
- One is operational based, e.g., the simple ones by eq.(1) or eq.(4), and a further example will be discussed in the next section. Generally, we can denote an allocation scheme as follows:
- (6)
- The other is evaluation based, by a principle on the overall performance jointly by all the agents, which will be further discussed in the next section.
(d) Updating rule
As a further extension of eq.(2), agents are modified to adapt
according to the allocation by
, with
where
denotes an operator that acts on
and
to yield a direction along which
reduces within a neighborhood region of
. When
is differentiable,
can be given by
or
on
, where
has a positive projection on
.
In this framework, a conquering action is performed via a updating rule by eq.(), while a dividing action is performed via an allocation scheme by eq.(6), which are iteratively implemented per sample as shown in Fig.8(c). To get a useful iteration, the following three issues should be considered:
- The updating by Step (b) needs an appropriate control of a learning stepsize, which should be kept small enough.
- During the iteration, we should keep
that usually consists of three types of constraints. One is
, the second is the covariance matrix
that should be positive definite. One typical way is transforming them to be function of parameters that are free of constraints, e.g.,
and
(Xu, 2001). Moreover, there is also a constraint of a matrix within the Stiefel manifold
, for which one way bases on projection (Xu, 2002) while the other bases on differential geometry (Edelman, Arias,& Smith, 1998).
- In addition to estimating unknown parameters
, it also needs to select an appropriate number
of agents, such that not only each agent performs best according to its individual criterion but also a best overall performance is achieved, which highly depends on
. In the literatures of statistics and machine learning, the task of determining the number
belongs to what is usually called model selection.
Model selection and typical implementation
In general, model selection refers to select an appropriate one among a family of infinite many candidate structures
with each
in a same configuration but in different scales, each of which is labeled by a scale parameter
in term of one integer or a set of integers. Selecting an appropriate
means getting a structure consisting of an appropriate number of free parameters. At a given
, we expect that
best fits or represents a given set
of samples with a finite size
and determine
of all the unknown parameters within
, such that the error of using
to represent
becomes the least. For our problem discussed in previous section, we have
and
, while
is simply the number
.
However, by a typical best fitting principle (e.g., a maximum likelihood (ML) principle ), its fitting error monotonically decreases as
grows up, until the error reaches zero at
, a value that relates to
and is usually much bigger than an appropriate
. In this case, a large part of structure has actually been used to fit noises or outliers, which not only wastes computing resources but also degrades the generalization performance on new samples not in
but from a same underlying regularity. This is usually called ''over-fitting'' problem, for which we cannot get an appropriate
.
One major direction for tackling the problem is a two stage implementation. First, enumerate a candidate set of
and estimate the unknown set
of parameters by ML learning for a solution
at each
. Second, use a model selection criterion
to select a best
. In the literature, there exist several typical model selection criteria, including Akaike information criterion (AIC) and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), Bayesian approach related criteria under the names of Bayesian inference criterion (BIC) (Schwarz, 1978), Minimum message length (MML) (Wallace & Boulton,1968; Wallace & Dowe, 1999), Minimum description length (MDL) (Rissanen, 1986, 2007), Cross validation (CV) based criteria (Stone, 1978; Rivals & Personnaz, 1999), and the VC dimension based generalization error bound (Vapnik, 1995). Unfortunately, anyone of these criteria usually provides a rough estimate that may not yield a satisfactory performance. Also, this two stage approach usually incurs a huge computing cost. Moreover, the parameter learning performance deteriorates rapidly as
increases, which makes the value of
evaluated unreliably.
The other direction is stepwise model selection either incremently or decrementally. Incremental algorithms attempt to keep as much as possible what already learned as
increases step by step, focusing on learning newly added parameters (Ghahramani & Beal, 2000; Salah and & Alpaydin, 2004). Such an incremental implementation can save computing costs, but usually leads to suboptimal performance because not only those newly added parameters but also the old parameter set
need to be re-learned but actually has been not updated. This suboptimal problem may be lessened by a decremental implementation that starts with
at a large value and decreases
step by step. At each step, one takes a subset out of
with the remaining parameter updated, and discard the subset with a biggest decreasing of
after trying a number of such subsets. Such a procedure can be formulated as a tree searching. The initial parameter set
is the root of the tree, and discarding one subset moves to one immediate descendent. A depth-first searching suffers from a suboptimal performance seriously, while a breadth-first searching suffers a huge combinatorial computing cost. Usually, a trade off between the two extremes is adopted.
Another direction of studies is called automatic model selection. One early exemplar is the effort on RPCL that is featured by driving an extra agent far away from data or equivalently pushing the proportion
that samples are allocated to this agent towards zero. Subsequently, Bayesian Ying Yang learning was proposed in (Xu,1995, 2002) and systematically developed in the past decade, which leads to not only new model selection criteria but also algorithms with automatic model selection via maximizing a Ying Yang harmony functional (Xu, 2002, 2004).
In general, this automatic model selection direction demonstrates a quite difference nature from the usual incremental or decremental implementation that bases on evaluating the change
as a subset
of parameters is added or removed. Thus, automatic model selection is associated with a learning algorithm or a learning principle with the following two features:
- There is an indicator
on a subset
, we have
, if
consists of parameters of a redundant structural part.
- In implementation of this learning algorithm or optimizing this learning principle, there is an mechanism that automatically drives
. Thus, the corresponding redundant structural part is effectively discarded.
Coordinated opposite mechanisms versus Ying Yang harmony
As illustrated in Fig.9(a), an appropriate combination of a global principle and individual criteria actually results in automatic model selection during parameter learning, due to a coordination of two opposite mechanisms, namely one for each agent to participate and the other for each agent to leave.
As mentioned early, the WTA allocation by eq.(1) used in CCL leads to the dead unit problem because agents do not have an equal chance to participate. This problem can be solved via a participating mechanism that lets each agent to have an equal initial chance to participate. For the FSCL that bases still on the WTA allocation by eq.(1), its local criterion by eq.(3) provides a participating mechanism that penalizes those frequent winners and compensates those constant losers. It lacks of a mechanism for an incapable or extra agent to leave. It fails when the number of agents in consideration is larger than an appropriate one. As previously introduced, RPCL improve FSCL via penalizing a rival by eq.(4), which provides a mechanism to drive an incapable or extra agent away. An appropriate balance between this leaving mechanism and the participating mechanism, as
illustrated in Fig.9(a), leads to a favorable RPCL nature of automatic model selection, with the balance controlled by the de-learning rate
in eq.( 4).
Instead of using a global allocating policy such as eq.(), we can get a leaving mechanism via an appropriate local criterion too. Still using the WTA by eq.(1), we consider Type A in eq.(5) with a special Gaussian
, while Type B can be analyzed similarly. It follows that
varies with
as shown in Fig.9(b), where
is the dimension of
. If the
-th agent wins a lot of samples,
will become larger than a threshold, and not only
will increase but also the gap between different agents reduces. In other words, the competing ability of an agent that won too much in past is self-penalized and the competing ability of a weaker agent is compensated for a relative boosting, which effectively provides a participating mechanism similar to
in eq.(3). This similarity can also be observed from the term
in which
takes a role similar to
in eq.(3).
On the other hand, when the
-th agent won too fewer samples and thus its
drops below certain threshold, not only
will increase but also the corresponding gap between different agents increases. Therefore, an incapable/extra agent is self-penalized to fade out, which also provides a leaving mechanism that is different from the RPCL one by eq.(4). Moreover, as the competing ability of the
-th agent decreases, its
decreases and the other
will relatively increase. That is, the term
effectively enhances this self-penalizing featured leaving mechanism. As a result, the local criterion internally seeks an appropriate balance between its participating mechanism and leaving mechanism, with an automatic model selection nature.
Instead of at a bottom level via two opposite mechanisms, automatic model selection nature can also be observed in a top-down way with a global learning principle or theory that guides parameter learning and model selection. To seek a relationship between RPCL learning and Bayesian Ying Yang learning, we consider the BYY system in i.i.d. cases, as shown in Figure 6 of the Scholarpedia page Bayesian Ying Yang learning. Our problem is actually a special case that is simplified into the above Fig.9(c). Moreover, considering
a degenerated case of equations (12)&(14) with
collapsed to
and also
,from which we thus have
and
simplified as follows:
As discussed in that Scholarpedia page too,
along the gradient flow
also leads to a favorable nature of automatic model selection on
. Similar to the discussion after equation (10) there, the flow
becomes the one for the ML learning if
. In other words, BYY harmony learning differs from the ML learning by a relative fitness
. If
, updating goes along the same direction of the ML learning but with an increased strength. If
, i.e., the fitness is worse than the average and thus is doubtful, updating still goes along the same direction of the ML learning but with a reduced strength. When
, updating reverses to the opposite direction, i.e., becoming de-learning,
which becomes similar to RPCL. Further, there are two improvements. First, it avoids the difficulty of specifying an appropriate
that is required in RPCL. Second, the rival
is updated by either de-learning or learning instead of only de-learning by RPCL. Therefore, RPCL can be explained as a bottom level simplified approximation of the BYY harmony learning. In other words, the Ying Yang best harmony at the global level in Fig.9(c) has an inherent relation with the local balance of two mechanisms in Fig.9(a).
Even interestingly, letting
we find that it is just the one in eq.(5). Therefore, the above relation by eq.(7) actually provides the following new allocation scheme
- (8)
with which the iterative learning in Fig.8(c) actually implements a specific formulation of BYY harmony learning.
Let
, it follows from eq.(7) that we get the following
global principle:
- (9)
and the allocation scheme by eq.(8) also comes from minimizing this
along the gradient flow
. In other words, knowing a global level principle can guide us to obtain and improve an allocation scheme via modifying either the gradient flow or the structure of
. For examples, minimizing
with respect to
under the constraint
results in the WTA allocation by eq.(1). Also, we are lead to
- (10)
by minimizing
with respect to
under the constraint
As discussed after Figure 9 of the scholarpedia page Bayesian Ying Yang learning, letting every appearance of
replaced by
leads to an approximate implementation of BYY harmony learning that tends towards RPCL learning, especially at
.
Learning algorithms for Gaussian mixture, RBF nets, and mixture-of-experts
Since the coordinating nature between a participating mechanism and a leaving mechanism is actually featured via a combination of a specific individual criterion
and a specific instance of the global allocating scheme by eq.(6), here we categorize different combinations to get an outline on algorithms that fall in the proposed divide-conquer learning framework. Shown in Fig.10(a) is an outline of learning algorithms derived from Fig.8(c) on Gaussian mixture.
The first column is featured by the WTA allocation by eq.(1). At the bottom, each learning agent is simply a center point
and multiple agents jointly perform a task of cluster analysis by either the standard CL or the FS-CL with the previously discussed improvement. Moreover, each agent is extended to a general Gaussian
with a priori
. For Type A, multiple agents jointly describe a Gaussian mixture:
- (11)
It improves the standard CL not only by extending a spherical cluster to an elliptic cluster but also by getting the balancing mechanism as discussed in Fig.9(b). For Type B, multiple agents jointly describe a product Gaussian mixture:
- (12)
It improves FS-CL similarly as the above for Type A, and differs from Type A in that
takes a role similar to the one in eq.(3), which further enhances a participating mechanism.
Featured by eq.(4), the second column consists of the standard RPCL (Xu, Krzyzak, & Oja, 1992 , 1993) and its generalization (Xu, 1998a). Next, the last two columns lead to an adaptive version of the EM algorithm (Redner & Walker, 1984; Xu & Jordan, 1996) and Bayesian Ying Yang harmony learning not only on a Gaussian mixture by eq.(11) for the cases of Type A, which was already discussed after eq.(7) in the previous section, but also on product Gaussian mixture by eq.(12) for the cases of Type B.
One application of RPCL in its original papers (Xu, Krzyzak, & Oja, 1992 , 1993) was for learning a normalized radial basis function (RBF) networks in a two stage implementation, First, a clustering algorithm (e.g., k-means) in (Moody, & Darken, 1989) is replaced by RPCL with the number of clusters determined automatically. Second, the output layer is estimated by a least square linear fitting. It has been subsequently shown that a normalized RBF networks can be regarded as a simplified case of an alternative mixture of experts (ME):
- (13)
for which the maximum likelihood (ML) learning is implemented by the EM algorithm (Xu, Jordan, & Hinton, 1995). Also, an alternative ME becomes an extended normalized radial basis function (RBF) networks at the following special case
- (14)
which degenerates to a normalized RBF networks (Moody, & Darken, 1989) simply with
. As a result, the conventional two stage but suboptimal implementation of a normalized RBF learning is replaced by the EM algorithm (Xu, 1998b). Moreover, RPCL and Bayesian Ying Yang learning have also been adopted for determining the number of basis functions (Xu, 1998b, 2001).
Similar to Fig.10(a), these learning algorithms can also be summarized under the general divide-conquer learning framework here, for which we further interpret the following points:
- Each agent also has a part
that makes an action
to the environment.
- Individual criterion also needs to evaluate the error
, for which we generalize the Type A case of eq.(5) into :
- (15)
With the above
replacing its previous appearances on Gaussian mixture, it follows from Fig.8(c) that similarly we get the learning algorithms for the alternative ME and normalized RBF networks, as outlined in Fig.10 (b).
Though we observe here and also in the previous section that the general divide-conquer learning framework leads to an approximate implementation of BYY harmony learning, the latter actually goes far beyond in both performances and coverage. First , it follows from equations (12)&(14) in the Scholarpedia page Bayesian Ying Yang learning that we generally have the term
to regularize the effect of a finite size of samples. Second, it goes beyond the coverage of the formulation by eq.(9) and eq.(8), not only in the above cases
but also in the cases to be further introduced in the next section.
Learning algorithms for subspace based functions
In many practices, there is only a finite size of training samples distributed in small dimensional subspaces, instead of scattering over all the dimensions of the observation space. These subspace structures can not be well captured by considering a basis
supported on the entire space of
. There are too many free parameters in
, which usually leads to poor performances. Instead, we consider a basis on a subspace as shown in Fig.(a), where observed samples are regarded as generated from a subspace with independent factors distributed along each coordinate of a
dimensional inner representation
.
Shown in Fig.11, we may let
in eq.(13) and eq.(14) to be replaced by
that considers
as generated from a lower dimensional subspace spanned by the columns of
, while the mapping to
is described by
based on this subspace also. Specifically, there are two typical choices:
- Type A is indicated by
, which corresponds to the previous ME by eq.(13) and RBF networks by eq.(14) with
for
directly while the gating net in eq.(13) and basis function in eq.(14) are supported on the subspace of
instead of the original space of
.
- Type B is indicated by
. It performs a mapping
from the lower dimension subspace. We seek a mapping
to get a cascade mapping
. From two Gaussians
and
, a choice for
is their posteriori inverse in a Bayesian sense, from which we get
by a Gaussian
as
in eq.(13) with
. Putting them into eq.(15), learning is made by those algorithms in Fig.10(b) again.
Correspondingly, we get two types of subspace based gating networks and subspace based functions (SBF). Type B further improves Type A as the mapping
acts as feature extraction, such that redundant parts are discarded.
To get the advantages expected as above, a key issue is how to determine an appropriate dimension
for each subspace, which is another model selection task. One way to tackle this problem is the two stage implementation based approaches introduced in a previous section. Under the framework of the formulation by eq.(9) and eq.(8), we can not get automatic model selection nature by those algorithms in Fig.10(b). Alternatively, we turn this subspace based model into a specific example of the BYY system given in Fig.6 of the Scholarpedia page Bayesian Ying Yang learning, by which an appropriate dimension
of each subspace will be determined automatically during implementing the algorithm for Bayesian Ying Yang harmony learning. This is however beyond the scope of the divide-conquer learning framework introduced in the previous section. For this task in Fig.11,
in equation (16) of the Scholarpedia page Bayesian Ying Yang learning has a formulation
from which we generally can not the allocation scheme by eq.(8) via
except some degenerated case that is able to be reformulated into a format
.
References
- Acciani, G., et al (2003), "A feature extraction unsupervised neural network for an environmental data set", Neural Networks 16: 427-436.
- Ahalt, SC., et al (1990), "Competitive learning algorithms for vector quantization", Neural Networks 277-291.
- Akaike, H (1974), "A new look at the statistical model identification", "IEEE Tr. Automatic Control" 19: 714-723.
- Bozdogan, H & Ramirez, DE (1988), "FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC", "Psychometrika" { 53} (3): 407-415.
- Cavanaugh, JE (1997), "Unifying the derivations for the Akaike and corrected Akaike information criteria", "Statistics & Probability Letters ", {33}: 201-208.
- Desieno, D (1988), "Adding a conscience to competitive learning", Proc. of IEEE int. conf. on Neural Networks, Vol.I, pp.117-124.
- Edelman, A, Arias, TA, & Smith, ST, (1998), "The geometry of algorithms with orthogonality constraints", SIAM J. Matrix Anal. Appl., 20:303-353.
- Ghahramani, Z & Beal, MJ (2000), "Variational inference for Bayesian mixtures of factor analyzers", in Solla, Leen, & Muller (eds), Advances in Neural Information Processing Systems 12, 449-455.
- Grossberg, S (1987), "Competitive learning: from iterative activation to adaptive resonance", Cognitive Science, 11: 23-63.
- Hecht-Nielsen, R (1987), "Counterpropagation networks", Applied Optics 26: 4979-4984.
- Hough, PVC "Method and means for recognizing complex patterns", U.S. Patent 3069654, Dec.18, 1962.
- Jacobs, RA., et al (1991), "Adaptive mixtures of local experts", Neural Computation 3: 79-87.
- Jordan, MI & Xu, L (1995), "Convergence results for the EM approach to mixtures of experts", Neural Networks 8: 1409-1431.
- Kohonen, T (1982), "Self-organized formation of topologically correct feature maps", Biological Cybernetics 43: 59-69.
- Liu, ZY, Qiao, H, & Xu, L (2006), "Multisets Mixture learning based Ellipse Detection", Pattern Recognition 39: 731-735.
- Ma, J, Wang, T, & Xu, L (2004), "A gradient BYY harmony learning rule on Gaussian mixture with automated model selection", Neurocomputing 56: 481-487.
- Moody, J & Darken, J (1989) "Fast learning in networks of locally-tuned processing units", Neural Computation 1: 281-294.
- Nair, TM., et al, (2003) "Rival penalized competitive learning (RPCL): a topology-determining algorithm for analyzing gene expression data", Computational Biology and Chemistry 27: 565-574.
- Nakkrasae, S, & Sophatsathit, P (2004), "An rpcl-based indexing approach for software component classification" International J. of Software Engineering and Knowledge Engineering 14: 497-518.
- Redner, RA & Walker, HF (1984), "Mixture densities, maximum likelihood, and the EM algorithm", SIAM Review 26: 195-239.
- Rissanen, J (1986), "Stochastic complexity and modeling", Annals of Statistics 14 (3): 1080-1100.
- Rissanen, J (2007), Information and Complexity in Statistical Modeling, Springer, 2007.
- Rivals, I & Personnaz, L (1999) "On Cross Validation for Model Selection", " Neural Computation" 11: 863-870.
- Robbins, H., & Monro, S., (1951), "A stochastic approximation method", Ann. Math. Statist., 22, 400-407.
- Rumelhart, DE & Zipser, D (1985), "Feature discovery by competitive learning", Cognitive Science 9: 75-112.
- Salah, AA & Alpaydin, E. (2004), "Incremental mixtures of factor analyzers", Proc. the 17th International Conference on Pattern Recognition, 23-26 Aug. 2004, Cambridge, UK, Vol.1, 276-279.
- Schwarz, G (1978), "Estimating the dimension of a model", "Annals of Statistics" 6: 461-464.
- Stone, M (1978), "Cross-validation: A review", "Math. Operat. Statist. " 9: 127-140.
- Vapnik, VN (1995), The Nature of Statistical Learning Theory, Springer.
- Wallace, CS & Boulton, DM (1968), "An information measure for classification", Computer Journal 11, 185-194.
- Wallace, CS & Dowe, DR (1999), "Minimum message length and Kolmogorov complexity", Computer Journal 42 (4): 270-280.
- Xu, L (2007), "A Unified Perspective and New Results on RHT Computing, Mixture Based Learning, and Multi-learner Based Problem Solving", (in press), " Pattern Recognition", 40: 2129-2153.
- Xu, L (2007), "A Trend on Regularization and Model Selection in Statistical Learning: A Perspective from Bayesian Ying Yang Learning", an invited chapter in "Studies in Computational Intelligence 63: Challenges to Computational Intelligence", Duch, W. and Mandziuk, J., 365-405.
- Xu, L (2002), "BYY Harmony Learning, Structural RPCL, and Topological Self-Organizing on Unsupervised and Supervised Mixture Models", Neural Networks 15: 1125-1151.
- Xu, L (2001), "Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, ME-RBF Models and Three-Layer Nets", International J. of Neural Systems 11: 3-69.
- Xu, L (1998a), "Rival Penalized Competitive Learning, Finite Mixture, and Multisets Clustering", Proc. of IJCNN98, Anchorage, Vol.II, pp.2525-2530.
- Xu, L (1998b), "RBF nets, mixture experts, and Bayesian Ying-Yang learning", Neurocomputing 19(1-3): 223-257.
- Xu, L & Jordan, MI (1996), "On convergence properties of the EM algorithm for Gaussian mixtures", Neural Computation 8(1): 129-151.
- Xu, L, Jordan, MI, & Hinton, GE (1995), "An Alternative Model for Mixtures of Experts", in Cowan, Tesauro, and Alspector, eds., Advances in Neural Information Processing Systems 7, MIT Press, 633-640.
- Xu, L, Krzyzak, A, & Oja, E (1993), "Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection", IEEE Tr. on Neural Networks 4: 636-649.
- Xu, L, & Oja, E (1993), "Randomized Hough Transform (RHT): Basic Mechanisms, Algorithms and Complexities", Computer Vision, Graphics, and Image Processing: Image Understanding 57: 131-154.
- Xu, L, Krzyzak, A & Oja, E (1992), "Unsupervised and Supervised Classifications by Rival Penalized Competitive Learning", Proc. of 11th Intl. Conf. on Pattern Recognition, Hauge, Netherlands, Aug.30-Sept.3, 1992, Vol.I, pp.672-675.
- Xu, L, Oja, E & Kultanen, P (1990), "A New Curve Detection Method: Randomized Hough transform (RHT)", Pattern Recognition Letters 11: 331-338.
Internal links
- The subpage of /APPENDIX
- Lei Xu (2007) Bayesian Ying Yang learning. Scholarpedia, 2(3):1809.
- Teuvo Kohonen and Timo Honkela (2007) Kohonen network. Scholarpedia, 2(1):1568.
External links
Author's website (http://www.cse.cuhk.edu.hk/~lxu/)
See also
- Bayesian Ying Yang Learning
- Unsupervised Learning
- Data Clustering
- Competitive Learning
- Winner-Take-All
- Kohonen Network
- Mixtures of Experts
- Gaussian mixture
- Radial Basis Function Networks
| Lei Xu (2007) Rival penalized competitive learning. Scholarpedia, 2(8):1810, (go to the first approved version) Created: 31 July 2006, reviewed: 17 August 2007, accepted: 31 August 2007 |
| Action editor: | Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia |




