Rival penalized competitive learning

From Scholarpedia

Lei Xu (2007), Scholarpedia, 2(8):1810. doi:10.4249/scholarpedia.1810 revision #73360 [link to/cite this article]

Hosting and maintenance of this article is sponsored by Brain Corporation.

Curator: Dr. Lei Xu, Dept. Computer Science & Engineering, Chinese University of Hong Kong

Rival penalized competitive learning (RPCL) is a development of competitive learning in help of an appropriate balance between two opposite mechanisms (namely a participating mechanism and a leaving mechanism), such that an appropriate number of agents or learners will be allocated to learn multiple structures underlying observations. Moreover, RPCL has been further extended into a general divide-conquer learning framework for multi-agents to divide and conquer a complicated task.


Contents

Classical competitive learning: from WTA to FSCL

Figure 1: Competitive learning for clustering analysis and dead unit problem.
Enlarge
Figure 1: Competitive learning for clustering analysis and dead unit problem.


Figure 2:   Samples of four clusters.
Enlarge
Figure 2: Samples of four clusters.

Proposed decades ago, classical competitive learning (CCL) is usually used for making data clustering by a number of units or agents. Each agent is featured by a parametric point \theta_j=m_j that moves among the observation samples and seeks to be located at the center of a cluster of samples, such that several points collectively perform clustering analysis. As a sample x_t comes, each m_j evaluates a degree of its suitability on representing x_t by an individual value or criterion \varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2, and competes to be allocated to represent x_t. The competition is guided globally by the winner-take-all (WTA) policy

Figure 3:    Three dead units by CCL (double click it to see animation).
Enlarge
Figure 3: Three dead units by CCL (double click it to see animation).
(1)
p_{ j, t}= \begin{cases} 1, & \mbox{if}\quad j=c, \\  0, & \mbox{otherwise;}\end{cases} \qquad c=arg \ min_{ j} \varepsilon_t(\theta_j).

Then, each m_j is updated to further reduce \varepsilon_t(\theta_j) by

(2)
m_j^{new}= m_j^{old}+ \eta p_{ j, t} (x_t- m_j^{old}).


When m_1, m_2 are initialized as in Fig.1(a), as a sample x comes, m_1 becomes the winner and then moves a little bit towards x. Next, as another sample x comes in Fig.1(b), m_2 becomes the winner and then moves similarly. Finally, m_1 and m_2 each converge to one clusters' center as in Fig.1(c).

Figure 4:   FSCL makes all the four units placed (double click it to see animation).
Enlarge
Figure 4: FSCL makes all the four units placed (double click it to see animation).


However, if m_1, m_2 are initialized as in the top part of Fig.1(d), m_2 always becomes the winner and finally moves the middle point between two clusters, while m_1 never wins and becomes dead. This is known as the under-utilized or dead unit problem, featured by a WTA based competitive learning (Rumelhart et al, 1985; Grossberg, 1978; Hecht-Nielsen, 1987), which makes the task of clustering analysis badly performed. This problem comes from unequal chances for the dead agents to participate competition due to their unfavorable initial locations. Thus, it is desired that every agent should be initialized to have an equal chance to join competition, which is actually difficult to implement without a priori knowledge on the entire clustering structure, while such a knowledge is exactly the goal of this competitive learning.

Rival penalized competitive learning
Enlarge
Figure 5: When k is pre-assigned to 5. the frequency sensitive mechanism also brings the extra one into data to disturb the correct locations of others (double click it to see animation).

One way to tackle this difficulty is a so called frequency sensitive competitive learning (FSCL) (Ahalt, et al, 1990), in help of modifying \varepsilon_t(\theta_j) into

(3)
\varepsilon_t(\theta_j)=\alpha_j \Vert x_t- m_j \Vert^2,

where \alpha_j is the frequency that the j-th agent won in past. Illustrated in Fig.2 is an example of four clusters, the CCL resulted in three dead units in Fig.3, while the FSCL makes all the four units placed to the cluster's centers instead of becoming dead, as shown in Fig.4.


Such a sprit of reducing the winning rates of frequent winners can also be implemented in alternative ways, e.g., conscience (Desieno, 1988), leaky learning (Rumelhart et al, 1985; Grossberg, 1978), convex bridge (Hecht-Nielsen, 1987), and neighbor sharing in Kohonen network (Kohonen,1982).


Rival penalized competitive learning

Figure 6:  RPCL and automatic allocation.
Enlarge
Figure 6: RPCL and automatic allocation.

The way of reducing the winning rates works when the number k of clusters is known. Unfortunately, it is usually unknown. Not only it is no doubt that both CCL and FSCL will result in bad performances when k is given a smaller number, but also FSCL fails when k is pre-assigned to a value larger than an appropriate one. As illustrated in Fig.5, the frequency sensitive mechanism will bring the extra agents into data to disturb the correct locations of other agents. Thus, how to determine k is a critical difficulty too.


Figure 7:  Rival penalized mechanism makes extra agents driven far away(double click it to see animation).
Enlarge
Figure 7: Rival penalized mechanism makes extra agents driven far away(double click it to see animation).


One solution is provided under the name of Rival Penalized Competitive Learning (RPCL), proposed in the early 1990's for clustering analysis and detecting lines in an image (Xu, Krzyzak, & Oja, 1992 , 1993). As illustrated in Fig.6, not only the winner is learned but also its rival (i.e., the second winner) is repelled a little bit from the sample to reduce a duplicated information allocation. As a sample x comes, m_1 is the winner and m_2 is its rival in Fig.6(a), m_1 is moved towards x by a small step size, while m_2 is repelled away from x by a much smaller step size. Similarly in Fig.6(b), m_3 is the winner and moved towards x, while m_2 is its rival and repelled away from x.

More precisely, the RPCL differs from FSCL in that eq.(2) is implemented with eq.(1) replaced by

(4)
p_{ j, t}= \begin{cases} 1, & \mbox{if} \, j=c, \\  -\gamma, & \mbox{if}\,  j= r, \\ 0, & \mbox{otherwise}, \end{cases} \begin{cases} c=arg \ min_{ j} \varepsilon_t(\theta_j), \\  r=arg \ min_{ j \ne c } \varepsilon_t(\theta_j),\end{cases}

where \gamma approximately takes a number between 0.05 and 0.1 for controlling the penalizing strength.

As illustrated in Fig.6(c), m_1 and m_3 finally converge to clusters' centers, while m_2 is driven far away from samples. Shown in Fig.7 is an experimental result that improves its FSCL counterpart in Fig.5, with a disturbing agent driven away from samples. Adding this rival penalized mechanism makes extra agents driven far away from data as long as the number of agents is initially given to be larger than the number of clusters to be detected. In other words, RPCL can allocate an appropriate number of learning agents to detect a correct number of clusters or objects automatically during learning. This automatic allocation is a favorable nature that both CCL and FSCL as well as the conventional clustering algorithms (e.g., k-means) do not have. In the past decade, a number of applications of RPCL have been made (Acciani, et al, 2003; Nair, 2003; Nakkrasae & Sophatsathit, 2004).


A general framework for divide-conquer learning

CCL, FSCL, and RPCL can be investigated systemically from the perspective of a general framework for learning agent based problem solving. That is, multiple learning agents coordinately divide a complicated task into a number of subtasks and each subtask is mainly conquered by one learning agent such that the entire task can be well solved collectively by these multiple learning agents (Xu, 2007). In the following, we introduce the four basic ingredients of this general framework.

Figure 8:   (a) Ellipse fitting, (b) Gaussian distribution, (c) a two step iterative learning
Enlarge
Figure 8: (a) Ellipse fitting, (b) Gaussian distribution, (c) a two step iterative learning

(a) Agent's structure Each agent needs a hardware M_j that is able to handle one specific problem solving task. This M_j is usually featured by a pre-designed structure with a set \theta_j of parameters to be determined. The simplest example is that M_j is given by a point \theta_j=m_j. In general, M_j models or fits its environment, from which we get x_t(\theta_j) as a reconstruction of x_t by M_j. Beyond a point m_j, M_j can also be a curve or shape as shown in Fig.8(a) for objection detection in a noise situation (Liu, Qiao, & Xu, 2006), outperforming the classical Hough transform (HT) (Hough, 1962) and Randomized Hough transform (RHT) (Xu, Oja, & Kultanen, 1990; Xu & Oja, 1993). Generally, M_j(\theta_j) can also be other parametric structure too.

(b) Individual criterion\quad Each agent has an individual value \varepsilon_t(\theta_j) to evaluate how good this x_t(\theta_j) is reconstructed by M_j. Considering the error \ e_t (\theta_j )=x_t- x_t(\theta_j), it can be measured by one of the following two related choices:

  • The errors are directly measured by certain norm, e.g., in a L_2 norm \varepsilon_t(\theta_j)=\Vert e_t (\theta_j )\Vert^2. For a simple case \!\theta_j=m_j, we have \!x_t(\theta_j)=m_j and \varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2.
  • Observing that a larger error occurs with a smaller chance, an error may also be measured by its probability distribution, e.g., x - m_j is measured by a Gaussian as shown in Fig.8(b). In general, we have \!q(\varepsilon(\theta_j)| \theta_j) or q(x | x (\theta_j), \theta_j), which is further concisely rewritten into q(x |\theta_j). Plus a priori \alpha_j for the chance of the corresponding agent to be involved, we get the following probabilistic distribution based measure:
(5)
\begin{array}{l} \varepsilon_t(\theta_j)=-\begin{cases}  \ln{[\alpha_j q(x_t|\theta_j)]},  & (a)\, Type\, A, \\   \ln{ q(x_t|\theta_j)^{\alpha_j} } -\ln  \mathcal C(  \theta), & (b)\, Type\, B. \end{cases}\\  C(\theta)= \int \prod_j q(x|\theta_j)^{\alpha_j}dx\approx \sum_t \prod_j{ q(x_t|\theta_j)^{\alpha_j}}.    \end{array}

Type B can be regarded as a further extension of eq.(3).

(c) Global principle\quad Agents are coordinated globally to combine individual criteria \{ \varepsilon_t(\theta_j)\}_{j=1}^k, in one of the following two manners:

  • One is operational based, e.g., the simple ones by eq.(1) or eq.(4), and a further example will be discussed in the next section. Generally, we can denote an allocation scheme as follows:
(6)
p_{ j, t}= ALOC[\{{\varepsilon_t(\theta_j) }\}_{j=1}^k]
  • The other is evaluation based, by a principle on the overall performance jointly by all the agents, which will be further discussed in the next section.

(d) Updating rule\quad As a further extension of eq.(2), agents are modified to adapt x_t according to the allocation by p_{ j, t}\delta \theta_j, with

(7)
\delta \theta_j=Reduce(\varepsilon_t(\theta_j^{old})),

where Reduce(\varepsilon_t(\theta_j)) denotes an operator that acts on \theta_j^{old} and x_t to yield a direction along which \varepsilon_t(\theta_j) reduces within a neighborhood region of \theta_j^{old}. When \varepsilon_t(\theta_j) is differentiable, Reduce(\varepsilon_t(\theta_j)) can be given by -\nabla_{  \theta_j}\varepsilon_t(\theta_j) or -{\mathcal P}\nabla_{  \theta_j}\varepsilon_t(\theta_j) on D(\theta_j), where -{\mathcal P}\nabla_{  \theta_j}\varepsilon_t(\theta_j) has a positive projection on -\nabla_{  \theta_j}\varepsilon_t(\theta_j).

In this framework, a conquering action is performed via a updating rule by eq.(7), while a dividing action is performed via an allocation scheme by eq.(6), which are iteratively implemented per sample as shown in Fig.8(c). To get a useful iteration, the following three issues should be considered:

  • The updating by Step (b) needs an appropriate control of a small enough learning stepsize.
  • During the iteration, we should keep \begin{array}{l}\theta_j^{new}\in D(\theta_j) \end{array} that usually consists of three types of constraints. One is \begin{array}{l}\alpha_j>0, \ \sum_j \alpha_j=1 \end{array}, the second is the covariance matrix \Sigma_j that should be positive definite. One typical way is transforming them to be function of parameters that are free of constraints, e.g., \begin{array}{l}\alpha_j=exp(c_j)/\sum_j exp(c_j)\end{array} and \begin{array}{l}\Sigma_j=S_jS_j^T\end{array} (Xu, 2001). Moreover, there is also a constraint of a matrix within the Stiefel manifold \begin{array}{l}UU^T=I\end{array}, for which one way bases on projection (Xu, 2002) while the other bases on differential geometry (Edelman, Arias,& Smith, 1998).
  • In addition to estimating unknown parameters \begin{array}{l}\{ \theta_j \}_{j=1}^k\end{array}, it also needs to select an appropriate number k of agents, such that not only each agent performs best according to its individual criterion but also a best overall performance is achieved, which highly depends on p_{ j, t}. In the literatures of statistics and machine learning, the task of determining the number k belongs to what is usually called model selection.


Model selection and typical implementation

In general, model selection refers to select an appropriate one among a family of infinite many candidate structures \{S_{\mathbf k}( \theta)\} with each S_{\mathbf k} in a same configuration but in different scales, each of which is labeled by a scale parameter {\mathbf  k} in term of one integer or a set of integers. Selecting an appropriate {\mathbf  k} means getting a structure consisting of an appropriate number of free parameters. At a given {\mathbf k}, we expect that S_{\mathbf k} best fits or represents a given set \Chi_N of samples with a finite size N and determine \theta_{\mathbf k} of all the unknown parameters within S_{\mathbf k}, such that the error of using S_{\mathbf k}(\theta^*_{\mathbf k} ) to represent \Chi_N becomes the least. For our problem discussed in previous section, we have S_{\mathbf k}=\{M_{j}\}_{j=1}^k and \theta_{\mathbf k}=\{\theta_{j}\}_{j=1}^k, while {\mathbf  k} is simply the number {\mathbf k}.

However, by a typical best fitting principle (e.g., a maximum likelihood (ML) principle ), its fitting error monotonically decreases as {\mathbf k} grows up, until the error reaches zero at {\mathbf k}={\mathbf k}_N, a value that relates to N and is usually much bigger than an appropriate {\mathbf k}^*. In this case, a large part of structure has actually been used to fit noises or outliers, which not only wastes computing resources but also degrades the generalization performance on new samples not in \Chi_N but from a same underlying regularity. This is usually called ''over-fitting'' problem, for which we cannot get an appropriate k^*.

One major direction for tackling the problem is a two stage implementation. First, enumerate a candidate set of {\mathbf  k} and estimate the unknown set \theta_{\mathbf k} of parameters by ML learning for a solution \theta^*_{\mathbf k} at each {\mathbf  k} . Second, use a model selection criterion J(\theta^*_{\mathbf k}) to select a best {\mathbf  k}^* . In the literature, there exist several typical model selection criteria, including Akaike information criterion (AIC) and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), Bayesian approach related criteria under the names of Bayesian inference criterion (BIC) (Schwarz, 1978), Minimum message length (MML) (Wallace & Boulton,1968; Wallace & Dowe, 1999), Minimum description length (MDL) (Rissanen, 1986, 2007), Cross validation (CV) based criteria (Stone, 1978; Rivals & Personnaz, 1999), and the VC dimension based generalization error bound (Vapnik, 1995). Unfortunately, anyone of these criteria usually provides a rough estimate that may not yield a satisfactory performance. Also, this two stage approach usually incurs a huge computing cost. Moreover, the parameter learning performance deteriorates rapidly as {\mathbf  k} increases, which makes the value of J(\theta^*_{\mathbf k}) evaluated unreliably.

The other direction is stepwise model selection either incrementally or decrementally. Incremental algorithms attempt to keep as much as possible what already learned as {\mathbf  k} increases step by step, focusing on learning newly added parameters (Ghahramani & Beal, 2000; Salah and & Alpaydin, 2004). Such an incremental implementation can save computing costs, but usually leads to suboptimal performance because not only those newly added parameters but also the old parameter set \theta_{\mathbf k} need to be re-learned but actually has been not updated. This suboptimal problem may be lessened by a decremental implementation that starts with {\mathbf  k} at a large value and decreases {\mathbf  k} step by step. At each step, one takes a subset out of \theta_{\mathbf k} with the remaining parameter updated, and discard the subset with a biggest decreasing of J(\theta^*_{\mathbf k}) after trying a number of such subsets. Such a procedure can be formulated as a tree searching. The initial parameter set \theta_{\mathbf k} is the root of the tree, and discarding one subset moves to one immediate descendant. A depth-first searching suffers from a suboptimal performance seriously, while a breadth-first searching suffers a huge combinatorial computing cost. Usually, a trade off between the two extremes is adopted.

Another direction of studies is called automatic model selection. One early exemplar is the effort on RPCL that is featured by driving an extra agent far away from data or equivalently pushing the proportion \alpha_j that samples are allocated to this agent towards zero. Subsequently, Bayesian Ying Yang learning was proposed in (Xu,1995, 2002) and systematically developed in the past decade, which leads to not only new model selection criteria but also algorithms with automatic model selection via maximizing a Ying Yang harmony functional (Xu, 2002, 2004).

In general, this automatic model selection demonstrates a quite difference nature from the usual incremental or decremental implementation that bases on evaluating the change J(\theta^*_{\mathbf k})- J(\theta^*_{\mathbf k}\cup \theta_{\delta}) as a subset \theta_{\delta} of parameters is added or removed. Thus, automatic model selection is associated with a learning algorithm or a learning principle with the following two features:

  • There is an indicator \rho(\theta_e) on a subset \theta_e \in \theta_{\mathbf k}, we have \rho(\theta_e)=0, if \theta_e consists of parameters of a redundant structural part.
  • In implementation of this learning algorithm or optimizing this learning principle, there is an mechanism that automatically drives \rho(\theta_e)\to0. Thus, the corresponding redundant structural part is effectively discarded.


Coordinated opposite mechanisms versus Ying Yang harmony

As illustrated in Fig.9(a), an appropriate combination of a global principle and individual criteria actually results in automatic model selection during parameter learning, due to a coordination of two opposite mechanisms, namely one for each agent to participate and the other for each agent to leave.

Figure 9:   (a) coordinated mechanisms, (b)  rewarding vs penalizing via variances.
Enlarge
Figure 9: (a) coordinated mechanisms, (b) rewarding vs penalizing via variances.

As mentioned early, the WTA allocation by eq.(1) used in CCL leads to the dead unit problem because agents do not have an equal chance to participate. This problem can be solved via a participating mechanism that lets each agent to have an equal initial chance to participate. For the FSCL that bases still on the WTA allocation by eq.(1), its local criterion by eq.(3) provides a participating mechanism that penalizes those frequent winners and compensates those constant losers. However, it lacks of a mechanism for an incapable or extra agent to leave. It fails when the number of agents in consideration is larger than an appropriate one. As previously introduced, RPCL improves FSCL via penalizing a rival by eq.(4), which provides a mechanism to drive an incapable or extra agent away. An appropriate balance between this leaving mechanism and the participating mechanism, as illustrated in Fig.9(a), leads to a favorable RPCL nature of automatic model selection, with the balance controlled by the de-learning rate \gamma in eq.( 4).

Instead of using a global allocating policy such as eq,(6), we may also get a leaving mechanism via an appropriate local criterion too. Still using the WTA by eq.(1), we consider Type A in eq.(5) with a special Gaussian p(x| \theta_j)=G(x|m_j, \sigma_j^2I), while Type B can be analyzed similarly.

It follows that \begin{array}{l}\varepsilon_t(\theta_j)=-\ln{\alpha_j }+ 0.5 [{\Vert x_t-m_j\Vert^2 / \sigma_j^{2}}\end{array} +d\ln{(2\pi \sigma_j^2)}] varies with \sigma_j^{2} as shown in Fig.9(b), where d is the dimension of x. If the j-th agent wins a lot of samples, \sigma_j^{2} will become larger than a threshold, and not only \varepsilon_t(\theta_j) will increase but also the gap between different agents reduces. In other words, the competing ability of an agent that won too much in past is self-penalized and the competing ability of a weaker agent is compensated for a relative boosting, which effectively provides a participating mechanism similar to \alpha_j in eq.(3). This similarity can also be observed from the term 0.5  {\Vert x_t-m_j\Vert^2 / \sigma_j^{2}} in which \sigma_j^{-2}\propto \alpha_j takes a role similar to \alpha_j in eq.(3).

On the other hand, when the j-th agent won too fewer samples and thus its \sigma_j^{2} drops below certain threshold, not only \varepsilon_t(\theta_j) will increase but also the corresponding gap between different agents increases. Therefore, an incapable/extra agent is self-penalized to fade out, which also provides a leaving mechanism that is different from the RPCL one by eq.(4). Moreover, as the competing ability of the j-th agent decreases, its \alpha_j decreases and the other \alpha_i will relatively increase. That is, the term -\ln{\alpha_j } effectively enhances this self-penalizing featured leaving mechanism. As a result, the local criterion internally seeks an appropriate balance between its participating mechanism and leaving mechanism, with an automatic model selection nature.

Instead of at a bottom level via two opposite mechanisms, automatic model selection nature can also be observed in a top-down way with a global learning principle or theory that guides parameter learning and model selection. We consider the BYY system in i.i.d. cases, as shown in Figure 7 of the Scholarpedia page Bayesian Ying Yang learning. Our problem is actually a special case that is simplified into the above Fig.9(c). Moreover, considering a degenerated case with x,y collapsed to x and H(p \Vert q, \Theta ) simplified as follows:

(8)
\begin{array}{l} H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell \in \mathcal{C}_{\kappa}(x_t)}   p(\ell|x_t) \tilde L_{\ell}(x_t, \theta_{\ell}), \ \theta_{\ell}=\{\alpha_{\ell}, \phi_{\ell}\}, \\  \tilde L_{\ell}(x, \theta_{\ell})=\ln{[ \alpha_{\ell}q(x| \phi_{\ell})q( \theta_{\ell}, h)]}+0.5h^2\nabla_{xx^T}^2\ln{q(x| \phi_{\ell})},    \end{array}

where q( \theta_{\ell}) is a priori distribution of parameters and is ingored when there is no priori knowledge. Moreover, \mathcal{C}_{\kappa}(x_t) \subset  \{1,2,\dots, k\} consists of \kappa labels or indices of agents that enter to represent x_t. One typical example is the Climax neighborhood policy for simplifying the Yang machine by only considering its climax and neighbors given a sample x_t. The rationale is that a Yang distribution conditioning on x_t is more reliably described by its apex realm that actually takes a major role in a harmony of the corresponding Ying part. Details are referred to the Scholarpedia page Bayesian Ying Yang learning, especially eqn.(3), eqn.(6), Fig.7, and Fig.8 there. Specifically, we observe the following three key points:

  • For eq.(8) here , the Yang machine is simply p(\ell|x_t) with \ell being a discrete label, and \mathcal{C}_{\kappa}(x_t) consists of the \kappa-Climax neighbors (CN) as follows
(9)
\begin{array}{l} \ \mathcal{C}_{\kappa}(x_t)\  \mbox{consists of  those  of} \   \ell  \  \mbox{that correspond the}\  \kappa \  \mbox{largest values of} \  \tilde L_{\ell}(x_t, \theta_{\ell}).    \end{array}
  • The structure of p(\ell|x_t) is designed under a uncertainty range that is restricted to perserve ones of the Ying machine q(\ell|x_t) as follows:
(10)
\begin{array}{l}    q(\ell|x_t)={ e^{L_{\ell}(x_t, \theta_{\ell})} \over  \sum_{j \in \mathcal{C}_{\kappa}(x_t)}e^{L_{j}(x_t, \theta_{j})}}, \ \ell^c_t=arg\max_{\ell} \tilde   L_{\ell}(x, \theta_{\ell}), \mbox{ we have}  \\ p(\ell|x_t)=\begin{cases}  \mbox{to be determined within} \ [0, q(\ell|x_t)], & (a) \mbox{Bounded  by Ying}, \\ \mbox{to be determined within} \ [0, q(\ell^c_t|x_t)],  & (b)  \mbox{Ceilinged by Ying}.  \end{cases}   \\  \mbox{subject to} \   \sum_{\ell \in \mathcal{C}_{\kappa}(x_t) } p(\ell|x_t)=s_t \ with  \ q(\ell^c_t|x_t) \le s_t < 1.   \end{array}
  • It further leads to different specific forms of p(\ell|x_t) via maximizing H(p \Vert q, \Theta ). The following are three typical examples:
(11)
\begin{array}{l} (a) \  \mbox{For the choice (a) with} \  s_t=1, \  \mbox{we have}  \ p(\ell|x_t)= q(\ell|x_t), \forall \ell.  \\ (b) \ \mbox{For the choice (a) with} \  s_t=q(\ell^c_t|x_t), \  \mbox{we get}  \\  p(\ell|x_t)=\begin{cases} q(\ell^c_t|x_t), & if \ \ell =\ell^c_t,    \\ 0, & otherwise.  \end{cases}  \\ (c) \ \mbox{For the choice (b), let} \ \kappa^c \ \mbox{be an integer with}  \ \kappa^c-1  \le 1/q(\ell^c_t|x_t) \le  \kappa^c, \mbox{we have}  \\  p(\ell|x_t)=\begin{cases} q(\ell^c_t|x_t), & if  \  \ell \in \mathcal{C}^c_{\kappa^c}(x_t) \ but \  \ell \ne  \ell^r_t, \ with \ \mathcal{C}^c_{\kappa^c}(x_t)= \mathcal{C}_{\kappa=\kappa^c}(x_t), \\ 1-(\kappa^c-1)q(\ell^c_t|x_t), & if  \ \ell =\ell^r_t, \  \ell^r_t=arg\min_{ \ell \in \mathcal{C}^c_{\kappa^c}(x_t)}\tilde  L_{\ell}(x, \theta_{\ell}),   \\ 0, & otherwise.   \end{cases}    \end{array}

Moreover, we observe an implementation of \begin{array}{l}\max_{\Theta} H(p \Vert q, \Theta )\end{array} along the gradient flow:

(12)
\begin{array}{l}  \nabla_{\theta_{\ell}}H(p \Vert q, \Theta )=\sum_{t=1}^{N}\{\sum_{\ell \in \mathcal{C}_{\kappa}(x_t) }  [p(\ell|x_t)+ \delta_{\ell,t}]\nabla_{\theta_{\ell}}L_{\ell}(x_t, \theta_{\ell})\\ +\sum_{\ell \in \mathcal{C}_{\kappa}(x_t) }p(\ell|x_t) \nabla_{\theta_{\ell}}[\ln{q( \theta_{\ell}, h)}+0.5h^2\nabla_{xx^T}^2\ln{q(x| \phi_{\ell})}]\}, \\  L_{\ell}(x, \theta_{\ell})=\ln{[ \alpha_{\ell}q(x| \theta_{\ell})]},\   \ \delta_{\ell,t}=p(\ell|x_t)\tilde L_{\ell}(x_t, \theta_{\ell}) -q(\ell|x_t)\sum_{j\in \mathcal{C}_{\kappa}(x_t)}p(j|x_t)\tilde L_{j}(x, \theta_{j}),   \end{array}

from which we see a favorable nature of automatic model selection on k. The flow \nabla_{\Theta}H(p \Vert q, \Theta ) becomes the one for the ML learning if \delta_{\ell,t}=0. The BYY harmony learning differs from the ML learning by a relative fitness \delta_{\ell,t}. If \delta_{\ell,t}>0, updating goes along the same direction of the ML learning but with an increased strength. If 0>\delta_{\ell,t}> - p(\ell|x_t), updating still goes along the ML learning direction but with a reduced strength. When \delta_{\ell,t}<-p(\ell|x_t), updating reverses to the opposite direction, i.e., becoming de-learning, which becomes similar to RPCL. This can be observed more clearly at a special case of either the choice (a) with \kappa=2 or the choice (b) with q(\ell^c_t|x_t)\ge 0.5, which leads to

(13)
\begin{array}{l}    \ell^c_t=arg\max_{\ell} \tilde   L_{\ell}(x, \theta_{\ell}),   \ \ell^r_t=arg\max_{\ell \ne \ell^c_t} \tilde   L_{\ell}(x, \theta_{\ell}), \\  \delta_{\ell,t}=\begin{cases} \eta_t, &   \ell =\ell^c_t, \\ -\eta_t, &  \ell =\ell^r_t,  \end{cases}  \  \eta_t=p(\ell^c_t|x_t)p(\ell^r_t|x_t)[\tilde   L_{\ell^c_t}(x_t, \theta_{\ell^c_t})-\tilde   L_{\ell^r_t}(x_t, \theta_{\ell^r_t})].   \end{array}

Moreover, there are two improvements. First, it avoids the difficulty of specifying an appropriate \gamma that is required in RPCL. Second, either the winner or the rival may be updated by de-learning as \delta_{\ell,t}< -p(\ell|x_t) or learning as \delta_{\ell,t}>-p(\ell|x_t) instead of only de-learning by RPCL. Therefore, RPCL can be explained as a bottom level simplified approximation of the BYY harmony learning. An even closer link to RPCL is given by the choice (a) with s_t=q(\ell^c_t|x_t) in eq.(11), by which we simply have

(14)
\begin{array}{l} p(\ell|x_t)+\delta_{\ell,t}=\begin{cases} q(\ell^c_t|x_t)[2- q(\ell^c_t|x_t)]\tilde L_{\ell^c_t}(x_t, \theta_{\ell^c_t}), & \mbox{if} \, \ell =\ell^c_t, \\   - q(\ell^c_t|x_t)[1- q(\ell^c_t|x_t)]\tilde  L_{\ell^c_t}(x_t, \theta_{\ell^c_t}), & \mbox{otherwise}.  \end{cases}  \end{array}

In a summary, the Ying Yang best harmony at the global level in Fig.9(c) has an inherent relation with the local balance of two mechanisms in Fig.9(a). In addition, it is also interesting to notice the choice (b) with s_t=q(\ell^c_t|x_t)<0.5 that all the candidates in \mathcal{C}^c_{\kappa^c}(x_t) are considered instead of only the first one since none of them can be regarded to suit the current sample x_t.

Moreover, p(\ell|x_t)+ \delta_{\ell,t} actually provides a new allocation scheme for the iterative learning in Fig.8(c) with \begin{array}{l} \varepsilon_t(\theta_j)=- \tilde  L_{j}(x, \theta_{j})\end{array}, which implements a specific formulation of BYY harmony learning. Furthermore, let \varepsilon(\theta)=-H(p \Vert q, \Theta ), it follows from eq.(8) that we get the following global principle:

(15)
\varepsilon(\theta)=\sum_{t=1}^N \varepsilon_t(\theta),\varepsilon_t(\theta) =\sum_{j\in \mathcal{C}_{\kappa}(x_t)} p(j|x_t) \varepsilon_t(\theta_j), \ q(\ell|x_t)={exp(-\varepsilon_t(\theta_{\ell}))\over \sum_{j=1}^k exp(-\varepsilon_t(\theta_{j}))},

with p(\ell|x_t) obtained in a same way as eq.(11). In a degenerated case, minimizing \varepsilon(\theta) with respect to p(\ell|x_t) \ge 0 under the constraint \begin{array}{l}  \sum_j p(\ell|x_t)=1,\end{array} p(\ell|x_t)\ge 0 returns to the WTA allocation by eq.(1).


Learning algorithms for GMM, RBF nets, and mixture-of-experts

Figure 10:    Allocating policy versus individual criteria. (a) learning  algorithms for Gaussian mixture, (b) learning  algorithms for alternative ME and normalized RBF nets.
Enlarge
Figure 10: Allocating policy versus individual criteria. (a) learning algorithms for Gaussian mixture, (b) learning algorithms for alternative ME and normalized RBF nets.

Since the coordinating nature between a participating mechanism and a leaving mechanism is actually featured via a combination of a specific individual criterion \varepsilon_t(\theta_j) and a specific instance of the global allocating scheme by eq.(6), here we categorize different combinations to get an outline on algorithms that fall in the proposed divide-conquer learning framework. Shown in Fig.10(a) is an outline of learning algorithms derived from Fig.8(c) on Gaussian mixture model (GMM).

The first column is featured by the WTA allocation by eq.(1). At the bottom, each learning agent is simply a center point m_j and multiple agents jointly perform a task of cluster analysis by either the standard CL or the FS-CL with the previously discussed improvement. Moreover, each agent is extended to a general Gaussian G(x|m_j,\Sigma_j) with a priori \alpha_j. For Type A, multiple agents jointly describe a Gaussian mixture:

(16)
\begin{array}{l}  q(x|\phi)=\sum_{j=1}^k \alpha_j G(x|m_j,\Sigma_j). \end{array}

It improves the standard CL not only by extending a spherical cluster to an elliptic cluster but also by getting the balancing mechanism as discussed in Fig.9(b). For Type B, multiple agents jointly describe a product Gaussian mixture:

(17)
\begin{array}{l}  q(x|\phi)=\prod_{j=1}^k G(x|m_j,\Sigma_j)^{\alpha_j}/\int \prod_{j=1}^k G(x|m_j,\Sigma_j)^{\alpha_j}dx.  \end{array}

It improves FS-CL similarly as the above for Type A, and differs from Type A in that \alpha_j takes a role similar to the one in eq.(3), which further enhances a participating mechanism.

Featured by eq.(4), the second column consists of the standard RPCL (Xu, Krzyzak, & Oja, 1992 , 1993) and its generalization (Xu, 1998a). Next, the last two columns lead to an adaptive version of the EM algorithm (Redner & Walker, 1984; Xu & Jordan, 1996) and Bayesian Ying Yang harmony learning not only on a Gaussian mixture by eq.(16) for the cases of Type A, which was already discussed after eq.(8) in the previous section, but also on product Gaussian mixture by eq.(17) for the cases of Type B.

In addition to Gaussian mixture, learning algorithms in Fig.10(a) is also applicable to the cases of supervised learning, i.e., knowing each pairing \{x_t, \ell_t\}, by simply letting the winner \ell_c=\ell_t for RPCL learning and \ell_t^t=\ell_t for BYY harmony learning with p(\ell | x_t) by eq.(11) in the choice (a) with s_t= q(\ell_t^c | x_t). Another choice for supervised learning is using a normalized radial basis function (RBF) networks. Actually, in its original papers (Xu, Krzyzak, & Oja, 1992 , 1993) one application of RPCL was for learning RBF nets in a two stage implementation. First, a clustering algorithm (e.g., k-means) in (Moody, & Darken, 1989) is replaced by RPCL with the number of clusters determined automatically. Second, the output layer is estimated by a least square linear fitting. It has been subsequently shown that a normalized RBF networks can be regarded as a simplified case of an alternative mixture of experts (ME):

(18)
\begin{array}{l} q(z |x, \theta)=\sum_{j=1}^k p(j|x, \phi)G(z|f_j(x,\phi_j), \Gamma_j), \    E[z|x]=\sum_{j=1}^k p(j|x, \phi) f_j(x,\phi_j), \\    p(j|x, \phi)={ \alpha_j G(x|m_j,\Sigma_j)/ q(x|\phi)}, \end{array}

for which the maximum likelihood (ML) learning is implemented by the EM algorithm (Xu, Jordan, & Hinton, 1995). Also, an alternative ME becomes an extended normalized radial basis function (RBF) networks at the following special case

(19)
\begin{array}{l} f(x|\phi_j)=w_j^T x +c_j, \ \alpha_j={|\Sigma_j|/\sum_{i=1}^k |\Sigma_i|}, \\  E[z|x]=\sum_{j=1}^k {  exp[-0.5(x-m_j)^T\Sigma_j^{-1} (x-m_j)]  \over \sum_{r=1}^k exp[-0.5(x-m_r)^T\Sigma_r^{-1} (x-m_r)]}f_j(x,\phi_j),   \end{array}

which degenerates to a normalized RBF networks (Moody, & Darken, 1989) simply with w_j=0. As a result, the conventional two stage but suboptimal implementation of a normalized RBF learning is replaced by the EM algorithm (Xu, 1998b). Moreover, RPCL and Bayesian Ying Yang learning have also been adopted for determining the number of basis functions (Xu, 1998b, 2001).

Similar to Fig.10(a), these learning algorithms can also be summarized under the general divide-conquer learning framework here, for which we further interpret the following points:

  • Each agent also has a part M_j^a(\theta_j^a) that makes an action z_t^a(\theta_j^a) to the environment.
  • Individual criterion also needs to evaluate the error e_t^z(\theta_j^a)=z_t- z_t^a(\theta_j^a), for which we generalize the Type A case of eq.(5) into :
(20)
\begin{array}{l} \varepsilon_t(\theta_j)=- \ln{[\alpha_j q(x_t| \theta_j^x)q(z_t|x_t, \theta_j^a)]}, \mbox{ in general;} \\  \varepsilon_t(\theta_j)=- \ln{[\alpha_j G(x|m_j,\Sigma_j)G(z|f_j(x,\phi_j), \Gamma_j)]}, \ \mbox{for the alternative ME.}  \end{array}

With the above \varepsilon_t(\theta_j) replacing its previous appearances on Gaussian mixture, it follows from Fig.8(c) that similarly we get the learning algorithms for the alternative ME and normalized RBF networks, as outlined in Fig.10 (b). Details are referred to (Xu, 2009).

In many practices, there is only a finite size of training samples distributed in small dimensional subspaces, instead of scattering over all the dimensions of the observation space. These subspace structures can not be well captured by considering a basis exp[-0.5(x-m_j)^T\Sigma_j^{-1} (x-m_j)] supported on the entire space of x. There are too many free parameters in \Sigma_j, which usually leads to poor performances. Instead, we can further consider that observed samples are regarded as generated from a subspace with independent factors distributed along each coordinate of a m_{\ell} dimensional inner representation y. Readers are referred to (Xu, 2009) for details and to the subpage /APPENDIX (a) subspace based functions.


Grouped agents, k-CN majority voting, and Ying Yang Alternation

Also in many practices, the uncertainty structure of each agent \varepsilon_t(\theta_j) is beyond a Gaussian structure. One choice is considering eq.(20) by choosing specific q(x_t|\theta_j) and q(z_t|x_t, \theta_j^a) if we know their forms in advance. A general way is considering each \varepsilon_t(\theta_j) to be described by a mixture of multiple components, with each one in a simple distribution. Typically, we consider a mixture of multiple Gaussian components, that is, each agent consists of q(x_t|\theta_j) in a Gaussian mixture by eq.(16) and q(z_t|x_t, \theta_j^a) by eq.(18). Equivalently, we consider k groups of agents with each group consisting of a number of agents in Gaussian structures, which can be made by simply letting the previously discussed single label j or \ell (e.g., in eq.(16) and eq.(18) as well as eq.(8) ) to be replaced by a double labeling j=(i,j) or \ell=(i,\ell), with i added for labeling Gaussian components within each group. Taking eq.(16) as an example, in sequel we consider its further details, while the cases with q(z |x, \theta_j^a) by eq.(18) can be considered similarly.

We consider multiple groups of Gaussian components \{G(x|m_{ij},\Sigma_{ij})\}_{ij} with their corresponding priors \{\alpha_{ij}\}_{ij}. That is, the entire data set is considered by a two level mixture of Gaussians

(21)
\begin{array}{l}  q(x|\phi)=\sum_{i,j} \alpha_{ij} G(x|m_{ij},\Sigma_{ij}), \ \sum_{i,j} \alpha_{ij}=1, \   \alpha_{ij} \ge 0. \end{array}

Still, the algorithms in Fig.10(a) can be used to learn their parameters \{ m_{ij},\Sigma_{ij}, \alpha_{ij}\}_{ij} via simply replacing every single label j or \ell with a double labeling j=(i,j) or \ell=(i,\ell), plus a modification on getting \ell_t^c through the following k-CN based majority rule:

(22)
\begin{array}{l} \ell_t^c=(i_t^*,\ell_t^*) \ \mbox{by first getting}  \   \ell_t^*=  arg\max_{\ell}  \# \{ (i,\ell)  \in \mathcal{C}_{\kappa}(x_t), \ fixing \ \ell  \}, \\   \mbox{then getting }  \  i^*_t=  arg\max_{i} [\alpha_{ij} G(x|m_{ij},\Sigma_{ij})]_{j= \ell_t^* }.   \end{array}

where \#S denotes the cardinality of the set S. If knowing that x_t comes from the \ell_t-th group, we simply let \ell_t^*=\ell_t, plus specifically considering p(ij| x_t) by eq.(11) in the choice (a) with s_t=q(\ell^c_t|x_t), to implement BYY harmony learning. As introduced before eq.(9), \kappa-CN policy focuses on these reliable contributions to p(\ell|x_t) for a discrimination of x_t. Using a majority voting to get \ell_t^* further considers these reliable contributions based on their ranks instead of their probability values.

To be more specific, it follows from the the gradient flow by eq.(12) that we can implement \begin{array}{l} \max_{\Theta} H(p \Vert q, \Theta )\end{array} via the following Ying-Yang alternation:

  • Yang step: replacing every single label j or \ell by a double labeling j=(i,j) or \ell=(i,\ell), we get

(1) \ell_t^c by eq.(22) and q(ij|x_t) by eq.(10) with L_{\ell}(x, \theta_{\ell})=\ln[\alpha_{ij} G(x|m_{ij},\Sigma_{ij})] and q( \theta_{\ell})\propto \alpha^{\lambda}_{ij} |\Sigma_{ij}|^{-0.5\lambda}_{ij}.

(2) p(ij|x_t) by eq.(11) and \delta_{ij,t} by eq.(12) or eq.(14), from which we get p_{ij,t}=p(ij|x_t)+ \delta_{ij,t}.

(3) update the data smoothing parameter by

(23)
\begin{array}{l} h^{new} - h^{old} \propto \{{ 1 \over N}-\sum_{ij}{n_{ij} h\over Nd}Tr[\Sigma_{ij}^{-1}]-{h_0(h^2) \over h^2}\}. \end{array}

where h_0(h^2) is the average gap between samples under a smoothing parameter h^2. Given a sample set, h_0(h^2) needs to be computed only once and stored as a table for a quick access at each updating of eq.(23). Further details are referred to (Xu, 2007).

  • Ying step: update parameters by one of the following three choices:

(a) An EM-like updating

(24)
\begin{array}{l} \bar \alpha_{ij}={n^{\delta}_{ij}+\lambda n_{ij} \over \sum_{ij}(n^{\delta}_{ij}+\lambda n_{ij})}, \ n_{ij}= \sum_{ t=1}^N p(ij|x_t),\ n^{\delta}_{ij}= \sum_{ t=1}^Np_{ij,t}, \\ \bar m_{ij}={1 \over n^{\delta}_{ij}} \sum_{ t=1}^Np_{ij,t}x_t, \ \bar \Sigma_{ij} ={1 \over n^{\delta}_{ij}+\lambda n_{ij}} [h^2 n_{ij} + \sum_{ t=1}^Np_{ij,t}( x_t- m^{old}_{ij})( x_t- m^{old}_{ij})^T]. \end{array}

(b) A gradient ascent updating

(25)
\begin{array}{l} \alpha^{new}_{ij}={ (1-\eta)\alpha^{old}_{ij} + \eta \bar \alpha_{ij} \over 1-\eta   + \eta\sum_{ij}\bar \alpha_{ij}}, \ m^{new}_{ij}= (1-\eta)m^{old}_{ij} +\eta \bar m_{ij}, \ \Sigma^{new}_{ij}= (1-\eta)\Sigma^{old}_{ij} +\eta \bar \Sigma_{ij}.    \end{array}

(c) A constraint ensured gradient ascent updating

(26)
\begin{array}{l} m^{new}_{ij}- m^{old}_{ij} \propto \bar m_{ij}-m^{old}_{ij} , \ \alpha^{new}_{ij}=e^{c^{new}_{ij}}/\sum_{ij} e^{c^{new}_{ij}},\  c^{new}_{ij}-   c^{old}_{ij} \propto \bar \alpha_{ij}-  \alpha^{old}_{ij}, \\  \Sigma_{ij}=S_{ij}S^T_{ij}, \ S^{new}_{ij}- S^{old}_{ij} \propto \Sigma^{old \ -1}_{ij}[\bar \Sigma_{ij}-\Sigma^{old}_{ij}]\Sigma^{old \ -1}_{ij}S^{old\ -T}_{ij}. \end{array}

The above Ying-Yang alternation is repeated until the iteration converges, during which automatic model selection is made as one \alpha_{ij}\to 0 with its corresponding Gaussian component discarded.

Remarks:

  • The above choice (a) is ensured to make updating along the direction to increase H(p \Vert q, \Theta ). However, its stepsize may sometimes overshot. To avoid this, an extra check is needed on whether H(p \Vert q, \Theta ) is increased after updating. If not, updating can be made by the Choice (b) with a recuded stepsize 0< \eta < 1. Alternative, we may also make updating by the choice (b) with a small enough stepsize 0< \eta, without checking whether H(p \Vert q, \Theta ) is increased. The choice (b) becomes equivalent to the choice (a) when \eta =1. Still, the updatings can not guarantee the requirement \alpha_{ij}\ge 0 and \Sigma_{ij} remains to be positive definite, which is instead guaranteed by the choice (c).
  • Though the above discussed is learning in a batch way. It directly applies to adaptive learning by setting N=1 in eq.(24). It also covers a spectrum that N varies from 1 to the entire size of samples.
  • We can either get q(x|\phi) by eq.(21) or q(x_t|\theta_j)=\sum_{i} \alpha_{ij} G(x|m_{ij},\Sigma_{ij}) /\sum_{i} \alpha_{ij}. Moreover, as a new sample x_t comes we may classify x_t \to \ell_t^c by the k-NN rule given by the choice (b) in eq.(22).
  • It degenerates to the BYY harmony learning for a Gaussian mixture when i takes only one value for every j.
  • To simplify computation, we may ignore eq.(23) by fixing h=0 or a small real number.


References

  • Acciani, G., et al (2003), "A feature extraction unsupervised neural network for an environmental data set", Neural Networks 16: 427-436.
  • Ahalt, SC., et al (1990), "Competitive learning algorithms for vector quantization", Neural Networks 277-291.
  • Akaike, H (1974), "A new look at the statistical model identification", IEEE Tr. Automatic Control 19: 714-723.
  • Bozdogan, H & Ramirez, DE (1988), "FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC", Psychometrika 53(3): 407-415.
  • Cavanaugh, JE (1997), "Unifying the derivations for the Akaike and corrected Akaike information criteria", Statistics & Probability Letters 33: 201-208.
  • Desieno, D (1988), "Adding a conscience to competitive learning", Proc. of IEEE int. conf. on Neural Networks, Vol.I, pp.117-124.
  • Edelman, A, Arias, TA, & Smith, ST, (1998), "The geometry of algorithms with orthogonality constraints", SIAM J. Matrix Anal. Appl., 20:303-353.
  • Ghahramani, Z & Beal, MJ (2000), "Variational inference for Bayesian mixtures of factor analyzers", in Solla, Leen, & Muller (eds), Advances in Neural Information Processing Systems 12, 449-455.
  • Grossberg, S (1987), "Competitive learning: from iterative activation to adaptive resonance", Cognitive Science, 11: 23-63.
  • Hecht-Nielsen, R (1987), "Counterpropagation networks", Applied Optics 26: 4979-4984.
  • Hough, PVC "Method and means for recognizing complex patterns", U.S. Patent 3069654, Dec.18, 1962.
  • Jacobs, RA., et al (1991), "Adaptive mixtures of local experts", Neural Computation 3: 79-87.
  • Jordan, MI & Xu, L (1995), "Convergence results for the EM approach to mixtures of experts", Neural Networks 8: 1409-1431.
  • Kohonen, T (1982), "Self-organized formation of topologically correct feature maps", Biological Cybernetics 43: 59-69.
  • Liu, ZY, Qiao, H, & Xu, L (2006), "Multisets Mixture learning based Ellipse Detection", Pattern Recognition 39: 731-735.
  • Ma, J, Wang, T, & Xu, L (2004), "A gradient BYY harmony learning rule on Gaussian mixture with automated model selection", Neurocomputing 56: 481-487.
  • Moody, J & Darken, J (1989) "Fast learning in networks of locally-tuned processing units", Neural Computation 1: 281-294.
  • Nair, TM., et al, (2003) "Rival penalized competitive learning (RPCL): a topology-determining algorithm for analyzing gene expression data", Computational Biology and Chemistry 27: 565-574.
  • Nakkrasae, S, & Sophatsathit, P (2004), "An rpcl-based indexing approach for software component classification" International J. of Software Engineering and Knowledge Engineering 14: 497-518.
  • Redner, RA & Walker, HF (1984), "Mixture densities, maximum likelihood, and the EM algorithm", SIAM Review 26: 195-239.
  • Rissanen, J (1986), "Stochastic complexity and modeling", Annals of Statistics 14 (3): 1080-1100.
  • Rissanen, J (2007), Information and Complexity in Statistical Modeling, Springer, 2007.
  • Rivals, I & Personnaz, L (1999) "On Cross Validation for Model Selection", Neural Computation 11: 863-870.
  • Robbins, H., & Monro, S., (1951), "A stochastic approximation method", Ann. Math. Statist. 22:400-407.
  • Rumelhart, DE & Zipser, D (1985), "Feature discovery by competitive learning", Cognitive Science 9: 75-112.
  • Salah, AA & Alpaydin, E. (2004), "Incremental mixtures of factor analyzers", Proc. the 17th International Conference on Pattern Recognition, 23-26 Aug. 2004, Cambridge, UK, Vol.1, 276-279.
  • Schwarz, G (1978), "Estimating the dimension of a model", Annals of Statistics 6: 461-464.
  • Stone, M (1978), "Cross-validation: A review", Math. Operat. Statist. 9: 127-140.
  • Vapnik, VN (1995), The Nature of Statistical Learning Theory, Springer.
  • Wallace, CS & Boulton, DM (1968), "An information measure for classification", Computer Journal 11: 185-194.
  • Wallace, CS & Dowe, DR (1999), "Minimum message length and Kolmogorov complexity", Computer Journal 42 (4): 270-280.
  • Xu, L (2009), "Learning Algorithms for RBF Functions and Subspace Based Functions", Ch.3 in "Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods and Techniques", eds. by Olivas, Guerrero, Sober, Benedito, & Lopez, IGI Global publication, pp60-94.
  • Xu, L (2007), "A Unified Perspective and New Results on RHT Computing, Mixture Based Learning, and Multi-learner Based Problem Solving", Pattern Recognition 40: 2129-2153.
  • Xu, L (2007), "A Trend on Regularization and Model Selection in Statistical Learning: A Perspective from Bayesian Ying Yang Learning", an invited chapter in "Studies in Computational Intelligence 63: Challenges to Computational Intelligence", Duch, W. and Mandziuk, J., 365-405.
  • Xu, L (2002), "BYY Harmony Learning, Structural RPCL, and Topological Self-Organizing on Unsupervised and Supervised Mixture Models", Neural Networks 15: 1125-1151.
  • Xu, L (2001), "Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, ME-RBF Models and Three-Layer Nets", International J. of Neural Systems 11: 3-69.
  • Xu, L (1998a), "Rival Penalized Competitive Learning, Finite Mixture, and Multisets Clustering", Proc. of IJCNN98, Anchorage, Vol.II, pp.2525-2530.
  • Xu, L (1998b), "RBF nets, mixture experts, and Bayesian Ying-Yang learning", Neurocomputing 19(1-3): 223-257.
  • Xu, L & Jordan, MI (1996), "On convergence properties of the EM algorithm for Gaussian mixtures", Neural Computation 8(1): 129-151.
  • Xu, L, Jordan, MI, & Hinton, GE (1995), "An Alternative Model for Mixtures of Experts", in Cowan, Tesauro, and Alspector, eds., Advances in Neural Information Processing Systems 7, MIT Press, 633-640.
  • Xu, L, Krzyzak, A, & Oja, E (1993), "Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection", IEEE Tr. on Neural Networks 4: 636-649.
  • Xu, L, & Oja, E (1993), "Randomized Hough Transform (RHT): Basic Mechanisms, Algorithms and Complexities", Computer Vision, Graphics, and Image Processing: Image Understanding 57: 131-154.
  • Xu, L, Krzyzak, A & Oja, E (1992), "Unsupervised and Supervised Classifications by Rival Penalized Competitive Learning", Proc. of 11th Intl. Conf. on Pattern Recognition, Hauge, Netherlands, Aug.30-Sept.3, 1992, Vol.I, pp.672-675.
  • Xu, L, Oja, E & Kultanen, P (1990), "A New Curve Detection Method: Randomized Hough transform (RHT)", Pattern Recognition Letters 11: 331-338.


Internal links

External links

Author's website (http://www.cse.cuhk.edu.hk/~lxu/)


See also


Lei Xu (2007) Rival penalized competitive learning. Scholarpedia, 2(8):1810, (go to the first approved version)
Created: 31 July 2006, reviewed: 17 August 2007, accepted: 31 August 2007
Action editor: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
For authors
PDF