# Rival penalized competitive learning

Post-publication activity

Curator: Lei Xu

Rival penalized competitive learning (RPCL) is a development of competitive learning in help of an appropriate balance between two opposite mechanisms (namely a participating mechanism and a leaving mechanism), such that an appropriate number of agents or learners will be allocated to learn multiple structures underlying observations. Moreover, RPCL has been further extended into a general competitive divide-conquer learning framework for multi-agents to divide and conquer a complicated task. Furthermore, RPCL may be regarded as a simplified and approximated implementation of the Bayesian Ying Yang Learning, from which we are actually lead to a family of RPCL extensions.

## Classical competitive learning: from WTA to FSCL

Proposed decades ago, classical competitive learning (CCL) is usually used for making data clustering by a number of units or agents. Each agent is featured by a parametric point $$\theta_j=m_j$$ that moves among the observation samples and seeks to be located at the center of a cluster of samples, such that several points collectively perform clustering analysis. As a sample $$x_t$$ comes, each $$m_j$$ evaluates a degree of its suitability on representing $$x_t$$ by an individual value or criterion, e.g., $$\varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2 \ ,$$ and competes to be allocated to represent $$x_t\ .$$ The competition is guided globally by the winner-take-all (WTA) policy as follows $\tag{1} p_{ j, t}= \begin{cases} 1, & \mbox{if}\quad j=c, \\ 0, & \mbox{otherwise;}\end{cases} \,\,\, c=arg \ min_{ j} \varepsilon_t(\theta_j).$

Then, each $$m_j$$ is updated to further reduce $$\varepsilon_t(\theta_j)$$ by $\tag{2} m_j^{new}= m_j^{old}+ \eta p_{ j, t} (x_t- m_j^{old}).$

When $$m_1, m_2$$ are initialized as in Figure 1(a), as a sample $$x$$ comes, $$m_1$$ becomes the winner and then moves a little bit towards $$x\ .$$ Next, as another sample $$x$$ comes in Figure 1(b), $$m_2$$ becomes the winner and then moves similarly. Finally, $$m_1$$ and $$m_2$$ each converge to one clusters' center as in Figure 1(c).

However, if $$m_1, m_2$$ are initialized as in the top part of Figure 1(d), $$m_2$$ always becomes the winner and finally moves the middle point between two clusters, while $$m_1$$ never wins and becomes dead. This is known as the under-utilized or dead unit problem, featured by a WTA based competitive learning (Rumelhart et al., 1985; Grossberg, 1978; Hecht-Nielsen, 1987), which makes the task of clustering analysis badly performed. This problem comes from unequal chances for the dead agents to participate competition due to their unfavorable initial locations. Thus, it is desired that every agent should be initialized to have an equal chance to join competition, which is actually difficult to implement without a priori knowledge on the entire clustering structure, while such a knowledge is exactly the goal of this competitive learning.

One way to tackle this difficulty is a so-called frequency sensitive competitive learning (FSCL) (Ahalt, et al., 1990), in help of modifying $$\varepsilon_t(\theta_j)$$ into $\tag{3} \varepsilon_t(\theta_j)=\alpha_j \Vert x_t- m_j \Vert^2,$

where $$\alpha_j$$ is the frequency that the $$j$$-th agent won in past. Illustrated in Figure 2 is an example of four clusters, the CCL resulted in three dead units in Figure 3, while the FSCL makes all the four units placed to the cluster's centers instead of becoming dead, as shown in Figure 4.

Such a sprit of reducing the winning rates of frequent winners can also be implemented in alternative ways, e.g., conscience (Desieno, 1988), leaky learning (Rumelhart et al., 1985; Grossberg, 1978), convex bridge (Hecht-Nielsen, 1987), and neighbor sharing in Kohonen network (Kohonen,1982). Figure 5: When $$k$$ is pre-assigned to 5. the frequency sensitive mechanism also brings the extra one into data to disturb the correct locations of others (double click it to see animation).

## Rival penalized competitive learning

The way of reducing the winning rates works when the number $$k$$ of clusters is known. Unfortunately, $$k$$ is usually unknown. Not only it is no doubt that both CCL and FSCL will result in bad performances when $$k$$ is given a smaller number, but also FSCL fails when $$k$$ is pre-assigned to a value larger than an appropriate one. As illustrated in Figure 5, the frequency sensitive mechanism will bring the extra agents into data to disturb the correct locations of other agents. Thus, how to determine $$k$$ is a critical difficulty too.

One solution is provided under the name of Rival Penalized Competitive Learning (RPCL), proposed in the early 1990s for clustering analysis and detecting lines in an image (Xu, Krzyzak, & Oja, 1992, 1993). As illustrated in Figure 6, not only the winner is learned but also its rival (i.e., the second winner) is repelled a little bit from the sample to reduce a duplicated information allocation. As a sample $$x$$ comes, $$m_1$$ is the winner and $$m_2$$ is its rival in Figure 6(a), $$m_1$$ is moved towards $$x$$ by a small step size, while $$m_2$$ is repelled away from $$x$$ by a much smaller step size. Similarly in Figure 6(b), $$m_3$$ is the winner and moved towards $$x\ ,$$ while $$m_2$$ is its rival and repelled away from $$x\ .$$ Figure 7: Rival penalized mechanism makes extra agents driven far away(double click it to see animation).

More precisely, the RPCL differs from FSCL in that eq.(2) is implemented with eq.(1) replaced by $\tag{4} p_{ j, t}= \begin{cases} 1, & \mbox{if} \, j=c, \\ -\gamma, & \mbox{if}\, j= r, \\ 0, & \mbox{otherwise}, \end{cases} \begin{cases} c=arg \ min_{ j} \varepsilon_t(\theta_j), \\ r=arg \ min_{ j \ne c } \varepsilon_t(\theta_j),\end{cases}$ where $$\gamma$$ approximately takes a number between $$0.05$$ and $$0.1$$ for controlling the penalizing strength.

As illustrated in Figure 6(c), $$m_1$$ and $$m_3$$ finally converge to clusters' centers, while $$m_2$$ is driven far away from samples. Shown in Figure 7 is an experimental result that improves its FSCL counterpart in Figure 5, with a disturbing agent driven away from samples. Adding this rival penalized mechanism makes extra agents driven far away from data as long as the number of agents is initially given to be larger than the number of clusters to be detected. In other words, RPCL can allocate an appropriate number of learning agents to detect a correct number of clusters or objects automatically during learning. This automatic allocation is a favorable nature that both CCL and FSCL as well as the conventional clustering algorithms (e.g., k-means) do not have. In past decades, a large number of applications of RPCL have been made (Acciani, et al., 2003; Nair, 2003; Nakkrasae & Sophatsathit, 2004).

## A general framework for competitive divide-conquer learning

CCL, FSCL, and RPCL can be investigated systemically from the perspective of a general framework for learning agent based problem solving. That is, multiple learning agents coordinately divide a complicated task into a number of subtasks and each subtask is mainly conquered by one learning agent such that the entire task can be well solved collectively by these multiple learning agents (Xu, 2007). In the following, we introduce the basic ingredients of this general framework.

(a) Agent's structure $$\quad$$ Each agent needs a hardware $$M_j$$ that is able to handle one specific problem solving task. This $$M_j$$ is usually featured by a pre-designed structure with a set $$\theta_j$$ of parameters to be determined. The simplest example is that $$M_j$$ is given by a point $$\theta_j=m_j\ .$$ In general, $$M_j$$ models or fits its environment, from which we get $$x_t(\theta_j)$$ as a reconstruction of $$x_t$$ by $$M_j\ .$$ Beyond a point $$m_j\ ,$$ $$M_j$$ can also be a curve or shape as shown in Figure 8(a) for objection detection in a noise situation (Xu, 2002; Liu, Qiao, & Xu, 2006), outperforming the classical Hough transform (HT) (Hough, 1962) and Randomized Hough transform (RHT) (Xu, Oja, & Kultanen, 1990; Xu & Oja, 1993). Generally, $$M_j(\theta_j)$$ can be other parametric structure too.

(b) Individual criterion$$\quad$$ Each agent has an individual value $$\varepsilon_t(\theta_j)$$ to evaluate how good this $$x_t(\theta_j)$$ is reconstructed by $$M_j.$$ Considering the error $$\ e_t (\theta_j )=x_t- x_t(\theta_j)\ ,$$ it can be measured by one of the following two related choices:

• The errors are directly measured by certain norm, e.g., in a $$L_2$$ norm $$\varepsilon_t(\theta_j)=\Vert e_t (\theta_j )\Vert^2\ .$$ For a simple case $$\!\theta_j=m_j\ ,$$ we have $$\!x_t(\theta_j)=m_j$$ and $$\varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2\ .$$
• Observing that a larger error occurs with a smaller chance, an error may be alternatively measured by its probability distribution, e.g., $$x - m_j$$ is measured by a Gaussian as shown in Figure 8(b). In general, we have $$\!q(\varepsilon(\theta_j)| \theta_j)$$ or $$q(x | x (\theta_j), \theta_j) \ ,$$ which is further concisely rewritten into $$q(x |\theta_j) \ .$$ Plus a priori $$\alpha_j$$ for the chance of the corresponding agent to be involved, we get the following probabilistic distribution based measure:

$\tag{5} \begin{array}{l} \varepsilon_t(\theta_j)=-\begin{cases} \ln{[\alpha_j q(x_t|\theta_j)]}, & (a)\, Type\, A, \\ \ln{ q(x_t|\theta_j)^{\alpha_j} } -\ln \mathcal C( \theta), & (b)\, Type\, B. \end{cases}\\ C(\theta)= \int \prod_j q(x|\theta_j)^{\alpha_j}dx\approx \sum_t \prod_j{ q(x_t|\theta_j)^{\alpha_j}}. \end{array}$ Type B with $$q(x| \theta_j)=G(x|m_j, \Sigma_j)$$ can be regarded as a further extension of eq.(3).

(c) Global principle$$\quad$$ Agents are coordinated globally to combine individual criteria $$\{ \varepsilon_t(\theta_j)\}_{j=1}^k\ ,$$ in one of the following two manners:

• One is operational based, e.g., the simple ones by eq.(1) or eq.(4). Generally, we can denote an allocation scheme as follows:

$\tag{6} p_{ j, t}= ALOC[\{{\varepsilon_t(\theta_j) }\}_{j=1}^k]$

• The other is evaluation based, by a principle on the overall performance jointly by all the agents.

(d) Updating rule$$\quad$$ As a further extension of eq.(2), agents are modified to adapt $$x_t$$ according to the allocation by $$p_{ j, t}\delta \theta_j\ ,$$ with $\tag{7} \delta \theta_j=Reduce(\varepsilon_t(\theta_j^{old})),$ where $$Reduce(\varepsilon_t(\theta_j))$$ denotes an operator that acts on $$\theta_j^{old}$$ and $$x_t$$ to yield a direction along which $$\varepsilon_t(\theta_j)$$ reduces within a neighborhood region of $$\theta_j^{old}\ .$$ When $$\varepsilon_t(\theta_j)$$ is differentiable, $$Reduce(\varepsilon_t(\theta_j))$$ can be given by $$-\nabla_{ \theta_j}\varepsilon_t(\theta_j)$$ or $$-{\mathcal P}\nabla_{ \theta_j}\varepsilon_t(\theta_j)$$ on $$D(\theta_j) \ ,$$ where $$-{\mathcal P}\nabla_{ \theta_j}\varepsilon_t(\theta_j)$$ has a positive projection on $$-\nabla_{ \theta_j}\varepsilon_t(\theta_j) \ .$$

(e) A-I/(Yang-Ying) alternation $$\quad$$ As shown in Figure 8(c), a conquering action is performed via an updating rule by eq.(7) to improve the fit of agents to samples, while a dividing action is performed via an allocation scheme by eq.(6). The two steps A-I, also called Yang-Ying steps from the perspective of Bayesian Ying Yang learning (see the last two subsections of this paper), are iteratively implemented. To get a useful iteration, the following three issues should be considered:

• The updating by Step I/Ying needs an appropriate control of a small enough learning stepsize.
• During the iteration, we should keep $$\begin{array}{l}\theta_j^{new}\in D(\theta_j) \end{array}$$ that usually consists of three types of constraints. One is $$\begin{array}{l}\alpha_j>0, \ \sum_j \alpha_j=1 \end{array}\ ,$$ the second is the covariance matrix $$\Sigma_j$$ that should be positive definite. One typical way is transforming them to be function of parameters that are free of constraints, e.g., $$\begin{array}{l}\alpha_j=exp(c_j)/\sum_j exp(c_j)\end{array}$$ and $$\begin{array}{l}\Sigma_j=S_jS_j^T\end{array}$$ (Xu, 2001). Moreover, there is also a constraint that a matrix locates within the Stiefel manifold $$\begin{array}{l}UU^T=I\end{array}\ ,$$ for which one way bases on projection (Xu, 2002) while the other bases on differential geometry (Edelman, Arias,& Smith, 1998).
• In addition to estimating unknown parameters $$\begin{array}{l}\{ \theta_j \}_{j=1}^k\end{array}\ ,$$ it also needs to select an appropriate number $$k$$ of agents, such that not only each agent performs best according to its individual criterion but also a best overall performance is achieved, which highly depends on $$p_{ j, t}\ .$$ In the literature of statistics and machine learning, the task of determining the number $$k$$ belongs to what is usually called model selection.

## Two stage, stepwise, and automatic model selection

In general, model selection refers to select an appropriate one among a family of infinite many candidate structures $$\{S_{\mathbf k}( \theta)\}$$ with each $$S_{\mathbf k}$$ in a same configuration but in different scales, each of which is labeled by a scale parameter $${\mathbf k}$$ in term of one integer or a set of integers. Selecting an appropriate $${\mathbf k}$$ means getting a structure consisting of an appropriate number of free parameters. At a given $${\mathbf k} \ ,$$ we expect that $$S_{\mathbf k}$$ best fits or represents a given set $$\Chi_N$$ of samples with a finite size $$N$$ and determine $$\theta_{\mathbf k}$$ of all the unknown parameters within $$S_{\mathbf k}\ ,$$ such that the error of using $$S_{\mathbf k}(\theta^*_{\mathbf k} )$$ to represent $$\Chi_N$$ becomes the least. For our problem discussed in previous section, we have $$S_{\mathbf k}=\{M_{j}\}_{j=1}^k$$ and $$\theta_{\mathbf k}=\{\theta_{j}\}_{j=1}^k\ ,$$ while $${\mathbf k}$$ is simply the number $${\mathbf k} \ .$$

For a best fitting principle (e.g., a maximum likelihood (ML) principle ), its fitting error monotonically decreases as $${\mathbf k}$$ grows up, until the error reaches zero at $${\mathbf k}={\mathbf k}_N\ ,$$ a value that relates to $$N$$ and is usually much bigger than an appropriate $${\mathbf k}^*\ .$$ In this case, a large part of structure has actually been used to fit noises or outliers, which not only wastes computing resources but also degrades the generalization performance on new samples with a same underlying regularity. This is usually called ''over-fitting'' problem, for which we cannot get an appropriate $$k^* \ .$$ Typically, this problem is tackled along one of the following three directions.

(a) Two stage implementation $$\quad$$ First, enumerate a candidate set of $${\mathbf k}$$ and its corresponding unknown set $$\theta_{\mathbf k}$$ of parameters by ML learning for a solution $$\theta^*_{\mathbf k}$$ at each $${\mathbf k}.$$ Second, use a model selection criterion $$J(\theta^*_{\mathbf k})$$ to select a best $${\mathbf k}^*.$$ In the literature, there exist several typical model selection criteria, including Akaike information criterion (AIC) and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), Bayesian approach related criteria under the names of Bayesian inference criterion (BIC) (Schwarz, 1978), Minimum message length (MML) (Wallace & Boulton,1968; Wallace & Dowe, 1999), Minimum description length (MDL) (Rissanen, 1986, 2007), Cross validation (CV) based criteria (Stone, 1978; Rivals & Personnaz, 1999), and the VC dimension based generalization error bound (Vapnik, 1995). Unfortunately, these criteria usually provides a rough estimate that may not yield a satisfactory performance. Also, this two stage approach usually incurs a huge computing cost. Moreover, the parameter learning performance deteriorates rapidly as $${\mathbf k}$$ increases, which makes the value of $$J(\theta^*_{\mathbf k})$$ evaluated unreliably.

(b) Stepwise model selection either incrementally or decrementally $$\quad$$ Incremental algorithms attempt to keep as much as possible what already learned as $${\mathbf k}$$ increases step by step, focusing on learning newly added parameters (Ghahramani & Beal, 2000; Salah and & Alpaydin, 2004). Such an incremental implementation can save computing costs, but usually leads to suboptimal performance because both those newly added parameters and the old parameter set $$\theta_{\mathbf k}$$ need to be re-learned while the old parameter set $$\theta_{\mathbf k}$$ actually has been not updated. This suboptimal problem may be lessened by a decremental implementation that starts with $${\mathbf k}$$ at a large value and decreases $${\mathbf k}$$ step by step. At each step, one takes a subset out of $$\theta_{\mathbf k}$$ to consider. Such a procedure can be formulated as a tree searching. The initial parameter set $$\theta_{\mathbf k}$$ is the root of the tree, and discarding one subset moves to one immediate descendant. A depth-first searching suffers from a suboptimal performance seriously, while a breadth-first searching suffers a huge combinatorical computing cost. Usually, a trade off between the two extremes is adopted.

(c) Automatic model selection $$\quad$$ One early exemplar is the effort on RPCL that is featured by driving an extra agent far away from data or equivalently pushing the proportion $$\alpha_j$$ that samples are allocated to this agent towards zero. Subsequently, Bayesian Ying Yang learning was proposed in (Xu,1995, 2002) and systematically developed, which leads to not only new model selection criteria but also algorithms with automatic model selection via maximizing a Ying Yang harmony functional (Xu, 2002, 2004).

In general, this automatic model selection demonstrates a quite difference nature from the usual incremental or decremental implementation that bases on evaluating the change $$J(\theta^*_{\mathbf k})- J(\theta^*_{\mathbf k}\cup \theta_{\delta})$$ as a subset $$\theta_{\delta}$$ of parameters is added or removed. Instead, automatic model selection is associated with a learning algorithm or a learning principle with the following two features:

• There is an indicator $$\rho(\theta_e)$$ associated with a subset $$\theta_e \in \theta_{\mathbf k}$$, indicating a particular structural component that is effectively discarded if either or both of its corresponding $$\rho(\theta_e)=0$$, e.g., a Gaussian component in Fig. 8(a) is discarded if either or both of its corresponding $$\alpha_l=0$$ and $$Tr[\Sigma_l]=0$$.
• In implementation of this learning algorithm or optimizing this learning principle, there is an mechanism that automatically drives $$\rho(\theta_e)\to 0$$, if the corresponding structure is redundant and thus can be effectively discarded.

Readers are referred to page 287 of a recent overview (Xu, 2010), and Sect.2.2.2 of a recent tutorial (Xu, 2012) for further details about automatic model selection.

## Automatic model selection from coordinated opposite mechanisms

As illustrated in Figure 9(a), an appropriate combination of a global principle and individual criteria actually results in automatic model selection during parameter learning, due to a coordination of two opposite mechanisms, namely one for each agent to participate and the other for each agent to leave.

As mentioned early, the WTA allocation by eq.(1) used in CCL leads to the dead unit problem because agents do not have an equal chance to participate. This problem can be solved via a participating mechanism that lets each agent to have an equal initial chance to participate. For the FSCL that bases still on the WTA allocation by eq.(1), its local criterion by eq.(3) provides a participating mechanism that penalizes those frequent winners and compensates those constant losers. However, it fails when the number of agents in consideration is larger than an appropriate one, due to lacking of a mechanism for an incapable or extra agent to leave.

As illustrated in Figure 9(a), it is an appropriate balance between such a leaving mechanism and the participating mechanism that leads to automatic model selection, and such a balance may come from one or more of the following different aspects.

(a) Allocating policy $$\quad$$ RPCL improves FSCL via penalizing a rival by eq.(4), which provides a mechanism to drive an incapable or extra agent away. An appropriate balance between this leaving mechanism and the participating mechanism is controlled by the de-learning rate $$\gamma$$ in eq.(4).

(b) Local criterion $$\quad$$ We consider Type A in eq.(5) with a special Gaussian $$p(x| \theta_j)=G(x|m_j, \sigma_j^2I)\ ,$$ while Type B can be analyzed similarly. It follows that $$\varepsilon_t(\theta_j)=-\ln{\alpha_j }+ 0.5 [{\Vert x_t-m_j\Vert^2 / \sigma_j^{2}}+d\ln{(2\pi \sigma_j^2)}]$$ varies with $$\sigma_j^{2}$$ as shown in Figure 9(b), where $$d$$ is the dimension of $$x\ .$$ If the $$j$$-th agent wins a lot of samples, $$\sigma_j^{2}$$ will become larger than a threshold, and not only $$\varepsilon_t(\theta_j)$$ will increase but also the gap between different agents reduces. In other words, the competing ability of an agent that won too much in past is self-penalized and the competing ability of a weaker agent is compensated for a relative boosting, which effectively provides a participating mechanism similar to $$\alpha_j$$ in eq.(3). This similarity can also be observed from the term $$0.5 {\Vert x_t-m_j\Vert^2 / \sigma_j^{2}}$$ in which $$\sigma_j^{-2}\propto \alpha_j$$ takes a role similar to $$\alpha_j$$ in eq.(3). On the other hand, when the $$j$$-th agent won too fewer samples and thus its $$\sigma_j^{2}$$ drops below certain threshold, not only $$\varepsilon_t(\theta_j)$$ will increase but also the corresponding gap between different agents increases. Therefore, an incapable/extra agent is self-penalized to fade out, which also provides a leaving mechanism that is different from the RPCL one by eq.(4). Moreover, as the competing ability of the $$j$$-th agent decreases, its $$\alpha_j$$ decreases and the other $$\alpha_i$$ will relatively increase. That is, the term $$-\ln{\alpha_j }$$ effectively enhances this self-penalizing featured leaving mechanism. Readers are referred to Sect.3.3.1 of Ref (Xu, 2007) for further details.

(c) Bayes priories $$\quad$$ In Bayesian approaches, the role of a prior is basically equivalent to adding a term $$-\ln{q(\theta_j)}$$ into the error $$\varepsilon_t(\theta_j)$$, that is, $\tag{8} \begin{array}{l} L(x_t,\theta_{\ell})=-\varepsilon_t(\theta_{\ell})+\ln{q(\theta_{\ell})}, \end{array}$ where the minimization of $$-\ln{q(\theta_j)}$$ drives those redundant structures discarded. Readers are referred to a systematic comparative investigation(Shi, Tu, & Xu, 2011) of three typical Bayesian approaches on the case of Type A in eq.(5) with $$q(x|\theta_j)= G(x|m_j,\Sigma_j),$$ namely, a Gaussian mixture as follows $\tag{9} \begin{array}{l} q(x|\phi)=\sum_{j=1}^k \alpha_j G(x|m_j,\Sigma_j). \end{array}$ The three Bayesian approaches are the minimum message length (MML) (Wallace & Dowe, 1999; Figueiredo & Jain, 2002), variational Bayes (VB) (Corduneanu & Bishop, 2001), and BYY harmony learning, respectively. Without any priors, both VB and MML, as well as other existing Bayesian approaches, degenerate to the maximum likelihood learning and thus are poor in automatic model selection. However, BYY is still capable of automatic model selection, and its performance can be further improved when appropriate priors are incorporated.

## Automatic model selection : BYY harmony learning, RPCL extensions, and alternation-in-batch

Bayesian Ying Yang learning provides a unified framework that coordinately accommodates the above three scenarios. Not only $$L(x_t,\theta_{\ell})$$ by eq.(8) is used as one basic component with the last two scenarios covered naturally, but also BYY is capable of automatic model selection even without priors. Interestingly, RPCL learning may be regarded as a simplified and approximated implementation of the BYY harmony learning, from which we are actually lead to a family of RPCL extensions.

The BYY harmony learning is implemented by maximizing $\tag{10} \begin{array}{l} H(p \Vert q, \theta )=\sum_{t=1}^{N}\sum_{\ell=1}^k p(\ell|x_t) L(x_t,\theta_{\ell}), \ L(x_t,\theta_{\ell})=-\varepsilon_t(\theta_{\ell})+\ln{q(\theta_{\ell})}, \\ subject \ to \begin{cases} p(\ell |x_t) \le q(\ell |x_t, \theta), & j\in \mathcal{C}_{\kappa}(x_t), \\ p(\ell |x_t)=0, & otherwise. \end{cases}, \ \ \ \kappa \le k, \ \ q(\ell|x_t, \theta) = { e^{L(x_t,\theta_{\ell})} \over \sum_{j \in \mathcal{C}_{\kappa}(x_t)} e^{L(x_t,\theta_{j})}}, \\ \mathcal{C}_{\kappa}(x_t) \subset \{1,2,\dots, k\} \ consists \ of \ \ell \ indices \ that \ correspond \ the \ first \ \kappa \ largest \ values \ of \ L(x_t,\theta_{\ell}). \end{array}$ When $$\kappa=k$$, $$q(\ell|x_t, \theta)$$ is the Bayesian posteriori structure that corresponds to $$L(x_t,\theta_{\ell})$$. Generally when $$\kappa<k$$, this $$\mathcal{C}_{\kappa}(x_t)$$ makes $$q(\ell|x_t, \theta)$$ focus on an apex zone of $$L(x_t,\theta_{\ell})$$. The structure of $$p(\ell|x_t)$$ comes from $$q(\ell |x_t, \theta)$$ subject to a principle of matched dynamic range, that is, $$p(\ell |x_t) \le q(\ell |x_t, \theta)$$ (see the page for more details).

Maximizing $$H(p \Vert q, \theta )$$ with respect to $$p(\ell|x_t)$$ and then solving the gradient flow with respect to $$\theta_{j}$$, we get $\tag{11} \begin{array}{l} \nabla_{\theta_{j}}H(p \Vert q, \theta )=\sum_{t=1}^{N} \delta_{j\in \mathcal{C}_{\kappa}(x_t)} p_{j, t}\nabla_{\theta_{j}} L(x_t,\theta_{j}), \ p_{j, t}= p(\ell|x_t, \theta^{old})[1+ \Delta_{j,t}(\theta^{old})], \\ \Delta_{j,t}(\theta)= L(x_t,\theta_{j})- \sum_{\ell \in \mathcal{C}_{\kappa}(x_t) } p(\ell|x_t, \theta) L(x_t,\theta_{\ell}), \\ p(\ell |x_t) =\begin{cases} q(\ell |x_t, \theta), & j\in \mathcal{C}_{\kappa}(x_t), \\ 0, & otherwise. \end{cases}, \ \ \delta_{j\in \mathcal{C}_{\kappa}(x_t)}=\begin{cases} 1, & j\in \mathcal{C}_{\kappa}(x_t), \\ 0, & otherwise. \end{cases} \end{array}$

This flow becomes same as the one for the ML learning if we let $$\Delta_{\ell,t}=0.$$ It is this relative fitness $$\Delta_{\ell,t}$$ that the BYY harmony learning differs from the ML learning. If $$\Delta_{\ell,t}>0,$$ updating goes along the same direction of the ML learning but with an increased strength. If $$0>\Delta_{\ell,t}> - 1$$, updating still goes along the ML learning direction but with a reduced strength. When $$\Delta_{\ell,t}<-1$$ updating reverses to the opposite direction, i.e., becoming de-learning, which becomes similar to RPCL. Therefore, a global level maximization of $$H(p \Vert q, \theta )$$ actually induces a local balance of leaving mechanism and participating mechanism, similar to ones discussed in Figure 9.

Actually, $$p_{\ell, t}$$ in eq.(11) provides a new allocation scheme, which leads to a family of RPCL extensions by implementing the alternation in Figure 8(c), simply with $$\delta \theta_j=\nabla_{\theta_{j}}L(x_t,\theta_{j})$$. Further insight can be obtained from the following three aspects:

(a) RPCL approximation and further improvements $$\quad$$ In the special case $$\kappa=2$$, we observe that $\tag{12} \begin{array}{l} \ell^c_t=arg\max_{\ell} L(x_t, \theta_{\ell}), \ \ \ \ell^r_t=arg\max_{\ell \ne \ell^c_t} L(x_t, \theta_{\ell}), \\ \Delta_{\ell,t}=\begin{cases} \eta_t, & \ell =\ell^c_t, \\ -\eta_t, & \ell =\ell^r_t, \end{cases} \ \ \eta_t=p(\ell^c_t|x_t, \theta^{old})p(\ell^r_t|x_t, \theta^{old})[L(x_t, \theta_{\ell^c_t})- L(x_t, \theta_{\ell^r_t})] \ge 0, \end{array}$ from which we have $$p(\ell^c_t|x_t, \theta^{old})(1+ \Delta_{\ell^c_t,t})\ge 0$$ for the winner while $$p(\ell^r_t|x_t, \theta^{old})(1+ \Delta_{\ell^r_t,t}) <0$$ when $$\eta_t>1$$ for the rival. This situation is exactly the same as RPCL. Thus, RPCL can be explained as a bottom level simplified approximation of the BYY harmony learning, e.g., one simple way is modifying $$L(x_t,\theta_{\ell^c_t})$$ in eq.(8) by adding a large enough positive constant. Being different from RPCL, it is also possible to have $$1 \ge \eta_t >0$$ for which there will be still a weak learning but no de-learning. That is, penalizing is not unconditional for every rival, and this conditional penalizing actually provides an improvement over RPCL. Also, here we have avoided the difficulty of specifying an appropriate de-learning strength $$\gamma$$ that is required by RPCL in eq.(4).

(b) A direct extension to supervised learning $$\quad$$ The trick of adding a large enough positive constant to one of $$L(x_t,\theta_{\ell})$$ in eq.(8) will make the above discussed RPCL become directly applicable to supervised learning, i.e., given each pairing $$\{x_t, \ell_t\}$$. Simply, we modify eq.(8) into

$\tag{13} \begin{array}{l} L(x_t,\theta_{\ell})=-\varepsilon_t(\theta_{\ell})+\ln{q(\theta_{\ell})}+C\delta_{\ell, \ell^*}, \ C >0 \ is \ a \ large \ enough \ constant,\\ \delta_{\ell, \ell^*}=\begin{cases} 1 & \ell= \ell^*, \\ 0, & otherwise. \end{cases} \\ \ell^*=\begin{cases} \ell_t, &(a) \ given \ a \ pairing \ \{x_t, \ell_t\}, \\arg\max_{\ell} [-\varepsilon_t(\theta_{\ell})+\ln{q(\theta_{\ell})}], & (b) \ \ given \ only \ \{x_t\}. \end{cases} \end{array}$ Actually, it covers the cases of (a) supervised learning with every pairing $$\{x_t, \ell_t\}$$ given, (b) unsupervised learning with only $$\{x_t\}$$ available, and (c) a mixed situation of (a) and (b).

(c) A spectrum from WTA CL to BYY harmony learning $$\quad$$ At one extreme $$\kappa =1$$, we are lead to the classical competitive learning similar to eq.(1). At the other extreme $$\kappa =k$$, we are lead to the standard BYY harmony learning. Generally, as $$\kappa$$ varies from $$\kappa =1$$ to $$\kappa =k$$, we get a spectrum of variants and extensions of RPCL.

In addition to implementing the alternation in Figure 8(c) adaptively, it follows from eq.(11) that we may also make the following Ying Yang alternation-in-batch

• A step: get $$p_{j, t}$$ in eq.(11) for all the samples in $$\{x_t\}$$;
• I step: update $$\theta_{\ell}^{new}=(1-\gamma)\theta_{\ell}^{old}+\gamma\theta_{\ell}^*$$ for every $$\ell$$ with an appropriate learning stepsize $$\gamma$$, where $$\theta^*=\{\theta_{\ell}^* \}$$ is the root of the following equations:

$\tag{14} \begin{array}{l} \nabla_{\theta_{j}}H(p \Vert q, \theta )=0, \forall j. \end{array}$

Typically, taking Gaussian mixture by eq.(9) as an example, $$\theta^*=\{\theta_{\ell}^* \}$$ is obtained simply in a format same as the M-step of the conventional EM algorithm (Redner & Walker, 1984; Xu & Jordan, 1996). That is, we have $\tag{15} \begin{array}{l} \alpha_{j}^*={1 \over N}\sum_{ t=1}^N p_{j, t}, \ m_{j}^*={1 \over N\alpha_{j}^*} \sum_{ t=1}^N p_{j, t}x_t, \ \Sigma_{j}^*={1 \over N\alpha_{j}^*} \sum_{ t=1}^N p_{j, t}( x_t- m^{old}_{j})( x_t- m^{old}_{j})^T]. \end{array}$ Further details about this topic are referred to the page.

## Extensions to supervised learning: mixture-of-experts, RBF nets, and subspace based functions

In the original papers (Xu, Krzyzak, & Oja, 1992, 1993), one application of RPCL was for learning normalized radial basis function (RBF) networks in a two stage implementation. First, a clustering algorithm (e.g., k-means) in (Moody, & Darken, 1989) is replaced by RPCL with the number of clusters determined automatically. Second, the output layer is estimated by a least square linear fitting. Subsequently, it has been shown that a normalized RBF networks can be regarded as a simplified case of the following alternative mixture of experts (ME): $\tag{16} \begin{array}{l} q(z |x, \theta)=\sum_{j=1}^k p(j|x, \phi)G(z|f_j(x,\phi_j), \Gamma_j), \ p(j|x, \phi)={ \alpha_j G(x|m_j,\Sigma_j)/ q(x|\phi)}, \end{array}$ where $$q(x|\phi)$$ is given by eq.(9). Its maximum likelihood (ML) learning is implemented by the EM algorithm (Xu, Jordan, & Hinton, 1995).

For a regression task, typically we consider $$f(x|\phi_j)=w_j^T x +c_j$$ with $$E[z|x]=\sum_{j=1}^k p(j|x, \phi) f_j(x,\phi_j)$$ that implements a type of piecewise linear regression. It becomes an extended normalized radial basis function (RBF) networks at the following special case $\tag{17} \begin{array}{l} E[z|x]=\sum_{j=1}^k { exp[-0.5(x-m_j)^T\Sigma_j^{-1} (x-m_j)] \over \sum_{r=1}^k exp[-0.5(x-m_r)^T\Sigma_r^{-1} (x-m_r)]}f_j(x,\phi_j) \ when \ \alpha_j={|\Sigma_j|/\sum_{i=1}^k |\Sigma_i|}, \end{array}$ and further degenerates to a normalized RBF networks (Moody, & Darken, 1989) simply with $$w_j=0\ .$$ As a result, the conventional two stage but suboptimal implementation of a normalized RBF learning is replaced by the EM algorithm (Xu, 1998b). Furthermore, RPCL and Bayesian Ying Yang learning have also been adopted for determining the number of basis functions (Xu, 2001).

These learning algorithms can also be summarized under the general divide-conquer learning framework shown in Figure 8, with the Type A case of eq.(5) extended into : $\tag{18} \begin{array}{l} \varepsilon_t(\theta_j)=- \ln{[\alpha_j G(x|m_j,\Sigma_j)G(z|f_j(x,\phi_j), \Gamma_j)]}, \ \mbox{for the alternative ME,} \end{array}$ and the learning algorithms introduced in the previous section directly apply. Readers are further referred to Fig.2 and Fig. 3 of (Xu, 2009), where adaptive learning algorithms of EM, RPCL, and BYY harmony learning are summarized in a unified format for learning the alternative ME and RBF networks.

For a finite size of training samples distributed in small dimensional subspaces, considering each $$G(x|m_j,\Sigma_j)$$ can not well capture these subspace structures. Instead, we regard samples as generated from each subspace with independent factors distributed along each coordinate of a lower dimensional subspace.

In implementation, we may simply let $$G(x|m_j,\Sigma_j)$$ to be replaced by $$G(x|m_j,A_j\Lambda_jA_j+\Sigma_j)$$ in both eq.(16) and eq.(18), with basis functions on lower dimensional subspace spanned by the columns of $$A_j.$$ The weak point is that automatic model selection is not applicable to the dimensions of these subspaces, which need to be specified in advance or obtained via two stage implementation of model selection.

Instead, we may consider the subspace based functions (SBF) as shown in Figure 10. Being different from eq.(18), the BYY harmony learning is made by the maximization of $$H(p \Vert q, \theta )$$ with their dimensions determined via automatic model selection, where $$\varepsilon_t(\theta_j)$$ and $$q(\ell|x_t, \theta)$$ are extended as follows : $\tag{19} \begin{array}{l} \varepsilon_t(\theta_j)=- e(x_t,y_{j,t},\, \theta_j)+0.5Tr[\Gamma_j^{y|x}\Pi_j^y], \ y_{j,t}=arg\min_y \ e(x_t,y,\, \theta_j), \\ e(x_t,y,\, \theta_j)=-\ln{[\alpha_j G(x_t|A_jy+\mu_j,\Sigma_j)G(y|0, \Lambda_j)G(z|W_j^z\xi+c_j,\Gamma_j^z)]}, \\ \theta_j=\{\alpha_j, A_j, \mu_j,\Sigma_j, \Lambda_j, W_j^z, c_j,\Gamma_j^z)\}, \\ \Pi_j^y=-\nabla_y^2 e(x_t,y,\theta_j), \ \Gamma_j^{y|x}=\varepsilon_{j,t}\varepsilon_{j,t}^T, \ \varepsilon_{j,t}=W_jx_t+w_j-y_{j,t}, \\ q(\ell|x_t, \theta) = { e^{-e(x_t,y_{j,t},\ \theta_{\ell})+0.5\ln{|\Pi_{\ell}^{\ y}|}+0.5m_{\ell}\ln{(2\pi)}} \over \sum_{j \in \mathcal{C}_{\kappa}(x_t)} e^{-e(x_t,y_{j,t},\ \theta_j)+0.5\ln{|\Pi_{j}^{\ y}|}+0.5m_{j}\ln{(2\pi)}}}, \end{array}$ based on which we piecewise-linearly implement the mapping $$x \to z$$ when $$\xi=z$$ by $$z=W_j^zx+c_j$$, and a cascading mapping $$y \to z$$ when $$\xi=y$$ by $$z=W_j^zy +c_j$$ that follows the mapping $$x_t \to y_t=W_jx_t+w_j,$$ which is also obtained during learning.

We may still use the Ying Yang alternation-in-batch by eq.(14), additionally including "getting $$y_{j,t}$$ in eq.(19) for all the samples in $$\{x_t\}$$". Moreover, $$\nabla_{\theta_{j}}H(p \Vert q, \theta )$$ in eq.(11) is modified into $\tag{20} \begin{array}{l} \nabla_{\theta_{j}}H(p \Vert q, \theta )=\sum_{t=1}^{N}\delta_{j\in \mathcal{C}_{\kappa}(x_t)} \nabla_{\theta_{j}}H_t(p \Vert q, \theta ),\\ \nabla_{\theta_{j}}H_t(p \Vert q, \theta )= p_{j, t}\nabla_{\theta_{j}} L(x_t,\theta_{j})-0.5 p(j|x_t, \theta^{old}) \nabla_{\theta_{j}} Tr[\Gamma_j^{y|x}\Pi_j^y] -0.5 \Delta_{j,t}(\theta^{old})\nabla_{\theta_{j}} |\Pi_j^y | \end{array}$

Readers are referred to Fig.6 and Fig. 7 of (Xu, 2009), where adaptive learning algorithms of RPCL and BYY harmony learning are summarized in a unified format for learning the above SBF functions and their variants with binary $$y$$, as well as data smoothing regularization in consideration.

## Relations to reinforcement learning

For reinforcement learning, there are a set $$S$$ of states and a set $$A$$ of actions. There is also a policy that chooses an action $$a_t \in A$$ at a state $$s_t$$. The action $$a_t$$ makes the learner move to a new state $$s_{t+1}$$. Associated with the transition $$(s_t, a_t, s_{t+1})$$, there is a scalar immediate reward $$r_{t+1}$$. The goal is to maximize $$\sum_{t=1}^{\infty}\gamma^t r_{t+1}$$ by determining a sequence of actions $$a_1, \cdots, a_t$$. One popular algorithm is called Q-learning, featured by learning a table $$Q(s_t, a_t)$$ as follows: $\tag{21} \begin{array}{l} Q^{new}(s_t, a_t)=(1-\eta_t) Q^{old}(s_t, a_t) + \eta_t[r_{t+1}+\gamma\max_a Q^{old}(s_t, a)], \end{array}$ where $$\eta_t >0$$ is a learning step size and $$\gamma >0$$ is a weighting constant. At a state $$s_t$$, an action $$a_t \in A$$ is selected by the following winner-take-all (WTA) competition $\tag{22} \begin{array}{l} a_t=arg\max_a Q^{old}(s_t, a). \end{array}$

Noticing that the action $$a$$ takes a role similar to $$j$$ and $$\varepsilon_t(\theta_j)=-\ln{Q^{old}(s_t, j)}=-\ln{Q^{old}(s_t, a)}$$, the above WTA competition is equivalent to giving $$p_{ j, t}$$ by eq.(1), and can be easily extended to the ones by eq.(4), eq.(6), and eqs.(10)& (11). Moreover, we can modify eq.(21) into $\tag{23} \begin{array}{l} Q^{new}(s_t, j_t)=(1-\eta_t) Q^{old}(s_t, j_t)+ \eta_tp_{ j, t}[r_{t+1}+\gamma Q^{old}(s_t, j_t)], \forall j_t \in A_{j^*_t}^{\kappa}, \ j^*_t=arg\max_{j}Q^{old}(s_t, j), \end{array}$ where $$A_{j^*_t}^{\kappa}$$ consists of the indices of $$j$$ that correspond the first $$\kappa$$ largest values of $$Q^{old}(s_t, j)$$. From this equation, we can get improved variants of reinforcement learning with help of rival penalized competitive learning, competitive divide-conquer learning, and BYY harmony learning, respectively.

On the other hand, it follows from eq.(23) that we may also implement the BYY harmony learning by eq.(10) with help of a modified Q-learning. Let $$\theta$$ to be quantized into a set $$S$$ of discrete values with each value as a state, and let the values of $$\ell$$ to denote different actions. Given a set $$X_{t}=\{x_1, \cdots, x_{t}\}$$ of samples, we consider $\tag{24} \begin{array}{l} Q(s_t, \ell)=H(\theta_{\ell} )=\sum_{x_t\in X_t} p_{ \ell, t} L(x_t, s_t, \ell), \end{array}$ with $$p_{ \ell, t}=p(\ell|x_t)$$ and $$r_{t}=L(x_t, s_t, \ell)$$. Then, eq.(23) can be interpreted as an incremental updating formulae of the discrete value $$Q(s_t, \ell)=H(\theta_{\ell} )$$ upon receiving a new sample $$x_{t+1}$$, while a transition $$(s_t, \ell_t, s_{t+1})$$ or $$T(s_t, \ell_t)= s_{t+1}$$ corresponds to an updating $$\theta^{old}$$ into $$\theta^{new}$$ along one ascent direction of $$H(p \Vert q, \theta )$$.