Rival penalized competitive learning

Figure 6: RPCL and automatic allocation.

The way of reducing the winning rates works when the number \(k\) of clusters is known. Unfortunately, \(k\) is usually unknown. Not only it is no doubt that both CCL and FSCL will result in bad performances when \(k\) is given a smaller number, but also FSCL fails when \(k\) is pre-assigned to a value larger than an appropriate one. As illustrated in Figure 5, the frequency sensitive mechanism will bring the extra agents into data to disturb the correct locations of other agents. Thus, how to determine \(k\) is a critical difficulty too.

One solution is provided under the name of Rival Penalized Competitive Learning (RPCL), proposed in the early 1990s for clustering analysis and detecting lines in an image (Xu, Krzyzak, & Oja, 1992, 1993). As illustrated in Figure 6, not only the winner is learned but also its rival (i.e., the second winner) is repelled a little bit from the sample to reduce a duplicated information allocation. As a sample \(x\) comes, \(m_1\) is the winner and \( m_2\) is its rival in Figure 6(a), \(m_1\) is moved towards \(x\) by a small step size, while \(m_2\) is repelled away from \(x\) by a much smaller step size. Similarly in Figure 6(b), \(m_3\) is the winner and moved towards \(x\ ,\) while \(m_2\) is its rival and repelled away from \(x\ .\)

Figure 7: Rival penalized mechanism makes extra agents driven far away(double click it to see animation).

More precisely, the RPCL differs from FSCL in that eq.(2) is implemented with eq.(1) replaced by \[\tag{4} p_{ j, t}= \begin{cases} 1, & \mbox{if} \, j=c, \\ -\gamma, & \mbox{if}\, j= r, \\ 0, & \mbox{otherwise}, \end{cases} \begin{cases} c=arg \ min_{ j} \varepsilon_t(\theta_j), \\ r=arg \ min_{ j \ne c } \varepsilon_t(\theta_j),\end{cases} \] where \( \gamma\) approximately takes a number between \(0.05\) and \(0.1\) for controlling the penalizing strength.

As illustrated in Figure 6(c), \(m_1\) and \( m_3\) finally converge to clusters' centers, while \( m_2\) is driven far away from samples. Shown in Figure 7 is an experimental result that improves its FSCL counterpart in Figure 5, with a disturbing agent driven away from samples. Adding this rival penalized mechanism makes extra agents driven far away from data as long as the number of agents is initially given to be larger than the number of clusters to be detected. In other words, RPCL can allocate an appropriate number of learning agents to detect a correct number of clusters or objects automatically during learning. This automatic allocation is a favorable nature that both CCL and FSCL as well as the conventional clustering algorithms (e.g., k-means) do not have. In past decades, a large number of applications of RPCL have been made (Acciani, et al., 2003; Nair, 2003; Nakkrasae & Sophatsathit, 2004).

A general framework for competitive divide-conquer learning

CCL, FSCL, and RPCL can be investigated systemically from the perspective of a general framework for learning agent based problem solving. That is, multiple learning agents coordinately divide a complicated task into a number of subtasks and each subtask is mainly conquered by one learning agent such that the entire task can be well solved collectively by these multiple learning agents (Xu, 2007). In the following, we introduce the basic ingredients of this general framework.

Figure 8: (a) Ellipse fitting, (b) Gaussian distribution, (c) a two step iterative learning

(a) Agent's structure \(\quad\) Each agent needs a hardware \(M_j\) that is able to handle one specific problem solving task. This \(M_j\) is usually featured by a pre-designed structure with a set \(\theta_j\) of parameters to be determined. The simplest example is that \(M_j\) is given by a point \(\theta_j=m_j\ .\) In general, \(M_j\) models or fits its environment, from which we get \(x_t(\theta_j) \) as a reconstruction of \(x_t\) by \(M_j\ .\) Beyond a point \(m_j\ ,\) \(M_j\) can also be a curve or shape as shown in Figure 8(a) for objection detection in a noise situation (Xu, 2002; Liu, Qiao, & Xu, 2006), outperforming the classical Hough transform (HT) (Hough, 1962) and Randomized Hough transform (RHT) (Xu, Oja, & Kultanen, 1990; Xu & Oja, 1993). Generally, \(M_j(\theta_j)\) can be other parametric structure too.

(b) Individual criterion\(\quad\) Each agent has an individual value \(\varepsilon_t(\theta_j) \) to evaluate how good this \(x_t(\theta_j) \) is reconstructed by \(M_j.\) Considering the error \(\ e_t (\theta_j )=x_t- x_t(\theta_j)\ ,\) it can be measured by one of the following two related choices:

The errors are directly measured by certain norm, e.g., in a \(L_2\) norm \(\varepsilon_t(\theta_j)=\Vert e_t (\theta_j )\Vert^2\ .\) For a simple case \(\!\theta_j=m_j\ ,\) we have \(\!x_t(\theta_j)=m_j\) and \( \varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2\ .\)

Observing that a larger error occurs with a smaller chance, an error may be alternatively measured by its probability distribution, e.g., \( x - m_j \) is measured by a Gaussian as shown in Figure 8(b). In general, we have \(\!q(\varepsilon(\theta_j)| \theta_j) \) or \(q(x | x (\theta_j), \theta_j) \ ,\) which is further concisely rewritten into \( q(x |\theta_j) \ .\) Plus a priori \( \alpha_j \) for the chance of the corresponding agent to be involved, we get the following probabilistic distribution based measure:

\[\tag{5} \begin{array}{l} \varepsilon_t(\theta_j)=-\begin{cases} \ln{[\alpha_j q(x_t|\theta_j)]}, & (a)\, Type\, A, \\ \ln{ q(x_t|\theta_j)^{\alpha_j} } -\ln \mathcal C( \theta), & (b)\, Type\, B. \end{cases}\\ C(\theta)= \int \prod_j q(x|\theta_j)^{\alpha_j}dx\approx \sum_t \prod_j{ q(x_t|\theta_j)^{\alpha_j}}. \end{array} \] Type B with \(q(x| \theta_j)=G(x|m_j, \Sigma_j)\) can be regarded as a further extension of eq.(3).

(c) Global principle\(\quad\) Agents are coordinated globally to combine individual criteria \(\{ \varepsilon_t(\theta_j)\}_{j=1}^k\ ,\) in one of the following two manners:

One is operational based, e.g., the simple ones by eq.(1) or eq.(4). Generally, we can denote an allocation scheme as follows:

\[\tag{6} p_{ j, t}= ALOC[\{{\varepsilon_t(\theta_j) }\}_{j=1}^k] \]

The other is evaluation based, by a principle on the overall performance jointly by all the agents.

(d) Updating rule\(\quad\) As a further extension of eq.(2), agents are modified to adapt \(x_t\) according to the allocation by \( p_{ j, t}\delta \theta_j\ ,\) with \[\tag{7} \delta \theta_j=Reduce(\varepsilon_t(\theta_j^{old})),\] where \(Reduce(\varepsilon_t(\theta_j)) \) denotes an operator that acts on \(\theta_j^{old}\) and \(x_t\) to yield a direction along which \(\varepsilon_t(\theta_j) \) reduces within a neighborhood region of \(\theta_j^{old}\ .\) When \( \varepsilon_t(\theta_j) \) is differentiable, \(Reduce(\varepsilon_t(\theta_j)) \) can be given by \(-\nabla_{ \theta_j}\varepsilon_t(\theta_j) \) or \(-{\mathcal P}\nabla_{ \theta_j}\varepsilon_t(\theta_j) \) on \(D(\theta_j) \ ,\) where \(-{\mathcal P}\nabla_{ \theta_j}\varepsilon_t(\theta_j) \) has a positive projection on \(-\nabla_{ \theta_j}\varepsilon_t(\theta_j) \ .\)

(e) A-I/(Yang-Ying) alternation \(\quad\) As shown in Figure 8(c), a conquering action is performed via an updating rule by eq.(7) to improve the fit of agents to samples, while a dividing action is performed via an allocation scheme by eq.(6). The two steps A-I, also called Yang-Ying steps from the perspective of Bayesian Ying Yang learning (see the last two subsections of this paper), are iteratively implemented. To get a useful iteration, the following three issues should be considered:

The updating by Step I/Ying needs an appropriate control of a small enough learning stepsize.
During the iteration, we should keep \(\begin{array}{l}\theta_j^{new}\in D(\theta_j) \end{array}\) that usually consists of three types of constraints. One is \(\begin{array}{l}\alpha_j>0, \ \sum_j \alpha_j=1 \end{array}\ ,\) the second is the covariance matrix \(\Sigma_j\) that should be positive definite. One typical way is transforming them to be function of parameters that are free of constraints, e.g., \(\begin{array}{l}\alpha_j=exp(c_j)/\sum_j exp(c_j)\end{array}\) and \(\begin{array}{l}\Sigma_j=S_jS_j^T\end{array}\) (Xu, 2001). Moreover, there is also a constraint that a matrix locates within the Stiefel manifold \(\begin{array}{l}UU^T=I\end{array}\ ,\) for which one way bases on projection (Xu, 2002) while the other bases on differential geometry (Edelman, Arias,& Smith, 1998).
In addition to estimating unknown parameters \(\begin{array}{l}\{ \theta_j \}_{j=1}^k\end{array}\ ,\) it also needs to select an appropriate number \(k\) of agents, such that not only each agent performs best according to its individual criterion but also a best overall performance is achieved, which highly depends on \(p_{ j, t}\ .\) In the literature of statistics and machine learning, the task of determining the number \(k\) belongs to what is usually called model selection.

Two stage, stepwise, and automatic model selection

In general, model selection refers to select an appropriate one among a family of infinite many candidate structures \(\{S_{\mathbf k}( \theta)\}\) with each \(S_{\mathbf k}\) in a same configuration but in different scales, each of which is labeled by a scale parameter \({\mathbf k} \) in term of one integer or a set of integers. Selecting an appropriate \({\mathbf k} \) means getting a structure consisting of an appropriate number of free parameters. At a given \({\mathbf k} \ ,\) we expect that \(S_{\mathbf k} \) best fits or represents a given set \(\Chi_N \) of samples with a finite size \(N \) and determine \(\theta_{\mathbf k} \) of all the unknown parameters within \(S_{\mathbf k}\ ,\) such that the error of using \(S_{\mathbf k}(\theta^*_{\mathbf k} )\) to represent \(\Chi_N \) becomes the least. For our problem discussed in previous section, we have \(S_{\mathbf k}=\{M_{j}\}_{j=1}^k\) and \(\theta_{\mathbf k}=\{\theta_{j}\}_{j=1}^k\ ,\) while \({\mathbf k} \) is simply the number \({\mathbf k} \ .\)

For a best fitting principle (e.g., a maximum likelihood (ML) principle ), its fitting error monotonically decreases as \({\mathbf k}\) grows up, until the error reaches zero at \({\mathbf k}={\mathbf k}_N\ ,\) a value that relates to \(N \) and is usually much bigger than an appropriate \({\mathbf k}^*\ .\) In this case, a large part of structure has actually been used to fit noises or outliers, which not only wastes computing resources but also degrades the generalization performance on new samples with a same underlying regularity. This is usually called ''over-fitting'' problem, for which we cannot get an appropriate \(k^* \ .\) Typically, this problem is tackled along one of the following three directions.

(a) Two stage implementation \(\quad\) First, enumerate a candidate set of \({\mathbf k} \) and its corresponding unknown set \(\theta_{\mathbf k} \) of parameters by ML learning for a solution \(\theta^*_{\mathbf k} \) at each \({\mathbf k}.\) Second, use a model selection criterion \(J(\theta^*_{\mathbf k})\) to select a best \({\mathbf k}^*.\) In the literature, there exist several typical model selection criteria, including Akaike information criterion (AIC) and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), Bayesian approach related criteria under the names of Bayesian inference criterion (BIC) (Schwarz, 1978), Minimum message length (MML) (Wallace & Boulton,1968; Wallace & Dowe, 1999), Minimum description length (MDL) (Rissanen, 1986, 2007), Cross validation (CV) based criteria (Stone, 1978; Rivals & Personnaz, 1999), and the VC dimension based generalization error bound (Vapnik, 1995). Unfortunately, these criteria usually provides a rough estimate that may not yield a satisfactory performance. Also, this two stage approach usually incurs a huge computing cost. Moreover, the parameter learning performance deteriorates rapidly as \({\mathbf k} \) increases, which makes the value of \(J(\theta^*_{\mathbf k})\) evaluated unreliably.

(b) Stepwise model selection either incrementally or decrementally \(\quad\) Incremental algorithms attempt to keep as much as possible what already learned as \({\mathbf k} \) increases step by step, focusing on learning newly added parameters (Ghahramani & Beal, 2000; Salah and & Alpaydin, 2004). Such an incremental implementation can save computing costs, but usually leads to suboptimal performance because both those newly added parameters and the old parameter set \(\theta_{\mathbf k}\) need to be re-learned while the old parameter set \(\theta_{\mathbf k}\) actually has been not updated. This suboptimal problem may be lessened by a decremental implementation that starts with \({\mathbf k} \) at a large value and decreases \({\mathbf k} \) step by step. At each step, one takes a subset out of \(\theta_{\mathbf k}\) to consider. Such a procedure can be formulated as a tree searching. The initial parameter set \(\theta_{\mathbf k} \) is the root of the tree, and discarding one subset moves to one immediate descendant. A depth-first searching suffers from a suboptimal performance seriously, while a breadth-first searching suffers a huge combinatorical computing cost. Usually, a trade off between the two extremes is adopted.

(c) Automatic model selection \(\quad\) One early exemplar is the effort on RPCL that is featured by driving an extra agent far away from data or equivalently pushing the proportion \(\alpha_j\) that samples are allocated to this agent towards zero. Subsequently, Bayesian Ying Yang learning was proposed in (Xu,1995, 2002) and systematically developed, which leads to not only new model selection criteria but also algorithms with automatic model selection via maximizing a Ying Yang harmony functional (Xu, 2002, 2004).

In general, this automatic model selection demonstrates a quite difference nature from the usual incremental or decremental implementation that bases on evaluating the change \(J(\theta^*_{\mathbf k})- J(\theta^*_{\mathbf k}\cup \theta_{\delta})\) as a subset \( \theta_{\delta}\) of parameters is added or removed. Instead, automatic model selection is associated with a learning algorithm or a learning principle with the following two features:

There is an indicator \(\rho(\theta_e)\) associated with a subset \(\theta_e \in \theta_{\mathbf k}\), indicating a particular structural component that is effectively discarded if either or both of its corresponding \(\rho(\theta_e)=0\), e.g., a Gaussian component in Fig. 8(a) is discarded if either or both of its corresponding \(\alpha_l=0\) and \(Tr[\Sigma_l]=0\).

In implementation of this learning algorithm or optimizing this learning principle, there is an mechanism that automatically drives \(\rho(\theta_e)\to 0\), if the corresponding structure is redundant and thus can be effectively discarded.

Readers are referred to page 287 of a recent overview (Xu, 2010), and Sect.2.2.2 of a recent tutorial (Xu, 2012) for further details about automatic model selection.

Automatic model selection from coordinated opposite mechanisms

Figure 9: (a) coordinated mechanisms, (b) rewarding vs penalizing via variances.

As illustrated in Figure 9(a), an appropriate combination of a global principle and individual criteria actually results in automatic model selection during parameter learning, due to a coordination of two opposite mechanisms, namely one for each agent to participate and the other for each agent to leave.

As mentioned early, the WTA allocation by eq.(1) used in CCL leads to the dead unit problem because agents do not have an equal chance to participate. This problem can be solved via a participating mechanism that lets each agent to have an equal initial chance to participate. For the FSCL that bases still on the WTA allocation by eq.(1), its local criterion by eq.(3) provides a participating mechanism that penalizes those frequent winners and compensates those constant losers. However, it fails when the number of agents in consideration is larger than an appropriate one, due to lacking of a mechanism for an incapable or extra agent to leave.

As illustrated in Figure 9(a), it is an appropriate balance between such a leaving mechanism and the participating mechanism that leads to automatic model selection, and such a balance may come from one or more of the following different aspects.

(a) Allocating policy \(\quad\) RPCL improves FSCL via penalizing a rival by eq.(4), which provides a mechanism to drive an incapable or extra agent away. An appropriate balance between this leaving mechanism and the participating mechanism is controlled by the de-learning rate \(\gamma\) in eq.(4).

(b) Local criterion \(\quad\) We consider Type A in eq.(5) with a special Gaussian \(p(x| \theta_j)=G(x|m_j, \sigma_j^2I)\ ,\) while Type B can be analyzed similarly. It follows that \(\varepsilon_t(\theta_j)=-\ln{\alpha_j }+ 0.5 [{\Vert x_t-m_j\Vert^2 / \sigma_j^{2}}+d\ln{(2\pi \sigma_j^2)}] \) varies with \(\sigma_j^{2}\) as shown in Figure 9(b), where \(d\) is the dimension of \(x\ .\) If the \(j\)-th agent wins a lot of samples, \(\sigma_j^{2}\) will become larger than a threshold, and not only \(\varepsilon_t(\theta_j)\) will increase but also the gap between different agents reduces. In other words, the competing ability of an agent that won too much in past is self-penalized and the competing ability of a weaker agent is compensated for a relative boosting, which effectively provides a participating mechanism similar to \(\alpha_j\) in eq.(3). This similarity can also be observed from the term \(0.5 {\Vert x_t-m_j\Vert^2 / \sigma_j^{2}}\) in which \(\sigma_j^{-2}\propto \alpha_j\) takes a role similar to \(\alpha_j\) in eq.(3). On the other hand, when the \(j\)-th agent won too fewer samples and thus its \(\sigma_j^{2}\) drops below certain threshold, not only \(\varepsilon_t(\theta_j)\) will increase but also the corresponding gap between different agents increases. Therefore, an incapable/extra agent is self-penalized to fade out, which also provides a leaving mechanism that is different from the RPCL one by eq.(4). Moreover, as the competing ability of the \(j\)-th agent decreases, its \(\alpha_j\) decreases and the other \(\alpha_i\) will relatively increase. That is, the term \(-\ln{\alpha_j }\) effectively enhances this self-penalizing featured leaving mechanism. Readers are referred to Sect.3.3.1 of Ref (Xu, 2007) for further details.

(c) Bayes priories \(\quad\) In Bayesian approaches, the role of a prior is basically equivalent to adding a term \(-\ln{q(\theta_j)}\) into the error \( \varepsilon_t(\theta_j)\), that is, \[\tag{8} \begin{array}{l} L(x_t,\theta_{\ell})=-\varepsilon_t(\theta_{\ell})+\ln{q(\theta_{\ell})}, \end{array} \] where the minimization of \(-\ln{q(\theta_j)}\) drives those redundant structures discarded. Readers are referred to a systematic comparative investigation(Shi, Tu, & Xu, 2011) of three typical Bayesian approaches on the case of Type A in eq.(5) with \( q(x|\theta_j)= G(x|m_j,\Sigma_j),\) namely, a Gaussian mixture as follows \[\tag{9} \begin{array}{l} q(x|\phi)=\sum_{j=1}^k \alpha_j G(x|m_j,\Sigma_j). \end{array} \] The three Bayesian approaches are the minimum message length (MML) (Wallace & Dowe, 1999; Figueiredo & Jain, 2002), variational Bayes (VB) (Corduneanu & Bishop, 2001), and BYY harmony learning, respectively. Without any priors, both VB and MML, as well as other existing Bayesian approaches, degenerate to the maximum likelihood learning and thus are poor in automatic model selection. However, BYY is still capable of automatic model selection, and its performance can be further improved when appropriate priors are incorporated.

Automatic model selection : BYY harmony learning, RPCL extensions, and alternation-in-batch

Bayesian Ying Yang learning provides a unified framework that coordinately accommodates the above three scenarios. Not only \(L(x_t,\theta_{\ell})\) by eq.(8) is used as one basic component with the last two scenarios covered naturally, but also BYY is capable of automatic model selection even without priors. Interestingly, RPCL learning may be regarded as a simplified and approximated implementation of the BYY harmony learning, from which we are actually lead to a family of RPCL extensions.

The BYY harmony learning is implemented by maximizing \[\tag{10} \begin{array}{l} H(p \Vert q, \theta )=\sum_{t=1}^{N}\sum_{\ell=1}^k p(\ell|x_t) L(x_t,\theta_{\ell}), \ L(x_t,\theta_{\ell})=-\varepsilon_t(\theta_{\ell})+\ln{q(\theta_{\ell})}, \\ subject \ to \begin{cases} p(\ell |x_t) \le q(\ell |x_t, \theta), & j\in \mathcal{C}_{\kappa}(x_t), \\ p(\ell |x_t)=0, & otherwise. \end{cases}, \ \ \ \kappa \le k, \ \ q(\ell|x_t, \theta) = { e^{L(x_t,\theta_{\ell})} \over \sum_{j \in \mathcal{C}_{\kappa}(x_t)} e^{L(x_t,\theta_{j})}}, \\ \mathcal{C}_{\kappa}(x_t) \subset \{1,2,\dots, k\} \ consists \ of \ \ell \ indices \ that \ correspond \ the \ first \ \kappa \ largest \ values \ of \ L(x_t,\theta_{\ell}). \end{array} \] When \(\kappa=k\), \( q(\ell|x_t, \theta)\) is the Bayesian posteriori structure that corresponds to \(L(x_t,\theta_{\ell})\). Generally when \(\kappa<k\), this \( \mathcal{C}_{\kappa}(x_t)\) makes \( q(\ell|x_t, \theta)\) focus on an apex zone of \(L(x_t,\theta_{\ell})\). The structure of \( p(\ell|x_t)\) comes from \( q(\ell |x_t, \theta)\) subject to a principle of matched dynamic range, that is, \( p(\ell |x_t) \le q(\ell |x_t, \theta)\) (see the page for more details).

Maximizing \( H(p \Vert q, \theta ) \) with respect to \( p(\ell|x_t)\) and then solving the gradient flow with respect to \(\theta_{j}\), we get \[\tag{11} \begin{array}{l} \nabla_{\theta_{j}}H(p \Vert q, \theta )=\sum_{t=1}^{N} \delta_{j\in \mathcal{C}_{\kappa}(x_t)} p_{j, t}\nabla_{\theta_{j}} L(x_t,\theta_{j}), \ p_{j, t}= p(\ell|x_t, \theta^{old})[1+ \Delta_{j,t}(\theta^{old})], \\ \Delta_{j,t}(\theta)= L(x_t,\theta_{j})- \sum_{\ell \in \mathcal{C}_{\kappa}(x_t) } p(\ell|x_t, \theta) L(x_t,\theta_{\ell}), \\ p(\ell |x_t) =\begin{cases} q(\ell |x_t, \theta), & j\in \mathcal{C}_{\kappa}(x_t), \\ 0, & otherwise. \end{cases}, \ \ \delta_{j\in \mathcal{C}_{\kappa}(x_t)}=\begin{cases} 1, & j\in \mathcal{C}_{\kappa}(x_t), \\ 0, & otherwise. \end{cases} \end{array} \]

This flow becomes same as the one for the ML learning if we let \(\Delta_{\ell,t}=0.\) It is this relative fitness \(\Delta_{\ell,t}\) that the BYY harmony learning differs from the ML learning. If \(\Delta_{\ell,t}>0,\) updating goes along the same direction of the ML learning but with an increased strength. If \(0>\Delta_{\ell,t}> - 1 \), updating still goes along the ML learning direction but with a reduced strength. When \(\Delta_{\ell,t}<-1\) updating reverses to the opposite direction, i.e., becoming de-learning, which becomes similar to RPCL. Therefore, a global level maximization of \( H(p \Vert q, \theta ) \) actually induces a local balance of leaving mechanism and participating mechanism, similar to ones discussed in Figure 9.

Actually, \( p_{\ell, t} \) in eq.(11) provides a new allocation scheme, which leads to a family of RPCL extensions by implementing the alternation in Figure 8(c), simply with \(\delta \theta_j=\nabla_{\theta_{j}}L(x_t,\theta_{j})\). Further insight can be obtained from the following three aspects:

(a) RPCL approximation and further improvements \(\quad\) In the special case \(\kappa=2\), we observe that \[\tag{12} \begin{array}{l} \ell^c_t=arg\max_{\ell} L(x_t, \theta_{\ell}), \ \ \ \ell^r_t=arg\max_{\ell \ne \ell^c_t} L(x_t, \theta_{\ell}), \\ \Delta_{\ell,t}=\begin{cases} \eta_t, & \ell =\ell^c_t, \\ -\eta_t, & \ell =\ell^r_t, \end{cases} \ \ \eta_t=p(\ell^c_t|x_t, \theta^{old})p(\ell^r_t|x_t, \theta^{old})[L(x_t, \theta_{\ell^c_t})- L(x_t, \theta_{\ell^r_t})] \ge 0, \end{array} \] from which we have \( p(\ell^c_t|x_t, \theta^{old})(1+ \Delta_{\ell^c_t,t})\ge 0 \) for the winner while \( p(\ell^r_t|x_t, \theta^{old})(1+ \Delta_{\ell^r_t,t}) <0 \) when \( \eta_t>1 \) for the rival. This situation is exactly the same as RPCL. Thus, RPCL can be explained as a bottom level simplified approximation of the BYY harmony learning, e.g., one simple way is modifying \( L(x_t,\theta_{\ell^c_t})\) in eq.(8) by adding a large enough positive constant. Being different from RPCL, it is also possible to have \(1 \ge \eta_t >0\) for which there will be still a weak learning but no de-learning. That is, penalizing is not unconditional for every rival, and this conditional penalizing actually provides an improvement over RPCL. Also, here we have avoided the difficulty of specifying an appropriate de-learning strength \(\gamma\) that is required by RPCL in eq.(4).

(b) A direct extension to supervised learning \(\quad\) The trick of adding a large enough positive constant to one of \( L(x_t,\theta_{\ell})\) in eq.(8) will make the above discussed RPCL become directly applicable to supervised learning, i.e., given each pairing \(\{x_t, \ell_t\}\). Simply, we modify eq.(8) into

\[\tag{13} \begin{array}{l} L(x_t,\theta_{\ell})=-\varepsilon_t(\theta_{\ell})+\ln{q(\theta_{\ell})}+C\delta_{\ell, \ell^*}, \ C >0 \ is \ a \ large \ enough \ constant,\\ \delta_{\ell, \ell^*}=\begin{cases} 1 & \ell= \ell^*, \\ 0, & otherwise. \end{cases} \\ \ell^*=\begin{cases} \ell_t, &(a) \ given \ a \ pairing \ \{x_t, \ell_t\}, \\arg\max_{\ell} [-\varepsilon_t(\theta_{\ell})+\ln{q(\theta_{\ell})}], & (b) \ \ given \ only \ \{x_t\}. \end{cases} \end{array} \] Actually, it covers the cases of (a) supervised learning with every pairing \(\{x_t, \ell_t\}\) given, (b) unsupervised learning with only \(\{x_t\}\) available, and (c) a mixed situation of (a) and (b).

(c) A spectrum from WTA CL to BYY harmony learning \(\quad\) At one extreme \(\kappa =1\), we are lead to the classical competitive learning similar to eq.(1). At the other extreme \(\kappa =k\), we are lead to the standard BYY harmony learning. Generally, as \(\kappa\) varies from \(\kappa =1\) to \(\kappa =k\), we get a spectrum of variants and extensions of RPCL.

In addition to implementing the alternation in Figure 8(c) adaptively, it follows from eq.(11) that we may also make the following Ying Yang alternation-in-batch

A step: get \(p_{j, t}\) in eq.(11) for all the samples in \(\{x_t\}\);

I step: update \(\theta_{\ell}^{new}=(1-\gamma)\theta_{\ell}^{old}+\gamma\theta_{\ell}^*\) for every \(\ell\) with an appropriate learning stepsize \(\gamma\), where \(\theta^*=\{\theta_{\ell}^* \}\) is the root of the following equations:

\[\tag{14} \begin{array}{l} \nabla_{\theta_{j}}H(p \Vert q, \theta )=0, \forall j. \end{array} \]

Typically, taking Gaussian mixture by eq.(9) as an example, \(\theta^*=\{\theta_{\ell}^* \}\) is obtained simply in a format same as the M-step of the conventional EM algorithm (Redner & Walker, 1984; Xu & Jordan, 1996). That is, we have \[\tag{15} \begin{array}{l} \alpha_{j}^*={1 \over N}\sum_{ t=1}^N p_{j, t}, \ m_{j}^*={1 \over N\alpha_{j}^*} \sum_{ t=1}^N p_{j, t}x_t, \ \Sigma_{j}^*={1 \over N\alpha_{j}^*} \sum_{ t=1}^N p_{j, t}( x_t- m^{old}_{j})( x_t- m^{old}_{j})^T]. \end{array} \] Further details about this topic are referred to the page.

Extensions to supervised learning: mixture-of-experts, RBF nets, and subspace based functions

In the original papers (Xu, Krzyzak, & Oja, 1992, 1993), one application of RPCL was for learning normalized radial basis function (RBF) networks in a two stage implementation. First, a clustering algorithm (e.g., k-means) in (Moody, & Darken, 1989) is replaced by RPCL with the number of clusters determined automatically. Second, the output layer is estimated by a least square linear fitting. Subsequently, it has been shown that a normalized RBF networks can be regarded as a simplified case of the following alternative mixture of experts (ME): \[\tag{16} \begin{array}{l} q(z |x, \theta)=\sum_{j=1}^k p(j|x, \phi)G(z|f_j(x,\phi_j), \Gamma_j), \ p(j|x, \phi)={ \alpha_j G(x|m_j,\Sigma_j)/ q(x|\phi)}, \end{array} \] where \(q(x|\phi)\) is given by eq.(9). Its maximum likelihood (ML) learning is implemented by the EM algorithm (Xu, Jordan, & Hinton, 1995).

For a regression task, typically we consider \(f(x|\phi_j)=w_j^T x +c_j\) with \(E[z|x]=\sum_{j=1}^k p(j|x, \phi) f_j(x,\phi_j)\) that implements a type of piecewise linear regression. It becomes an extended normalized radial basis function (RBF) networks at the following special case \[\tag{17} \begin{array}{l} E[z|x]=\sum_{j=1}^k { exp[-0.5(x-m_j)^T\Sigma_j^{-1} (x-m_j)] \over \sum_{r=1}^k exp[-0.5(x-m_r)^T\Sigma_r^{-1} (x-m_r)]}f_j(x,\phi_j) \ when \ \alpha_j={|\Sigma_j|/\sum_{i=1}^k |\Sigma_i|}, \end{array} \] and further degenerates to a normalized RBF networks (Moody, & Darken, 1989) simply with \(w_j=0\ .\) As a result, the conventional two stage but suboptimal implementation of a normalized RBF learning is replaced by the EM algorithm (Xu, 1998b). Furthermore, RPCL and Bayesian Ying Yang learning have also been adopted for determining the number of basis functions (Xu, 2001).

These learning algorithms can also be summarized under the general divide-conquer learning framework shown in Figure 8, with the Type A case of eq.(5) extended into : \[\tag{18} \begin{array}{l} \varepsilon_t(\theta_j)=- \ln{[\alpha_j G(x|m_j,\Sigma_j)G(z|f_j(x,\phi_j), \Gamma_j)]}, \ \mbox{for the alternative ME,} \end{array} \] and the learning algorithms introduced in the previous section directly apply. Readers are further referred to Fig.2 and Fig. 3 of (Xu, 2009), where adaptive learning algorithms of EM, RPCL, and BYY harmony learning are summarized in a unified format for learning the alternative ME and RBF networks.

Figure 10: Subspace based function

For a finite size of training samples distributed in small dimensional subspaces, considering each \( G(x|m_j,\Sigma_j)\) can not well capture these subspace structures. Instead, we regard samples as generated from each subspace with independent factors distributed along each coordinate of a lower dimensional subspace.

In implementation, we may simply let \( G(x|m_j,\Sigma_j)\) to be replaced by \( G(x|m_j,A_j\Lambda_jA_j+\Sigma_j)\) in both eq.(16) and eq.(18), with basis functions on lower dimensional subspace spanned by the columns of \(A_j.\) The weak point is that automatic model selection is not applicable to the dimensions of these subspaces, which need to be specified in advance or obtained via two stage implementation of model selection.

Instead, we may consider the subspace based functions (SBF) as shown in Figure 10. Being different from eq.(18), the BYY harmony learning is made by the maximization of \(H(p \Vert q, \theta )\) with their dimensions determined via automatic model selection, where \(\varepsilon_t(\theta_j)\) and \(q(\ell|x_t, \theta) \) are extended as follows : \[\tag{19} \begin{array}{l} \varepsilon_t(\theta_j)=- e(x_t,y_{j,t},\, \theta_j)+0.5Tr[\Gamma_j^{y|x}\Pi_j^y], \ y_{j,t}=arg\min_y \ e(x_t,y,\, \theta_j), \\ e(x_t,y,\, \theta_j)=-\ln{[\alpha_j G(x_t|A_jy+\mu_j,\Sigma_j)G(y|0, \Lambda_j)G(z|W_j^z\xi+c_j,\Gamma_j^z)]}, \\ \theta_j=\{\alpha_j, A_j, \mu_j,\Sigma_j, \Lambda_j, W_j^z, c_j,\Gamma_j^z)\}, \\ \Pi_j^y=-\nabla_y^2 e(x_t,y,\theta_j), \ \Gamma_j^{y|x}=\varepsilon_{j,t}\varepsilon_{j,t}^T, \ \varepsilon_{j,t}=W_jx_t+w_j-y_{j,t}, \\ q(\ell|x_t, \theta) = { e^{-e(x_t,y_{j,t},\ \theta_{\ell})+0.5\ln{|\Pi_{\ell}^{\ y}|}+0.5m_{\ell}\ln{(2\pi)}} \over \sum_{j \in \mathcal{C}_{\kappa}(x_t)} e^{-e(x_t,y_{j,t},\ \theta_j)+0.5\ln{|\Pi_{j}^{\ y}|}+0.5m_{j}\ln{(2\pi)}}}, \end{array} \] based on which we piecewise-linearly implement the mapping \(x \to z\) when \(\xi=z\) by \(z=W_j^zx+c_j\), and a cascading mapping \(y \to z\) when \(\xi=y\) by \(z=W_j^zy +c_j\) that follows the mapping \(x_t \to y_t=W_jx_t+w_j,\) which is also obtained during learning.

We may still use the Ying Yang alternation-in-batch by eq.(14), additionally including "getting \(y_{j,t}\) in eq.(19) for all the samples in \(\{x_t\}\)". Moreover, \( \nabla_{\theta_{j}}H(p \Vert q, \theta )\) in eq.(11) is modified into \[\tag{20} \begin{array}{l} \nabla_{\theta_{j}}H(p \Vert q, \theta )=\sum_{t=1}^{N}\delta_{j\in \mathcal{C}_{\kappa}(x_t)} \nabla_{\theta_{j}}H_t(p \Vert q, \theta ),\\ \nabla_{\theta_{j}}H_t(p \Vert q, \theta )= p_{j, t}\nabla_{\theta_{j}} L(x_t,\theta_{j})-0.5 p(j|x_t, \theta^{old}) \nabla_{\theta_{j}} Tr[\Gamma_j^{y|x}\Pi_j^y] -0.5 \Delta_{j,t}(\theta^{old})\nabla_{\theta_{j}} |\Pi_j^y | \end{array} \]

Readers are referred to Fig.6 and Fig. 7 of (Xu, 2009), where adaptive learning algorithms of RPCL and BYY harmony learning are summarized in a unified format for learning the above SBF functions and their variants with binary \(y\), as well as data smoothing regularization in consideration.

Relations to reinforcement learning

For reinforcement learning, there are a set \(S\) of states and a set \(A\) of actions. There is also a policy that chooses an action \(a_t \in A\) at a state \(s_t\). The action \(a_t\) makes the learner move to a new state \(s_{t+1}\). Associated with the transition \((s_t, a_t, s_{t+1})\), there is a scalar immediate reward \(r_{t+1}\). The goal is to maximize \(\sum_{t=1}^{\infty}\gamma^t r_{t+1}\) by determining a sequence of actions \(a_1, \cdots, a_t\). One popular algorithm is called Q-learning, featured by learning a table \(Q(s_t, a_t)\) as follows: \[\tag{21} \begin{array}{l} Q^{new}(s_t, a_t)=(1-\eta_t) Q^{old}(s_t, a_t) + \eta_t[r_{t+1}+\gamma\max_a Q^{old}(s_t, a)], \end{array} \] where \(\eta_t >0\) is a learning step size and \(\gamma >0\) is a weighting constant. At a state \(s_t\), an action \(a_t \in A\) is selected by the following winner-take-all (WTA) competition \[\tag{22} \begin{array}{l} a_t=arg\max_a Q^{old}(s_t, a). \end{array} \]

Noticing that the action \(a\) takes a role similar to \(j\) and \(\varepsilon_t(\theta_j)=-\ln{Q^{old}(s_t, j)}=-\ln{Q^{old}(s_t, a)}\), the above WTA competition is equivalent to giving \(p_{ j, t}\) by eq.(1), and can be easily extended to the ones by eq.(4), eq.(6), and eqs.(10)& (11). Moreover, we can modify eq.(21) into \[\tag{23} \begin{array}{l} Q^{new}(s_t, j_t)=(1-\eta_t) Q^{old}(s_t, j_t)+ \eta_tp_{ j, t}[r_{t+1}+\gamma Q^{old}(s_t, j_t)], \forall j_t \in A_{j^*_t}^{\kappa}, \ j^*_t=arg\max_{j}Q^{old}(s_t, j), \end{array} \] where \(A_{j^*_t}^{\kappa}\) consists of the indices of \(j\) that correspond the first \(\kappa\) largest values of \(Q^{old}(s_t, j)\). From this equation, we can get improved variants of reinforcement learning with help of rival penalized competitive learning, competitive divide-conquer learning, and BYY harmony learning, respectively.

On the other hand, it follows from eq.(23) that we may also implement the BYY harmony learning by eq.(10) with help of a modified Q-learning. Let \(\theta\) to be quantized into a set \(S\) of discrete values with each value as a state, and let the values of \(\ell\) to denote different actions. Given a set \(X_{t}=\{x_1, \cdots, x_{t}\}\) of samples, we consider \[\tag{24} \begin{array}{l} Q(s_t, \ell)=H(\theta_{\ell} )=\sum_{x_t\in X_t} p_{ \ell, t} L(x_t, s_t, \ell), \end{array} \] with \(p_{ \ell, t}=p(\ell|x_t)\) and \(r_{t}=L(x_t, s_t, \ell)\). Then, eq.(23) can be interpreted as an incremental updating formulae of the discrete value \(Q(s_t, \ell)=H(\theta_{\ell} )\) upon receiving a new sample \(x_{t+1}\), while a transition \((s_t, \ell_t, s_{t+1})\) or \(T(s_t, \ell_t)= s_{t+1}\) corresponds to an updating \(\theta^{old}\) into \(\theta^{new}\) along one ascent direction of \(H(p \Vert q, \theta )\).

References

Acciani, G., et al. (2003), "A feature extraction unsupervised neural network for an environmental data set", Neural Networks 16: 427-436.

Ahalt, SC., et al. (1990), "Competitive learning algorithms for vector quantization", Neural Networks 277-291.

Akaike, H (1974), "A new look at the statistical model identification", IEEE Tr. Automatic Control 19: 714-723.

Bozdogan, H & Ramirez, DE (1988), "FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC", Psychometrika 53(3): 407-415.

Cavanaugh, JE (1997), "Unifying the derivations for the Akaike and corrected Akaike information criteria", Statistics & Probability Letters 33: 201-208.

Corduneanu A & Bishop C M. (2001), " Variational Bayesian model selection for mixture distributions", Proceedings of the 8th International Conference on Artificial Intelligence and Statistics, 2001, 27–34

Desieno, D (1988), "Adding a conscience to competitive learning", Proc. of IEEE int. conf. on Neural Networks, Vol.I, pp.117-124.

Edelman, A, Arias, TA, & Smith, ST, (1998), "The geometry of algorithms with orthogonality constraints", SIAM J. Matrix Anal. Appl., 20:303-353.

Figueiredo M A F, Jain A K. (2002), " Unsupervised learning of finite mixture models ", IEEE Transactions on Pattern Analysis and Machine Intelligence , 24(3): 381–396

Ghahramani, Z & Beal, MJ (2000), "Variational inference for Bayesian mixtures of factor analyzers", in Solla, Leen, & Muller (eds), Advances in Neural Information Processing Systems 12, 449-455.

Grossberg, S (1987), "Competitive learning: from iterative activation to adaptive resonance", Cognitive Science, 11: 23-63.

Hecht-Nielsen, R (1987), "Counterpropagation networks", Applied Optics 26: 4979-4984.

Hough, PVC "Method and means for recognizing complex patterns", U.S. Patent 3069654, Dec.18, 1962.

Jacobs, RA., et al. (1991), "Adaptive mixtures of local experts", Neural Computation 3: 79-87.

Jordan, MI & Xu, L (1995), "Convergence results for the EM approach to mixtures of experts", Neural Networks 8: 1409-1431.

Kohonen, T (1982), "Self-organized formation of topologically correct feature maps", Biological Cybernetics 43: 59-69.

Liu, ZY, Qiao, H, & Xu, L (2006), "Multisets Mixture learning based Ellipse Detection", Pattern Recognition 39: 731-735.

Ma, J, Wang, T, & Xu, L (2004), "A gradient BYY harmony learning rule on Gaussian mixture with automated model selection", Neurocomputing 56: 481-487.

Moody, J & Darken, J (1989) "Fast learning in networks of locally-tuned processing units", Neural Computation 1: 281-294.

Nair, TM., et al., (2003) "Rival penalized competitive learning (RPCL): a topology-determining algorithm for analyzing gene expression data", Computational Biology and Chemistry 27: 565-574.

Nakkrasae, S, & Sophatsathit, P (2004), "An RPCL-based indexing approach for software component classification" International J. of Software Engineering and Knowledge Engineering 14: 497-518.

Redner, RA & Walker, HF (1984), "Mixture densities, maximum likelihood, and the EM algorithm", SIAM Review 26: 195-239.

Rissanen, J (1986), "Stochastic complexity and modeling", Annals of Statistics 14 (3): 1080-1100.

Rissanen, J (2007), Information and Complexity in Statistical Modeling, Springer, 2007.

Rivals, I & Personnaz, L (1999) "On cross validation for model selection", Neural Computation 11: 863-870.

Robbins, H., & Monro, S., (1951), "A stochastic approximation method", Ann. Math. Statist. 22:400-407.

Rumelhart, DE & Zipser, D (1985), "Feature discovery by competitive learning", Cognitive Science 9: 75-112.

Salah, AA & Alpaydin, E. (2004), "Incremental mixtures of factor analyzers", Proc. the 17th International Conference on Pattern Recognition, 23-26 Aug. 2004, Cambridge, UK, Vol.1, 276-279.

Schwarz, G (1978), "Estimating the dimension of a model", Annals of Statistics 6: 461-464.

Stone, M (1978), "Cross-validation: A review", Math. Operat. Statist. 9: 127-140.

Vapnik, VN (1995), The Nature of Statistical Learning Theory, Springer.

Wallace, CS & Dowe, DR (1999), "Minimum message length and Kolmogorov complexity", Computer Journal 42 (4): 270-280.

Xu, L (2009), "Learning Algorithms for RBF Functions and Subspace Based Functions", Ch.3 in "Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods and Techniques", eds. by Olivas, Guerrero, Sober, Benedito, \& Lopez, IGI Global publication, pp60-94.

Xu, L (2007), "A Unified Perspective and New Results on RHT Computing, Mixture Based Learning, and Multi-learner Based Problem Solving", Pattern Recognition 40: 2129-2153.

Xu, L (2002), "BYY Harmony Learning, Structural RPCL, and Topological Self-Organizing on Unsupervised and Supervised Mixture Models", Neural Networks 15: 1125-1151.

Xu, L (2001), "Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, ME-RBF Models and Three-Layer Nets", International J. of Neural Systems 11: 3-69.

Xu, L (1998a), "Rival Penalized Competitive Learning, Finite Mixture, and Multisets Clustering", Proc. of IJCNN98, Anchorage, Vol.II, pp.2525-2530.

Xu, L (1998b), "RBF nets, mixture experts, and Bayesian Ying-Yang learning", Neurocomputing 19(1-3): 223-257.

Xu, L & Jordan, MI (1996), "On convergence properties of the EM algorithm for Gaussian mixtures", Neural Computation 8(1): 129-151.

Xu, L, Jordan, MI, & Hinton, GE (1995), "An Alternative Model for Mixtures of Experts", in Cowan, Tesauro, and Alspector, eds., Advances in Neural Information Processing Systems 7, MIT Press, 633-640.

Xu, L, Krzyzak, A, & Oja, E (1993), "Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection", IEEE Tr. on Neural Networks 4: 636-649.

Xu, L, & Oja, E (1993), "Randomized Hough Transform (RHT): Basic Mechanisms, Algorithms and Complexities", Computer Vision, Graphics, and Image Processing: Image Understanding 57: 131-154.

Xu, L, Krzyzak, A & Oja, E (1992), "Unsupervised and Supervised Classifications by Rival Penalized Competitive Learning", Proc. of 11th Intl. Conf. on Pattern Recognition, Hauge, Netherlands, Aug.30-Sept.3, 1992, Vol.I, pp.672-675.

Xu, L, Oja, E & Kultanen, P (1990), "A New Curve Detection Method: Randomized Hough transform (RHT)", Pattern Recognition Letters 11: 331-338.

Internal links

Lei Xu (2007) Bayesian Ying Yang learning. Scholarpedia, 2(3):1809.
Teuvo Kohonen and Timo Honkela (2007) Kohonen network. Scholarpedia, 2(1):1568.

External links

Author's website (http://www.cse.cuhk.edu.hk/~lxu/)

Rival penalized competitive learning

Contents

Classical competitive learning: from WTA to FSCL