Rival penalized competitive learning

From Scholarpedia

Lei Xu (2007), Scholarpedia, 2(8):1810. revision #58907 [link to/cite this article]

Curator: Dr. Lei Xu, Dept. Computer Science & Engineering, Chinese University of Hong Kong

Rival penalized competitive learning (RPCL) is a development of competitive learning in help of an appropriate balance between two opposite mechanisms (namely a participating mechanism and a leaving mechanism), such that an appropriate number of agents or learners will be allocated to learn multiple structures underlying observations. Moreover, RPCL has been further extended into a general divide-conquer learning framework for multi-agents to divide and conquer a complicated task.


Contents

Classical competitive learning: from WTA to FSCL

Figure 1: Competitive learning for clustering analysis and dead unit problem.
Enlarge
Figure 1: Competitive learning for clustering analysis and dead unit problem.

Proposed decades ago, classical competitive learning (CCL) is usually used for making data clustering by a number of units or agents. Each agent is featured by a parametric point \theta_j=m_j that moves among the observation samples and seeks to be located at the center of a cluster of samples, such that several points collectively perform clustering analysis. As a sample x_t comes, each m_j evaluates a degree of its suitability on representing x_t by an individual value or criterion \varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2, and competes to be allocated to represent x_t. The competition is guided globally by the winner-take-all (WTA) policy

(1)
p_{ j, t}= \begin{cases} 1, & \mbox{if}\quad j=c, \\  0, & \mbox{otherwise;}\end{cases} \qquad c=arg \ min_{ j} \varepsilon_t(\theta_j).

Then, each m_j is updated to further reduce \varepsilon_t(\theta_j) by

(2)
m_j^{new}= m_j^{old}+ \eta p_{ j, t} (x_t- m_j^{old}).

When m_1, m_2 are initialized as in Fig.1(a), as a sample x comes, m_1 becomes the winner and then moves a little bit towards x. Next, as another sample x comes in Fig.1(b), m_2 becomes the winner and then moves similarly. Finally, m_1 and m_2 each converge to one clusters' center as in Fig.1(c).

However, if m_1, m_2 are initialized as in the top part of Fig.1(d), m_2 always becomes the winner and finally moves the middle point between two clusters, while m_1 never wins and becomes dead. This is known as the under-utilized or dead unit problem, featured by a WTA based competitive learning (Rumelhart et al, 1985; Grossberg, 1978; Hecht-Nielsen, 1987), which makes the task of clustering analysis badly performed. This problem comes from unequal chances for the dead agents to participate competition due to their unfavorable initial locations. Thus, it is desired that every agent should be initialized to have an equal chance to join competition, which is actually difficult to implement without a priori knowledge on the entire clustering structure, while such a knowledge is exactly the goal of this competitive learning.

One way to tackle this difficulty is a so called frequency sensitive competitive learning (FSCL) (Ahalt, et al, 1990), in help of modifying \varepsilon_t(\theta_j) into

Figure 2:   Samples of four clusters.
Enlarge
Figure 2: Samples of four clusters.
Figure 3:    Three dead units by CCL (double click it to see animation).
Enlarge
Figure 3: Three dead units by CCL (double click it to see animation).
(3)
\varepsilon_t(\theta_j)=\alpha_j \Vert x_t- m_j \Vert^2,

where \alpha_j is the frequency that the j-th agent won in past. Illustrated in Fig.2 is an example of four clusters, the CCL resulted in three dead units in Fig.3, while the FSCL makes all the four units placed to the cluster's centers instead of becoming dead, as shown in Fig.4.

Figure 4:   FSCL makes all the four units placed (double click it to see animation).
Enlarge
Figure 4: FSCL makes all the four units placed (double click it to see animation).
Rival penalized competitive learning
Enlarge
Figure 5: When k is pre-assigned to 5. the frequency sensitive mechanism also brings the extra one into data to disturb the correct locations of others (double click it to see animation).

Such a sprit of reducing the winning rates of frequent winners can also be implemented in alternative ways, e.g., conscience (Desieno, 1988), leaky learning (Rumelhart et al, 1985; Grossberg, 1978), convex bridge (Hecht-Nielsen, 1987), and neighbor sharing in Kohonen network (Kohonen,1982).


Rival penalized competitive learning

Figure 6:  RPCL and automatic allocation.
Enlarge
Figure 6: RPCL and automatic allocation.

The way of reducing the winning rates works when the number k of clusters is known. Unfortunately, it is usually unknown. Not only it is no doubt that both CCL and FSCL will result in bad performances when k is given a smaller number, but also FSCL fails when k is pre-assigned to a value larger than an appropriate one. As illustrated in Fig.5, the frequency sensitive mechanism will bring the extra agents into data to disturb the correct locations of other agents. Thus, how to determine k is a critical difficulty too.

One solution is provided under the name of Rival Penalized Competitive Learning (RPCL), proposed in the early 1990's for clustering analysis and detecting lines in an image (Xu, Krzyzak, & Oja, 1992 , 1993). As illustrated in Fig.6, not only the winner is learned but also its rival (i.e., the second winner) is repelled a little bit from the sample to reduce a duplicated information allocation. As a sample x comes, m_1 is the winner and m_2 is its rival in Fig.6(a), m_1 is moved towards x by a small step size, while m_2 is repelled away from x by a much smaller step size. Similarly in Fig.6(b), m_3 is the winner and moved towards x, while m_2 is its rival and repelled away from x.

Figure 7:  Rival penalized mechanism makes extra agents driven far away(double click it to see animation).
Enlarge
Figure 7: Rival penalized mechanism makes extra agents driven far away(double click it to see animation).

More precisely, the RPCL differs from FSCL in that eq.(2) is implemented with eq.(1) replaced by

(4)
p_{ j, t}= \begin{cases} 1, & \mbox{if} \, j=c, \\  -\gamma, & \mbox{if}\,  j= r, \\ 0, & \mbox{otherwise}, \end{cases} \begin{cases} c=arg \ min_{ j} \varepsilon_t(\theta_j), \\  r=arg \ min_{ j \ne c } \varepsilon_t(\theta_j),\end{cases}

where \gamma approximately takes a number between 0.05 and 0.1 for controlling the penalizing strength.

As illustrated in Fig.6(c), m_1 and m_3 finally converge to clusters' centers, while m_2 is driven far away from samples. Shown in Fig.7 is an experimental result that improves its FSCL counterpart in Fig.5, with a disturbing agent driven away from samples. Adding this rival penalized mechanism makes extra agents driven far away from data as long as the number of agents is initially given to be larger than the number of clusters to be detected. In other words, RPCL can allocate an appropriate number of learning agents to detect a correct number of clusters or objects automatically during learning. This automatic allocation is a favorable nature that both CCL and FSCL as well as the conventional clustering algorithms (e.g., k-means) do not have. In the past decade, a number of applications of RPCL have been made (Acciani, et al, 2003; Nair, 2003; Nakkrasae & Sophatsathit, 2004).


A general framework for divide-conquer learning

CCL, FSCL, and RPCL can be investigated systemically from the perspective of a general framework for learning agent based problem solving. That is, multiple learning agents coordinately divide a complicated task into a number of subtasks and each subtask is mainly conquered by one learning agent such that the entire task can be well solved collectively by these multiple learning agents (Xu, 2007). In the following, we introduce the four basic ingredients of this general framework.

Figure 8:   (a) Ellipse fitting, (b) Gaussian distribution, (c) a two step iterative learning
Enlarge
Figure 8: (a) Ellipse fitting, (b) Gaussian distribution, (c) a two step iterative learning

(a) Agent's structure Each agent needs a hardware M_j that is able to handle one specific problem solving task. This M_j is usually featured by a pre-designed structure with a set \theta_j of parameters to be determined. The simplest example is that M_j is given by a point \theta_j=m_j. In general, M_j models or fits its environment, from which we get x_t(\theta_j) as a reconstruction of x_t by M_j. Beyond a point m_j, M_j can also be a curve or shape as shown in Fig.8(a) for objection detection in a noise situation (Liu, Qiao, & Xu, 2006), outperforming the classical Hough transform (HT) (Hough, 1962) and Randomized Hough transform (RHT) (Xu, Oja, & Kultanen, 1990; Xu & Oja, 1993). Generally, M_j(\theta_j) can also be other parametric structure too.

(b) Individual criterion\quad Each agent has an individual value \varepsilon_t(\theta_j) to evaluate how good this x_t(\theta_j) is reconstructed by M_j. Considering the error \ e_t (\theta_j )=x_t- x_t(\theta_j), it can be measured by one of the following two related choices:

  • The errors are directly measured by certain norm, e.g., in a L_2 norm \varepsilon_t(\theta_j)=\Vert e_t (\theta_j )\Vert^2. For a simple case \!\theta_j=m_j, we have \!x_t(\theta_j)=m_j and \varepsilon_t(\theta_j)=\Vert x_t- m_j \Vert^2.
  • Observing that a larger error occurs with a smaller chance, an error may also be measured by its probability distribution, e.g., x - m_j is measured by a Gaussian as shown in Fig.8(b). In general, we have \!q(\varepsilon(\theta_j)| \theta_j) or q(x | x (\theta_j), \theta_j), which is further concisely rewritten into q(x |\theta_j). Plus a priori \alpha_j for the chance of the corresponding agent to be involved, we get the following probabilistic distribution based measure:
(5)
\begin{array}{l} \varepsilon_t(\theta_j)=-\begin{cases}  \ln{[\alpha_j q(x_t|\theta_j)]},  & (a)\, Type\, A, \\   \ln{ q(x_t|\theta_j)^{\alpha_j} } -\ln  \mathcal C(  \theta), \ C(\theta)= \int \prod_j q(x|\theta_j)^{\alpha_j}dx\approx \sum_t \prod_j{ q(x_t|\theta_j)^{\alpha_j}},   & (b)\, Type\, B. \end{cases} \end{array}

Type B can be regarded as a further extension of eq.(3).

(c) Global principle\quad Agents are coordinated globally to combine individual criteria \{ \varepsilon_t(\theta_j)\}_{j=1}^k, in one of the following two manners:

  • One is operational based, e.g., the simple ones by eq.(1) or eq.(4), and a further example will be discussed in the next section. Generally, we can denote an allocation scheme as follows:
(6)
p_{ j, t}= ALOC[\{{\varepsilon_t(\theta_j) }\}_{j=1}^k]
  • The other is evaluation based, by a principle on the overall performance jointly by all the agents, which will be further discussed in the next section.

(d) Updating rule\quad As a further extension of eq.(2), agents are modified to adapt x_t according to the allocation by p_{ j, t}\delta \theta_j, with

\delta \theta_j=Reduce(\varepsilon_t(\theta_j^{old})),

where Reduce(\varepsilon_t(\theta_j)) denotes an operator that acts on \theta_j^{old} and x_t to yield a direction along which \varepsilon_t(\theta_j) reduces within a neighborhood region of \theta_j^{old}. When \varepsilon_t(\theta_j) is differentiable, Reduce(\varepsilon_t(\theta_j)) can be given by -\nabla_{  \theta_j}\varepsilon_t(\theta_j) or -{\mathcal P}\nabla_{  \theta_j}\varepsilon_t(\theta_j) on D(\theta_j), where -{\mathcal P}\nabla_{  \theta_j}\varepsilon_t(\theta_j) has a positive projection on -\nabla_{  \theta_j}\varepsilon_t(\theta_j).

In this framework, a conquering action is performed via a updating rule by eq.(), while a dividing action is performed via an allocation scheme by eq.(6), which are iteratively implemented per sample as shown in Fig.8(c). To get a useful iteration, the following three issues should be considered:

  • The updating by Step (b) needs an appropriate control of a learning stepsize, which should be kept small enough.
  • During the iteration, we should keep \begin{array}{l}\theta_j^{new}\in D(\theta_j) \end{array} that usually consists of three types of constraints. One is \begin{array}{l}\alpha_j>0, \ \sum_j \alpha_j=1 \end{array}, the second is the covariance matrix \Sigma_j that should be positive definite. One typical way is transforming them to be function of parameters that are free of constraints, e.g., \begin{array}{l}\alpha_j=exp(c_j)/\sum_j exp(c_j)\end{array} and \begin{array}{l}\Sigma_j=S_jS_j^T\end{array} (Xu, 2001). Moreover, there is also a constraint of a matrix within the Stiefel manifold \begin{array}{l}UU^T=I\end{array}, for which one way bases on projection (Xu, 2002) while the other bases on differential geometry (Edelman, Arias,& Smith, 1998).
  • In addition to estimating unknown parameters \begin{array}{l}\{ \theta_j \}_{j=1}^k\end{array}, it also needs to select an appropriate number k of agents, such that not only each agent performs best according to its individual criterion but also a best overall performance is achieved, which highly depends on p_{ j, t}. In the literatures of statistics and machine learning, the task of determining the number k belongs to what is usually called model selection.


Model selection and typical implementation

In general, model selection refers to select an appropriate one among a family of infinite many candidate structures \{S_{\mathbf k}( \theta)\} with each S_{\mathbf k} in a same configuration but in different scales, each of which is labeled by a scale parameter {\mathbf  k} in term of one integer or a set of integers. Selecting an appropriate {\mathbf  k} means getting a structure consisting of an appropriate number of free parameters. At a given {\mathbf k}, we expect that S_{\mathbf k} best fits or represents a given set \Chi_N of samples with a finite size N and determine \theta_{\mathbf k} of all the unknown parameters within S_{\mathbf k}, such that the error of using S_{\mathbf k}(\theta^*_{\mathbf k} ) to represent \Chi_N becomes the least. For our problem discussed in previous section, we have S_{\mathbf k}=\{M_{j}\}_{j=1}^k and \theta_{\mathbf k}=\{\theta_{j}\}_{j=1}^k, while {\mathbf  k} is simply the number {\mathbf k}.

However, by a typical best fitting principle (e.g., a maximum likelihood (ML) principle ), its fitting error monotonically decreases as {\mathbf k} grows up, until the error reaches zero at {\mathbf k}={\mathbf k}_N, a value that relates to N and is usually much bigger than an appropriate {\mathbf k}^*. In this case, a large part of structure has actually been used to fit noises or outliers, which not only wastes computing resources but also degrades the generalization performance on new samples not in \Chi_N but from a same underlying regularity. This is usually called ''over-fitting'' problem, for which we cannot get an appropriate k^*.

One major direction for tackling the problem is a two stage implementation. First, enumerate a candidate set of {\mathbf  k} and estimate the unknown set \theta_{\mathbf k} of parameters by ML learning for a solution \theta^*_{\mathbf k} at each {\mathbf  k} . Second, use a model selection criterion J(\theta^*_{\mathbf k}) to select a best {\mathbf  k}^* . In the literature, there exist several typical model selection criteria, including Akaike information criterion (AIC) and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), Bayesian approach related criteria under the names of Bayesian inference criterion (BIC) (Schwarz, 1978), Minimum message length (MML) (Wallace & Boulton,1968; Wallace & Dowe, 1999), Minimum description length (MDL) (Rissanen, 1986, 2007), Cross validation (CV) based criteria (Stone, 1978; Rivals & Personnaz, 1999), and the VC dimension based generalization error bound (Vapnik, 1995). Unfortunately, anyone of these criteria usually provides a rough estimate that may not yield a satisfactory performance. Also, this two stage approach usually incurs a huge computing cost. Moreover, the parameter learning performance deteriorates rapidly as {\mathbf  k} increases, which makes the value of J(\theta^*_{\mathbf k}) evaluated unreliably.

The other direction is stepwise model selection either incremently or decrementally. Incremental algorithms attempt to keep as much as possible what already learned as {\mathbf  k} increases step by step, focusing on learning newly added parameters (Ghahramani & Beal, 2000; Salah and & Alpaydin, 2004). Such an incremental implementation can save computing costs, but usually leads to suboptimal performance because not only those newly added parameters but also the old parameter set \theta_{\mathbf k} need to be re-learned but actually has been not updated. This suboptimal problem may be lessened by a decremental implementation that starts with {\mathbf  k} at a large value and decreases {\mathbf  k} step by step. At each step, one takes a subset out of \theta_{\mathbf k} with the remaining parameter updated, and discard the subset with a biggest decreasing of J(\theta^*_{\mathbf k}) after trying a number of such subsets. Such a procedure can be formulated as a tree searching. The initial parameter set \theta_{\mathbf k} is the root of the tree, and discarding one subset moves to one immediate descendent. A depth-first searching suffers from a suboptimal performance seriously, while a breadth-first searching suffers a huge combinatorial computing cost. Usually, a trade off between the two extremes is adopted.

Another direction of studies is called automatic model selection. One early exemplar is the effort on RPCL that is featured by driving an extra agent far away from data or equivalently pushing the proportion \alpha_j that samples are allocated to this agent towards zero. Subsequently, Bayesian Ying Yang learning was proposed in (Xu,1995, 2002) and systematically developed in the past decade, which leads to not only new model selection criteria but also algorithms with automatic model selection via maximizing a Ying Yang harmony functional (Xu, 2002, 2004).

In general, this automatic model selection direction demonstrates a quite difference nature from the usual incremental or decremental implementation that bases on evaluating the change J(\theta^*_{\mathbf k})- J(\theta^*_{\mathbf k}\cup \theta_{\delta}) as a subset \theta_{\delta} of parameters is added or removed. Thus, automatic model selection is associated with a learning algorithm or a learning principle with the following two features:

  • There is an indicator \rho(\theta_e) on a subset \theta_e \in \theta_{\mathbf k}, we have \rho(\theta_e)=0, if \theta_e consists of parameters of a redundant structural part.
  • In implementation of this learning algorithm or optimizing this learning principle, there is an mechanism that automatically drives \rho(\theta_e)\to0. Thus, the corresponding redundant structural part is effectively discarded.


Coordinated opposite mechanisms versus Ying Yang harmony

As illustrated in Fig.9(a), an appropriate combination of a global principle and individual criteria actually results in automatic model selection during parameter learning, due to a coordination of two opposite mechanisms, namely one for each agent to participate and the other for each agent to leave.

Figure 9:   (a) coordinated mechanisms, (b)  rewarding vs penalizing via variances.
Enlarge
Figure 9: (a) coordinated mechanisms, (b) rewarding vs penalizing via variances.

As mentioned early, the WTA allocation by eq.(1) used in CCL leads to the dead unit problem because agents do not have an equal chance to participate. This problem can be solved via a participating mechanism that lets each agent to have an equal initial chance to participate. For the FSCL that bases still on the WTA allocation by eq.(1), its local criterion by eq.(3) provides a participating mechanism that penalizes those frequent winners and compensates those constant losers. It lacks of a mechanism for an incapable or extra agent to leave. It fails when the number of agents in consideration is larger than an appropriate one. As previously introduced, RPCL improve FSCL via penalizing a rival by eq.(4), which provides a mechanism to drive an incapable or extra agent away. An appropriate balance between this leaving mechanism and the participating mechanism, as illustrated in Fig.9(a), leads to a favorable RPCL nature of automatic model selection, with the balance controlled by the de-learning rate \gamma in eq.( 4).

Instead of using a global allocating policy such as eq.(), we can get a leaving mechanism via an appropriate local criterion too. Still using the WTA by eq.(1), we consider Type A in eq.(5) with a special Gaussian p(x| \theta_j)=G(x|m_j, \sigma_j^2I), while Type B can be analyzed similarly. It follows that \begin{array}{l}  \varepsilon_t(\theta_j)=-\ln{\alpha_j }+ 0.5 [{\Vert x_t-m_j\Vert^2 / \sigma_j^{2}}\end{array} +d\ln{(2\pi \sigma_j^2)}] varies with \sigma_j^{2} as shown in Fig.9(b), where d is the dimension of x. If the j-th agent wins a lot of samples, \sigma_j^{2} will become larger than a threshold, and not only \varepsilon_t(\theta_j) will increase but also the gap between different agents reduces. In other words, the competing ability of an agent that won too much in past is self-penalized and the competing ability of a weaker agent is compensated for a relative boosting, which effectively provides a participating mechanism similar to \alpha_j in eq.(3). This similarity can also be observed from the term 0.5  {\Vert x_t-m_j\Vert^2 / \sigma_j^{2}} in which \sigma_j^{-2}\propto \alpha_j takes a role similar to \alpha_j in eq.(3).

On the other hand, when the j-th agent won too fewer samples and thus its \sigma_j^{2} drops below certain threshold, not only \varepsilon_t(\theta_j) will increase but also the corresponding gap between different agents increases. Therefore, an incapable/extra agent is self-penalized to fade out, which also provides a leaving mechanism that is different from the RPCL one by eq.(4). Moreover, as the competing ability of the j-th agent decreases, its \alpha_j decreases and the other \alpha_i will relatively increase. That is, the term -\ln{\alpha_j } effectively enhances this self-penalizing featured leaving mechanism. As a result, the local criterion internally seeks an appropriate balance between its participating mechanism and leaving mechanism, with an automatic model selection nature.

Instead of at a bottom level via two opposite mechanisms, automatic model selection nature can also be observed in a top-down way with a global learning principle or theory that guides parameter learning and model selection. To seek a relationship between RPCL learning and Bayesian Ying Yang learning, we consider the BYY system in i.i.d. cases, as shown in Figure 6 of the Scholarpedia page Bayesian Ying Yang learning. Our problem is actually a special case that is simplified into the above Fig.9(c). Moreover, considering a degenerated case of equations (12)&(14) with x,y,z collapsed to x and also h_x=0,from which we thus have R_{\ell}(x, y, \Theta_{\ell}^q)=0 and H(p \Vert q, \Theta ) simplified as follows:

(7)
\begin{array}{l} H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell}p(\ell|x_t) L_{\ell}(x, \theta_{\ell}), \ \nabla_{\theta_{\ell}}H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell} p(\ell|x_t)(1+ \delta_{\ell,t})\nabla_{\theta_{\ell}}L_{\ell}(x_t, \theta_{\ell}) \\ L_{\ell}(x, \theta_{\ell})=\ln{[\alpha_{\ell}q(x| \theta_{\ell})]}, \ p(\ell|x_t)=e^{L_{\ell}(x, \theta_{\ell})}/\sum_{j}e^{L_{j}(x, \theta_{j})}, \ \delta_{\ell,t}=L_{\ell}(x_t, \theta_{\ell}) -\sum_{j}p(j|x_t)L_{j}(x, \theta_{\ell}). \end{array}

As discussed in that Scholarpedia page too, \begin{array}{l} \max_{ \Theta} H(p \Vert q, \Theta )\end{array} along the gradient flow \begin{array}{l} \nabla_{\Theta}H(p \Vert q, \Theta )\end{array} also leads to a favorable nature of automatic model selection on k. Similar to the discussion after equation (10) there, the flow \begin{array}{l} \nabla_{\Theta}H(p \Vert q, \Theta )\end{array} becomes the one for the ML learning if \delta_{\ell,t}=0. In other words, BYY harmony learning differs from the ML learning by a relative fitness \delta_{\ell,t}. If \delta_{\ell,t}>0, updating goes along the same direction of the ML learning but with an increased strength. If 0>\delta_{\ell,t}> -1, i.e., the fitness is worse than the average and thus is doubtful, updating still goes along the same direction of the ML learning but with a reduced strength. When \delta_{\ell,t}<-1, updating reverses to the opposite direction, i.e., becoming de-learning, which becomes similar to RPCL. Further, there are two improvements. First, it avoids the difficulty of specifying an appropriate \gamma that is required in RPCL. Second, the rival c is updated by either de-learning or learning instead of only de-learning by RPCL. Therefore, RPCL can be explained as a bottom level simplified approximation of the BYY harmony learning. In other words, the Ying Yang best harmony at the global level in Fig.9(c) has an inherent relation with the local balance of two mechanisms in Fig.9(a).

Even interestingly, letting \begin{array}{l} \varepsilon_t(\theta_j)=-L_{j}(x, \theta_{j})\end{array} we find that it is just the one in eq.(5). Therefore, the above relation by eq.(7) actually provides the following new allocation scheme

(8)
\begin{array}{l}  p_{\ell,t}=p(\ell|x_t)(1+ \delta_{\ell,t}), \ p(\ell|x_t)=e^{-\varepsilon_t(\theta_{\ell})}/\sum_{j} e^{-\varepsilon_t(\theta_{j})}, \ \delta_{\ell,t}=\sum_{j}p(j|x_t)\varepsilon_t(\theta_{j})- \varepsilon_t(\theta_{\ell}), \end{array}

with which the iterative learning in Fig.8(c) actually implements a specific formulation of BYY harmony learning.

Let \varepsilon(\theta)=-H(p \Vert q, \Theta ), it follows from eq.(7) that we get the following global principle:

(9)
\varepsilon(\theta)=\sum_{t=1}^N \varepsilon_t(\theta),\varepsilon_t(\theta) =\sum_{j=1}^k p(\ell|x_t) \varepsilon_t(\theta_j),

and the allocation scheme by eq.(8) also comes from minimizing this \varepsilon(\theta) along the gradient flow \begin{array}{l} \nabla_{\theta} \varepsilon(\theta)\end{array}. In other words, knowing a global level principle can guide us to obtain and improve an allocation scheme via modifying either the gradient flow or the structure of p(\ell|x). For examples, minimizing \varepsilon(\theta) with respect to p(\ell|x_t) under the constraint \begin{array}{l}  \sum_j p(\ell|x_t)=1,\end{array} p(\ell|x_t)\ge 0 results in the WTA allocation by eq.(1). Also, we are lead to

(10)
\begin{array}{l}  p(\ell|x_t)=exp(-\varepsilon_t(\theta_{\ell}))/\sum_{j\in \mathcal{C}_{\kappa}} exp(-\varepsilon_t(\theta_{j})), \\ \mathcal{C}_{\kappa}= \{ j :  \ \varepsilon_t(\theta_j)  \mbox{ among the } \kappa \mbox{ smallest ones of } \{ \varepsilon_t(\theta_j) \}_{j=1}^k \}, \end{array}

by minimizing \varepsilon(\theta) with respect to p(\ell|x_t) under the constraint \begin{array}{l} \sum_j p(\ell|x_t) =1, \ p(\ell|x_t) \ge 0, \ and \ p(\ell|x_t) \propto e^{- {\varepsilon_t(\theta_j)  }}, \ j\in \mathcal{C}_{\kappa}.\end{array} As discussed after Figure 9 of the scholarpedia page Bayesian Ying Yang learning, letting every appearance of \begin{array}{l}\sum_{\ell}\end{array} replaced by \begin{array}{l}\sum_{\ell \in \mathcal{C}_{\kappa}}\end{array} leads to an approximate implementation of BYY harmony learning that tends towards RPCL learning, especially at \kappa=2.


Learning algorithms for Gaussian mixture, RBF nets, and mixture-of-experts

Figure 10:    Allocating policy versus individual criteria. (a) learning  algorithms for Gaussian mixture, (b) learning  algorithms for alternative ME and normalized RBF nets.
Enlarge
Figure 10: Allocating policy versus individual criteria. (a) learning algorithms for Gaussian mixture, (b) learning algorithms for alternative ME and normalized RBF nets.

Since the coordinating nature between a participating mechanism and a leaving mechanism is actually featured via a combination of a specific individual criterion \varepsilon_t(\theta_j) and a specific instance of the global allocating scheme by eq.(6), here we categorize different combinations to get an outline on algorithms that fall in the proposed divide-conquer learning framework. Shown in Fig.10(a) is an outline of learning algorithms derived from Fig.8(c) on Gaussian mixture.

The first column is featured by the WTA allocation by eq.(1). At the bottom, each learning agent is simply a center point m_j and multiple agents jointly perform a task of cluster analysis by either the standard CL or the FS-CL with the previously discussed improvement. Moreover, each agent is extended to a general Gaussian G(x|m_j,\Sigma_j) with a priori \alpha_j. For Type A, multiple agents jointly describe a Gaussian mixture:

(11)
\begin{array}{l}  q(x|\phi)=\sum_{j=1}^k \alpha_j G(x|m_j,\Sigma_j). \end{array}

It improves the standard CL not only by extending a spherical cluster to an elliptic cluster but also by getting the balancing mechanism as discussed in Fig.9(b). For Type B, multiple agents jointly describe a product Gaussian mixture:

(12)
\begin{array}{l}  q(x|\phi)=\prod_{j=1}^k G(x|m_j,\Sigma_j)^{\alpha_j}/\int \prod_{j=1}^k G(x|m_j,\Sigma_j)^{\alpha_j}dx.  \end{array}

It improves FS-CL similarly as the above for Type A, and differs from Type A in that \alpha_j takes a role similar to the one in eq.(3), which further enhances a participating mechanism.

Featured by eq.(4), the second column consists of the standard RPCL (Xu, Krzyzak, & Oja, 1992 , 1993) and its generalization (Xu, 1998a). Next, the last two columns lead to an adaptive version of the EM algorithm (Redner & Walker, 1984; Xu & Jordan, 1996) and Bayesian Ying Yang harmony learning not only on a Gaussian mixture by eq.(11) for the cases of Type A, which was already discussed after eq.(7) in the previous section, but also on product Gaussian mixture by eq.(12) for the cases of Type B.

One application of RPCL in its original papers (Xu, Krzyzak, & Oja, 1992 , 1993) was for learning a normalized radial basis function (RBF) networks in a two stage implementation, First, a clustering algorithm (e.g., k-means) in (Moody, & Darken, 1989) is replaced by RPCL with the number of clusters determined automatically. Second, the output layer is estimated by a least square linear fitting. It has been subsequently shown that a normalized RBF networks can be regarded as a simplified case of an alternative mixture of experts (ME):

(13)
\begin{array}{l} q(z |x, \theta)=\sum_{j=1}^k p(j|x, \phi)G(z|f_j(x,\phi_j), \Gamma_j), \    E[z|x]=\sum_{j=1}^k p(j|x, \phi) f_j(x,\phi_j), \\    p(j|x, \phi)={ \alpha_j G(x|m_j,\Sigma_j)/ q(x|\phi)}, \end{array}

for which the maximum likelihood (ML) learning is implemented by the EM algorithm (Xu, Jordan, & Hinton, 1995). Also, an alternative ME becomes an extended normalized radial basis function (RBF) networks at the following special case

(14)
\begin{array}{l} f(x|\phi_j)=w_j^T x +c_j, \ \alpha_j={|\Sigma_j|/\sum_{i=1}^k |\Sigma_i|}, \  E[z|x]=\sum_{j=1}^k {  exp[-0.5(x-m_j)^T\Sigma_j^{-1} (x-m_j)]  \over \sum_{r=1}^k exp[-0.5(x-m_r)^T\Sigma_r^{-1} (x-m_r)]}f_j(x,\phi_j),   \end{array}

which degenerates to a normalized RBF networks (Moody, & Darken, 1989) simply with w_j=0. As a result, the conventional two stage but suboptimal implementation of a normalized RBF learning is replaced by the EM algorithm (Xu, 1998b). Moreover, RPCL and Bayesian Ying Yang learning have also been adopted for determining the number of basis functions (Xu, 1998b, 2001).

Similar to Fig.10(a), these learning algorithms can also be summarized under the general divide-conquer learning framework here, for which we further interpret the following points:

  • Each agent also has a part M_j^a(\theta_j^a) that makes an action z_t^a(\theta_j^a) to the environment.
  • Individual criterion also needs to evaluate the error e_t^z(\theta_j^a)=z_t- z_t^a(\theta_j^a), for which we generalize the Type A case of eq.(5) into :
(15)
\begin{array}{l} \varepsilon_t(\theta_j)=- \ln{[\alpha_j q(x_t| \theta_j^x)q(z_t|x_t, \theta_j^a)]}, \mbox{ in general;} \\  \varepsilon_t(\theta_j)=- \ln{[\alpha_j G(x|m_j,\Sigma_j)G(z|f_j(x,\phi_j), \Gamma_j)]}, \ \mbox{for the alternative ME.}  \end{array}

With the above \varepsilon_t(\theta_j) replacing its previous appearances on Gaussian mixture, it follows from Fig.8(c) that similarly we get the learning algorithms for the alternative ME and normalized RBF networks, as outlined in Fig.10 (b).

Though we observe here and also in the previous section that the general divide-conquer learning framework leads to an approximate implementation of BYY harmony learning, the latter actually goes far beyond in both performances and coverage. First , it follows from equations (12)&(14) in the Scholarpedia page Bayesian Ying Yang learning that we generally have the term R_{\ell}(x, y, \Theta_{\ell}^q)\ne 0 to regularize the effect of a finite size of samples. Second, it goes beyond the coverage of the formulation by eq.(9) and eq.(8), not only in the above cases R_{\ell}(x, y, \Theta_{\ell}^q)\ne 0 but also in the cases to be further introduced in the next section.


Learning algorithms for subspace based functions

In many practices, there is only a finite size of training samples distributed in small dimensional subspaces, instead of scattering over all the dimensions of the observation space. These subspace structures can not be well captured by considering a basis exp[-0.5(x-m_j)^T\Sigma_j^{-1} (x-m_j)] supported on the entire space of x. There are too many free parameters in \Sigma_j, which usually leads to poor performances. Instead, we consider a basis on a subspace as shown in Fig.(a), where observed samples are regarded as generated from a subspace with independent factors distributed along each coordinate of a m_{\ell} dimensional inner representation y .

Figure 11:   Subspace based function
Enlarge
Figure 11: Subspace based function


Shown in Fig.11, we may let G(x|m_j,\Sigma_j) in eq.(13) and eq.(14) to be replaced by G(x|m_j,A_j\Lambda_jA_j+\Sigma_j) that considers x as generated from a lower dimensional subspace spanned by the columns of A_j, while the mapping to z is described by q(z|x,y,\ell) based on this subspace also. Specifically, there are two typical choices:

  • Type A is indicated by i_Z=0, which corresponds to the previous ME by eq.(13) and RBF networks by eq.(14) with f_j(x,\phi_j)=  W_jx+c_j for x \to z directly while the gating net in eq.(13) and basis function in eq.(14) are supported on the subspace of y instead of the original space of x.
  • Type B is indicated by i_Z=1. It performs a mapping y \to z from the lower dimension subspace. We seek a mapping x \to y to get a cascade mapping x \to y \to z. From two Gaussians G(y|0, \Lambda_j) and G(x|Ay+m_j), \Sigma_j, a choice for x \to y is their posteriori inverse in a Bayesian sense, from which we get x \to z by a Gaussian \begin{array}{l}\int G(y|U(x-m_j), \Pi_j^{y \ -1})G(y|0, \Lambda_j)dy\end{array} as G(z|f_j(x,\phi_j), \Gamma_j) in eq.(13) with f_j(x,\phi_j)=  W_jU(x-m_j)+c_j. Putting them into eq.(15), learning is made by those algorithms in Fig.10(b) again.

Correspondingly, we get two types of subspace based gating networks and subspace based functions (SBF). Type B further improves Type A as the mapping x \to y acts as feature extraction, such that redundant parts are discarded.

To get the advantages expected as above, a key issue is how to determine an appropriate dimension m_{\ell} for each subspace, which is another model selection task. One way to tackle this problem is the two stage implementation based approaches introduced in a previous section. Under the framework of the formulation by eq.(9) and eq.(8), we can not get automatic model selection nature by those algorithms in Fig.10(b). Alternatively, we turn this subspace based model into a specific example of the BYY system given in Fig.6 of the Scholarpedia page Bayesian Ying Yang learning, by which an appropriate dimension m_{\ell} of each subspace will be determined automatically during implementing the algorithm for Bayesian Ying Yang harmony learning. This is however beyond the scope of the divide-conquer learning framework introduced in the previous section. For this task in Fig.11, \begin{array}{l} \nabla_{\Theta}H(p \Vert q, \Theta ) \end{array} in equation (16) of the Scholarpedia page Bayesian Ying Yang learning has a formulation \begin{array}{l} \sum_{\ell} (p_{\ell,t} H_{\ell,t} + \delta_{\ell,t}G_{\ell,t}) \end{array} from which we generally can not the allocation scheme by eq.(8) via \varepsilon_t(\theta_j)=- H_{\ell}(x_t, y^*_{\ell}(x_t), \Theta_{\ell}^q) except some degenerated case that is able to be reformulated into a format \begin{array}{l}\sum_{\ell} p(\ell|x_t)(1+ \delta_{\ell,t}) \tilde H_{\ell,t}  \end{array}.

References

  • Acciani, G., et al (2003), "A feature extraction unsupervised neural network for an environmental data set", Neural Networks 16: 427-436.
  • Ahalt, SC., et al (1990), "Competitive learning algorithms for vector quantization", Neural Networks 277-291.
  • Akaike, H (1974), "A new look at the statistical model identification", "IEEE Tr. Automatic Control" 19: 714-723.
  • Bozdogan, H & Ramirez, DE (1988), "FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC", "Psychometrika" { 53} (3): 407-415.
  • Cavanaugh, JE (1997), "Unifying the derivations for the Akaike and corrected Akaike information criteria", "Statistics & Probability Letters ", {33}: 201-208.
  • Desieno, D (1988), "Adding a conscience to competitive learning", Proc. of IEEE int. conf. on Neural Networks, Vol.I, pp.117-124.
  • Edelman, A, Arias, TA, & Smith, ST, (1998), "The geometry of algorithms with orthogonality constraints", SIAM J. Matrix Anal. Appl., 20:303-353.
  • Ghahramani, Z & Beal, MJ (2000), "Variational inference for Bayesian mixtures of factor analyzers", in Solla, Leen, & Muller (eds), Advances in Neural Information Processing Systems 12, 449-455.
  • Grossberg, S (1987), "Competitive learning: from iterative activation to adaptive resonance", Cognitive Science, 11: 23-63.
  • Hecht-Nielsen, R (1987), "Counterpropagation networks", Applied Optics 26: 4979-4984.
  • Hough, PVC "Method and means for recognizing complex patterns", U.S. Patent 3069654, Dec.18, 1962.
  • Jacobs, RA., et al (1991), "Adaptive mixtures of local experts", Neural Computation 3: 79-87.
  • Jordan, MI & Xu, L (1995), "Convergence results for the EM approach to mixtures of experts", Neural Networks 8: 1409-1431.
  • Kohonen, T (1982), "Self-organized formation of topologically correct feature maps", Biological Cybernetics 43: 59-69.
  • Liu, ZY, Qiao, H, & Xu, L (2006), "Multisets Mixture learning based Ellipse Detection", Pattern Recognition 39: 731-735.
  • Ma, J, Wang, T, & Xu, L (2004), "A gradient BYY harmony learning rule on Gaussian mixture with automated model selection", Neurocomputing 56: 481-487.
  • Moody, J & Darken, J (1989) "Fast learning in networks of locally-tuned processing units", Neural Computation 1: 281-294.
  • Nair, TM., et al, (2003) "Rival penalized competitive learning (RPCL): a topology-determining algorithm for analyzing gene expression data", Computational Biology and Chemistry 27: 565-574.
  • Nakkrasae, S, & Sophatsathit, P (2004), "An rpcl-based indexing approach for software component classification" International J. of Software Engineering and Knowledge Engineering 14: 497-518.
  • Redner, RA & Walker, HF (1984), "Mixture densities, maximum likelihood, and the EM algorithm", SIAM Review 26: 195-239.
  • Rissanen, J (1986), "Stochastic complexity and modeling", Annals of Statistics 14 (3): 1080-1100.
  • Rissanen, J (2007), Information and Complexity in Statistical Modeling, Springer, 2007.
  • Rivals, I & Personnaz, L (1999) "On Cross Validation for Model Selection", " Neural Computation" 11: 863-870.
  • Robbins, H., & Monro, S., (1951), "A stochastic approximation method", Ann. Math. Statist., 22, 400-407.
  • Rumelhart, DE & Zipser, D (1985), "Feature discovery by competitive learning", Cognitive Science 9: 75-112.
  • Salah, AA & Alpaydin, E. (2004), "Incremental mixtures of factor analyzers", Proc. the 17th International Conference on Pattern Recognition, 23-26 Aug. 2004, Cambridge, UK, Vol.1, 276-279.
  • Schwarz, G (1978), "Estimating the dimension of a model", "Annals of Statistics" 6: 461-464.
  • Stone, M (1978), "Cross-validation: A review", "Math. Operat. Statist. " 9: 127-140.
  • Vapnik, VN (1995), The Nature of Statistical Learning Theory, Springer.
  • Wallace, CS & Boulton, DM (1968), "An information measure for classification", Computer Journal 11, 185-194.
  • Wallace, CS & Dowe, DR (1999), "Minimum message length and Kolmogorov complexity", Computer Journal 42 (4): 270-280.
  • Xu, L (2007), "A Unified Perspective and New Results on RHT Computing, Mixture Based Learning, and Multi-learner Based Problem Solving", (in press), " Pattern Recognition", 40: 2129-2153.
  • Xu, L (2007), "A Trend on Regularization and Model Selection in Statistical Learning: A Perspective from Bayesian Ying Yang Learning", an invited chapter in "Studies in Computational Intelligence 63: Challenges to Computational Intelligence", Duch, W. and Mandziuk, J., 365-405.
  • Xu, L (2002), "BYY Harmony Learning, Structural RPCL, and Topological Self-Organizing on Unsupervised and Supervised Mixture Models", Neural Networks 15: 1125-1151.
  • Xu, L (2001), "Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, ME-RBF Models and Three-Layer Nets", International J. of Neural Systems 11: 3-69.
  • Xu, L (1998a), "Rival Penalized Competitive Learning, Finite Mixture, and Multisets Clustering", Proc. of IJCNN98, Anchorage, Vol.II, pp.2525-2530.
  • Xu, L (1998b), "RBF nets, mixture experts, and Bayesian Ying-Yang learning", Neurocomputing 19(1-3): 223-257.
  • Xu, L & Jordan, MI (1996), "On convergence properties of the EM algorithm for Gaussian mixtures", Neural Computation 8(1): 129-151.
  • Xu, L, Jordan, MI, & Hinton, GE (1995), "An Alternative Model for Mixtures of Experts", in Cowan, Tesauro, and Alspector, eds., Advances in Neural Information Processing Systems 7, MIT Press, 633-640.
  • Xu, L, Krzyzak, A, & Oja, E (1993), "Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection", IEEE Tr. on Neural Networks 4: 636-649.
  • Xu, L, & Oja, E (1993), "Randomized Hough Transform (RHT): Basic Mechanisms, Algorithms and Complexities", Computer Vision, Graphics, and Image Processing: Image Understanding 57: 131-154.
  • Xu, L, Krzyzak, A & Oja, E (1992), "Unsupervised and Supervised Classifications by Rival Penalized Competitive Learning", Proc. of 11th Intl. Conf. on Pattern Recognition, Hauge, Netherlands, Aug.30-Sept.3, 1992, Vol.I, pp.672-675.
  • Xu, L, Oja, E & Kultanen, P (1990), "A New Curve Detection Method: Randomized Hough transform (RHT)", Pattern Recognition Letters 11: 331-338.


Internal links

External links

Author's website (http://www.cse.cuhk.edu.hk/~lxu/)


See also


Lei Xu (2007) Rival penalized competitive learning. Scholarpedia, 2(8):1810, (go to the first approved version)
Created: 31 July 2006, reviewed: 17 August 2007, accepted: 31 August 2007
Action editor: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
For authors