Bayesian Ying Yang learning

From Scholarpedia

Lei Xu (2007), Scholarpedia, 2(3):1809. revision #40243 [link to/cite this article]

Curator: Dr. Lei Xu, Dept. Computer Science & Engineering, Chinese University of Hong Kong

Bayesian Ying-Yang learning is a statistical learning theory for a two pathway featured intelligent system via two complementary Bayesian representations of the joint distribution on the external observation and its inner representation, with all unknowns in the system determined by a principle that two Bayesian representations become best harmony.

Contents

Two types of abilities and two pathway approaches

It is speculated that a learning system or a biological brain survives in its world with at least two types of intelligent abilities. Implemented by a top-down or outbound pathway, Type-I consists of abilities of 'fitting' its world, including not only understanding ability to explain its world but also motoring ability to track the changes in its world. On the other hand, implemented by a bottom-up or inbound pathway, Type-II consists of problem solving skills, ranging from perceiving what events encountered to generating signals that activate the outbound pathway. Both the pathways were initially set up via biological inheritance and then further developed gradually during interactions between the brain system and its surviving world.

The outbound pathway is developed mainly via a learning process for mining the common features or discovering regularities among an ensemble of uncertain evidences (or called samples) from the world. While the inbound pathway is demanded to fast implement problem solving timely, usually per one or several samples encountered. This inbound pathway is developed either adaptively via learning from previous problem solving experiences or off-line via seeking problem solving strategies based on the Type-I knowledge (i.e., regularities or dependencies about the world) by reasoning, inference, optimization, and then designing the strategies into certain fast implementable functions.

The two-pathway approach has been adopted in the literature of modelling a perception system for decades. One early example is the adaptive resonance theory developed in the 1970s (Grossberg, 1976; Grossberg & Carpenter, 2002), featured by a resonance between bottom-up input and top-down expectation in help of a mechanism motivated from a cognitive science view. In the past one or two decades, efforts have further been made on multilayer net featured two-pathway approaches under guidance of a statistical principle, e.g., under the least mean square error criterion by the auto-association (Bourlard & Kamp, 1988) and the LMSER self-organization (Xu, 1991 & 93), and under the Helmholtz energy by the Helmholtz machine and sleep-wake learning (Hinton, et al, 1995; Dayan, 2002). The basic spirit of LMSER self-organizing was further developed into the Bayesian Ying-Yang (BYY) learning in the mid-1990s, not only with multi-layer two-pathway replaced by a general system via two complementary Bayesian representations but also with the least mean square error criterion replaced by a best Ying-Yang harmony criterion (Xu, 1995, 2002).

Over-fitting, model selection, and regularization

The hardware of a biological brain is featured by an alive body with its organization specified genetically. Such an organization, regardless whether or not in a two pathway form, can be mathematically denoted by a sophisticated manifold S(X, \Theta)\, \, with a parametric architecture or structure S(\cdot, \cdot)\, to accommodate all functions or tasks that are expected to perform. To build up an artificial intelligent system, our task is to set up such a S(X, \Theta)\, \,.

Only from X\,\,, we are unable to specify an appropriate S(\cdot, \cdot)\,. It needs to be appropriately designed, which is task-dependent and needs knowledge about the structure underlying the samples of X\,\,. Except some particular situations, we usually have no such type of knowledge. Instead of seeking such knowledge, we consider S(\cdot, \cdot)\, to be a combination of a number of individual simple structures. Shown in Fig.1(a) are two examples of such combined architectures. Provided that each simple structure and a combing architecture are pre-designed, we can use S_k\,\, to denote such a combined architecture with k\,\, being one or a set of integers that accounts for the scale of the combined architecture, i.e., the number of involved simple structures. Given k\,\,, we also need to determine \Theta\,\, that consists of all the unknown parameters within the combing architecture. Usually, what we have is a given set \Chi_N\,\, of samples with a finite size N\,\,, we need to determine both k\,\, and \Theta\,\,. In other words, we need to specify one S_k = S_k(X, \Theta)\,\, to model or represent the given set \Chi_N\,\,. More specifically, the task of determining \Theta\,\, at a given k\,\, is usually called parameter learning, while the task of selecting an appropriate scale k\,\, is called model selection since we select among a set of candidate models as the value of k\,\, varies.

Figure 1:   A combination of a series of individual simple structures (a) two examples, (b) fitting error curve
Enlarge
Figure 1: A combination of a series of individual simple structures (a) two examples, (b) fitting error curve

Intuitively, we expect that S_k\,\, best fits or represents \Chi_N\,\, such that the error of usingS_k\,\, to represent \Chi_N\,\, becomes the least. With such a type of best fitting principle, we expect to determine \Theta^*_k\, at a given k\,\,. However, as shown in Fig.1(b), its fitting error will monotonically decrease as k\, grows up, until the error reaches zero at k=k_N\,\,, a value that relates to N\, and is usually much bigger than an appropriatek^*\,\,. In this case, a large part of structure has actually been used to fit noises or outliers. This is usually called ''over-fitting'' problem, for which we cannot get an appropriate k^*\,\,. Only taking parameter learning in consideration, the previously mentioned two pathway approaches, (such as auto-association, LMSER self-organization and Helmholtz machine, etc), can not avoid this over-fitting problem too.

One major direction for tackling the problem is a two stage implementation, with parameter learning made on a set of candidate models among which one is selected by a model selection criterion. Beyond the best-fitting principle, such a criterion is obtained from a different statistical learning theory. In literature, there exist several typical model selection criteria, including Akaike information criterion (AIC) and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), Bayesian approach related criteria under the names of Bayesian inference criterion (BIC) (Schwarz, 1978), Minimum message length (MML) (Wallace, 1999), and Minimum description length (MDL) (Rissanen, 1986), Cross validation (CV) based criteria (Stone, 1978; Rivals & Personnaz, 1999). Unfortunately, not only a two stage implementation is very expensive to compute, but also these criteria can only provide a rough estimation or even a wrong estimate especially when N\,\, is small and k\,\, consists of more than one integers.

Another direction for tackling over-fitting is called regularization. Instead of searching an appropriate k^*\,\,, we may consider a S_k\,\, with its scale k\,\, large enough to accommodate the true structure underlying the samples \Chi_N \,\,, and then impose certain constraint on \Theta\,\, or certain regularity on S_k\,\, of a scale k\,\, (Tikhonov & Arsenin, 1977; Poggio & Girosi, 1990), such that we can still determine a unique \Theta^*\,\, to let S_k\,\, approximate the true structure of a lower scale. Equivalently, the situation also applies to the cases that the true structure underlying \Chi_N \,\, is known but the size N\,\, is not large enough to determine a unique \Theta^*\,\,. However, the regularization methods still suffer one key difficulty and one crucial weakness.

The difficulty is to appropriately control the strength of regularization, which is only computable for a simple structure. Usually, it is roughly estimated via Cross validation with extensive computing costs (Stone, 1978; Rivals & Personnaz, 1999). The crucial weakness comes from the fact that a constraint or regularity has to be imposed in an isotropic or uniform manner, which usually does not work. First, data may come from multiple disconnected substructures. Second, even when data come from an isotropic or uniform structure but with a unknown scale, we still can not treat the parameters of S_k\,\, in an isotropic or uniform way since those parameters in the extra part of S_k\,\, should be discarded when we use a S_k\,\, with a larger scale. For an example, given a set of samples that come from y=x^2+ 3x+ 2\,\,, we desire to force all the parameters \{ a_i, i \ge 3\}\, to be zero if we use a polynomial model y=\sum_{i=0}^k a_i x^i, k >3\, to fit the data.

In the rest of this paper, we introduce Bayesian Ying-Yang learning that tackles the over-fitting problem from the third direction.

Bayesian Ying-Yang system

As shown in Fig.2, we consider a general two-pathway approach from a statistical perspective. The joint distribution of the external observations X\,\, and its inner representations R\,\, via the following two types of Bayesian decomposition:

(1)
p({ X}, { R})=p({ R}| { X})p({ X}),   \  q({ X}, { R})=q({ X} | { R})q({ R}). \,
Figure 2:   Bayesian Ying-Yang System
Enlarge
Figure 2: Bayesian Ying-Yang System

In a compliment to the famous Chinese ancient Ying-Yang philosophy, the decomposition of p({  X}, { R})\,\, coincides the Yang concept with a visible domain p({ X})\,\, for a Yang space and a forward pathway by p({ R}| { X})\,\, as a Yang pathway. Thus, p({ X}, { R})\, is called Yang machine. Similarly, q({ X}, { R})\,\, is called Ying machine with an invisible domain q({R})\,\, for a Ying space and a backward pathway by q({ X} | { R})\,\, as a Ying pathway. Such a pair of Ying-Yang machines is called Bayesian Ying-Yang (BYY) system.

More specifically, R=\{Y,\Theta\}\, can be further divided into two parts, and thus the system is also divided into two layers, as shown in Fig.2. One is Y\, that consists of a set of inner encoding or state variables, which timely respond the external environment per sample or per several samples observed. This Y\, is supported by a parametric substructure q(Y|\Theta_y)\,. On one hand, taking a key role in the information flow within the front layer, Y\,\, is the source of the information flow to fit the observations \Chi_N \, via the top-down pathway q({ X} | Y, \Theta_{x|y} )\, that implements the abilities of Type I. On the other hand, featuring the abilities of Type II, this Y\, comes from the information flow via a bottom-up pathway p(Y| { X}, \Theta_{y|x})\, from \Chi_N \, that is either in a smoothed form, e.g.,

(2)
p({ X}|\Theta_x )=p_h(X), \, \, p_h(X)  = \prod_{t=1}^N G(x_t|\bar x_t, h^2I),  \,\,

or p_0(X)=p_h(X)|_{h=0}\, directly.

In a summary, the front layer itself is a parametric Ying-Yang pair:

(3)
p({ X}, { Y}|\Theta_p )=p(Y| { X}, \Theta_{y|x})p_h(X),   \  q({ X}, { Y}|\Theta_q )=q({ X} | Y, \Theta_{x|y} )q(Y|\Theta_y),\,

which consists of four parametric substructures with each depending on a subset of parameters \Theta =\{\Theta_p, \Theta_q\}\, with \Theta_q =\{\Theta_y, \Theta_{x|y} \}\, and \Theta_p=\{\Theta_{y|x}, h\}\,, where \Theta_{x}\, is simply h for the case by eq.(2). Being different from Y\, that responds the environment per sample, \Theta\, is determined collectively from all the samples and represents the common regularities or dependencies among data as the knowledge about the world. This \Theta\, is accommodated in the back layer with some a priori structure to back up the front layer. A feedback from the front layer to the back layer goes via p(\Theta | X)\,. Moreover, the back layer may be modulated by some meta knowledge from q(\Xi)\, on a deeper layer.

The Ying-Yang system in Fig.2 are featured by two manifolds p({ X}, { Y}|\Theta_p )\, and q({ X}, { Y}|\Theta_q )\,. After the structures q(Y|\cdot), q({ X} | Y, \cdot)\,, and p(Y| { X}, \cdot)\, are pre-designed, there are three levels of tasks for determining three levels of unknowns:

Figure 3:  Local subspace analysis
Enlarge
Figure 3: Local subspace analysis
  1. Association or Relaxation: The task is to determine Y\, from X\, via certain dependence or association within the front layer and is usually referred to under the terms of reasoning, mapping, repression, etc. Also, Y\, may be indirectly determined via minimizing an energy or cost that incurs from violating certain constraints in S\,, under the term of relaxation.
  2. Parameter learning: The task is to determine \Theta\,, which includes determining Y\, as a subtask. Determined collectively by all the samples, \Theta\, is updated as samples come, in a speed much slower than Y\, that varies as one or a subset of samples come. This parameter learning is made via optimizing the performances of implementing either or both of Type I and Type II abilities, subject to either none or some priori knowledge q(\Theta)\, from the back layer.
  3. Model Selection: The scale for representing R\,\, (or equivalently the scale of the entire system) is featured jointly by the scale k_p\,\, of p({ X}, { Y}|\Theta_p )\, and the scale k_q\, of q({ X}, { Y}|\Theta_q )\,. One common part shared by both k_p\, and k_q\, is the scale k_Y\, for representing Y\,. Moreover, q({ X} | Y, \Theta_{x|y} )\, may be designed via a combined architecture, which contributes a rest part \! k_q^- of \! k_q after taking off the effect of \! k_Y. Also, p(Y| { X}, \Theta_{y|x})\, may be designed via a combined architecture, which contributes a rest part k_p^- \, of k_p\, after taking off the effect of k_Y\,. So, the challenging task of model selection includes determining all of k_{Y} \,, k^-_q\,, and k^-_p\,. Via the BYY harmony learning, p(Y| { X}, \Theta_{y|x}) \, is determined completely or partially from q({ X}, { Y}|\Theta_q ) \,. Thus, k^-_p \, or a part of k^-_p \, needs not to be looked for separately but can be obtained from k^-_q \, compeletely or partially. Even interestingly, the model selection problems of many typical learning tasks can be formulated into selecting merely the k_{Y} \, part in a BYY system (Xu,2005).
Figure 4:  BYY harmony learning for LSF
Enlarge
Figure 4: BYY harmony learning for LSF

Shown in Fig.3 is the problem of local factor analysis, which can be regarded as an extension of two well known unsupervised data analysis tools, namely Gaussian mixture and factor analysis. Shown in Fig.4 is the BYY learning system formulated for this problem. Type I knowledge is described by several Gaussian clusters with each described by a subspace located at \mu_{\ell}\,. Type II part is learned in term of not only the decision boundaries for separating Gaussian clusters but also the linear mappings x \to y_{\ell,t}, \ell=1, \cdots, k\, for projecting each sample x_t\, onto the coordinates y_{\ell,t}\, of each subspace. In this problem, what needs to be appropriately selected is the unknown k_Y=\{k, \{ m_{\ell} \}_{\ell=1}^k\}\,, where k\, is the number of Gaussian clusters and m_{\ell}\, is the dimension of each subspace. Moreover, k_q\, consists of the effective free parameter number in the set of m_{\ell}, \Sigma_{\ell}, U_{\ell}, \Lambda_{\ell},  \alpha_{\ell}, \ell=1, \cdots, k\,, while k_p\, consists of the effective free parameter number for representing p(Y| { X}, \Theta_{y|x})\,. Readers are referred to the last section for other details and comparative experiments.

BYY harmony learning

An analogy of the Ying-Yang system in Fig.2 to the ancient Chinese Ying-Yang philosophy motivates to determine all the unknowns of three levels under a best harmony principle, which is mathematically implemented by maximizing the following harmony measure

(4)
H(p \Vert q) \,\!\,= \int p({R}|{X})p_h({X}) \ln{[ q({X} | {R})q({R})]} d{X} d{R}\,

= \int p(\Theta | {X})p(Y| { X}, \Theta_{y|x})p_h(X) \ln{[q({ X} | Y, \Theta_{x|y} )q(Y|\Theta_y)q( \Theta)]} d{X} dY d\Theta, \, with respect to K=\{k_Y,\, k_q^-, \, k_p^-\} \, as well as p(\Theta | {X})\,. Instead of seeking a best \Theta^*\, value, it seeks a density p^*(\Theta | {X})\, that represents the uncertainty on estimating \Theta\, from a finite size of samples. However, it needs a priori knowledge to design an appropriate structure for p(\Theta | {X})\,, which is very difficult.

To avoid the difficulty, we only seek p(\Theta | {X})\, at a data set {X}={\mathcal X}_N\,, i.e., we consider p(\Theta | {X})=p(\Theta | {\mathcal X}_N )\, such that eq.(4) can be further rewritten into

(5)
H(p \Vert q)= \int p( \Theta  |{\mathcal X}_N ) H(p \Vert q, \Theta )d\Theta, \, H(p \Vert q, \Theta )=H_f(p \Vert q, \Theta )-Z(\Theta  ),  \,\!\,

H_f(p \Vert q, \Theta )=\int p( {Y} | {X}, \Theta_{y|x}) p_h({X}) \ln{[ q({X} | {Y}, \Theta_{x|y})q({Y}|\Theta_{y})]} d{X} d{Y}, \, where Z(\Theta) =-\ln{q( \Theta)}\, represents a priori knowledge for regularizing learning.

Further noticing the following inequity

(6)
H(p \Vert q)= \int  p( \Theta  |{\mathcal X}_N ) H(p \Vert q, \Theta )d\Theta  \le \max_{\Theta} H(p \Vert q, \Theta  ), \,

we can avoid the difficulty for the maximization of H(p \Vert q)\, via approximately seeking the maximization of its upper bound \max_{\Theta} H(p \Vert q, \Theta )\,. That is, maximizing H(p \Vert q)\, with respect to K and p(\Theta |  {X})\, is turned into

(7)
\max_{\{\Theta,  K\}   } H(p \Vert q, \Theta ).  \,

A simplest and also mostly encountered case is that there is no priori knowledge for q(\Theta)\,. Let q( \Theta )=1\, or Z(\Theta )=0\,, \max_{h} H(p \Vert q, \Theta )\, will force h=0\, and thus p_h(X)=p_0(X)\, by eq.(2). Provided that \bar k_q, \bar k_p\, are pre-given, we have that eq.(5) becomes

(8)
\max_{\{\Theta, k_Y \} } \{H_f(p \Vert q, \Theta )= \int p( {Y} | {\mathcal X}_N, \Theta_{y|x}) \ln{[ q({\mathcal X}_N | {Y}, \Theta_{x|y})q({Y}|\Theta_{y})]} d{Y}\},\,

which seeks a best harmony within the front layer Ying-Yang system in Fig.2. Also, the integral over Y\, can be further removed in implementation via one of several choices. Details are referred to (Xu, 2007a&b) for a recent summary.

In the cases of a small sample size, we return back to consider an appropriate h \ne 0\, such that eq.(2) takes a role of making data more smooth. There are two directions for handling the problem. One is made under the name of equal covariance, i.e., getting h^2=h(\Theta_q)\, under the constraint that the variance of p_h(X)\, is equal to the variance of \int q({X} | {Y}, \Theta_{x|y})q({Y}|\Theta_{y})dY\, (Xu, 2007a). The second direction is seeking a priori by q(h)\, via decomposing q( \Theta)\, into

(9)
q( \Theta)=q( \Theta^-|h)q(h), \ with \ \Theta=\Theta^-\cup\{h\}, \ \Theta^-\cap\{h\}=\emptyset,  \,

with q( \Theta  )=1 relaxed into q( \Theta^-|h)=q( \Theta^-)=1. One way is to let q(h) to be a \chi^2 distribution or a Gamma distribution (see Sec.3.4.3 of Xu, 2007a). The other is considering q(h) dependent on p_h(X). It follows from eq.(5) and p(\Theta |{\mathcal X}_N )=p(\Theta^- | {\mathcal X}_N, h)p(h|{\mathcal X}_N) that maximizing H(p \Vert q) with respect to q(h) results in q(h)=p(h|{\mathcal X}_N), for which we consider

(10)
p(h|{\mathcal X}_N)\propto 1/ \sum_{t=1}^N p_h(x_t), \, p_h(x)=N^{-1}\sum_{t=1}^N G(x|\bar x_t, h^2I), \,

where p_h(x) is an i.i.d version of eq.(2). With Z(\Theta )=-\ln{q(h)}, \ q(h)=p(h|{\mathcal X}_N) by eq.(10), we implement eq.(7) for an appropriate h \ne 0 such that it takes a role called data smoothing regularization. Details are referred to (Xu, 2003; Xu, 2007b) and especially Sec.23.7.4 in (Xu, 2004c) for a summary and historical remarks on studies by the present author since 1997.

Here we provide a rationale for using a priori in a format of q(\psi)\propto 1/ \sum_{t=1}^N q(u_t|\psi). In an infinite sample size, we have \int q(u|\psi) du=1 that does not depend on \psi. However, this is no longer true for s(\psi)=\sum_{t=1}^N q(u_t|\psi) on a finite sample size, which varies with \psi and imposes an implicit distribution \propto s(\psi). Considering a priori q(\psi) \propto 1/s(\psi) can balance off this unnecessary bias.

Furthermore, we can extend the above approach to get a priori q(\Theta^-)=q(\Theta_{y|x}|\Theta_q)q(\Theta_q)=q(\Theta_q) with q(\Theta_{y|x}|\Theta_q)=q(\Theta_{y|x})=1. Taking the i.i.d. version q(x|y, \theta_{x|y})q(y|\theta_y) of q({X} | {Y}, \Theta_{x|y})q({Y}|\Theta_{y}) as an example, we consider

(11)
q(\Theta_q) \propto 1/\sum_{t=1}^N q(x_t|\Theta_q), \, q(x|\Theta_q)=\begin{cases} \int q(x |y,\theta_{x|y})q(y|\theta_y)dy, & \mbox{ Choice (a)}, \\   q(x_t |y_t, \theta_{x|y})q(y_t|\theta_y), & \mbox{ Choice (b),} \end{cases}

where each y_t is estimated in responding to each x_t for Choice (b) (Xu, 2005; Xu, 2006b). With Z(\Theta)=-\ln{q(\Theta_q)}, we may implement eq.(7) with a regularization imposed via q(\Theta_q), under the name of normalization regularization. Details are referred to (Xu, 2007b) and especially Sec.23.7.4 in (Xu, 2004c) for a summary and historical remarks on studies by the present author since 1997.

In help of the inequality by eq.(6), we are able to get rid of the difficulty in getting p( \Theta | {\mathcal X}_N), however, which ignored the feedback interaction from the front to the back layer. Still, we may somewhat take p( \Theta |{\mathcal X}_N) back in consideration for a further improvement on selecting K, such that the task can be approximately decomposed into the following two stages:

(12)
\begin{matrix} Stage~~ I:&\Theta^*=arg \max_{\Theta}H(p \Vert q,\Theta), \mbox{for a set of values of } K;~~~~~~~~ \\ Stage~~ II: & K^*=arg\min_{K} J(K),   J(K)=- H(p \Vert q, \Theta^*) + 0.5 d(\Theta), \\ where & d(\Theta)\approx  \mbox{ the rank of the Hessian} \,   \Pi=- {\partial^2  H(p \Vert q, \Theta ) \over \partial \Theta \partial {\Theta }^T}_{\Theta=\Theta^*}.\\ \end{matrix}

To understand the above J(K), we expand H(p \Vert q,\Theta) with respect to \Theta around the obtained \Theta^* at Stage I up to the second order, and ignoring the first order term since \nabla_{\Theta} H(p \Vert q, \Theta )=0 at \Theta=\Theta^*, resulting in H(p \Vert q, \Theta )=H(p \Vert q, \Theta^* )-0.5Tr[\Sigma \Pi] and \Sigma=\int(\Theta -\Theta^* )(\Theta -\Theta^* )^Tp( \Theta  | {\mathcal X}_N)d \Theta. In a neighbor area of \Theta^*, H(p \Vert q, \Theta ) can be approximately regarded as a quadratic function of \Theta. Thus, we can regard p( \Theta  | {\mathcal X}_N) as a Gaussian that concentrates in this area with its mean \Theta^* and its covariance \Sigma=\Pi^{-1}. So we have Tr[\Sigma \Pi]=Tr[\Pi^{-1} \Pi] that is the dimension of \Theta. If \Pi is singular, d(\Theta) is given by the rank of \Pi, i.e., an effective dimension of \Theta (Xu, 2004b, Xu, 2007a&b).

It should be noted that the representation form of Y can have several choices too. The simplest and most widely encountered case is that Y consists of a set of random vectors \{ y_t \}_{t=1}^N that are independent and identically distributed (i.i.d.), in a correspondence to X that consists of a set of i.i.d. samples vectors \{ x_t \}_{t=1}^N. The other important case is a time series Y=y_1y_2\cdots y_t \cdots, corresponding to X=x_1x_2\cdots x_t \cdots with temporal relation among samples. Both the i.i.d. and temporal cases have been studied and applied to several learning tasks. Details are referred to (Xu, 2000, 2001a, 2004a). Furthermore, Y may have other structures too, i.e., in a tree representation, for which further investigation are demanded.

The last but nontrivial, it also deserve to observe a degenerated case of the Ying-Yang system in Fig.2 from {R}=\{Y, \Theta\} to simply {R}= \Theta. It follows from eq.(7), eq.(4),and eq.(9), that

(13)
\max_{\{\Theta,  K \}   } H(p \Vert q, \Theta),  \,\!\,\Theta=\{\Theta, h\},

H(p \Vert q) =   \int  p(\Theta^- | {\mathcal X}_N)p(h|{\mathcal X}_N)p_h(X) \ln{[q({ X} |   \Theta^- )q(h)q( \Theta^-)]} d{X} d\Theta.  \,

It is interesting to observe the following special cases of handling a priori q( \Theta  )=q(h)q( \Theta^-):

  • Let q(h)=1 and q( \Theta^-)=1, similar to that discussed before eq.(8), \max_{h} H(p \Vert q,\Theta ) will force h=0 and thus we have p_h(X)=p_0(X), H(p \Vert q) =   \int  p(\Theta^- | {\mathcal X}_N) p_0(X) \ln{[q({ X} |   \Theta^- )]} d{X} d\Theta. In eq.(12), Stage I becomes to implement the ML learning on H(p \Vert q, \Theta )=\ln{[q({\mathcal X}_N| \Theta^- )]} while the stage II actually degenerated to implement AIC.
  • Let q( \Theta^-)=1 only, we are lead to one similar to the case of q(h)=p(h|{\mathcal X}_N) by eq.(10). Stage I implements the regularized ML learning for not only \Theta^- but also an appropriate hyperparameter h \ne 0, while the stage II is no longer simply AIC but a regularized extension.
  • Let q(h)=1 while q( \Theta^-) is given by eq.(11), i.e., q(\Theta^-) \propto 1/\sum_{t=1}^N q(x_t|\Theta^-), Stage I implements a Bayesian regularization with a new type of priori, while the stage II further improves via model selection.
  • The previous two cases were obtained as byproducts of BYY learning studies in 1997 under the name of data smoothing learning and normalization learning, respectively. Details are referred to (Xu, 2003) for a recent introduction and especially to Sec.23.7.4 in (Xu, 2004c) for a summary and historical remarks on studies by the present author since 1997. Moreover, the features of the two cases can be combined jointly with q(h) by eq.(10) and q( \Theta^-) by eq.(11).

Another direction on model selection and regularization

Being different not only from those existing two pathway approaches such as auto-association, LMSER self-organization, Helmholtz machine, and varianional Bayes that have not taken the over-fitting problem in consideration, but also from those typical model selection criteria that base on a computational expensive two-stage implementation, the BYY learning provides an alternative direction for tackling the over-fitting challenge with the following favorable features:

  • The scale of the BYY system is considered separately for the part k_Y\, and for the rest parts by k_q^-\, and k_p^-\,. Actually, the model selection problems of many typical learning tasks (Xu, 2005) can be reformulated into selecting only the k_{Y}\, part in a BYY system. Such a favorable separation makes both parameter learning for \Theta\, and model selection for k_{Y}\, implemented simultaneously during the maximization of eq.(8) in a new mechanism, namely automatic model selection during parameter learning. That is, with k_{Y}\, initialized at values large enough, an appropriate k_{Y}\, will be determined automatically during implementing parameter learning \max_{\Theta } H_f(p \Vert q)\, by eq.(8) that includes maximizing \int p( {Y}) \ln{q({Y}|\Theta_{y})} d{Y} \, with p( {Y})=\int p( {Y} | {X}, \Theta_{y|x}) p({X}) d{X}\,. Readers are referred to (Xu, 2004b&c, 2005, 2007a&b) for further details. Applied to several typical learning problems, adaptive algorithms have been developed with model selection automatically performed during implementing these algorithms.
  • As the size N\, of samples drops below a limit, the above automatic model selection performance will drop too. In such a situation, we can return to a two stage implementation of learning by eq.(12). The separation of k_Y\, and the rests makes their contributions to J(K)\, estimated in two different manners. The contribution by k_Y\, can be estimated more accurately while the contribution by the rest parts k_q^-\, and k_p^-\, still has to be estimated roughly in a way similar to those previously discussed typical model selection criteria. As a result, though the advantage of saving computing costs no longer exists, J(K)\, in eq.(12) as a whole can outperform these typical model selection criteria, which have been already confirmed by experiments (see the last section).
  • Moreover, the separation of k_Y\, from the rest also provides a general framework that integrates the roles of regularization and model selection such that not only the crucial weakness caused by isotropic regularization can be avoided by automatic selection mechanism on k_Y\,, but also rough approximation caused by handling the integrals can be remedied via imposing certain regularization. This regularization is imposed indirectly on one or more components in the Ying-Yang pair, e.g., on q({X} | {Y}, \Theta_{x|y})q({Y}|\Theta_{y}) \, by eq.(11), p_h({X}) by eq.(10) or variance equalization, and p( {Y} | {X}, \Theta_{y|x}) via one of particular structures such as posteriori structure, a mixture of several parametric p( {Y} | {X}, \Theta_{y|x}, \ell), \ell=1, \cdots, \kappa \,, and a Gaussian approximation in help of variance equalization. The details are referred to (Xu, 2004b&c, 2005, 2007a& b). Even interestingly, imposing a regularization on p_h({X}) \, or p( {Y} | {X}, \Theta_{y|x}) \, will not encounter a difficulty of controlling regularization strength as encountered by the conventional regularization approaches.
  • Furthermore, the BYY system acts as a general framework that integrates a number of typical statistical learning structures or models from a unified perspective, with new insights and motivations.
  • In addition, the BYY learning provides a natural perspective for developing alternative minimization algorithms that include and extend the well known EM algorithm, with a guaranteed convergence.

Readers are referred to (Xu, 1995, 2000, 2001a&b, 2002, 2003, 2004b&c, 2005, 2007a&b) for details and overviews, as well as results on a number of typical learning tasks, with some of them listed as follows:

  • Cluster analysis, Gaussian mixture, and mixture of shape-structures (including lines, planes, curves, surfaces, and even complicated shapes).
  • Factor analysis (FA) and local FA, including PCA, subspace analysis and local subspaces, etc.
  • Independent subspace analysis, including independence components analysis (ICA), binary factor analysis (BFA), nonGaussian factor analysis (NFA), and LMSER, as well as three layer net.
  • Independent state space analysis, including temporal factor analysis (TFA), independent hidden Markov model (HMM), temporal LMSER, and variants.
  • Combination of multiple inferences, including multiple classifier combination, RBF nets, mixture of experts, etc.

Relations to other approaches: links and differences

For a deep insight on BYY harmony learning, its relations to a number of existing learning approaches have been also explored in past years, which are briefly summarized from the following four aspects:

  • One early effort on automatic model selection started from Rival Penalized Competitive Learning (RPCL) for clustering analysis and curve detection (Xu, Krzyzak, & Oja, 1993). It is heuristically proposed on a level of updating learning rule, with an automatic model selection mechanism emerging collectively. In recent years, it has been shown that the algorithms derived from the BYY learning under different settings indeed demonstrate different versions of RPCL-like mechanism. Readers are referred to Sec. 23.7.3 in (Xu, 2004c) for a historic remark on RPCL-like mechanisms versus the BYY harmony learning.
  • The BYY learning relates to a family of generalized maximum likelihood (ML) learning approaches, featured by an EM like two pathway implementation (i.e., with the E step for the bottom-up pathway while the M step for the top-down pathway). Typical examples of this family include the EM algorithm implemented ML learning (Redner & Walker, 1984), information geometry based em-algorithm (Amari, 1995), Helmholtz Machine (Hinton, et al, 1995; Dayan, 2002), variational approximation (Jordan et al, 1999), the bits-back based MDL (Hinton & Zemel, 1994), etc. On one hand, the BYY system plus best Ying-Yang matching via minimizing Kullback divergence provides a unified perspective for not only this generalized ML learning family but also those best information transfer featured approaches. However, similar to the conventional ML learning, this generalized ML learning family as well as a best Ying-Yang matching featured BYY system can not handle the over-fitting challenge well because the challenge has not been intentionally taken in consideration. On the other hand, the BYY harmony learning, i.e., the BYY system plus the best Ying-Yang harmony by eq.(4), intentionally tackles this challenge and provides a new direction for model selection and regularization, as discussed in the previous section. Readers are referred to (Xu, 2002, 2004b&c, 2005, 2007a&b) for further details on the links and differences, as well as for a deep understanding via two different types of projection geometry (Xu, 2004a&b&c).
  • Having been interpreted from a two channel based best information transfer perspective (Xu, 2002, 2004b&c), the BYY learning also relates to the information theoretic approaches, such as AIC and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), MML (Wallace, 1999), and BIC (Schwarz,1978) / MDL (Rissanen, 1986), etc. However, as already discussed in the previous section, there is a key difference, that is, the separation of k_Y\, from the rest parts enables the BYY learning to make the scale K either determined automatically during parameter learning or estimated more accurately such that an improved criterion is obtained for a two stage implementation.
  • As discussed previously, the BYY learning also relates to regularization techniques with a new framework that integrates the roles of regularization and model selection.

There are also two other streams of studies on tackling the over-fitting challenge. One is on the cross validation (CV) approach (Stone, 1978; Rivals & Personnaz, 1999), which is a general technique for handling a small sample size. It may deserve to investigate whether the CV approach can be applied to further improve the BYY learning on a small size of samples. The other stream is on the VC dimension based generalization error (Vapnik, 1995), based on which the support vector machine is developed and has become a popular topic in the machine learning literature. Little is known theoretically on the relation of BYY learning to the VC dimension based generalization error. There has been only a preliminary discussion on a link between SVM and a BYY learning based kernel density estimation (Xu, 2001). A further effort may deserve to be made along this direction.

Readers are further referred to Sec.5.2 and Sec.5.3 in Xu (2007b) for a number of open problems and challenges.

Experiments on local factor analysis

We take the problem of local factor analysis in Fig.3 as an example to show how the BYY harmony learning by eq.(5) works via comparative experiments. In this case, samples in \mathcal{X}_N \, are i.i.d., and each sample x \, has its inner representation \{y, \ell\} \, with a real vector y \, and a integer label \ell \,. The corresponding BYY system is shown in Fig.4 and eq.(5) becomes

(14)
H(p \Vert q, \Theta ) \,\!\,=   \sum_{t=1}^{N}\sum_{\ell=1}^k p(\ell|x_t)H_t(\Theta_{\ell})-Z(\Theta ), \,\!\, H_t(\Theta_{\ell}) \,\!\, given in Fig.4,

which is maximized via an iterative procedure that implements a Yang step and a Ying step alternatively. At the Yang step, \ell_t, y_{\ell, t}, W_{\ell} \, are estimated as shown by the left side of Fig.4 while the parameters in \theta_{\ell}=\{ m_{\ell}, \Sigma_{\ell}, U_{\ell}, \Lambda_{\ell},\alpha_{\ell} \}_{\ell=1}^k \, are updated at the Ying step via computing the gradient \nabla_{\Theta_{\ell}} H_t(\Theta_{\ell}) \, or {\mathcal P}\nabla_{\theta_{\ell}} H_t(\Theta_{\ell}) \, with {\mathcal P} \, being a positive definite matrix, subject to the constraints that \Sigma_{\ell} \, is positive definite, \Lambda_{\ell}=diag[\lambda_{\ell,1}, \cdots, \lambda_{\ell,m_{\ell}}]  \, with \lambda_{\ell,j}\ge 0 \,, U_{\ell}^TU_{\ell}=I \,, and \alpha_{\ell} \ge 0, \sum_{\ell=1}^k\alpha_{\ell}=1 \,. Moreover, there are three choices for the parameter h^2 \,, as already introduced around eq.(9) and eq.(10).

During implementing the iterative procedure, \lambda_{\ell,j} \, will tend to zero if the corresponding dimension is extra and thus discarded; \alpha_{\ell} \, also tends to zero if the corresponding Gaussian cluster is extra and thus discarded too. As a result, appropriate k^*, \{m_{\ell}^*\} \, are determined automatically during learning, i.e., automatic model selection is made during parameter learning. Alternatively, on a small sample size N \,, a better k^*, \{m_{\ell}^*\} \, can be selected by a two stage implementation in help of eq.(12), with the Stage II detailed as follows

(15)
\begin{array}{l} \{k^*, \{m_{\ell}^*\}\}=arg\min_{k, \{m_{\ell}\}} \{ J( k, \{m_{\ell}\}) + 0.5 d(\theta^*)\}, \\  J( k, \{m_{\ell}\})=\frac{1}{2}\sum_{\ell=1}^k \alpha_{\ell}[\ln{|\Sigma_{\ell}|}+ \ln{|\Lambda_{\ell}|}+ m_{\ell}( \ln{(2\pi)}+1)] - \sum_{\ell=1}^k \alpha_{\ell} \ln{\alpha_{\ell}} \\ +\sum_{\ell=1}^k \alpha_{\ell} \{ h^2Tr[\Sigma_{\ell}^{-1}] +Tr[\Gamma_{\ell}(U_{\ell}\Sigma_{\ell}^{-1}U_{\ell}^T+\Lambda_{\ell}^{-1})] \} - \frac{kd +\sum_{\ell=1}^k  m_{\ell}}{N}, \end{array}

where d(\Theta^*) \, is approximately given by the number of free parameters within the model in Fig.4, i.e., we approximately let d(\Theta^*) =dk+k-1+ 0.5d(d+1)k+ \sum_{\ell=1}^{k}(m_{\ell}+d_{U_{\ell}}), and d_{U_{\ell}}= m_{\ell}d-0.5m_{\ell}(m_{\ell}+1) \,.

First, experiments have been made with a large number of simulated data sets generated by varying k, d, m_{\ell}, \Sigma_{\ell}=\Psi^2_{\ell}I \,. The task aims at checking whether k, \{ m_{\ell}\}\, can be correctly selected. For comparison, we conduct the maximum likelihood (ML) learning by the EM algorithm for parameter learning, and then make model selection on k, \{ m_{\ell}\}\, by several typical criteria, including AIC and its modification CAIC, BIC or equivalently MDL, cross-validation (CV). Moreover, it is also intented to make a comparision with a VC-dimension based SRM error bound. After an extensive search of the existing literature, only one criterion has been found for selecting k\, on a Gaussian mixture (Wang & Feng, 2005), but there is no criterion available for local factor analysis. Via q(x, \ell) \, in Fig.3, we have been able to use the criterion in (Wang&Feng, 2005) for k\, but unable to determine the hidden factor number m_{\ell} \, for each Gaussian component. Furthermore, comparisons have also made with two algorithms that makes selection on k, \{ m_{\ell}\} \, incrementally such that the huge computing cost by a two stage implementation of ML + Criterion can be significantly saved. One is the variantional inference for Bayesian mixtures of factor analyser (VBMFA) (Ghahramani and Beal, 1999) and the other is named incremental mixture of factor analyzer (IMoFA) (Salah & Alpaydin, 2004).

In a correspondence to those criteria, we implement BYY-C, i.e., the BYY harmony learning via a two stage implementation by eq.(12) together with eq.(15), while in a correspondence to VBMFA and IMoFA, we implement BYY-A, i.e., the BYY learning with automatic model selection. Both performances and computing times are compared in experiments.

Bayesian Ying Yang learning
Enlarge
Figure 5: The results on three simulated data sets with samples of a small N=40 \,, medium N=5\times 40=200 \,, and large size N=5\times 200=1000 \,, respectively, with d=5, k=3, m_{\ell}=3 \, and \Psi^2_{\ell}=0.2\zeta_{\ell}, \ \zeta_{\ell}=\min[\lambda_{\ell,1}, \cdots, \lambda_{\ell,m_{\ell}}] \,. The counts (e.g., 76 in the 1st block) indicate the times that the scales (k,m)\, (e.g., k=3,m=3) has been confirmed in 100 experiments repeated with different initializations. The BYY learning is made by setting \Gamma_{\ell}=0\, for each \ell  \, but with h^2>0 \, learned via q(h)=p(h|{\mathcal X}_N)\, by eq.(10). The computing times are made by MATLAB 7.0.1(R14) on a P4 3.2GHz 1GB DRAM PC, in a form mean\pm standard \ variance.

Given in Fig.5 are experimental results on three simulated data sets with samples of a small, medium, and large size, respectively. For those existing criteria plus VBMFA and IMoFA, it can be observed that no one is always best. Instead, some performs better in one case while some other performs better in another case. Interestingly, BYY-C considerably outperforms all these criteria as well as VBMFA, IMoFA, and BYY-A, and BYY-A outperforms its counterparts VBMFA and IMoFA, while VBMFA and IMoFA perform quite similarly. Moreover, the computing times used by BYY-A and IMoFA are similar but both are only 3\%-30\% \, of the computing times by two stage implementation based criteria and BYY-C. Though being inferior to BYY-C, BYY-A is still better or comparable to the one that performs best among the criteria as well as VBMFA and IMoFA, with a considerable computing cost saving. Again, the observations can be consistently obtained from Fig.6 with comparisons made on 27 data sets.

Bayesian Ying Yang learning
Enlarge
Figure 6: Comparisons made on 27 data sets that generated by varying the sample size N, the dimension d of sample space, and the variance \Psi_l^2 of noise, with each taking three levels as shown at the top-left corner. For a compact expression, only the correct rate obtained in 100 experiments are given. E.g., the blue diagonal ray (1)-(7) at the top-right corner indicates experiments on three data sets featured by (N=1000, d=5, \Psi_l^2=0.2\varsigma_{\ell}), (N=200, d=7, \Psi_l^2=0.5\varsigma_{\ell}), and (N=40, d=9, \Psi_l^2=0.8\varsigma_{\ell}), respectively. For CV-5, the corresponding correct rates are 87, 69, and 60, respectively.

Second, experiments also have been made on a number of real world data sets for pattern recognition tasks. On these data sets, it is hard to directly check whether k, \{ m_{\ell}\} \, are appropriate. Instead, what we can directly compare with are the average classification rates on the testing sets. Shown in Fig.7 are the comparison results on several widely used data sets. In favor of saving computing cost for a real application purpose, we only take BYY-A to compare with other approaches. Again, it can be observed that BYY-A outperforms the others in most cases, with a computing time similar to IMoFA but 3\%-30\% \, of the computing times needed by those two stage implemented criteria.

Bayesian Ying Yang learning
Enlarge
Figure 7: Experiments on eight real world data sets. On each data set, 20 independent runs are made for learning from different initializations, and then tested on the testing sets, with the classification rates given in the form mean\pm standard \ variance \,.

The above experiments are all made by Mr. Lei Shi. Readers are referred to Web-link II for further details and also experiments on other data sets as well as the applications on two widely used handwritten digits databases MNIST and CEDAR.

External links

Web-link I: For further readings on Bayesian Ying Yang learning

Web-link II: For details of the experiments and For other experiments and applications

References

  • Akaike, H (1974), "A new look at the statistical model identification", IEEE Tr. Automatic Control 19: 714-723.
  • Amari, S (1995), "Information geometry of the EM and em algorithms for neural networks", Neural Networks 8(9): 1379-1408.
  • Bourlard, H & Kamp, Y (1988), "Auto-association by multilayer Perceptrons and singular value decomposition", Biological Cybernetics 59: 291-294.
  • Bozdogan, H & Ramirez, DE (1988), "FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC", Psychometrika 53 (3): 407-415.
  • Cavanaugh, JE (1997), "Unifying the derivations for the Akaike and corrected Akaike information criteria", Statistics & Probability Letters 33: 201-208.
  • Ghahramani, Z & Beal, MJ (1999), "Variational inference for Bayesian mixtures of factor analysers", Advances in Neural Information Processing Systems 11, 1999, 449-455.
  • Grossberg, S (1976), "Adaptive patten classification and universal recording: I &II", Biological Cybernetics 23: 121-134 and 23: 187-202.
  • Grossberg, S & Carpenter, GA (2002), "Adaptive Resonance Theory", The Handbook of Brain Theory and Neural Networks, Second edition, (MA Arbib, Ed.), Cambridge, MA: The MIT Press, 87-90.
  • Hinton, GE, Dayan, P., Frey, BJ, & Neal, RN (1995), "The wake-sleep algorithm for unsupervised learning neural networks", Science, 268:1158-1160.
  • Hinton, GE & Zemel, RS (1994), "Autoencoders, minimum description length and Helmholtz free energy", Advances in NIPS, 6, 3-10.
  • Jordan, MI, Ghahramani, Z, Jaakkola, TS & Saul, LK (1999), "An Introduction to Variational Methods for Graphical Models ", Machine Learning 37(2): 183-233.
  • Poggio, T & Girosi, F (1990), "Networks for approximation and learning", Proc. of IEEE 78: 1481-1497.
  • Redner, RA & Walker, HF (1984), "Mixture densities, maximum likelihood, and the EM algorithm", SIAM Review 26: 195-239.
  • Rissanen, J (1986), "Stochastic complexity and modelling", Annals of Statistics 14 (3): 1080-1100.
  • Rivals, I & Personnaz, L (1999), "On Cross Validation for Model Selection", Neural Computation 11: 863-870.
  • Salah, A & Alpaydin, E (2004), "Incremental mixtures of factor analysers", Proc.17th Intl Conf. on Pattern Recognition, vol.1, 276-279.
  • Schwarz, G (1978), "Estimating the dimension of a model", Annals of Statistics 6: 461-464.
  • Stone, M (1978), "Cross-validation: A review", Math. Operat. Statist. 9: 127-140.
  • Tikhonov, AN & Arsenin, VY (1977), Solutions of Ill-posed Problems, Winston and Sons.
  • Vapnik, VN (1995), The Nature of Statistical Learning Theory, Springer.
  • Wallace, CS & Dowe, DR (1999), "Minimum message length and Kolmogorov complexity", Computer Journal 42 (4): 270-280.
  • Wang, L & Feng, J (2005), "Learning Gaussian mixture models by structural risk minimization", "Proc. 4th Intl Conf. Machine Learning and Cybernetics", 18-21 August, 2005, Guangzhou, China, Vol. 8, 4858-4863.
  • Xu, L (2007a), "A Unified Perspective and New Results on RHT Computing, Mixture Based Learning, and Multi-learner Based Problem Solving", Pattern Recognition, Article in Press, doi:10.1016/j.patcog.2006.12.016.
  • Xu, L (2007b), "Trends on Regularization and Model Selection in Statistical Learning: A Perspective from Bayesian Ying Yang Learning", an invited chapter in Challenges to Computational Intelligence (in press), Duch, W, Mandziuk, J & Zurada, JM eds, Springer-Verlag, 2006.
  • Xu, L (2005), "Fundamentals, Challenges, and Advances of Statistical Learning for Knowledge Discovery and Problem Solving: A BYY Harmony Perspective", Keynote talk, Proc. of Intl. Conf. on Neural Networks and Brain, Oct. 13-15, 2005, Beijing, China, Vol. 1, 24-55.
  • Xu, L (2004a), "Temporal BYY Encoding, Markovian State Spaces, and Space Dimension Determination", IEEE Trans on Neural Networks 15( 5): 1276-1295.
  • Xu, L (2004b), "Advances on BYY harmony learning: information theoretic perspective, generalized projection geometry, and independent factor auto-determination", IEEE Trans on Neural Networks 15(4): 885-902.
  • Xu, L (2004c), "Bayesian Ying Yang Learning", Intelligent Technologies for Information Analysis", N. Zhong and J. Liu (eds), Springer, 615-706.
  • Xu, L (2003), "Data smoothing regularization, multi-sets-learning, and problem solving strategies", Neural Networks, Vol. 15, No.5-6, 817-825.
  • Xu, L (2002), "Bayesian Ying Yang Harmony Learning", The Handbook of Brain Theory and Neural Networks, Second edition, (MA Arbib, Ed.), Cambridge, MA: The MIT Press, pp1231-1237.
  • Xu, L (2001), "Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, Three-Layer Nets and ME-RBF-SVM Models", Intl J of Neural Systems 11(1):43-69.
  • Xu, L (1995), "Bayesian-Kullback Coupled YING-YANG Machines: Unified Learnings and New Results on Vector Quantization", Proc. Intl. Conf. on Neural Information Processing (ICONIP95), Beijing, Oct 30-Nov.3, 1995, pp.977-988.
  • Xu, L, Krzyzak, A & Oja, E (1993), "Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection", IEEE Trans on Neural Networks 4: 636-649.
  • Xu, L (1991&93) "Least mean square error reconstruction for self-organizing neural-nets", Neural Networks 6: 627-648, 1993. Its early version on Proc. IJCNN91'Singapore, 2363-2373, 1991.

Internal references

External links

See also


Lei Xu (2007) Bayesian Ying Yang learning. Scholarpedia, 2(3):1809, (go to the first approved version)
Created: 31 July 2006, reviewed: 15 February 2007, accepted: 22 March 2007
Action editor: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
For authors