Bayesian Ying Yang learning

From Scholarpedia

Lei Xu (2007), Scholarpedia, 2(3):1809. revision #60628 [link to/cite this article]

Curator: Dr. Lei Xu, Dept. Computer Science & Engineering, Chinese University of Hong Kong


Bayesian Ying-Yang learning is a statistical learning theory for a two pathway featured intelligent system via a Ying representation and a Yang representation, i.e., two complementary Bayesian representations of the joint distribution on the external observation and its inner representation. The system architecture is built under three general designing principles and all the rest unknowns in the system are learned from a set of samples under a Ying-Yang best harmony principle.

Firstly proposed in 1995 and systematically developed over a decade, Bayesian Ying-Yang learning provides not only a general framework that accommodates several typical learning theories and approaches under a unified perspective but also a new theory that leads to both improved model selection criteria and learning algorithms with automatic model selection. Moreover, it is able to implement best harmony by Ying-Yang alternative maximization and to cooperate model selection via Ying representation and learning regularization via Yang representation.


Contents

Two types of abilities and two pathway approaches

It is speculated that a learning system or a biological brain survives in its world with at least two types of intelligent abilities. Implemented by a top-down or outbound pathway, Type-I consists of abilities of 'fitting' its world, including not only understanding ability to explain its world but also motoring ability to track the changes in its world. On the other hand, implemented by a bottom-up or inbound pathway, Type-II consists of problem solving skills, ranging from perceiving what events encountered to generating signals that activate the outbound pathway. Both the pathways were initially set up via biological inheritance and then further developed gradually during interactions between the brain system and its surviving world.

The outbound pathway is developed mainly via a learning process for mining the common features or discovering regularities among an ensemble of uncertain evidences (or called samples) from the world. While the inbound pathway is demanded to fast implement problem solving timely, usually per one or several samples encountered. This inbound pathway is developed either adaptively via learning from previous problem solving experiences or off-line via seeking problem solving strategies based on the Type-I knowledge (i.e., regularities or dependencies about the world) by inference and optimization, and then designing the strategies into certain fast implementable functions.

The two-pathway approach has been adopted in the literature of modeling a perception system for decades. One early example is the adaptive resonance theory developed in the 1970s (Grossberg, 1976; Grossberg & Carpenter, 2002), featured by a resonance between bottom-up input and top-down expectation in help of a mechanism motivated from a cognitive science view. In the past one or two decades, efforts have further been made on multilayer net featured two-pathway approaches under guidance of a statistical principle, e.g., under the least mean square error criterion by the auto-association (Bourlard & Kamp, 1988) and the LMSER self-organization (Xu, 1991 & 93), and under the Helmholtz energy by the Helmholtz machine and sleep-wake learning (Hinton, Dayan, Frey, & Neal, 1995; Dayan, 2002). The basic spirit of LMSER self-organizing was further developed into the Bayesian Ying-Yang (BYY) learning in the mid-1990s, not only with multi-layer two-pathway replaced by a general system via two complementary Bayesian representations but also with the least mean square error criterion replaced by a best Ying-Yang harmony criterion (Xu, 1995, 2002).


Over-fitting, model selection, and regularization

The hardware of a biological brain is featured by an alive body with its organization specified genetically. Such an organization, regardless whether or not in a two pathway form, can be mathematically denoted by a sophisticated manifold S(X, \Theta)\, \, with a parametric architecture or structure S(\cdot, \cdot)\, to accommodate all functions or tasks that are expected to perform. To build up an artificial intelligent system, our task is to set up such a S(X, \Theta)\, \,.

Only from X\,\,, we are unable to specify an appropriate S(\cdot, \cdot)\,. It needs to be appropriately designed, which is task-dependent and needs knowledge about the structure underlying the samples of X\,\,. Except some particular situations, we usually have no such type of knowledge. Instead, we consider S(\cdot, \cdot)\, to be a combination of a number of individual simple structures via a simple combination scheme. Shown in Fig.1(a) are two examples of such combined architectures. In general, we consider a family of infinite many structures \{S_{k}\} with each S_{ k} in a same configuration but in different scales, each of which is labeled by a scale parameter {k} in term of one integer or a set of integers that accounts for the scale of the combined architecture, i.e., the number of involved simple structures. Given k\,\,, we also need to determine \Theta\,\, that consists of all the unknown parameters within the combing architecture. Usually, what we have is a given set \Chi_N\,\, of samples with a finite size N\,\,, we need to determine both k\,\, and \Theta\,\,. In other words, we need to specify one S_k = S_k(X, \Theta)\,\, to model or represent the given set \Chi_N\,\,. More specifically, the task of determining \Theta\,\, at a given k\,\, is usually called parameter learning, while the task of selecting an appropriate scale k\,\, is called model selection since we select among a set of candidate models as the value of k\,\, varies.

Bayesian Ying Yang learning
Enlarge
Figure 1: A combination of a series of individual simple structures (a) two examples, (b) fitting error curve. Where { k} is enumerated via k_1\prec k_2 denoting that k_1 proceeds k_2, i.e., S_{k_1} is a part (or called a substructure) of S_{k_2}. If k consists of only one integer, k_1\prec k_2 becomes simply k_1<k_2.

We expect that S_k\,\, best fits or represents \Chi_N\,\, such that the error of usingS_k\,\, to represent \Chi_N\,\, becomes the least. With such a type of best fitting principle, we expect to determine \Theta^*_k\, at a given k\,\,. As shown in Fig.1(b), its fitting error monotonically decreases as k\, grows up, until the error reaches zero at k=k_N\,\,, a value that relates to N\, and is usually much bigger than an appropriate k^*\,\,. In this case, a large part of structure has actually been used to fit noises or outliers. This is usually called ''over-fitting'' problem, for which we cannot get an appropriate k^*\,\,. Only taking parameter learning in consideration, the previously mentioned two pathway approaches, (such as auto-association, LMSER self-organization and Helmholtz machine, etc), can not avoid this over-fitting problem too.

One major direction for tackling the problem is a two stage implementation, with parameter learning made on a set of candidate models among which one is selected by a model selection criterion. Beyond the best-fitting principle, such a criterion is obtained from a different statistical learning theory. In literature, there exist several typical model selection criteria, including Akaike information criterion (AIC) and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), Bayesian approach related criteria under the names of Bayesian inference criterion (BIC) (Schwarz, 1978), Minimum message length (MML) (Wallace & Boulton,1968; Wallace & Dowe, 1999), and Minimum description length (MDL) (Rissanen, 1986, 2007), and Cross validation (CV) based criteria (Stone, 1978; Rivals & Personnaz, 1999). Unfortunately, not only a two stage implementation is very expensive to compute, but also these criteria can only provide a rough estimation or even a wrong estimate because estimating performance deteriorates considerably when N\,\, is small and k\,\, consists of more than one integers.

Another direction for tackling over-fitting is called regularization. Instead of searching an appropriate k^*\,\,, we consider a S_k\,\, with its scale k\,\, large enough to accommodate the regularity underlying \Chi_N \,\,, and then impose certain constraint on either or both of \Theta\,\, and S_k\,\, (Tikhonov & Arsenin, 1977; Poggio & Girosi, 1990) to specify one \Theta^*\,\, for approximately describing a true structure that is actually in a lower scale. Equivalently, the situation also applies to the cases that \Chi_N \,\, comes from a underlying structure with a known scale k\,\, while the size N\,\, is not large enough to specify a unique \Theta^*\,\,. Unfortunately, it is difficult to appropriately choose what type of constraint to impose. Instead, it is imposed in an isotropic or uniform manner, which usually does not work. For an example, if we use a polynomial of a degree k >3\, to fit a set of samples from y=x^2+ 3x+ 2\,\,, a model selection purpose desires to force all the parameters \{ a_i, i \ge 3\}\, to be zero, but which can not be achieved by an isotropic or uniform regularization, e.g., minimizing a_1^2+a_2^2+\cdots + a_k^2\, fails to treat the parameters \{ a_i, i \ge 3\}\, differently from those \{ a_i, i \le 2\}\,. Actually, such a regularization disturbs or conflicts with the purpose of model selection. Even given a type of constraint, another difficulty is to control an appropriate regularization strength, which is able to be roughly estimated only for a rather simple structure via either handling the integral of marginal density or in help of cross validation with extensive computing costs (Stone, 1978; Rivals & Personnaz, 1999).


Three levels of tasks from inverse problem perspective

Figure 2:   Three levels of inverse problems
Enlarge
Figure 2: Three levels of inverse problems

The learning involved tasks discussed above can be further summarized into the following three levels from an inverse problem perspective as illustrated in Fig.2:

  • Inference or Relaxation \quad Provided with an observation X that can be regarded as either generated from an inner representation Y or a consequence from a cause Y via a given mapping G: Y \to X, the Type-II ability makes an inverse inference X \to Y, as shown in Fig.2(a). When G: Y \to X is one-to-one, its inverse mapping F: X \to Y is analytically solvable. Generally, the task of determining Y\, from X\, is involved in reasoning, mapping, representation, etc. It is usually not a simple task due to various reasons. One is due to uncertainty incurred externally by observation noises, which can be described by a distribution q(X | Y) for a probabilistic mapping Y \to X. Uncertainty also origins internally from a mapping G: Y \to X of many-to-one or infinite many to one, which can be considered by q(X | Y) plus a distribution q(Y) for each reasonable cause or inner representation Y, as shown in Fig.2 (b). Actually, q(X | Y) and q(Y) jointly act as the knowledge of Type I, based on which a Type-II ability is obtained via statistical inference methods, as shown in the column (a) of the table in Fig.3. Go further beyond, Y\, may be indirectly determined via a dynamics of minimizing an energy or cost that incurs from violating certain constraints in S\,, under the term of relaxation.
Figure 3:  Typical Statistical Inference Methods for  Inverse Problems
Enlarge
Figure 3: Typical Statistical Inference Methods for Inverse Problems
  • Parameter Learning\quad The second level of inverse problems considers the situations that q(X|Y) and q(Y) are unknown but provided with their parametric structures q(X|Y,\theta_{x|y}) and q(Y|\theta_y) with \Theta=\{\theta_{x|y},\theta_{y}\}. As illustrated in Fig.2 (c), the scenario becomes that we get a set of samples \Chi_N generated from a map \Theta \to \Chi_N, and the task is getting an inverse mapping \Chi_N \to \Theta, usually referred by the term estimation or parameter learning for \Theta. Actually, the task of determining \Theta\, also includes an inverse inference X \to Y as a subtask. While X \to Y is made per sample of X but \Chi_N \to \Theta collectively bases on all the samples. We may also update \Theta\, as samples come, but in a speed much slower than that Y\, varies as one or a subset of samples come. Via either a marginal distribution \begin{array}{l} q(X |\Theta)=\int q(X |Y, \theta_{x|y})q(Y|\theta_y)dY\end{array} or directly a parametric structure q(X |\Theta) with a unknown parameter set \Theta, we encounter the inverse problem shown in Fig.2 (c) with uncertainties described by q(\Chi_N |\Theta) and q(\Theta). Again, one most widely studied direction consists of statistical inference methods, as shown in the column (b) of the table in Fig.3.
  • Model Selection\quad The third level of inverse problems considers a family of structures \{S_{k}\} with each S_{ k} in a same prespecified configuration but in a unknown scale k. Now, a set of samples \Chi_N is generated from a map S_{k} \to \Chi_N and the task is getting an inverse mapping \Chi_N \to S_{k} or equivalently \Chi_N \to k, which is usually called model selection. This task also includes the subtask of determining \Theta\,\, in S_{k}(\Theta) for each k\,\,. Via \begin{array}{l} q(X | S_{k})=\int q(X | S_{k}(\Theta))q(\Theta)d\Theta \end{array}, we encounter the inverse problem shown in Fig.2 (d), with uncertainties described by q(X | S_{k}) and q(k), which is tackled by statistical inference methods shown in the column (c) of the table in Fig.3.

From these inverse problems and Fig.3, we get a general perspective to summarize typical statistical approaches for the three levels of learning tasks. The first three items in the columns (b)&(c) summarize those widely studied approaches of parameter learning (e.g., ML, MAP, etc) and model selection approaches ( e.g., BIC, MDL, MML, etc), while the first three items in the column (a) cover typical inverse solving techniques, either nested in parameter learning or used independently in tasks such as classification, inference, and control.

Though having extensive applications, these classical approaches still suffer at least three kinds of limitations.

  • The integrals over Y and \Theta\,\, are analytically solvable only in some specific distributions, but become intractable in many cases. One way for this problem is using Laplace approximation (Schwarz, 1978). The other direction is featured by either variational approximations (Hinton & Zemel, 1994; Jordan, Ghahramani, Jaakkola, & Saul, 1999; Jaakkola, 2001) or BYY learning, which are related to the LPD items in Fig.3 and will be further introduced later.
  • The difficulty of getting an appropriate q(\Theta) is similar to the previously discussed difficulty of choosing what type of constraint for regularization. Many efforts have been made on this topic in the literature of machine learning, ranging from Jeffreys prior and other improper priors to Dirichlet prior and other conjugate priors.
  • Fig.3 merely summarizes, instead of unifying, these statistical approaches. Also, it fails to cover other information theoretic related approaches. Moreover, it merely considers the bottom-up pathway as an inverse of the top-down pathway with other possible scenarios ignored.


Bayesian Ying-Yang system : A unified framework

As shown in Fig.4, a set X=\{ x \} of samples are regarded as generated via a top-down path from its inner representation R=\{ Y, \Theta \}, not only with a long term memory \Theta that is a collection of all unknown parameters in the system for collectively representing the underlying structure or regularities among data X as the knowledge about the world, but also with a short term memory Y that each element y \in Y is the corresponding inner representation of one element x \in X. A mapping R \to X and an inverse X \to R are jointly considered via the joint distribution of X, R in two types of Bayesian decomposition:

(1)
p({ X}, { R})=p({ R}| { X})p({ X}),   \  q({ X}, { R})=q({ X} | { R})q({ R}),  \,
Figure 4:   Bayesian Ying-Yang System
Enlarge
Figure 4: Bayesian Ying-Yang System

In a compliment to the famous Chinese ancient Ying-Yang philosophy, the decomposition of p({  X}, { R})\,\, coincides the Yang concept with a visible domain p({ X})\,\, for a Yang space and a forward pathway by p({ R}| { X})\,\, as a Yang pathway. Thus, p({ X}, { R})\, is called Yang machine. Similarly, q({ X}, { R})\,\, is called Ying machine with an invisible domain q({R})\,\, for a Ying space and a backward pathway by q({ X} | { R})\,\, as a Ying pathway. Such a pair of Ying-Yang machines is called Bayesian Ying-Yang (BYY) system. It actually consists of two layers. The back layer accommodates \Theta\, and supports the front layer by a priori q(\Theta|\Xi), while p({ R} | { X})\,\, consists of the posteriori p(\Theta|X, \Xi) that transfers the knowledge from observations to the back layer. The front layer is itself is a Ying-Yang pair for X \to Y and Y \to X, modeled in a parametric Ying-Yang pair p({ X}, { Y}|\Theta_p ),   \  q({ X}, { Y}|\Theta_q ) where \Theta =\{\Theta_p, \Theta_q\} and \Theta_q =\{\Theta_y, \Theta_{x|y} \}, \Theta_p=\{\Theta_{y|x}, \Theta_x\}\,.

Taking a key role in the information flow within the front layer, Y\, is supported by a parametric substructure q(Y|\Theta_y)\,. On one hand, Y\,\, is the source of the information flow to fit the observations \Chi_N \, via the top-down pathway q({ X} | Y, \Theta_{x|y} )\, that implements the abilities of Type I. On the other hand, featuring the abilities of Type II, this Y\, comes from the information flow via a bottom-up pathway p(Y| { X}, \Theta_{y|x})\,. The input to the system comes from a sample set \Chi_N=\{x_t\}_{t=1}^N \, either in a smoothed form p({ X} |\Theta_x )= G(X-\Chi_N|0, h^2I) or directly \Chi_N \, by setting h=0, where G(u|\mu, \Sigma) denotes a Gaussian density of vector u with a mean vector \mu and a covariance matrix \Sigma. We need to design appropriately other three substructures, with different designs performing different learning tasks. In general, designs are guided by the following three principles, respectively:

Figure 5:  Several choices for measuring the uncertainty or  information contained in a distribution
Enlarge
Figure 5: Several choices for measuring the uncertainty or information contained in a distribution

Least Redundancy Principle \quad Subject to the nature of learning tasks, q({R})\,\, should be in a structure with the inner representation of \Chi_N \, encoded with a redundancy as least as possible. First, the number of variables and parameters of R=\{ Y, \Theta \} should be as less as possible, which is itself the model selection task that is beyond the task of design. Second, the dependences among these variables should be as least as possible, which is possibly or at least partly to be considered during design. E.g., when Y consists of multiple components Y=\{Y^{(j)}\}, we may design \begin{array}{l} q(Y|\Theta_y)=\prod_j q(Y^{(j)}|\theta_y^{(j)}) \end{array}.

Divide and Conquer Principle \quad Subject to the representation formats of X and Y, a complicated mapping Y \to X is modeled by designing q({ X} | { R})=q({ X} | Y, \Theta ) via a mixture of a number of simple structures. E.g., we may design q({ X} | Y, \Theta ) via a mixture of a number of linear regression featured by Gaussians of Y conditional on X. In some situation, it is not necessary to design q({R}) and q({ X} | { R}) separately. Instead, we design q({ X},   { R}), especially q({ X},  Y, \Theta_q ) via a single parametric model as a whole but still attempting to follow the above two principles. One example is the one encountered in Type B of Rival penalized competitive learning, actually via a product Gaussian mixture (see Scholarpedia page Rival penalized competitive learning). In general, we can consider an integral measure {\mathcal M}({ X},  Y, \Theta_q ) to get

(2)
\begin{array}{l}  q({ X},  Y, \Theta_q ) =  {\mathcal M}({ X},  Y, \Theta_q )/\int  {\mathcal M}({ X},  Y, \Theta_q )dXdY. \end{array}

Uncertainty Conversation Principle In a compliment to the Ying-Yang philosophy, Ying is primary and thus is designed first, while Yang is relatively secondary and thus is designed basing on Ying. Moreover, as illustrated by the Ying-Yang sign located at the top-left of Fig.4, the room of varying or dynamic range of Yang should balance with that of Ying, which motivates to design p(X, R) (in fact, only p({ R} | { X}) because p({ X} |\Theta_x ) is given by \Chi_N \, already) under a principle of uncertainty conversation between Ying-Yang. In other words, Yang machine preserves a varying room or dynamic range that is appropriate to accommodate uncertainty or information contained within the Ying machine, i.e., U( p)= U( q) under a uncertainty measure U( p) in one of the choices given in Fig.5. Since a Yang machine consists of two components p({ R}| { X})p({ X}), we may have one or two of the following uncertainty conversations:

(3)
\begin{array}{l} U( p({ R} |{ X}))= U( q({ R} |{ X})), \,  q({ R} |{ X})= {q({ X}, { R})\over q({ X})}, \  U(p({ X} |\Theta_x ))= U( q({ X})), \,  q({ X})=\int q({ X}, { R})dR.   \end{array}

One typical case of Likelihood based (a) is p({ R} |{ X})=  q^{\gamma}({ R} |{ X}) with a scalar 0\le \gamma< \infty. It includes Likelihood based (b) at \gamma=1< \infty, which represents the strongest conversation that can be achieved between Ying and Yang since p({ R} , { X})=  q({ R} , { X}) is unreachable as p({ X} |\Theta_x ) fixed by \Chi_N \, already. Moreover, it usually encounters an implementing difficulty to get q({ X}), except some special cases that the above intergal over R is either analytically solvable or becoming a inexpensively computable sum. One other typical case is the Hessian based (b) for avoiding the computing difficulty of handling the intergal over R. As previously discussed (Xu, 2004b, p889), this is consistent to the celebrated Cramér-Rao inequality stating that the inverse of the Fisher information provides a lower bound on the covariance matrix of any unbiased estimator of { R}, while -\nabla_{RR^T} \ln{q({ X}, { R})} provides an approximate Fisher information matrix.

Still, we can use the notation S_k(\Theta)\,\, to refer a BYY system, with its configuration S_k\,\, specified by the above design and with \Theta =\{\Theta_p, \Theta_q\} consisting of unknown parameters in both Ying and Yang. Moreover, the scale k\,\, features the complexity of R, including the scale k_Y\, for representing Y\, and the rest part that is contributed by \Theta. The remaining task consists of parameter learning for determining \Theta\,\, and model selection for selecting an appropriate scale k\,\,.

The BYY system provides a unified framework on a number typical learning tasks in the following senses:

  • It includes the three levels of inverse problems discussed in the previous section.
  • It applies to not only the cases that \Chi_N=\{ x_t \}_{t=1}^N of random vectors that are independent and identically distributed (i.i.d.), as shown in Fig.6(a), but also the cases that there are certain temporal structure underlying \Chi_N via q(Y|\Theta_y) with a built in temporal dependence (Xu, 2001b, 2004a).
  • It implements typical intelligent tasks via special cases of the inner representation \{y_t, \ell_t\} per sample x_t. When there is only \ell_t that takes several labels, x_t\to \ell_t implements a pattern classification or decision making. When there is only \ell_t of a binary vector, x_t\to \ell_t implements an inner encoding task. When there is only y_t of a real vector, x_t\to y_t implements an dimension reduction or generating control signal. Each of them can be regarded as a subtask of a combined general task by x_t\to \{y_t, \ell_t\}.
  • It covers both unsupervised learning and supervised learning. Conventionally, a learning that only bases on input sample x_t is called unsupervised, while a learning that bases on a sample of an input-output pair \{ x_t, z_t \} is called supervised. The latter differs from the former in that it focuses on the dependence relation x_t\to  z_t. In a BYY system, we can simply regard \{ x_t, z_t \} jointly as an input, with the pair \{ x_t, z_t \} taking the place of x_t, that is, we have
(4)
\begin{array}{l} p(\{ X, Z\},{ Y}|\Theta_p )=p( Y|X, \theta_{y|x} )p(X )p(Z|X,Y),   \  q(\{ X, Z\}, { Y}|\Theta_q )=q(X| Y, \theta_{x|y} )q({ Y}|\theta_y )q( Z |X, { Y}, \theta_{z|x,y} ),  \end{array}
\begin{array}{l} \mbox{with}\  p({ X}   )= \prod_{t=1}^NG(x-x_t|0, h^2_xI), \   p(Z|X,Y) =\prod_{t=1}^N G(z(x_t)-z_t|0, h^2_zI)\end{array} both as input to the system. The latter comes from that z_t becomes independent of y_t once knowing the pairing \{ x_t, z_t \}. Typically, \begin{array}{l}    q( Z |X, { Y}, \theta_{z|x,y} ) \end{array} is in one of two special cases as shown in Fig.6(a).
Figure 6:  (a) BYY system  in  i.i.d. cases (front layer), (b) Local factor analysis (LFA).
Enlarge
Figure 6: (a) BYY system in i.i.d. cases (front layer), (b) Local factor analysis (LFA).

For a further insight, we observe Fig.6(a) for the details of a BYY system in i.i.d. cases, which is actually the front layer of the BYY system in Fig.4 with eq.(4). Following the principle by eq.(3) , we design p(\ell | x, \theta_{\ell|x}) according to the likelihood based measure and get the variance of p(y | { x},   \ell,   \theta_{y|x, \ell}) according to the Hessian matrix measure. Moreover, we remove the integral of y to get \rho( x, \Theta_{\ell}^q) by Laplace approximation. In addition to modeling \Chi_N=\{ x_t \}_{t=1}^N, we may also implement x\to z by the following distribution q( z |x ) or the regression E( z |x ):

(5)
\begin{array}{lll} q( z |x ) =\sum_{\ell} p(\ell | x, \theta_{\ell|x})q( z |x, \ell ), & and \quad   E(z|x, \ell) =\sum_{\ell} p(\ell | x, \theta_{\ell|x})E(z|x, \ell). \\ q( z |x, \ell ) =\begin{cases}   q( z |x,  \ell, \theta_{z|x, \ell}),  & (a), \\ \int q( z |y, \theta_{z|y, \ell})p(y | { x},   \ell,   \theta_{y|x, \ell})dy,   & (b). \end{cases} & and \quad    E(z|x, \ell)=\begin{cases}   \int z q( z |x,  \ell, \theta_{z|x, \ell})dz=f(x, \theta_{z|x, \ell}),   & (a), \\ \int f(y, \theta_{z|y, \ell})p(y | { x},   \ell,   \theta_{y|x, \ell})dy \approx F(\varsigma(x,\theta_{y|x, \ell}),  & (b).\end{cases} \\  \int zq( z |y, \ell )dz=f(y, \theta_{z|y, \ell}), &   F(y)=f(y, \theta_{z|y, \ell})+0.5Tr[\Pi_{\ell}^{y \ -1} \nabla_{yy^T}^2f(y, \theta_{z|y, \ell}) ].  \end{array}

In other words, this BYY system also covers the mixture-of-experts (ME) model (Jacobs, et al, 1991; Jordan & Xu, 1995), alternative ME model (Xu, Jordan, & Hinton, 1995), and its special cases of normalized radial basis function (RBF) networks and extended normalized RBF networks (Xu, 1998). Instead of using the conventional maximum likelihood (ML) learning, these models can be obtained by BYY harmony learning with favorable features to be introduced in the rest sections. Readers are also referred to maximum likelihood learning. Further details are also referred to the Scholarpedia page Rival penalized competitive learning.

To understand more specifically, we proceed to a simplified case that \{ X, Z \} degenerates back to X only, i.e., for the cases of unsupervised learning. We consider the problem of local factor analysis in Fig.6(b). It can be regarded as an combination of two well known unsupervised data analysis tools, i.e., samples are models by a mixture of a number of Gaussians, with each Gaussian generated by a linear system from independent factors of either Gaussian for a conventional factor analysis (FA) and or Bernoulli for binary factor analysis (BFA). Specifically, when q(y |     \theta_{y, \ell}) and q(x | { y},   \ell,   \theta_{x|y, \ell}) are both Gaussian, and thus p(y | { x},   \ell,   \theta_{y|x, \ell}) and q( x| \theta_{x,\ell}) are also Gaussian. In this case, \rho( x, \Theta_{\ell}^q) is obtained without any approximation. In addition to determining unknown parameters, we also need to appropriately select the unknown k_Y=\{k, \{ m_{\ell} \}_{\ell=1}^k\}\,, where k\, is the number of subspaces and m_{\ell}\, is the dimension of each subspace.


BYY harmony learning (A): fundamentals

According to the Chinese Ying-Yang philosophy, one key principle is Ying-Yang harmony, in a sense that Ying and Yang not only matches each other but also seeks a best match in a most compact way, as illustrated by the Ying-Yang sign located at the top-right of Fig.4. This motivates us to determine the unknowns \Theta\,\, and k\,\, in a BYY system under such a best harmony principle, which is mathematically implemented by maximizing the following harmony measure

(6)
H(p \Vert q) = \int p({R}|{X})p({X}) \ln{[ q({X} | {R})q({R})]} d{X} d{R}.

On one hand, this maximization forces q({X} | {R})q({R}) to match p({R}|{X})p({X}). In other words, q({X} | {R})q({R}) attempts to describe \Chi_N=\{x_t\}_{t=1}^N \, in help of p({R}|{X}), which uses \begin{array}{l} q({X} )= \int q({X} | {R})q({R}) d{R} \end{array} to fit p({ X} |\Theta_x )= G(X|\Chi_N, h^2I) not in a maximum likelihood sense but with a promising least complexity nature. Due to a finite size N, this matching aims at (but may not really reach) q({X} | {R})q({R})= p({R}|{X})p({ X} |\Theta_x ). Still we get a trend towards this equality by which H(p \Vert q) becomes the negative entropy, and its further maximization will minimize the system complexity, which consequently provides a model selection nature on k\,\,.

At the first glance, one may feel the formulae eq.(6) somewhat familiar. In fact, its degenerated case with R vanished leads to \begin{array}{l} H(p \Vert q) = \int p({X}) \ln{ q({X} )} d{X}.\end{array} With p({ X} )= G(X|\Chi_N, h^2I) at h=0 and q({X} )= q({ X} | \Theta ), maximizing this H(p \Vert q) becomes \begin{array}{l}\max_{\Theta}\ln{ q(\Chi_N |\Theta )}\end{array}, i.e., maximum likelihood (ML) learning. The situation with p({ X} ) beyond G(X|\Chi_N, h^2I) was also explored in the engineering literature (e.g., signal processing), via \begin{array}{l}\min_{q({X})}-H(p \Vert q) \end{array} under the name of Minimum Cross-Entropy (Minxent). It was noticed that \begin{array}{l}\min_{p({X} )}- H(p \Vert q)\end{array} or \begin{array}{l}\max_{ p({X} )}H(p \Vert q) \end{array} leads to a singular result that \begin{array}{l} p({ X} )=\delta({ X} -{ X}^*),\end{array}, \begin{array}{l}{X}^*=arg\max_X q(X|\Theta )\end{array}, which was regarded as irregular and not useful. Instead, efforts were turned to \begin{array}{l}\min_{p({X})} {KL}( p({X} )\Vert q({X} )) \end{array}, where {KL}(p \Vert q) is the classic Kullback–Leibler divergence:

(7)
{KL}(p \Vert q) = \int p({u}) \ln{ p({u}) \over q({u} )} du, \ with \ a \ relation \ {KL}(p \Vert q) =  H(p \Vert p)- H(p \Vert q).

Minimizing this {KL}(p \Vert q) with respect to p will also push p to match q, and thus the above singular result is avoided. Moreover, if p is given, {KL}(p \Vert q) is still equivalent to \begin{array}{l}\min_{ q({X} )}- H(p \Vert q) \end{array}. Thereafter, Minimum Cross-Entropy is usually used to refer this \min {KL}(p \Vert q).

Interestingly, with the inner representation R considered in the BYY system, the scenario becomes different from the above classic situation. With p({ X} )= G(X| \Chi_N, h^2I) fixed, \begin{array}{l} \max_{ p } H(p \Vert q) \end{array} is made with respect to only p({R}|{X}), which is no longer useless but responsible to the promising least complexity nature discussed after eq.(6). In other words, maximizing H(p \Vert q) with respect to unknowns not only in Ying part q({X},{R}) but also in Yang part p({X},{R}) makes the Ying-Yang system become a best harmony. Alternatively, we may regard such a mathematical formulation as an information theoretic interpretation of the ancient Ying Yang philosophy.

More generally, we can extend probability densities to probability measures defined over arbitrary sets. Let Q({X},{R}) for the Ying machine and P({R},{X}) for the Yang machine are probability measures over the domain of {X},{R}. Let \mu({X},{R}) is a sigma finite measure on the domain of {X},{R}, which acts as a benchmark or reference measure on the universe that Ying and Yang machines are supported. When Q is absolutely continuous in respect to \mu, it follows from the Radon–Nikodym theorem that we can generalize eqn.(6) into

(8)
H(P \Vert Q) = \int   \ln{d Q({X},{R}) \over d\mu({X},{R})} d P({R},{X}),

which returns back to eqn.(6) when \mu({X},{R}) is the usual Lebesgue measure. On one hand, the above -H(Q \Vert Q) becomes the entropy of the Ying machine at the special setting Q =P, and \max H pushes the Ying machine to a least complexity with respect to the benchmark measure \mu. However, there is no mechanism that makes Q to model the sample size \Chi_N. On the other hand, -H(P \Vert Q) =KL (P \Vert Q) becomes Kullback–Leibler divergence by eqn.(7) between the Ying and Yang machine at another special setting \mu=P and \max H makes a best Ying-Yang matching during which Q attempts to fit the sample size \Chi_N. However, there is no mechanism for making the Ying machine to a least complexity due to the disappearance of the benchmark measure \mu. Both the specical cases consider only the relationship between two measures, while the general case by eqn.(8) handles the relationship among three measures such that the least complexity nature and the best matching nature are jointly obtained.

Without losing generality, the rest of this article focuses on the cases of probability densities by eqn.(6), while we may still consider other cases similarly in help of eqn.(8). To further proceed, we rewrite eq.(6) into

(9)
\begin{array}{l} H(p \Vert q) = \int p(\Theta | {X}) H(p \Vert q, \Theta )d\Theta +H_{b}(\Xi), \   H(p \Vert q, \Theta )=H_f(p \Vert q, \Theta )-Z(\Theta_a|\Xi), \  Z(\Theta_a|\Xi) =-\ln{q( \Theta_a|\Xi)} , \\  H_f(p \Vert q, \Theta )=\int p( {Y} | {X}, \Theta_{y|x}) p({X}| \Theta_{x})  \ln{[ q({X} | {Y}, \Theta_{x|y})q({Y}|\Theta_{y})]} d{X} d{Y}, \ H_{b}(\Xi) =\int p(\Theta_b | {X})p({X}| \Theta_{x})  \ln{q(\Theta_b|\Xi)} dX d\Theta_b,    \end{array}

where the partition \Theta=\Theta_a\cup \Theta_b,  \Theta_a\cap \Theta_b=\emptyset is made such that H_{b}(\Xi) is obtained with the integral over \Theta_b solved analytically, e.g., for some cases that p(\Theta_b | {X}),  q(\Theta_b|\Xi) are conjugate distributions. If it is difficult to find such conjugate distributions or to compute H_{b}(\Xi), we may simply consider \Theta=\Theta_a, \  \Theta_b=\emptyset.

Considering a Taylor expansion of Q(u) around u^*=arg\max_{u}  Q(u) up to the second order, we have the following approximation

(10)
\begin{array}{l} \int  p(u)Q(u)du \approx  Q( u^*)+ 0.5Tr[\Sigma H_Q( u^* )], \  H_Q(u) = \nabla^2_{uu^T} Q(u), \    \Sigma= E_{p(u)}(u- u^*)(u- u^*)^T.  \end{array}

With p(\Theta | {X}) designed according to the principle by eq.(3) with the Hessian matrix measure in Fig.5, it follows from using eq.(10) on \begin{array}{l}\int p(\Theta | {X}) H(p \Vert q, \Theta )d\Theta \end{array} that we approximately get

(11)
\begin{array}{l} H(p \Vert q) \approx  H(p \Vert q, \Theta^* )+ 0.5d_k(\Xi), \ \Theta^*=arg\max_{ \Theta} H(p \Vert q, \Theta ), \\ d_k(\Xi)=n_f(\Theta)+ (\Theta(X)- \Theta^*)^T \Omega(\Theta^*,\Xi)(\Theta(X)- \Theta^*)+H_{b}(\Xi),  \   \Omega(\Theta^*,\Xi)=\nabla^2_{ \Theta \Theta^T} H(p \Vert q, \Theta ), \end{array}

where n_f(\Theta) is the number of free parameters in \Theta. We can observe that the maximization of H(p \Vert q) with respect to \Theta, \Xi, k consists of an inner maximization of H(p \Vert q, \Theta ) that seeks the best harmony among the front layer Ying-Yang supported by a priori q( \Theta|\Xi) and an outer maximization that seeks the best harmony of the entire Ying Yang system with d_k(\Xi) taken in consideration for the interaction between the front and back layers.

We may get further insights on \max_{ \Theta} H(p \Vert q, \Theta ) by considering its gradient flow \nabla_{ \Theta} H(p \Vert q, \Theta ) with p( {Y} | {X}, \Theta_{y|x}) designed according to the principle by eq.(3) with the choice (b) of likelihood based measure in Fig.5, that is, we have

(12)
p( {Y} | {X}, \Theta_{y|x})= q({X} | {Y}, \Theta_{x|y})q({Y}|\Theta_{y})/ q({X} | \Theta), \  q({X} | \Theta)=\int q({X} | {Y}, \Theta_{x|y})q({Y}|\Theta_{y})d{Y}.

It follows from H(p \Vert q, \Theta ) in eq.(9) that

(13)
\begin{array}{l} \nabla_{\Theta}H(p \Vert q, \Theta ) = \int p( {Y} | {X}, \Theta_{y|x}) [1+\delta_L(X,Y)]\nabla_{\Theta}L(X,Y) p({X}) d{X} d{Y},   \\ \delta_L(X,Y)= L(X,Y)- \int p( {Y} | {X}, \Theta_{y|x}) L(X,Y) d{Y}, \ and \  L(X,Y)=\ln{[ q({X} | {Y}, \Theta_{x|y})q({Y}|\Theta_{y}) q( \Theta|\Xi)]}. \end{array}

Noticing that L(X,Y) describes the fitness of an inner representation Y on the observation X, we observe that \delta_L(X,Y) indicates whether the considered Y fits X better than the average of all the possible choices of Y. Letting \delta_L(X,Y)=0, \nabla_{\Theta}H(p \Vert q, \Theta ) is simplified into \begin{array}{l} \int p( {Y} | {X}, \Theta_{y|x})\nabla_{\Theta}L(X,Y) p({X}) d{X} d{Y}, \end{array} which is also be derived from \begin{array}{l} \nabla_{\Theta} \int p({X}) \ln{ [q({X} | \Theta) q( \Theta|\Xi)]} d{X} \end{array} with q({X} | \Theta) by eq.(3). In other words, with p({ X}  )= G(X-\Chi_N|0, h^2I) at h=0, updating \Theta along this gradient flow actually implementing the classic Bayesian learning or the classic ML learning for a non-informative priori q( \Theta|\Xi). With \delta_L(X,Y)\ne 0, the gradient flow \nabla_{\Theta}L(X,Y) with all possible choices of Y is integrated via weighting not just by p({Y}|{X}) but also by a modification of a relative fitness measure 1+\delta_L(X,Y). If \delta_L(X,Y)> 0, updating goes along the same direction of the Bayesian learning / ML learning but with an increased strength. If 0>\delta_L(X,Y)> -1, i.e., the fitness is worse than the average and thus this Y is doubtful, updating still goes along the same direction of the Bayesian learning / ML learning while with a reduced strength. When \delta_L(X,Y)< -1, updating reverses to the opposite direction, i.e., becoming de-learning. In other words, the BYY harmony learning has a nature similar to Rival Penalized Competitive Learning (Xu, Krzyzak, & Oja, 1992&93) but with an improvement that there is no need on a pre-specified de-learning strength.


BYY harmony learning (B): implementations

Figure 7:   (a) Two stage iterative procedure for BYY harmony learning,  (b) BYY harmony model selection criterion, and (c) an illustration of automatic model selection.
Enlarge
Figure 7: (a) Two stage iterative procedure for BYY harmony learning, (b) BYY harmony model selection criterion, and (c) an illustration of automatic model selection.

From eq.(11), we can implement the maximization of H(p \Vert q, \Theta ) via an iterative procedure as shown in Fig.7(a), during which a previous estimate \Theta^{(t-\tau)}, \tau\ge 1 is used as \Theta(X). Specifically, Stage I(b) relates to choosing a priori q( \Theta|\Xi) and may be discarded when we have no priori to use or we do not want to consider this part. As to Stage I(a), its implementation can be made in help of \nabla_{\Theta}H(p \Vert q, \Theta ) by eq.(13), for which a key point is to handle the integral over Y\,. In general, Y\, may include either or both of a part Y' of real variables and a part L\, of discrete labels, e.g., in Fig.6 we have real variables \{y^{(j)}\} for coordinates in each subspace and a discrete label \ell for different subspaces. Let Y, L to take the places of Y in H(p \Vert q, \Theta ) by eq.(9) We have

(14)
\begin{array}{l} H(p \Vert q, \Theta )=\sum_{L} \int p( L | {X}, \Theta_{y|x})    p( {Y} | {X}, L, \Theta_{y|x})p({X}| \Theta_{x})  \ln{[ q({X} | {Y}, \Theta_{x|y})q({Y}|L, \Theta_{y}) q(L| \Theta_{y}) ]} d{X} d{Y}+\ln{q(\Theta_a|\Xi)}.  \end{array}

We follow the principle by eq.(3) to design p( L | {X}, \Theta_{y|x}) according to the likelihood based measure and to design p( {Y} | {X}, L, \Theta_{y|x}) according to the Hessian matrix measure. Moreover, it follows from eq.(4) that the above H(p \Vert q, \Theta ) can be extended to cover supervised learning x \to z simply with p({X}| \Theta_{x}) replaced by p({ X}   )   p(Z|X,Y), q({X} | {Y}, \Theta_{x|y}) replaced by q(X| Y, \theta_{x|y} )q({ Y}|\theta_y )q( Z |X, { Y}, \theta_{z|x,y} ), and dX replaced by dXdZ.

There could be different ways to get an approximation of \begin{array}{l} \nabla_{\Theta}H(p \Vert q, \Theta )\end{array} for implementing a gradient based learning. Two typical ways are featured by swapping the order of handling \begin{array}{l}\nabla_{\Theta} \end{array} and \begin{array}{l} \sum_{L} \int \end{array}. Let Y, L to take the places of Y in eq.(13), one is making the operation \begin{array}{l}\nabla_{\Theta} \end{array} first and then approximating the integral over Y and the summation over L to get \begin{array}{l}\delta_L(X,Y, L)\end{array} and \begin{array}{l}\nabla_{\Theta}H(p \Vert q, \Theta )\end{array}. The other is using eq.(10) to remove the above integral over Y and then get the corresponding \begin{array}{l}  \nabla_{\Theta}H(p \Vert q, \Theta )\end{array}.

Figure 8:   Iterative procedure for iid BYY system.
Enlarge
Figure 8: Iterative procedure for iid BYY system.


For a detailed insight, we further use the above second way on the i.i.d. cases shown in Fig.6(a). It follows from eq.(9) that we have

(15)
\begin{array}{l} H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell} p(\ell|x_t) H_{\ell}(x_t, z_t, y^*_{\ell}(x_t,z_t), \Theta_{\ell}^q), \ y^*_{\ell}(x,z)=arg\max_{y} [L_{\ell}(x, y, \Theta_{\ell}^q)+i_Z \ln{q(z|x, y, \ell)}],\\ H_{\ell}(x, z,  y, \Theta_{\ell}^q)=L_{\ell}(x, y, \Theta_{\ell}^q) +i_Z\ln{q(z|x, y, \ell)}+R_{\ell}(x, y, \Theta_{\ell}^q),  \  L_{\ell}(x, y, \Theta_{\ell}^q)=\ln{[  q(x|y, \ell, \theta_{ x|y, \ell }) q(y|\ell, \theta_{y, \ell }) q(\ell)]}, \end{array}

where i_Z=1 indicates that we consider not only unsupervised learning for the dependence structure underlying {\mathcal X}_N but also supervised learning for the relation {\mathcal X}_N \to {\mathcal Z}_N, and i_Z=0 degenerates back to consider only unsupervised learning.

Bayesian Ying Yang learning
Enlarge
Figure 9: Adaptive algorithm for local factor analysis with i_Z=0 in getting y^*_{\ell,t} and updating G(y|0,\Lambda_{\ell}),  q(x|y, \ell, \theta_{ x|y, \ell}).

Conceptually, those priors studied in the literature of Bayesian approaches may be adopted as q( \Theta_{\ell}^q|\Xi_{\ell}) accordingly, ranging from Jeffreys prior to Dirichlet prior and other conjugate priors. Alternatively, a data sensitive improper priori q(\Theta) without requiring a hyper-parameter \Xi was proposed in Sec.3.4.3 of (Xu, 2007a) under the name of data smoothing and in (Xu, 2007b) under the name of normalization, for regularizing the irregularity of finite size samples, the key points are given as follows:

  • It has been shown empirically that a good choice of q(h_x|{\mathcal X}_N), q(h_z|{\mathcal Z}_N) is simply given as follows:
(16)
\begin{array}{l} q(h|{\mathcal U}_N) \propto 1/ \sum_{t=1}^N p_h(u_t), \ p_h(u)=N^{-1}\sum_{t=1}^N G(u|\bar u_t, h^2I). \end{array}
  • Eqn.(16) is just a special case of the following one:
(17)
\begin{array}{l} q(\Theta|{\mathcal U}_N) \propto 1/ \sum_{t=1}^N p(u_t|\Theta), \end{array}

with a rationale called Cancelling Induced Bias (CIB) underlying a parametric model p(u|\Theta). Conceptually, \begin{array}{l}\int p(u|\Theta) du=1\end{array} shields \Theta to take effect. However, a finite size of samples makes \begin{array}{l}\mu(\Theta)=\sum_{t=1}^N p(u_t|\Theta) \end{array} becomes a measure of \Theta, which acts as an unwanted improper prior. Eqn.(17) aims at to cancel this induced bias. Readers are further referred to (Xu, 2007a&b) for a recent overview and Sec.23.7.4 in (Xu, 2004c) for a summary and historical remarks.

One typical way to implement Stage I(a) in Fig.7(a) is following the gradient flow \nabla_{\Theta}H(p \Vert q, \Theta ). From eq.(15), we have

(18)
\begin{array}{l}  \nabla_{\Theta}H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell} p_{\ell,t} [ (1+ \delta h_{\ell,t})\nabla_{\Theta_{\ell}^q }L_{\ell}(x_t, y^*_{\ell}(x_t), \Theta_{\ell}^q)+ i_{Z} \nabla_{\Theta_{\ell}^q}\ln{q(z_t|x_t,y^*_{\ell}(x_t, z_t),  \ell)}]\\ - 0.5\sum_{t=1}^{N}\sum_{\ell} p_{\ell,t} \delta h_{\ell,t} \nabla_{\Theta_{\ell}^q} \ln{|\Pi_{\ell}^y}| + \sum_{t=1}^{N}\sum_{\ell}p_{\ell,t} \nabla_{\Theta} R_{\ell}(x_t, y^*_{\ell}(x_t, z_t), \Theta_{\ell}^q),     \end{array}

by which Stage I(a) in Fig.7(a) can be adaptively implemented per sample x_t by iterating alternatively the Yang step and Yang step in Fig.8.

Figure 10:   Adaptive algorithm for local subspace based function.
Enlarge
Figure 10: Adaptive algorithm for local subspace based function.

Again, it degenerates back to considering only unsupervised learning by letting i_Z=0, Further shown in Fig.9 is the details for local factor analysis in Fig.6(b). These updating formulae all comes from gradient based updating, along either a gradient direction directly or a direction that has a positive projection on the gradient direction by multiplying a positive definite matrix such that certain constraints on \Theta are satisfied and that computation is simplified, e.g., updating of \theta_{y|x,   \ell} is simplified with \Pi_{\ell}^{y\ -1} removed from its computation.

A further extension can be made to also consider the relation x\to z via two cascadeded mapplings x\to y and y\to z. An example is given in Fig.10, for which learning is made by swithing on i_Z=1 in Fig.9 for getting y^*_{\ell,t} and updating G(y|0,\Lambda_{\ell}),  q(x|y, \ell, \theta_{ x|y, \ell}), plus updating G(z|W_{\ell}+c_{\ell}, \Gamma_{\ell}). We get a regression E(z|x) that combines a number of local linear regression functions with each supported on a local subspace, and thus we call it subspace based function (SBF).

Moreover, the algorithms in Fig.9 and Fig.10, as well generally Fig.8 and eq.(18), can be modified into a number of variants shown in Fig.11, as remarked as follows:

Bayesian Ying Yang learning
Enlarge
Figure 11: Several variants, where \gamma >0 is far less than 1, and \bar \delta(u)=0 except \bar \delta(u)=1 at u=0.
  • At the first column, we may ignore q(\Theta|\Xi) by simply setting q(\Theta|\Xi)=1, and thus remove the part of updating hyperparameter. Also, we may remove data smoothing regularization by simply setting h_x^2=0, h_z^2=0. Ignoring this part will further save computing cost but also reduce the performance.
  • The previous discussion on eq.(13) directly applies to eq.(18) because H(p \Vert q, \Theta ) in eq.(15) is just a specific example of eq.(9). Thus, by simply setting \delta_{\ell,t}=0 in eq.(), we are lead to the ML learning at the bottom of the second column, which is further generalized to the other three choices in the same column by considering either or both data smoothing and a priori q(\Theta|\Xi).
  • Instead of following the principle by eq.(3), we may simply let p(\ell|x) to be structure free and to be decided by maximizing \begin{array}{l}\max_{p(\ell|x)}  H(p \Vert q, \Theta )\end{array}, resulting in \begin{array}{l} p(\ell|x)=\bar \delta(\ell-\ell^*),  \ell^*=arg\max_{\ell}  H_{\ell}(x, y^*_{\ell}(x_t), \Theta_{\ell}^q)\end{array}. Consequently, we get \delta_{\ell,t}=0 and thus those special cases shown in the third column.
  • Taking the rival \ell_r^*=arg\max_{\ell\ne r}  H_{\ell}(x, y^*_{\ell}(x_t), \Theta_{\ell}^q) in consideration, we are further lead to the fourth colum that consists of Rival Penalized Competitive Learning (RPCL)(Xu, Krzyzak, & Oja, 1992&93) and extensions with either or both data smoothing and a priori q(\Theta|\Xi) in consideration.

In addition to tackling computational difficulty by removing the integral over Y in help of eq.(10), we may also encounter the computational difficulty of the summation \begin{array}{l}\sum_{L}\end{array} or \begin{array}{l}\sum_{\ell}\end{array}, if there are too many items to sum up. As above discussed, maximizing H(p \Vert q, \Theta ) with respect to a structure free p(\ell|x) leads to p(\ell|x)=\bar \delta(\ell-\ell^*), and the corresponding summation \begin{array}{l}\sum_{\ell}\end{array} per sample x_t reduces into merely one most important term. However, it may be too crude to use merely this term to approximate an entire summation. Instead, we may extend one \ell^* into L^* that is a subset of all the values that \ell will take. E.g., if certain constraints prevent those of p(\ell|x) with \ell \in  L^* to become zeros, \begin{array}{l}\max_{p(\ell|x)}  H(p \Vert q, \Theta )\end{array} will reduce \begin{array}{l}\sum_{\ell}\end{array} into \begin{array}{l}\sum_{\ell \in L^*}\end{array} with

  • L^*=\{\ell:  H_{\ell}(x, y^*_{\ell}(x_t), \Theta_{\ell}^q) \  is \  \mbox{among the first}\ \kappa \ \mbox{largest ones} \}, where \kappa=\# L^* denotes the cardinality of the set L^*.

Then, we approximately let every appearance of \begin{array}{l}\sum_{\ell}\end{array} or \begin{array}{l}\sum_{L}\end{array} to be replaced by \begin{array}{l}\sum_{\ell \in L^*}\end{array}. Such an approximation will considerably save computing cost but does not affect performance seriously for an appropriate \kappa. Particularly, when \kappa=2, it follows from the reason similar to eq.(13) that the BYY learning algorithms in the first column of Fig.11 becomes very close to RPCL algorithms in the fourth column of Fig.11, with two points of difference. First, it avoids the difficulty of specifying an appropriate \gamma for RPCL. Second, the rival c is updated by either de-learning or learning instead of only de-learning by RPCL.

In the cases with a binary vector \begin{array}{l}{y}\end{array} , the integral \begin{array}{l}\int_{y}\end{array} becomes a summation \begin{array}{l}\sum_{y}\end{array} over all the \begin{array}{l}2^{m_{\ell}}\end{array} terms. In a strict sense, the concepts \begin{array}{l}\nabla_y, \nabla_{yy^T} \end{array} in Fig.8 become not applicable to a binary vector \begin{array}{l}{y}\end{array}. Still, we have those equations in Fig.8 simply as if \begin{array}{l}{y}\end{array} is real since we can rewrite \begin{array}{l}H_{\ell}(x, z,  y, \Theta_{\ell}^q)\end{array} in a format \begin{array}{l}b_0+b_1[y-Ey]+b_2[y-Ey][y-Ey]^T\end{array} from which we get those equations again without invovling \begin{array}{l}\nabla_y, \nabla_{yy^T} \end{array}. Moreover, in addition to the choice \begin{array}{l}\Gamma_{\ell}^{y|x}=\Pi_{\ell}^{y\ -1}\end{array} in Fig.9, we may also consider the following choices:

  • \begin{array}{l}\Gamma_{\ell}^{y|x}=diag[\varsigma(x_t,\theta_{y|x, \ell}) (\mathbf 1 -\varsigma(x_t,\theta_{y|x, \ell}) )^T] \end{array}, which is equivalently considering an independent distribution \begin{array}{l}p(y|x,\ell,\theta_{y|x, \ell}) \end{array} in Fig.6;
  • \begin{array}{l}\Gamma_{\ell}^{y|x}={ 1 \over \# N(y_{\ell}^*(x_t))} \sum_{ y \in N(y_{\ell}^*(x_t))} [y- \varsigma(x_t,\theta_{y|x, \ell})][y- \varsigma(x_t,\theta_{y|x, \ell})]^T \end{array}, where N(y_{\ell}^*(x_t)) consists of y_{\ell}^*(x_t) and those values with \kappa bit distance away, typically \kappa= 0, \ or \  1.

The last but not the least, an important nature needs to be addressed is automatic model selection. In a BYY system, there is a subset \tilde {\theta}_{y} \subseteq \theta_{y} of parameters on which there exists an indicator \rho(\tilde {\theta}_{y}). Maximizing H(p \Vert q, \Theta ), that includes maximizing \begin{array}{l} \int p( {Y}) \ln{q({Y}|\Theta_{y})} d{Y} \end{array} with \begin{array}{l} p( {Y})=\int p( {Y} | {X}, \Theta_{y|x}) p({X}) d{X}\,\end{array}, will exert a force to push \rho(\tilde {\theta}_{y})\to 0, which means that its associated contribution to k_Y should be discarded because it is redundant. Readers are referred to (Xu, 2004b&c, 2005, 2007a&b) for further details. During learning, \alpha_{\ell} \to 0 means that the corresponding components q(y, \ell | \theta_y), q(x|  y, \ell,  \theta_{x|y,   \ell}), and q(y |\ell, \theta_{y,   \ell}) are extra and thus discarded. As a result, k effectively reduces to k-1. As illustrated in Fig.7(c), \alpha_j\to 0 means its contribution to \tilde J(k) is 0, and a number of such parameters becoming 0 result in that \tilde J(k) has effectively no change on a range [\hat k, \tilde k]. Also, if the variance of q(y^{(j)} |\ell, \theta_{y,  \ell}) tends to zero, e.g., \lambda_{\ell,j} \to 0 for the local factor analysis in Fig., the corresponding dimension y^{(j)} is extra and thus discarded. As illustrated beyond \tilde k in Fig.7(c), such a parameter becoming 0 contributes to \tilde J(k) by -\infty. As long as k_Y is initialized at a big enough value, appropriate k^*, \{m_{\ell}^*\} \, are determined automatically during an iterative implementation of Stage I(a) in Fig.7(a). That is, model selection is incurred during parameter learning, which shares with and further improve the automatic model selection nature of Rival Penalized Competitive Learning (RPCL)(Xu, Krzyzak, & Oja, 1992&93). Further discussions are referred the subpage of /Appendix.

Further improved k^*, \{m_{\ell}^*\} can be obtained by also considering Stage II of the two stage implementation in Fig.7(a), especially for a small sample size N \,. For the task of local factor analysis and local binary factor analysis (BFA) in Fig.6(b), we have J({\mathbf k})=\tilde{J}(\mathbf k) + 0.5 n_f(\Theta_{\mathbf k}), \ {\mathbf k}=\{k, \{m_{\ell}\} \} with

(19)
\begin{array}{l} \tilde{J}(\mathbf k) =0.5 \sum_{\ell=1}^k \alpha_{\ell}\ln{|\Sigma_{\ell}|}  - \sum_{\ell=1}^k \alpha_{\ell} \ln{\alpha_{\ell}}  + \tilde{J}_y(\mathbf k) +0.5\sum_{\ell=1}^k \alpha_{\ell} ( m_{\ell}+h^2Tr[\Sigma_{\ell}^{-1}]), \\  \tilde{J}_y(\mathbf k) = \begin{cases}  0.5 \sum_{\ell=1}^k \alpha_{\ell}[\ln{|\Lambda_{\ell}|}+ m_{\ell}  \ln{(2\pi)}+m_{\ell}],  & (a)\, Gaussian  G(y|0,\Lambda_{\ell}), \\    \sum_{\ell=1}^k \alpha_{\ell}\sum_{i=1}^{m_{\ell}} [q_{\ell}^{(i)}\ln{q_{\ell}^{(i)}} +(1-q_{\ell}^{(i)})\ln{(1-q_{\ell}^{(i)})},  & (b)\, Bernoulli B(y|q_{\ell}). \end{cases} \end{array}

where \begin{array}{l} n_f(\Theta_{\mathbf k})=dk+k-1+ 0.5d(d+1)k+ \sum_{\ell=1}^{k}(m_{\ell}+d_{U_{\ell}}),\end{array} and d_{U_{\ell}}= m_{\ell}d-0.5m_{\ell}(m_{\ell}+1).

A number of experiments have been conducted on the task of local factor analysis and shown that BYY harmony learning outperforms considerably several typical learning methods and model selection criteria on both performances and computing times. Details are referred to the subpage of /experimental results, where one may also find experimental results on a number of other typical learning tasks.


BYY harmony learning (C): favorable features

Bayesian Ying Yang learning is featured by two fundamental differences from other existing learning approaches:

  • Systematically considering two pathways and two domains coordinately as a BYY system under a probability theory ground and designing three system components under the principles of Least Redundancy, Divide-Conquer, and Uncertainty Conversation, respectively. This is different not only from cognitive science motivated adaptive resonance theory (Grossberg, 1976; Grossberg & Carpenter, 2002) and the least mean square error reconstruction based auto-association (Bourlard & Kamp, 1988) and LMSER self-organization (Xu, 1991 & 93), but also from those probability theoretic approaches that either merely consider a bottom-up pathway as the inverse of a top-down pathway (e.g., those for inverse problems summarized in Fig.2), or approximate such inverses for tackling intractable computations (e.g., Helmholtz machine, varianional Bayes, and extensions).
  • Mathematically implementing the Ying-Yang harmony philosophy, in a sense that Ying and Yang not only matches each other but also seeks a best match in a most compact way, by determining all unknowns in a BYY system via maximizing the functional by eq.(6) that can be regarded as an further extension of the traditional Minimum Cross-Entropy principle to the BYY system, while with its known useless irregular aspect turned into a promising nature of least complexity, as discussed in previous sections.

These two fundamental issues make Bayesian Ying Yang learning possess the favorable promising features.

First, the conventional model selection approaches aim at model complexity conveyed at either or both of the level of structure S_{ k} the level of parameter set \Theta, as summarized in Fig.2 (c)&(d) and Columns (b)&(c) in Fig.3. This task is usually difficult to estimate, with some rough bounds provided by those approaches discussed after Fig.2. Bayesian Ying Yang learning considers not only the levels of structure S_{ k} and parameter set \Theta but also the level of short memory representation Y\, shown in Fig.2 (a)&(b) by the front layer of BYY system in Fig.4. That is, the complexity or scale k of the BYY system is considered with the part k_Y\, for representing Y\, and the rest part contributed by S_{ k} and \Theta. The second part is estimated via Z(\Theta|\Xi) =-\ln{q( \Theta|\Xi)} in eq.(9) and d_k(\Xi) in eq.(11), which is along a line similar to the above conventional approaches. However, the part k_Y\, is modeled via q({Y}|\Theta_{y}) in H_f(p \Vert q, \Theta ), which is estimated more accurately than the second part. Promisingly, the model selection problems of many typical learning tasks can be reformulated into selecting merely the k_{Y}\, part in a BYY system (Xu, 2005). Therefore, the resulted BYY harmony criterion shown in Fig.7(b) may considerably improve the performances by typical model selection approaches, which has been shown by in experiments, with details referred to the subpage of /experimental results.

Second and even interestingly, the feature of considering k_Y\, via q({Y}|\Theta_{y}) in H_f(p \Vert q, \Theta ) makes both parameter learning for \Theta\, and model selection for k_{Y}\, implemented simultaneously. That is, we get an important nature of automatic model selection, as illustrated in Fig.7(c). As addressed in the previous section, maximizing H(p \Vert q, \Theta ) will exert a force that pushes some indicator \rho(\tilde {\theta}_{y})\to 0 if a subset \tilde {\theta}_{y} \subseteq \theta_{y} is extra, which means that its associated contribution to k_Y can be discarded.

Remarks: at the first glance, the scenario of Fig.2 (b) with q({Y}|\Theta_{y}) has also been considered in a number of typical learning approaches, especially those with an EM like two pathway implementation, such as the EM algorithm implemented ML learning (Redner & Walker, 1984), information geometry based em-algorithm (Amari, 1995), Helmholtz Machine (Hinton, Dayan, Frey, & Neal, 1995; Dayan, 2002), variational approximation (Jordan, Ghahramani, Jaakkola, & Saul, 1999), the bits-back based MDL (Hinton & Zemel, 1994), etc. As to be explained in the next section, those studies actually have neither put q({Y}|\Theta_{y}) in a role for describing k_Y nor sought for the above nature of automatic model selection.

Third, the separated consideration of k_Y\, from the rest of k also provides a general framework that integrates the roles of regularization and model selection, such that not only the automatic model selection mechanism on k_Y\, can avoid the previously mentioned disturbance by a regularization with an inappropriate priori q( \Theta|\Xi), but also imprecise approximations caused by handling the integrals may be alleviated via regularization. Specifically, model selection is made via q({Y}|\Theta_{y}) in Ying machine, while regularization is imposed in Yang machine via either or both of designing the structures of p( L | {X}, \Theta_{y|x}),     p( {Y} | {X}, L, \Theta_{y|x}) under a uncertainty conservation principle given in Fig.5 and making data smoothing regularization in help of p({ X} |\Theta_x )= G(X-\Chi_N|0, h^2I) with h\ne 0. This h takes a role similar to regularization strength. However, the difficulty of the conventional regularization approaches on controlling this strength has been avoided because an appropriate h determined via \begin{array}{l}\max_h H(p \Vert q, \Theta )\end{array} in help of Z(\Theta )=-\ln{q(h|{\mathcal X}_N)} by eq.(16).

Fourth, in addition to making the best harmony learning by maximizing H(p \Vert q) in eq.(6), an alternative has also been proposed in (Xu,1995) under the name of Bayesian Kullback Ying Yang (BKYY) learning that performs a best Ying Yang matching by minimizing:

(20)
KL(p({R}|{X})p({X})  \Vert p({R}|{X})p({X}) ) = \int p({R}|{X})p({X}) \ln{[ p({R}|{X})p({X}) ]} d{X} d{R}-H(p \Vert q),

that is, a best Ying Yang harmony includes not only a best Ying Yang matching as a part but also minimizing the entropy or the complexity of a Yang machine. In other words, it seeks a Yang machine that not only best matches the Ying machine but also keeps itself in a least complexity. As to be further addressed in the next subsection, the perspective of eq.(20) provides a bridge to revisit a number of typical statistical learning approaches but with new insights.

Fifth, considering a learning system in a Ying-Yang pair naturally motivates to implement the maximization of H(p \Vert q) or the minimization of KL(p \Vert q) by an alternative iteration of

  • Yang step: fixing all the unknowns in the Ying machine, we update the rest of the unknowns in the Yang machine (after excluding those common unknowns shared by the Ying machine);
  • Ying step: fixing those just updated unknowns in the Yang step, we update all the unknowns in the Ying machine.

Not only this iteration is guaranteed to converge, but also it includes the well known EM algorithm (see the subpage of /APPENDIX) and provides a general perspective for developing other EM-like algorithms, e.g., ones in Fig..


Relations to other approaches

In previous sections, not only it has been discussed after eqn.(8) that the BYY best harmony in its general expression by eqn.(8) provides a unified view on the Ying Yang harmony \max H(p \Vert q) by eqn.(6) at \mu in the Lebesgue measure and the Ying-Yang matching \min KL(p \Vert q) by eqn.(7) at the special setting \mu=P, but also it has been discussed after eq.(13) the relation of BYY harmony learning to Rival Penalized Competitive Learning (RPCL)(Xu, Krzyzak, & Oja, 1992&93) from a updating flow view and after eq.(18) the difference of the Ying Yang harmony \max H(p \Vert q) by eqn.(6) from the maximum likelihood (ML) and Bayesian learning. In sequel, we further discuss systematically relations of Bayesian Ying Yang learning to other typical approaches from three typical perspectives, as sketched in Fig.12 and summarized in Fig.13.


Figure 12:   Relations to Other Approaches
Enlarge
Figure 12: Relations to Other Approaches

We start at the left bottom corner of Fig.12. One classic perspective considers to use either a parametric model q(X |\Theta) directly or a top-down model \begin{array}{l} q(X |\Theta)=\int q(X |Y, \theta_{x|y})q(Y|\theta_y)dY \end{array} (called a latent or generative model) for a best matching or modeling of observation data. When p(X )=\delta(X-X_N), it follows from the second row of Type B table and Type C table in Fig.13 that the Ying-Yang matching \min KL(p \Vert q) becomes equivalent to ML learning and also other likelihood based studies under the names of marginal likelihood, evidence, Bayesian information criterion, and Bayesian approach (Schwarz,1978; Hinton & Zemel, 1994; Neath & Cavanaugh,1997; MacKay, 1992, 2003); while it follows from the third row of Type B table and Type C table in Fig.13 that the Ying Yang harmony \max H(p \Vert q) becomes equivalent to MAP estimate and those q(\Theta) featured studies under the name of Bayesian learning (Press, 1989) or Minimum Message Length (MML) (Wallace & Boulton, 1968; Wallace & Dowe, 1999). Also, these Bayesian based studies will degenerate into the likelihood based studies if q(\Theta) is ignored, as illustrated at the right bottom corner of Fig.12. Moreover, the model selection AIC criterion (AIC) can be obtained not only from ML learning (i.e., Ying-Yang matching at the second row of Type B table) in its standard derivation (Akaike, 1974; Bozdogan, 1987; Cavanaugh, 1997), but also from the Ying-Yang harmony at the first row of Type C table in Fig.13 as a special case of several outcomes of a derivation in help of designing p( \Theta | {X}) by following the principle by eq.(3) with the Hessian matrix measure and using eq.(10) to approximately remove the integral over \Theta (Xu, 2008a&b).

We move to the left middle of Fig.12. The second perspective considers a bottom-up pathway (called transformation / recognition / representative) for a best encoding X \to Y. When p(X )=\delta(X-X_N), the Ying-Yang matching \min KL(p \Vert q) at the 4th row of Type B table in Fig.13 seeks a best matching between the inner representation p_S(Y ) by the Yang machine and its counterpart q(Y ) by the Yang machine, which further leads to the upper center in Fig.12, including the minimum mutual information (MMI) and the maximum information transfer (INFOMAX) for and independent subspace analyses (Amari, Cichocki, & Yang, 1996; Bell & Sejnowski,1995; Xu, 2008c) at the special case that q(Y ) consists of independent components and that p(Y|X ) degenerates into a deterministic linear or post-linear function of X. Further studies may be explored beyond this special case. Interestingly, the Ying-Yang harmony \max H(p \Vert q) at the 4th row of Type B table in Fig.13 not only seeks such an inner best matching but also consists of maximizing H_{Yang} that pushes p(Y|X ) to approach a deterministic one, and even to arrive it if p(Y|X ) is free of any constraint.

Figure 13:   Best Ying-Yang harmony vs best Ying-Yang  matching: three types of formulation
Enlarge
Figure 13: Best Ying-Yang harmony vs best Ying-Yang matching: three types of formulation

Putting the above two perspectives in a joint consideration, we further come to the center of Fig.12 for the third perspective, i.e., those studies on a probability theoretic based two pathway. The Ying Yang harmony \max H(p \Vert q) leads to Bayesian Ying-Yang harmony learning, introduced in the previous sections. Moreover, extensive efforts have also been made during the same period under the name of the Helmholtz free energy based learning or Helmholtz machine (Hinton & Zemel, 1994; Hinton, Dayan, Frey, & Neal, 1995; Dayan, 2002) and further under the name of variational approximation (Ghahramani & Beal, 2000; Jaakkola, 2001; Jordan, Ghahramani, Jaakkola, & Saul, 1999; Corduneanu & Bishop, 2001). When p(X )=\delta(X-X_N), the Ying-Yang matching \min KL(p \Vert q) at the 1st row of Type B table in Fig.13 becomes equivalent to Helmholtz machine if p(Y|X ) is a parametric structure by a forward networks as those in (Hinton, Dayan, Frey, & Neal, 1995; Dayan, 2002); while the Ying-Yang matching \min KL(p \Vert q) at the 1st row of Type C table in Fig.13 leads to variational Bayes approach if p(\Theta|X ) is given parametric structure as that in (Ghahramani & Beal, 2000). Moreover, these variational approximation approaches in general correspond to the best Ying-Yang matching \min KL(p \Vert q) at the 1st row of both Type B and Type C tables in Fig.13, with p(Y|X ) and p(\Theta|X ) designed in given parametric structures that have unknown parameters to be determined, as those called bi-directional BYY architectures in (Xu, 1995, 2004a&b&c).

As sketched within the left ellipse in Fig.12, the Ying-Yang matching covers not only the studies of the likelihood based types, the AIC types, and the variational approximation based types, but also the studies of seeking a best matching between the Ying Yang inner representations, as well as variations and extensions with regularizations by p({ X}  )= G(X-\Chi_N|0, h^2I) and q(h|{\mathcal X}_N) by eq.(16). However, it follows from KL(p \Vert q) = H_{Yang}   - H(p \Vert q) at the top of the table in Fig.13 that the Ying-Yang matching differs from the Ying-Yang harmony in minimizing H_{Yang} as well, which cancels out the least complexity nature because the chance of maximizing \begin{array}{l} \int p_S( {Y}) \ln{q({Y}|\Theta_{y})} d{Y} \end{array} has been removed. Therefore, the Ying-Yang matching does not have the first three favorable features of the best Ying-Yang harmony, as elaborated in the previous section.

Figure 14:  An Optimal Information Transfer Perspective
Enlarge
Figure 14: An Optimal Information Transfer Perspective

For a further insight, we can also consider the best Ying Yang harmony from an optimal communication perspective, in other words, searching a minimum code length to communicate data, i.e., a formal information theory restatement of Occam's Razor. The first study along this direction is the Minimum message length (MML) (Wallace & Boulton, 1968; Wallace & Dowe, 1999), which is subsequently followed by the minimum description length (MDL) (Rissanen, 1986, 2007). As illustrated in Fig.14, the key idead is encoding a set of samples in two parts. Assume that a parametric form of p(x|\theta) is already known by the receiver, the model information is encoded via encoding \theta, which needs b_{\theta} bits. The second part b_{\varepsilon} consists of the bits to encode the residuals that can not be described by the model. Conceptually, MDL has a slight difference that bases on Kraft-McMillan inequality and turns the problem of searching for an efficient code into searching for a good distribution p(X). However, a widely used practical implementation for MDL actually degenerates into one that is same as Bayesian information criterion (BIC)(Schwarz,1978; Neath & Cavanaugh,1997) that can still be regarded as an example of the above two part coding (MacKay, 1992, 2003). Unfortunately, the bits b_{\theta} is difficult to be estimated since we usually has no a priori knowledge about p(\theta). Instead, a rough bound is considered, with deteriorated performances. In fact, other typical criteria share a similar difficulty that is either directly or equivalently related to getting p(\theta). Interestingly, the best Ying Yang harmony is equivalent to minimizing the total bits of three parts of encoding, as shown in Fig.14. The difficult part b_{\theta} is now separated into two parts b_{\theta}+ b_{y}. One is getting b_{y} that consists of bits for encoding y_t via b_t^{\varepsilon}, which is not only no longer so difficult to describe but also carries an information about k_Y per sample x_t. The other part is getting a new b_{\theta} for the bits of the rest part of model scale k, which is still estimated in a way similar to that for MDL and those typical model selection criteria. This provides another interpretation on the first two favorable features elaborated in the previous section. Though a three part coding has also been considered under the name of bit-back based MDL (Hinton & Zemel, 1994), what it considers is b_{\theta}+ b_{y}+b_{\varepsilon} -b_{back}, where b_{back} is actually equivalent to H_{Yang} at the bottom of Type B table in Fig.13 with p(X )=\delta(X-X_N). In other words, the bit-back based MDL is just another interpretation of the above discussed Helmholtz machine and variational approximation.

There are also two other streams of studies on tackling the over-fitting challenge. One is on the cross validation (CV) approach (Stone, 1978; Rivals & Personnaz, 1999. It may deserve to investigate whether the CV approach can be applied to further improve the BYY learning on a small size of samples. The other stream is on the VC dimension based generalization error bound (Vapnik, 1995), based on which the support vector machine is developed and has become a popular topic in the machine learning literature. Little is known theoretically on the relation of the BYY learning to the VC dimension based generalization error bound. There has been only a preliminary discussion on a link between SVM and a BYY learning based kernel density estimation (Xu, 2001a). A further effort may deserve to be made along this direction too.



References

  • Akaike, H (1974), "A new look at the statistical model identification", IEEE Tr. Automatic Control 19: 714-723.
  • Amari, S (1995), "Information geometry of the EM and em algorithms for neural networks", Neural Networks 8(9): 1379-1408.
  • Amari, S, Cichocki, A, & Yang, H (1996), "A new learning algorithm for blind separation of sources", In Touretzky, Mozer, & Hasselmo (Eds.), Advances in Neural Information Processing System 8, MIT Press, 757-763.
  • Bell, A & Sejnowski, T (1995), "An information-maximization approach to blind separation and blind deconvolution", Neural Computation 7, 1129-1159.
  • Bourlard, H & Kamp, Y (1988), "Auto-association by multilayer Perceptron and singular value decomposition", Biological Cybernetics 59: 291-294.
  • Bozdogan, H & Ramirez, DE (1988), "FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC", Psychometrika 53 (3): 407-415.
  • Cavanaugh, JE (1997), "Unifying the derivations for the Akaike and corrected Akaike information criteria", Statistics & Probability Letters 33: 201-208.
  • Dayan, P., (2002), " Helmholtz machines and sleep-wake learning", The Handbook of Brain Theory and Neural Networks, Second edition, (MA Arbib, Ed.), Cambridge, MA: The MIT Press, pp522-525.
  • Ghahramani, Z & Beal, MJ (2000), "Variational inference for Bayesian mixtures of factor analyzers", in Solla, Leen, & Muller (eds), Advances in Neural Information Processing Systems 12, 449-455.
  • Grossberg, S (1976), "Adaptive pattern classification and universal recording: I &II", Biological Cybernetics 23: 121-134 and 23: 187-202.
  • Grossberg, S & Carpenter, GA (2002), "Adaptive Resonance Theory", The Handbook of Brain Theory and Neural Networks, Second edition, (MA Arbib, Ed.), Cambridge, MA: The MIT Press, 87-90.
  • Hinton, GE, Dayan, P., Frey, BJ, & Neal, RN (1995), "The wake-sleep algorithm for unsupervised learning neural networks", Science, 268:1158-1160.
  • Hinton, GE & Zemel, RS (1994), "Autoencoders, minimum description length and Helmholtz free energy", in Cowan, Tesauro, & Alspector (eds), Advances in Neural Information Processing Systems, 6, 3-10.
  • Jaakkola, TS (2001), "Tutorial on variational approximation methods", in Opper & Saad (eds), Advanced Mean Field Methods: Theory and Practice, MIT press, 129-160.
  • Jordan, MI, Ghahramani, Z, Jaakkola, TS & Saul, LK (1999), "An Introduction to Variational Methods for Graphical Models ", Machine Learning 37(2): 183-233.
  • Jacobs, RA., et al (1991), "Adaptive mixtures of local experts", Neural Computation 3: 79-87.
  • Jordan, MI & Xu, L (1995), "Convergence results for the EM approach to mixtures of experts", Neural Networks 8: 1409-1431.
  • Mackey, D (1992), "A practical Bayesian framework for backpropagation", Neural Computation 4, 448-472.
  • MacKay, D (2003), Information Theory, Inference, and Learning Algorithms, Cambridge University Press.
  • Poggio, T & Girosi, F (1990), "Networks for approximation and learning", Proc. of IEEE 78: 1481-1497.
  • Redner, RA & Walker, HF (1984), "Mixture densities, maximum likelihood, and the EM algorithm", SIAM Review 26: 195-239.
  • Rissanen, J (1986), "Stochastic complexity and modeling", Annals of Statistics 14 (3): 1080-1100.
  • Rissanen, J (2007), Information and Complexity in Statistical Modeling, Springer, 2007.
  • Rivals, I & Personnaz, L (1999), "On Cross Validation for Model Selection", Neural Computation 11: 863-870.
  • Schwarz, G (1978), "Estimating the dimension of a model", Annals of Statistics 6: 461-464.
  • Stone, M (1978), "Cross-validation: A review", Math. Operat. Statist. 9: 127-140.
  • Tikhonov, AN & Arsenin, VY (1977), Solutions of Ill-posed Problems, Winston and Sons.
  • Vapnik, VN (1995), The Nature of Statistical Learning Theory, Springer.
  • Wallace, CS & Boulton, DM (1968), "An information measure for classification", Computer Journal 11, 185-194.
  • Wallace, CS & Dowe, DR (1999), "Minimum message length and Kolmogorov complexity", Computer Journal 42 (4): 270-280.
  • Xu, L (2008a), "Bayesian Ying Yang System, Best Harmony Learning, and Gaussian Manifold Based Family", In Zurada et al (eds.), Computational Intelligence: Research Frontiers, WCCI2008 Plenary/Invited Lectures, LNCS5050, 48–78.
  • Xu, L (2008b), "Machine learning problems from optimization perspective", A special issue for CDGO 07, Journal of Global Optimization, in press, DOI10.1007/s10898-008-9364-0.
  • Xu, L (2008c), "Independent Subspaces", in Ramón, Dopico, Dorado & Pazos (Eds.), Encyclopedia of Artificial Intelligence, IGI Global (IGI) publishing company, 903-912.
  • Xu, L (2007a), "A Unified Perspective and New Results on RHT Computing, Mixture Based Learning, and Multi-learner Based Problem Solving", Pattern Recognition, 40, 2129-2153.
  • Xu, L (2007b), "Trends on Regularization and Model Selection in Statistical Learning: A Perspective from Bayesian Ying Yang Learning", an invited chapter in Challenges to Computational Intelligence, Duch, W & Mandziuk, J, eds, Springer-Verlag, 365-406.
  • Xu, L (2005), "Fundamentals, Challenges, and Advances of Statistical Learning for Knowledge Discovery and Problem Solving: A BYY Harmony Perspective", Keynote talk, Proc. of Intl. Conf. on Neural Networks and Brain, Oct. 13-15, 2005, Beijing, China, Vol. 1, 24-55.
  • Xu, L (2004a), "Temporal BYY Encoding, Markovian State Spaces, and Space Dimension Determination", IEEE Trans on Neural Networks 15( 5): 1276-1295.
  • Xu, L (2004b), "Advances on BYY harmony learning: information theoretic perspective, generalized projection geometry, and independent factor auto-determination", IEEE Trans on Neural Networks 15(4): 885-902.
  • Xu, L (2004c), "Bayesian Ying Yang Learning", Intelligent Technologies for Information Analysis", N. Zhong and J. Liu (eds), Springer, 615-706.
  • Xu, L (2003), "Data smoothing regularization, multi-sets-learning, and problem solving strategies", Neural Networks, Vol. 15, No.5-6, 817-825.
  • Xu, L (2002), "Bayesian Ying Yang Harmony Learning", The Handbook of Brain Theory and Neural Networks, Second edition, (MA Arbib, Ed.), Cambridge, MA: The MIT Press, pp1231-1237.
  • Xu, L (2001a), "Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, Three-Layer Nets and ME-RBF-SVM Models", Intl J of Neural Systems 11(1):43-69.
  • Xu, L (2001b), "BYY harmony learning, independent state space and generalized APT financial analyses", IEEE Trans. Neural Networks 12(4):822-849.
  • Xu, L (1998), "RBF nets, mixture experts, and Bayesian Ying-Yang learning", Neurocomputing 19(1-3): 223-257.
  • Xu, L (1997), "Bayesian Ying Yang system and theory as a unified statistical learning approach (II): from unsupervised learning to supervised learning and temporal modeling", Theoretical Aspects of Neural Computation : A Multidisciplinary Perspective (TANC97) , in Wong, KM, et al, eds, Springer, pp25-42.
  • Xu, L (1995), "Bayesian-Kullback Coupled YING-YANG Machines: Unified Learnings and New Results on Vector Quantization", Proc. Intl. Conf. on Neural Information Processing (ICONIP95), Beijing, Oct 30-Nov.3, 1995, pp.977-988.
  • Xu, L, Jordan, MI, & Hinton, GE (1995), "An Alternative Model for Mixtures of Experts", in Cowan, Tesauro, and Alspector, eds., Advances in Neural Information Processing Systems 7, MIT Press, 633-640.
  • Xu, L, Krzyzak, A & Oja, E (1992&93), "Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection", IEEE Trans on Neural Networks 4: 636-649, An early version on Proc. 1992 IJCNN, Nov.3-6, 1992, Beijing, 665-670.
  • Xu, L (1991&93) "Least mean square error reconstruction for self-organizing neural-nets", Neural Networks 6: 627-648, 1993. Its early version on Proc. IJCNN91'Singapore, 2363-2373, 1991.


Internal links

External links

See also


Lei Xu (2007) Bayesian Ying Yang learning. Scholarpedia, 2(3):1809, (go to the first approved version)
Created: 31 July 2006, reviewed: 15 February 2007, accepted: 22 March 2007
Action editor: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
For authors