APPENDIX

From Scholarpedia

< Bayesian Ying Yang learning
This revision has not been approved by curators yet; It may contain inaccuracies.

Curator: Dr. Lei Xu, Dept. Computer Science & Engineering, Chinese University of Hong Kong

This page is in a revision process and thus may contain errors. Please revisit this page later


Contents

(A) Uncertainty Conversation Principle


Figure 1:  Several choices for measuring the uncertainty or  information contained in a distribution
Enlarge
Figure 1: Several choices for measuring the uncertainty or information contained in a distribution

In a compliment to the Ying-Yang philosophy, Ying is primary and thus is designed first, while Yang is relatively secondary and thus is designed basing on Ying. Moreover, as illustrated by the Ying-Yang sign located at the top-left of Fig., the room of varying or dynamic range of Yang should balance with that of Ying, which motivates to design p(X, R) (in fact, only p({ R} | { X}) because p({ X} |\Theta_x ) is given by \Chi_N \, already) under a principle of uncertainty conversation between Ying-Yang. In other words, Yang machine preserves a varying room or dynamic range that is appropriate to accommodate uncertainty or information contained within the Ying machine, i.e., U( p)= U( q) under a uncertainty measure U( p) in one of the choices given in Fig.1. Since a Yang machine consists of two components p({ R}| { X})p({ X}), we may have one or two of the following uncertainty conversations:

(1)
\begin{array}{l} U( p({ R} |{ X}))= U( q({ R} |{ X})), \,  q({ R} |{ X})= {q({ X}, { R})\over q({ X})}, \  U(p({ X} |\Theta_x ))= U( q({ X})), \,  q({ X})=\int q({ X}, { R})dR.   \end{array}

One typical case of Likelihood based (a) is p({ R} |{ X})=  q^{\gamma}({ R} |{ X}) with a scalar 0\le \gamma< \infty. It includes Likelihood based (b) at \gamma=1< \infty, which represents the strongest conversation that can be achieved between Ying and Yang since p({ R} , { X})=  q({ R} , { X}) is unreachable as p({ X} |\Theta_x ) fixed by \Chi_N \, already. Moreover, it usually encounters an implementing difficulty to get q({ X}), except some special cases that the above intergal over R is either analytically solvable or becoming a inexpensively computable sum. One other typical case is the Hessian based (b) for avoiding the computing difficulty of handling the intergal over R. As previously discussed (Xu, 2004b, p889), this is consistent to the celebrated Cramér-Rao inequality stating that the inverse of the Fisher information provides a lower bound on the covariance matrix of any unbiased estimator of { R}, while -\nabla_{RR^T} \ln{q({ X}, { R})} provides an approximate Fisher information matrix.



(B) BYY system for supervised learning and Subspace basd functions


Typically, \begin{array}{l}    q( Z |X, { Y}, \theta_{z|x,y} ) \end{array} is in one of two special cases as shown in Fig.2(a).

Figure 2:  (a) BYY system  in  i.i.d. cases (front layer), (b) Local factor analysis (LFA).
Enlarge
Figure 2: (a) BYY system in i.i.d. cases (front layer), (b) Local factor analysis (LFA).

For a further insight, we observe Fig.2(a) for the details of a BYY system in i.i.d. cases, which is actually the front layer of the BYY system in Fig. with eq.(). Following the principle by eq.(1) , we design p(\ell | x, \theta_{\ell|x}) according to the likelihood based measure and get the variance of p(y | { x},   \ell,   \theta_{y|x, \ell}) according to the Hessian matrix measure. Moreover, we remove the integral of y to get \rho( x, \Theta_{\ell}^q) by Laplace approximation. In addition to modeling \Chi_N=\{ x_t \}_{t=1}^N, we may also implement x\to z by the following distribution q( z |x ) or the regression E( z |x ):

(2)
\begin{array}{lll} q( z |x ) =\sum_{\ell} p(\ell | x, \theta_{\ell|x})q( z |x, \ell ), & and \quad   E(z|x, \ell) =\sum_{\ell} p(\ell | x, \theta_{\ell|x})E(z|x, \ell). \\ q( z |x, \ell ) =\begin{cases}   q( z |x,  \ell, \theta_{z|x, \ell}),  & (a), \\ \int q( z |y, \theta_{z|y, \ell})p(y | { x},   \ell,   \theta_{y|x, \ell})dy,   & (b). \end{cases} & and \quad    E(z|x, \ell)=\begin{cases}   \int z q( z |x,  \ell, \theta_{z|x, \ell})dz=f(x, \theta_{z|x, \ell}),   & (a), \\ \int f(y, \theta_{z|y, \ell})p(y | { x},   \ell,   \theta_{y|x, \ell})dy \approx F(\varsigma(x,\theta_{y|x, \ell}),  & (b).\end{cases} \\  \int zq( z |y, \ell )dz=f(y, \theta_{z|y, \ell}), &   F(y)=f(y, \theta_{z|y, \ell})+0.5Tr[\Pi_{\ell}^{y \ -1} \nabla_{yy^T}^2f(y, \theta_{z|y, \ell}) ].  \end{array}

In other words, this BYY system also covers the mixture-of-experts (ME) model (Jacobs, et al, 1991; Jordan & Xu, 1995), alternative ME model (Xu, Jordan, & Hinton, 1995), and its special cases of normalized radial basis function (RBF) networks and extended normalized RBF networks (Xu, 1998). Instead of using the conventional maximum likelihood (ML) learning, these models can be obtained by BYY harmony learning with favorable features to be introduced in the rest sections. Readers are also referred to maximum likelihood learning. Further details are also referred to the Scholarpedia page Rival penalized competitive learning.

To understand more specifically, we proceed to a simplified case that \{ X, Z \} degenerates back to X only, i.e., for the cases of unsupervised learning. We consider the problem of local factor analysis in Fig.2(b). It can be regarded as an combination of two well known unsupervised data analysis tools, i.e., samples are models by a mixture of a number of Gaussians, with each Gaussian generated by a linear system from independent factors of either Gaussian for a conventional factor analysis (FA) and or Bernoulli for binary factor analysis (BFA). Specifically, when q(y |     \theta_{y, \ell}) and q(x | { y},   \ell,   \theta_{x|y, \ell}) are both Gaussian, and thus p(y | { x},   \ell,   \theta_{y|x, \ell}) and q( x| \theta_{x,\ell}) are also Gaussian. In this case, \rho( x, \Theta_{\ell}^q) is obtained without any approximation. In addition to determining unknown parameters, we also need to appropriately select the unknown k_Y=\{k, \{ m_{\ell} \}_{\ell=1}^k\}\,, where k\, is the number of subspaces and m_{\ell}\, is the dimension of each subspace.




In previous sections, not only it has been discussed after eqn.() that the BYY best harmony in its general expression by eqn.() provides a unified view on the Ying Yang harmony \max H(p \Vert q) by eqn.() at \mu in the Lebesgue measure and the Ying-Yang matching \min KL(p \Vert q) by eqn.() at the special setting \mu=P, but also it has been discussed after eq.() the relation of BYY harmony learning to Rival Penalized Competitive Learning (RPCL)(Xu, Krzyzak, & Oja, 1992&93) from a updating flow view and after eq.(9) the difference of the Ying Yang harmony \max H(p \Vert q) by eqn.() from the maximum likelihood (ML) and Bayesian learning. In sequel, we further discuss systematically relations of Bayesian Ying Yang learning to other typical approaches from three typical perspectives, as sketched in Fig. and summarized in Fig..


Figure 3:   Iterative procedure for iid BYY system.
Enlarge
Figure 3: Iterative procedure for iid BYY system.


For a detailed insight, we further use the above second way on the i.i.d. cases shown in Fig.2(a). It follows from eq.() that we have

(10)
\begin{array}{l} H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell} p(\ell|x_t) H_{\ell}(x_t, z_t, y^*_{\ell}(x_t,z_t), \Theta_{\ell}^q), \ y^*_{\ell}(x,z)=arg\max_{y} [L_{\ell}(x, y, \Theta_{\ell}^q)+i_Z \ln{q(z|x, y, \ell)}],\\ H_{\ell}(x, z,  y, \Theta_{\ell}^q)=L_{\ell}(x, y, \Theta_{\ell}^q) +i_Z\ln{q(z|x, y, \ell)}+R_{\ell}(x, y, \Theta_{\ell}^q),  \  L_{\ell}(x, y, \Theta_{\ell}^q)=\ln{[  q(x|y, \ell, \theta_{ x|y, \ell }) q(y|\ell, \theta_{y, \ell }) q(\ell)]}, \end{array}

where i_Z=1 indicates that we consider not only unsupervised learning for the dependence structure underlying {\mathcal X}_N but also supervised learning for the relation {\mathcal X}_N \to {\mathcal Z}_N, and i_Z=0 degenerates back to consider only unsupervised learning.

Bayesian Ying Yang learning/APPENDIX
Enlarge
Figure 4: Adaptive algorithm for local factor analysis with i_Z=0 in getting y^*_{\ell,t} and updating G(y|0,\Lambda_{\ell}),  q(x|y, \ell, \theta_{ x|y, \ell}).


One typical way to implement Stage I(a) in Fig.(a) is following the gradient flow \nabla_{\Theta}H(p \Vert q, \Theta ). From eq.(10), we have

(9)
\begin{array}{l}  \nabla_{\Theta}H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell} p_{\ell,t} [ (1+ \delta h_{\ell,t})\nabla_{\Theta_{\ell}^q }L_{\ell}(x_t, y^*_{\ell}(x_t), \Theta_{\ell}^q)+ i_{Z} \nabla_{\Theta_{\ell}^q}\ln{q(z_t|x_t,y^*_{\ell}(x_t, z_t),  \ell)}]\\ - 0.5\sum_{t=1}^{N}\sum_{\ell} p_{\ell,t} \delta h_{\ell,t} \nabla_{\Theta_{\ell}^q} \ln{|\Pi_{\ell}^y}| + \sum_{t=1}^{N}\sum_{\ell}p_{\ell,t} \nabla_{\Theta} R_{\ell}(x_t, y^*_{\ell}(x_t, z_t), \Theta_{\ell}^q),     \end{array}

by which Stage I(a) in Fig.(a) can be adaptively implemented per sample x_t by iterating alternatively the Yang step and Yang step in Fig.3.

Figure 5:   Adaptive algorithm for local subspace based function.
Enlarge
Figure 5: Adaptive algorithm for local subspace based function.

Again, it degenerates back to considering only unsupervised learning by letting i_Z=0, Further shown in Fig.4 is the details for local factor analysis in Fig.2(b). These updating formulae all comes from gradient based updating, along either a gradient direction directly or a direction that has a positive projection on the gradient direction by multiplying a positive definite matrix such that certain constraints on \Theta are satisfied and that computation is simplified, e.g., updating of \theta_{y|x,   \ell} is simplified with \Pi_{\ell}^{y\ -1} removed from its computation.

A further extension can be made to also consider the relation x\to z via two cascadeded mapplings x\to y and y\to z. An example is given in Fig.5, for which learning is made by swithing on i_Z=1 in Fig.4 for getting y^*_{\ell,t} and updating G(y|0,\Lambda_{\ell}),  q(x|y, \ell, \theta_{ x|y, \ell}), plus updating G(z|W_{\ell}+c_{\ell}, \Gamma_{\ell}). We get a regression E(z|x) that combines a number of local linear regression functions with each supported on a local subspace, and thus we call it subspace based function (SBF).


A further extension can be made to also consider the relation x\to z via two cascadeded mapplings x\to y and y\to z. An example is given in Fig.5, for which learning is made by swithing on i_Z=1 in Fig.4 for getting y^*_{\ell,t} and updating G(y|0,\Lambda_{\ell}),  q(x|y, \ell, \theta_{ x|y, \ell}), plus updating G(z|W_{\ell}+c_{\ell}, \Gamma_{\ell}). We get a regression E(z|x) that combines a number of local linear regression functions with each supported on a local subspace, and thus we call it subspace based function (SBF).

In addition to tackling computational difficulty by removing the integral over Y in help of eq.(), we may also encounter the computational difficulty of the summation \begin{array}{l}\sum_{L}\end{array} or \begin{array}{l}\sum_{\ell}\end{array}, if there are too many items to sum up. As above discussed, maximizing H(p \Vert q, \Theta ) with respect to a structure free p(\ell|x) leads to p(\ell|x)=\bar \delta(\ell-\ell^*), and the corresponding summation \begin{array}{l}\sum_{\ell}\end{array} per sample x_t reduces into merely one most important term. However, it may be too crude to use merely this term to approximate an entire summation. Instead, we may extend one \ell^* into L^* that is a subset of all the values that \ell will take. E.g., if certain constraints prevent those of p(\ell|x) with \ell \in  L^* to become zeros, \begin{array}{l}\max_{p(\ell|x)}  H(p \Vert q, \Theta )\end{array} will reduce \begin{array}{l}\sum_{\ell}\end{array} into \begin{array}{l}\sum_{\ell \in L^*}\end{array} with

  • L^*=\{\ell:  H_{\ell}(x, y^*_{\ell}(x_t), \Theta_{\ell}^q) \  is \  \mbox{among the first}\ \kappa \ \mbox{largest ones} \}, where \kappa=\# L^* denotes the cardinality of the set L^*.

Then, we approximately let every appearance of \begin{array}{l}\sum_{\ell}\end{array} or \begin{array}{l}\sum_{L}\end{array} to be replaced by \begin{array}{l}\sum_{\ell \in L^*}\end{array}. Such an approximation will considerably save computing cost but does not affect performance seriously for an appropriate \kappa. Particularly, when \kappa=2, it follows from the reason similar to eq.() that the BYY learning algorithms in the first column of Fig. becomes very close to RPCL algorithms in the fourth column of Fig., with two points of difference. First, it avoids the difficulty of specifying an appropriate \gamma for RPCL. Second, the rival c is updated by either de-learning or learning instead of only de-learning by RPCL.

In the cases with a binary vector \begin{array}{l}{y}\end{array} , the integral \begin{array}{l}\int_{y}\end{array} becomes a summation \begin{array}{l}\sum_{y}\end{array} over all the \begin{array}{l}2^{m_{\ell}}\end{array} terms. In a strict sense, the concepts \begin{array}{l}\nabla_y, \nabla_{yy^T} \end{array} in Fig.3 become not applicable to a binary vector \begin{array}{l}{y}\end{array}. Still, we have those equations in Fig.3 simply as if \begin{array}{l}{y}\end{array} is real since we can rewrite \begin{array}{l}H_{\ell}(x, z,  y, \Theta_{\ell}^q)\end{array} in a format \begin{array}{l}b_0+b_1[y-Ey]+b_2[y-Ey][y-Ey]^T\end{array} from which we get those equations again without invovling \begin{array}{l}\nabla_y, \nabla_{yy^T} \end{array}. Moreover, in addition to the choice \begin{array}{l}\Gamma_{\ell}^{y|x}=\Pi_{\ell}^{y\ -1}\end{array} in Fig.4, we may also consider the following choices:

  • \begin{array}{l}\Gamma_{\ell}^{y|x}=diag[\varsigma(x_t,\theta_{y|x, \ell}) (\mathbf 1 -\varsigma(x_t,\theta_{y|x, \ell}) )^T] \end{array}, which is equivalently considering an independent distribution \begin{array}{l}p(y|x,\ell,\theta_{y|x, \ell}) \end{array} in Fig.2;
  • \begin{array}{l}\Gamma_{\ell}^{y|x}={ 1 \over \# N(y_{\ell}^*(x_t))} \sum_{ y \in N(y_{\ell}^*(x_t))} [y- \varsigma(x_t,\theta_{y|x, \ell})][y- \varsigma(x_t,\theta_{y|x, \ell})]^T \end{array}, where N(y_{\ell}^*(x_t)) consists of y_{\ell}^*(x_t) and those values with \kappa bit distance away, typically \kappa= 0, \ or \  1.

(C) priors: data smoothing and normalization


Conceptually, those priors studied in the literature of Bayesian approaches may be adopted as q( \Theta_{\ell}^q|\Xi_{\ell}) accordingly, ranging from Jeffreys prior to Dirichlet prior and other conjugate priors. Alternatively, a data sensitive improper priori q(\Theta) without requiring a hyper-parameter \Xi was proposed in Sec.3.4.3 of (Xu, 2007a) under the name of data smoothing and in (Xu, 2007b) under the name of normalization, for regularizing the irregularity of finite size samples, the key points are given as follows:

  • It has been shown empirically that a good choice of q(h_x|{\mathcal X}_N), q(h_z|{\mathcal Z}_N) is simply given as follows:
(5)
\begin{array}{l} q(h|{\mathcal U}_N) \propto 1/ \sum_{t=1}^N p_h(u_t), \ p_h(u)=N^{-1}\sum_{t=1}^N G(u|\bar u_t, h^2I). \end{array}
  • Eqn.(5) is just a special case of the following one:
(6)
\begin{array}{l} q(\Theta|{\mathcal U}_N) \propto 1/ \sum_{t=1}^N p(u_t|\Theta), \end{array}

with a rationale called Cancelling Induced Bias (CIB) underlying a paramtric model p(u|\Theta). Conceptually, \begin{array}{l}\int p(u|\Theta) du=1\end{array} shields \Theta to take effect. However, a finite size of samples makes \begin{array}{l}\mu(\Theta)=\sum_{t=1}^N p(u_t|\Theta) \end{array} become a measure of \Theta, which acts as an unwanted improper prior. Eqn.(6) aims at to cancel this induced bias. Readers are further referred to (Xu, 2007a&b) for a recent overview and Sec.23.7.4 in (Xu, 2004c) for a summary and historical remarks.


Next, we can use eq.() to remove the above integral over Y and then get the corresponding \nabla_{\Theta}H(p \Vert q, \Theta ). Particularly, for the i.i.d. cases shown in Fig.2(a) it follows from eq.() that we have

(10)
\begin{array}{l} H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell}\int p(y,\ell|x_t) H_{\ell}(x_t, y, \Theta_{\ell}^q)dy, \ H_{\ell}(x,  y, \Theta_{\ell}^q)=L_{\ell}(x, y, \Theta_{\ell}^q) +\ln{q(z|x,y, \ell)}+R_{\ell}(\Theta_{\ell}^q), \\  L_{\ell}(x, y, \Theta_{\ell}^q)=\ln{[q(x|y, \ell, \theta_{ x|y, \ell }) q(y|\ell, \theta_{y, \ell }) q(\ell)]}, \\ R_{\ell}(\Theta_{\ell}^q)= N^{-1} \ln{[q( \Theta_{\ell}^q|\Xi_{\ell})q(h_x|{\mathcal X}_N)q(h_z|{\mathcal Z}_N)]}-0.5Tr[h^2_x \Pi_{\ell}^x + h^2_z \Pi_{\ell}^z]. \\   \Pi_{\ell}^x=-\nabla_{xx^T}\ln{q(x|y, \ell, \theta_{ x|y, \ell }) },  \   \Gamma_{\ell}^z=-\nabla_{zz^T}\ln{q(z|x,y, \ell)},  \end{array}

which degenerates back to the cases of unsupervised learning when we let

(11)
\begin{array}{l} \ln{q(z|x,y, \ell)}=0, \ h_z=0,  \ \ln{q(h_z|{\mathcal Z}_N)}=0, \   and \  z \  is \ deleted.  \end{array}
(9)
\begin{array}{l}  \nabla_{\Theta_{\ell}^q}H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell}\int p(y,\ell|x_t)\{  [1+ \delta_{\ell}^H(x, y)]\nabla_{\Theta_{\ell}^q }L_{\ell}(x_t, y, \Theta_{\ell}^q) + \nabla_{\Theta_{\ell}^q}\ln{q(z_t|x_t,y, \ell)} + \nabla_{\Theta_{\ell}^q}R_{\ell}(\Theta_{\ell}^q)\},  \\  \delta_{\ell}^H(x, y)=H_{\ell}(x,  y, \Theta_{\ell}^q) -\sum_{\ell}\int p(y,\ell|x) H_{j}(x, y, \Theta_{\ell}^q)dy,     \\ \nabla_{h_x}H(p \Vert q, \Theta )=-h_x N\sum_{\ell} p_{\ell,t} Tr[\Pi_{\ell}^x]+\nabla_{h_x}\ln{q(h_x|{\mathcal X}_N)},\\  \nabla_{h_z}H(p \Vert q, \Theta )=-h_zN\sum_{\ell} p_{\ell,t} Tr[ \Gamma_{\ell}^z]+\nabla_{h_z}\ln{q(h_z|{\mathcal Z}_N)}.  \end{array}


Next, we can use  eq.()  to remove the above integral over  Y and then get the corresponding \nabla_{\Theta}H(p \Vert q, \Theta ). Particularly, 

for the i.i.d. cases shown in Fig.2(a) it follows from eq.() that we have

(10)
\begin{array}{l} H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell} p(\ell|x_t) H_{\ell}(x_t, y^*_{\ell}(x_t), \Theta_{\ell}^q), \ H_{\ell}(x,  y, \Theta_{\ell}^q)=L_{\ell}(x, y, \Theta_{\ell}^q) +\ln{q(z|x,y, \ell)}+R_{\ell}(x, y, \Theta_{\ell}^q), \\     y^*_{\ell}(x)=arg\max_{y} [L_{\ell}(x, y, \Theta_{\ell}^q)+\ln{q(z|x,y, \ell)}], \\  L_{\ell}(x, y, \Theta_{\ell}^q)=\ln{q(x,y, \ell|\Theta_{\ell}^q)}, \ q(x,y, \ell|\Theta_{\ell}^q)= q(x|y, \ell, \theta_{ x|y, \ell }) q(y|\ell, \theta_{y, \ell }) q(\ell), \\ R_{\ell}(x, y, \Theta_{\ell}^q)=  N^{-1}\ln{[q( \Theta_{\ell}^q|\Xi_{\ell})q(h_x|{\mathcal X}_N)q(h_z|{\mathcal X}_N)]}-0.5Tr[h^2 \Pi_{\ell}^x + \varepsilon_{\ell}(x)\varepsilon_{\ell}^T(x) (\Pi_{\ell}^y+ \Gamma_{\ell}^y)+ \Pi_{\ell}^{y\ -1} \Gamma_{\ell}^y]-0.5 m_{\ell}. \\ \varepsilon_{\ell}(x)= y^*_{\ell}(x)-\varsigma_{\ell}(x,\theta_{y|x,   \ell} ),  \ \Pi_{\ell}^x=-\nabla_{xx^T}L_{\ell}(x,y,\Theta_{\ell}^q)=-\nabla_{xx^T}\ln{q(x|y, \ell, \theta_{ x|y, \ell }) }, \  \Gamma_{\ell}^z=-\nabla_{zz^T}\ln{q(z|x,y, \ell)},   \end{array}

which degenerates back to the cases of unsupervised learning when we let

(11)
\begin{array}{l} \ln{q(z|x,y, \ell)}=0, \ h_z=0,  \ \ln{q(h_z|{\mathcal Z}_N)}=0, \   and \  z \  is \ deleted.  \end{array}


One typical way to implement Stage I(a) in Fig.(a) is following the gradient flow \nabla_{\Theta}H(p \Vert q, \Theta ). From eq.(10), we have by which Stage I(a) in Fig.(a) can be adaptively implemented per sample x_t coming by iterating alternatively the Yang step and Yang step in Fig.(a).

Derivations for BYY harmony learning in i.i.d. cases

The corresponding BYY system is shown in Fig. and eq.() becomes

H(p \Vert q, \Theta )=\sum_{t=1}^{N}\sum_{\ell=1}^k p(\ell|x_t)[L_{\ell}(x_t, y^*_{t,\ell}, \Theta_{\ell})-0.5T_{\ell}(x_t, y^*_{t,\ell}, \Theta_{\ell})  +N^{-1} \ln{q( \Theta_{\ell}|\Xi_{\ell})}]+\ln{q(h|{\mathcal X}_N)},
(12)
T_{\ell}(x_t, y^*_{t,\ell}, \Theta_{\ell})  = -h^2 Tr[\nabla_{xx^T}L_{\ell}(x,y,\Theta_{\ell})]  + m_{\ell}  +Tr[(y^*_{t,\ell}-\eta_{\ell}(x_t,\theta_{y|x,   \ell} ))(y^*_{t,\ell}-\eta_{\ell}(x_t,\theta_{y|x,   \ell} ))^T\Pi^y_{\ell} ],
L_{\ell}(x,y,\Theta_{\ell})=\ln{[q(x|  y, \ell,  \theta_{x|y,   \ell})q(y |\ell, \theta_{y,   \ell})\alpha_{\ell}]}, \  p(\ell | x_t)= {\alpha_{\ell}q(x|    \theta_{\ell}) \over \sum_{j} \alpha_{j}q(x|    \theta_{j}) }\approx {e^{L_{\ell}(x,y^*_{t,\ell}, \Theta_{\ell})}(2\pi)^{0.5m_{\ell}}|\Pi^y_{\ell}|^{-0.5} \over \sum_{j} e^{L_{\ell}(x,y^*_{t,j}, \Theta_{j})}(2\pi)^{0.5m_{j}}|\Pi^y_{j}|^{-0.5}},
\nabla_{xx^T}L_{\ell}(x,y,\Theta_{\ell})  =\nabla_{xx^T} \ln{q(x|  y, \ell,  \theta_{x|y,   \ell})},  \   \Pi^y_{\ell}=-\nabla_{yy^T}^2 L_{\ell}(x,y,\Theta_{\ell})=-\nabla_{yy^T}^2\ln{[q(x|  y, \ell,  \theta_{x|y,   \ell})q(y |\ell, \theta_{y,   \ell})]},

where \eta_{\ell}(x_t,\theta_{y|x,   \ell} )=\int y p(y | { x},   \ell,   \theta_{y|x, \ell})dy is the regression function of p(y | { x},   \ell,   \theta_{y|x, \ell}), and thus we can directly consider \eta_{\ell}(x_t,\theta_{y|x,   \ell} ) with no need to model p(y | { x},   \ell,   \theta_{y|x, \ell}). Moreover, p(\ell | x_t) is designed under the principle by eq.(1) according to the likelihood based measure, via the following eq.() based approximation

\alpha_{\ell}q(x|    \theta_{\ell}) =\alpha_{\ell}\int q(x|  y, \ell,  \theta_{x|y,   \ell})q(y |\ell,  \theta_y)dy=\int e^{L_{\ell}(x,y,\Theta_{\ell})}dy\approx e^{L_{\ell}(x,y^*_{t,\ell}, \Theta_{\ell})}(2\pi)^{0.5m_{\ell}}|\Pi^y_{\ell}|^{-0.5}

Data sensitive improper priori

Here we just introduce a rationale for using a priori in a format of q(\psi)\propto 1/ \sum_{t=1}^N q(u_t|\psi). In an infinite sample size, we have \int q(u|\psi) du=1 that does not depend on \psi. However, this is no longer true for s(\psi)=\sum_{t=1}^N q(u_t|\psi) on a finite sample size, which varies with \psi and imposes an implicit distribution \propto s(\psi). Considering a priori q(\psi) \propto 1/s(\psi) can balance off this unnecessary bias.

Relations to variational approximation

Maximizing the likelihood function q({\mathbf X} |\Theta)= \int q({\mathbf X} | {\mathbf Y}, \theta_{x|y} )q({\mathbf Y}|\theta_y)d{\mathbf Y} is suggested to be replaced by maximizing one of its lower bound via the Helmholtz free energy or variational free energy (Day95, Neal99), that is, \max_{\Theta} q({\mathbf X} |\Theta) is replaced by maximizing the following cost

(13)
\begin{array}{l} F=- \int p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x})\ln{p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x}) \over q({\mathcal X}_N | {\mathbf Y}, \Theta )q({\mathbf Y}|\theta_y) } d {\mathbf Y} \\ =- \int p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x})\ln{p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x}) \over q({\mathbf Y}| {\mathcal X}_N, \Theta)} d {\mathbf Y} +\ln{q({\mathcal X}_N |\Theta)} \le \ln{q({\mathcal X}_N |\Theta)}, \\  q({\mathbf Y}| {\mathcal X}_N, \Theta)={q({\mathcal X}_N | {\mathbf Y}, \theta_{x|y} )q({\mathbf Y}|\theta_y) / q({\mathcal X}_N |\Theta)}. \end{array}

Instead of computing q({\mathcal X}_N |\Theta) and q({\mathbf Y}| {\mathcal X}_N, \Theta), a pre-specified parametric model is considered for p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x}), and learning is made for determining the unknown parameters \theta_{y|x} together with \Theta via maximizing F.

Actually, maximizing F by eq.(13) is equivalent to \min_{\Theta} KL(p \Vert q,  \Theta) by eq.() with p({\mathbf X})=\delta({\mathbf X}-{\mathcal X}_N). In other words, two approaches coincide in this situation, while they were motivated from two different perspectives. Maximizing F by eq.(13) directly aims at approximating the ML learning on q({\mathcal X}_N |\Theta), with an approximation gap that trades off computational efficiency via a pre-specified parametric p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x}). This gap disappears if p({\mathbf Y}| {\mathcal X}_N, \theta_{y|x}) is able to reach the posteriori q({\mathbf Y}| {\mathcal X}_N, \Theta ). However, minimizing KL(p \Vert q,  \Theta) by eq.() is not motivated from a purpose of approximating the ML learning though it was also shown in (Xu, 1995) that \min_{p({\mathbf Y}| {\mathbf X}, \theta_{y|x})} KL(p \Vert q,  \Theta) for a p({\mathbf Y}| {\mathbf X}, \theta_{y|x}) free from of constraints makes \min_{\Theta} KL(p \Vert q,  \Theta) become the ML learning when p({\mathbf X})=\delta({\mathbf X}-{\mathcal X}_N). Instead, the motivation is determining all the unknowns in the Ying-Yang pair to make the pair best matched. The approaches of the shadowed center in Fig. are special cases of minimizing the Helmholtz free energy -F by eq.(13) and of minimizing KL(p \Vert q,  \Theta) by eq.(). In addition to being equivalent to the ML learning and approximating the ML learning, studies on \min_{\Theta} KL(p \Vert q,  \Theta) by eq.() further covers not only extensions to p({\mathbf X}, h), but also the problems of \min_{q({\mathbf X} | {\mathbf Y}, \theta_{x|y} )} KL(p \Vert q,  \Theta) with respect to a free q({\mathbf X} | {\mathbf Y}, \theta_{x|y} ), which leads to the minimum mutual information (MMI) base ICA learning (Amari96).

References

  • Method of steepest descent, Wikipedia, http://en.wikipedia.org/wiki/Laplace_approximation
  • McLachlan, GJ & Geoffrey, J (1997), The EM Algorithms and Extensions, Wiley.
  • Rissanen, J., Information and Complexity in Statistical Modeling, Springer, 2007.
  • Xu, L (2008c), "Independent Subspaces", in Ramón, Dopico, Dorado & Pazos (Eds.), Encyclopedia of Artificial Intelligence, IGI Global (IGI) publishing company, 903-912.

Action editor: Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia
For authors