APPENDIX
From Scholarpedia
| This revision has not been approved by curators yet; It may contain inaccuracies. | ||||||||||||||||||||
Curator: Dr. Lei Xu, Dept. Computer Science & Engineering, Chinese University of Hong Kong
This page is in a revision process and thus may contain errors. Please revisit this page later
Contents |
(A) Uncertainty Conversation Principle
In a compliment to the Ying-Yang philosophy, Ying is primary and thus is designed first, while Yang is relatively secondary and thus is designed basing on Ying. Moreover, as illustrated by the Ying-Yang sign located at the top-left of Fig., the room of varying or dynamic range of Yang should balance with that of Ying, which motivates to design
(in fact, only
because
is given by
already) under a principle of uncertainty conversation between Ying-Yang. In other words, Yang machine preserves a varying room or dynamic range that is appropriate to accommodate uncertainty or information contained within the Ying machine, i.e.,
under a uncertainty measure
in one of the choices given in Fig.1. Since a Yang machine consists of two components
, we may have one or two of the following uncertainty conversations:
- (1)
One typical case of Likelihood based (a) is
with a scalar
. It includes Likelihood based (b) at
, which represents the strongest conversation that can be achieved between Ying and Yang since
is unreachable as
fixed by
already. Moreover, it usually encounters an implementing difficulty to get
, except some special cases that the above intergal over
is either analytically solvable or becoming a inexpensively computable sum. One other typical case is the Hessian based (b) for avoiding the computing
difficulty of handling the intergal over
. As previously discussed (Xu, 2004b, p889), this is consistent to the celebrated Cramér-Rao inequality stating that the inverse of the Fisher information provides a lower bound on the covariance matrix of any unbiased estimator of
, while
provides an approximate Fisher information matrix.
(B) BYY system for supervised learning and Subspace basd functions
Typically,
is in one of two special cases as shown in Fig.2(a).
For a further insight, we observe Fig.2(a) for the details of a BYY system in i.i.d. cases, which is actually the front layer of the BYY system in Fig. with eq.(). Following the principle by eq.(1) , we design
according to the likelihood based measure and get the variance of
according to the Hessian matrix measure. Moreover, we remove the integral of
to get
by Laplace approximation. In addition to modeling
, we may also implement
by the following distribution
or the regression
:
- (2)
In other words, this BYY system also covers the mixture-of-experts (ME) model (Jacobs, et al, 1991; Jordan & Xu, 1995), alternative ME model (Xu, Jordan, & Hinton, 1995), and its special cases of normalized radial basis function (RBF) networks and extended normalized RBF networks (Xu, 1998). Instead of using the conventional maximum likelihood (ML) learning, these models can be obtained by BYY harmony learning with favorable features to be introduced in the rest sections. Readers are also referred to maximum likelihood learning. Further details are also referred to the Scholarpedia page Rival penalized competitive learning.
To understand more specifically, we proceed to a simplified case that
degenerates back to
only, i.e., for the cases of unsupervised learning. We consider the problem of local factor analysis in Fig.2(b). It can be regarded as an combination of two well known unsupervised data analysis tools, i.e., samples are models by a mixture of a number of Gaussians, with each Gaussian generated by a linear system from independent factors of either Gaussian for a conventional factor analysis (FA) and or Bernoulli for binary factor analysis (BFA). Specifically, when
and
are both Gaussian, and thus
and
are also Gaussian. In this case,
is obtained without any approximation. In addition to determining unknown parameters, we also need to appropriately select the unknown
, where
is the number of subspaces and
is the dimension of each subspace.
In previous sections, not only it has been discussed after eqn.() that the BYY best harmony in its general expression by eqn.() provides a unified view on the Ying Yang harmony
by eqn.() at
in the Lebesgue measure and the Ying-Yang matching
by eqn.() at the special setting
, but also it has been discussed
after eq.() the relation of BYY harmony learning to Rival Penalized Competitive Learning (RPCL)(Xu, Krzyzak, & Oja, 1992&93) from a updating flow view and after eq.(9) the difference of the Ying Yang harmony
by eqn.() from the maximum likelihood (ML) and Bayesian learning. In sequel, we further discuss systematically
relations of Bayesian Ying Yang learning to other typical approaches from three typical perspectives, as sketched in Fig. and summarized in Fig..
For a detailed insight, we further use the above second way on the i.i.d. cases shown in Fig.2(a). It follows from eq.() that we have
- (10)
where
indicates that we consider not only unsupervised learning for the dependence structure underlying
but also supervised learning for the relation
, and
degenerates back to consider only unsupervised learning.
One typical way to implement Stage I(a) in Fig.(a) is following the gradient flow
. From eq.(10), we have
- (9)
by which Stage I(a) in Fig.(a) can be adaptively implemented per sample
by iterating alternatively the Yang step and Yang step in
Fig.3.
Again, it degenerates back to considering only unsupervised learning by letting
, Further shown in Fig.4 is the details for local factor analysis in Fig.2(b). These updating formulae all comes from gradient based updating, along either a gradient direction directly or a direction that has a positive projection on the gradient direction by multiplying a positive definite matrix such that certain constraints on
are satisfied and that computation is simplified, e.g., updating of
is simplified with
removed from its computation.
A further extension can be made to also consider the relation
via two cascadeded mapplings
and
. An example is given in Fig.5, for which learning is made by swithing on
in Fig.4 for getting
and updating
, plus updating
. We get a regression
that combines a number of local linear regression functions with each supported on a local subspace, and thus we call it subspace based function (SBF).
A further extension can be made to also consider the relation
via two cascadeded mapplings
and
. An example is given in Fig.5, for which learning is made by swithing on
in Fig.4 for getting
and updating
, plus updating
. We get a regression
that combines a number of local linear regression functions with each supported on a local subspace, and thus we call it subspace based function (SBF).
In addition to tackling computational difficulty by removing the integral over
in help of eq.(), we may also encounter the computational difficulty of the summation
or
, if there are too many items to sum up. As above discussed, maximizing
with respect to a structure free
leads to
, and the corresponding summation
per sample
reduces into merely one most important term. However, it may be too crude to use merely this term to approximate an entire summation. Instead, we may extend one
into
that is a subset of all the values that
will take. E.g.,
if certain constraints prevent those of
with
to become zeros,
will reduce
into
with
-
, where
denotes the cardinality of the set
.
Then, we approximately let every appearance of
or
to be replaced by
. Such an approximation will considerably save computing cost but does not affect performance seriously for an appropriate
. Particularly, when
, it follows from the reason similar to eq.() that the BYY learning algorithms in the first column of Fig. becomes very close to RPCL algorithms in the fourth column of Fig., with two points of difference. First, it avoids the difficulty of specifying an appropriate
for RPCL. Second, the rival
is updated by either de-learning or learning instead of only de-learning by RPCL.
In the cases with a binary vector
, the integral
becomes a summation
over all the
terms. In a strict sense, the concepts
in Fig.3 become not applicable to a binary vector
. Still, we have those equations in Fig.3 simply as if
is real since we can rewrite
in a format
from which we get those equations again without invovling
. Moreover, in addition to the choice
in Fig.4, we may also consider the following choices:
-
, which is equivalently considering an independent distribution
in Fig.2;
-
, where
consists of
and those values with
bit distance away, typically
(C) priors: data smoothing and normalization
Conceptually, those priors studied in the literature of Bayesian approaches may be adopted as
accordingly, ranging from Jeffreys prior to Dirichlet prior and other conjugate priors. Alternatively, a data sensitive improper priori
without requiring a hyper-parameter
was proposed in Sec.3.4.3 of (Xu, 2007a) under the name of data smoothing and in (Xu, 2007b) under the name of normalization, for regularizing the irregularity of finite size samples, the key points are given as follows:
- It has been shown empirically that a good choice of
is simply given as follows:
- (5)
- Eqn.(5) is just a special case of the following one:
- (6)
with a rationale called Cancelling Induced Bias (CIB) underlying a paramtric model
. Conceptually,
shields
to take effect. However, a finite size of samples makes
become a measure of
, which acts as an unwanted improper prior. Eqn.(6) aims at to cancel this induced bias. Readers are further referred to (Xu, 2007a&b) for a recent overview and Sec.23.7.4 in (Xu, 2004c) for a summary and historical remarks.
Next, we can use eq.() to remove the above integral over
and then get the corresponding
. Particularly,
for the i.i.d. cases shown in Fig.2(a) it follows from eq.() that we have
- (10)
which degenerates back to the cases of unsupervised learning when we let
- (11)
- (9)
Next, we can use eq.() to remove the above integral overand then get the corresponding
. Particularly,
for the i.i.d. cases shown in Fig.2(a) it follows from eq.() that we have
- (10)
which degenerates back to the cases of unsupervised learning when we let
- (11)
One typical way to implement Stage I(a) in Fig.(a) is following the gradient flow
. From eq.(10), we have
by which Stage I(a) in Fig.(a) can be adaptively implemented per sample
coming by iterating alternatively the Yang step and Yang step in
Fig.(a).
Derivations for BYY harmony learning in i.i.d. cases
The corresponding BYY system is shown in Fig. and eq.() becomes
- (12)
where
is the regression function of
, and thus we can directly consider
with no need to model
. Moreover,
is designed under the principle by eq.(1) according to the likelihood based measure, via the following eq.() based approximation
Data sensitive improper priori
Here we just introduce a rationale for using a priori in a format of
. In an infinite sample size, we have
that does not depend on
. However, this is no longer true for
on a finite sample size, which varies with
and imposes an implicit distribution
. Considering a priori
can balance off this unnecessary bias.
Relations to variational approximation
Maximizing the likelihood function
is suggested to be replaced by maximizing one of its lower bound via the Helmholtz free energy or variational free energy (Day95, Neal99), that is,
is replaced by maximizing the following cost
- (13)
Instead of computing
and
, a pre-specified parametric model is considered for
, and learning is made for determining the unknown parameters
together with
via maximizing
.
Actually, maximizing
by eq.(13) is equivalent to
by eq.() with
. In other words, two approaches coincide in this situation, while they were motivated from two different perspectives. Maximizing
by eq.(13) directly aims at approximating the ML learning on
, with an approximation gap that trades off computational efficiency via a pre-specified parametric
. This gap disappears if
is able to reach the posteriori
. However, minimizing
by eq.() is not motivated from a purpose of approximating the ML learning though it was also shown in (Xu, 1995) that
for a
free from of constraints
makes
become the ML learning when
. Instead, the motivation is determining all the unknowns in the Ying-Yang pair to make the pair best matched. The approaches of the shadowed center in Fig. are special cases of minimizing the Helmholtz free energy
by eq.(13) and of minimizing
by eq.(). In addition to being equivalent to the ML learning and approximating the ML learning, studies on
by eq.() further covers not only extensions to
, but also the problems of
with respect to a free
, which leads to the minimum mutual information (MMI) base ICA learning (Amari96).
References
- Method of steepest descent, Wikipedia, http://en.wikipedia.org/wiki/Laplace_approximation
- McLachlan, GJ & Geoffrey, J (1997), The EM Algorithms and Extensions, Wiley.
- Rissanen, J., Information and Complexity in Statistical Modeling, Springer, 2007.
- Xu, L (2008c), "Independent Subspaces", in Ramón, Dopico, Dorado & Pazos (Eds.), Encyclopedia of Artificial Intelligence, IGI Global (IGI) publishing company, 903-912.
| Action editor: | Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia |


