Bayesian Ying Yang learning
From Scholarpedia
| Lei Xu (2007), Scholarpedia, 2(3):1809. | revision #67870 [link to/cite this article] | |||||||||||||||||||
Curator: Dr. Lei Xu, Dept. Computer Science & Engineering, Chinese University of Hong Kong
Bayesian Ying-Yang learning is a statistical learning theory for a two pathway featured intelligent system via a Ying representation and a Yang representation, i.e., two complementary Bayesian representations of the joint distribution on the external observation and its inner representation. The system architecture is built under three general designing principles and all the rest unknowns in the system are learned from a set of samples under a Ying-Yang best harmony principle.
Firstly proposed in 1995 and systematically developed over a decade, Bayesian Ying-Yang learning provides not only a general framework that accommodates several typical learning theories and approaches under a unified perspective but also a new theory that leads to both improved model selection criteria and learning algorithms with automatic model selection. Moreover, it is able to implement best harmony by Ying-Yang alternative maximization and to cooperate model selection via Ying representation and learning regularization via Yang representation.
Two types of abilities and two pathway approaches
It is speculated that a learning system or a biological brain survives in its world with at least two types of intelligent abilities. Implemented by a top-down or outbound pathway, Type-I consists of abilities of 'fitting' its world, including not only understanding ability to explain its world but also motoring ability to track the changes in its world. On the other hand, implemented by a bottom-up or inbound pathway, Type-II consists of problem solving skills, ranging from perceiving what events encountered to generating signals that activate the outbound pathway. Both the pathways were initially set up via biological inheritance and then further developed gradually during interactions between the brain system and its surviving world.
The outbound pathway is developed mainly via a learning process for mining the common features or discovering regularities among an ensemble of uncertain evidences (or called samples) from the world. While the inbound pathway is demanded to fast implement problem solving timely, usually per one or several samples encountered. This inbound pathway is developed either adaptively via learning from previous problem solving experiences or off-line via seeking problem solving strategies based on the Type-I knowledge (i.e., regularities or dependencies about the world) by inference and optimization, and then designing the strategies into certain fast implementable functions.
The two-pathway approach has been adopted in the literature of modeling a perception system for decades. One early example is the adaptive resonance theory developed in the 1970s (Grossberg, 1976; Grossberg & Carpenter, 2002), featured by a resonance between bottom-up input and top-down expectation in help of a mechanism motivated from a cognitive science view. In the past one or two decades, efforts have further been made on multilayer net featured two-pathway approaches under guidance of a statistical principle, e.g., under the least mean square error criterion by the auto-association (Bourlard & Kamp, 1988) and the LMSER self-organization (Xu, 1991 & 93), and under the Helmholtz energy by the Helmholtz machine and sleep-wake learning (Hinton, Dayan, Frey, & Neal, 1995; Dayan, 2002). The basic spirit of LMSER self-organizing was further developed into the Bayesian Ying-Yang (BYY) learning in the mid-1990s, not only with multi-layer two-pathway replaced by a general system via two complementary Bayesian representations but also with the least mean square error criterion replaced by a best Ying-Yang harmony criterion (Xu, 1995, 2002).
Over-fitting, model selection, and regularization
The hardware of a biological brain is featured by an alive body with its organization specified genetically. Such an organization, regardless whether or not in a two pathway form, can be mathematically denoted by a sophisticated manifold
with a parametric architecture or structure
to accommodate all functions or tasks that are expected to perform. To build up an artificial intelligent system, our task is to set up such a
.
Only from
, we are unable to specify an appropriate
. It needs to be appropriately designed, which is task-dependent and needs knowledge about the structure underlying the samples of
. Except some particular situations, we usually have no such type of knowledge. Instead, we consider
to be a combination of a number of individual simple structures via a simple combination scheme. Shown in Fig.1(a) are two examples of such combined architectures. In general, we consider a family of infinite many structures
with each
in a same configuration but in different scales, each of which is labeled by a scale parameter
in term of one integer or a set of integers that accounts for the scale of the combined architecture, i.e., the number of involved simple structures. Given
, we also need to determine
that consists of all the unknown parameters within the combing architecture. Usually, what we have is a given set
of samples with a finite size
, we need to determine both
and
. In other words, we need to specify one
to model or represent the given set
. More specifically, the task of determining
at a given
is usually called parameter learning, while the task of selecting an appropriate scale
is called model selection since we select among a set of candidate models as the value of
varies.
We expect that
best fits or represents
such that the error of using
to represent
becomes the least. With such a type of best fitting principle, we expect to determine
at a given
. As shown in Fig.1(b), its fitting error monotonically decreases as
grows up, until the error reaches zero at
, a value that relates to
and is usually much bigger than an appropriate
. In this case, a large part of structure has actually been used to fit noises or outliers. This is usually called ''over-fitting'' problem, for which we cannot get an appropriate
. Only taking parameter learning in consideration, the previously mentioned two pathway approaches, (such as auto-association, LMSER self-organization and Helmholtz machine, etc), can not avoid this over-fitting problem too.
One major direction for tackling the problem is a two stage implementation, with parameter learning made on a set of candidate models among which one is selected by a model selection criterion. Beyond the best-fitting principle, such a criterion is obtained from a different statistical learning theory. In literature, there exist several typical model selection criteria, including Akaike information
criterion (AIC) and extensions (Akaike, 1974; Bozdogan & Ramirez, 1988; Cavanaugh, 1997), Bayesian approach related criteria under the names of Bayesian inference criterion (BIC) (Schwarz, 1978), Minimum message length (MML) (Wallace & Boulton,1968; Wallace & Dowe, 1999), and Minimum description length (MDL) (Rissanen, 1986, 2007), and Cross validation (CV) based criteria (Stone, 1978; Rivals & Personnaz, 1999). Unfortunately, not only a two stage implementation is very expensive to compute, but also these criteria can only provide a rough estimation or even a wrong estimate because estimating performance deteriorates considerably when
is small and
consists of more than one integers.
Another direction for tackling over-fitting is called regularization. Instead of searching an appropriate
, we consider a
with its scale
large enough to accommodate the regularity underlying
, and then impose certain constraint on either or both of
and
(Tikhonov & Arsenin, 1977; Poggio & Girosi, 1990) to specify one
for approximately describing a true structure that is actually in a lower scale. Equivalently, the situation also applies to the cases that
comes from a underlying structure with a known scale
while the size
is not large enough to specify a unique
. Unfortunately, it is difficult to appropriately choose what type of constraint to impose. Instead, it is imposed in an isotropic or uniform manner, which usually does not work. For an example, if we use a polynomial of a degree
to fit a set of samples from
, a model selection purpose desires to force all the parameters
to be zero, but which can not be achieved by an isotropic or uniform regularization, e.g., minimizing
fails to treat the parameters
differently from those
. Actually, such a regularization disturbs or conflicts with the purpose of model selection. Even given a type of constraint, another difficulty is to control an appropriate regularization strength, which is able to be roughly estimated only for a rather simple structure via either handling the integral of marginal density or in help of cross validation with extensive computing costs (Stone, 1978; Rivals & Personnaz, 1999).
Three levels of tasks from inverse problem perspective
The learning involved tasks discussed above can be further summarized into the following three levels from an inverse problem perspective as illustrated in Fig.2:
- Inference or Relaxation
Provided with an observation
that can be regarded as either generated from an inner representation
or a consequence from a cause
via a given mapping
, the Type-II ability makes an inverse inference
, as shown in Fig.2(a). When
is one-to-one, its inverse mapping
is analytically solvable. Generally, the task of determining
from
is involved in reasoning, mapping, representation, etc. It is usually not a simple task due to various reasons. One is due to uncertainty incurred externally by observation noises, which can be described by a distribution
for a probabilistic mapping
. Uncertainty also origins internally from a mapping
of many-to-one or infinite many to one, which can be considered by
plus a distribution
for each reasonable cause or inner representation
, as shown in Fig.2 (b). Actually,
and
jointly act as the knowledge of Type I, based on which a Type-II ability is obtained via statistical inference methods, as shown in the column (a) of the table in Fig.3. Go further beyond,
may be indirectly determined via a dynamics of minimizing an energy or cost that incurs from violating certain constraints in
, under the term of relaxation.
- Parameter Learning
The second level of inverse problems considers the situations that
and
are unknown but provided with their parametric structures
and
with
. As illustrated in Fig.2 (c), the scenario becomes that we get a set of samples
generated from a map
, and the task is getting an inverse mapping
, usually referred by the term estimation or parameter learning for
. Actually, the task of determining
also includes an inverse inference
as a subtask. While
is made per sample of X but
collectively bases on all the samples. We may also update
as samples come, but in a speed much slower than that
varies as one or a subset of samples come. Via either a marginal distribution
or directly a parametric structure
with a unknown parameter set
, we encounter the inverse problem shown in Fig.2 (c) with uncertainties described by
and
. Again, one most widely studied direction consists of statistical inference methods, as shown in the column (b) of the table in Fig.3.
- Model Selection
The third level of inverse problems considers a family of structures
with each
in a same prespecified configuration but in a unknown scale
. Now, a set of samples
is generated from a map
and the task is getting an inverse mapping
or equivalently
, which is usually called model selection. This task also includes the subtask of determining
in
for each
. Via
, we encounter the inverse problem shown in Fig.2 (d), with uncertainties described by
and
, which is tackled by statistical inference methods shown in the column (c) of the table in Fig.3.
From these inverse problems and Fig.3, we get a general perspective to summarize typical statistical approaches for the three levels of learning tasks. The first three items in the columns (b)&(c) summarize those widely studied approaches of parameter learning (e.g., ML, MAP, etc) and model selection approaches ( e.g., BIC, MDL, MML, etc), while the first three items in the column (a) cover typical inverse solving techniques, either nested in parameter learning or used independently in tasks such as classification, inference, and control.
Though having extensive applications, these classical approaches still suffer at least three kinds of limitations.
- The integrals over
and
are analytically solvable only in some specific distributions, but become intractable in many cases. One way for this problem is using Laplace approximation (Schwarz, 1978). The other direction is featured by either variational approximations (Hinton & Zemel, 1994; Jordan, Ghahramani, Jaakkola, & Saul, 1999; Jaakkola, 2001) or BYY learning, which are related to the LPD items in Fig.3 and will be further introduced later.
- The difficulty of getting an appropriate
is similar to the previously discussed difficulty of choosing what type of constraint for regularization. Many efforts have been made on this topic in the literature of machine learning, ranging from Jeffreys prior and other improper priors to Dirichlet prior and other conjugate priors.
- Fig.3 merely summarizes, instead of unifying, these statistical approaches. Also, it fails to cover other information theoretic related approaches. Moreover, it merely considers the bottom-up pathway as an inverse of the top-down pathway with other possible scenarios ignored.
Bayesian Ying-Yang system : A unified framework
As shown in Fig.4, a set
of samples are regarded as generated via a top-down path from its inner representation
, not only with a long term memory
that is a collection of all unknown parameters in the system for collectively representing the underlying structure
or regularities among data
as the knowledge about the world, but also with a short term memory
that each element
is the corresponding inner representation of one element
. A mapping
and an inverse
are jointly considered via the joint distribution of
in two types of Bayesian decomposition:
- (1)
In a compliment to the famous Chinese ancient Ying-Yang philosophy, the decomposition of
coincides the Yang concept with a visible domain
for a Yang space and a forward pathway by
as a Yang pathway. Thus,
is called Yang machine. Similarly,
is called Ying machine with an invisible domain
for a Ying space and a backward pathway by
as a Ying pathway. Such a pair of Ying-Yang machines is called Bayesian Ying-Yang (BYY) system. It actually consists of two layers. The back layer accommodates
and supports the front layer by a priori
, while
consists of the posteriori
that transfers the knowledge from observations to the back layer. The front layer is itself is a Ying-Yang pair for
and
, modeled in a parametric Ying-Yang pair
where
and
.
Taking a key role in the information flow within the front layer,
is supported by a parametric substructure
. On one hand,
is the source of the information flow to fit the observations
via the top-down pathway
that implements the abilities of Type I. On the other hand, featuring the abilities of Type II, this
comes from the information flow via a bottom-up pathway
. The input to the system comes from a sample set
either in a smoothed form
or directly
by setting
, where
denotes a Gaussian density of vector
with a mean vector
and a covariance matrix
. We need to design appropriately other three substructures, with different designs performing different learning tasks. In general, designs are guided by the following three principles, respectively:
Least Redundancy Principle
Subject to the nature of learning tasks,
should be in a structure with the inner representation of
encoded with a redundancy as least as possible. First, the number of variables and parameters of
should be as less as possible, which is itself the model selection task that is beyond the task of design. Second, the dependences among these variables should be as least as possible, which is possibly or at least partly to be considered during design. E.g., when
consists of multiple components
, we may design
.
Divide and Conquer Principle
Subject to the representation formats of
and
, a complicated mapping
is modeled by designing
via a mixture of a number of simple structures. E.g., we may design
via a mixture of a number of linear regression featured by Gaussians of
conditional on
. In some situation, it is not necessary to design
and
separately. Instead, we design
, especially
via a single parametric model as a whole but still attempting to follow the above two principles. One example is the one encountered in Type B of Rival penalized competitive learning, actually via a product Gaussian mixture (see Scholarpedia page Rival penalized competitive learning). In general, we can consider an integral measure
to get
- (2)
Uncertainty Preservation Principle In a compliment to the Ying-Yang philosophy, Ying is primary and thus is designed first, while Yang is relatively secondary and thus is designed basing on Ying. Moreover, as illustrated by the Ying-Yang sign located at the top-left of Fig.4, the room of varying or dynamic range of Yang should be consistent with that of Ying, which motivates to design
(in fact, only
because
is given by
already) under a principle of uncertainty preservation between Ying-Yang, that is, Yang machine preserves a varying room or dynamic range that is appropriate to accommodate uncertainty or information contained within the Ying machine. One rationale is that the representation of
by the Yang machine is more uncertain than that by the Ying machine, from which we have
- (3)
where
is the domain of
by the Ying machine. The strongest preservation between Ying and Yang is
achieved at
or equivalently at
. As to be addressed in the next subsection, the best Ying Yang harmony principle with
leads us to consider a relaxed preservation that
. To be more specific,
consists of an apex realm that varies with the input
and focuses on those possible representations of
with high probabilities. At a given
, we have
as follows:
- (4)
which may be called the apex zone or climax neighborhood that corresponds to
. One degenerated case is
that consists of only the apex point
, e.g., occured when
. The domain out of
, i.e., the part that corresponds to the tail
, maybe considered in several choices. One is letting
small enough and thus simply ignoring this part. The other one is considering a lumped represetation, e.g., eqn.(11c) in the Scholarpedia page Rival penalized competitive learning.
For the cases that the representation
or partially a subset
consists of real numbers, it usually encounters an implementing difficulty to get
, except special cases that the intergal over
is analytically solvable. Typically, the integral is handled by Laplace approximation around
, which actually makes a Peak Convex Analysis on a apex zone or climax neighborhood of
. In this case, the inequality in eq.(3) implies that
- (5)
which is consistent to, as previously discussed (Xu, 2004b, p889), the celebrated Cramér-Rao inequality stating that the inverse of the Fisher information provides a lower bound on the covariance matrix of any unbiased estimator of
, while
approximates Fisher information matrix, where
, and
denotes the covariance matrix of
. For two positive definite matrices
,
means
for any
. A simple example is
- (6)
Another possible way to relax eq.(3) is letting
to be replaced by
. That is, being bounded by the Ying machine is relaxed into being celinged by the Ying machine. In addition, we may also consider uncertainty conversation in help of either a distributive or collective measure, with further details referred to the subpage /APPENDIX (A).
Still, we use the notation
to refer a BYY system, with its configuration
specified by the above design and with
consisting of unknown parameters in both Ying and Yang. Moreover, the scale
features the complexity of
, including the scale
for representing
and the rest part that is contributed by
. The remaining task consists of parameter learning for determining
and model selection for selecting an appropriate scale
.
is an example of eq.(3) with
given in Fig.7 and with the integral of
for getting
removed by Laplace approximation. As an example of eq.(5),
is given indirectly via the corresponding
and
. The BYY system provides a unified framework on a number typical learning tasks in the following senses:
- It includes the three levels of inverse problems discussed in the previous section.
- It applies to not only the cases that
of random vectors that are independent and identically distributed (i.i.d.), as shown in Fig.5, but also the cases that there are certain temporal structure underlying
via
with a built in temporal dependence (Xu, 2001b, 2004a).
- It implements typical intelligent tasks via specific cases of the inner representation
per sample
coming. When there is only
that takes several labels,
implements a pattern classification or decision making. When there is only
of a binary vector,
implements an inner encoding task. When there is only
of a real vector,
implements an dimension reduction or generating control signal. Each of them can be regarded as a subtask of a combined task by
.
- It covers both unsupervised learning and supervised learning. Conventionally, a learning that only bases on input sample
is called unsupervised, while a learning that bases on a sample of an input-output pair is called supervised, further in two scenarios. One is directly implemented by the Yang machine if either or both of
are available per sample
coming. The other is implemented by the entire BYY system in a cascaded mapping
with each pairing
available, for which we can simply regard
jointly as an input, with the pair
taking the place of
, that is, we have
- (7)
Further details are referred to the subpage /APPENDIX(B).
BYY harmony learning (A): fundamentals
According to the Chinese Ying-Yang philosophy, one key principle is Ying-Yang harmony, in a sense that Ying and Yang not only matches each other but also seeks a best match in a most compact way, as illustrated by the Ying-Yang sign located at the top-right of Fig.4. This motivates us to determine the unknowns
and
in a BYY system under such a best harmony principle, which is mathematically implemented by maximizing the following harmony measure
- (8)
On one hand, this maximization forces
to match
. In other words,
attempts to describe
in help of
, which uses
to fit
not in a maximum likelihood sense but with a promising least complexity nature. Due to a finite size
, this matching aims at (but may not really reach)
. Still we get a trend towards this equality by which
becomes the negative entropy, and its further maximization will minimize the system complexity, which consequently provides a model selection nature on
.
At the first glance, one may feel the formulae eq.(8) somewhat familiar. In fact, its degenerated case with
vanished leads to
With
at
and
, maximizing this
becomes
, i.e., maximum likelihood (ML) learning. The situation with
beyond
was also explored in the engineering literature (e.g., signal processing), via
under the name of Minimum Cross-Entropy (Minxent). It was noticed that
or
leads to a singular result that
,
, which was regarded as irregular and not useful. Instead, efforts were turned to
, where
is the classic Kullback–Leibler divergence:
- (9)
Minimizing this
with respect to
will also push
to match
, and thus the above singular result is avoided. Moreover, if
is given,
is still equivalent to
. Thereafter, Minimum Cross-Entropy is usually used to refer this
.
Interestingly, with the inner representation
considered in the BYY system, the scenario becomes different from the above classic situation. With
fixed,
is made with respect to only
, which is no longer useless but responsible to the promising least complexity nature discussed after eq.(8). In other words, maximizing
with respect to unknowns not only in Ying part
but also in Yang part
makes the Ying-Yang system become a best harmony. Alternatively, we may regard such a mathematical formulation as an information theoretic interpretation of the ancient Ying Yang philosophy.
More generally, we can extend probability densities to probability measures defined over arbitrary sets. Let
for the Ying machine and
for the Yang machine are probability measures over the domain of
Let
is a sigma finite measure on the domain of
, which acts as a benchmark or reference measure on the universe that Ying and Yang machines are supported. When
is absolutely continuous in respect to
, it follows from the Radon–Nikodym theorem that we can generalize eqn.(8) into
- (10)
which returns back to eqn.(8) when
is the usual Lebesgue measure.
- On one hand, the above
becomes the entropy of the Ying machine at the special setting
, and
pushes the Ying machine to a least complexity with respect to the benchmark measure
. However,
lacks a mechanism that can make
to model the sample set
- On the other hand,
becomes Kullback–Leibler divergence by eqn.(9) between the Ying and Yang machine at another special setting
and
makes a best Ying-Yang matching during which
attempts to fit the sample size
However, there lacks a mechanism to push the Ying machine into a least complexity due to the disappearance of the benchmark measure
.
Both the two specical cases consider only the relationship between two measures, while the general case by eqn.(10) handles the relationship among three measures such that the least complexity nature and the best matching nature are jointly obtained.
Without losing generality, the rest of this article focuses on the cases of probability densities by eqn.(8), while we may still consider other cases similarly in help of eqn.(10). To further proceed, we rewrite eq.(8) into
- (11)
where the partition
is made such that
is obtained with the integral over
solved analytically, e.g., for some cases that
are conjugate distributions. If it is difficult to find such conjugate distributions or to compute
, we may simply consider
.
Considering a Taylor expansion of
around
up to the second order, we have the following approximation
- (12)
With
designed according to eq.(5) around the peak
, it follows from using eq.(12) on
that we approximately get
- (13)
where
is the number of free parameters in
. We can observe that the maximization of
with respect to
consists of an inner maximization of
that seeks the best harmony among the front layer Ying-Yang supported by a priori
and an outer maximization that seeks the best harmony of the entire Ying Yang system with
taken in consideration for the interaction between the front and back layers.
We may get further insights on
by considering its gradient flow
with
by eq.(3) at
. That is, we have
- (14)
It follows from
in eq.(11) that
- (15)
Noticing that
describes the fitness of an inner representation
on the observation
, we observe that
indicates whether the considered
fits
better than the average of all the possible choices of
. Letting
, we get simply
which is also be derived from
. In other words, with
at
, updating
along this gradient flow actually implementing the classic Bayesian learning or the classic ML learning for a non-informative priori
. With
, the gradient flow
with all possible choices of
is integrated via weighting not just by
but also by a modification of a relative fitness measure
. If
, updating goes along the same direction of the Bayesian learning / ML learning but with an increased strength. If
, i.e., the fitness is worse than the average and thus this
is doubtful, updating still goes along the same direction of the Bayesian learning / ML learning while with a reduced strength. When
, updating reverses to the opposite direction, i.e., becoming de-learning. In other words, the BYY harmony learning has a nature similar to Rival Penalized Competitive Learning (Xu, Krzyzak, & Oja, 1992&93) but with an improvement that there is no need on a pre-specified de-learning strength.
BYY harmony learning (B): implementations
From eq.(13), we can implement the maximization of
via an iterative procedure as shown in Fig.6(a), during which a previous estimate
is used as
. Specifically, Stage I(b) relates to choosing a priori
and may be discarded when we have no priori to use or we do not want to consider this part. As to Stage I(a), its implementation can be made in help of
, for which a key point is to handle the integral over
. In general,
may include either or both of a part
of real variables and a part
of discrete labels. Let
to take the places of
in
by eq.(11) We have
- (16)
which can be directly extended to cover supervised learning
simply as discussed after eq.(7).
One can get an approximation of
for implementing a gradient based learning. Two typical ways are featured by swapping the order of handling
and
. One is making the operation
first and then approximating the integral over
and the summation over
to get
and
. The other is using eq.(12) to remove the above integral over
and then get the corresponding
. For an example, using this second way on the i.i.d. cases shown in Fig.5, it follows from eq.(11) that we get the iterative implementing procedure shown in Fig.7 and its application to the task of local factor analyses shown in Fig.8, with the detailed algorithm derived from the following theorem. Moreover, readers are further referred to the subpage /APPENDIX(B) for the learning algorithms on supervised learning and subspace based functions.
means that
shares the same direction of
, i.e.,
for some
. (b)
is given in Fig.5 and
is obtained from
under the constraints of the dynamic range conservation by eq.(3) and of the apex approximation by eq.(4), with
being the climax neighborhood(shortly
-CN). (c) Each Ying-Yang step is conducted either adaptively per sample
with
consisting of only one index
or in a batch with
consisting of not only the sample
but also other samples, even
of the entire sample set. (d) the updating equation for
comes from
by eq.(18), with
being the average gap between samples under a smoothing parameter
. Given a sample set,
needs to be computed only once and stored as a table for a quick access at each updating of
. Theorem Given a decomposition
with
being posistive definite matrices, we have
.
Its proof is simple by noticing
and
. Based on this theorem, we can get the following line search updating for increasing or maximizing
- (17)
Morever, it follows from eqn.(5.2) in (Xu & Jordan, 1996) that we observe the convergence rate of the above case (b) with
is faster than that of the gradient updating
with
.
Conceptually, those priors studied in the literature of Bayesian approaches may be adopted as
accordingly, ranging from Jeffreys prior (e.g.,
in Fig.7) to Dirichlet prior and other conjugate priors. Alternatively, a data sensitive improper priori
without requiring a hyper-parameter
was proposed in Sec.3.4.3 of (Xu, 2007a) under the name of data smoothing and in (Xu, 2007b) under the name of normalization, for regularizing the irregularity of finite size samples, the key points are given as follows:
means that
shares the same direction of
, i.e.,
for some
. For updating
, one even can let
. The updating equations on
and
remain the same as in Fig.7. - It has been shown empirically that a good choice of
is simply given as follows (e.g., used in Fig.7):
- (18)
- Eqn.(18) is just a special case of the following one:
- (19)
with a rationale called Induced Bias Cancellation(IBC) underlying a parametric model
. Conceptually,
shields
to take effect. However, a finite size of samples makes
becomes a measure of
, which acts as an unwanted improper prior. Eqn.(19) aims at to cancel this induced bias. Readers are further referred to (Xu, 2007a&b) for a recent overview and Sec.23.7.4 in (Xu, 2004c) for a summary and historical remarks.
Moreover, the algorithm in Fig.7 can be further modified into a number of variants shown in Fig.9, as remarked as follows:
- At the first column, we may ignore
by simply setting
to remove the hyperparameter part. Also, we may remove data smoothing regularization by simply setting
. Ignoring this part will further save computing cost but also may reduce the performance.
- The previous discussion on eq.(15) still applies. Simply setting
leads to the ML learning at the bottom of the second column, which is further generalized upwards to the other three choices in the same column by considering either or both data smoothing and a priori
.
- Simply letting
to be structure free subject to
, maximizing
results in
. Consequently, we get
and thus those special cases shown in the third column. In fact, getting
is a MAximum Posteriori (MAP) competition. Thus, this type of learning is an extension of competitive learning to Map-BYY Competitive learning and goes to further extensions upwards.
- Taking the rival
in consideration, we are further lead to the fourth colum that consists of Rival Penalized Competitive Learning (RPCL)(Xu, Krzyzak, & Oja, 1992&93) and extensions with either or both data smoothing and a priori
in consideration. Moreover, special cases of the first column also generalize RPCL learning in different ways, with details referred to the scholarpedia page Rival Penalized Competitive Learning.
The last but not the least, an important nature needs to be addressed is automatic model selection. In a BYY system, there is a subset
of parameters on which there exists an indicator
. Maximizing
, that includes maximizing
with
, will exert a force to push
, which means that its associated contribution to
should be discarded because it is redundant. Readers are referred to (Xu, 2004b&c, 2005, 2007a&b) for further details. During learning,
means that the corresponding
components
,
, and
are extra and thus discarded. As a result,
effectively reduces to
. As illustrated in Fig.6(c),
means its contribution to
is 0, and a number of such parameters becoming 0 result in that
has effectively no change on a range
. Also, if the variance of
tends to zero, the corresponding dimension
is extra and thus discarded. As illustrated beyond
in Fig.6(c), such a parameter becoming 0 contributes to
by
. As long as
is initialized at a big enough value, appropriate
are determined automatically during an iterative implementation of Stage I(a) in Fig.6(a). That is, model selection is incurred during parameter learning, which shares with and further improve the automatic model selection nature of Rival Penalized Competitive Learning (RPCL)(Xu, Krzyzak, & Oja, 1992&93).
Further improved
can be obtained by also considering Stage II of the two stage implementation in Fig.6(a), especially for a small sample size
. For the task of local factor analysis in Fig.8 and its extension to
local binary factor analysis (BFA) (b), we have
with
- (20)
where
and
.
A number of experiments have been conducted on the task of local factor analysis and shown that BYY harmony learning outperforms considerably several typical learning methods and model selection criteria on both performances and computing times. Details are referred to the subpage of /experimental results, where one may also find experimental results on a number of other typical learning tasks.
BYY harmony learning (C): favorable features
Bayesian Ying Yang learning is featured by two fundamental differences from other existing learning approaches:
- Systematically considering two pathways and two domains coordinately as a BYY system under a probability theory ground and designing three system components under the principles of Least Redundancy, Divide-Conquer, and Uncertainty Conversation, respectively. This is different not only from cognitive science motivated adaptive resonance theory (Grossberg, 1976; Grossberg & Carpenter, 2002) and the least mean square error reconstruction based auto-association (Bourlard & Kamp, 1988) and LMSER self-organization (Xu, 1991 & 93), but also from those probability theoretic approaches that either merely consider a bottom-up pathway as the inverse of a top-down pathway (e.g., those for inverse problems summarized in Fig.2), or approximate such inverses for tackling intractable computations (e.g., Helmholtz machine, varianional Bayes, and extensions).
- Mathematically implementing the Ying-Yang harmony philosophy, in a sense that Ying and Yang not only matches each other but also seeks a best match in a most compact way, by determining all unknowns in a BYY system via maximizing the functional by eq.(8) that can be regarded as an further extension of the traditional Minimum Cross-Entropy principle to the BYY system, while with its previously known useless irregular aspect turned into a promising nature of least complexity, as discussed in previous sections.
These two fundamental issues make Bayesian Ying Yang learning possess the favorable promising features.
First, the conventional model selection approaches aim at model complexity conveyed at either or both of the level of structure
the level of parameter set
, as summarized in Fig.2 (c)&(d) and Columns (b)&(c) in Fig.3. This task is usually difficult to estimate, with some rough bounds provided by those approaches discussed after Fig.2. Bayesian Ying Yang learning considers not only the levels of structure
and parameter set
but also the level of short memory representation
shown in Fig.2 (a)&(b) by the front layer of BYY system in Fig.4. That is, the complexity or scale
of the BYY system is considered with the part
for representing
and the rest part contributed by
and
. The second part is estimated via
in eq.(11) and
in eq.(13), which is along a line similar to the above conventional approaches. However, the part
is modeled via
in
, which is estimated more accurately than the second part. Promisingly, the model selection problems of many typical learning tasks can be reformulated into selecting merely the
part in a BYY system (Xu, 2005). Therefore, the resulted BYY harmony criterion shown in Fig.6(b) may considerably improve the performances by typical model selection approaches, which has been shown by in experiments, with details referred to the subpage of /experimental results.
Second and even interestingly, the feature of considering
via
in
makes both parameter learning for
and model selection for
implemented simultaneously. That is, we get an important nature of automatic model selection, as illustrated in Fig.6(c). As addressed in the previous section, maximizing
will exert a force that pushes some indicator
if a subset
is extra, which means that its associated contribution to
can be discarded.
Remarks: at the first glance, the scenario of Fig.2 (b) with
has also been considered in a number of typical learning approaches, especially those with an EM type two pathway implementation, such as the EM algorithm implemented ML learning (Redner & Walker, 1984), information geometry based em-algorithm (Amari, 1995), Helmholtz Machine (Hinton, Dayan, Frey, & Neal, 1995; Dayan, 2002), variational approximation (Jordan, Ghahramani, Jaakkola, & Saul, 1999), the bits-back based MDL (Hinton & Zemel, 1994), etc. As to be explained in the next section, those studies actually have neither put
in a role for describing
nor sought for the above nature of automatic model selection.
Third, the separated consideration of
from the rest of
also provides a general framework that integrates the roles of regularization and model selection, such that not only the automatic model selection mechanism on
can avoid the previously mentioned disturbance by a regularization with an inappropriate priori
, but also imprecise approximations caused by handling the integrals may be alleviated via regularization. Specifically, model selection is made via
in Ying machine, while regularization is imposed in Yang machine via either or both of designing the structures of
under a uncertainty conservation principle and making data smoothing regularization in help of
with
. This
takes a role similar to regularization strength. However, the difficulty of the conventional regularization approaches on controlling this strength has been avoided because an appropriate
determined via
in help of
by eq.(18).
Fourth, in addition to making the best harmony learning by maximizing
in eq.(8), an alternative has also been proposed in (Xu,1995) under the name of Bayesian Kullback Ying Yang (BKYY) learning that performs a best Ying Yang matching by minimizing:
- (21)
that is, a best Ying Yang harmony includes not only a best Ying Yang matching as a part but also minimizing the entropy or the complexity of a Yang machine. In other words, it seeks a Yang machine that not only best matches the Ying machine but also keeps itself in a least complexity. As to be further addressed in the next subsection, the perspective of eq.(21) provides a bridge to revisit a number of typical statistical learning approaches but with new insights.
Fifth, considering a learning system in a Ying-Yang pair naturally motivates to implement the maximization of
or the minimization of
by an alternative iteration of
- Yang step: fixing all the unknowns in the Ying machine, we update the rest of the unknowns in the Yang machine (after excluding those common unknowns shared by the Ying machine);
- Ying step: fixing those just updated unknowns in the Yang step, we update all the unknowns in the Ying machine.
Not only this iteration is guaranteed to converge, but also it includes the well known EM algorithm (see the subpage of /APPENDIX) and provides a general perspective for developing other EM-like algorithms.
Relations to other approaches
In previous sections, not only it has been discussed after eqn.(10) that the BYY best harmony in its general expression by eqn.(10) provides a unified view on the Ying Yang harmony
by eqn.(8) at
in the Lebesgue measure and the Ying-Yang matching
by eqn.(9) at the special setting
, but also it has been discussed
after eq.(15) the relation of BYY harmony learning to Rival Penalized Competitive Learning (RPCL)(Xu, Krzyzak, & Oja, 1992&93) from a updating flow view and the difference of the Ying Yang harmony
by eqn.(8) from the maximum likelihood (ML). In sequel, we further discuss systematically
relations of Bayesian Ying Yang learning to other typical approaches from three typical perspectives, as sketched in Fig.10 and summarized in Fig.11.
We start at the left bottom corner of Fig.10. One classic perspective considers to use either a parametric model
directly or a top-down model
(called a latent or generative model) for a best matching or modeling of observation data. When
, it follows from the second row of Type B table and Type C table in Fig.11 that the Ying-Yang matching
becomes equivalent to ML learning and also other likelihood based studies under the names of marginal likelihood, evidence, Bayesian information criterion, and Bayesian approach (Schwarz,1978; Hinton & Zemel, 1994; Neath & Cavanaugh,1997; MacKay, 1992, 2003); while it follows from the third row of Type B table and Type C table in Fig.11 that the Ying Yang harmony
becomes equivalent to MAP estimate and those
featured studies under the name of Bayesian learning (Press, 1989) or Minimum Message Length (MML) (Wallace & Boulton, 1968; Wallace & Dowe, 1999). Also, these Bayesian based studies will degenerate into the likelihood based studies if
is ignored, as illustrated at the right bottom corner of Fig.10.
Moreover, the model selection AIC criterion (AIC) can be obtained not only from ML learning (i.e., Ying-Yang matching at the second row of Type B table) in its standard derivation (Akaike, 1974; Bozdogan, 1987; Cavanaugh, 1997), but also from the Ying-Yang harmony at the first row of Type C table in Fig.11 as a special case of several outcomes of a derivation in help of designing
by following the uncertainty conversation principle by eq.(5) with the Hessian matrix measure and using eq.(12) to approximately remove the integral over
(Xu, 2008a&b).
We move to the left middle of Fig.10. The second perspective considers a bottom-up pathway (called transformation / recognition / representative) for a best encoding
. When
, the Ying-Yang matching
at the 4th row of Type B table in Fig.11 seeks a best matching between the inner representation
by the Yang machine and its counterpart
by the Yang machine, which further leads to the upper center in Fig.10, including the minimum mutual information (MMI) and the maximum information transfer (INFOMAX) for and independent subspace analyses (Amari, Cichocki, & Yang, 1996; Bell & Sejnowski,1995; Xu, 2008c) at the special case that
consists of independent components and that
degenerates into a deterministic linear or post-linear function of
Further studies may be explored beyond this special case. Interestingly, the Ying-Yang harmony
at the 4th row of Type B table in Fig.11 not only seeks such an inner best matching but also consists of maximizing
that pushes
to approach a deterministic one, and even to arrive it if
is free of any constraint.
Putting the above two perspectives in a joint consideration, we further come to the center of Fig.10 for the third perspective, i.e., those studies on a probability theoretic based two pathway. The Ying Yang harmony
leads to Bayesian Ying-Yang harmony learning, introduced in the previous sections. Moreover, extensive efforts have also been made during the same period under the name of the Helmholtz free energy based learning or Helmholtz machine (Hinton & Zemel, 1994; Hinton, Dayan, Frey, & Neal, 1995; Dayan, 2002) and further under the name of variational approximation (Ghahramani & Beal, 2000; Jaakkola, 2001; Jordan, Ghahramani, Jaakkola, & Saul, 1999; Corduneanu & Bishop, 2001). When
, the Ying-Yang matching
at the 1st row of Type B table in Fig.11 becomes equivalent to Helmholtz machine if
is a parametric structure by a forward networks as those in (Hinton, Dayan, Frey, & Neal, 1995; Dayan, 2002); while the Ying-Yang matching
at the 1st row of Type C table in Fig.11 leads to variational Bayes approach if
is given parametric structure as that in (Ghahramani & Beal, 2000). Moreover, these variational approximation approaches in general correspond to the best Ying-Yang matching
at the 1st row of both Type B and Type C tables in Fig.11, with
and
designed in given parametric structures that have unknown parameters to be determined, as those called bi-directional BYY architectures in (Xu, 1995, 2004a&b&c).
As sketched within the left ellipse in Fig.10, the Ying-Yang matching covers not only the studies of the likelihood based types, the AIC types, and the variational approximation based types, but also the studies of seeking a best matching between the Ying Yang inner representations, as well as variations and extensions with regularizations by
and
by eq.(18). However, it follows from
at the top of the table in Fig.11 that the Ying-Yang matching differs from
the Ying-Yang harmony in minimizing
as well, which cancels out the least complexity nature because the chance of maximizing
has been removed. Therefore, the Ying-Yang matching does not have the first three favorable features of the best Ying-Yang harmony, as elaborated in the previous section.
For a further insight, we can also consider the best Ying Yang harmony from an optimal communication perspective, in other words, searching a minimum code length to communicate data, i.e., a formal information theory restatement of Occam's Razor. The first study along this direction is the Minimum message length (MML) (Wallace & Boulton, 1968; Wallace & Dowe, 1999), which is subsequently followed by the minimum description length (MDL) (Rissanen, 1986, 2007). As illustrated in Fig.12, the key idead is encoding a set of samples in two parts. Assume that a parametric form of
is already known by the receiver, the model information is encoded via encoding
, which needs
bits. The second part
consists of the bits to encode the residuals that can not be described by the model. Conceptually, MDL has a difference that bases on Kraft-McMillan inequality and turns the problem of searching for an efficient code into searching for a good distribution
. However, a practical implementation for MDL usually degenerates into one that is same as Bayesian information criterion (BIC)(Schwarz,1978; Neath & Cavanaugh,1997) that can still be regarded as an example of the above two part coding (MacKay, 1992, 2003). Unfortunately, the bits
is difficult to be estimated since we usually has no a priori knowledge about
. Instead, a rough bound is considered, with deteriorated performances. In fact, other typical criteria share a similar difficulty that is either directly or equivalently related to getting
. Interestingly, the best Ying Yang harmony is equivalent to minimizing the total bits of three parts of encoding, as shown in Fig.12. The difficult part
is now separated into two parts
. One is getting
that consists of bits for encoding
via
, which is not only no longer so difficult to describe but also carries an information about
per sample
. The other part is getting a new
for the bits of the rest part of model scale
, which is still estimated in a way similar to that for MDL and those typical model selection criteria. This provides another interpretation on the first two favorable features elaborated in the previous section. Though a three part coding has also been considered under the name of bit-back based MDL (Hinton & Zemel, 1994), what it considers is
, where
is actually equivalent to
at the bottom of Type B table in Fig.11 with
. In other words, the bit-back based MDL is just another interpretation of the above discussed Helmholtz machine and variational approximation.
There are also two other streams of studies on tackling the over-fitting challenge. One is on the cross validation (CV) approach (Stone, 1978; Rivals & Personnaz, 1999. It may deserve to investigate whether the CV approach can be applied to further improve the BYY learning on a small size of samples. The other stream is on the VC dimension based generalization error bound (Vapnik, 1995), based on which the support vector machine is developed and has become a popular topic in the machine learning literature. Little is known theoretically on the relation of the BYY learning to the VC dimension based generalization error bound. There has been only a preliminary discussion on a link between SVM and a BYY learning based kernel density estimation (Xu, 2001a). A further effort may deserve to be made along this direction too.
References
- Akaike, H (1974), "A new look at the statistical model identification", IEEE Tr. Automatic Control 19: 714-723.
- Amari, S (1995), "Information geometry of the EM and em algorithms for neural networks", Neural Networks 8(9): 1379-1408.
- Amari, S, Cichocki, A, & Yang, H (1996), "A new learning algorithm for blind separation of sources", In Touretzky, Mozer, & Hasselmo (Eds.), Advances in Neural Information Processing System 8, MIT Press, 757-763.
- Bell, A & Sejnowski, T (1995), "An information-maximization approach to blind separation and blind deconvolution", Neural Computation 7, 1129-1159.
- Bourlard, H & Kamp, Y (1988), "Auto-association by multilayer Perceptron and singular value decomposition", Biological Cybernetics 59: 291-294.
- Bozdogan, H & Ramirez, DE (1988), "FACAIC: Model selection algorithm for the orthogonal factor model using AIC and FACAIC", Psychometrika 53 (3): 407-415.
- Cavanaugh, JE (1997), "Unifying the derivations for the Akaike and corrected Akaike information criteria", Statistics & Probability Letters 33: 201-208.
- Dayan, P., (2002), " Helmholtz machines and sleep-wake learning", The Handbook of Brain Theory and Neural Networks, Second edition, (MA Arbib, Ed.), Cambridge, MA: The MIT Press, pp522-525.
- Ghahramani, Z & Beal, MJ (2000), "Variational inference for Bayesian mixtures of factor analyzers", in Solla, Leen, & Muller (eds), Advances in Neural Information Processing Systems 12, 449-455.
- Grossberg, S (1976), "Adaptive pattern classification and universal recording: I &II", Biological Cybernetics 23: 121-134 and 23: 187-202.
- Grossberg, S & Carpenter, GA (2002), "Adaptive Resonance Theory", The Handbook of Brain Theory and Neural Networks, Second edition, (MA Arbib, Ed.), Cambridge, MA: The MIT Press, 87-90.
- Hinton, GE, Dayan, P., Frey, BJ, & Neal, RN (1995), "The wake-sleep algorithm for unsupervised learning neural networks", Science, 268:1158-1160.
- Hinton, GE & Zemel, RS (1994), "Autoencoders, minimum description length and Helmholtz free energy", in Cowan, Tesauro, & Alspector (eds), Advances in Neural Information Processing Systems, 6, 3-10.
- Jaakkola, TS (2001), "Tutorial on variational approximation methods", in Opper & Saad (eds), Advanced Mean Field Methods: Theory and Practice, MIT press, 129-160.
- Jordan, MI, Ghahramani, Z, Jaakkola, TS & Saul, LK (1999), "An Introduction to Variational Methods for Graphical Models ", Machine Learning 37(2): 183-233.
- Jacobs, RA., et al (1991), "Adaptive mixtures of local experts", Neural Computation 3: 79-87.
- Jordan, MI & Xu, L (1995), "Convergence results for the EM approach to mixtures of experts", Neural Networks 8: 1409-1431.
- Mackey, D (1992), "A practical Bayesian framework for backpropagation", Neural Computation 4, 448-472.
- MacKay, D (2003), Information Theory, Inference, and Learning Algorithms, Cambridge University Press.
- Poggio, T & Girosi, F (1990), "Networks for approximation and learning", Proc. of IEEE 78: 1481-1497.
- Redner, RA & Walker, HF (1984), "Mixture densities, maximum likelihood, and the EM algorithm", SIAM Review 26: 195-239.
- Rissanen, J (1986), "Stochastic complexity and modeling", Annals of Statistics 14 (3): 1080-1100.
- Rissanen, J (2007), Information and Complexity in Statistical Modeling, Springer, 2007.
- Rivals, I & Personnaz, L (1999), "On Cross Validation for Model Selection", Neural Computation 11: 863-870.
- Schwarz, G (1978), "Estimating the dimension of a model", Annals of Statistics 6: 461-464.
- Stone, M (1978), "Cross-validation: A review", Math. Operat. Statist. 9: 127-140.
- Tikhonov, AN & Arsenin, VY (1977), Solutions of Ill-posed Problems, Winston and Sons.
- Vapnik, VN (1995), The Nature of Statistical Learning Theory, Springer.
- Wallace, CS & Boulton, DM (1968), "An information measure for classification", Computer Journal 11, 185-194.
- Wallace, CS & Dowe, DR (1999), "Minimum message length and Kolmogorov complexity", Computer Journal 42 (4): 270-280.
- Xu, L (2008a), "Bayesian Ying Yang System, Best Harmony Learning, and Gaussian Manifold Based Family", In Zurada et al (eds.), Computational Intelligence: Research Frontiers, WCCI2008 Plenary/Invited Lectures, LNCS5050, 48–78.
- Xu, L (2008b), "Machine learning problems from optimization perspective", A special issue for CDGO 07, Journal of Global Optimization, in press, DOI10.1007/s10898-008-9364-0.
- Xu, L (2008c), "Independent Subspaces", in Ramón, Dopico, Dorado & Pazos (Eds.), Encyclopedia of Artificial Intelligence, IGI Global (IGI) publishing company, 903-912.
- Xu, L (2007), "A Unified Perspective and New Results on RHT Computing, Mixture Based Learning, and Multi-learner Based Problem Solving", Pattern Recognition, 40, 2129-2153.
- Xu, L & Jordan, MI (1996), "On convergence properties of the em algorithm for gaussian mixtures", Neural Computation, 1, 129-151.
- Xu, L (2005), "Fundamentals, Challenges, and Advances of Statistical Learning for Knowledge Discovery and Problem Solving: A BYY Harmony Perspective", Keynote talk, Proc. of Intl. Conf. on Neural Networks and Brain, Oct. 13-15, 2005, Beijing, China, Vol. 1, 24-55.
- Xu, L (2004a), "Temporal BYY Encoding, Markovian State Spaces, and Space Dimension Determination", IEEE Trans on Neural Networks 15( 5): 1276-1295.
- Xu, L (2004b), "Advances on BYY harmony learning: information theoretic perspective, generalized projection geometry, and independent factor auto-determination", IEEE Trans on Neural Networks 15(4): 885-902.
- Xu, L (2004c), "Bayesian Ying Yang Learning", Intelligent Technologies for Information Analysis", N. Zhong and J. Liu (eds), Springer, 615-706.
- Xu, L (2003), "Data smoothing regularization, multi-sets-learning, and problem solving strategies", Neural Networks, Vol. 15, No.5-6, 817-825.
- Xu, L (2002), "Bayesian Ying Yang Harmony Learning", The Handbook of Brain Theory and Neural Networks, Second edition, (MA Arbib, Ed.), Cambridge, MA: The MIT Press, pp1231-1237.
- Xu, L (2001a), "Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, Three-Layer Nets and ME-RBF-SVM Models", Intl J of Neural Systems 11(1):43-69.
- Xu, L (2001b), "BYY harmony learning, independent state space and generalized APT financial analyses", IEEE Trans. Neural Networks 12(4):822-849.
- Xu, L (1998), "RBF nets, mixture experts, and Bayesian Ying-Yang learning", Neurocomputing 19(1-3): 223-257.
- Xu, L (1997), "Bayesian Ying Yang system and theory as a unified statistical learning approach (II): from unsupervised learning to supervised learning and temporal modeling", Theoretical Aspects of Neural Computation : A Multidisciplinary Perspective (TANC97) , in Wong, KM, et al, eds, Springer, pp25-42.
- Xu, L (1995), "Bayesian-Kullback Coupled YING-YANG Machines: Unified Learnings and New Results on Vector Quantization", Proc. Intl. Conf. on Neural Information Processing (ICONIP95), Beijing, Oct 30-Nov.3, 1995, pp.977-988.
- Xu, L, Jordan, MI, & Hinton, GE (1995), "An Alternative Model for Mixtures of Experts", in Cowan, Tesauro, and Alspector, eds., Advances in Neural Information Processing Systems 7, MIT Press, 633-640.
- Xu, L, Krzyzak, A & Oja, E (1992&93), "Rival Penalized Competitive Learning for Clustering Analysis, RBF net and Curve Detection", IEEE Trans on Neural Networks 4: 636-649, An early version on Proc. 1992 IJCNN, Nov.3-6, 1992, Beijing, 665-670.
- Xu, L (1991&93) "Least mean square error reconstruction for self-organizing neural-nets", Neural Networks 6: 627-648, 1993. Its early version on Proc. IJCNN91'Singapore, 2363-2373, 1991.
Internal links
- The subpage of /APPENDIX
- The subpage of /Experimental Results,
- Rival Penalized Competitive Learning
External links
See also
- Rival Penalized Competitive Learning
- Model Selection
- Regularization
- Bayesian Learning
- Bayesian Inference
- Adaptive Resonance Theory
- Helmholtz Machine
- Minimum Description Length
- Independent State Spaces
| Lei Xu (2007) Bayesian Ying Yang learning. Scholarpedia, 2(3):1809, (go to the first approved version) Created: 31 July 2006, reviewed: 15 February 2007, accepted: 22 March 2007 |
| Action editor: | Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia |
denoting that
proceeds
, i.e.,
is a part (or called a substructure) of
. If
.
is far less than 1, and
except
at
.

