Algorithmic probability
From Scholarpedia
| Marcus Hutter et al. (2007), Scholarpedia, 2(8):2572. | doi:10.4249/scholarpedia.2572 | revision #73369 [link to/cite this article] | |||||||||||||||||||
| Hosting and maintenance of this article is sponsored by Brain Corporation. |
Curator: Dr. Marcus Hutter, Australian National University, Canbera, Australia
Curator: Dr. Shane Legg, Dalle Molle Institute for Artificial Intelligence (IDSIA)
Curator: Dr. Paul M.B. Vitanyi, CWI and Computer Science, University of Amsterdam, The Netherlands
Algorithmic "Solomonoff" Probability (AP) assigns to objects an a priori probability that is in some sense universal. This prior distribution has theoretical applications in a number of areas, including inductive inference theory and the time complexity analysis of algorithms. Its main drawback is that it is not computable and thus can only be approximated in practice.
Contents |
Bayes, Occam and Epicurus
In an inductive inference problem there is some observed data
and a set of hypotheses
, one of which may be the true hypothesis generating
.
The task is to decide which hypothesis, or hypotheses, are the most likely to be
responsible for the observations.
An elegant solution to this problem is Bayes' rule,
To compute the relative probabilities of different hypotheses
can be
dropped as it is an independent constant. Furthermore, for a given
hypothesis
it is often straight forward to compute
.
A conceptually more difficult problem is to assign the prior probability
. Indeed, how can one assign a probability to a hypothesis
before observing any data?
A theoretically sound solution to this problem was not known until R.J. Solomonoff
founded algorithmic probability theory in the 1960s.
Algorithmic probability rests upon two philosophical
principles. The first is Epicurus' (342? B.C. - 270 B.C.) principle
of multiple explanations which states that one should
keep all hypotheses that are consistent with the data.
From Bayes' rule it is clear that in order to keep consistent hypotheses
what is needed is
.
The second philosophical foundation is the principle of Occam's razor (1285 - 1349, sometimes spelt Ockham). Occam's razor states that when inferring causes entities should not be multiplied beyond necessity. This is widely understood to mean: Among all hypotheses consistent with the observations, choose the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability.
Using Turing's model of universal computation, Solomonoff (1964) produced a universal prior distribution that unifies these two principles. This work was later generalized and extended by a number of researchers, in particular L. A. Levin who formalized the initial intuitive approach in a rigorous mathematical setting and supplied the basic theorems of the field. There now exist a number of strongly related universal prior distributions, however this article will describe only the discrete universal a priori probability, and its continuous counterpart. See Chapters 4 and 5 of Li and Vitányi (1997) for a general survey.
Discrete Universal A Priori Probability
Consider an unknown process producing a binary string of one hundred 1s. The probability that such
a string is the result of a uniform random process, such as fair coin
flips, is just
, like that of any other string of the same length. Intuitively, we feel that there is a difference between a string that we can recognize and distinguish, and the vast majority of strings that are indistinguishable to us. In fact, we feel that a far more
likely explanation is that some deterministic algorithmic simple process has
generated this string. This observation was already made by P.-S. Laplace in about 1800,
but the techniques were lacking for that great mathematician to quantify this insight.
Presently we do this using the lucky confluence of combinatorics, information theory and theory of algorithms.
It is clear that, in a world with computable processes,
patterns which result from simple processes are relatively likely,
while patterns that can only be produced by very complex processes are
relatively unlikely.
Formally, a computable process that produces a string
is a
program
that when executed on a universal Turing machine
produces the string
as output.
As
is itself a binary string, we can define
the discrete universal a priori probability,
, to be the
probability that the output of a universal prefix Turing machine
is
when provided with fair coin flips on
the input tape. Formally,
where the sum is over all halting programs
for which
outputs the string
. As
is
a prefix universal Turing machine the set of valid programs forms a
prefix-free set and thus the sum is bounded due to Kraft's inequality.
In fact, the boundedness of the sum
should be an obvious consequence from realizing that the above expression is
indeed a probability, which hinges on the fact that a prefix machine is one
with only {0,1} input symbols and no end-of-input marker of any kind.
It is easy to see that this distribution is strongly related to Kolmogorov complexity in that
is at least the maximum term in the summation,
which is
.
The Coding Theorem of L.A. Levin (1974) states that equality also
holds in the other direction in the sense that
,
or equivalently,
As every string can be computed by at least one program on
, it follows
that
assigns a non-zero probability to all
computable hypotheses and thus this distribution respects Epicurus' principle.
Furthermore, if we measure the complexity of
by its
Kolmogorov complexity, we see that the simplest strings have the
highest probability under
, while complex ones have a
correspondingly low probability. In this way
also formalises the principle of Occam's razor.
Unfortunately, it is impossible in general to know whether any given
program
will halt when run on
, due to
the halting problem. This means that the sum can not be computed, it
can only be approximated from below. Stated technically, the function
is only lower semi-computable. Furthermore,
is not a proper probability measure but rather a semi-measure as
. This can be corrected by normalising,
however this introduces other theoretical weaknesses and so will not
be discussed here.
The universal a priori probability
has many
remarkable properties.
A theorem of L.A. Levin states that
among the lower semi-computable semi-measures it is
the largest such function in the following sense: If
is a lower
semi-computable semi-measure, then there is a
constant
such that
for all
. In fact, we can take
with
the prefix Kolmogorov complexity of the distribution
(defined as the least complexity of a prefix Turing machine lower semi-computing the distribution).
This justifies calling
a universal distribution.
Thus, the same mathematical object
arises in three different ways:
- The family of lower semi-computable semi-measures contains an element that multiplicatively dominates every other element: a `universal' lower semi-computable semi-measure
;
- The probability mass function defined as the probability that the universal prefix machine
outputs
when the input is provided by fair coin flips, is the `a priori' probability
; and
- The probability
has
as its associated Shannon-Fano code (
).
When three very different definitions (or properties) turn out to define the same object, then this is usually taken in Mathematics to mean that this object is objectively important.
Continuous Universal A Priori Probability
The universal distribution
is defined in a discrete
domain, its arguments are finite binary strings. For applications such
as the prediction of growing sequences it is
necessary to define a similar distribution on infinite binary
sequences. This leads to the universal semi-measure
defined as the probability that the output of a monotone universal
Turing machine
starts with
when provided
with fair coin flips on the input tape. Formally,
where the sum is over all minimal programs
for which
's output
starts with
(denoted by *).
Like
, it can be shown that
is
only a semi-computable semi-measure. Due to a theorem of
L.A. Levin it is the largest such function: If
is a
lower semi-computable semi-measure, then there is a constant
such that
for all
. The precise definition of these notions is similar
but more complicated than in the discrete setting above and thus
we refer the reader to the literature (see for example chapter 4 of
Li and Vitanyi 1997).
The semi-measure
has similar remarkable properties to
, for example,
,
where
is the monotone complexity.
Also
divided by the uniform measure
is the maximal semicomputable supermartingale.
Additionally, if the infinite
binary sequences are distributed according to a computable measure
, then the predictive distribution
converges rapidly to
with
-probability 1. Hence,
predicts
almost as well as does the true distribution
.
Applications
Algorithmic probability has a number of important theoretical applications; some of them are listed below. For a more complete survey see Li and Vitanyi (1997).
Solomonoff Induction
Solomonoff (1964) developed a quantitative formal theory of induction based on universal Turing machines (to compute, quantify and assign codes to all quantities of interest) and algorithmic complexity (to define what simplicity/complexity means). The important Prediction Error Theorem is due to Solomonoff (1978): Let
be the expected squared error made in the
-th prediction
of the next bit of an infinite binary sequence
in the ensemble of infinite binary sequences
distributed according to computable measure
when
predicting according to the fixed semi-measure
. Then,
that is, the total summed expected squared prediction error is bounded
by a constant, and therefore, if smooth enough, typically decreases
faster than
. In particular it implies
with
-probability 1, stated above.
This is remarkable as it means that
Solomonoff's inductive inference system will learn to correctly predict
any computable sequence with only the absolute minimum amount of data.
It would thus, in some sense, be the perfect universal prediction algorithm,
if only it were computable.
AIXI and a Universal Definition of Intelligence
It can also be shown that
leads to excellent
predictions and decisions in more general stochastic environments.
Essentially AIXI is a generalisation of Solomonoff induction
to the reinforcement learning setting, that is, where the agent's
actions can influence the state of the environment.
AIXI can be proven to be, in some sense, an optimal reinforcement
learning agent embedded in an arbitrary unknown environment (Hutter 2004).
Like Solomonoff induction, the AIXI agent is not computable and thus can
only be approximated in practice.
Instead of defining an optimal general agent, it is possible to turn things around and define a universal performance measure for agents acting in computable environments, known as universal intelligence.
Expected Time/Space Complexity of Algorithms under the Universal Distribution
For every algorithm whatsoever, the
-average-case complexity is always of the same order of magnitude as the worst-case complexity for computational resources such as time and storage space. This result of Li and Vitanyi means that, if the inputs are distributed according to the universal distribution, rather than according to the uniform distribution as is customary, then the expected case is as bad as the worst case. For example, the sorting procedure Quicksort has worst-case running time
, but expected running time
on a list of n keys for permutations drawn at random from uniformly distributed permutations of
keys. But if the permutations are distributed according to the universal distribution, that is, according to Occam's dictum that the simple permutations have the highest probabilities, as is often the case in practice, then the expected running time rises to the worst-case of
. Experiments appear to confirm this theoretical prediction.
PAC Learning Using the Universal Distribution in the Learning Phase
Using
as the sampling distribution in the learning phase in
PAC-learning properly increases the learning power for the family
of discrete concept classes under computable distributions, and
similarly for PAC learning of continuous concepts over computable
measures by using
. This model of Li and Vitanyi is akin to using a
teacher in the learning phase who gives the simple examples first.
See Applications of
AIT.
Halting Probability
A formally related quantity is the probability that
halts when provided with fair coin flips on the input tape (i.e., that
a random computer program will eventually halt). This halting
probability
is also known as Chaitin's constant
or "the number of wisdom", and has numerous remarkable
mathematical properties. For example, it can be used to quantify
Goedel's Incompleteness Theorem.
References
M. Hutter. Universal artificial intelligence: Sequential Decisions based on algorithmic probability. Springer, Berlin, 2004. http://www.springer.com/chl/home/generic/search/results?SGWID=2-40109-22-34425910-0
M. Li and P. M B. Vitanyi. An introduction to Kolmogorov complexity and its applications. Springer, New York, 2nd edition 1997, and 3rd edition 2008. http://www.springer.com/chl/home/generic/search/results?SGWID=2-40109-22-165245230-0
L.A. Levin. Laws of information conservation (non-growth) and aspects of the foundation of probability theory. Problems Information Transmission, 10(3):206-210, 1974.
R. J. Solomonoff. A formal theory of inductive inference: Parts 1 and 2. Information and Control, 7:1--22 and 224--254, 1964.
R. J. Solomonoff. Complexity-based induction systems: Comparisons and convergence theorems. IEEE Transactions on Information Theory, IT-24:422-432, 1987.
A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms. Russian Mathematical Surveys, 25(6):83--124, 1970.
Internal references
- Marcus Hutter (2007) Algorithmic information theory. Scholarpedia, 2(3):2519.
- Rodney G. Downey and Jan Reimann (2007) Algorithmic randomness. Scholarpedia, 2(10):2574.
- Ming Li and Paul M.B. Vitanyi (2007) Applications of algorithmic information theory. Scholarpedia, 2(5):2658.
- Olaf Sporns (2007) Complexity. Scholarpedia, 2(10):1623.
External Links
- Homepages of
See Also
Algorithmic information theory, Algorithmic Complexity, Algorithmic Randomness
| Marcus Hutter, Shane Legg, Paul M.B. Vitanyi (2007) Algorithmic probability. Scholarpedia, 2(8):2572, (go to the first approved version) Created: 4 December 2006, reviewed: 23 August 2007, accepted: 23 August 2007 |
| Invited by: | Dr. Eugene M. Izhikevich, Editor-in-Chief of Scholarpedia, the peer-reviewed open-access encyclopedia |
| Action editor: | Dr. Marcus Hutter, Australian National University, Canbera, Australia |




