Applications of algorithmic information theory
| Ming Li and Paul M.B. Vitanyi (2007), Scholarpedia, 2(5):2658. | doi:10.4249/scholarpedia.2658 | revision #145413 [link to/cite this article] |
Algorithmic Information Theory, more frequently called Kolmogorov complexity, has a wide range of applications, many of them described in detail by Li and Vitanyi (1997/2007). The range of applications is so vast that every such selection cannot be comprehensive or even cover all important items. The applications can use the property that:
- most strings are effectively incompressible, or
- some strings can be effectively compressed, or
- some strings can be effectively compressed but take a lot of effort to do so, or
- different strings can be effectively compressed in various degrees, singly, jointly, or conditionally on one another, and that approximating the Kolmogorov complexity with real-world compressors still gives acceptable, or even very good, results.
Contents |
Application of Incompressibility
A new mathematical proof technique was developed, now known as the incompressibility method, a basic general technique such as the "pigeon hole" argument, "the counting method" or the "probabilistic method". It is based on the fact that most strings (the so-called Martin-Loef random strings) cannot be effectively compressed.
List of selected application areas
The incompressibility method has been applied in, among others,
- recursion theory to analyse notions of randomness, this line of research is compiled in Downey and Hirschfeldt (2008),
- combinatorics, combinatorial geometry,
- statistics,
- graph theory, expanders,
- Kolmogorov 0-1 Laws, probability theory,
- theory of parallel and distributed computation,
- time and space complexity of computations, language recognition, string matching, Turing machine complexity, complexity <review> The automatic linking refers to a wrong article, i think, and this is a general problem in Scholarpedia design.</review> of tapes, stacks, queues, solving many well-researched open problems of several decades standing,
- sorting algorithms,
- communication complexity,
- routing in computer networks,
- circuit theory,
- formal language and automata theory, parallel computation, lower bound proof techniques.
- physics: thermodynamics; explanation of Maxwell's Demon paradox, quantum information theory.
The method is always a matter of using regularity in an object, or algorithm, imposed by a property under investigation and quantified in an assumption to be contradicted, to compress the object's or algorithm's description to below its minimal value. <review>
I think the last sentence is not easy to read and understand, mostly because it tries to put too much contents in few words. Since it is a method, not a formal statement or definition, I'd suggest to write here something rather informal, avoiding the "encyclopedic" style. (And I would prefer short and simple enough sentences, e.g. "The matter of the incompressibility method is like this. We consider a property imposing some regularity in objects (strings, numbers, algorithms, etc.). We use this regularity to construct a description of a certain form. Then we take a random object. We compare the length of our description with object's complexity (or, actually, with a lower bound for it). And we get a contradiction or some bound for parameters.")
I like the introduction to Ch.6 in the LV book. Maybe, you can reproduce here part of it, with some modifications? The incompressibility method is the oldest (is it true? a few historical comments are also worth adding) and the most used application of algorithmic complexity, and it makes sense to spend more words and explain the general scheme of the method in detail. Examples (even with the two sentences at the end of the Goedel theorem example) are not a sufficient source for a reasonably interested novice, I'm afraid. If you later write a separate article on incompressibility method, you can move some material there. </review>
A simple example from number theory
In the nineteenth century, Chebychev showed that the number of primes less than \(n\) grows asymptotically like \(n/\log n\). Using the incompressibility method we cannot (yet) prove this statement precisely, but we can come remarkably close with a minimal amount of effort. We first prove, following G.J. Chaitin, that for infinitely many \(n\), the number of primes less than or equal to \(n\) is at least \(\log n/ \log \log n\). The proof method is as follows. For each \(n\), we construct a description from which \(n\) can be effectively retrieved. This description will involve the primes less than \(n\). For some \(n\) this description must be long, which shall give the desired result.
Assume that \(p_1 , p_2 , \ldots , p_m\) is the list of all the primes less than \(n\). Then, \[ n = p_1^{{e}_1} p_2^{{e}_2} \cdots p_m^{{e}_m} \] can be reconstructed from the vector of the exponents. Each exponent is at most \(\log n\) and can be represented by \(\log \log n\) bits. The description of \(n\) (given \(\log n\)) can be given in \(m \log \log n\) bits.
It can be shown that each \(n\) that is random (given \(\log n\)) cannot be described in fewer than \(\log n\) bits, whence the result follows. Can we do better? This is slightly more complicated. The original idea is due to P. Berman, and improved by J.T. Tromp. Let \(l(x)\) denote the length of the binary representation of \(x\). We shall show that for infinitely many \(n\) of the form \(n=m \log^2 m\), the number of distinct primes less than \(n\) is at least \(m\).
Firstly, we can describe any given integer \(n\) by \(E(m) n/p_m\), where \(E(m)\) is a prefix-free encoding of \(m\), and \(p_m\) is the largest prime dividing \(n\). For random \(n\), the length of this description, \(l(E(m)) + \log n - \log p_m\), must exceed \(\log n\). Therefore, \(\log p_m < l(E(m))\). It is known (and straightforward from the earlier discussion) that we can set \(l(E(m)) \leq \log m + 2 \log \log m\). Hence, <review> The argument will be easier to follow if you formulate what exactly is proven. "If p_m is any prime dividing a random number, then "</review> \(p_m < m \log^2 m\). Setting \(n := m \log^2 m\), and observing from our previous result that \(p_m\) must grow with \(n\) <review> "there are infinitely many primes dividing random numbers" is more precise and clear. Btw, the last n is not that n=mlog^2m. </review>, we have proven our claim. The claim is equivalent to the statement that for our special sequence of values of \(n\) the number of primes less than \(n\) exceeds \(n/ \log^2 n\).
Example: Goedel's incompleteness result
<review> Here should be an introductory note that Goedel proved that in any consistent powerful enough theory, there are true, but unprovable statements (or a link at least).</review>
A formal system (consisting of definitions, axioms, rules of inference) is consistent if no statement that can be expressed in the system can be proved to be both true and false in the system. A formal system is sound if only true statements can be proved to be true in the system. (Hence, a sound formal system is consistent.) The idea below goes back to Ya. Barzdins and was popularized by G.J. Chaitin.
Let \(x\) be a finite binary string. We write `\(x\) is random' if the shortest binary description of \(x\) has length at least that of the literal description of \(x\) (in other words, the Kolmogorov complexity of \(x\) is not less than its length). A simple counting argument shows that there are random \(x\)'s of each length.
Fix any sound formal system \(F\) in which we can express statements like `\(x\) is random.' We claim that for all but finitely many random strings \(x\), the sentence `\(x\) is random' is not provable in \(F\). Assume the contrary. Suppose \(F\) can be described in \(f\) bits---assume, for example, that this is the number of bits used in the exhaustive description of \(F\) in the first chapter of the textbook `Foundations of \(F\)'. Then given \(F\), we can start to exhaustively search for a proof that some string of length <review> Formally, we have to search for x of length greater than n (and for arbitrary n to cope with constants). I'm not sure whether this should be explained in this popular article, but the current form may confuse a meticulous reader.</review> \(n \gg f \) is random, and print it when we find such a string. This is an \(x\) satisfying the `\(x\) is random' sentence. This procedure to print \(x\) of length \(n\) uses only \( \log n + f\) bits of data, which is much less than \(n\). But \(x\) is random by the proof and the fact that \(F\) is sound. Hence, \(F\) is not consistent, <review> I do not understand why consistency is mentioned here. It is not shown that F proves 'x is random' and F proves 'x is not random' or anything like that. And we do not need it, since we already have a contradiction. </review> which is a contradiction.
This shows that although most strings are random, it is impossible to effectively prove them random. In a way, this explains why the incompressibility method is so successful. We can argue about a `typical' individual element, which is difficult or impossible by other methods.
Average-case complexity of Algorithms
The incompressibility method has been very successful to analyze the difficult average case complexity of algorithms. For example, the average case time complexity of an algorithm is the average running time over all inputs of binary length \(n\), for each \(n\). This involves analyzing the running time of all binary strings of length \(n\), for each \(n\). With the incompressibility method, we only need to analyze the running time of a single incompressible (that is, ML-random) string of length \(n\), because such a string is so typical that its running time is about equal to that of most other strings, and hence the average time.
The method simply analyzes the algorithm with respect to this single string, by showing that the string is compressible if the algorithm does not satisfy a certain running time. This method has been used to analyze the average case running time for many well-known sorting algorithms, including the Heapsort and the Shellsort algorithm. A successful example is a \(\Omega (p n^{1+1/p})\) lower bound on the average-case running time (uniform distribution) for sorting \(n\) items, of \(p\)-pass Shellsort. This is the first nontrivial general lower bound for average-case Shellsort in 40 years.
Applications of Compressibility
Traditional wisdom has it that the better a theory compresses the learning data concerning some phenomenon under investigation, the better we learn, generalize, and the better the theory predicts unknown data. This belief is vindicated in practice but before the advent of Kolmogorov complexity has not been rigorously proved in a general setting. Ray Solomonoff invented the notion of universal prediction using the Kolmogorov complexity based universal distribution, see the section on Algorithmic Probability in Algorithmic Information Theory. The discrete version of the universal probability, denoted as \(m\), has miraculous applications:
- For every algorithm whatsoever, the average-case computation complexity resource (e.g. running time, storage) is the same order of magnitude as the worst-case computation complexity resource (running time, storage, respectively), the average taken with inputs distributed according to the universal distribution \(m\).
- In PAC-learning, using \(m\) as the distribution from which to draw labeled examples in the learning phase, considering families of discrete concept classes with computable distributions, the learning power increases: Some concept classes that were NP-hard to learn now become polynomially learnable. Similarly, PAC learning of continuous concept classes from families over computable measures is improved by using \(M\), the continuous version of \(m\), in the learning phase. Some continuous concept classes that are non-computable in the general setting become computable now. That is, in both cases we draw in the learning phase labeled examples according to the universal distribution, and we PAC-learn according to the real computable distrinution. This model is akin to using a teacher in the learning phase who gives the simple examples first.
- A formally related quantity is the probability that \(U\) halts when provided with fair coin flips on the input tape (i.e., that a random computer program will eventually halt). This halting probability, \(\Omega = \sum_x m(x)\) also known as Chaitin's constant, or "the number of wisdom" has numerous remarkable mathematical properties, and can be used for instance to quantify Goedel's Incompleteness Theorem and is a natural example of a Martin-Loef random sequence.
- The continuous version of \(m\), denoted above by \(M\), leads to excellent predictions and decisions in general stochastic environments. A theory of learning by experiments in a reactive environment, has been developed by combining universal distribution predictions with sequential decision theory, results in an optimal reinforcement learning agent embedded in an arbitrary unknown environment Hutter (2005), and a formal definition and test of intelligence.
Universal prediction is related to optimal effective compression. The latter is almost always a best strategy in hypotheses identification (the minimum description length (MDL) principle). While most strings are incompressible, they represent data where there is no meaningful law or regularity to learn; it is precisely the compressible strings that represent data where we can learn meaningful laws from. As perhaps the last mathematical innovation of an extraordinary scientific career, Kolmogorov in 1973 proposed to found statistical theory on finite combinatorial principles independent of probabilistic assumptions. Technically, the new statistics is expressed in terms of Kolmogorov complexity. The relation between the individual data and its explanation (model) is expressed by Kolmogorov's Structure function. This entails a non-probabilistic approach to statistics and model selection. Let data be finite binary strings and models be finite sets of binary strings. Consider model classes consisting of models of given maximal (Kolmogorov) complexity. The Structure function of the given data expresses the relation between the complexity level constraint on a model class and the least log-cardinality of a model in the class containing the data. Recently it has been shown that the structure function determines all stochastic properties of the data: for every constrained model class it determines the individual best-fitting model in the class irrespective of whether the `true' model is in the model class considered or not. This approach led to a new foundation for the celebrated Minimum Description Length (MDL) principle for Statistical Modeling or Statistical Inference, Rissanen (2007). Essentially, for given data, one analysis tells which models obtained by the MDL principle are the right ones, the best fitting ones, while another analysis using the structure function tells how to obtain them. In this setting, this happens with certainty, rather than with high probability as is in the classical case. Related ideas have been applied, by the same authors <review> Who are these anonymous authors?</review>, to rate-distortion theory and lossy compression of individual data. Cognitive psychology has a long tradition of applying formal models of simplicity and complexity. The work by E. L. J. Leeuwenberg even predates the advent of Kolmogorov complexity. Not surprisingly, this field has a large and significant literature on applications of Kolmogorov complexity, for example the circles around N. Chater and P.A. van der Helm.
Applications of Compressibility Requiring a Great Effort
Some strings can be compressed but take a great amount of effort, in time or space, to do so. This leads to applications in
- Structural Complexity Theory about relations between resource-bounded computational classes,
- Language compression and coding,
- Probabilistic computational classes and oracle classes,
- Universal optimal search,
- Logical Depth.
Potential applications in this setting are
Applications of Differences in Compressibility by Real Compressors
Using the compressibility of nonrandom strings, a new theory of information distance and normalized information distance has been introduced, initially using Kolmogorov complexity and applied through approximation by real-life compressors like "gzip", "bzip2", and "PPMZ". In classical physics, we know how to measure the physical distance between a pair of objects. In the information age, it is important to measure the "information distance" between two objects: two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, or two genomes. Such a measurement should not be application dependent or arbitrary. The universal similarity metric is probably the greatest practical success of Algorithmic Information Theory. A reasonable definition for the similarity between two objects is how difficult it is to transform them into each other. Formally, one can define the similarity between strings \(x\) and \(y\) as the length of the shortest program that computes \(x\) from \(y\) and vice versa. This was proven to be equal to \[ \max \{K(x|y),K(y|x)\} \] up to logarithmic additive terms which can be ignored. These distances are absolute, but if we want to express similarity, then we are more interested in relative ones. For example, if two strings of length 1,000,000 differ by 1000 bits, then we are inclined to think that those strings are relatively more similar than two strings of 1000 bits that have that distance. Hence we need to normalize to obtain a universal similarity metric, and to apply it in practice for e.g. evolutionary trees of species, we can approximate the Kolmogorov complexity by real-world compressors. This bold step was taken first for the slightly different sum distance \(K(x|y)+K(y|x)\), originally derived with a thermodynamic interpretation in mind. Many applications including chain letter phylogeny, plagiarism detection, protein sequence/structure classification, and phylogenetic reconstruction have followed. In particular, researchers from the datamining community noticed that this methodology is in fact a parameter-free, feature-free, data-mining tool. They have experimentally tested a similar <review> similar or similarity? </review> metric on a large variety of sequence benchmarks. Comparing the compression method with 51 major methods found in 7 major data-mining conferences over the past decade they established clear superiority of the compression method for clustering heterogenous data, and for anomaly detection, and competitiveness in clustering domain data. It has been shown that the normalized information distance (NID), \[ NID(x,y) = \frac{ \max\{K{(x|y)},K{(y|x)}\} }{ \max \{K(x),K(y)\}}, \] where \(K(x|y)\) is algorithmic information of \(x\) conditioned on \(y\), is a similarity metric. The function \(NID(x,y)\) has been shown to satisfy the basic requirements for a metric distance measure. The universality conjecture was proved by recently, that is, that the NID as well as the normalized sum distance proposed earlier are universal in that they minimize, are at least as small as, every computable distance measure satisfying a natural density requirement. While this metric is not computable, it has an abundance of applications. Simply approximating \(K\) by real-world compressors, with \(C(x)\) is the binary length of the file \(x\) compressed with compressor \(C\) (for example "gzip", "bzip2", "PPMZ", in order to make NID easy to apply, we can rewrite the NID to obtain the normalized compression distance (NCD) \[ NCD(x,y) = \frac{C(xy) - \min \{C(x),C(y)\}}{\max \{C(x),C(y)\}}. \] NCD is actually a family of distances parametrized with the compressor \(C\). The better \(C\) is, the closer the NCD approaches the NID, and the better the results are. The normalized compression distance has been used to fully automatically reconstruct language and phylogenetic trees as above. It can also be used for new applications of general clustering and classification of natural data in arbitrary domains, for clustering of heterogenous data, and for anomaly detection across domains. The NID and NCD have been further applied to authorship attribution, stemmatology, music classification, internet knowledge discovery, to analyze network traffic and cluster computer worms and virusses, software metrics and obfuscation, web page authorship, topic and domain identification, hurricane risk assessment, ortholog detection, and clustering fetal heart rate tracings. Objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of War and Peace by Tolstoy. For simplicity we take it that all meaning of the object is represented by the literal object itself. Objects can also be given by name, like "the four-letter genome of a mouse," or "the text of `War and Peace' by Tolstoy." There are also objects that cannot be given literally, but only by name, and that acquire their meaning from their contexts in background common knowledge in humankind, like "home" or "red." Using code-word lengths obtained from the page-hit counts returned by Google from the web, we obtain a semantic distance using the NCD formula and viewing Google as a compressor useful for data mining, text comprehension, classification, and translation. The associated NCD, now called the normalized Google distance (NGD) can be rewritten as: \[ NGD(x,y)= \frac{ \max \{\log f(x), \log f(y)\} - \log f(x,y) }{ \log N - \min\{\log f(x), \log f(y) \}}, \] where \(f(x)\) denotes the number of pages containing \(x\), and \(f(x,y)\) denotes the number of pages containing both \(x\) and \(y\), as reported by Google. The number \(N\) can be set to the number of pages indexed by Google. For a publicly available open-source downloadable software tool CompLearn, written by Rudi Cilibrasi, for both NCD and NGD, see http://www.complearn.org. For an on-line demo of the NGD, see http://clo.complearn.org/: Simply think of 4-25 words (or compound words containing spaces) in a list. Enter the words in the box below and then press "run experiment" to see the computer make a tree. The computer will use the Google search engine page counts to compute NGD for each pair of words. Then it will use a new quartet method to find the best tree.
Stimulated by this work, a competitive approach based on compression has been developed to Pearson-Neyman hypothesis testing (null-hypothesis versus alternatve hypothesis), tests for randomness of strings generated by random number generators, and lossy compression and denoising via compression.
References
R. Downey, D. Hirschfeldt, Algorithmic Randomness and Complexity, Springer-Verlag, New York, 2008.
M. Li and P. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications,Springer-Verlag, New York, 1997. (2nd Edition 1997. 3rd Edition 2007, to appear.)
J.J. Rissanen, Information and Complexity in Statistical Modeling, Springer-Verlag, New York, 2007.
Bibliography, http://www.cs.uwaterloo.ca/~mli, http://www.cwi.nl/~paulv
See Also
Algorithmic Complexity, Algorithmic Information Theory, Algorithmic Probability, Algorithmic Randomness, Recursion Theory, Universal Search


