Algorithmic Information Dynamics

From Scholarpedia
This article has not yet been published; it may contain inaccuracies, unapproved changes, or be unfinished.
Jump to: navigation, search

Algorithmic Information Dynamics (AID) is an algorithmic probabilistic framework for causal discovery and causal analysis. It allows solving inverse problems using computational tools drawn from algorithmic complexity theory and based on algorithmic probability. AID studies dynamical systems in software space where all possible computable models live and are able to explain discrete longitudinal data such as a particle orbit in state and phase space, or approximate continuous systems by discrete computable models. AID provides tools such as perturbation analysis to guide the search and use of precomputed models--extracted initially from an exhaustive search--to find relevant generative mechanisms representing first principles of a domain of interest and is an alternative or complementary to other approaches and methods of experimental inference such as statistical machine learning and classical information theory.

AID is related to other areas such as computational mechanics, and program synthesis. However, unlike other methods such as Bayesian networks, AID does not rely on graphical models or the (often inaccessible) empirical estimation of mass probability distributions. AID lays the foundations and methods to make the area of algorithmic information and algorithmic complexity more relevant to the practice of science, knowledge discovery, and causal analysis. AID also connects with and across other parallel fields of active research such as logical inference, causal reasoning, and neuro-symbolic computation. AID studies how candidate discrete computable equations as generating mechanisms are affected by changes from observed phenomena over time as a result of the system evolving (e.g. under noise) or perturbations introduced (whose place in time and space can be hypothesised under algorithmic perturbation analysis).

Figure 1: The basic principles of Algorithmic Information Dynamics consist in finding computable models or sequences of (short) computable models able to explain a piece of data or an observation and study the effect that interventions have on such models in the software space. Fully deterministic systems subject to no noise or no external influence will find an invariant set of candidate models with description lengths varying by only a logarithmic or sublogarithmic term; any deviation will suggest noise or external influence.


Foundations and Motivation

Algorithmic Probability

Algorithmic Probability was introduced by Solomonoff, and Levin in its modern formulation, and by Kolmogorov and Chaitin indirectly through their work on algorithmic complexity and Chaitin's halting probability number. Algorithmic Probability considers the probability of a (discrete) object to be produced by a computable mechanistic set of rules able of Turing universality. Algorithmic Probability imposes a mapping distribution between input and output called the Universal Distribution. Formally, a computable process that produces a string \(s\) is a program \(p\) that when executed on a universal Turing machine \(U\) produces the string \(s\) as output and halts.

As \(p\) is itself a binary string, we can define the discrete universal a priori probability, \(m(s),\) to be the probability that the output of an arbitrary binary input of universal prefix Turing machine \(U\) is \(s\) when this input is provided with fair coin flips on the input tape. Formally,

\[ m(s) \;:=\; \sum_{p\;:\;U(p)=s} 2^{-\ell(p)}, \]

where the sum is over all halting programs \(p\) for which \(U\) outputs the string \(s\). As \(U\) is a prefix universal Turing machine the set of valid programs forms a prefix-free set (or self-delimited programming language) and thus the sum is bounded due to Kraft's inequality (i.e., it defines a probability semi-measure).

A formulation of the methods behind AID can be described as a combination of Bayes' theorem and Computability theory where the default agnostic prior distribution is the Universal Distribution instead of some other agnostic distributions such as the uniform. In this way any updates to the specific distribution of the problem is updated according to Algorithmic Probability.

Introduced by Ray Solomonoff, the theory of universal inductive inference as he also called it, is a theory of prediction based on observations, assuming each individual observed object is generated by arbitrary computable processes. An example is the prediction of the next digit in the sequence \(s=1234567...\) According to Algorithmic Probability, the next digit is expected to be 8, if the simplest model (i.e., the shortest self-delimiting program) able to generate that sequence is the successor function \(x_0=1;x_i:=x_{i-1}+1\) and such a model would generate 8 as a predictor of the next digit given the data. The only assumption is that the prior probability distribution follows a computable probability distribution even if such distribution is unknown because as proven by Levin, all computable distributions converge into what is known the Universal Distribution. As one of the pillars of algorithmic information theory (AIT), the emerging Universal Distribution from the Algorithmic Coding Theorem conjoins algorithmic complexity, universal a priori probability, and universal/maximal computably enumerable discrete semi-measure into a single precise and ubiquitous mathematical formalization of a necessary "bias toward simplicity" for spaces of computably generated objects. A result that has been called 'miraculous' in the scientific literature and was praised by the late Marvin Minsky as the most important theory in science relevant to AI.

Algorithmic Probability captures (not originally on purpose) longstanding principles on which science has been founded, the principle (also known as Occam's razor) that the simplest explanation (in this case the most algorithmically probably which turns out to be the shortest algorithmic description) is more likely the correct one, and the principle of multiple explanations (Epicurus) to keep all explanations consistent with the data and Bayes's Rule (transform the a priori distribution to a posterior distribution according to the evidence), keep all hypothesis consistent with data.

Numerical Methods

AID was conceived and introduced in the early 2010s and is currently an active area of research but the methods enabling AID were introduced in the mid-2000s. While it was not known, and no experiment had been conceived to test if the concept of algorithmic probability would empirically capture the intuition of Occam's razor principle other than as established in the theory, Zenil and colleagues studied the empirical behaviour of probability distributions produced by different models of computation. The results suggested that the output distributions were more compatible than possibly theoretically expected but were conformant with both the theoretical and empirical expectations conforming with Occam's as the elements with assigned highest probability were also found to be the most algorithmic simple according to various complementary order parameters, including computable ones which converge in value pointing to the same direction while the least frequent elements were also more random by all measures.

The numerical application of AID beyond its theoretical formulation strongly relies upon such numerical methods designed to combine and expand classical information theory at characterizing randomness with measures different to the popular use of compression algorithms such as Lempel-Ziv widely used to claim approximations to algorithmic complexity. These methods on which AID relies upon (but it is also independent of) consist of the so-called Coding Theorem (CTM) and Block Decomposition (BDM) methods.

Causal Discovery with the Coding Theorem Method (CTM)

Most attempts to approximate algorithmic complexity have been made by way of popular lossless compression algorithms such as LZ or LZW. However, if we take the example of the sequence \(s=1234567...\), an algorithm such as LZ or LZW will fail at compressing \(s\) despite its obvious compressed form (\(x_0=1;x_i:=x_{i-1}+1\)). This is because algorithms such as LZ and LZW are entropy estimators and they build a dictionary of most frequent words to assign them shorter codes. For example, if \(s\) is a computable Borel normal number, no contiguous subsequence is approximately represented more than any other of the same length and the resulting compressed file tends to be of about the same size of \(s\) itself (modulo change of data type transformation which is only a transliteration between e.g. ASCII to binary). Because of these limitations of popular lossless compression algorithms, and other reasons, Zenil et al. introduced a method based on the so-called Algorithmic Coding theorem. The Algorithmic Coding Theorem in the context of Algorithmic Probability due L.A. Levin (1974) establishes equality between universal a priori probability \(m(s)\) and algorithmic complexity \(K(s)\), formally:

\[ m(s) = 2^{-K(s)} + c \]

or equivalently,

\[ - \log \; m(s) \;=\; K(s)+ c , \] where \( c\) is a constant.

Based on this fundamental theorem, and under the assumption of optimality of the reference Turing machine, the Coding Theorem Method (CTM) is a method that provides an alternative path to the widespread use of statistical compression algorithms such as Lempel-Ziv to approximate algorithmic complexity. CTM does not rely upon statistical methods such as those based on dictionaries on which popular lossless compression algorithms such as LZW are based (designed to find statistical regularities and thus more related to classical information theory than to algorithmic complexity). The aim of CTM is to embrace Turing universality and explore the space of computer programs able to capture properties beyond statistical patterns and, as consequence, ignored by statistical approaches. Then, the method pumps it into the generative models that support a natural or artificial phenomenon, as usually done in science. As an example, traditionally the boundary of what part of the probabilistic content of an object or state belongs to the model itself, or to the explanation of the possible explanatory models, and these have traditionally been conflated in statistical and probabilistic approaches.

For example, when modelling the outcome of throwing a dice, what a statistical model ends up quantifying is the uncertainty external to the process itself, the degree of uncertainty as an observer hence quantifying a property of the observer and not the process related to the dice itself. This is because the process of throwing the dice is deterministic according to the laws of classical mechanics that are known to govern the dice trajectory (if quantum fluctuations are believed to be probabilistic, at short distances wouldn't have any effect), the probabilistic content of the model describing the dice outcome is, therefore, strange to the actual process. What an algorithmic-probability-like approach like CTM would do in principle, and its major difference is to produce a set of deterministic models describing the dice trajectory and outcome without making any recourse to probability. It is then in the distribution of possible computable models explaining the dice where a probability consequentially emerges but is not assumed at the core of the individual models themselves. In turn, these models cannot only be tested beyond their outcome predictions as mechanistic (step-by-step) descriptions of the process but also offer a non-black-box approach where each model can be followed step-by-step and model states correspond to constructive (e.g. physical) states, as opposed to random variables with no state-to-state correspondence between the model and phenomenological data.

Unlike popular lossless compression algorithms that are guaranteed not to be able to characterize objects such as \(s\), CTM can, in principle, because when running all possible computer programs up to the size of \(s\) in bits there is a non-zero probability that CTM will find such a program if such a program exists. Indeed, this finding is guaranteed as the worse case is when the program in question has the form of \(print[s]\) and \(s\) is algorithmically random (i.e., \(s\) is incompressible or with algorithmic complexity value equal to the length of \(s\), up to a constant). This holds because of the ergodicity of the algorithmic complexity approximation by the CTM over the software space, if enough computational resources are expended, which in turn is allowed by the lower semi-computability of the universal a priori probability.

The Block Decomposition Method (BDM)

One way to see BDM is as a weighted version of Shannon's entropy that introduces algorithmic randomness into the original formulation of classical information, one that is able to help tell apart statistical randomness from algorithmic randomness. To illustrate this let's take the sequence \(s = 1234567... \) again. While the digits of \(s\) are distributed with maximal entropy because it can be seen as the digital expansion of the so-called Champernowne constant that has been proven to be Borel normal (because it does not show any statistical patterns in the sense of overrepresentations of any digit subsequence repeated any more times than any other subsequence), when considering the set of all possible sequences some of which do show statistical patterns, the appearance of each digit in \(s\) would be characterized as unexpected when we may have liked to say that the most salient feature of \(s\) is, in reality, its not random nature as it can be generated by a simple deterministic/computable generative process.

This is a key methodological distinction because (1) the sequence would be characterized as maximally 'disordered' by statistical approaches such as Shannon Entropy unless upon its application one already has the information that \(s\) has been generated by a deterministic process, thereby self-defeating the use of the measure in the first place (in science this is the rule rather than the exception when observing data: one wants to investigate the nature of an object when the nature of such object is not yet known); and (2) it is of high empirical value in practice and application to science.

For example, if the purpose is to quantify human memory, a person would not need to learn \(s\) digit by digit in order to generate it illustrating the algorithmic nature of human cognitive processes beyond statistical patterns. The sequence \(s\) may look very special and, in fact, most sequences are of this type. In addition to not having any statistical regularity, most sequences do not even have any short description (low algorithmic randomness) when looking at the set of all possible sequences. That is, most sequences are algorithmically random. Moreover, the difference between those that do not have a short algorithmic description versus those that appear statistically random but have a short description is divergent. We only chose \(s\) because it was the most obvious for illustration (another example are the digits of a mathematical constant such as \(\pi\)). Indeed, CTM and BDM have found many applications in psychometrics and cognition due to this advantage (see Applications of AID).

What BDM does is to extend the power gained by CTM to quantify algorithmic randomness by implementing a divide-and-conquer approach whereby the data is decomposed in pieces small enough that an exhaustive computational search finds the set of all computable models able to generate each piece of data. The sequence of small generators then supports the larger piece of data from which insight is gained as to their algorithmic properties. Small pieces of data supported by even smaller programs than those pieces are called causal patches in the context of AID as they are the result of a causal relationship between the set of these computable models and the observed data. On the other hand, models of about the same length than the data are not causally explained by a shorter generating mechanism and are therefore considered (algorithmically) random and not causally supported.

A sequence of computer programs smaller than their matched patches constitutes a sufficient test for non-randomness and is, therefore, informative of its causal content, as its components can be explained by underlying small computable models shorter than the original whole data itself. This way, the stronger the coarse-graining of the original data (i.e., the weaker the decomposition) the closer the value resulting from the BDM is to the theoretical optimal value that corresponds to the whole data's algorithmic complexity. Hence, the more fine-grained (i.e., the stronger the decomposition), the more the BDM calculated value can diverge from the whole data's algorithmic complexity. This is because BDM value and the algorithmic complexity are proven to converge (up to a constant that only depends on the error of the CTM relative to the algorithmic complexity) as the data partition size tends to the whole data size.

Algorithmic Intervention Analysis

Unlike other approaches, such as Pearl's do-calculus, Algorithmic Information Dynamics can help in the initial process of causal discovery and is able to perform fully unsupervised hypothesis generation from causal computable models. It also provides the tools to explore the algorithmic effects that perturbations to systems and data may have on their underlying computable models, effectively also providing a framework for causal analysis similar to the do-calculus, but without traditional probability distributions. AID can substitute, or complement, other approaches to causal analysis. For example, Judea Pearl et al. methods for causal analysis assumes the existence of an educated causal model but cannot provide the means for primary causal discovery.

Figure 2: Causal intervention matching with AID: Central to AID is to study the effect of interventions at the data level to the set of computable candidate models, their change in program space are an indication of both the nature of the original data and the nature of the perturbation.

One can formulate a cause-effect question in the language of the do-calculus as a probability between a random variable and an intervention \(P(L|do(D))\) and one can also do so in the context of AID. In one of the typical examples by Pearl himself, if \(L\) means human lifespan and \(do(D)\) the use of some drug \(D\), the probability \(P\) of \(D\) having an effect on \(L\) is quantified by classical probability to calculate \(P\). What AID does is to substitute \(P\) by AP, the algorithmic probability that \(D\) exerts an effect on \(L\) with the chief advantage that not only one obtains a probability for the dependency (and direction) between \(D\) and \(L\) but also a set of candidate models explaining such a connection where no classical probability is involved in any of the inference or description of each individual model. AP would then impose a natural non-uniform algorithmic probability distribution over the space of candidate models connecting \(D\) and \(L\) based on estimations of the Universal Distribution, which in effect introduces a simplicity bias that favours shorter models as opposed to algorithmically random ones, all of which have already been found to explain the causal connection between \(D\) and \(L\). AID thus removes the need for classical probability distributions and, more important, produces a set of generative models no longer derived from traditional statistics (e.g. regression, correlation), which is used even in the do-calculus to determine the probability of a dependency thereby falling back to the methods the do-calculus itself was circumventing.

While the do-calculus is a major improvement in the area of causal discover, AID complements it offering a path to leave classical probability behind and take the next step towards departing from correlation confounding methods. Another key difference between the original do-calculus and the algorithmic probability calculus is that the do-calculus makes a distinction between \(P(L|do(D))\) and \(P(L|D)\) but in AID both hypothesis, \(AP(L|D)\) and \(AP(D|L)\), can be tested independently as they have different meanings under algorithmic probability. \(AP(L|D)\) means that \(L\) is the generative mechanism of \(D\) and \(AP(D|L)\) means that \(D\) is the generative mechanism of \(L\). Each of them would trigger a different explorative process under AID, the former would look for the set of smaller to larger computable models denoted by \({L}\) that can explain \(D\) while the latter would look for all the generative models \(\lbrace D \rbrace\) that generate \(L\). Intuitively, this means that, for example, to explain how a barometer falling (\( X \)) could be the cause for a storm would likely not find many shorter generative mechanisms to causally explain the occurrence of a storm (\( Y \)), while the other direction would find shorter computable models for the barometer fall to be the effect and not the cause. Clearly, disregarding the result, \(AP(X|Y)\) and \(AP(Y|X)\) are not equivalent.

In the language of AID, conforming more to a typical notation in graph and set theory, the first example would be written as \(AP(L \backslash D\)) or \(C(L \backslash D)\), where \(C\) is the algorithmic complexity of \(L\) with intervention \(D\) (in the context of AID that is usually a deletion to a model that includes D) and the second example as \(AP(X \backslash Y)\) or \(C(X \backslash Y)\). Alternatively, \(AP(L \backslash \lbrace D \rbrace )\) would be \(L\) with a set of interventions \(\lbrace D \)\(\rbrace\). Multivariate causal analysis can be achieved by conditional versions of algorithmic complexity and is currently an area of active research.

In Pearl's account to causal analysis, the so-called causal ladder up to human-grade reasoning consists of 3 levels, with the first being that of pattern observation covered by traditional statistics long-based on regression and correlation, interventions as the one his do-calculus suggests and AID allows, and, the 3rd level, that of counterfactuals or the power to imagine what would happen if conditions were different. AID also provides the most fundamental step in the causal analysis which is causal discovery under the same framework without having to make recourse to different approaches for different purposes (discovery vs analysis). While the do-calculus comes with no first guidelines for how to come up with a first testable model, AID can generate a set of computable models ranked by likeliness and offers a full path towards the full replacement and substitution of regression and correlation in the description of a model, taking the statistical nature out of the causal description. In addition, AID can also potentially cover all three levels of causality in Judea's ladder and provides the methodological framework to address each without any making recourse to classical probability, regression, or correlation.

Both interventions and counterfactuals in AID are covered by the underlying simulation of all possible computable hypotheses (hence all 'if' questions) up to the length of the original observation, and methods such as CTM and BDM allow the estimation of processes at all these 3 levels by decomposing the original problem into smaller problems with the additional advantage that AID ranks the counterfactuals naturally by algorithmic probability likeliness based on the Universal Distribution as a prior.

Information Difference and Algorithmic Information Calculus

With the ability to move data away from or toward algorithmic randomness by perturbing the system, AID enables the investigation of which data's elements contribute the most to the information necessary for the underlying causal computable model (the positive information elements) and which elements would trigger an increase in the system's algorithmic complexity (the negative information elements). This way, the "algorithmic randomness control" performed by the Algorithmic Intervention Analysis puts forward new general methods for studying perturbation effects on non-linear systems beyond those from classical control theory, and without making strong a priori assumptions of linearity. For example, AID has shown not only how to reconstruct the space-time evolution of Elementary Cellular Automata from gathered disordered states, but also correlated, as expected, positive information genes in E-Coli with homeostatic processes and more negative information genes with specialization processes in the cellular network.

Let \(|S| \) denote the whole size of the object \(S\), and not only the number of constitutive elements. Let \(N\) denote the total number of constitutive elements of \(S\). (Therefore, \(N\) may be smaller than \(| S | \) , but \(|S| = \mathbf{O}( f_c(N) )\), where \(f_c\) is a total computable function). For example, in the case of networks, \(|S| \) grows as the number of all possible edges of a graph (in \(\mathbf{O}( N^2 )\)) and \(N\) is the total number of vertices.

Let \(F \neq \empty\) be a subset of the set of elements of the object \(S\) to be deleted. In order to grasp such a formal measure of the algorithmic-informational contribution of each element, one may look into the difference in algorithmic complexity between the original data \(S\) and the destructively perturbed data \(S \setminus F\). We define the information difference between \(S\) and \(S \setminus F\) as \[I(S,F) = K(S) - K(S \setminus F).\] Note that the same holds analogously for constructive perturbation, i.e., elements insertion. When \(F = \left\{ e \right\}\) with \(I(S,F)=I(S,e)\) and this difference assumes an absolute value \[| I(S,e) | = | K(S) - K(S \setminus e) | \leq \log(N) + \mathbf{O}(1),\] we say \(e\) is a neutral information element, since the algorithmic information contribution of the insertion of \(e\) back into \(S\) (or its deletion from \(S\)) would necessarily be of the same order of a computable reversible procedure. Furthermore, we say \(e\) is a positive information element, if it is not neutral and \(I(S,e) > 0\); and negative information subset of elements, if it is not neutral and \(I(S,e) < 0\). Thus, negative content is immediately associated with information gain from perturbations, and positive content with information loss.

In the context of causal analysis, when considering a stochastically random bit-flip being performed to a string pattern, such as \(0000000...\) repeated \(n\) times, such perturbation will have a significant effect with respect to the length of the original shortest program \(p\) consisting of a loop printing \(n\) number of \(0\)'s, because the new program \(p'\) would need to take into account the \(1\) inserted at a stochastically random position \(x \leq n\). If, however, the string is algorithmically random, the effect of any stochastically random single-bit perturbation will have no effect (up to a logarithmic term of string's algorithmic complexity, which is already maximal), because every bit is already necessary information for the string's own description. Not only for strings, but also for graphs and other multidimensional data structures, (stochastically) random single perturbations have similar worst-case effects in both simple and algorithmically random objects amounting to up to about \(\log( | S | )\). Thus, the information difference impact of moving an object \(S\) toward or away from algorithmic randomness may be:

  1. of the same order (i.e., \(I(S,e) = \mathbf{\Theta}( \log(N) ) \)) of the object's algorithmic complexity \(K(S)\), if e.g. \(S\) is a computable and, therefore, logarithmically compressible in the form \(K(S) \leq \log(N) + \mathbf{O}( \log( \log(N) ) ) + \mathbf{O}(1)\);
  2. or strong-asymptotically dominated by the object's algorithmic complexity \(K(S) \geq |S| - \mathbf{O}(1)\), as \(I(S,e) = \mathbf{O}( \log( |S| ) ) = \mathbf{o}\left( |S| \right)\) , if e.g. \(S\) is algorithmically random and, therefore, incompressible (up to a constant);

In other words, in the asymptotic limit when \(N \to \infty\), the ratio of worst-case information difference per bit of algorithmic complexity \[\frac{I(S,e)}{K(S)}\] tends to \(0\) for algorithmic random objects and tends to a constant \(\epsilon > 0\) for computable objects. The more the object's algorithmic complexity diverges from the maximal value (i.e., the larger the algorithmic randomness deficiency), the larger the worst-case relative impact on the object's algorithmic complexity (i.e., on its optimal losslessly compressibility). Thus, within the context of computably generated objects, this phenomenon suggests some kind of "robustness" to stochastically random perturbations relative to objects' intrinsic algorithmic randomness. Indeed, as one of the main results of AID, these impacts were studied and demonstrated to be related to a thermodynamic-like behaviour in stochastic-randomly moving an object toward or away from algorithmic randomness. In addition, this, in turn, led to the existence of optimum points of reprogrammability between both extremes of algorithmic complexity.

Connection to Dynamical Systems

While some theoretical results have connected algorithmic complexity and dynamical systems not many applications have connected them. One of the main results in AID is that if a chain of causally connected processes is unaffected and generated from the same generative mechanism then its algorithmic complexity remains constant up to a (double) logarithmic term accounting for the time step when the system is to be reproduced up to a certain observable time, anything departing from such quantity indicates that the process has been subject to an external perturbation or interaction which can be pinpointed by AID. Numerical experiments quantifying the change in the number of attractors in a dynamical system (and therefore the average depth or shallowness of their attractors) demonstrates that, on the one hand, when removing elements that reduce the algorithmic complexity of a dynamical system (with respect to the average length of the models explaining its original description) systematically reduces the number of attractors. On the other hand, when removing elements that increase the dynamical system algorithmic complexity (hence makes it more algorithmically random), the number of attractors increases. Current open topics of AID research include multivariable modelling e.g. describing multiple particles individually rather than as a system.

Applications to Data Science

Dimensionality \(\) Reduction and Minimal Loss of Information

Besides causation, control, and interventional analysis, AID methods and algorithmic information difference can also be applied to study lossy compression, so that the information loss is minimized or quantified. One can study the rate at which information is lost, recovered or reconstructed in reduced complex artificial and real networks, while retaining the typical features of biological, social, and engineering networks, such as scale-free edge distribution and the small-world property.

For example, it was evaluated the loss of information and their ability to preserve information content of network in previous dimensionality reduction methods based on motif analysis, graph sparsification, and graph spectra. Graph spectra being the most irregular, losing most of the information content of the original network, and network motif analysis the method that better preserved the original information content, which is in agreement with local regularities preserving global algorithmic information.

On the other hand, AID methods can be applied directly in order to contribute to areas of feature selection and dimensionality reduction. In the case of graphs, for example, new methods for network summarisation based on successive local perturbations in the form of single edge deletions that produces minimises loss of algorithmic information was introduced. The methods constitute an interesting computation-time-feasible approach to designing theoretically optimal lossy compression techniques based on principles and estimations to theoretical optimal lossless compression. It was shown that graph-theoretic indices (ranging from degree distribution and clustering coefficient to edge betweenness and degree and eigenvector centralities) measured on synthetic and real-world networks were preserved, achieving equal or significantly better results than other data reduction and some of the leading network sparsification methods.

The same methods have been proposed to enable machine intelligence with an inference engine based on AID to help AI algorithms to build computable hypotheses from data that can then be tested against observation. This would complement other approaches such as statistical machine learning that would fail at tasks such as inference and abstraction.

Relation to other concepts

Resource-bounded complexity

AID, as based on CTM, is a resource-bounded algorithmic complexity measure as it takes informed runtime cutoffs (usually very short to deal with short strings) to estimate candidate upper bounds algorithmic complexity under assumptions of optimality and for the reference universal enumeration chosen.

Minimum Message and Minimal Description Length methods

While AID, MML, and MDL are ultimately related and based on the principles of Algorithmic probability, these minimum length approaches depart from AID (and AIT) in fundamental ways. MDL is a model selection principle rather than a model generating method. As such, AID incorporates the principle of MDL and represents a generalisation with MDL a particular case based on learning algorithms using the statistical notion of information rather than the more general--and powerful--notion of algorithmic information that requires Turing-completeness as used by AID. In practice, however, MDL and AID can complement each other because AID is more difficult to estimate while MDL provides some statistical shortcuts and this is similar and mostly already embedded in the Block Decomposition Method (BDM) that combines both classical and algorithmic information theories.

In fact, AID is agnostic of the underlying method even if it currently relies completely on the use of CTM and BDM as described above. So, for example, when AID constrains itself to statistical models not produced by CTM, AID would collapse into methods such as MDL or MML, but any methods that introduce attempts to go beyond statistical would approximate the spirit of AID as using CTM and BDM.

Computational Mechanics

AID can be considered as a special case or generalisation of the area of Computational Mechanics depending on the perspective, they differ in that (1) the formulation of Crutchfield involves the introduction of stochastic processes in the models themselves (rather than on the probability distribution of the models) while from AID no probabilistic content is at the core of the candidate generative model but only at the level of the universal distribution (the distribution governing all computer programs), and (2) AID by way of CTM and BDM provide the means to implement other computational-mechanic approaches including potentially that of Crutchfield's as the means to mine the space of computable finite (deterministic or stochastic) automata.

Measures of sophistication

Because AID has the potential property (under assumptions of optimality) to distinguish simple and random and assign them similar values, AID can be considered a measure of sophistication similar to Bennett's Logical Depth. Moreover, a measure of (re)programmability is a measure of sophistication by design and is based on AID. Some thermodynamic-like properties associated to reprogramming systems were found and published.

Granger causality and transfer entropy

This family of methods are of statistic nature and mostly for causal analysis and not causal discover, especially because they are unable to deal with models in phase-space corresponding to possible physical state models. Some relationships of variable, causal or temporal can be derived as it can from intervention analysis similar to Pearl's do-calculus, but they don't produce candidate mechanistic models in the first place and the causal analysis falls back into the purely classical probabilistic framework that is only partially successful to be removed from but has no other option to resort to it again when it comes to testing the associative nature between any 2 or more variables.

Challenges and limitations

The computational resources needed to compute CTM at the core of AID limit the size of its application but BDM extends and combines CTM with other computable measures such as Shannon entropy itself. One of the most active areas of AID research is to find ways to make the algorithmic measures to be more relevant and to find possible shortcuts towards the computation of the computable region in the uncomputable space that is relevant for applications in science.


  • G. J. Chaitin. On the length of programs for computing finite binary sequences.Journal of the ACM, 13(4):547--569, 1966.
  • R. G. Downey and D. R. Hirschfeldt, Algorithmic Randomness and Complexity. New York, NY: Springer New York, 2010.
  • P. Gàcs. On the symmetry of algorithmic information. Soviet Mathematics Doklady, 15:1477--1480, 1974.
  • R. G. Gallager.Information Theory and Reliable Communication. Wiley, New York, 1968.
  • N. Gauvrit, F. Soler-Toscano, H. Zenil, Natural Scene Statistics Mediate the Perception of Image Complexity, Visual Cognition, Volume 22, Issue 8, pp. 1084-1091, 2014.
  • N. Gauvrit, H. Singmann, F. Soler-Toscano, H. Zenil, Algorithmic complexity for psychology: A user-friendly implementation of the coding theorem method, Behavior Research Methods, Volume 48, Issue 1, pp. 1-16, 2015.
  • N. Gauvrit, H. Zenil, F. Soler-Toscano and J.-P. Delahaye, Algorithmic complexity for short binary strings applied to psychology: a primer, Behavior Research Methods, vol. 46-3, pp 732-744, 2014.
  • S. Hernández-Orozco, N.A. Kiani, H. Zenil, Algorithmically Probable Mutations Reproduce Aspects of Evolution, such as Convergence Rate, Genetic Memory, and Modularity, Royal Society Open Science 5:180399, 2018.
  • M. Hutter. On Universal Prediction and Bayesian Confirmation. Theoretical Computer Science, 384:1 (2007) 33-48
  • A. N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of Information and Transmission, 1(1):1--7, 1965.
  • A. N. Kolmogorov. Combinatorial foundations of information theory and the calculus of probabilities. Russian Mathematical Surveys, 38(4):27--36, 1983.
  • L. A. Levin. Laws of information conservation (non-growth) and aspects of the foundation of probability theory. Problems of Information Transmission, 10(3):206--210, 1974.
  • M. Li and P. M. B. Vitanyi. An Introduction to Kolmogorov Complexity and its Applications. Springer, New York, 2nd edition, 1997.
  • F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit, Calculating Kolmogorov Complexity from the Output Frequency Distributions of Small Turing Machines, PLoS ONE 9(5): e96223, 2014.
  • F. Soler-Toscano, H. Zenil, A Computable Measure of Algorithmic Probability by Finite Approximations with an Application to Integer Sequences, Complexity vol. 2017 (2017), Article ID 7208216
  • F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit , Correspondence and Independence of Numerical Evaluations of Algorithmic Information Measures, Computability, vol. 2, no. 2, pp 125-140, 2013.
  • R. J. Solomonoff. A formal theory of inductive inference: Parts 1 and 2. Information and Control, 7:1--22 and 224--254, 1964.
  • G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based Computing, Natural Computing, vol 12(4): 499--515, 2013.
  • H. Zenil and J-P. Delahaye, “On the Kolmogorov-Chaitin complexity for short sequences”. In C. Calude (ed) Randomness and Complexity: From Leibniz to Chaitin, World Scientific Publishing Company, 2007.
  • H. Zenil and J-P. Delahaye, Numerical Evaluation of the Complexity of Short Strings: A Glance Into the Innermost Structure of Algorithmic Randomness, Applied Mathematics and Computation 219, pp. 63-77, 2012.
  • H. Zenil, G. Ball and J. Tegnér, Testing Biological Models for Non-linear Sensitivity with a Programmability Test. In P. Liò, O. Miglino, G. Nicosia, S. Nolfi and M. Pavone (eds), Advances in Artificial Intelligence, ECAL 2013, pp. 1222-1223, MIT Press, 2013.
  • H. Zenil, Programmability: A Turing Test Approach to Computation. In L. De Mol and G. Primiero (eds.), Turing in Context, Koninklijke Vlaamse Academie van België voor Wetenschappen en Kunsten (Belgian Academy of Sciences and Arts), Contactforum, 2014.
  • H. Zenil, F. Soler-Toscano, K. Dingle and A. Louis, Correlation of Automorphism Group Size and Topological Properties with Program-size Complexity Evaluations of Graphs and Complex Networks, Physica A: Statistical Mechanics and its Applications, vol. 404, pp. 341–358, 2014.
  • H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, Two-Dimensional Kolmogorov Complexity and Validation of the Coding Theorem Method by Compressibility, PeerJ Computer Science, 1:e23, 2015.
  • H. Zenil, Algorithmicity and Programmability in Natural Computing with the Game of Life as an In Silico Case Study, Journal of Experimental & Theoretical Artificial Intelligence, Volume 27, Issue 1, pp. 109-121, 2015.
  • H. Zenil, N.A. Kiani, F.S. Abrahão, A.Rueda-Toicen, A.A. Zea, J. Tegnér, Minimal Algorithmic Information Loss Methods for Dimension Reduction, Feature Selection and Network Sparsification, 2020.
  • H. Zenil, N.A. Kiani, F. Marabita, Y. Deng, S. Elias, A. Schmidt, G. Ball and J. Tegnér, An Algorithmic Information Calculus for Causal Discovery and Reprogramming Systems, iScience, vol. 19, pp. 1160–1172, 2019.
  • H. Zenil, N.A. Kiani, J. Tegnér, Numerical Investigation of Graph Spectra and Information Interpretability of Eigenvalues, In F. Ortuño & I. Rojas (Eds.): 3rd International Work-Conference on Bioinformatics and Biomedical Engineering (IWBBIO) 2015, Part II, LNCS 9044, pp. 395--405. Springer, 2015.
  • H. Zenil, N.A. Kiani and J. Tegnér, Methods of Information Theory and Algorithmic Complexity for Network Biology, Seminars in Cell and Developmental Biology, vol. 51, pp. 32-43, 2016.
  • H. Zenil, Approximations to Algorithmic Probability. In Robert A. Meyers (ed), 2nd. Edition of the Springer Encyclopedia of Complexity and Systems Science, 2017.
  • H. Zenil, N.A. Kiani and J. Tegnér, Algorithmic Information Dynamics of Emergent, Persistent, and Colliding Particles in the Game of Life. In A. Adamatzky (ed), From Parallel to Emergent Computing, Taylor & Francis / CRC Press, pp.367--383, 2019.
  • H. Zenil, N.A. Kiani and J. Tegnér, Low Algorithmic Complexity Entropy-deceiving Graphs, Physical Review E 96, 012308, 2017.
  • H. Zenil, N.A. Kiani, and J. Tegnér, Quantifying loss of information in network-based dimensionality reduction techniques, J. Complex Networks, vol. 4, no. 3, pp. 342–362, 2016.
  • H. Zenil, N.A. Kiani, and J. Tegnér, The Thermodynamics of Network Coding, and an Algorithmic Refinement of the Principle of Maximum Entropy, Entropy, vol. 21, no. 6, p. 560, 2019.
  • H. Zenil, S. Hernández-Orozco, N.A. Kiani, F. Soler-Toscano, A. Rueda-Toicen, A Decomposition Method for Global Evaluation of Shannon Entropy and Local Estimations of Algorithmic Complexity, Entropy 20(8), 605, 2018.
  • H. Zenil, N.A. Kiani, A. Zea, J. Tegnér, Causal Deconvolution by Algorithmic Generative Models, Nature Machine Intelligence, vol 1, pages 58–66, 2019.
  • H. Zenil, P. Minary, Training-free Measures Based on Algorithmic Probability Identify High Nucleosome Occupancy in DNA Sequences, Nucleic Acids Research, gkz750, 2019.
  • A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms. Russian Mathematical Surveys, 25(6):83--124, 1970.

Internal references

  • Olaf Sporns (2007) Complexity. Scholarpedia, 2(10):1623.
  • Tomasz Downarowicz (2007) Entropy. Scholarpedia, 2(11):3901.

Recommended reading

For an introduction to Kolmogorov complexity, history and more references:

Foundational Papers

Applications Papers

Media and Popular Science Publications

External links

See also

Algorithmic "Kolmogorov" Complexity, Algorithmic "Solomonoff" Probability, Universal "Levin" Search, Algorithmic "Martin-Loef" Randomness, Recursion Theory, Applications of AIT.

Personal tools

Focal areas