# Entropy

Post-publication activity

Curator: Tomasz Downarowicz Figure 1: In a naive analogy, energy in a physical system may be compared to water in lakes, rivers and the sea. Only the water that is above the sea level can be used to do work (e.g. propagate a turbine). Entropy represents the water contained in the sea.
• In classical physics, the entropy of a physical system is proportional to the quantity of energy no longer available to do physical work. Entropy is central to the second law of thermodynamics, which states that in an isolated system any activity increases the entropy.
• In probability theory, the entropy of a random variable measures the uncertainty about the value that might be assumed by the variable.
• In information theory, the compression entropy of a message (e.g. a computer file) quantifies the information content carried by the message in terms of the best lossless compression rate.
• In sociology, entropy is the natural decay of structure (such as law, organization, and convention) in a social system.
• In the common sense, entropy means disorder or chaos.

## History

The term entropy was coined in 1865 [Cl] by the German physicist Rudolf Clausius from Greek en- = in + trope = a turning (point). The word reveals an analogy to energy and etymologists believe that it was designed to denote the form of energy that any energy eventually and inevitably turns into -- a useless heat. The idea was inspired by an earlier formulation by Sadi Carnot [Ca] of what is now known as the second law of thermodynamics.

The Austrian physicist Ludwig Boltzmann [B] and the American scientist Willard Gibbs [G] put entropy into the probabilistic setup of statistical mechanics (around 1875). This idea was later developed by Max Planck. Entropy was generalized to quantum mechanics in 1932 by John von Neumann [N]. Later this led to the invention of entropy as a term in probability theory by Claude Shannon [Sh] (1948), popularized in a joint book [SW] with Warren Weaver, that provided foundations for information theory.

The concept of entropy in dynamical systems was introduced by Andrei Kolmogorov [K] and made precise by Yakov Sinai [Si] in what is now known as the Kolmogorov-Sinai entropy.

The formulation of Maxwell's paradox by James C. Maxwell (around 1871) triggered a search for the physical meaning of information, which resulted in the finding by Rolf Landauer [L] (1961) of the heat equivalent of the erasure of one bit of information, which brought the notions of entropy in thermodynamics and information theory together.

The term entropy is now used in many other sciences (such as sociology), sometimes distant from physics or mathematics, where it no longer maintains its rigorous quantitative character. Usually, it roughly means disorder, chaos, decay of diversity or tendency toward uniform distribution of kinds.

## Entropy in physics

### Thermodynamical entropy - macroscopic approach

In thermodynamics, a physical system is a collection of objects (bodies) whose state is parametrized by several characteristics such as the distribution of density, pressure, temperature, velocity, chemical potential, etc. The change of entropy of a physical system when it passes from one state to another equals

$\Delta S = \int \frac {dQ}T,$

where $$dQ$$ denotes an element of heat being absorbed (or emitted; then it has negative sign) by a body, $$T$$ is the absolute temperature of that body at that moment, and the integration is over all elements of heat active in the passage. The above formula allows one to compare the entropies of different states of a system, or to compute the entropy of each state up to a constant (which is satisfactory in most cases). The absolute value of entropy is established by the third law of thermodynamics.

Notice that when an element $$dQ$$ of heat is transmitted from a warmer body at temperature $$T_1$$ to a cooler one at temperature $$T_2\ ,$$ then the entropy of the first body changes by $$-\frac {dQ}{T_1}\ ,$$ while that of the other rises by $$\frac {dQ}{T_2}\ .$$ Since $$T_2<T_1\ ,$$ the absolute value of the latter fraction is larger and jointly the entropy of the two-body system increases (while the global energy remains the same).

A system is isolated if it does not interact with its surroundings (i.e., is not influenced in any way). In particular, an isolated system does not exchange energy or matter (or even information) with its surroundings. In virtue of the first law of thermodynamics (the conservation of energy principle), an isolated system can pass only between states of the same global energy. The second law of thermodynamics introduces irreversibility of the evolution: an isolated system cannot pass from a state of higher entropy to a state of lower entropy. Equivalently, the second law says that it is impossible to perform a process whose only final effect is the transmission of heat from a cooler medium to a warmer one. Any such transmission must involve outside work; the elements participating in the work will also change their states and the overall entropy will rise.

The first and second laws of thermodynamics together imply that an isolated system will tend to the state of maximal entropy among all states of the same energy. This state is called the equilibrium state and reaching it is interpreted as the thermodynamical death of the system. The energy distributed in this state is incapable of any further activity.

### Boltzmann entropy and Gibbs entropy - microscopic approach

Boltzmann [B] gave another, statistical meaning to entropy. A thermodynamical state $$A$$ (or macrostate, as described in terms of a distribution of pressure, temperature, etc.) can be realized in many different ways at the microscopic level, corresponding to many points $$\omega$$ (called microstates) in phase space $$\Omega\ .$$ For example, the microscopic description $$\omega$$ may consist of specifying the positions and velocities of all particles, requiring $$6n$$ real numbers for $$n$$ particles; in this case, phase space is $$\Omega = \mathbb{R}^{6n}\ ,$$ or the set of those $$\omega \in \mathbb{R}^{6n}$$ with a certain energy. The Boltzmann entropy of $$A$$ is defined to be proportional to the logarithm of the phase space volume of the set $$M_A$$ of all $$\omega$$ that realize the state $$A\ :$$

$\tag{1} S(A) = S_0 + k \log_2(vol(M_A))$

The constant $$S_0$$ depends on the unit of phase space volume, and the proportionality factor $$k$$ is known as the Boltzmann constant. One also writes $$S(\omega)=S(A)$$ for every $$\omega\in M_A\ .$$

Analogously, if we take $$\Omega$$ to be a finite set, and $$\#M_A$$ denotes the number of elements in the set $$M_A\ ,$$ then the Boltzmann entropy is defined to be

$\tag{2} S(A) = k\log_2 (\#M_A)$

(The constant $$S_0$$ is not needed any more as no choice of volume unit is involved.) The logarithm is used in the formulas (1) and (2) because different macrostates correspond to sets of very very different sizes in phase space, the largest of which (by far) belongs to the equilibrium state.

The formulas (1) and (2) can be rephrased in probabilistic terms if we imagine that one element of $$\Omega$$ is chosen at random with uniform distribution. That is, if the set $$\Omega$$ of all microstates has $$M$$ elements and the set $$M_A$$ of those realizing the macrostate $$A$$ has $$N$$ elements, we assign probability $$\frac 1M$$ to each microstate and $$\frac NM$$ to the state $$A\ .$$ Then Boltzmann's formula can be rewritten as

$\tag{3} S(A) = S_0' + k\log_2(Prob(A)),$

where $$S_0'= k \log_2(\#\Omega)\ .$$ Practically $$S_0' = S_{max}$$ is the maximal possible entropy of a macrostate, since $$M_A$$ cannot have more elements than $$\Omega\ .$$ (And indeed, the thermal equilibrium state $$E$$ contains most microstates of a given energy, so that $$\log_2(\#M_E) \approx \log_2(\#\Omega)$$ and thus $$S(E) \approx S_0'\ .$$ This means that the equilibrium state has the maximal possible entropy.) Notice that logarithms of probabilities are negative.

Gibbs entropy refines (3) to the case where the microstates $$\omega$$ realizing the macrostate $$A$$ may have different probabilities $$p_\omega\ :$$

$\tag{4} S(A) = -k\sum_{\omega\in M_A} p_\omega\log_2 (p_\omega).$

The formula (4) reduces to (2) if $$p_\omega = \frac 1N$$ for all microstates in $$M_A\ .$$ For a continuous probability distribution with density function $$\rho\ ,$$ the Gibbs entropy is defined as

$S(A) = -k\int \rho(\omega) \log_2(\rho(\omega)) \, d\omega.$

Probability distributions over phase space arise in particular because thermal equilibrium states are connected with such distributions, known as the canonical, microcanonical, and macrocanonical distributions.

### Entropy in quantum mechanics

John von Neumann [N] found analogs in quantum mechanics for both the Boltzmann entropy and the Gibbs entropy. The quantum Boltzmann entropy is defined by

$S(A) = k \log_2(\dim(\mathcal{H}_A)),$

where $$\mathcal{H}_A$$ is the subspace of Hilbert space containing those microstates realizing the macrostate $$A\ .$$ The Hilbert space plays a role roughly analogous to that of the classical phase space.

The quantum Gibbs entropy, usually called von Neumann entropy, is

$S(\rho) = -k \mathrm{tr}(\rho \log(\rho)),$

where $$\rho$$ is a density matrix, which plays a role roughly analogous to that of a probability distribution over phase space. See also the main article on von Neumann entropy.

### Black hole entropy

Einstein's General Theory of Relativity implies that black holes exist and that (under reasonable technical assumptions) a certain quantity about a black hole, the surface area $$A$$ of its horizon, behaves much like entropy in thermodynamics: for any system of black holes, the sum of their surface areas cannot decrease. This statement was proven by Stephen W. Hawking (around 1972) and is known as the second law of black hole dynamics. Indeed, further considerations pointed out by Jacob Bekenstein (1973) and others suggest calling the quantity

$S = \frac{c^3k}{4\gamma\hbar} A$

(rather than $$A$$ itself) the black hole's entropy, where the prefactor involves only constants of nature$c$ is the speed of light, $$k$$ Boltzmann's constant, $$\gamma$$ Newton's gravitational constant, and $$\hbar$$ Planck's constant.

It is believed that the second law of black hole dynamics can be violated in reality (because of quantum effects not taken into account by General Relativity) using processes (such as Hawking radiation) that increase the thermodynamical entropy. It is further believed that the sum of black hole entropy and thermodynamical entropy cannot decrease, and should thus be regarded as the entropy of a system containing black holes. A proof of this claim would require a quantum theory of gravitation, which is not available to date in a satisfactory way; it is also not clear whether and how the black hole entropy is connected to the number of microstates as involved in Boltzmann's formula (2).

## Entropy in mathematics

### Shannon entropy

In probability theory, a probability vector $$P$$ (also called a partition of unity) is a collection of finitely many nonnegative numbers $$\{p_1,p_2,\dots,p_n\}$$ whose sum equals 1. The Shannon entropy of a probability vector $$P$$ is a straightforward adaptation of the Gibbs entropy formula (4) (but leaving out the proportionality factor $$k$$):

$\tag{5} H(P) = -\sum_{i=1}^n\, p_i\log_2 (p_i).$

Probability vectors occur naturally in connection with finite partitions of a probability space. Consider an abstract space $$\Omega$$ equipped with a probability measure $$\mu$$ assigning probabilities (numbers between 0 and 1) to subsets of $$\Omega$$ (more precisely, a measure usually does not assign values to all subsets only to certain selected subsets called measurable sets; such sets form a large family closed under set operations such as unions or intersections, called a sigmafield). A finite partition $$\mathcal A$$ of $$\Omega$$ is a collection of pairwise disjoint measurable sets $$\{A_1,A_2,\dots,A_n\}$$ whose union is $$\Omega\ .$$ Then the probabilities $$p_i = \mu(A_i)$$ form a probability vector $$P_{\mathcal A}\ .$$ One associates the entropy of this vector to the partition $$\mathcal A\ :$$

$H_\mu(\mathcal A) = H(P_{\mathcal A}).$

In this setup entropy can be viewed as a parameter strictly related to the notion of information. Given a measurable set $$A\ ,$$ the amount of information associated with $$A$$ is defined as

$\tag{6} I(A) = -\log_2(\mu(A)).$

The information function $$I_{\mathcal A}$$ associated with a partition $$\mathcal A=\{A_1,A_2,\dots,A_n\}$$ is defined on the space $$\Omega$$ and it assumes the constant value $$I(A_i)$$ at all points $$\omega$$ belonging to the set $$A_i\ ,$$ i.e.,

$\tag{7} I_{\mathcal A}(\omega) = \begin{cases} -\log_2(\mu(A_1)), & if \ \ \omega\in A_1; \\ -\log_2(\mu(A_2)), & if \ \ \omega\in A_2; \\ \vdots\\ -\log_2(\mu(A_n)), & if \ \ \omega\in A_n. \\ \end{cases}$

One easily verifies that the expected value of the information function with respect to $$\mu$$ equals the entropy $$H_\mu(\mathcal A)\ .$$

$$\tag{8} H_\mu(\mathcal A) = \int I_{\mathcal A} d\mu.$$

This explain the relation between information and entropy: while information depends on the points in the probability space, entropy is the constant representing the mean value of the information.

In some sciences (e.g. in neuroscience) the term information refers to the difference between the entropy of a signal and the entropy of the noise. So defined information corresponds to the notion of conditional entropy (as explained below) and the similarity of names with the information function is rather incidental.

### Interpretation of Shannon entropy

• The partition $$\mathcal A$$ of the space $$\Omega$$ associates with each element $$\omega\in\Omega$$ the information that answers the question in which $$A_i$$ are you?. That is the maximal knowledge about the points depending solely on the partition. One bit of information is equivalent to acquiring an answer to a binary question, i.e., to a question asking for a choice between two possibilities. Unless the partition has two elements, the question in which $$A_i$$ are you? is not binary. But it can be replaced by a series of binary questions and one is free to use any arrangement (tree) of such questions. In such an arrangement, the number of questions $$N(\omega)$$ (i.e., the amount of information in bits) needed to determine the location of the point $$\omega$$ within the partition may vary from point to point (see example below). The smaller the expected value of $$N(\omega)$$ the better the arrangement. The best arrangement satisfies $$\mathsf{E}(I_{\mathcal A})\le \mathsf{E}(N)$$ (where $$\mathsf{E}$$ denotes the expected value), and $$N(\omega)\le I_{\mathcal A}(\omega)+1$$ for almost every $$\omega\ .$$ The difference between $$I_{\mathcal A}(\omega)$$ and $$N(\omega)$$ results from the crudeness of the measurement of information by counting binary questions; the outcome is always a positive integer. The function $$I_{\mathcal A}$$ can be interpreted as the precise value. Entropy is the expected amount of information needed to locate a point in the partition.
• Another interpretation of Shannon entropy deals with the notion of uncertainty. Let $$X$$ be a random variable defined on the probability space $$\Omega$$ and assuming values in a finite set $$\{x_1,x_2,\dots,x_n\}\ .$$ The variable $$X$$ generates a partition $$\mathcal A$$ of $$\Omega$$ into the sets $$A_i = \{\omega\in\Omega: X(\omega)=x_i\}$$ (called the preimage partition). The probabilities $$p_i = \mu(A_i) = {Prob}(X=x_i)$$ form a partition of unity called the distribution of $$X$$. Suppose an experimenter knows the distribution of $$X$$ and tries to guess the outcome of $$X$$ before performing the experiment, i.e., before picking some $$\omega\in\Omega$$ and reading the value $$X(\omega)\ .$$ His uncertainty about the outcome is the expected value of the information he is missing to be certain. As explained above that is exactly the entropy $$H_\mu(\mathcal A)\ .$$

Notice that the entropy does not depend on the metric structure of the set of values of $$X\ ,$$ so entropy cannot be compared to variance. Variance measures a different kind of uncertainty of the outcome of a real random variable $$X\ ,$$ which takes into account the distances between the outcome values.

### Properties of the information function and of the Shannon entropy

• (a) The information $$I(A)$$ (see (6)) associated with a set $$A$$ is a nonnegative and decreasing function of $$\mu(A)\ ;$$ the smaller the set, the more information is encoded in the fact that $$\omega\in A\ .$$ $$I(A)=0$$ if and only if $$\mu(A)=1$$ (there is no information encoded in an event which is certain).
• (b) The entropy of a partition does not depend on the order in which the elements of the partition are numbered.
• (c) The entropy of a partition is nonnegative and equal to zero if and only if one of the elements $$A_i$$ of the partition has measure 1 (and all other elements have measure zero).
• (d) The entropy of a partition into $$n$$ sets is highest for the measure which assigns equal values $$\frac 1n$$ to these sets. The entropy then equals $$\log_2n\ .$$
• (e) The entropy of a partition into $$n$$ sets is a continuous function of the measures of these sets.
• (f) If the elements of a partition $$\mathcal B$$ are obtained by uniting elements of the partition $$\mathcal A$$ (i.e., if $$\mathcal A$$ is a refinement of $$\mathcal B$$) then $$H_\mu(\mathcal B)\le H_\mu(\mathcal A)\ .$$
• (g) The entropy of the least common refinement $$\mathcal A\vee\mathcal B$$ of two partitions $$\mathcal A$$ and $$\mathcal B$$ is not larger than the sum of the entropies of $$\mathcal A$$ and $$\mathcal B$$ (this property is called subadditivity).
• (h) The equality $$H_\mu(\mathcal A\vee\mathcal B) = H_\mu(\mathcal A)+H_\mu(\mathcal B)$$ holds if and only if $$\mathcal A$$ and $$\mathcal B$$ are stochastically independent.

Shannon proved that above properties (b), (c), (d), (e) and (h) determine the defining formula (5).

### Conditional entropy

Given two finite partitions $$\mathcal A$$ and $$\mathcal B$$ of the same probability space $$(\Omega,\mu)\ ,$$ the conditional entropy of $$\mathcal A$$ given $$\mathcal B$$ is defined as

$\tag{9} H_\mu(\mathcal A|\mathcal B) = H_\mu(\mathcal A\vee\mathcal B) - H_\mu(\mathcal B)$

and is interpreted as the amount of information added by introducing the partition $$\mathcal A$$ when the partition $$\mathcal B$$ (and its information) is already known. The following holds

$H_\mu(\mathcal A|\mathcal B) = \sum_{B\in\mathcal B}\mu(B) H_{\mu_B}(\mathcal A),$

where $$\mu_B$$ is the conditional probability measure on $$B$$ obtained by restricting and normalizing $$\mu$$ (to normalize a finite measure means to multiply it by a constant so that it becomes a probability measure).

Sometimes the known information comes not from a finite partition, only from a family of partitions, or, more generally, from a family of measurable sets that form a sigmafield $$\Sigma$$ smaller than the sigmafield of all measurable sets. In such case one considers the conditional entropy of a partition $$\mathcal A$$ given a sigmafield $$\Sigma$$ defined as

$\tag{10} H_\mu(\mathcal A|\Sigma) = \inf H_\mu(\mathcal A|\mathcal B),$

where the infimum is taken over all finite partitions $$\mathcal B$$ measurable with respect to $$\Sigma$$ (infimum of a set of numbers is either the smallest number in this set or the largest number smaller than all elements in the set; for example the infimum of all strictly positive numbers is 0).

### Properties of the conditional entropy

• (a) $$H_\mu(\mathcal A|\mathcal B)\le H_\mu(\mathcal A)\ ,$$ $$H_\mu(\mathcal A|\mathcal A)=0$$
• (b) If $$\mathcal B$$ is a refinement of $$\mathcal C$$ then $$H_\mu(\mathcal A|\mathcal B)\le H_\mu(\mathcal A|\mathcal C)\ .$$
• (c) $$H_\mu(\mathcal A|\mathcal B\vee \mathcal C) = H_\mu(\mathcal A\vee\mathcal B| \mathcal C)-H_\mu(\mathcal B| \mathcal C)\ .$$

The property (h) of Shannon entropy can be reformulated as follows:

• (h') The partitions $$\mathcal A$$ and $$\mathcal B$$ are stochastically independent if and only if $$H_\mu(\mathcal A|\mathcal B) = H_\mu(\mathcal A)$$ (by symmetry also $$H_\mu(\mathcal B|\mathcal A) = H_\mu(\mathcal B)$$).

### Kolmogorov-Sinai entropy

See the main article on Kolmogorov-Sinai entropy.

This is the key entropy notion in ergodic theory. Let $$T$$ be a measurable transformation of the probability space $$(\Omega,\mu)\ ,$$ which preserves the measure $$\mu\ ,$$ i.e., such that $$\mu(T^{-1}(A)) = \mu(A)$$ for every measurable set $$A\ .$$ (In dynamical systems it is more natural to consider preimages of sets $$T^{-1}(A) = \{x: T(x)\in A\}$$ rather than their forward images. For instance, preimages of disjoint sets are disjoint, which is not true for images. The image of a measurable set is usually not measurable unless the transformation is invertible, which is not assumed in the setup of the Kolmogorov-Sinai entropy.) Let $$\mathcal A$$ be a finite measurable partition of $$\Omega$$ and let $$\mathcal A^n$$ denote the least common refinement $$\mathcal A \vee T^{-1}(\mathcal A) \vee \cdots \vee T^{-n+1}(\mathcal A)\ .$$ By a subadditivity argument, the sequence of Shannon entropies $$\frac1n H_\mu(\mathcal A^n)$$ converges to its infimum. The entropy of $$T$$ with respect to the partition $$\mathcal A$$ is defined as the limit

$\tag{11} h_\mu(T,\mathcal A) = \lim_{n\to\infty} \frac1n H_\mu(\mathcal A^n) = \inf_{n\in\Bbb N} \frac1n H_\mu(\mathcal A^n).$

The Kolmogorov-Sinai entropy of the measure-preserving system $$(\Omega,\mu, T)$$ is the supremum

$\tag{12} h_\mu(T) = \sup_{\mathcal A} h_\mu(T,\mathcal A),$

where $$\mathcal A$$ ranges over all finite measurable partitions of $$\Omega\ .$$

A system $$(\Omega,\mu,T)$$ with a fixed measurable partition $$\mathcal A$$ is called a process. The most important part of the Kolmogorov-Sinai entropy theory deals with processes.

### Interpretation of the Kolmogorov-Sinai entropy of a process

By the definition (11), the entropy $$h_{\mu}(T,\mathcal A)$$ of a process generated by a partition $$\mathcal A$$ can be interpreted as the average gain of information per unit of time delivered by the partition $$\mathcal A\ .$$

The same entropy can be computed using the notion of conditional entropy (see (10)):

$h_\mu(T,\mathcal A) = \inf_{n\in\mathbb N}H(\mathcal A|T^{-1}(\mathcal A) \vee T^{-2}(\mathcal A) \cdots \vee T^{-n+1}(\mathcal A)) = H(\mathcal A|\mathcal A^-),$

where $$\mathcal A^-$$ is the sigmafield generated by all partitions $$T^{-n}(\mathcal A)$$ ($$n\ge 1$$) and is sometimes called the past (or future depending on the interpretation) of the process. This formula provides another interpretation of the Kolmogorov-Sinai entropy: it is the new information obtained in one step of a process given all the information from the past.

### The main entropy theorems in ergodic theory

In the context of a measure-preserving transformation $$T\ ,$$ a partition $$\mathcal A$$ is called a one-sided generator if the sigmafield generated jointly by the partitions $$T^{-n}(\mathcal A): \ n\ge 0$$ equals the sigmafield of all measurable sets. If $$T$$ is invertible one also defines a two-sided generator as a partition $$\mathcal A$$ such that the sigmafield generated jointly by the partitions $$T^{n}(\mathcal A): \ n\in\Bbb Z$$ is the sigmafield of all measurable sets.

• The Kolmogorov-Sinai Theorem: If $$\mathcal A$$ is a generator (one-sided or two-sided), then $$h_\mu(T) = h_\mu(T,\mathcal A)\ .$$
• The Krieger Generator Theorem: If $$T$$ is invertible then $$h_\mu(T)<\infty$$ if and only if the system has a finite two-sided generator. Moreover, $$h_\mu(T)=0$$ if and only if both $$T$$ is invertible and the system has a one-sided generator.

The main theorem concerning entropy of processes is

• The Shannon-McMillan-Breiman Theorem: Given an ergodic process $$(\Omega,\mu,T,\mathcal A)\ ,$$ the convergence $$\lim_{n\to\infty}\frac 1n I_{\mathcal A^n}(x) = h_\mu(T,\mathcal A)$$ holds at $$\mu$$-almost every point $$x\ .$$ (Recall that $$I_{\mathcal A^n}(x)$$ denotes the information function associated with the partition $$\mathcal A^n\ ,$$ see (7).)

The interpretation of this last theorem is that for large $$n$$ the partition $$\mathcal A^n$$ cuts the major part of the space into sets of roughly equal measures. One has to be careful with the meaning of roughly. The ratios between measures of the typical sets in $$\mathcal A^n$$ may be far from unity, only their logarithms must have absolute values much smaller than $$n\ .$$

Entropy plays the key role in the theory of Bernoulli processes developed by Donald Ornstein, where it turns out to be a complete invariant of the isomorphism:

• The Ornstein Theorem: Two Bernoulli processes are isomorphic if and only if their entropies are equal.

### Topological entropy

This is the main entropy notion in topological dynamics. In a topological dynamical system topological entropy measures the exponential speed of the growth of the number of distinguishable $$n$$-orbits as $$n$$ grows to infinity.

See the main article on topological entropy.

## Compression entropy

Consider a message in the form of a very long sequence (word) $$B$$ composed from a finite collection $$\Lambda$$ of letters. For example, every text or computer file or even TV program has such form. Formally, $$B\in\Lambda^N\ ,$$ where $$N$$ is the length of the message. In view of what was said about the information associated with a set, all messages of the same length $$N$$ should contain the same amount of information $$N\log_2(\#\Lambda)$$ because the uniform measure of every such word is $$(\#\Lambda)^{-N}$$ (here $$\#A$$ denotes cardinality of the set $$A\ ,$$ i.e., the number of its elements). The simple example below shows, however, that the amount of information, if properly understood, varies from word to word. For, let

$B_1 = 001001001001001001001001001 {\rm \ and \ }B_2=001101111010110100000101111.$

Using obvious notation, $$B_1$$ can be written as $$(001)^9\ ,$$ while there is no obvious abbreviation for $$B_2\ .$$ One is inclined to believe that $$B_2$$ carries more information than $$B_1\ ,$$ because its description cannot be essentially compacted.

This intuition was given a formal shape by scientists such as Shannon and Kolmogorov in the 60's and later developed by Leonid Levin (1974).

A lossless data compression code is formally any function transforming in a 1-1 correspondence all finite messages over $$\Lambda$$ to finite sequences (of various lengths) over another finite alphabet, say, $$\Sigma\ .$$ Usually one requires that the code also contains all instructions needed to perform the inverse map (i.e., to decode). Only the commonly used conventions need not be included. The precise rigors imposed on the code lead to slightly different notions (see for example Kolmogorov complexity).

The information content of a message $$B$$ equals $$m\log_2(\#\Sigma)\ ,$$ where $$m$$ is the length of the shortest possible image of $$B$$ by a lossless data compression code. The ratio

$R(B)=\frac{m\log_2(\#\Sigma)}{N\log_2(\#\Lambda)}$

is called the (best possible) compression rate of $$B$$ or its compression entropy (it represents the ratio between the length of the compressed and original message if both are transformed to the binary alphabet by a simple binary encoding of the letters).

## Entropy as disorder

In nearly all its meanings, entropy can be viewed as a measure of disorder and chaos, as long as by order one understands segregating things by their kind (e.g. by similar properties or parameter values). Chaos is the state of a system (physical or dynamical) in which elements of all kinds are mixed evenly throughout the space, so that the space is homogeneous. For example, a container with gas is in its state of maximal entropy when the temperature and pressure are constant throughout the volume. That means there is approximately the same number of particles in every unit of the volume, and the proportion between slow and fast particles is everywhere the same. States of lower entropy occur when particles are organized, for example: slower ones in one area, faster ones in another. A message $$B$$ has high entropy if all short words appear with equal frequencies in all sufficiently long subwords of $$B\ .$$ Any trace of organization and logic in the structure of the message allows for its compression and hence lowers its compression entropy. These observations lead to the common sense meaning of entropy.

To have order in the house means to have food separated from utensils and plates, clothing arranged in the closet by type, trash deposited in the trash container, etc. When these things get mixed together, entropy increases causing disorder and chaos. In a social system, order is associated with classification of the individuals by their skills and assigning to them appropriate positions in the system. Law and other mechanisms are enforced to keep such order. When this classification and assignment fails, the system falls into chaos.