# Entropy/connections between different meanings of entropy

## Connection between Boltzmann entropy and information

Recall the Boltzmann entropy formula

<math boltzmann>

k\log(Prob(A)) = S(A) - S_{max}, [/itex]

where $$S_{max}$$ is the entropy of the equilibrium state. The left hand side of the formula (<ref>boltzmann</ref>) equals $$-I(A)$$, i.e., minus the information associated with the state $$A$$ viewed as a subset of the space $$\Omega$$.

This interpretation causes confusion, because the negative sign reverses the direction of the relationship between entropy and information: The more information is associated with a macrostate $$A$$ the smaller its Boltzmann entropy. This is usually explained by interpreting what it means to associate information with a state. Namely, the information about the state of the system is information available to an outside observer. Thus it is reasonable to assume that this information actually escapes from the system, and hence it should receive the negative sign. Indeed, it is the knowledge about the system possessed by an outside observer that increases the usability of the energy contained in that system to do physical work, i.e., it decreases the system's entropy. For an observer who knows nothing about the system (except its global energy level), the system is extremely likely to be in its maximal entropy state.

The above approach has been the subject of many discussions, as it makes the entropy of a system depend on the seemingly non-physical notion of the knowledge of a mysterious observer. The classical Maxwell paradox is based on the assumption that it is possible to acquire information about the parameters of individual particles without any expense in the form of heat or work. To avoid such paradoxes, one must agree that to acquire every bit of information physically changes the system. Moreover, it increases the entropy of the memory of the observer by a certain amount. Landauer established that amount to be equal to the Boltzmann constant $$k$$. As a consequence, erasing one bit of information from a memory (say, of a computer) at temperature $$T$$ results in the emission of heat in amount $$kT$$ to the environment. This fact sets limits on the theoretical maximal speed of computers, because the heat can be removed with a limited speed only.

## Connection between Gibbs and Shannon entropy

Now suppose that the probability distribution $$\mu$$ on the space $$\Omega$$ of all microstates is not uniform. Then the formula (<ref>boltzmann</ref>) does not hold. In this case one uses the Gibbs entropy formula. The Gibbs entropy of the equilibrium state $$\Omega$$ is proportional to the Shannon entropy

$S_{max} = kH_{\mu}(\xi),$

where $$\mu$$ is the probability distribution on $$\Omega$$, and $$\xi$$ denotes the partition of $$\Omega$$ into points $$\omega$$.

The detection that the system is in state $$A$$ changes the probability distribution from $$\mu$$ to the conditional measure $$\mu_A$$ obtained by restricting and normalizing $$\mu$$ and the Gibbs entropy of $$A$$ is proportional to the Shannon entropy of the partition $$\xi$$ (into points) with respect to the conditional measure $$\mu_A$$:

$S(A) = kH_{\mu_A}(\xi).$

This formula and the discussion in the preceding section allows one to interpret the entropy of a macrostate $$A$$ as the amount of information hiding in the system and still unavailable to the observer who detected that the system is actually in the macrostate $$A$$.

## Kolmogorov-Sinai entropy and the compression rate

Recall that the $$R(B)$$ denotes the best compression rate of a message $$B$$. An approximate value of the best compression rate of a message $$B$$ of very large length $$N$$ can be computed using the Kolmogorov-Sinai entropy. First observe that for $$n$$ smaller than $$N$$ the string $$B$$ defines a probability measure $$\mu_B$$ on $$\Lambda^n$$ by frequencies, as follows:

$\mu_B(A) = \frac 1{N-n+1}\#\{i:B[i,i+n-1]=A\}.$

If $$N$$ is very large $$\mu_B$$ approximates in fact a shift-invariant measure defined on measurable subsets of the space $$\Lambda^{\Bbb N}$$ of all infinite strings over the alphabet $$\Lambda$$. The fact the the measure is shift-invariant allows one to apply the tools of ergodic theory, in particular the Kolmogorov-Sinai entropy. The following limit theorem holds:

• Compression Theorem: Let $$\mu$$ be any $$T$$-invariant measure on $$\Lambda^{\Bbb N}$$, where $$T$$ denotes the shift transformation. Then for $$\mu$$-almost every infinite string $$B$$ it holds that $$\lim_{N\to\infty} R(B_N) = \frac{h_\mu(T)}{\log_2(\#\Lambda)}$$, where $$B_N$$ is the message of length $$N$$ obtained as the initial word of length $$N$$ in $$B$$.

In practice, since every message $$B$$ has finite length $$N$$, the following approximate formula is used:

<math approx>

R(B) \approx \frac{\frac 1nH_{\mu_B}(\Lambda^n)}{\log_2(\#\Lambda)}, [/itex]

where $$n$$ is a large integer yet much smaller than $$N$$ ($$n$$ should satisfy $$3n(\#\Lambda)^n<N$$), $$\Lambda^n$$ is the finite space of all words of length $$n$$ partitioned into individual words $$A\in\Lambda^n$$.

### Remarks

• It is important to realize that the average compression rate of any lossless data compression code achieved on all messages $$B$$ of a given length $$N$$ is always nearly equal to 1, i.e., on average there can be no compression. The vast majority of the messages generate nearly the uniform frequency measure on the subwords and hence their theoretical compression rate equals 1. Only a relatively few very exceptional messages generate measures of lower entropy and allow for essential compression. Luckily, most computer files, due to their organized form, are exceptional in this sense and allow for their compression.
• The Compression Theorem holds almost everywhere, which means that it may fail for exceptional infinite strings. See Example.