Entropy/connections
Connection between Shannon's and Boltzmann's entropy
Both notions refer to probability and there is evident similarity in the formulas. But the analogy fails to be obvious. In the literature many different attempts toward understanding the relation can be found. In simple words, the interpretation relies on the distinction between the macroscopic state considered in classical thermodynamics and the microscopic states of statistical mechanics. A thermodynamical state \(A\) (a given distribution of pressure, temperature, etc.) can be realized in many different ways \(\omega\) at the microscopic level, where one distinguishes all individual particles, their positions and velocity vectors (see e.g. example above). By definition, the difference of Boltzmann's entropies \(S(A)-S_{max}\) is proportional to \(\log_2({Prob}(A))\ ,\) the logarithm of the probability of the macroscopic state \(A\) in the probability space \(\Omega\) of all microscopic states \(\omega\ .\) This leads to the equation
\[\tag{1} kI(A) = S_{max} - S(A) \]
where \(I(A)\) is the probabilistic information associated with the set \(A\subset\Omega\ .\)
This interpretation causes confusion, because \(S(A)\) appears in this equation with negative sign, which reverses
the direction of monotonicity: The more information is associated with a macrostate \(A\) the smaller its
Boltzmann's entropy.
This is usually explained by interpreting what it means to associate information with a state. Namely, the information about the state of the system is an information available to an outside observer. Thus it is reasonable to assume that this information actually escapes from the system, and hence it should receive the negative sign. Indeed, it is the knowledge about the system possessed by an outside observer that increases the usability of the energy contained in that system to do physical work, i.e., it decreases the system's entropy.
The interpretation goes further: Each microstate \(\omega\) in a system appearing to the observer as being in macrostate \(A\) still hides the information about its identity. Let \(I_h(A)\) denote the joint information still hiding in the system if its state is identified as \(A\ .\) This entropy is clearly maximal at the maximal state, and then it equals \(S_{max}/k\ .\) In a state \(A\) it is diminished by \(I(A)\ ,\) the information already stolen by the observer. So, one has
\[ I_h(A) = \frac{S_{max}}k - I(A). \]
This, together with (1), yields
\[ S(A) = kI_h(A), \]
which provides a new interpretation to the Boltzmann's entropy: it is proportional to the information still hiding in the system provided the macrostate \(A\) has been detected.
It is extremely hard to calculate the absolute value (i.e., the proper additive constant) of Boltzmann's entropy, because the entropy of the maximal state depends on the level of precision of identyfying the microstates. Without quantum approach, the space \(\Omega\) is infinite and so is the maximal entropy. However, if the space of states is assumed finite, the absolute entropy obtains a new interpretation, already in terms of Shannon's entropy (not just of the information fuction). In such case, the highest possible Shannon's entropy \(H_\mu(\mathcal A)\) is achieved when \(\mathcal A = \xi\) is the partition of the space \(\Omega\) into single points \(\omega\) and \(\mu\) is the uniform measure on \(\Omega\) (i.e., such that all points have equal probabilities). It is thus natural to set
\[ S_{max} = kH_\mu(\xi). \]
The detection that the system is in state \(A\) changes the probability distribution from \(\mu\) to the uniform conditional distribution \(\mu_A\) supported only on \(A\ .\) For such distributions the difference \(H_\mu(\xi) - H_{\mu_A}(\xi)\) equals precisely \(-\log (\mu(A))\ ,\) i.e. \(I(A)\ .\) This, together with (1) implies that
\[ S(A) = kH_{\mu_A}(\xi), \]
i.e., that Boltzmann's entropy is proportional to the Shannon's entropy of the partition \(\xi\) (into points) with respect to the uniform measure on \(A\ .\)
The whole above interpretation is subject of many discussions, as it makes entropy of a system depend on the seemingly non-physical notion of knowledge of a mysterious observer. The classical Maxwell's paradox is based on the assumption that it is possible to acquire information about the parameters of individual particles without any expense of heat or work. To avoid such paradoxes, one must agree that every bit of acquired information has its physical entropy equivalent (equal to the Boltzmann's constant \(k\)), by which the entropy of the memory of the observer increases. In consequence, erasing one bit of information from a memory (say, of a computer) at temperature \(T\ ,\) results in the emission of heat in amount \(kT\) to the environment. Such calculations set limits on the theoretical maximal speed of computers, because the heat can be driven away with a limited speed only.
Kolmogorov-Sinai entropy and the compression rate
An approximate value of the best compression rate of a message \(B\) of very large length \(N\) can be computed using the Kolmogorov-Sinai entropy. First observe that for \(n\) smaller than \(N\) the string \(B\) defines a probability measure \(\mu_B\) on \(\Lambda^n\) by frequencies, as follows:
\[ \mu_B(A) = {fr}_B(A) = \frac 1{N-n+1}\#\{i:B[i,i+n-1]=A\}. \]
If \(N\) is very large \(\mu_B\) approximates in fact a shift-invariant measure defined on measurable subsets of the space \(\Lambda^{\Bbb N}\) of all infinite strings over the alphabet \(\Lambda\ .\) The following limit theorem holds:
- Compression Theorem: Let \(\mu\) be any \(T\)-invariant measure on \(\Lambda^{\Bbb N}\ ,\) where \(T\) denotes the shift transformation. Then for \(\mu\)-almost every infinite string \(B\) holds \( \lim_{N\to\infty} R(B_N) = \frac{h_\mu(T)}{\log_2(\#\Lambda)}\ ,\) where \(B_N\) is the message of length \(N\) obtained as the initial word of length \(N\) in \(B\ .\)
In practice, since every message \(B\) has finite length \(N\ ,\) the following approximate formula is used:
\[\tag{2} R(B) \approx \frac{\frac 1nH_{\mu_B}(\Lambda^n)}{\log_2(\#\Lambda)}, \]
where \(n\) is a large integer yet much smaller than \(N\) (\(n\) should satisfy \(3n(\#\Lambda)^n<N\)), \(\Lambda^n\) is the finite space of all words of length \(n\)
partitioned into individual words \(A\in\Lambda^n\ .\)
See Example of computing the compression rate.
Remarks
- It is important to realize, that the average compression rate of any losless data compression code achieved on all messages \(B\) of a given length \(N\) is always nearly equal to 1, i.e., in the average there is no compression. Vast majority of the messages generate nearly the uniform frequency measure on the subwords and hence their theoretical compression rate equals 1. Only relatively few very exceptional messages generate measures of lower entropy and allow for essential compression. Luckily, most computer files, due to their organized form, are exceptional in this sense and allow for their compression.
- The Compression Theorem holds almost everywhere, which means that it may fail for exceptional infinite strings. See Example.