connections between different meanings of entropy
From Scholarpedia
| Tomasz Downarowicz (2007), Scholarpedia, 2(11):3901. | revision #25685 [link to/cite this article] | |||||||||||||||||||
Contents |
Connection between Boltzmann entropy and information
Recall the Boltzmann entropy formula
- (1)
where
is the entropy of the equilibrium state. The left hand side of the formula (1) equals
, i.e., minus the information associated with the state
viewed as a subset of the space
.
This interpretation causes confusion, because the negative sign reverses the direction of the relationship between entropy and information:
The more information is associated with a macrostate
the smaller its Boltzmann entropy.
This is usually explained by interpreting what it means to associate information with a state. Namely, the
information about the state of the system is information available to an outside observer. Thus it is reasonable
to assume that this information actually escapes from the system, and hence it should receive the negative sign.
Indeed, it is the knowledge about the system possessed by an outside observer that increases the usability of
the energy contained in that system to do physical work, i.e., it decreases the system's entropy. For an observer who knows nothing about the system (except its global energy level), the system is extremely likely to be in its maximal entropy state.
The above approach has been the subject of many discussions, as it makes the entropy of a system depend on the
seemingly non-physical notion of the knowledge of a mysterious observer. The classical Maxwell paradox is based on
the assumption that it is possible to acquire information about the parameters of individual particles without any expense in the form of
heat or work. To avoid such paradoxes, one must agree that to acquire every bit of information physically changes the system. Moreover, it increases the entropy of the memory of the observer by a certain amount. Landauer established that amount to be
equal to the Boltzmann constant
. As a consequence, erasing one bit of information from a memory (say, of a computer) at temperature
results in the emission of heat in amount
to the environment.
This fact sets limits on the theoretical maximal speed of computers, because the heat can be removed with a limited speed only.
Connection between Gibbs and Shannon entropy
Now suppose that the probability distribution
on the space
of all microstates
is not uniform. Then the formula (1) does not hold. In this case one uses the Gibbs entropy formula.
The Gibbs entropy of the equilibrium state
is proportional to the Shannon entropy
where
is the probability distribution on
, and
denotes the partition
of
into points
.
The detection that the system is in state
changes the probability distribution from
to the
conditional measure
obtained by restricting and normalizing
and the
Gibbs entropy of
is proportional to the Shannon entropy of the partition
(into points)
with respect to the conditional measure
:
This formula and the discussion in the preceding section allows one to interpret the entropy of a macrostate
as the amount of information hiding in the system and still unavailable to the observer who detected that the system is actually in the macrostate
.
Kolmogorov-Sinai entropy and the compression rate
Recall that the
denotes the best compression rate of a message
.
An approximate value of the best compression rate of a message
of very large length
can be computed
using the Kolmogorov-Sinai entropy. First observe that for
smaller than
the string
defines a probability measure
on
by frequencies, as follows:
If
is very large
approximates in fact a shift-invariant measure defined on measurable subsets of the space
of all infinite strings over the alphabet
.
The fact the the measure is shift-invariant allows one to apply the tools of ergodic theory, in particular the Kolmogorov-Sinai
entropy. The following limit theorem holds:
- Compression Theorem: Let
be any
-invariant measure on
, where
denotes the shift transformation. Then for
-almost every infinite string
it holds that
, where
is the message of length
obtained as the initial word of length
in
.
In practice, since every message
has finite length
, the following approximate formula is used:
- (2)
where
is a large integer yet much smaller than
(
should satisfy
),
is the finite space of all words of length
partitioned into individual words
.
See Example of computing the compression rate.
Remarks
- It is important to realize that the average compression rate of any lossless data compression code achieved on all messages
of a given length
is always nearly equal to 1, i.e., on average there can be no compression. The vast majority of the messages generate nearly the uniform frequency measure on the subwords and hence their theoretical compression rate equals 1. Only a relatively few very exceptional messages generate measures of lower entropy and allow for essential compression. Luckily, most computer files, due to their organized form, are exceptional in this sense and allow for their compression.
- The Compression Theorem holds almost everywhere, which means that it may fail for exceptional infinite strings. See Example.
