entropy example 4
From Scholarpedia
| Tomasz Downarowicz (2007), Scholarpedia, 2(11):3901. | revision #25686 [link to/cite this article] | |||||||||||||||||||
To understand the formula
- (1)
consider a lossless compression algorithm applied to a message
of a finite length
.
To save on the output length, the decoding instructions must be relatively short compared to
. This is easily
achieved in codes which refer to relatively short components of the messages only. For example, the decoding instructions may consist of a complete list of the subwords
of some carefully chosen length
appearing in
and their images
. The images may have different lengths (as short as possible). The
image of
is obtained by cutting
into subwords
of length
and concatenating the images of these subwords:
.
(There are additional issues here: in order for such a code to be invertible, the images
must form a
prefix-free family, so that there is always a unique way of partitioning
back into the images
. But this does not essentially affect the computations.) For best compression results, it is
reasonable to assign the shortest images to the subwords appearing in
with highest frequencies. It can be proved,
with the help of the Shannon-McMillan-Breiman theorem, that such a code realizes the compression rate as in
(1) for nearly all sufficiently long messages
.
For example, consider the binary message of length 30:
On the right,
is shown divided into subwords of length 3. The frequencies of the subwords
in this division are:
The theoretical value (obtained using the formula (1) for
) of the best compression ratio is
A binary prefix-free code giving the shortest images to the most frequent subwords is
The image of
by this code is:
The compression rate achieved on
using this code equals
,
which means that the code is quite efficient on
.
Remark: The image given here does not include the decoding instructions, which comprise simply a record of the above prefix-free code. One has to imagine that
is much longer and then the decoding instructions (of fixed length) do not affect the compression ratio.
