Mutual information

Post-publication activity

Dr. Peter E. Latham, Gatsby Computational Neuroscience Unit, University College London, England
Dr. Yasser Roudi, Gatsby Computational Neuroscience Unit, University College London

Mutual information is one of many quantities that measures how much one random variables tells us about another. It is a dimensionless quantity with (generally) units of bits, and can be thought of as the reduction in uncertainty about one random variable given knowledge of another. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent.

<review> ***Reply: Note to reviewers: our replies are preceded by "***Reply". We are putting them under "review" so they don't show up when others look at this article. Please delete any that you are happy with.</review>

<review> Reviewer A, added after revision: I like this new version much much better and I am close to approving it. I delete most of my old comments. If I left some - that it because I am not always sure which ones are mine and which reviewer's B or C. I added new comments labelled "Reviewer A, added after revision". I made a few minor editions which can be seen in the "compare revisions" mode.</review>

<review> removed 2nd "in general" for grammatical reasons</review>

<review> ***Reply: We agree that the old intro could be massively improved; hopefully the new version is better.</review>

Definition

For two discrete variables \(X\) and \(Y\) whose joint probability distribution is \(P_{XY}(x,y)\), the mutual information between them, denoted \(I(X;Y)\), is given by (Shannon and Weaver, 1949; Cover and Thomas, 1991)

I(X;Y) = \sum_{x,y} P_{XY}(x,y) \log {P_{XY}(x,y) \over P_X(x) P_Y(y)} = E_{P_{XY}} \log{P_{XY} \over P_X P_Y} \, . </math>

Here \(P_X(x)\) and \(P_Y(y)\) are the marginals\[P_X(x) = \sum_y P_{XY}(x,y)\] and \(P_Y(y) = \sum_x P_{XY}(x,y)\) and \(E_P\) is the expected value over the distribution \(P\). The focus here is on discrete variables, but most results derived for discrete variables extend very naturally to continuous ones – one simply replaces sums by integrals. One should be aware, though, that the formal replacement of sums by integrals hides a great deal of subtlety, and, for distributions that are not sufficiently smooth, may not even work. See (Gray, 1990) for details.

<review> It would be nice to have the expressions in terms of expectations as well. For example, \(I(X;Y) = E_{P_{XY}} \log{P_{XY} \over P_X P_Y} \), etc. In this way they are connected better to probability, where the whole theory is anchored. I also concur with item 2 above. A good development of information theory for continuous variables can be found in Robert Gray's "Entropy and Information Theory", Springer-Verlag 1990. Although the whole discussion may become very technical in this case. Nevertheless it is worth a mention, just to let people know of the potential difficulties associated with continuous RV.</review>

<review> ***Reply: Done (but see response to previous reviewer).</review>

The units of information depend on the base of the logarithm. If base 2 is used (the most common, and the one used here), information is measured in bits. For other bases, see Table I.

Table I
base	units	conversion
2	bits	1 bit = 1 bit
\(e\)	nats	1 bit = \(\log_e 2 \ (\approx 0.693)\) nats
10	bans	1 bit = \(\log_{10} 2 \ (\approx 0.301)\) bans

<review> I suggest to remove the above conversion table. It belongs to the entry "information" rather than here. </review>

<review> ***Reply: we're not sure what the reviewer means by "the entry 'information'"; none exists on scholarpedia. We decided to leave this in, as it seems to us like useful information to us (it's something we always forget and occaisonally need to know).</review>

<review> Reviewer A, added after revision: This is really a funny misunderstanding. By an "entry" I mean a Scholarpedia entry "information" (if it exists). I think the conversion table should be given there. But OK, I accept that it doesn't hurt to have it also here (within the entry "mutual information".</review>

Interpretation

To understand what \(I(X;Y)\) actually means, we first need to define entropy and conditional entropy.

Qualitatively, entropy is a measure of uncertainty – the higher the entropy, the more uncertain one is about a random variable. This statement was made quantitative by Shannon. He postulated that a measure of uncertainty of a random variable \(X\) should be a continuous function of its probability distribution \(P_{X}(x)\) and should satisfy the following conditions:

<review> The notation \(P_{X}(x)\) is cumbersome when discussing the whole space rather than individual events. I can understand its use in sums where the elements may need to be shown explicitly. When discussing the probability as a whole, I suggest simpliying the notation to \(P_{X}\), meaning the probability measure on the whole space</review>

<review> ***Reply: we disagree with this, but only barely - while we prefer the notation \(P_{X}(x)\), we don't really know what other people think. We'll leave it like it is for now, but if the reviewer feels strongly, we'll change it.</review>

It should be maximal when \(P_{X}(x)\) is uniform, and in this case it should increase with the number of possible values \(X\) can take;

It should remain the same if we reorder the probabilities assigned to different values of \(X\);

The uncertainty about two independent random variables should be the sum of the uncertainties about each of them.

He then showed that the only measure of uncertainty that satisfies all these conditions is the entropy, defined as

H(X)=-\sum_x P_X(x) \log P_X(x) = - E_{P_X} \log P_X </math>

Although not particularly obvious from this equation, \(H(X)\) has a very concrete interpretation: Suppose \(x\) is chosen randomly from the distribution \(P_X(x)\), and someone who knows the distribution \(P_X(x)\) is asked to guess which \(x\) was chosen. If the guesser uses the optimal question-asking strategy, which is to divide the probability in half on each guess by asking questions like "is \(x\) greater than \(x_0\)?", then the average number of yes/no questions it takes to guess \(x\) lies between \(H(X)\) and \(H(X)+1\) (Cover and Thomas, 1991). This gives quantitative meaning to "uncertainty": it is the number of yes/no questions it takes to guess a random variables, given knowledge of the underlying distribution and taking the optimal question-asking strategy. See figure <ref>F1</ref> for examples.

Figure 1: Guessing a random variable using an optimal strategy. In both panels, the random variable can take on the four values 1, 2, 3 or 4, but with different probabilities. The optimal strategy is for each guess to divide the probability in half. a. All variables have probability 1/4. The first question is "is it 1 or 2?" because that question divides the probabilities in half. (Note that this question is not unique; one could just as easily asked, for example, "is it 2 or 4?") This strategy continues until the variable is guessed, which always takes two guess. This is also the entropy of the distribution: H=-(1/4)log(1/4)-(1/4)log(1/4)-(1/4)log(1/4)-(1/4)log(1/4)=2. b. Variables 1-4 have probabilities 1/2, 1/4, 1/8, and 1/8, respectively. The first question, "is it 1?" divides the probability in half. (Again, this question is not uniqe; one could have asked "is it 2, 3 or 4?") "Yes" indicates that the variable has been guessed; "no" requires a second question. That question is "is it 2?", which again divides the probability in half. If the answer is "no", then there are only two variables left, 3 and 4, each occurring with probability 1/2, and the last question is "is it 3". An answer of either yes or no will tell us what the variable was. On average, the number of questions is (1/2)1 + (1/4)2 + (1/8)3 + (1/8)3 = 1 3/4. This is equal to the entropy (using log to the base 2): H=-(1/2)log(1/2)-(1/4)log(1/4)-(1/8)log(1/8)-(1/8)log(1/8)= 1 3/4. Adopted from Nirenberg and Latham (2003)

<review> Reviewer A, added after revision: I am a bit concerned that this article spends so much space on entropy. Now, when the entry "entropy" exists this article overlaps a lot with that one. For example, the above figure and example is almost identical with "entropy example 3" (see http://www.scholarpedia.org/article/Entropy/entropy_example_3). Is it absolutely necassary to place here so similar thing? I accept the article, but I ask the authors to reconsider this issue.</review>

The conditional entropy is the average uncertainty about \(X\) after observing a second random variable \(Y\), and is given by

H(X|Y)=\sum_y P_Y(y) \left[ - \sum_x P_{X|Y}(x|y) \log \left(P_{X|Y}(x|y)\right)\right] = E_{P_Y} \left[ - E_{P_{X|Y}} \log P_{X|Y} \right] </math>

where \(P_{X|Y}(x|y) (\equiv P_{XY}(x,y)/P_Y(y))\) is the conditional probability of \(x\) given \(y\).

<review> The definition above may be improved through the use of expectations as well. If one defines, as in Cover and Thomas, the event-conditioned entropy (or with expectations), then it's easier to intuit the conditional entropy as \( E_{P_Y} H(X|y) \). </review>

<review> ***Reply: done.</review>

<review> The above objections by the reviewer may be addressed by refering to an article on probability in Scholarpedia (if/when one exists). </review>

<review> ***Reply: none exists yet, but when one does we will make a link.</review>

With the definitions of \(H(X)\) and \(H(X|Y)\), Eq. (<ref>info</ref>) can be written

I(X,Y)=H(X)-H(X|Y). </math>

Mutual information is therefore the reduction in uncertainty about variable \(X\), or the expected reduction in the number of yes/no questions needed to guess \(X\) after observing \(Y\). Note that the yes/no question interpretation even applies to continuous variables: although it takes an infinite number of questions to guess a continuous variable, the difference in the number of yes/no questions it takes to guess \(X\) before versus after observing \(Y\) is finite, and is the mutual information. While problems can arise when going from discrete to continuous variables (subtracting infinities is always dangerous), they rarely do in practice; see (Gray, 1990).

<review> The above difference MAY but NEED NOT be finite. If \(X = Y\), the difference is infinite. Generally, one should be much more careful with statements on continuous variables. Moreover, problems with infinite entropy occur not only for continuous variables, but already for discrete variables with countably many states. The "difference of entropies" formula is "safe" only for FINITELY many states. Otherwise it may represent the undefined term \(\infty - \infty\). This should be clearly indicated. (appended by another reviewer) Again, Gray's book treats continuous variables carefully.</review>

<review> ***Reply: we have added a small disclaimer. Given the indended audience -- non-experts -- it seems reasonable to be a little less than rigorous about continuous distributions; the alternative would lead to an article that is way too long. In addition, in our experience, except for pathological cases, the transition from discrete to continuous is, in practice, pretty straightforward.</review>

<review>Reviewer A, added after revision: I still insist on changing the definitive "is finite" to "may be finite".</review>

<review> I think that this figure needs an additional sentence or so in the main text above describing its take-home message. You need enough of a hook to lead people to read/think through the caption. </review>

<review> Another reviewer: I agree that the figure is hard to swallow, and in particular could confuse the basic reader. Perhaps a solution would be to separate out one figure explaining the basic yes/no questions point, and a second (which would be read by the person wanting to go slightly further) going in to the bimodal distribution as an interesting example. This bit does need work. </review>

<review> ***Reply: the figure has been removed and replaced with a new one.</review>

<review>Reviewer A, added after revision: I don't see another figure... but its OK now.</review>

Properties

Mutual information is intimately related to the Kullback-Leibler divergence (Cover and Thomas, 1991), a very natural measure of the "distance" between two distributions. It is defined as follows: for any two distributions \(P(z)\) and \(Q(z)\),

\[ D_{KL} \Big( P(z)||Q(z) \Big) \equiv \sum_z P(z) \log \left[ {P(z) \over Q(z)} \right] \, . \]

<review>Comments on both text and above reviewer notes: Apart from not being symmetric, \(D_{KL}\) also does not satisfy the triangle inequality. So it should not be called a distance with a straight face. I can live with "distance" or divergence, or any of the other things it is usually called (distance not being one of them). I disagree with the reviewer on absolute continuity - it can of course be defined for any two measures, just happens to be infinite if P disagrees with Q on what is a set of measure 0. Also, a finite \(D_{KL}(P_{XY}||P_XP_Y)\), which implies a finite and symmetric \(I(X;Y)\), does not necessarily imply finite entropies, hence some of the problems with the "entropy expansion" of I(X;Y). </review>

<review> ***Reply: sorry; we were a little sloppy. we now put distance in quotes, and include qualifiers. we also believe that, for this audience, "any two distributions" is accurate enough, so long as we allow the KL divergence to be infinity.</review>

Distance is in quotes because the Kullback-Leibler divergence is not a true distance: it is not symmetric, and it does not obey the triangle inequality (Cover and Thomas, 1991). It is not hard to show that \(D_{KL} \Big(P(z)||Q(z) \Big)\) is non-negative, and zero if and only if \(P(z) = Q(z)\)

<review>a.e. (unless the authors restrict themselves to discrete distributions)</review>.

<review> ***Reply: this is true, but for this audience, it's probably enough to stick with the simpler statement.</review>

From Eq. (1), it is easy to see that the mutual information is just the Kullback-Leibler distance between the joint distribution, \(P_{XY}(x,y)\), and the product of the independent ones, \(P_X(x)P_Y(y)\),

\[ I(X;Y) = D_{KL} \Big( P_{XY}(x,y)||P_X(x)P_Y(y) \Big) \, . \]

Thus, another way to think about mutual information is that it is a measure of how close the true joint distribution of \(X\) and \(Y\) is to the independent joint distribution. Because the Kullback-Leibler distance, and thus the mutual information, is zero if and only if \(P_{XY}(x,y) = P_X(x)P_Y(y)\), it follows that the mutual information captures all dependencies between random variables, not just, say, second order ones, as captured by the covariance.

Mutual information is additive for independent variables. More quantitatively, if \(P_{XYWZ}(x,y,w,z) = P_{XY}(x,y) P_{WZ}(w,z)\), then

\[ I(X,W; Y,Z) = I(X;Y) + I(W;Z) \, . \]

This follows easily from the definition of the mutual information. It's also a property that information should have. For example, suppose \(X\) and \(Y\) are the cost of a bottle of wine and how it tastes, and \(W\) and \(Z\) are the height and weight of tree squirrels. If cost provides 1 bit of information about taste and height provides 2 bits of information about weight, then it makes sense that cost+weight should provide 3 bits of information about taste+height. Assuming, of course, that squirrels (and their weights) do not influence wines (and their prices).

<review> Properties (and indeed the whole article) should be listed in order of simplicity according to scholarpedia style - the idea being that if someone wants to casually look up what mutual information "is", they look down a certain distance, if they want to know more details and get to the math, they keep reading. Thus this one should possibly be at the start</review>

<review> ***Reply: we thought about this, but it's not obvious to us where this property would fit in further up. We agree that simplicity should come first, but it's generally not possible to have ordering by headings _and_ ordering by simplicity. Also, although this property is easy to state, it requires lots of definitions to actually prove. And finally, while it's simple, its significance isn't obvious -- and it really isn't such an important property -- until we discuss information transmission. A long-winded way of saying that we'll leave it here. We did, though, add the disclaimer about squirrels and wine being independent.</review>

<review> There are many more important properties of mutual information that could be incouded here. Chain rules come to mind, a natural extension of the additive property above when the distrubitions are not independent. Many inequalities, e.g. as listed in Chapter 17 of Cover and Thomas 2e (Chapter 16 in the 1st edition)</review>

<review> Yes, this article is the place for these things. Look at the Entropy article (detailed) or Don Johnson SNR article (concise but definitive - I see the mutual info article as being longer as there is a lot more to it though) for examples of what this article could be. Substantial effort needs to be put into this section. </review>

<review> ***Reply: we didn't find any of those properties appropriate for a sholarpedia article, which is not supposed to be super-technical. However, we are adding the data-processing inequality. We can't believe we forgot that!</review>

<review>Reviewer A, added after revision: I insist on listing the symmetry property \(I(X;Y)=I(Y;X)\). The proof using entropy is very simple\[I(X;Y) = H(X) - H(X|Y) = H(X) - H(X\vee Y) + H(Y)\] (which is symmetric).</review>

The Data Processing Inequality (DPI) states, loosely, that post-processing cannot increase information. More quantitatively, consider two random variables, \(X\) and \(Y\), whose mutual information is \(I(X,Y)\). Now consider a third random variable, \(Z\), that is a (probabilistic) function of \(Y\) only. (The only qualifier means \(P_{Z|XY}(z|x,y)=P_{Z|Y}(z|y)\), which in turn implies that \(P_{XYZ}(x,y,z)=P_{Z|Y}(z|y)P_{Y|X}(y|x)P_X(x)=P_{X|Y}(x|y)P_{Y|Z}(y|z)P_Z(z)\). This last relationship means that \(X, Y\) and \(Z\) form a Markov Chain.) The DPI states that \(Z\) can't have more information about \(X\) than \(Y\) has about \(X\); that is,

\[ I(X;Z) \le I(X;Y) \, . \]

This inequality, which again is a property information should have, is easy to prove. Combining Eqs. (3) and (4) and using \(\sum_z P_{XYZ}(x,y,z)=P_{XY}(x,y)\) and \(\sum_y P_{XYZ}(x,y,z)=P_{XZ}(x,z)\), we have

\[ I(X;Y)-I(X;Z) = \sum_{x,y,z} P_{XYZ}(x,y,z) \log \left[ {P_{X|Y}(x|y) \over P_X(x)} \right] - \sum_{x,y,z} P_{XYZ}(x,y,z) \log \left[ {P_{X|Z}(x|z) \over P_X(x)} \right] = \sum_{x,y,z} P_{XYZ}(x,y,z) \log \left[ {P_{X|Y}(x|y) \over P_{X|Z}(x|z)} \right] . \]

Then, using the Markov property of \(X, Y\) and \(Z\) to replace \(P_{XYZ}(x,y,z)\) with \(P_{X|Y}(x|y)P_{Y|Z}(y|z)P_Z(z)=P_{X|Y}(x|y) P_{YZ}(y,z)\), the above expression becomes

\[ I(X;Y)-I(X;Z) = \sum_{y,z} P_{YZ}(y,z) \sum_x P_{X|Y}(x|y) \log \left[ {P_{X|Y}(x|y) \over P_{X|Z}(x|z)} \right] = \sum_{y,z} P_{YZ}(y,z) D_{KL} \left(P_{X|Y}(x|y)| P_{X|Z}(x|z) \right) \ge 0 . \]

The last expression has a certain amount of intuitive appeal: if \(P_{X|Y}(x|y) = P_{X|Z}(x|z)\) for all \(x,y\) and \(z\), then \(I(X;Y)=I(X;Z)\), and no information is lost in the transformation from \(Y\) to \(Z\). If those two probabilities are not equal, then at least some information is lost.

The channel coding theorem

Mutual Information is just one way among many of measuring how related two random variables are. However, it is a measure ideally suited for analyzing communication channels.

Abstractly, a communication channel can be visualized as a transmission medium which receives an input \(x\) and produces and output \(y\). If the channel is noiseless, the output will be equal to the input. However, in general, the transmission medium is noisy and an input \(x\) is converted to output \(y\) with probability \(P_{Y|X}(y|x)\).

Given a communication channel, we can transmit any message \(\mathbf{s}\) from a set of \(M\) possible messages by performing the following three steps:

To each message \(\mathbf{s}\) we assign a string \(\mathbf{x}(\mathbf{s})=(x_1,\dots,x_n) \) of length \(n\). Each \(\mathbf{x}(s)\) is called a codeword. The deterministic mapping from the set of messages to the set of codewords is called the encoding function.
To transmit \(\mathbf{s}\), we transmit the corresponding string \(\mathbf{x}(\mathbf{s})\) over the channel. This yields an output string \(\mathbf{y}\), also of length \(n\). For the so called memoryless channles, each \(x_i\) in the input string is mapped to an output \(y_i\) with probability \(P_{Y|X}(y|x)\), independent of the other \(x_j\).
The output string \(\mathbf{y}\) is used to recover the transmitted message, using a deterministic decoding function. The decoding function maps each \(\mathbf{y}\) to one symbol \(\mathbf{s'}\).

These three steps are illustrated as

\[ \mathbf{s} \ \mathbf{\xrightarrow{Encoding}} \ x_1 x_2 ... x_n\ \rightarrow \Big[ \ {\rm noisy \ channel} \! : P_{Y|X}(y|x) \ \Big] \rightarrow y_1 y_2 ... y_n\ \mathbf\xrightarrow{Decoding} \ \mathbf{s'} \, . \]

One would like \(\mathbf{s'}\) to be the same as \(\mathbf{s}\), but, because of noise, there is no way to absolutely guarantee this. However, the channel coding theorem (Shannon and Weaver, 1949) states that one can come close, so long as \(M\) is not too large. Moreover - and this is where the link between mutual information and communication channels arises - the maximum number of messages that can be transmitted almost error-free is a function of the mutual information between \(X\) and \(Y\). Specifically, the channel coding theorem states the following:

First, define the channel capacity, \(C\), as the maximum mutual information with respect to the input distribution, \(P_X\),

\[ C=\max_{p_X} I(X;Y) \, . \]

<review>Reviewer A, added after revision: Still, I don't completely understand this concept, and I am afraid, the reader will get lost here. It must be better explained over what set is the maximum. So far, I understand that the "channel" deals with a given pair of variables, X and Y with a given joint distribution P_{X,Y}. I imagine, that the authors are trying to say, that the "channel" is a fixed CONDITIONAL distribution (as a function of x,y) while the marginals are allowed to vary. The remaining part of the example below deals with a fixed pair of variables, so the process of taking the maximum is not illustrated, and hence the reader has no chance to understand what the maximum has to do here! In this example simply C = I(X;Y), am I right? </review>

Then, assume that \(M = 2^{nR}\) messages are encoded in strings of length \(n\). If \(R < C\), the probability of an error (i.e., the probability that \(\mathbf{s} \ne \mathbf{s'}\)) is exponentially small in \(n\). If, on the other hand, \(R \ge C\), then errors are likely. More formally,

For \(R=\frac{\log(M)}{n}<C \), it is possible to transmit messages with a maximum probability of error which is exponentially small in \(n\). Conversely, if \(R>C \), this is not possible.

In the above theorem, \(R\) is called the rate.

<rewiew> I removed the repetition: At first glance, the ability to transmit information over a noisy channel almost error free looks mysterious. </review>

Although at first glance the ability to transmit information over a noisy channel almost error free looks mysterious, it's based on a very simple principle: send only a subset of the possible messages (that is, choose \(M\) such that \(M < 2^{nC}\)). The idea is illustrated in Fig. <ref>F2</ref>. This figures shows a set of input messages, represented by the dots in the left hand panels. Because of noise, a particular input message could map to any one of the messages inside the circles in the right hand panels. If one used all possible messages, as in Fig. <ref>F2</ref>a, and tried to decode - to map \(Y\) back to \(X\) - one would make errors. For example, the red message on the right hand side of Fig. <ref>F2</ref>a could be mapped back to any of three input messages. If, on the other hand, one used only a subset of the messages, as in Fig. <ref>F2</ref>b, then decoding could be done perfectly. Although this main idea is simple to grasp, the power of the channel coding theorem lies in the fact that increasing \(n\) exponentially increases the maximum number of messages that can be transmitted almost error free (\(M \sim 2^{nC}\)).

Figure 2: How to transmit information error free. a. A scheme that does not transmit information error free. The dots in the left panel show (very schematically) all possible messages. The noisy channel maps each of them to any of the messages within the circles in the right panel. If one were to try to decode these messages, errors would be made on a regular basis. That's because multiple messages lie within each circle, so for each message that is received, more than one message could have been sent. For example, the red dot could be decoded as the red one (which is correct), but also as either of the cyan ones (both of which are wrong). b. A scheme that does transmit information error free. A sufficiently small subset of the messages are sent so that there is only one message inside each circles. Decoding can now be done perfectly - the red dot on the right can now be decoded only as the red dot on the left. This is, of course, an idealization; in real systems, there is always some probability of error. But, as shown in the main text, that probability can be made arbitrarily small.

A detailed proof of the channel coding theorem is beyond the scope of this article and we refer the readers to the relevant literature [Shannon and Weaver 1949, Cover and Thomas 1990]. It is, however, possible to see the relation between the rate of transmission and the probability of error quite intuitively, as shown below.

Suppose that we want to transmit \( M \) messages using codewords of length \( n \). If we send a codeword \(\mathbf{x}^\mathrm{true} \equiv (x_1^\mathrm{true}, x_2^\mathrm{true}, ..., x_n^\mathrm{true})\), and it is converted to \(\mathbf{y} \equiv (y_1, y_2, ..., y_n)\) on the output side, there is a very natural way to translate from \(\mathbf{y}\) back to \(\mathbf{x}\): choose the message most likely to have caused it. That is, compute \(p(\mathbf{y}|\mathbf{x})\) - the probability that \(\mathbf{y}\) was produced by codeword \(\mathbf{x}\) - for all codewords, and choose the one that maximizes this probability. We denote the probability of mistakenly decoding a randomly chosen codeword, \(\mathbf{x}\), by \(P_n\). This is defined as

\[ P_n = \mathrm{Prob} \left[ p(\mathbf{y}|\mathbf{x}) > p(\mathbf{y}|\mathbf{x}^\mathrm{true}) \right] \, . \]

Since \(\mathbf{x}\) was chosen randomly in the above expression, it follows that the total probability of an error when decoding messages, denoted \(P_{tot}\), is given by

\[ P_{tot} = 1 - (1-P_n)^{M} \approx 1 - \exp[-M P_n] \, . \]

For \(P_{tot}\) to be small, \(M P_n\) must be much less than one. This provides a bound on the number of messages, \( M \), that can be sent almost error free using codewords of length \( n \)

M \ll 1/P_n \, . </math>

It is a straightforward exercise in large deviation theory (see Chapter 12 of Cover & Thomas, 1991) to show that \(P_n\) decreases exponentially with \( n \). In fact, as motivated in the example below, in the large \(n\) limit, \(P_n\) is given by

\[ P_n = 2^{-nI(X;Y)} \, . \]

If we now let

\[ M = 2^{n[I(X;Y)-\epsilon]} \]

with \(\epsilon\) positive, we have automatically satisfied Eq. (<ref>N_inequality</ref>), in the large \(n\) limit. Thus, this analysis implies that \(2^{nI(X;Y)}\) is an upper bound on the number of messages of length \(n\) can be sent with a probability of error equal to \(2^{-n\epsilon}\). By making \(n\) sufficiently large, the probability of error can be made arbitrarily small.

The connection between mutual information and the number of messages that can be sent is a deep one, but it turns out to be fairly easy to understand, as can be seen with the following example.

Consider a communication channel that transmits 0s and 1s, and transmits them correctly with probability \(q\); that is,

\[ \begin{array}{ll} P_{Y|X}(1,1) & = q \\ P_{Y|X}(1,1) & = q \, . \end{array} \]

For simplicity assume that \(P_X(0)=P_X(1)=1/2\), which in turn implies that \(P_{X|Y}(1,1) = P_{X|Y}(0,0) = q.\) The mutual information for this distribution is given by

\begin{array}{ll} I(X;Y) & = H(X) - H(X|Y) \\ & = -(1/2) \log(1/2) -(1/2) \log(1/2) -(1/2)[-q \log q - (1-q) \log(1-q)] -(1/2)[-q \log q - (1-q) \log(1-q)] \\ & = -\log(1/2) + q \log q + (1-q)\log(1-q) \end{array} </math>

As usual, one constructs a codebook containing only a subset of possible strings of 0s and 1s; that is, for strings of length \(n\), the codebook contain far fewer than \(2^n\) messages. Suppose a particular codeword, \(\mathbf{y} \equiv (y_1, y_2, ..., y_n)\), was received. How would one figure out which message was sent? The idea is to make use of the fact that if a 1 was received then there is a probability \(q\) that a 1 was sent, and similarly for 0. For definiteness, take \(q=0.9\). Then, one simply looks for the codeword that has approximately 90% 1s at positions in the string where the received symbol was 1, and approximately 90% 0s at positions in the string where the received symbol was 0. For example, consider following received message and two candidate codewords,

\[ {\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1}{\color{red}0} \ \ \mathrm{received \ message} \]

\[ {\color{red}0}{\color{blue}1}{\color{blue}1}{\color{green}1}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{blue}1}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{blue}1} {\color{red}0}{\color{green}0}{\color{red}0}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{red}0}{\color{blue}1} {\color{red}0}{\color{red}0}{\color{red}0}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{blue}1}{\color{green}0}{\color{blue}1}{\color{red}0} \ \ \mathrm{likely \ codeword} \]

\[ {\color{red}0}{\color{green}0}{\color{blue}1}{\color{red}0}{\color{green}1}{\color{green}0}{\color{green}1}{\color{green}0}{\color{blue}1}{\color{green}0} {\color{blue}1}{\color{blue}1}{\color{green}1}{\color{red}0}{\color{blue}1}{\color{red}0}{\color{green}1}{\color{green}1}{\color{red}0}{\color{green}0} {\color{blue}1}{\color{green}1}{\color{blue}1}{\color{green}0}{\color{red}0}{\color{green}1}{\color{red}0}{\color{blue}1}{\color{green}0}{\color{green}0} {\color{red}0}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{blue}1}{\color{red}0}{\color{green}1}{\color{blue}1} {\color{red}0}{\color{red}0}{\color{green}1}{\color{green}0}{\color{blue}1}{\color{green}1}{\color{blue}1}{\color{blue}1}{\color{green}0}{\color{red}0} \ \ \mathrm{unlikely \ codeword} \]

Incorrect matches are color-coded in green. For the likely codeword, only about 10% of the symbols are green, whereas for the unlikely codeword, about 50% are green.

There are two points to this example. One is that the probability of making an error - of incorrectly decoding a message - is about equal the probability that a randomly chosen message, one in which the probability of a 0 or 1 in each position is 1/2, will have 90% of its 1s and 90% of its zeros in just the right places. The other is that this probability is not hard to calculate: it is the same as the probability of flipping \(n\) coins, and having the first \(n/2\) come up 90% heads and the second \(n/2\) come up 90% tails. This probability (with 0.9 replaced by q) is given by

\[ P_n = \left[(1/2)^{n/2} \frac{(n/2)!}{(qn/2)!((1-q)n/2)!} \right]^2 \, . \]

The first factor, \((1/2)^{n/2}\) is the probability of a particular message occurring; the second enumerates all the ways to arrange the symbols. Using Stirling's formula and taking the large \(n\) limit, this expression can be written

\[ P_n \approx 2^{-n[- \log_2 (1/2) + q \log_2 q + (1-q) \log_2 (1-q)]} = 2^{-nI(X;Y)} \, . \]

where the last equality follows from Eq. (<ref>binary_mi</ref>). This analysis, which required only the use of Stirling's formula (and can be easily extended to the general case), provides a link between information theory and communications channels.

<review> There are a couple of examples that can help improve the intuitive understanding of channel coding. Consider binary strings/channels in both cases.

Take a binary string of length n. Say we want to protect from a single error (bit flip) in this string. In this case we need to send additional information about the position of the flipped bit, which means we need another \( log_2 n\) binary symbols, which encode the binary representation of the error. Of course, each one of the extra symbols may be flipped, but their protection needs another \( log_2 log_2 n\), which is much smaller than the original expansion, that it can be disregarded (or folded in the original). Now, if we want to protect against 2 errors (flipped bits), we would need to provide extra information about the position of the 2 flipped bits, the binary representation of which is \( log_2 (2n-1) \approx log_2 n + 1\) long, that is, just one bit longer than the 1 bit correction. The overall probability of error decreases as the probability of the remaining unprotected patterns (3 or more so far)
Consider the space of all binary strings of length n. You can imagine it as an n-dimensional cube on \([0 1]^n\). Each distinct binary message is represented by a vertex on this cube. When a particular message is selected for transmission, noise can perturbe the message to one of the neighbouring messages - along an edge, or several edges away from the original (Hamming distance can be introduced here, then the perturbation is in the "binary sphere of Hamming radius k"). So in order to avoid confusion, we agree not to send any other messages from within this sphere, and interpret all outcomes from this sphere as its center, the original message.

</review>

<review> On the reviewer's first idea: correcting one fliped entry is a bit more complicated. The sender cannot predict which bit will be flipped, so he cannot append this kind of information at the end of the message. One has to label all vertices using some \(m\) labels so that the 1-ball around each vertex receives an injective labelling (I think this should be possible with \(m=n+1\), but I am not sure), and append the coded label of the message. After one flip we could still figure out which the original vertex was. Of course, the amount of appended information should be roughly the same as the other reviewer computed. The same can be done with balls of larger radius \(k\) and \(m\) of rank \(n^k\). This produces a variant of the second method. </review>

<review> ***Reply: We decided not to use any of these suggestions, since they're kind of complicated. Hopefully what we used isn't even worse.</review>

<review>Reviewer A, added after revision: This example is indded much easier for an unexperienced reader. However, I still do not see the moment, when we CHOOSE a subset of all messages (reducing the cardinality) to make errorless transmission possible. Neither I see the moment when we apply some maximum! </review>

While this fundamental theorem puts bounds on transmitted information, it does not tell us how to reach those bounds. An active area of research is designing codes that come as close as possible to the limit given by mutual information error correcting codes.

References

Cover, T.M. and Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons, New York, NY.
Gray, R.M. (1990). Entropy and Information Theory. Springer-Verlag, New York, NY.
S. Nirenberg and P.E. Latham, Decoding neuronal spike trains: how important are correlations? Proc. Natl. Acad. Sci. 100:7348-7353 (2003).
Shannon, C.E. and Weaver, W. (1949). The mathematical theory of communication. University of Illinois Press, Urbana, Illinois.

Mutual information

Contents

Definition

Interpretation

Properties

The channel coding theorem

References

External links

See also

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Focal areas

Activity

Tools