Mutual information
| Peter E. Latham and Yasser Roudi (2009), Scholarpedia, 4(1):1658. | doi:10.4249/scholarpedia.1658 | revision #186917 [link to/cite this article] |
Mutual information is a quantity that measures how much one random variable tells us about another. In general, information is best thought of as a measure of reduction in uncertainty: 1 bit of information reduces the uncertainty about a random variable by a factor of 2, 2 bits by a factor of 4, and, in general, \(n\) bits by a factor of \(2^n\). In particular, mutual information is the reduction in uncertainty about one random variable given knowledge of another.
<review> 1. Throughout the article, what the authors call "uncertainty" is sometimes something proportional to the inverse of probability, sometimes to its logarithm. This may be terribly confusing for a nonexpert reader. Apparently above they skip the logarithm.</review>
<review> 2. The above interpretation of "information" is confusing. In the context of mutual information, 1 bit reduces uncertainty about a random variable by AT MOST a factor 2; if the information is independent there is no reduction at all. From what is written above the reader may deduce that \(n\) bits of ANY information reduces uncertainty about a given random variable by a factor of \(2^n\). What is written in the "dictionary part" of the article should be absolutely correct and admit no misinterpretation.</review>
Contents |
Definition
For two discrete variables \(X\) and \(Y\) whose joint probability distribution is \(P_{XY}(x,y)\), the mutual information between them, denoted \(I(X;Y)\), is given by (Shannon and Weaver, 1949; Cover and Thomas, 1991)
- <math info>
I(X;Y) = \sum_{x,y} P_{XY}(x,y) \log {P_{XY}(x,y) \over P_X(x) P_Y(y)} = E_{P_{XY}} \log{P_{XY} \over P_X P_Y} \, . </math>
Here \(P_X(x)\) and \(P_Y(y)\) are the marginals\[P_X(x) = \sum_y P_{XY}(x,y)\] and \(P_Y(y) = \sum_x P_{XY}(x,y)\) and \(E_P\) is the expected value over the distribution \(P\). The focus here is on discrete variables, but most results derived for disrete variables extend very naturally to continuous ones – one simply replaces sums by integrals. One should be aware, though, that the formal replacement of sums by integrals hides a great deal of subtlety, and, for distributions that are not sufficiently smooth, may not even work. See (Gray, 1990) for details.
<review> 1. remove "our" </review> Done.
<review> 2. the passage to continuous variables is not that easy: for the fraction to make sense one needs the Radon-Nikodym derivative (and the fact that the joint distribution is absolutely continuous wrt. product measure).</review> We don't want to be too technical here, so we weakened our claim and referred the reader to Gray.
<review> It would be nice to have the expressions in terms of expectations as well. For example, \(I(X;Y) = E_{P_{XY}} \log{P_{XY} \over P_X P_Y} \), etc. In this way they are connected better to probability, where the whole theory is anchored. I also concur with item 2 above. A good development of information theory for continuous variables can be found in Robert Gray's "Entropy and Information Theory", Springer-Verlag 1990. Although the whole discussion may become very technical in this case. Nevertheless it is worth a mention, just to let people know of the potential difficulties associated with continuous RV.</review> Done (but see response to previous reviewer).
The units of information depend on the base of the logarithm. If base 2 is used (the most common, and the one used here), information is measured in bits. For other bases, see Table I.
| base | units | conversion |
|---|---|---|
| 2 | bits | 1 bit = 1 bit |
| \(e\) | nats | 1 bit = \(\log_e 2 \ (\approx 0.693)\) nats |
| 10 | bans | 1 bit = \(\log_{10} 2 \ (\approx 0.301)\) bans |
<review> I suggest to remove the above conversion table. It belongs to the entry "information" rather than here. </review>
Interpretation
To understand what \(I(X;Y)\) actually means, we first need to define entropy and conditional entropy.
Qualitatively, entropy is a measure of uncertainty – the higher the entropy, the more uncertain one is about a random variable. This statement was made quantitative by Shannon. He postulated that a measure of uncertainty of a random variable \(X\) should be a continuous function of its probability distribution \(P_{X}(x)\) and should satisfy the following conditions: <review> The notation \(P_{X}(x)\) is cumbersome when discussing the whole space rather than individual events. I can understand its use in sums where the elements may need to be shown explicitly. When discussing the probability as a whole, I suggest simpliying the notation to \(P_{X}\), meaning the probability measure on the whole space</review>
- It should be maximal when \(P_{X}(x)\) is uniform, and in this case it should increase with the number of possible values \(X\) can take;
- It should remain the same if we reorder the probabilities assigned to different values \(X\) can take;
- The uncertainty about two independent random variables should be the sum of the uncertainties about each of them.
He then showed that the only measure of uncertainty that satisfies all these conditions is the entropy, defined as
- <math entropy>
H(X)=-\sum_x P_X(x) \log P_X(x)
</math>
<review> 1. Here, following Shannon, the authors treat "uncertainty" WITH the logarithm (see my first comment). </review>
<review> 2. I am sure the meaning of "entropy" and Shannon's postulates will be explained in the due entry, so, in this article this part should be reduced.</review>
Although not particularly obvious from this equation, \(H(X)\) has a very concrete interpretation: if \(X\) is chosen randomly from the distribution \(P_X(x)\), and one is asked to guess which \(X\) was chosen, the average number of yes/no questions it takes lies between \(H(X)\) and \(H(X)+1\) (Cover and Thomas, 1991). This gives quantitative meaning to "uncertainty": it is the number of yes/no questions it takes to guess a random variables.
<review> The above is very unclear. It is not explained what kind of "guesses" are allowed. Even an experienced reader may imagine that the "guesser" tries to guess the value in a series of "iid" trials with, say, the uniform distribution on the states (say, he knows the list of states only). It must be made absolutely clear what "yes/no" questions are, and that the "guesser" KNOWS apriori the distribution of \(X\), if this interpretation of entropy is going to be (as it is now) used repeatedly later in the text. </review>
The conditional entropy is the uncertainty about \(X\) after observing a second random variable \(Y\), and is given by
- <math cond_entropy>
H(X|Y)=\sum_y P_Y(y) \left[ - \sum_x P_X(x|y) \log \left(P_X(x|y)\right)\right]
</math>
where \(P_X(x|y) (\equiv P_{XY}(x,y)/P_Y(y))\) is the conditional probability of \(X\) given a value of \(Y\).
<review> If the authors think the reader needs a table to convert from log base 2 to log base 10, why do they assume he/she will know what "conditional probability" is. The word "simply" is only irritating for someone who gets lost in this place. I suggest to remove "simply" and at least add a link to "conditional probability" or write the formula.</review> Done. No page yet for probability or conditional probability, so no link.
<review> The definition above may be improved through the use of expectations as well. If one defines, as in Cover and Thomas, the event-conditioned entropy \( H(X|y) = \left[ - \sum_x P_X(x|y) \log \left(P_X(x|y)\right)\right] \) (or with expectations), then it's easier to intuit the conditional entropy as \( E_{P_Y} H(X|y) \).
The above objections by the reviewer may be addressed by refering to an article on probability in Scholarpedia (if/when one exists). </review>
With the definitions of \(H(x)\) and \(H(X|Y)\), Eq. (<ref>info</ref>) can be written
- <math info2>
I(X,Y)=H(X)-H(X|Y).
</math>
Mutual information is therefore the reduction in uncertainty about variable \(X\), or the reduction in the number of yes/no questions needed to guess \(X\) after observing \(Y\). Note that the yes/no question interpretation even applies to continuous variables: although it takes an infinite number of questions to guess a continuous variable, the difference in the number of yes/no questions it takes to guess \(X\) before versus after observing \(Y\) is finite, and is the mutual information.
<review> The above difference MAY but NEED NOT be finite. If \(X = Y\), the difference is infinite. Generally, one should be much more careful with statements on continuous variables. Moreover, problems with infinite entropy occur not only for continuous variables, but already for discrete variables with countably many states. The "difference of entropies" formula is "safe" only for FINITELY many states. Otherwise it may represent the undefined term \(\infty - \infty\). This should be clearly indicated. (appended by another reviewer) Again, Gray's book treats continuous variables carefully.</review>
Figure <ref>F1</ref> elaborates on this point. Note especially Fig. <ref>F1</ref>d, which shows that information-as-a-reduction-in-uncertainty works even for bimodal distributions, which are generally hard to characterize with a single number.
<review> I suggest to remove the figure and the text around it. It is relevant to "conditional distribution" rather than here. It shows an example of how the distribution of \(X\) may differ from its conditional distribution given some particular values of \(Y\). Such distributions may in fact differ ARBIRARILY, so any picture of a pair of histograms would be a good example here, which is obvious for anyone who understands joint distribution of a pair of random variables in generality. For an unexperienced reader the picture is completely puzzling and misleading. Mutual information takes into account the average over all states of \(Y\), and ALWAYS reduces entropy. The picture shows the influence of individual values of \(Y\), including the case where the entropy of \(X\) is increased (the "negatively informative case") , which only messes things up. I don't understand the comment concerning not knowing vs. knowing the value \(Y=y_3\). Also, I don't understand what is so special about the bimodal distribution. If the authors insist on keeping the figure, at least make a comment about averaging the conditinal entropy over the values of \(Y\), remove the unlear comment about \(Y=y_3\) and remove the particularity of the bimodal distribution in saying "works even for bimodal distributions..." (but keep the case d, as it provides a good clue of how a "yes/no" question may work). </review>
Properties
- Mutual information is intimately related to the Kullback-Leibler divergence (Cover and Thomas, 1991), a very natural measure of the distance between two distributions. It is defined as follows: for any two distributions \(P(z)\) and \(Q(z)\),
\[ D_{KL} \Big( P(z)||Q(z) \Big) \equiv \sum_z P(z) \log \left[ {P(z) \over Q(z)} \right] \, . \]
<review> Not "any two" only \(P\) must be assumed absolutely continuous of wrt. \(Q\). Also it would be reasonable to remark that this "distance" is NOT symmetric! Nevertheless, mutual information is symmetric, assuming that both variables have finite entropy. I think this property deserves separate listing! </review>
<review>Comments on both text and above reviewer notes: Apart from not being symmetric, \(D_{KL}\) also does not satisfy the triangle inequality. So it should not be called a distance with a straight face. I can live with "distance" or divergence, or any of the other things it is usually called (distance not being one of them). I disagree with the reviewer on absolute continuity - it can of course be defined for any two measures, just happens to be infinite if P disagrees with Q on what is a set of measure 0. Also, a finite \(D_{KL}(P_{XY}||P_XP_Y)\), which implies a finite and symmetric \(I(X;Y)\), does not necessarily imply finite entropies, hence some of the problems with the "entropy expansion" of I(X;Y). </review>
It is not hard to show that \(D_{KL} \Big(P(z)||Q(z) \Big)\) is non-negative, and zero if and only if \(P(z) = Q(z)\) <review>a.e. (unless the authors restric themselves to discrete distributions)</review>. From Eq. (1), it is easy to see that the mutual information is just the Kullback-Leibler distance between the joint distribution, \(P_{XY}(x,y)\), and the product of the independent ones, \(P_X(x)P_Y(y)\),
\[ I(X;Y) = D_{KL} \Big( P_{XY}(x,y)||P_X(x)P_Y(y) \Big) \, . \]
Thus, another way to think about mutual information is that it is the distance between the correlated and uncorrelated distributions of \(X\) and \(Y\). Because the Kullback-Leibler distance, and thus the mutual information, is zero if and only if \(P_{XY}(x,y) = P_X(x)P_Y(y)\), it follows that the mutual information captures all dependencies between random variables, not just, say, second order ones, as captured by the covariance.
<review> "CORRELATED and UNCORRELATED" is exactly what covariance captures. Mutual information (as explained) captures the distance between TRUE joint distribution and the INDEPENDENT joint distribution. So these words should be used at the first place.</review>
- Mutual information is additive for independent variables. More quantitatively, if \(P_{XYWZ}(x,y,w,z) = P_{XY}(x,y) P_{WZ}(w,z)\) , then
\[ I(X,W; Y,Z) = I(X;Y) + I(W;Z) \, . \]
This follows easily from the definition of the mutual information. It's also a property that information should have. For example, suppose \(X\) and \(Y\) are the cost of a bottle of wine and how it tastes, and \(W\) and \(Z\) are the height and weight of tree squirrels. If cost provides 1 bit of information about taste and height provides 2 bits of information about weight, then it makes sense that cost+weight should provide 3 bits of information about taste+height.
<review> If invoke such a naive example, why not do it at the beginning of the article with regard to mutual information. Here, for additivity, one could recall and enhance that first example, already familiar for the reader. Also one needs to comment that squirrels (and their weights) usually do not influence wines (and their prices), (can heavy squirrels damage wineyards more than light ones???) so this is an example of independent pairs of variables.</review>
The significance of this result lies in its implication for noisy communication channels. Suppose messages are encoded in strings of 0s and 1s and sent over phone wires (which, because of noise, randomly convert 0s to 1s and vice versa). If the symbols are independent and each one provides, say \(I_1\) bits, then \(n\) symbols provide \(nI_1\) bits.
<review> I have an impression that the above text is in this place by mistake. It hardly applies to additivity, besides, it appears BEFORE the noisy communication channels are explaned. </review>
<review> There are many more important properties of mutual information that could be incouded here. Chain rules come to mind, a natural extension of the additive property above when the distrubitions are not independent. Many inequalities, e.g. as listed in Chapter 17 of Cover and Thomas 2e (Chapter 16 in the 1st edition)</review>
The channel coding theorem
<review> The section below is the worst written part of the article. I hate to say it, but the authors seem to have saved on time and effort to write something understandable.</review>
Mutual Information is just one way among many of measuring how related two random variables are. However, it is a measure ideally suited for analyzing communication channels. Abstractly, a communication channel can be visualized as a noisy transmission medium, which takes \(X\) as input and produces \(Y\) as output, and does so probabilistically via \(P_Y(y|x)\),
\[
x_1 x_2 ... x_n \rightarrow
\Big[ \ {\rm noisy \ channel} \! : P_Y(y|x) \ \Big] \rightarrow
y_1 y_2 ... y_n
\, .
\]
All communication systems have this basic structure. A key, and inescapable, part of that structure is noise: what comes out is not what goes in.
<review> It seems that in this section we "want" the output to be identical with the input, i.e., we tend to have \(Y=X\). The noise makes the joint distribution diverge from the diagonal. This should be clearly explained. Also, explain the following: the variables \(X\) and \(Y\) are given and fixed. So how can we "improve" the transmission? The coding trick is in fact a complicated manipulation on the initial distributions, and it may be very hard for the reader to understand. (Coding includes changing the finite-dimensional distributions of the signal source, while the reader usually thinks about an iid source.) </review>
This would seem to imply a loss of information. However, in a triumph of intuition and logic, Shannon showed the following (Shannon and Weaver, 1949):
<review> I am also fascinated by Shannon's achievements, but this is an ENCYCLOPEDIA, so, we should definitely refrain from showing too much of an enthusiasm. </review>
<review> On above reviewer's comment - oh, come on, we can be enthusiastic from time to time even in an ENCYCOLPEDIA. Is life worth living if you can't exclaim Shannon lives! from time to time ;-) </review>
<review> OK, I agree. Nonetheless, I still don't like the "triumph of intuition and logic". It sounds a bit exagerrated and suggests that we have witnessed the struggle to obtain this result.</review>
- It is possible to communicate information over a noisy channel almost error free, with a maximum rate set by the mutual information between input signals and output signals.
<review> This statement is completely unclear. At the first glance "almost error free" contradicts "with a maximum rate". If "almost error free", shouldn't the "rate" be something like 99% ? It must be explained what "rate" means here. And it means something quite complicated: if we want to errorfree encode messages (binary strings) of length \(n\) then using even most efficient method (code) we will have to sacrafice a (usually huge) number of message so that the code will "work well" on selected messages only. Now, the number of selected messages \(N_n\) is such that the ratio between \(\log N_n\) and \(\log 2^n\) (\(2^n\) is the number of all possible \(n\)-messages) is roughly equal to the mutual information. Notice that the term "rate" refers to logarithms, not the bare quantities, so it is very confusing to use it without a comment. Or else, we will have to encode our messages and INCREASE their lenghs at the "rate" \(I(X;Y)^{-1}\), then we will be able to errorfree transmit nearly all \(2^n\) original messages (note that there are many more messages of the increased length, but again, we are using only relatively few of them). This is roughly explained in the article below, which is a good idea (but done in a non-satisfactory way). But the statement of the Shannon's theorem itself MUST be presented in a complete and understandable (at least to a mathematician) way. </review>
Although at first glance this looks mysterious, it's based on a very simple principle: send redundant information. Consider, for example, a noisy channel that transmits binary digits – 0s and 1s. Suppose that there is a 10% probability of a bit flip (\(1 \rightarrow 0\) or \(0 \rightarrow 1\)), as shown here: \[ 00110100 ... \rightarrow \Big[ \ {\rm noisy \ channel} \ \Big] \rightarrow 0{\color{red}1}110{\color{red}0}00 ... \]
To increase the probability that a message goes through intact, we could send triplets of symbols: 111 in place of a single 1 and 000 in place of a single 0. The probability that two of the bits flip, thus creating an error, is on the order of \(0.1^2\), or one part in a hundred. Or we could send more 1s and 0s – eleven, say, in which case it would take six bit flips to cause an error. The probability of this happening is on the order of \(0.1^6\), or one part in a million. In general, the longer the blocks, the smaller the probability of error.
There is, of course, a tradeoff: while longer blocks reduce the probability of error, they also reduce the number of messages one can send per symbol.
<review> For an unexperienced reader "number of messages per symbol" is hard to accept. Usually one message takes many symbols, not vice-versa. I'd either say, "increases the effective length of messages" or "reduces the number of messages of given length \(n\) (after coding) that can be sent errorfree". </review>
It is thus desirable to design efficient codes – codes which minimize the probability of error for a fixed block size. One might think that this is a problem with no end, in the sense that if one looked harder it would always be possible to find a more efficient code. This is where Shannon's genius came in:
<review> Again, refrain from judgements and enthusiasm. </review>
he showed that for a block of length \(n\) (e.g., a binary string with \(n\) 0s and 1s), the number of messages that can be sent almost error free, \(\mathcal{N}_n\), is
\[ \mathcal{N}_n = 2^{n[I(X,Y) - \epsilon]} \, . \]
The "almost error free" qualifier means that the probability of error is \(2^{-n\epsilon}\), where \(\epsilon\) is a small positive number that quantifies the maximum error in our estimates of the signal entropies. By making \(n\) sufficiently large, the probability of error can be made arbitrarily small. Note that \(\epsilon\) can never be negative.
The encoding and decoding steps can be added to the communication channel, so it looks like
\[ z_1 z_2 ... z_m \rightarrow \Big[ \ {\rm encode} \ \Big] \rightarrow x_1 x_2 ... x_n \rightarrow \Big[ \ {\rm noisy \ channel} \ \Big] \rightarrow y_1 y_2 ... y_n \rightarrow \Big[ \ {\rm decode} \ \Big] \rightarrow z_1 z_2 ... z_m \, . \]
The key point here is that one can, with very high probability, perfectly recover the original message, \(z_1 z_2 ... z_m\), if the number of symbols after the encoding step, \(n\), is greater than the number of symbols in the original message, \(m\). For the simple example given above, in which \(1 \rightarrow 111\) and \(0 \rightarrow 000\), the encoding/decoding process would look something like
\[ 110 \rightarrow \Big[ \ {\rm encode} \ \Big] \rightarrow 111111000 \rightarrow \Big[ \ {\rm noisy \ channel} \ \Big] \rightarrow 1{\color{red}0}111{\color{red}0}00{\color{red}1} \rightarrow \Big[ \ {\rm decode} \ \Big] \rightarrow 110 \, . \]
<review> This example perfectly misses the essence of Shannon's theorem. It suggests that in order to improve the tranmission's quality one needs to increase the ratio between the coded and original messages to infinity. The same is suggested few lines earler, where the "triplets" code is increased to "elevenths". Such method of reducing errors has nothing to do with Shannon's idea and mutual information. I suggest to give an example where the length really increases by the fraction suited to the 10% noise (it will be instructive to compute the mutual information in such case) and only the length \(n\) is allowed to increase (but not the ratio between the lengths before and after coding). For as simple dependence as in this example there should exist a simple and convincing such code. </review>
<review> There are a couple of examples that can help improve the intuitive understanding of channel coding. Consider binary strings/channels in both cases.
- Take a binary string of length n. Say we want to protect from a single error (bit flip) in this string. In this case we need to send additional information about the position of the flipped bit, which means we need another \( log_2 n\) binary symbols, which encode the binary representation of the error. Of course, each one of the extra symbols may be flipped, but their protection needs another \( log_2 log_2 n\), which is much smaller than the original expansion, that it can be disregarded (or folded in the original). Now, if we want to protect against 2 errors (flipped bits), we would need to provide extra information about the position of the 2 flipped bits, the binary representation of which is \( log_2 (2n-1) \approx log_2 n + 1\) long, that is, just one bit longer than the 1 bit correction. The overall probability of error decreases as the probability of the remaining unprotected patterns (3 or more so far)
- Consider the space of all binary strings of length n. You can imagine it as an n-dimensional cube on \([0 1]^n\). Each distinct binary message is represented by a vertex on this cube. When a particular message is selected for transmission, noise can perturbe the message to one of the neighbouring messages - along an edge, or several edges away from the original (Hamming distance can be introduced here, then the perturbation is in the "binary sphere of Hamming radius k"). So in order to avoid confusion, we agree not to send any other messages from within this sphere, and interpret all outcomes from this sphere as its center, the original message.
</review>
<review> On the reviewer's first idea: correcting one fliped entry is a bit more complicated. The sender cannot predict which bit will be flipped, so he cannot append this kind of information at the end of the message. One has to label all vertices using some \(m\) labels so that the 1-ball around each vertex receives an injective labelling (I think this should be possible with \(m=n+1\), but I am not sure), and append the coded label of the message. After one flip we could still figure out which the original vertex was. Of course, the amount of appended information should be roughly the same as the other reviewer computed. The same can be done with balls of larger radius \(k\) and \(m\) of rank \(n^k\). This produces a variant of the second method. </review>
For a given channel, meaning a fixed response distribution \(P_Y(y|x)\), one can choose \(P_X(x)\) to maximize the mutual information. It can be shown that there is a unique maximum (Cover and Thomas, 1991). That maximum corresponds to the channel capacity.
<review> It took me a long time to figure out what is meant here. The new term "response distribution" simply means "conditional distribution" as a function of \(x, y\). Now, we are allowed to vary the marginal distribution of \(X\) (that will alter both the joint distribution and the other marginal) and we seek one that maximizes the mutual information, right? And there is unique such maximizing marginal. And the channel capacity is that maximal value. This should be rewritten more clearly. Especially, it is crucial to understand that a "channel" is NOT the joint distribution only the "system of conditional ones". </review>
The fundamental theorem of information theory states two things:
<review> Please, cite!</review>
- One can achieve information rates arbitrarily close to the channel capacity.
- It is impossible to transmit information almost error-free at a rate higher than the channel capacity.
While the fundamental theorem puts bounds on transmitted information, it does not tell us how to reach those bounds. An active area of research is designing codes that come as close as possible to the limit given by mutual information error correcting codes.
References
- Cover, T.M. and Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons, New York, NY.
- Shannon, C.E. and Weaver, W. (1949). The mathematical theory of communication. University of Illinois Press, Urbana, Illinois.
- Gray, R.M. (1990). Entropy and Information Theory. Springer-Verlag, New York, NY.
External links
See also
Mutual Information (Wikipedia)
Error correcting codes (Wikipedia)</review>


