# Fano inequality

Post-publication activity

Curator: Robert Mario Fano

The inequality that became known as the Fano inequality pertains to a model of communications system in which a message selected from a set of $$N$$ possible messages is encoded into an input signal for transmission through a noisy channel and the resulting output signal is decoded into one of the same set of possible messages. Successive input messages are assumed to be statistically independent.

## Origin

Let $$X$$ be the set of input messages, $$Y$$ the corresponding set of output messages and $$P(x_k , y_j) = P(y_j) P(x_k \mid y_j )$$ be the joint probability of message $$x_k$$ being transmitted and message $$y_j$$ being received. An error occurs when the output message differs from the input message, that is when $$j \ne k\ .$$ Let $$P(e)$$ be the probability of this event and $$H(E)= -P(e) \log P(e) - (1- P(e)) \log (1- P(e))$$be the associated binary entropy.

The conditional entropy $$H(X \mid Y) = \sum_{j} P(y_j) H(X \mid y_j )$$ represents the average amount of information lost because of the channel noise and is referred to as the “equivocation”. It is the average value of

$\tag{1} H(X \mid y_j ) = - \sum_{k} P(x_k \mid y_j ) \log P(x_k \mid y_j )$

which is the average amount of information still needed to specify the input message when a particular output message $$y_j$$ is specified.

The Fano inequality is

$\tag{2} H(X \mid Y) \le H(E) + P(e) \log (N-1)$

It first appeared as Eq. 4.35 in the 1953 edition of the lecture notes distributed to M.I.T. students in the graduate subject Statistical Theory of Information, and, later on, as Eq, 6.16 in the final textbook published in 1961 (Fano, 1961).

The inequality resulted from an early attempt to relate the equivocation, the information theory measure of the effect of channel noise, to the probability of error, the traditional practical measure. It was originally based on the following heuristic argument.

The equivocation, because of its very definition, cannot exceed the information provided about the input messages by any procedure capable of correcting output errors. The first step in correcting output errors is to specify for each input message whether the corresponding output message is correct or incorrect. The amount of information provided by this specification is given by the binary entropy $$H(E)\ ,$$ the first term on the right hand side of (2). Then, for each incorrect output message the correct input message must be identified. The number of possible input messages is $$N-1$$ because the output message is known to be incorrect. Thus the amount of information necessary to identify the input message cannot exceed $$\log(N-1)\ ,$$ and the corresponding average amount of information to be provided cannot exceed $$P(e) \log(N-1)\ ,$$ the second term on the right hand side of (2).

## Derivation

The inequality may be formally derived as follows. Let

$P(e | y_j ) = \sum_{k \ne j} P(x_k \mid y_j )$

be the probability of error when $$y_j$$ is the output message, that is, when the input message is any $$x_k$$ with $$k \ne j\ .$$ Then,

$1 - P(e \mid y_j) = P(x_j \mid y_j)$

is the probability of the output message being correct, that is, of the input message being $$x_j\ ,$$ the one corresponding to the given output message $$y_j\ .$$ The corresponding binary entropy is

$H(E \mid y_j) = - P(e \mid y_j) \log P(e \mid y_j) - (1- P(e \mid y_j)) \log (1- P(e \mid y_j))$

The probability of message $$x_k$$ being the correct input message when it is known that the output message $$y_j$$ is incorrect is

$P_c (x_k \mid y_j) = P_{k \ne j} (x_k \mid y_j) / P(e \mid y_j)$

Now, (1) can be rewritten in the form

\begin{align} H(X \mid y_j) & = H(E \mid y_j) + P(e \mid y_j) \log P(e \mid y_j) - P(e \mid y_j) \sum_{k \ne j} P_c (x_k \mid y_j) \log P(x_k \mid y_j)\\ & = H(E \mid y_j) - P(e \mid y_j) \sum_{k \ne j} P_c (x_k \mid y_j) \log P_c(x_k \mid y_j)\\ & = H(E \mid y_j) + P(e \mid y_j) H(X_{k \ne j} \mid y_j) \end{align}

where

$H(X_{k \ne j} \mid y_j) = -\sum_{k \ne j} P_c(x_k \mid y_j) \log P_c(x_k \mid y_j)$

is the entropy of an ensemble of $$N-1$$ elements whose value cannot exceed $$\log (N-1)\ .$$ Thus,

$H \left( X \mid y_j \right) \le H(E \mid y_j) + P(e \mid y_j) \log (N-1)$

which, when averaged over the $$Y$$ ensemble, yields

$H(X \mid Y) \le H(E \mid Y) + P(e) \log (N-1)$

which, in turn, yields (2), since $$H(E) \ge H(E \mid Y)\ .$$