# Viterbi algorithm

Post-publication activity

Curator: Andrew J. Viterbi

The Viterbi Algorithm produces the maximum likelihood estimates of the successive states of a finite-state machine (FSM) from the sequence of its outputs which have been corrupted by successively independent interference terms.

## General Problem and Solution

Figure 1 illustrates a generic FSM consisting of an $$L$$-stage shift register driven by a $$Q$$-ary input sequence $$u\ .$$ Consequently, after each shift the FSM dwells in one of $$Q^L$$ states. Corresponding to each state the signal generator produces an output $$x\ ,$$ which is generally a real vector. The corrupting effect, called a channel in communication applications, transforms $$x$$ into $$y\ ,$$ the observables, which by the nature of their formation constitute a random Markov sequence. Note that the terms of $$x$$ constitute a Markov sequence as well whenever the input sequence $$u$$ to the FSM is random, thus inducing a probability distribution on the state transitions. Figure 2 is the state diagram for an FSM with parameters $$Q=2$$ (binary) and $$L=2\ .$$ The states then correspond to their register contents$S_0=00\ ;$ $$S_1=01\ ;$$ $$S_2=10\ ;$$ $$S_3=11\ .$$ Clearly only a subset of transitions is permissible. From the sequence of observables $$y\ ,$$ we seek the most likely path of transitions through the states of the diagram. In Figure 2, we also designate the FSM outputs on each branch. Thus the branch from $$S_0$$ to $$S_1$$ is labeled with its output $$x_{01}\ .$$ It is also convenient for the description of the algorithm to view the multi-step evolution of the path through the graph by means of a multi-stage replication of the state diagram, which is known as a trellis diagram. Figure 3 is the trellis diagram corresponding to the example of Figure 2. At the top of Figure 3 are shown the successive branch observables $$y(1)\ ,$$ $$y(2)$$,$${\ldots}y(k)$$,$$\ldots\ .$$ Henceforth we use the notation $$y(k)$$ to denote the observable(s) for the $$k$$th successive branch. Similarly $$S(k)$$ will denote any state at the $$k$$th successive node level (and we shall dispense with subscripts until necessary).

The goal then is to find the most probable path through the trellis diagram. Provided the successive terms $$u$$ of the input sequence are independent, the state transition probabilities $$Pr[S(k-1){\rightarrow}S(k)]$$ are mutually independent for all $$k$$ as are the conditional output probabilities $$p[y(k)|S(k-1){\rightarrow}S(k)]\ .$$ For any given path from the origin ($$k=0$$) to an arbitrary node $$n\ ,$$ $$S(0), S(1),{\ldots}S(n)\ ,$$ the relative path probability (likelihood function) is given by

$L = \prod_{k=1}^n Pr[S(k-1){\rightarrow}S(k)] p[y(k)|S(k-1){\rightarrow}S(k)].$

For computational purposes it is more convenient to consider its logarithm, which is given by the sum,

$ln( L)= \sum_{k=1}^n m[y(k); S(k-1),S(k)]$

where

$m[y(k); S(k-1),S(k)] = ln \{Pr[S(k-1){\rightarrow}S(k)]\} + ln \{p[y(k)|S(k-1){\rightarrow}S(k)]\},$

which is denoted the Branch Metric between any two states at the $$(k-1)$$th and $$k$$th node levels. (Note that the unallowable transitions have zero probability and hence their logarithms will be negative infinity taking them out of competition.) We next define the State Metric, $$M_K (S_i)\ ,$$ of the state $$S_i (K)$$ to be the maximum over all paths leading from the origin to the $$i$$th state at the $$K$$th node level. Thus, again inserting subscripts where necessary,

$\begin{array}{lcl} M_K (S_i) &=& Max \{ \sum_{k=1}^{K-1} m[y(k);S(k-1),S(k)] + m[y(K); S(K-1), S_i(K)] \} \\ & &\scriptstyle{All~paths~S(0),S(1){\ldots}S(K-1)} \end{array}$

It then follows that to maximize the above sum over $$K$$ terms, it suffices to maximize the sum over the first $$K-1$$ terms for each state $$S_j (K-1)$$ at the $$(K-1)$$th node and then maximize the sum of this and the $$K$$th term over all states $$S (K-1)\ .$$ Thus,

$\begin{array}{lcl} M_K (S_i) &=& Max \{ M_{K-1} (S_j) + m[y(K); S_j (K-1), S_i(K)]\} \\ & &\scriptstyle{S_j (K-1)} \end{array}$

This recursion is known as the Viterbi Algorithm. It is most easily described in connection with the trellis diagram. If we label each branch (allowable transition between states) by its Branch Metric $$m[]$$ and each state at each node level by its State Metric $$M[]\ ,$$ the State Metrics at node level $$K$$ are obtained from the State Metrics at the level $$K-1$$ by adding to each State Metric at level $$K-1$$ the Branch Metrics which connect it to states at the $$K$$th level, and for each state at level $$K$$ preserving only the largest sum which arrives to it. If additionally at each level we delete all branches other than the one which produces this maximum, there will remain only one path through the trellis leading from the origin to each state at the $$K$$th level, which is the most probable path reaching it from the origin. In typical (though not all) applications, both the initial state (origin) and the final state are fixed to be$$S_0$$ and thus the algorithm produces the most probable path through the trellis both initiating and ending at$$S_0\ .$$

## Direct Applications

Numerous applications of this algorithm have appeared over the past several decades. We begin with three applications for which the basic FSM structure of Figure 1 is well defined.

### Convolutional Codes

The earliest application, for which the algorithm was originally proposed in 1967, was for the maximum likelihood decoding of convolutionally coded digital sequences transmitted over a noisy channel. Currently the algorithm forms an integral part of the majority of wireless telecommunication systems, both involving satellite and terrestrial mobile transmission. For the convolutional encoder, the signal generator of Figure 1 is a linear matrix of $$s$$ modulo-2 adders, each of which adds together the contents of some subset of the $$L$$ shift register stages, thus forming s binary symbols. This addition operation may be performed after each shift of the register, thus producing a rate $$1/s$$ code, or only once every $$r<s$$ shifts to produce a rate r/s code. These may be serially transmitted, for example, as binary amplitude modulation ($$x=+1$$ or $$-1$$) of a carrier signal. At the receiver, the demodulator generates an output y, which is either a real number or the result of quantizing the latter to one of a finite set of values. The conditional densities $$p(y|x)$$ of the channel outputs are assumed to be mutually independent, corresponding to a memoryless channel. The most commonly treated example is the additive white gaussian noise (AWGN) channel for which each $$y$$ is the sum of the encoded symbol $$x$$ and a gaussian random noise variable, with all noise variables mutually independent. This channel model is closely approximated by satellite and space communication applications and, with appropriate caution, it can also be applied to terrestrial communication design.

Thus the 4-state ($$L=2$$) state diagram of Figure 2 applies to a convolutional code of rate $$1/s\ .$$ Since only one input bit changes each time, each state has only two branches both exiting and entering it, each from and to two other states. For the exiting branches, one corresponds to a zero entering the register and the other to a one. It is generally assumed that all input bits are equally likely to be a zero or a one, so the state transition probabilities, $$P(S_j{\rightarrow}S_i) =1/2$$ for each branch. Hence the first term of the branch metric $$m( )$$ can be omitted since it is the same for each branch. As for the second term of $$m( )\ ,$$ the conditional probability density $$p(y|S_j{\rightarrow}S_i) = p(y|x)\ ,$$ where $$x$$ is an s-dimensional binary vector generated by the $$s$$ modulo-2 adders for each new input bit, which corresponds to a state transition, while $$y$$ is the random vector corresponding to the s noise-corrupted channel outputs corresponding to $$x\ .$$ For the AWGN, ln $$p(y|x)$$ is proportional to the inner product of the two vectors $$x$$ and $$y\ .$$

To generalize to any rational rate $$r/s<1\ ,$$ $$r$$ input bits enter each time and the register shifts in blocks of $$r\ .$$ The Markov graph changes only in having each state connected to $$2^r$$ other states. Note that if $$r=L$$ any state can be reached from any other state and the Markov graph becomes fully connected. Another generalization is to map each binary vector $$x\ ,$$ not into a vector of $$s$$ binary values, +1 or –1, but into a constellation of points in 2 or more dimensions. An often employed case is quadrature amplitude modulation (QAM). For example, for $$s=4\ ,$$ sixteen points may be mapped into a two-dimensional grid, and the value in each dimension modulates the amplitude of one of the two quadrature components of the sinusoidal carrier. Here $$x$$ is the 2-dimensional vector representing one of the sixteen modulating values and $$y$$ is the corresponding demodulated channel output. Multiple generalizations of this approach abound in the literature and in real applications. In most cases this multidimensional approach is used to conserve bandwidth at the cost of a higher channel signal-to-noise requirement.

An interesting footnote on this first application is that the Viterbi Algorithm was proposed not so much to develop an efficient maximum likelihood decoder for a convolutional code, but primarily to establish bounds on its error correcting performance.

### MLSE Demodulators for Intersymbol Interference and Multipath-Fading Channels

In the previous application, the convolution operation is employed in order to introduce redundancy for the purpose of increasing transmission reliability. But convolution also occurs naturally in physical channels whose bandwidth constraints linearly distort the transmitted digital signal. Treating the channel as a linear filter, it is well known that the output signal is the convolution of the input signal and the filter’s impulse response. A discrete model of the combination of signal waveform, channel effects and receiver filtering is shown in Figure 4. This combination produces, after sampling at the symbol rate, the discrete convolution $$x_k=\sum_{j=0}^m h_j u_{k-j}\ ,$$ where the $$h_i$$ are real numbers and the random variables $$u_k$$ are the input bits, generally taken to be binary (+A or –A). The $$h_j$$ terms are called the “Intersymbol Interference” coefficients, since except for $$h_0\ ,$$ all the other terms of the sum represent interference by preceding symbols on the given symbol. To the output of the discrete filter $$x_k$$ must be added noise variables $$n_k$$ to account for the channel noise. While generally these noise variables are not mutually independent, they can be made so by employing at the receiver a so-called “whitened matched filter” prior to sampling.

A related application with a very similar model is that of multipath-fading channels. Here the taps represent multiple delays in a multipath channel. Their relative spacing may be less than one symbol period, in which case each symbol of the input sequence must be repeated several times. Most important, because of random variations in propagation, the multipath coefficients $$h$$ are now random variables, so the conditional densities of $$y$$ depend on the statistics of both the additive noise and the multiplicative random coefficients.

Comparing this application with the previous one of convolutional codes, we note that the principal difference is that modulo-2 addition is replaced by real addition and there is one rather than s outputs for each branch. Otherwise, the same Markov graph applies, with two branches emanating from each state when the input sequence is binary (+A or –A). Generation of the branch metrics, $$ln~p(y|S_j{\rightarrow}S_i)\ ,$$ for multi-path fading is slightly more complex because besides depending on the noise, the $$y$$ variables depend on a linear combination of the $$h$$ random variables, with their signs determined by the register contents involved in the branch transition.

### Partial Reponse Maximum Likelihood Decoders for Recorded Data

A filter model for the magnetic recording medium and reading process is very similar to the Intersymbol Interference model. The simplest version, known as the Partial Response channel results, when sampled at the symbol rate, in an output depending on just the difference between two input symbols, $$x_k= u_k-u_{k-2}\ ,$$ with additive noise samples also being mutually independent. Note that since inputs $$u$$ are +1 or –1, the outputs $$x$$ (prior to adding noise) are ternary, +2, 0, or –2. This then reduces to the model of Figure 4 with just two non-zero taps, $$h_0= +1$$ and $$h_2= -1\ .$$ Often this is described by the polynomial whose coefficients are the tap values; in this case $$h(D) = 1-D^2)\ ,$$ which can be modeled by a 2-stage shift register which gives rise to a 4-state Markov graph, as in Figure 2. But actually, a simpler model can be used based on the fact that all outputs for which $$k$$ is odd depend only on odd indexed inputs, and similarly for even. Thus a 2-state Markov graph suffices for each of the odd and even subsets. When the recording density is increased, the simple Partial Response channel model is replaced by a longer shift register. A generally accepted model has tap coefficients represented by the polynomial $$h(D)= (1-D)(1+D)^N\ ,$$ where $$N\ge1\ .$$ For example, for the case of $$N=2\ ,$$ known as “Extended Partial Response”, an eight-state Markov graph applies.

## Other Applications: Hidden Markov Models

The Markov nature of the preceding three applications is obvious from the system model. Many more applications of the algorithm appear in the literature of numerous fields ranging from signal processing to genetics. In such cases the term Hidden Markov Model (HMM) is used. Such models are often empirically derived based on experience in the given discipline. The best known and most effective results were obtained for Speech Recognition and for DNA Sequence Alignment. Since background in each field is a prerequisite for full understanding, we only comment briefly on the common requirements for establishing the Markov models by a procedure known as Baum-Welch estimation. Initially a Markov model is postulated along with estimated state transition probabilities and the probabilities of observables conditioned on state transitions. Then given a sequence of observations, the likelihood of the observation sequence is computed for the postulated model. At the same time new estimates of the state transition probabilities and the observable conditional probabilities are developed and from these a new likelihood function is derived. If this exceeds the previous one, the procedure is repeated for as many times as the likelihood increases. When it ceases to increase, the model is accepted and the most likely state sequence is obtained by using the Viterbi Algorithm.