# Neural Filtering

Post-publication activity

Curator: James Ting-Ho Lo

A fundamental problem in a large variety of applications is that of processing a stochastic process, called measurement process, to estimate another stochastic process, called signal process. A neural filter is a neural network that is synthesized with simulated data (if models of the signal and measurement processes are available) or experimental data (if not) to perform such recursive processing. No assumptions such as linear dynamics, Gaussian distribution, additive noise, and Markov property are required. A properly trained neural filter with a proper architecture carries the most “informative” statistics in its dynamical state and approximates the optimal filtering performance to any accuracy.

## Background

Optimal filtering originated from the astronomical studies at the end of the eighteenth century. To determine the six parameters of the planet or comet motion using telescopic measurement data, the least-squares method was discovered by Gauss (1963) in 1795 and independently by Legendre (1806). During World War II, Kolmogorov (1963) in 1941 and Wiener (1949) in 1942 developed linear minimum-variance filters for fire control independently. The Kolmogorov-Wiener filter theory provided the foundation for the subsequent development of the celebrated Kalman filter by Kalman (1960). The Kalman filter processes the measurements streaming in continuously, and updates its estimate of the signal recursively upon the arrival of each measurement.

The Kalman filter is derived under the assumption that the signal and measurement processes satisfy linear equations driven by white Gaussian noises, i.e., linear Gauss-Markov models. Since the signal and measurement processes in most, if not all, applications are nonlinear, they are linearized at the predicted value of the signal for applying the Kalman filter equations. This results in the extended Kalman filter (EKF). The importance of the Kalman filter is reflected by thousands of papers that have appeared since 1960 and by the applications of the EKF in system control, signal processing, robotics, seismology, communication, economics, finance, navigation, target tracking, etc.

However, the EKF (and iterated EKF) is sometimes far from optimal and even divergent. This shortcoming of the EKF motivated an enormous amount of work on optimal nonlinear filtering by the analytic approach for over thirty years, which can be found in (Liptser and Shiryayev, 1977; Anderson and Moore, 1979; Kallianpur and Karandikar, 1988; Bucy and Joseph, 1987}, and papers and books referred to in them. Starting with a mathematical model, the analytic approach searches for a solution or approximate solution consisting of equations that describe the structures or characterize the variables of the filter. In the process of searching, deductive reasoning is used and many assumptions are made to make some special cases analytically tractable. It is appropriate to mention here that many approximate nonlinear filters have been obtained, which can be viewed as predecessors of the neural filters. Among them are those derived by approximating non-Gaussian densities by Lo (1969, 1972}, those obtained on group manifolds by Lo (1977) and Lo and Eshleman (1979a, 1979b), and those obtained by functional series expansion by Mitter and Ocone (1979),Lo and Ng (1983, 1986, 1987), and Lo (1986). These approximate filters are optimal for the given structure and approach the minimum-variance filter as the number of terms increases.

## Problem formulation and the synthetic approach

In a standard formulation of the optimal filtering problem, the signal process $$x_{n}$$ and measurement process $$y_{n}$$ satisfy the equations: $\tag{1} x_{n+1} = f(x_{n},n)+G(x_{n},n)d_{n}$

$\tag{2} y_{n} = h(x_{n},n)+v_{n}$

where $$x_{1}$$ is a Gaussian random vector, $$d_{n}$$ and $$v_{n}$$ are respectively white Gaussian noise processes with zero means, and $$f(x_{n},n)\ ,$$ $$G(x_{n},n)$$ and $$h(x_{n},n)$$ are known functions. All the processes and functions are of compatible dimensions in the above equations. These equations imply such assumptions as the Markov property, Gaussian distribution, and additive measurement noise.

In a most general formulation of the optimal filtering problem, the signal and measurement processes are virtually any two processes, $$x_{n}$$ and $$y_{n}\ .$$ In both this general and above standard formulation, the problem is to design and make a discrete-time dynamical system that inputs $$y_{n}$$ and outputs an estimate $$\hat{x}_{n}$$ of $$x_{n}$$ at each time $$n=1,2,\cdots ,N\ ,$$ which minimizes a given estimation error criterion. Here $$N$$ is a positive integer or infinity. The dynamical system is called an optimal filter with respect to the given estimation error criterion. The dynamical state of the optimal filter at a time $$n_{1}$$ must carry the optimal conditional statistics given all the measurements $$y_{n}$$ that have been received up to and including the time $$n_{1}$$ so that at the next time $$n_{1}+1\ ,$$ the optimal filter will receive and process $$y_{n_{1}+1}$$ using the optimal conditional statistics from $$n_{1}\ ,$$ and produce the optimal estimate $$\hat{x}_{n_{1}+1}\ .$$ The most widely used estimation error criterion is the mean squared error criterion, $$E[\left\Vert x_{n}-\hat{x}_{n}\right\Vert ^{2}]\ ,$$ where $$E$$ and $$\left\Vert \cdot \right\Vert$$ denote the expectation and the Euclidean norm respectively. The estimate $$\hat{x}_{n}$$ that minimizes this criterion is called the minimum-variance estimate.

The optimal filtering problem for virtually any signal and measurement processes, $$x_{n}$$ and $$y_{n}\ ,$$ was solved by a synthetic approach using recurrent neural networks as virtually optimal filters by James T. Lo (1992a,1992b, 1994) and James T. Lo and Lei Yu (2004). Instead of deriving equations based on a mathematical model such as Eq. (1) and Eq. (2), the synthetic approach synthesizes data collected by computer simulations (if a good mathematical model is available) or actual experiments (if not) into a recursive filter whose output approximates the optimal estimate of the signal process to any preselected degree of accuracy with respect to a given estimation error criterion.

Other results on the synthetic approach to optimal filtering have also been reported: A neural filtering problem was reduced to a nonlinear programming problem by Parisini and Zoppoli (1994). A noise-free signal process was estimated using a feedforward or finit-memory neural network by Alessandri et al. (1999). Finite-memory filtering using neural networks was reported by Alessandri et al. (1997) and Parisini (1997). Stubberud et al. (1995, 1998) presented an adaptive EKF and EKF implementations using neural networks. An adaptive nonlinear filter, which mimics the Kalman filter's two-step prediction-update scheme, was proposed using neural networks with all their weights adjusted online for adaptation by Parlos et al. (2001).

Radial basis functions (RBFs) were used by Elanayar and Shi (1994) and Haykin et al. (1997) for nonlinear filtering. Haykin et al. (1997) is also a good survey of optimal filtering and a comparison of their RBF filters and the recursive neural filters of Lo (1992, 1994).

## Recurrent multilayer perceptrons

Many neural network paradigms have been used for filtering. However, not all can have the best filtering and computational performances. At present, recurrent multilayer perceptrons (RMLPs) are believed to be the best and briefly described here. Comprehensive and detailed description of RMLPs, dynamical range transformers, their differentiation methods, etc. for optimal filtering can be found in (Lo and Yu, 1997).

A multilayer perceptron (MLP) is a feedforward neural network whose processing nodes computes the weighted average of its inputs and then transform the average by an activation function such as the hyperbolic tangent and logistic function. A recurrent multilayer perceptron (RMLP) is an MLP with feedbacks of processing nodes' activation levels after a unit-time delay to processing nodes in the same or a lower layer. An RMLP is a dynamical system with exogenous inputs, whose dynamical state at any given time consists of the activation levels feedbacked from the preceding time.

Two types of RMLP have been used for optimal filtering, namely MLPs with interconnected nodes (MLPWINs) and MLPs with output feedbacks (MLPWOFs). MLPWINs have only feedbacks to the same layer. The only feedbacks in MLPWOFs are those from the output (i.e., last) layer to the input (i.e., first) layer. For optimal filtering, the activation functions must be monotone increasing, and those in the last layer are usually identity functions.

Although MLPWINs are more popular, MLPWOFs have the following advantages: The estimate $$\hat{x}_{n}$$ of the signal $$x_{n}\ ,$$ which is output by the MLPWOF, is a useful statistic to be fed back. If an a priori estimate $$\hat{x}_{0}$$ of the initial signal $$x_{0}$$ is available and needs to be used to initiate the filtering process, then there must be feedback connections from the output $$\hat{x}_{n}$$ to the input layer for the feeding back of $$\hat{x}_{n}\ .$$ Because these feedbacks are influenced in the synthesis (or training) of the MLPWOF, they are called teacher-influenced feedbacks. For optimal filtering, there must be feedbacks other than the teacher-influenced feedbacks. The feedbacks that are not teacher-influenced and thus do not enter the training criterion directly are called free feedbacks. They are needed to carry current “statistics” that summarize information contained in all the past measurements for updating the estimate of the current signal upon the arrival of the next measurement.

If no a priori estimate $$\hat{x}_{0}$$ of the initial signal $$x_{0}$$ is available, MLPWINs usually requires less processing nodes that MLPWOF to achieve the same filtering accuracy. This points to the possibility of including both types of feedbacks to get MLPWINOF (MLPs with interconnected nodes and output feedbacks) for optimal filtering.

Note that MLPs with a tapped-delay line to hold some past measurements as inputs were also used for filtering. However, such MLPs do not have a dynamical state to carry “statistics” of all the past measurements. Therefore, they are not as effective or efficient as RMLPs for filtering especially if the signal or measurement process is not deterministic.

## Fundamental neural filtering theorems

Theorem 1. Let an $$N_{x}$$-dimensional signal process $$x_{n}$$ and an $$N_{y}$$-dimensional measurement process $$y_{n}\ ,$$ $$n=1,\cdots ,N\ ,$$ be defined on a probability space $$(\Omega ,\mathcal{A},P)\ .$$ Assume that the range $$\Lambda :=\{y_{n}\left( s\right) |n=1,\cdots ,N,s\in \Omega \}\subset R^{m}$$ is compact and the second moments $$E[\left\Vert x_{n}\right\Vert ^{2}]\ ,$$ $$n=1,\cdots ,N\ ,$$ are finite with respect to $$(\Omega ,\mathcal{A},P)\ .$$ Consider an MLPWIN with $$N_{x}$$ output nodes, whose activation functions are linear, and one hidden layer of $$M$$ nodes, which are fully interconnected and whose activation functions are strictly monotone increasing. Let the $$N_{x}$$-dimensional output vector of the MLPWIN at time $$n$$ be denoted by $$\alpha _{n}(M)$$ after it has received $$y_{n}\ ,$$ $$i=1,\ldots ,n,$$ one by one in the given order at its $$N_{y}$$ input terminals, and let the weights of the MLPWIN be denoted by $$w\ .$$ Then $r(M):=\min_{w}\frac{1}{T}\sum_{n=1}^{T}E[\Vert \alpha_{n}(M)-E[x_{n}|y^{n}]\Vert ^{2}]$ is monotone decreasing and converges to 0 as $$M$$ goes to infinity, where $$E[x_{n}|y^{n}]$$ denotes the conditional expectation of $$x_{n}$$ given $$y^{n}=\{y_{n}\ ,$$ $$i=1,\ldots ,n\}\ .$$

Theorem 2. Let an $$N_{x}$$-dimensional signal process $$x_{n}$$ and an $$N_{y}$$-dimensional measurement process $$y_{n}\ ,$$ $$n=1,\cdots,N \ ,$$ be defined on a probability space $$(\Omega ,\mathcal{A},P)\ .$$ Assume that the range $$\Lambda :=\{y_{n}\left( s\right) |n=1,\cdots ,N,s\in \Omega \}\subset R^{m}$$ is compact and the second moments $$E[\left\Vert x_{n}\right\Vert ^{2}]\ ,$$ $$n=1,\cdots ,N\ ,$$ are finite with respect to $$(\Omega ,\mathcal{A},P)\ .$$

• Consider an MLPWOF with $$M$$ nodes in a single hidden layer, $$N_{y}(N-1)$$ free output feedbacks, no teacher-influenced output feedback, $$N_{y}$$ external input terminals for receiving $$y_{n}\ ,$$ and $$N_{x}$$ teacher-influenced output terminals for outputing an estimate of $$x_{n}$$ at time $$n\ .$$ The activation functions of the $$M$$ hidden nodes are strictly

monotone increasing and those of the output nodes are linear. Let the $$N_{x}$$-dimensional teacher-influenced output vector at time $$n$$ be denoted by $$\alpha _{n}(M)$$ after having received $$y_{n}\ ,$$ $$i=1,\ldots ,n$$ one by one in the given order at the $$N_{y}$$ external input terminals, and let the weights of the MLPWOF be denoted by $$w\ .$$ Then the sequence $r(M):=\min_{w}\frac{1}{N}\sum_{n=1}^{N}E[\left\Vert \alpha _{n}(M)-E[x_{n}|y^{n}]\right\Vert ^{2}]$ is monotone decreasing to 0 as the number $$M$$ of hidden nodes of the MLPWOF approaches infinity, where $$E[x_{n}|y^{n}]$$ denotes the conditional expectation of $$x_{n}$$ given $$y^{n}=\{y_{n}\ ,$$ $$i=1,\ldots ,n\}\ .$$

• If the estimate $$\alpha _{n}(M)$$ from the teacher-influenced output

terminals in the above MLPWOF are fed back to teacher-influenced feedback input terminals after a unit-time delay and an a priori estimate $$\hat{x}% _{0}$$ of $$x_{0}$$ is available and loaded into the teacher-influenced feedback input terminals as $$\alpha _{0}(M)$$ for time 1, then the sequence $r(M):=\min_{w}\frac{1}{N}\sum_{n=1}^{N}E[\left\Vert \alpha _{n}(M)-E[x_{n}|% \hat{x}_{0},y^{n}]\right\Vert ^{2}]$ is also monotone decreasing to 0 as the number $$M$$ of hidden nodes of the MLPWOF goes to infinity, where $$E[x_{n}|\hat{x}_{0},y^{n}]$$ denotes the conditional expectation of $$x_{n}$$ given $$y^{n}=\{y_{n}\ ,$$ $$i=1,\ldots ,n\}$$ and $$x_{0}=\hat{x}_{0}\ .$$

## Dynamical range reducers and extenders

If the range of the signal and measurement expands over time, such as in financial time series prediction, satellite orbit determination, aircraft/ship navigation, and target tracking, or are large relative to the filtering resolution or accuracy required, the sizes of the RMLP and the training data set need sometimes to be unreasonably large. To alleviate this synthesis difficulty and enhance the generalization capability of the synthesized neural filter beyond the length of time for which the training data are available, an RMLP with dynamical range reducers or extenders is used as the neural filter.

A dynamical range reducer of a neural filter is a preprocessor of an RMLP, that dynamically transforms at least one component of the measurement process and sends the resulting process to at least one input terminal of the RMLP. A basic scheme for dynamically transforming the $$i$$th component $$y_{in}$$ of a measurement process $$y_{n}$$ is to subtract some estimate $$\hat{y}_{in}$$ of $$y_{in}$$ from $$y_{in}$$ at every time point $$n\ .$$ A scheme that generates a causal estimate $$\hat{y}_{in}$$ is called an auxiliary estimator of $$y_{in}\ .$$ The resulting difference, $$y_{in}-\hat{y}_{in}\ ,$$ is used at time $$n$$ as the $$i$$th component of the input vector to the RMLP. A device that comprises an auxilliary estimator to generate an auxiliary estimate $$% \hat{y}_{in}\ ,$$ and a subtractor to perform the subtraction, $$y_{in}-\hat{y}_{in}\ ,$$ is called a dynamical range reducer by estimate subtraction. Three types of dynamical range reducers are range reducers by differencing, range reducers by linear prediction and range reducers by model-aided prediction.

The purpose of the auxiliary estimate is only to reduce the input range of the RMLP. Therefore, the auxiliary estimate does not have to be very accurate. However, notice that the difference process, $$y_{in}-\hat{y}_{in}\ ,$$ $$i=1\ ,$$ $$\ldots \ ,$$ is causally equivalent to the information process, $$y_{in}\ ,$$ $$i=1\ ,$$ \ldots , only if $$y_{i1}$$ is used jointly with the difference process. If a dynamical range reducer by estimate subtraction is employed for recursive neural filtering, there are three cases to be considered:

• Case 1. The initial signal, $$x_{1}\ ,$$ is a fixed vector: In this case, the RMLP will learn to integrate this vector into the estimates produced by the RMLP during its training.
• Case 2. The signal process is observable from the difference process $$% y_{in}-\hat{y}_{in}\ ,$$ $$i=1\ ,$$ $$\ldots\ :$$ In this case, the estimate produced by the RMLP will still converge to that of the optimal estimate given the original information process. It only takes a little longer.
• Case 3. A good estimate of the initial signal is available: In this

case, we use an MLPWOF into which the initial signal can be loaded through the teacher-influenced output feedback terminals of the MLPWOF as mentioned earlier on. If the estimate of the initial signal is good enough to determine a good estimate of the first measurement $$y_{1}\ ,$$ the difference process produced by the dynamical range extender together with the initial signal is roughly causally equivalent to the measurement process, and the filtering performance of the MLPWOF is roughly the same as that of the optimal filter based on the original measurement process $$y\ .$$

A dynamical range extender of a neural filter is a postprocessor of an RMLP, that dynamically transforms at least one component of the output process of the RMLP into the estimate of the signal process (i.e., the filtering result). A basic scheme for dynamically transforming the output range of an output node, say node $$i$$ in layer $$L\ ,$$ of an RMLP is to add some estimate $$\hat{\hat{x}}_{in}$$ of the desired output $$x_{in}$$ for the same output node to the node's actual output $$\beta _{in}^{L}$$ at every time point $$n\ .$$ The resulting sum, $$\beta _{in}^{L}+\hat{\hat{x}}_{in}\ ,$$ is used as the $$i$$th component $$\hat{x}_{in}$$ of the output vector $$\hat{x}_{n}$$ of the neural filter at time $$n\ .$$ Thus, the “actual desired output” for the output node is $$x_{in}-\hat{\hat{x}}_{in}$$ at time $$n\ ,$$ whose range is expected to be smaller than the range of $$x_{in}\ ,$$ provided that the estimate $$\hat{\hat{x}}_{in}$$ is “good” . The estimate $$\hat{\hat{x}}_{in}$$ will be called an auxiliary estimate of $$x_{in}$$ and a scheme that generates this estimate $$\hat{\hat{x}}_{in}$$ will be called an auxiliary estimator. A device that comprises such an auxiliary estimator and an adder will be called dynamical range extender by estimate addition, which is a dynamical transformer of the output process $$\beta _{in}^{L}\ .$$

For an RMLP with a range extender by estimate addition to approximate the optimal filter to any accuracy, a fundamental requirement for the dynamical range extender is that the estimate $$\hat{\hat{x}}_{in}$$ be a function of the measurements, $$y_{i}\ ,$$ $$i=1,2,\cdots ,n\ ,$$ for $$n=1,2,\dots ,N\ .$$ Five types of such range extender by estimate addition, whose auxiliary estimators have different levels of estimation accuracy and different levels of computational cost, are range extenders by accumulation, range extenders by Kalman filtering, range extenders by feedforward Kalman filtering, range extenders by linear prediction, and range extenders by feedforward linear estimation. Neural filters with extenders by Kalman filtering and range extenders by feedforward Kalman filtering actually use the RMLP to estimate and make up for the nonlinear part of the optimal estimate of the signal that the Kalman filter fails to include. This has the advantage of keeping the size of the RMLP small and making the improvement of existing extended Kalman filters in applications easy.

## Synthesizing neural filters

For synthesizing (i.e., training) an RMLP with or without dynamical range transformers into a neural filter, a training data set is required that consists of realizations of the signal and measurement processes, which we denote by $$\{(x_{n}\left( s\right) \ ,$$ $$y_{n}\left( s\right)$$), $$n=1,\cdots,N\ ,$$ $$s\in S\}\ ,$$ where $$N$$ is a time length believed long enough for the realizations to represent all possible dynamics of the signal and information processes. It is assumed that the set $$S$$ is a random sample and adquately reflects the joint probability distributions of the signal and measurement processes $$x_{n}$$ and $$y_{n}\ .$$

In general, the target (i.e., desired) outputs used in constructing an error criterion for training a neural network are what the neural network outputs are intended to approximate. If an RMLP is used in a minimum-variance neural filter, the target output at time $$n$$ should then be the minimum-variance estimate of the signal at the same time, which is known to be the conditional expectation $$E[x_{n}|y^{n}]$$ (or $$E[x_{n}|\hat{x}_{0},y^{n}]$$). However, such minimum-variance estimates are difficult, if not impossible, to obtained. Fortunately, it has been shown that the signals themselves can be used instead of their minimum-variance estimates, yielding the following mean-squared error criterion: $\tag{3} Q(w)=\frac{1}{N\left\vert S\right\vert }\sum_{s\in S}\sum_{n=1}^{N}\left\Vert x_{n}\left( s\right) -\alpha _{n}\left( s,w\right) \right\Vert ^{2}$

where $$\alpha _{n}\left( s,w\right)$$ denotes the output of the RMLP with weights $$w$$ and with or without dynamical range transformers after it inputs the measurement realization $$y^{n}\left( s\right) =\{y_{i}\left( s\right) |1\leq i\leq n\}$$ one at a time in the given order.

The weights $$w$$ are determined through minimizing $$C\left( w\right)$$ by the variation of $$w\ .$$ Since $$C\left( w\right)$$ usually has many poor local minima, avoiding such local minima has been a major problem in the neural network approach and has caused much objection to the approach. Fortunately, the problem, which is referred to as the local-minimum problem, has been solved effectively by the convexification method by Lo (2001, 2006).

The idea is to transform the mean-squared error criterion (3) into the risk-averting error criterion, $\tag{4} J_{\lambda }(w)=\frac{1}{N\left\vert S\right\vert }\sum_{s\in S}\sum_{n=1}^{N}\exp \left( \lambda \left\Vert x_{n}\left( s\right) -\alpha _{n}\left( s,w\right) \right\Vert ^{2}\right)$

It has been proven that the convexity region of $$J_{\lambda }(w)$$ expands monotonically as $$\lambda$$ increases, creating tunnels in the weight space for a local search optimization method (e.g., Quasi-Newton or conjugate gradient) to escape poor local minima by Lo (2001). The convexification method of training a neural network consists of two phases – the convexification phase and the deconvexification phase. In the convexification phase, starting with a very small number, the value of $$\lambda$$ increases a certain small percentage whenever the local search optimization method seems to get bogged down in a local minimum and repeat until the further increase of $$\lambda$$ does not reduce $$J_{\lambda }(w)\ .$$ In the deconvexification phase, the value of $$\lambda$$ is gradually decreased to near zero, and then starting with the resulting value of $$w\ ,$$ $$Q(w)$$ is used as the error criterion to get its final minimizer. A detailed description of the convexification phase can be found in Lo and Bassu (2001).

## Accommodative property of neural filters

For the analytic approach to optimal filtering, the signal process must comprise all the uncertain processes involved except the noises and must be Markovian. The signal process in the standard formulation ((1)) is such a process. However, in the synthetic approach, the signal process consists of only those processes whose estimates are needed as the output of the filter. For example, if we are only interested in getting an estimate of a scalar-valued process $$x_{1n}$$ that is the first component of a $$2$$-dimensional Markov process described by, for $$n=1,2,\cdots ,N-1\ ,$$ $\left[ \begin{array}{c} x_{1\left( n+1\right) } \\ x_{2\left( n+1\right) } \end{array} \right] =\left[ \begin{array}{cc} 0.5 & 1 \\ 1 & -\theta \end{array} \right] \left[ \begin{array}{c} x_{1n} \\ x_{2n} \end{array} \right] +\left[ \begin{array}{c} \xi _{1n} \\ \xi _{2n} \end{array} \right] ,$ where $$\xi _{n}=[\xi _{1n},\xi _{2n}]^{T}$$ is a standard 2-dimensional white Gaussian sequence with mean 0 and covariance $$E[\xi _{n_{1}}\xi _{n_{2}}^{T}]=\delta _{n_{1}n_{2}}I_{2}\ ,$$ $$I_{2}$$ being the $$2\times 2$$ identity matrix, $$\delta _{n_{1}n_{2}}$$ is the Kronecker delta, $$\theta$$ is an uncertain environmental parameter, and the initial state $$x_{0}$$ is a Gaussian random vector with mean 0 and covariance $$\Sigma _{0}$$ and statistically independent of $$\xi _{n}$$ for all $$n\ .$$ Assume that the measurement process $$y_{n}$$ is described by $$y_{n}=x_{2n}+\varepsilon _{n}\ ,$$ where $$\varepsilon _{n}$$ is a scalar valued white Gaussian sequence with mean 0 and variance 1 and statistically independent of the Markov process $$x_{n}\ .$$

In applying the analytic approach to estimating $$x_{1n}\ ,$$ we need to include $$x_{1n}\ ,$$ $$x_{2n}$$ and $$\theta$$ in the signal process. In contrast, in applying the synthetic approach, $$x_{1n}$$ constitutes the signal process. The neural filter inputs $$y_{n}$$ and outputs only the estimate $$\hat{x}_{1n}$$ of $$x_{1n}\ .$$ Notice that $$x_{1n}$$ by itself is not a Markov process, and $$\theta$$ is uncertain. Yet, by Theorem 1 or Theorem 2, the estimate $$\hat{x}_{1n}$$ approaches the minimum-variance estimate $$E\left[ x_{1n}|y^{n}\right]$$ (or $$E[x_{n}|\hat{x}_{0},y^{n}]$$) as time $$n$$ increases. Here $$\theta$$ does not need to be estimated. Since $$y^{n}$$ is the only information about $$x_{1n}$$ at time $$n\ ,$$ $$E\left[ x_{1n}|y^{n}\right]$$ (or $$E[x_{n}|\hat{x}_{0},y^{n}]$$) is the best estimate with respect to the mean-squared error criterion ((3)). This shows that the neural filter actually adapts to the uncertain environmental parameter $$\theta \ .$$ Notice that the weights of the neural filter are held fixed and not adjusted online for adaptation. This is a very important advantage because the signal is usually not available online for training the neural filter.

In engineering, the word, “adaptive,” in “adaptive processor” such as “adaptive controller,” “adaptive filter,” “adaptive beamformer,” etc. usually describes a processor with its parameters adjusted online to adapt to an uncertain environmental parameter. Therefore, the word, “accommodative” is used to describe the capability of the neural filter to adapt without having its weights adjusted online. This accommodative capability of the neural filter was first observed by Lo (1994) and Lo and Yu (1995). It was proven by Lo and Nave (2008) that if $$\theta _{n}$$ is a environmental process observable from the measurement process $$y_{n}\ ,$$ then the neural filter output actually approaches $$E\left[ x_{1n}|y^{n},\theta ^{n}\right]$$ (or $$E[x_{n}|\hat{x}_{0},y^{n},\theta ^{n}]$$), which assumes that $$\theta ^{n}$$ is assumed precisely given.

## Robust neural filters

Each squared deviation $$\left\Vert x_{n}\left( s\right) -\alpha _{n}\left( s,w\right) \right\Vert ^{2}$$ is exponentiated in the risk-averting error criterion $$J_{\lambda }\left( w\right)$$ in ((4)). Larger deviations are therefore much more emphasized than smaller deviations. Minimizing $$J_{\lambda }\left( w\right)$$ therefore has the effect of avoiding larger or disastrous deviations depending on the magnitude of the risk-sensitivity index $$\lambda \ .$$ In fact, it is proven that $$\frac{1}{\lambda }\ln J_{\lambda}\left( w\right)$$ approaches the mean-squared error criterion as $$\lambda$$ goes to zero and approaches the minmax error criterion as $$\lambda$$ goes to infinity: $\lim_{\lambda \rightarrow 0}\frac{1}{\lambda }\ln J_{\lambda }\left( w\right) = Q\left( w\right)$ $\lim_{\lambda \rightarrow \infty }\arg \min_{w}J_{\lambda }\left( w\right) = \arg \inf_{w}\max_{s,n}\left\Vert x_{n}\left( s\right) -\alpha _{n}\left( s,w\right) \right\Vert ^{2}$ in (Lo 1996) and (Lo 2001) respectively. These properties show that there is a spectrum of error criteria that induce a spectrum of robustness in the neural filter. The degree of robustness increases as $$\lambda$$ increases from $$0$$ to $$\infty \ .$$

## Neural filters versus particle filters and unscented Kalman filters

Particle filters and unscented Kalman filters in (Doucet et al., 2001) and (Julier et al., 2000, 2001) respectively are also better than the extended Kalman filters in performance. The unscented Kalman filter is simple, effective and computationally efficient. However, it is a suboptimal filter whose accuracy relative to that of the optimal filter is hard to analyze. Its adaptive version, the dual unscented Kalman filter by Eric Wan, shares similar advantage and shortcoming. The particle filter, whose performance can be made as close to that of the optimal filter in theory, has received the most attention. Nevertheless, it performs Monte Carlo online and thus involves excessive amount of online computation. Moreover, for adaptive filtering, the particle filter has to estimate all the uncertain environmental parameters online, which increases the dimensionality of the process to be estimated and requires much more “ particles” (i.e., point masses) to represent the conditional probability density, which in turn requires much more online computation to update these particles. Both the particle filters and unscented Kalman filters are applicable only if mathematical models of the signal and measurement processes are explicitly given.

As neural filters are synthesized from realizations of the signal and measurement processes, they are applicable whether or not the mathematical models are available. The synthesis can be viewed as Monte Carlo offline prior to the deployment of the filter. Much like the Kalman filter, the synthesized neural filter is a dynamical system that processes the measurement process to update its own dynamical state and outputs a virtually optimal estimate of the signal process. No Monte Carlo or data synthesis is performed online, and no augmentation of the signal process is necessary. Therefore, neural filters are computationally efficient. As discussed earlier on, neural filters with fixed weights have adaptive capability. to be developed are expected to be able to adapt to nonobservable environmental parameters. It is also worth mentioning that neural filters are more suitable for VLSI implementation. More discussions on the advantages of adaptive neural filtering over particle filtering can be found in (Feldkamp and D. V. Prokhorov, 2003; Lo and Yu 2004}.

In spite of the mentioned advantages of the standard neural filters over the particle and unscented Kalman filters, the standard neural filters have not received as much attention. A main reason is the difficulty in training an RMLP. P. Werbos of the National Science Foundation said (in one of his presentations available on his website pwerbos@nsf.gov): “At least four kinds of problems explain why some people give up prematurely when trying to use recurrent networks.” He listed “bugs, bumpy error surface, shallow plateaus and local minima.” Another reason is perhaps that the particle filters have very rich mathematical structures, while neural filtering is so simple, yet already so effective, giving the impression that there is not much left to do on it.

## Adavantage of the synthetic approach

The synthetic approach to optimal filtering is simple, general, systematic and practical. While the conventional analytic approach to optimal filtering derives formulas or equations from a mathematical model of the signal and measurement processes, the synthetic approach synthesizes realizations of those processes into a virtually optimal filter. Such realizations are obtained by computer simulations or actual experiments. The synthetic approach has the following advantages:

• No such assumption as the Markov property, linear dynamics, Gaussian distribution, additive noise is necessary.
• It applies, even if a mathematical model of the signal or measurement process is not available.
• The resulting neural filter is virtually minimum-variance for its given architecture.
• Neural filters with or without dynamical range transformers are parsimonous approximators of optimal filters.
• Other estimation error criteria than the minimum-variance criterion can be easily used.
• The extended Kalman filter and a recursive multilayer perceptron can be easily integrated into a neural filter.
• Neural filters are well suited for real-time processing due to their massively parallel nature of computing.
• The simple recursive neural network architecture is suitable for chip implementation.