# User:Ke CHEN/Proposed/Negatively Correlated Ensemble Learning

As an effective approach to improve the generalization of supervised classifiers, ensembles of multiple learning machines, i.e. groups of learners that work together as a committee, have attracted a lot of research interest in the machine learning community. Most existing ensemble methods, such as Bagging, ensembles of features and Random Forests, always train ensemble members independently. In this situation, the interaction and cooperation among the individual members in the ensemble may not be fully exploited.

Negative Correlation Learning (NCL) (Liu and Yao, 1999a and 1999b) is a specific ensemble method, which emphasizes interaction and cooperation among individual members in the ensemble. It uses a penalty term in the error function to produce biased individual learners whose errors tend to be negatively correlated. Specifically, NCL introduces a correlation penalty term to the cost function of each individual network so that each neural network minimizes its mean-square-error (MSE) together with the error correlation within the ensemble. This encourages diversity, which is essential for good ensemble performance (Brown et al., 2005). This article summarizes work on NCL, including the problem formulation, the training algorithms, the parameter selection algorithm, the ensemble selection and combination methods and some variants of NCL for specific applications.

## Contents |

## Formulation of NCL

NCL was firstly proposed by Liu and Yao (1997 and 1999a), where negative correlation learning and evolutionary learning are combined to automatically design neural network (NN) ensembles. This algorithm emphasizes the cooperation and specialization among different individual NNs during the individual NN design. It introduces a correlation penalty term to the error function of each individual network in the ensemble so that instead of training the networks independently, the ensemble as a whole can be jointly trained and diversity between members encouraged. Furthermore, all ensemble members can be trained on the same training data set: it is not necessary to divide the training data between the ensemble members in order to achieve a diverse ensemble.

Given the training set \(\{\mathbf{x}_{n},y_{n}\}_{n=1}^{N}\ ,\) NCL combines \(M\) neural networks \(f_{i}(\mathbf{x})\) to constitute the ensemble. \[ f_{ens}(\mathbf{x}_{n})=\frac{1}{M}\sum_{i=1}^{M}f_{i}(\mathbf{x}_{n}). \] To train network \(f_{i}\ ,\) the cost function \(e_{i}\) for network \(i\) is defined by \[\tag{1} e_{i}=\sum_{n=1}^{N}(f_{i}(\mathbf{x}_{n})-y_{n})^{2}+\lambda p_{i}\text{,} \] where \(\lambda \) is a weighting parameter on the penalty term \(p_{i}\ :\) \[\tag{2} p_{i}=\sum_{n=1}^{N}\left\{ (f_{i}(\mathbf{x}_{n})-f_{ens}(\mathbf{x}_{n}))\sum_{j\neq i}\left( f_{j}(\mathbf{x}_{n})-f_{ens}(\mathbf{x}_{n})\right) \right\} =-\sum_{n=1}^{N}\left( f_{i}(\mathbf{x}_{n})-f_{ens}(\mathbf{x}_{n})\right) ^{2}\text{.} \]

The first term on the right-hand side of (1) <review>broken link</review> is the empirical
training error of network \(i\ .\) The second term \(p_{i}\) is a correlation
penalty function. The purpose of minimizing \(p_{i}\) is to negatively
correlate each network's error with errors for the rest of the ensemble. The
\(\lambda \) parameter controls a trade-off between the training error term
and the penalty term. With \(\lambda =0\ ,\) we would have an ensemble with each
network training independently. If \(\lambda \) is increased, more and more
emphasis would be placed on minimizing the penalty. <review>Is 1 the maximum?</review>

In NCL training, all individual neural networks interact with each other through their penalty terms in the error functions. Each neural network minimizes not only the difference between \(f_{i}(\mathbf{x}_{n})\) and \(y_{n}\) , but also the difference between \(f_{ens}(\mathbf{x}_{n})\) and \(f_{i}(\mathbf{x}_{n})\ ,\) considering the error of all other neural networks while training a particular neural network.

When \(\lambda =1\ ,\) the cost function \(e_{i}\) for network \(i\) can be represented by \[ e_{i}=\sum_{n=1}^{N}(f_{i}(\mathbf{x}_{n})-y_{n})^{2}-\sum_{n=1}^{N}\left( f_{i}(\mathbf{x}_{n})-f_{ens}(\mathbf{x}_{n})\right) ^{2}\text{,} \] and the average cost function \(E\) for the ensemble is changed to \[ E=\frac{1}{M}\sum_{i=1}^{M}e_{i}=\frac{1}{M}\sum_{n=1}^{N}\sum_{i=1}^{M}\left\{ (f_{i}(\mathbf{x}_{n})-y_{n})^{2}-\left( f_{i}(\mathbf{x}_{n})-f_{ens}(\mathbf{x}_{n})\right) ^{2}\right\} =\sum_{n=1}^{N}(f_{ens}(\mathbf{x}_{n})-y_{n})^{2}\text{.} \]

According to the above equation, the error function of NCL (when \(\lambda =1\) ) is equivalent to training a single estimator \(f_{ens}(\mathbf{x}_{n})\) instead of training each individual network separately.

NCL can be trained by gradient descent algorithm by deriving the cost function <review>add reference</review>. To address the issues of automatic determination of the number of individual neural networks (NNs) in an ensemble and facilitate the exploitation of the interaction between individual NN design and combination, Liu and Yao (1999a) proposed to combine NCL with evolutionary ensembles. The idea of this approach is to encourage different individual NNs in the ensemble to learn different parts or aspects of the training data so that the ensemble can learn better the entire training data. This provides an opportunity for different NNs to interact with each other and to specialize.

## Constructive training

In the <review>standard?</review>NCL training algorithm, the structure of the ensemble, e.g., the number of NNs in the ensemble, and the structure of individual NNs, e.g., the number of hidden nodes, are all designed manually and fixed during the training process, which might be appropriate for real-world problems. To automatically design ensembles using NCL, Islam et al. (2003) used a constructive algorithm to determine the size of the ensemble and NN architectures within an ensemble. This method combines ensemble architecture design with cooperative training for individual neural networks in ensembles.

## Ensemble pruning with NCL

The existing ensemble learning algorithms often generate unnecessarily large ensembles. These large ensembles are memory demanding. In addition, it is not always true that the larger the size of an ensemble, the better it is (Yao and Liu, 1998, Zhou et al., 2002). Motivated by the above reasons, Chen modelled ensemble pruning as a probabilistic model with a truncated Gaussian prior for both regression and classification problems (Chen et al., 2006 and 2009). The expectation maximization and expectation propagation algorithms were used to infer the combination weights and showed good performance in both generalization error and pruned ensemble size. Chen (2008) also incorporated these ensemble pruning algorithms with NCL to generate effective ensembles.

## Regularized NCL

Later on, NCL was observed to be prone to overfitting noise (Chen and Yao, 2009) and a regularized negative correlation learning (RNCL) algorithm was proposed to increase its robustness against noisy data. RNCL incorporates a regularization term into the cost function of NCL and employs a Bayesian inference procedure to optimize regularization parameters.

## Evolutionary multi-objective methods

It is widely believed that the success of ensemble algorithms depends on both the accuracy and diversity among individual estimators/classifiers in the ensemble (Krogh and Vedelsby, 1995). Recent research has demonstrated (Chen and Yao, 2009) that regularization is another important factor for NCL and evolutionary multiobjective algorithms have been used to balance the tradeoff among these terms. DIVACE (Diverse and Accurate Ensemble Learning Algorithm) (Chandra and Yao, 2006a and 2006b) is such an approach, which combines evolving neural network ensembles and a multiobjective algorithm. Chen and Yao (2009) further proposed a multiobjective regularized negative correlation learning (MRNCL) algorithm, which implemented RNCL with a multi-objective algorithm. By incorporating the regularization term, the training of an individual neural network in MRNCL involves minimization of the three terms: empirical training error term, correlation penalty term and the regularization term.

## Online learning

NCL has been extended to tackle online learning problems (Minku et al., 2009). Two different approaches to use negative correlation, fixed size NCL and growing NCL, in incremental learning were presented and analyzed. Later, a selective NCL (SNCL) algorithm for incremental learning was proposed in Tang et. al (2009). In SNCL, when a new training data point is presented, the previously trained neural network ensemble is cloned and the cloned ensemble is trained with the new training data set. Then, the new ensemble is combined with the previous ensemble and a selection process is applied to prune the whole ensemble to a fixed size. Recently, the impact of diversity on on-line ensemble learning was investigated in the presence of concept drift (Minku et al., 2010). An empirical relationship between diversity and ensemble performance was revealed and discussed.

## Imbalanced learning

As machine learning matures to an applied technology, the importance of the class imbalance problem grows and more and more efforts are made to tackle this problem. The relationship between diversity and ensemble performance for imbalanced data sets has been icnvestigated and NCL\(_{Cost}\ ,\) a variation model of NCL, was proposed for imbalanced data sets (Wang et. al, 2009a and 2009b). To improve training speed and performance for classification problems, an AdaBoost-style algorithm, AdaBoost.NC, was studied in Wang et al. (2010), in which an ambiguity term derived theoretically for classification ensembles was used to manage diversity explicitly.

## References

- Brown, G.; Wyatt, J.; Yao, X. and Harris, R. (2005). Diversity Creation Methods: A Survey and Categorisation.
*Journal of Information Fusion (Special issue on Diversity in Multiple Classifier Systems)*6(1): 5-20. - Chandra(2006a). Ensemble learning using multi-objective evolutionary algorithms.
*Journal of Mathematical Modelling and Algorithms*5(4): 417–445. - Chandra(2006b). Evolving hybrid ensembles of learning machines for better generalisation.
*Neurocomputing*69(7-9): 686–700. - Chen, H.; Tino, P. and Yao, X. (2006). A probabilistic ensemble pruning algorithm. Workshops on Optimization-based Data Mining Techniques with Applications in Sixth IEEE International Conference on Data Mining, Hong Kong. 878–882
- Chen, H.; Tino, P. and Yao, X. (2009). Predictive ensemble pruning by expectation propagation.
*IEEE Transactions on Knowledge and Data Engineering*21(7): 999–1013. - Chen(2009). Regularized negative correlation learning for neural network ensembles.
*IEEE Transactions on Neural Networks*20(12): 1962–1979. - Chen(2009). Multi-objective neural network ensembles based on regularized negative correlation learning.
*IEEE Transactions on Knowledge and Data Engineering*99: 1-1. [1] - Chen, H. (2008). Diversity and Regularization in Neural Network Ensembles. PhD thesis, School of Computer Science, University of Birmingham.
- Islam, M. M.; Yao, X. and Murase, K. (2003). A constructive algorithm for training cooperative neural network ensembles.
*IEEE Transaction on Neural Networks*14(4): 820–834. - Krogh, A. and Vedelsby, J. (2003). Neural network ensembles, cross validation, and active learning. Advances in Neural Information Processing Systems, Denver, Colorado, USA. 231–238
- Liu(1997). Negatively correlated neural networks can produce best ensembles.
*Australian journal of intelligent information processing systems*4(3/4): 176–185. - Liu(1999a). Ensemble learning via negative correlation.
*Neural Networks*12(10): 1399–1404. - Liu(1999b). Simultaneous training of negatively correlated neural networks in an ensemble.
*IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics*29(6): 716–725. - Minku, F. L.; Inoue, H. and Yao, X. (2009). Negative correlation in incremental learning.
*Natural Computing*8(2): 289–320. - Minku, F. L.; White, A. and Yao, X. (2010). The impact of diversity on on-line ensemble learning in the presence of concept drift.
*IEEE Transactions on Knowledge and Data Engineering*22(5): 730–742. - Tang, K.; Lin, M.; Minku, F. L. and Yao, X. (2009). Selective negative correlation learning approach to incremental learning.
*Neurocomputing*72(13-15): 2796–2805. - Wang, S.; Chen, H. and Yao, X. (2010). Negative correlation learning for classification ensembles. Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN’10), Barcelona, Spain. 2893–2900
- Wang, S. and Yao, X. (2009a). Diversity analysis on imbalanced data sets by using ensemble models. Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining (CIDM’09), Nashville, Tennessee, USA. 324–331
- Yao, X. and Liu, Y. (2009b). Diversity exploration and negative correlation learning on imbalanced data sets. Proceedings of the 2009 International Joint Conference on Neural Networks (IJCNN’09), Atlanta, Georgia, US. 3259–3266
- Yao(1998). Making use of population information in evolutionary artificial neural networks.
*IEEE Transactions on Systems, Man, and Cybernetics, Part B*28(3): 417–425. - Zhou, Z.; Wu, J. and Tang, W. (2002). Ensembling neural networks: many could be better than all.
*Artificial Intelligence*137(1-2): 239–263.

## Further reading

- H. Chen and X. Yao. Ensemble Learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. (In Preparation)

## External links

- Authors' academic homepage: Huanhuan Chen, Xin Yao