Initial section: Just state that “multiple models are combined” rather than “multiple set of models” are combined. Later on, you can point out that each model in the ensemble can itself be an ensemble.
Introduction/model selection: MLPs, DTs, etc., are not actually algorithms but rather are models. One or more learning algorithms are capable of taking training data as input and generating a model.
Replace “true generalization performance” with “performance on unseen data.” Actually, before this, have a brief explanation of how supervised learning is done, whereby you end up defining training data, test data, generalization performance, etc.
Exchange the first two paragraphs---the life example of polling multiple experts is a good intuitive explanation of why ensembles are useful. Give that before the supervised learning/statistics oriented explanation.
Figure 1 needs to depict greater diversity among the base models. I recommend pointing out a set of examples in figure 1 which are misclassified by one out of the three base models, and then point out that the ensemble would classify these examples correctly and correct the mistakes of individual models.
Divide and Conquer section: Requirement (i) (independent classifier outputs) is a bit strong. You only need enough diversity to ensure that at least half the base models classify each example correctly.
Spell-check the article---e.g., the word “approachm” appears once.
Figure 4: It uses S to represent both the original training set and the bootstrapped training set. Please use a different variable for the bootstrapped training set (e.g., S_t).
Boosting section: Please state explicitly that the assumption that the training error of a base model is less than 0.5 ends up causing that base classifier to correct at least one mistake made by the previous base model. Also, the ensemble error is a training error bound.
In the Introduction, you should mention that Dietterich suggested some interesting reasons for the use of a multiple classifier system (T.G. Dietterich, Ensemble methods in machine learning, Multiple Classifier Systems, Springer-Verlag, LNCS, Vol. 1857 (2000) 1-15).
In the Section on "Ensemble combination rules", you should mention the categorization of fusion rules proposed by Xu et al..The combination functions following the fusion strategy can be classified on the basis of the type of outputs of classifiers forming the ensemble. Xu et al. distinguish between three types of classifier outputs: a)Abstract-level output. Each classifier outputs a unique class label for each input pattern; b)Rank-level output. Each classifier outputs a list of ranked class labels for each input pattern. The class labels are ranked in order of plausibility of being the correct class label; c)Measurement-level output. Each classifier outputs a vector of continuous-valued measures that represent estimates of class posterior probabilities or class-related confidence values that represent the support for the possible classification hypotheses. (L. Xu, A. Krzyzak, and C.Y. Suen, Methods for combining multiple classifiers and their applications to handwriting recognition, IEEE Trans. on Systems, Man, and Cyb., Vol. 22, No. 3, May/June (1992) 418-435)
You should make the reader aware that two main design approaches have been proposed, that Ho called “coverage optimization” and “decision optimization” methods. Coverage optimization refers to methods that assume a fixed, usually simple, decision combination function and aim to generate a set of mutually complementary classifiers that can be combined to achieve optimal accuracy. Decision optimization methods assume a given set of carefully designed classifiers and aim to select and optimize the combination function. See: T.K. Ho, Complexity of classification problems and comparative advantages of combined classifiers, Springer-Verlag, LNCS, Vol. 1857 (2000) 97-106 Roli F., Giacinto, G., Design of Multiple Classifier Systems, in H. Bunke and A. Kandel (Eds.) Hybrid Methods in Pattern Recognition, World Scientific Publishing (2002)
In the end, you should make the reader aware that: There is no guarantee that the combination of multiple classifiers always performs better than the best individual classifier in the ensemble. Neither an improvement on the ensemble’s average performance can be guaranteed for the general case. Such guarantees can be given only under particular conditions that classifiers and the combination function have to satisfy. For example, in the case of linear combiners (G. Fumera, F. Roli, A Theoretical and Experimental Analysis of Linear Combiners for Multiple Classifier Systems, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6) (2005) 942-956)
I have several suggestions:
1. Several spelling and grammar mistakes remain, some of which do affect the ability to interpret the article.
2. "Nor an improvement on the ensemble’s average performance can only be guaranteed for certain special cases (Fumera 2005)." I think you mean to say that the improvement can be guaranteed in some special cases.
3. Boosting section, near the end. The fact that the boosted ensemble's training error is less than the upper bound epsilon does not mean that it outperforms the best individual base model. In fact, it outperforms the worst individual base model.
4. In AdaBoost algorithm, equation 4, don't have both expressions for the error. They are redundant.
5. Mixtures of Experts is not normally trained with bootstrapping. Also, the mixture of experts does not really use a meta-classifier. MOE uses the errors of the outputs of the base models to train the gating network, but does not train on the outputs themselves the way stacking does. I suggest a separate diagram.
6. Under voting methods, note that having an odd number of classifiers does not help avoid ties unless the number of classes is 2. Even for an even number of classes, if the number of classes is greater than the number of base classifiers, then having an odd number of classifiers does not help. I would just remove the conditions on odd number of classifiers, or just make clear for that first part that you are dealing with the two-class case.
Majority Voting Error
In the article, it is said: "it can be shown the majority voting combination will always lead to a performance improvement". I will show by an example that this is not true: P(correct, single)=0.6, n=4.P(correct, vote)=P(k>2)=0.4752 < .6 .