General comments: Overall the entry is very well organized and well written. I enjoyed it very much. The whole field of metalearning is covered by visiting many topics which is appropriate at this stage when many views seem to prevail. The attempt to provide a definition that goes beyond classification algorithms is well done although a bit unconventional. Some sections are a bit “sketchy” and could be easily extended to provide a better understanding of the field. The entry is in general a very good reference for those interested in metalearning.
Detailed comments: The definition of metalearning (first page) talks about changing parameters based on experience. Although later on the idea of "parameter change" is explained to stand for many concepts, it is a bit limited for a general definition. I would suggest broadening the definition of metalearning from the very beginning.
In “Terminology and Definition” it reads: “Any learning algorithm must satisfy delta > 0 in its domain, that is it must improve expected performance”. I disagree with this definition because in general learning occurs any time the algorithm does better than random guessing. If average performance is already high, learning may never help improve average performance. I suggest changing it to “Any learning algorithm aims at satisfying…”
Section “Where does metalearning start?” states that learning and metalearning can be seen as a “flat” learning algorithm. I don’t share the author’s opinion on this. The statement looks subjective and expressing the author’s view exclusively. I would suggest to eliminate the section or provide a more compelling argument in favor for this “flat” view.
In section “Domains” it reads: “By definition, the metalearning algorithm gets the same training data as the base-level learning algorithm”. It is now standard to think about base-learning as learning from a single task, whereas metalearning is about learning from many tasks. I think it is wrong to say that the same training data is “always” used in metalearning. I would suggest to make a distinction between contexts; the statement is not true everywhere.
In Section “Performance Measures” it reads: “Here the inequations…” Please rephrase.
Section “Inductive Transfer” is very sketchy. There are many papers in the machine learning literature touching on this subject. Perhaps the authors could extend this section a bit more.
Section “Optimal Universal Inductive Transfer” is controversial and as the authors mention it may not even be considered metalearning at all. My suggestion is to eliminate the section entirely.
Section “Self-modification” is also sketchy. Please add a few more references and extend the section a bit more.
Section “Success-Story Algorithm” assigns a label of “self-modifying policy” to the agent, parameter, learning algorithm, etc. Could there be a more specific naming for these entries?
Section “Multiple learning algorithms” should read: “…we have prior knowledge in the form of a pool…”
Section “Algorithm selection” should read: “…consists of choosing from a pool the best algorithm..”. A few lines below it should read: “The base-algorithm chosen for a new task is the one…”
Last line before “References” should read: “…change its software such that at some point it is not a metalearner anymore.
Reviewer B: Comments of Reviewer B
1. The article for Scolarpedia provides quite a lot of useful information to those who wish to know more about metalearning. It encompasses all types of metalearning and suggests a common framework (in Section 1).
2. The definition of metalearning provided right at the beginning would probably seem too narrow to a casual reader. It states that metalearning uses experience to change parameters of a learning algorithm. At first sight, this does not seem to cover algorithm selection or combination of algorithms (e.g. bagging). If one reads the rest of the article carefully, it is clear that these cases are covered. Still, one should assume that many readers would not be very careful readers and so, perhaps, this should be taken into account. So, my suggestion is to add a clarification to the definition, such as the following: “Here it is assumed that the learning algorithm may be rather complex algorithm in general and may incorporate more than one learning method. Hence selection or combination of algorithms, for instance, can be regarded as parameterization of the complex learning algorithm referred to earlier.
3. The organization of the article is a bit unfortunate. Several formal definitions are given right at the beginning of the article (in Section 1) which may put many readers off continuing to study this issue. This part is of course useful, as it provides a common framework for various cases studied later. My suggestion is that that some parts of the article, namely “Section 2 Origins (including Uses)” and “part of Section 3 (the first paragraph which includes references to overview articles)” be moved in front of the existing Section 1. Then when you come to the section showing the definitions, motivate the reader by saying that this is provided so as to provide a common framework for what follows.
4. – The printed version of the article I got does not show numbering of sections. I guess this will be corrected.
5. – In individual sections, i.e. 4.1.1 etc., till 4.4.2 and 4.5 the authors include the list of the basic concepts D, DT, φ, etc. and show what these represent in each particular case. This seems fine, but there are two problems with it: (1) when the list is encountered for the first time, nothing is said to introduce it; (2) the lists occupy quite a lot of space. Perhaps the authors could reflect about how this could be compacted (e.g. introduce a table for 4.1.1 till 4.1.3?).
6. The section 4.1.3. Optimal Universal Inductive Transfer is far too long in comparison with the others. This part should be shortened to be in line with the others.
7. One comment regards the subsection 1.5 Performance measures. It would be useful to mention some measures that are commonly used, such as error rate, MSE etc. The authors could mention particularly those that are referred to later in subsections 4.1.1 etc. The part on measures used in unsupervised learning seems a bit odd / vague.
8. Section 4.3. Multiple Learning Algorithms includes three subsections: 4.2.1 Algorithm Selection, 4.2.2 Algorithm Portfolios and 4.2.3 Distributed Learning Algorithm. This does not seem to cover well the area in which many practical results were obtained. I suggest to extend it. As I did not suceed to edit the text directly, I include my suggestion below:
Given a task, algorithm selection (Rice, 1976) consists of choosing from the pool a base-algorithm expected to perform best. A predictor is trained on meta-level experience consisting of a mapping from (statistical or information-theoretic) task features to performance. The base-algorithm chosen for a new task it the one with the best predicted performance.
This basic scheme was later extended in various ways. First, the focus was not to suggest a single algorithm, but rather a ranking of all possible algorithms on the basis of both a given performance measure, but also costs (i.e. time needed to train the classifier) (Brazdil et al., (1994). In addition to this the meta-level task features were extended in various ways. One important group involved so called sampling landmarks, that is, measures of performance of each alternative classifier on small subsets of data (see e.g. Leite et al., 2005).
Brazdil, P. and Hennery R. (1994): Analysis of Results. In D. Michie, D.J. Spiegelhalter and C.C. taylor (Eds.), Machine learning, neural and statistical classification, Ellis Horwood.
Leite R., and Brazdil P. (2005): Predicting a relative performance of classifiers from samples. In ICML ’05, Proc. of Int. Conf. on Machine Learning, ACM Press.
Author Schaul : Authors' reply
We appreciate the constructive comments of both reviewers, and have attempted to address all them, also revising the article structure, as suggested. For further improvements, we encourage the reviewers to edit the article directly. In particular, we would appreciate if reviewer A could point us to any crucial references we may have missed (for the sections on Inductive Transfer and Self-modification).
We did not incorporate the following two comments:
Reviewer A: In “Terminology and Definition” it reads: “Any learning algorithm must satisfy delta > 0 in its domain, that is it must improve expected performance”. I disagree with this definition because in general learning occurs any time the algorithm does better than random guessing. If average performance is already high, learning may never help improve average performance. I suggest changing it to “Any learning algorithm aims at satisfying…”
We disagree: if the agent's expected performance is high from the start without improving any further, we do not speak of learning. If we did, how could we distinguish learning agents from good but fixed agents?
Reviewer A: Section “Success-Story Algorithm” assigns a label of “self-modifying policy” to the agent, parameter, learning algorithm, etc. Could there be a more specific naming for these entries?
This is precisely the point we want to illustrate with this example (also for the Gödel machine): all those roles are filled by the same component.
The paper looks very good right now. Thank you to the authors for such an excellent job. As requested by the authors, I added one more reference on inductive transfer (a recent survey by Pan and Yang 2009 that provides a good summary of recent work).
The paper looks quite good now. You have done a good job! However, some relatively minor problems I have mentioned earlier have been only partially resolved.
1) In the definion of Metalearning on page 1 it is said that: ".. metalearning algorithm uses experience to change parameters of a learning algorithm ..". This permits an interpretation that is too narrow and would exclude, for instance, the selection of ML algorithms with recourse to metalearning. The problem could be easily resolved by changing the text as follows: ".. metalearning algorithm uses experience to change certain aspects of a learning algorithm ..".
2) In section "Terminology and definition" there is phrase: Now we define a learning algorithm Lmu .. performance PHI increases." The problem here is again that the interpretation give by the reader to "learning algorithm" may again be too narrow. This could be corrected by adding a few words to correct this: “Here it is assumed that the learning algorithm may be rather complex algorithm in general and may incorporate more than one learning method."
3) I am a bit puzzled why the section 3.1 Inductive Transfer includes 3.1.2 Ensemble Methods, as the latter are normally used with the same data (i.e. there is no transfer from one domain to another). So I would suggest turning it into a separate item. But perhaps I missed something. If I did, maybe you could help the reader and justify better your choice.
4) On the printed version of the article the section numbers appear only on page 1. In the rest of the article the sections are not numbered. I guess this will be corrected.
The rest, as I said, seems fine.
User 4: Regarding meta-genetic programming
As far as I have found in the literature, Schmidhuber was the first to describe meta-genetic programming (in the terminology of the time: meta-evolution of plans). However, the term "meta-genetic programming" was introduced by (Edmonds, 2001) in what appears to be the first implementation of meta-GP. It is true that just as Koza did not reference Cramer, Edmonds did not cite Schmidhuber. Nevertheless, I suggest adding Edmonds as a secondary citation for being the first to produce tangible evidence in support of meta-genetic programming.
Mendeleev predicted the existence of ekasilicon, yet it was later named Germanium by Winkler, the scientist that demonstrated Germanium's existence.
Edmonds, B., 2001. Meta-Genetic Programming: Co-evolving the Operators of Variation. Turkish Journal Electrical Engineering and Computer Sciences, 9, 13-29.