# Metalearning

 Tom Schaul and Juergen Schmidhuber (2010), Scholarpedia, 5(6):4650. doi:10.4249/scholarpedia.4650 revision #91489 [link to/cite this article]
Post-publication activity

Curator: Tom Schaul

In the context of machine learning, Metalearning is the process of learning to learn. Informally speaking, a metalearning algorithm uses experience to change certain aspects of a learning algorithm, or the learning method itself, such that the modified learner is better than the original learner at learning from additional experience.

## Origins

Metalearning (also known as 'meta-learning' or 'learning to learn'), an old informal concept of cognitive psychology, more recently became a formal concept of machine learning. Among the earliest machine learning approaches to metalearning is a system designed to adjust bias, called STABB (Shift To A Better Bias), introduced by Utgoff (1986), the Variable bias management system by Rendell, Senshu and Tcheng (1987) which selects between different learning algorithms, and meta-genetic programming (Schmidhuber, 1987), to our knowledge the first system that tries to learn entire learning algorithms, through methods of artificial evolution.

In practice, metalearning can be used for automating human design decisions (e.g. parameter tuning) and then automatically revisiting and optimizing those decisions dynamically in the light of new experience obtained during learning. Another application is transfer of knowledge between multiple related tasks. Meta-learning may reduce the efforts for designing domain-specific learning algorithms, and lead to more robust and general learning architectures. Apart from practical applications, research on metalearning algorithms may also lead to new insights into human learning. Some philosophers of science have argued that the scientific method itself is a form of metalearning (Korb, 2004).

The 'Examples' section will provide a number of examples within the spectrum of techniques that can be (and have been) called metalearning algorithms. Additional overviews can be found in a number of surveys, e.g., Thrun and Pratt (1998), Vilalta and Drissi (2002), Giraud-Carrier et al. (2004), Anderson and Oates (2007), Brazdil et al. (2009).

## Terminology and definition

Since its introduction, the term "metalearning" has been used by various researchers in different ways, sometimes resulting in mutually inconsistent definitions. We therefore start by introducing broad terminology and a formal definition in this section.

Consider a domain $$D$$ of possible experiences $$s \in D\ ,$$ each having a probability $$p(s)$$ associated with it. Let $$T$$ be the available training experience at any given moment. Training experience is a subset of $$D\ ,$$ i.e. $$T \in D_{T}\subset\mathcal{P}(D)\ ,$$ where $$\mathcal{P}(D)$$ is the powerset of $$D\ .$$ An agent $$\pi_\theta$$ is parametrized by $$\theta \in \Theta\ .$$ A task associates a performance measure $$\phi : (\Theta, D) \mapsto \Re$$ with the agent's behavior for each experience. We denote by $$\Phi$$ the expected performance of an agent on $$D\ :$$

$\Phi(\theta) = \mathbb{E}_{s\in D}[\phi(\theta,s)]$

Now we define a learning algorithm $$\mathbf{L}_{\mu}: (\Theta, D_{T}) \mapsto \Theta\ ,$$ parametrized by $$\mu \in M\ ,$$ as a function that changes the agent's parameters $$\theta$$ based on training experience, so that its expected performance $$\Phi$$ increases. (Here it is assumed that the learning algorithm may be rather complex algorithm in general and may incorporate more than one learning method.) More formally, we define the learning algorithm's expected performance gain $$\delta$$ to be:

$\delta (\mathbf{L}_{\mu}) = \mathbb{E}_{\theta \in \Theta, T \in D_{T}} \left[\Phi(\mathbf{L}_{\mu}(\theta, T)) - \Phi(\theta)\right]$

Any learning algorithm must satisfy $$\delta > 0$$ in its domain, that is it must improve expected performance. A learning algorithm's modifiable components $$\mu$$ are called its meta-parameters. We define a metalearning algorithm $$\mathbf{ML}: (M, D_{T}) \mapsto M$$ to be a function that changes the meta-parameters of a learning algorithm, based on training experience, so that its expected performance gain $$\delta$$ increases:

$\mathbb{E}_{\mu \in M, T \in D_{T}}\left[\delta (\mathbf{L}_{\mathbf{ML}(\mu,T)})- \delta (\mathbf{L}_{\mu})\right] > 0$

In other words, using $$\mathbf{L}_{\mu'}$$ tends to lead to bigger performance increases than using $$\mathbf{L}_{\mu}\ ,$$ where $$\mu' = \mathbf{ML}(\mu,T)$$ are the updated meta-parameters. Note the symmetry of the two definitions, due to metalearning being a form of learning itself.

### Meta-levels

We do not have to limit ourselves to a single meta-level of learning. In fact, we can easily extend our definition to an arbitrary number of meta-levels. A meta-meta-learning algorithm, for example, adapts the parameters of $$\mathbf{ML}\ .$$ As we will see for cases such as the Gödel machine, the boundaries between the levels can be blurred to the extent that learning, metalearning, meta-meta-learning, etc. are collapsed into a single algorithm.

### Where does metalearning start?

The present definition leaves room for interpretation on how to separate learning from metalearning. While in practice the distinction tends to be clear from context, there is no well-founded justification for how to draw the line in general. In particular, the combination of a learning algorithm and a metalearning algorithm (and similarly, any number of further meta-levels) can always be seen as a single 'flat' learning algorithm, e.g. by defining a new learning algorithm $$\mathbf{L'}(\theta, T) = \mathbf{L}_{\mathbf{ML}(\mu,T)}(\theta, T)\ .$$

Conversely, almost every traditional learning algorithm can be viewed as metalearning. For example, a neural network that first learns to associate seven input patterns with seven corresponding target patterns, and then learns an additional eighth pattern pair association, is sometimes considered as a very rudimentary form of a metalearner.

### Gaining experience

It is commonly considered natural to learn gradually, as the training experience $$T_{t}$$ generally grows with time $$t$$ (e.g., in active learning). It suffices, however, to focus on (meta-)learning at the present moment. In fact, ignoring limitations on memory and computational resources, learning $$\theta_{t}$$ from scratch is functionally equivalent to gradually learning from new experience (making use of $$\theta_{t-1}$$). In both cases, we use the exact same training experience $$T_{t}\ .$$ This overview will mostly omit time indices; all quantities refer to the current time unless mentioned otherwise.

### Domains

The definition of experience $$s$$ was kept broad to match all conceivable kinds of learning. $$s$$ may consist of an input-target-tuple (in supervised learning), of a single datapoint (in unsupervised learning), or of a sequence of observations, actions and rewards (in reinforcement learning). Sometimes $$p(s)$$ is uniform (samples are uniformly drawn from a set), but this is no requirement. For example, in the case of active learning or on-policy reinforcement learning, $$p(s)$$ depends on the policy $$\pi_{\theta}\ .$$

By definition, the metalearning algorithm has access to the cumulative training data of the base-level learning algorithm(s). This may seem counterintuitive, as in practice metalearning algorithms often use high-level information such as complex input features obtained through pre-processing the data, convergence speed, or generalization performance. Since those can be derived from the training experience, however, their derivation is considered part of the metalearning algorithm, for simplicity's sake.

### Performance measures

Performance measures to be optimized through learning are as diverse as the application domains. In supervised learning we usually measure the differences between the agent's outputs and teacher-given target values (e.g. mean squared error, classification accuracy), in reinforcement learning the reward signals from the environment, in unsupervised learning properties of the internal representation the data, such as the mutual information between individual components of the representation. A performance measure may also be based on the amount of training experience or runtime necessary to achieve a certain behavior.

Optimizing several performance measures simultaneously, without combining them into a single number, is called multi-objective learning. In this case, the performance gain $$\delta$$ is a vector, and it is sufficient for learning if performance strictly improves for at least one of the objectives.

### Meta-parameters

The meta-parameters $$\mu$$ are the modifiable components of the learning algorithm $$\mathbf{L}_\mu$$ (just as $$\theta$$ are the modifiable components of $$\pi_{\theta}$$). Metalearning algorithms vary with respect to what information the meta-parameters contain, and how they are represented. Meta-parameters can include the actual code of $$\mathbf{L}\ ,$$ the choice of $$\mathbf{L}$$ from a set of learning algorithms, learning rates, or exploration-exploitation tradeoff. They may also directly encode domain knowledge to guide base-level learning, e.g., in the form of rules, constraints, world models, noise assumptions.

### Prior knowledge

Often learning does not ignore task and domain structure. Instead some form of prior domain knowledge (e.g., by human experts) is introduced. In our framework this is encoded in the hard-wired part of the agent's behavior $$\pi\ ,$$ its initial parameters $$\theta_{0}$$ (or their prior distribution). Learning can also be guided by using an informed prior distribution on $$M$$ (thereby effectively limiting the class of learning algorithms to those considered potentially useful for the task).

## Examples

This section will provide a number of examples within the spectrum of techniques that can be (and have been) called metalearning algorithms. For each of the following types of metalearning we will illustrate how it fits the framework above, mapping its corresponding components onto the symbols of the definition.

### Inductive Transfer

One of the simplest variations of metalearning is called inductive transfer, e.g., Vilalta and Drissi (2002). It reuses previously gained knowledge on new but somehow similar tasks. The learning algorithm and its parameters remain fixed. Here $$\mathbf{L}_{\mu}$$ usually only considers part of the available training experience $$T$$ (e.g. the current task), whereas $$\mathbf{ML}$$ uses all of $$T$$ (e.g. all tasks).

#### Policy partitioning

Also known as parameter sharing, this approach simply partitions the parameters into a task-specific and a general part. The general part is updated by all tasks and transferred from one task to the next, whereas the task-specific part is not transferred. In particular, we speak of recurrent functional decomposition if the learned output can be viewed as a transformation of the input by the convolution of a task specific transformation and a transformation that is identical for all tasks (a form of preprocessing).

For example, when training artificial neural networks, one can either use a partially shared architecture, or transfer the learned network weights directly, as a hopefully useful initialization for a new network supposed to solve the next problem. Caruana (1997) developed this approach under the name of Multitask learning, and applied it to a number of real-world supervised learning tasks.

 $$D$$ input-target mappings $$D_{T}$$ task-specific datasets $$\phi$$ MSE $$\pi_{\theta}$$ neural network $$\theta$$ initial network weights $$\mathbf{L}_{\mu}$$ Backprop $$\mu$$ some of the weights $$\mathbf{ML}$$ Multitask learning

Policy partitioning has been applied to support vector machines, where the weight vector is decomposed into the sum of a global (cross-task) and a local (task-specific component) part, and weight-decay is used on the local part.

### Ensemble methods

Ensemble methods are typically used for classification. They combine the outputs of multiple base-level classifiers. Stacked generalization (Wolpert 1992) either uses various algorithms to produce its component classifiers, or the same algorithm trained on different subsets of the data. Bagging (Breiman, 1996) builds data subsets by sampling. Boosting (Schapire, 1990) iteratively builds new classifiers by reweighing samples, to put higher emphasis on misclassified samples.

 $$D$$ input/class samples $$D_{T}$$ $$\mathcal{P}(D)$$ $$\phi$$ classification errors $$\pi_{\theta}$$ set of base-level classifiers $$\theta$$ parameters of each classifier $$\mathbf{L}_{\mu}$$ supervised learning $$\mu$$ number of classifiers, data subsets with sample weights $$\mathbf{ML}$$ Boosting

### Self-modification

The learning algorithms for the base-level and the meta-level do not need to be different: parameters and meta-parameters are sometimes updated in the same manner. For this kind of metalearning, some of the parameters are just modifiable components of the learning algorithm: we speak of self-modifying or self-adapting parameters or policies.

#### Neural metalearners that learn learning algorithms

Recurrent neural networks (RNNs) have been suggested for metalearning because they can in principle learn to run their own weight change algorithm (Schmidhuber, 1993). Differentiable RNN can be trained by gradient descent. This type of metalearning, however, may involve very large time lags between significant events, which standard RNNs cannot easily deal with. Long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997), however, overcome this shortcoming, and a successful metalearner (Hochreiter et al., 2001) uses LSTM-RNN to quickly learn a full-fledged learning algorithm for quadratic functions.

 $$D$$ input-target samples $$D_{T}$$ $$\mathcal{P}(D)$$ $$\phi$$ MSE $$\pi_{\theta}$$ RNN $$\theta$$ network activations $$\mathbf{L}_{\mu}$$ RNN $$\mu$$ network weights $$\mathbf{ML}$$ backpropagation through time

#### Success-Story Algorithm

The success-story algorithm (SSA; Schmidhuber et al., 1997) is a metalearning algorithm for improving self-modifying policies in a lifelong reinforcement learning framework. At self-computed "check-points" the system undoes self-modifications that have not been followed by faster average reward intake. Applications include complex maze tasks in partially observable environments.

 $$D$$ observation-action-reward sequences $$D_{T}$$ elements of $$D$$ $$\phi$$ cumulative reward per time $$\pi_{\theta}$$ self-modifying policy $$\theta$$ self-modifying policy $$\mathbf{L}_{\mu}$$ self-modifying policy $$\mu$$ self-modifying policy $$\mathbf{ML}$$ SSA

### Multiple learning algorithms

For many tasks we have prior knowledge in the form of a pool of algorithms that potentially could solve the task. Metalearning can be used to determine which one to use when, or how to combine them in the best way.

#### Algorithm selection

Given a task, algorithm selection (Rice, 1976) consists of choosing from a pool the best base-algorithm. A predictor is trained on meta-level experience consisting of a mapping from (statistical or information-theoretic) task features to performance. The base-algorithm chosen for a new task is the one with the best predicted performance.

This basic scheme was later extended in various ways. First, the focus was not to suggest a single algorithm, but rather a ranking of all possible algorithms on the basis of both a given performance measure, but also costs (i.e. time needed to train the classifier) (Brazdil et al., 1994). In addition to this the meta-level, task features were extended in various ways. One important group involved so called sampling landmarks, that is, measures of performance of each alternative classifier on small subsets of data (see e.g. Leite et al., 2005).

 $$\mu$$ chosen base-algorithm $$\mathbf{ML}$$ predicting base-learner performance from extracted task features

#### Algorithm portfolios

More generally, a metalearner may also choose a subset of available algorithms which are run in parallel until one of them succeeds. This is especially useful for algorithms with a heavy-tailed runtime distribution. In this case the meta-parameters consist of a set of algorithms, plus time shares allocated to each. Note that a portfolio is not necessarily static – the allocated time shares can be adjusted during execution once additional information becomes available (Dynamic Algorithm Portfolios; Gagliolo and Schmidhuber, 2006).

 $$\mu$$ set of chosen base-algorithms with time-shares $$\mathbf{ML}$$ predicting base-learner performance

#### Distributed learning algorithms

In neural networks, synaptic weights can be changed by a single global algorithm (e.g. Backprop) or by local learning rules on the synapse level (e.g. Hebb's rule). In the latter case, metalearning can be used to determine how to use the local learning rule. For example, the local rules may depend on their position in the network structure, and evolution is used to pick good ones – one application is a robot navigation task by Mondada and Floreano (1996).

 $$D$$ observation-action-reward sequences $$D_{T}$$ elements of $$D$$ $$\phi$$ navigation performance $$\pi_{\theta}$$ neural network controller $$\theta$$ network weights $$\mathbf{L}_{\mu}$$ variants of Hebbian learning $$\mu$$ learning rules, learning rate $$\mathbf{ML}$$ evolution

Another example is a recurrent neural network metalearner that is set up such that its initial, learnable weights determine how a local rule for fast Hebb-like weight changes is used during sequence processing to achieve a goal (Schmidhuber, 1993b).

### Algorithm learning

A type of metalearning with very high flexibility emerges when the meta-parameters contain (part of) the actual code of the learning algorithm.

#### Meta-Genetic Programming

Meta-Genetic Programming evolves a genetic programming system using genetic programming (Cramer, 1985) itself. Chromosomes as well as program-modifying programs executing crossover and mutation as well as other more complex types of chromosome changes are allowed to evolve on their own, rather than being pre-programmed by a human programmer. A recursive but terminating algorithm of this type (Schmidhuber, 1987) creates an unbounded number of meta-levels and meta-meta-levels etc.

 $$D$$ code-fitness mappings $$D_{T}$$ $$\mathcal{P}(D)$$ $$\phi$$ fitness function $$\pi_{\theta}$$ code $$\theta$$ code $$\mathbf{L}_{\mu}$$ genetic programming $$\mu$$ code-combining code $$\mathbf{ML}$$ genetic programming

#### Optimal Ordered Problem Solver

The Optimal Ordered Problem Solver (OOPS; Schmidhuber, 2004) incrementally learns to solve one problem after another, re-using experience with previous problems (inductive transfer) in a way that is time-optimal in a certain sense. The initial bias is given by a probability distribution P on programs for a universal computer. OOPS solves the first problem with a general, asymptotically optimal problem solver, namely, non-incremental universal search (Levin 1973, 1984): In the i-th phase (i=1,2,3...) test all programs p with runtime at most $$2^iP(p)$$ until the task is solved. The storage for the first program found to compute a solution to any task becomes non-writable. Programs tested during search for solutions to later tasks may copy non-writable code into separate modifiable storage, to edit it and execute the modified result. To solve the n-th task , OOPS sacrifices half the total search time for testing (through a variant of universal search) programs that have the most recent successful program as a prefix. The other half remains for testing fresh programs with arbitrary beginnings. When searching for a universal solver for all tasks in the sequence we have to time-share the second half (but not the first!) among all tasks 1...n.

Program prefixes may recompute the probability distribution on their suffixes in arbitrary computable ways, thus rewriting the search procedure on the suffixes. This allows for metalearning or metasearching, that is, searching for faster search procedures.

 $$D$$ problem sequences $$D_{T}$$ elements of $$D$$ $$\phi$$ fitness function $$\pi_{\theta}$$ code $$\theta$$ code $$\mathbf{L}_{\mu}$$ extended Levin Search $$\mu$$ successful code for previous problems $$\mathbf{ML}$$ OOPS

### Fully self-referential learners

A learning algorithm is fully self-referential if it can inspect and improve every part of its own code. This blurs the distinction between learning on the base-level, meta-level, meta-meta-level, etc. An example is the Gödel machine, a universal, provably optimal learning machine (Schmidhuber, 2006, 2009). It starts out with a theorem prover and a set of self-referential axioms describing its own hardware and software as well as a user-given utility function. Every part of its code can be rewritten by the code itself, but only if such a change will provably lead to better future performance.

 $$D$$ any $$D_{T}$$ $$\mathcal{P}(D)$$ $$\phi$$ performance axioms $$\pi_{\theta}$$ Gödel machine hardware $$\theta$$ Gödel machine software $$\mathbf{L}_{\mu}$$ Gödel machine software $$\mu$$ Gödel machine software $$\mathbf{ML}$$ Gödel machine software

Note that a Gödel machine may change its software such that at some point it is not a metalearner anymore, once it has discovered provably optimal behavior that does not require additional adaptation.

## References

• Anderson, M.L. & Oates, T., 2007. A review of recent research in metareasoning and metalearning. AI Magazine.
• Brazdil, P.B. et al., 2009. Meta-Learning: Applications to Data Mining, Springer.
• Brazdil, P. & Hennery R., 1994. Analysis of Results. In D. Michie, D.J. Spiegelhalter and C.C. Taylor (Eds.), Machine learning, neural and statistical classification, Ellis Horwood.
• Breiman, L., 1996. Bagging Predictors. Machine Learning, 24(2), 123-140.
• Caruana, R., 1997. Multitask learning. Machine Learning.
• Cramer, N.L., 1985. A Representation for the Adaptive Generation of Simple Sequential Programs. In J. J. Grefenstette Proceedings of an International Conference on Genetic Algorithms and Their Applications. Hillsdale NJ.
• Gagliolo, M. & Schmidhuber, J., 2006. Learning Dynamic Algorithm Portfolios. Annals of Mathematics and Artificial Intelligence, 47(3-4), 295-328.
• Giraud-Carrier, C., Vilalta, R. & Brazdil, P., 2004. Introduction to the special issue on meta-learning. Machine Learning, 54(3), 187–193.
• Hochreiter, S. & Schmidhuber, J., 1997. Long short-term memory. Neural Computation, 9, 1735-1780.
• Hochreiter, S., Younger, A.S. & Conwell, P.R., 2001. Learning to Learn Using Gradient Descent. In Proceedings of the International Conference on Artificial Neural Networks (ICANN). pp. 87-94.
• Korb, K., 2004. Introduction: machine learning as philosophy of science. Minds and Machines. A
• Leite R., & Brazdil P., 2005: Predicting a relative performance of classifiers from samples. In Proc. of Int. Conf. on Machine Learning, ACM Press.
• Levin, L.A., 1973. Universal sequential search problems. Problems of Information Transmission, 9, 265-266.
• Levin, L.A., 1984. Randomness Conservation Inequalities: Information and Independence in Mathematical Theories. Information and Control, 61, 15-37.
• Mondada, F. & Floreano, D., 1996. Evolution of homing navigation in a real mobile robot. IEEE Transactions on Systems, Man, and Cybernetics.
• Pan, S.J. & Yang, Q., 2009. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering.
• Rendell, L., Seshu, R. & Tcheng, D., 1987. Layered Concept-learning and Dynamical Variable Bias Management. 10th International Joint Conference on AI.
• Rice, J.R., 1976. The algorithm selection problem. In M. Rubinoff & M. C. Yovits, Advances in Computers, 15, 65-118.
• Schapire, R.E., 1990. The Strength of Weak Learnability. Machine Learning, 5, 197-227.
• Schmidhuber, J., 1987. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook., Institut für Informatik, Technische Universität München.
• Schmidhuber, J., 1993a. A self-referential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam. pp. 446-451.
• Schmidhuber, J., 1993b. On decreasing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam. pp. 460-463.
• Schmidhuber, J., Zhao, J. & Wiering, M., 1997. Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28, 105-130.
• Schmidhuber, J., 2004. Optimal Ordered Problem Solver. Machine Learning, 54, 211-254.
• Schmidhuber, J., 2006. Gödel machines: Fully Self-Referential Optimal Universal Self-Improvers. In B. Goertzel & C. Pennachin, Artificial General Intelligence, pp. 199-226.
• Schmidhuber, J., 2009. Ultimate Cognition à la Gödel. Cognitive Computation, 1, 177-193.
• Thrun, S. & Pratt, L., 1998. Learning to learn.
• Utgoff, P., 1986. Shift of bias for inductive concept learning. In R. Michalski, J. Carbonell, & T. Mitchell Machine Learning. pp. 163-190.
• Vilalta, R. & Drissi, Y., 2002. A Perspective View and Survey of Meta-Learning. Journal of Artificial Intelligence Review, 18(2), 77-95.
• Wolpert, D.H., 1992. Stacked generalization. Neural Networks, 5, 241-259.

Internal references

• Howard Eichenbaum (2008) Memory. Scholarpedia, 3(3):1747.
• Wolfram Schultz (2007) Reward. Scholarpedia, 2(3):1652.