# Evolution strategies

**Evolution Strategies (ESs)** are a sub-class of nature-inspired
direct search (and optimization) methods belonging to the class of
Evolutionary Algorithms (EAs) which use mutation, recombination,
and selection applied to a population of individuals containing
candidate solutions in order to evolve iteratively better and better
solutions. Its roots date back to the mid 1960s when P. Bienert,
I. Rechenberg, and H.-P. Schwefel at the Technical University of Berlin,
Germany, developed the first bionics-inspired schemes for evolving
optimal shapes of minimal drag bodies in a wind tunnel using Darwin's
evolution principle.

Evolution Strategies can be applied in all fields of optimization
including continuous, discrete, combinatorial search spaces
\(\mathcal{Y}\)
without and with *constraints* as well as mixed search spaces.
Given the optimization problem
\[
\mathbf{y}^* = \mathrm{argopt}_{{\mathbf{y} \in \mathcal{Y}}} \, f(\mathbf{y}),
\]
the function \(f(\mathbf{y})\) to be optimized, also referred
to as objective (or goal) function, can be presented in mathematical
form, via simulations, or even in terms of measurements obtained from
real objects. The ES can also be applied to a set of objective functions
in context of multi-objective optimization
(see also Multiobjective Evolutionary Algorithms and Multiobjective Search).

## The Canonical ES Versions

The canonical versions of the ES are denoted by
\[
(\mu/\rho, \lambda)\mbox{-ES} \quad \mbox{and} \quad
(\mu/\rho + \lambda)\mbox{-ES},
\]
respectively. Here \(\mu\) denotes the number of parents,
\(\rho \leq \mu\) the mixing number (i.e., the number of parents
involved in the procreation of an offspring), and \(\lambda\)
the number of offspring. The parents are *deterministically* selected
(i.e., deterministic survivor selection) from
the (multi-)set of either the offspring, referred to as *comma-selection*
(\( \mu < \lambda \) must hold), or both the parents and offspring,
referred to as *plus-selection*.
Selection is based on the ranking of the individuals' fitness
\(F(\mathbf{y})\) taking the \( \mu \) best individuals
(also referred to as truncation selection). In general, an
\[
\mbox{ES individual} \quad \mathbf{a} := (\mathbf{y}, \mathbf{s}, F(\mathbf{y}))
\]
comprises the *object parameter vector*
\(\mathbf{y} \in \mathcal{Y}\) to be optimized,
a set of strategy parameters \(\mathbf{s}\ ,\) needed especially in
self-adaptive ESs, and the individual's observed fitness \(F(\mathbf{y})\)
being equivalent to the objective function \(f(\mathbf{y})\ ,\) i.e.,
\(F(\mathbf{y}) \equiv f(\mathbf{y})\) in the simplest case.
The distinction between \(F(\mathbf{y})\) and \(f(\mathbf{y})\)
is necessary, since \(F(\mathbf{y})\) can be the result of a local
search operator which is applied to the \(f(\mathbf{y})\)-function
to be optimized, or it even
can be the outcome of another ES (see Meta-ES, below).
Furthermore, the observed \(F(\mathbf{y})\) can be the result of a
noisy \(f(\mathbf{y})\)-evaluation process.

The conceptual algorithm of the \((\mu/\rho \; \stackrel{+}{,} \;\lambda)\)-ES is given below:

**\((\mu/\rho \; \stackrel{+}{,} \; \lambda)\)-Self-Adaptation-Evolution-Strategy**

- Initialize parent population \(\mathbf{P}_\mu = \{ \mathbf{a}_1, \ldots, \mathbf{a}_{\mu} \}\ .\)
- Generate \(\lambda\) offspring \(\tilde{\mathbf{a}}\) forming the offspring population \(\tilde{\mathbf{P}}_\lambda = \{ \tilde{\mathbf{a}}_1, \ldots, \tilde{\mathbf{a}}_\lambda\}\) where each offspring \(\tilde{\mathbf{a}}\) is generated by:
- Select (randomly) \(\rho\) parents from \(\mathbf{P}_\mu\) (if \(\rho = \mu\) take all parental individuals instead).
- Recombine the \(\rho\) selected parents \(\mathbf{a}\) to form a recombinant individual \(\mathbf{r}\ .\)
- Mutate the strategy parameter set \(\mathbf{s}\) of the recombinant \(\mathbf{r}\ .\)
- Mutate the objective parameter set \(\mathbf{y}\) of the recombinant \(\mathbf{r}\) using the mutated strategy parameter set to control the statistical properties of the object parameter mutation.

- Select new parent population (using deterministic truncation selection) from either
- the offspring population \(\tilde{\mathbf{P}}_\lambda\) (this is referred to as
*comma*-selection, usually denoted as "\((\mu,\lambda)\)-selection"), or - the offspring \(\tilde{\mathbf{P}}_\lambda\)
*and*parent \(\mathbf{P}_\mu\)population (this is referred to as*plus*-selection, usually denoted as "\((\mu + \lambda)\)-selection")

- the offspring population \(\tilde{\mathbf{P}}_\lambda\) (this is referred to as
- Goto 2. until
*termination criterion*fulfilled.

Depending on the search space and objective function \(f(\mathbf{y})\ ,\) the recombination and/or the mutation of the strategy parameters may or may not occur in specific instantiations of the algorithm. For example, a \((\mu/1 + \lambda)\)-ES, or equivalently \((\mu + \lambda)\)-ES, does not use recombination. It draws its new \(\mu\) parents for the next generation from both the old \(\mu\) parents and the \(\lambda\) offspring (generated from these parents) by taking the best \(\mu\) individuals (with respect to the observed \(F(\mathbf{y})\)).

Evolution Strategies of type \((\mu/\rho + 1)\) are also referred
to as *steady state ESs*, i.e., strategies without a generation gap:
They produce only one offspring per generation. After evaluating its
fitness \(F(\mathbf{y})\ ,\) the worst individual is removed from the
population. Strategies of this type are especially useful on parallel
computers when the times for calculating the individuals' fitnesses are
non-constant, thus allowing for asynchronous parallel processing.

As to the *termination condition*, distance measures in the fitness
and or search space as well as the maximum number of generations can be
used.

## Example: (\(\mu/\mu_I, \lambda\))-\(\sigma\)-Self-Adaptation-ES

Mutation and recombination operators in ESs are problem-specifically
designed. As an example, consider the
\((\mu/\mu_I, \lambda)\)-Self-Adaptation-ES
for unconstrained real-valued search spaces \(\mathbb{R}^n\ .\)
An individual is defined as
\(\mathbf{a} = (\mathbf{y}, \sigma, F(\mathbf{y}))\ .\)
The ES generates \(\lambda\) offspring according to
\[
\forall l=1, \ldots, \lambda : \;\; \mathbf{a}_l \leftarrow
\begin{cases}
& \sigma_l \leftarrow \langle \sigma \rangle
\mathrm{e}^{\tau \mathrm{N}_l(0,1)}, \\[2mm]
& \mathbf{y}_l \leftarrow \langle \mathbf{y} \rangle +
\sigma_l \mathbf{N}_l(\mathbf{0}, \mathbf{I}), \\[2mm]
& F_l \leftarrow F(\mathbf{y}_l),
\end{cases}
\qquad\qquad\mbox{(I)}
\]
representing steps 2 and 3 of the conceptual
\((\mu/\rho \; \stackrel{+}{,} \; \lambda)\)-Self-Adaptation-Evolution-Strategy
algorithm (initialization, evolution loop, and termination condition
are not shown).
The \(\mathrm{N}_l(0,1)\) and
\(\mathbf{N}_l(\mathbf{0}, \mathbf{I})\) are (0, 1) normally
distributed random scalars and vectors, respectively, implementing
the mutation operation for the strategy parameter \(\sigma\)
and the \(n\)-dimensional object parameter vector
\(\mathbf{y}.\) Both mutation operations are applied to the
respective recombinants \(\langle \sigma \rangle\) and
\(\langle \mathbf{y} \rangle.\) The mutated strategy parameter
\(\sigma_l\) controls the strength of the object parameter
mutation (in this example, \(\sigma_l\) is simply the
standard deviation of the normally distributed random components).
This mutation is additively applied to the recombinant
\(\langle \mathbf{y} \rangle.\) Changing the mutation strength
\(\sigma\) according to (I),
allows for a *self-tuning* of the mutation strength:
Since the individual's \(\sigma_l\) controls the generation
of the individual's \(\mathbf{y}_l,\) selecting a specific
individual \(\mathbf{a}_l\) according to its fitness
\(F(\mathbf{y}_l)\) results in an inheritance of the
corresponding \(\sigma_l\) value. This process is also
referred to as \(\sigma\)-*self-adaptation*. That is,
self-adaptive ESs can work without any model-based external control rule
for the endogenous strategy parameters. The rate of self-adaptation
depends on the choice of the *learning parameter* \(\tau.\)
Empirical as well as theoretical investigations suggest to choose
it proportional to \(1/\sqrt{n}\ ,\) e.g.,
\(\tau = 1/\sqrt{2n}\ .\)
Recombination in (I) is done by arithmetical averaging
over the \(\mu\) best offspring individuals:
Let \(a\) be a component of the individual \(\mathbf{a}\)
(either being \(\sigma\) or a component of \(\mathbf{y}\))
the corresponding component of the recombinant
\(\langle a \rangle\) is calculated according to
\[
\langle a \rangle := \frac{1}{\mu} \sum_{m=1}^{\mu} a_{m;\lambda},
\qquad\qquad\mbox{(II)}
\]
where "\(m;\lambda\)" denotes the \(m\)-th best
offspring individual (of the last generation). This type of
recombination is referred to as *global intermediate*
recombination and denoted by a subscript \(I\) attached to the
mixing number \(\rho\ .\) Apart from the intermediate
recombination there are other types, e.g. discrete recombination
where the parental components are coordinate-wise randomly transferred
to the recombinant.

Simple implementations of the \((\mu/\mu_I, \lambda)\)-\(\sigma\)-Self-Adaptation-ES for Mathematica and Matlab/Octave can be found here.

## Variations on ESs and Operator Design Principles

While the \((\mu/\mu_I,\lambda)\)-ES in Eq. (I) uses
isotropically distributed mutations for the object parameter vector
\(\mathbf{y}\ ,\) more advanced ESs use
*covariance matrix adaptation (CMA)* techniques (CMA-ESs) to allow
for correlated mutations in real-valued search spaces.
Furthermore, hierarchically organized ESs (also referred to as
Meta-ESs or Nested-ESs) as well as predator-prey ESs are in use.
For example, simple Meta-ESs can be defined by
\[
\left[ \mu' / \rho' + \lambda' (\mu/\rho, \lambda)^\gamma
\right]\mbox{-ES}
\]
with \(\lambda'\) subpopulations of
\((\mu/\rho, \lambda)\)-ESs running
independently over a number of generations \(\gamma\)
(isolation time).
Such strategies are used in mixed structure and parameter
optimization tasks and for evolutionary learning of strategy
parameters (e.g., population sizes, mutation parameters) of the
inner evolution loop.

The performance of an ES on a specific problem class depends
crucially on the design of the ES-operators (mutation, recombination,
selection) used and on the manner in which the ES-operators are
adapted during the evolution process (adaptation schemes, e.g.,
\(\sigma\)-self-adaptation, covariance matrix adaptation,
etc.). Ideally they should be designed in such a manner
that they guarantee the *evolvability* of the system throughout
the whole evolution process. Here are some principles and general
guide lines:

- Selection is done by population truncation similar to that what breeders are doing when breeding animals or plants.
- Typical truncation ratios \(\mu/\lambda\) in strategies with
*comma*-selection in continuous search spaces are in the range from 1/7 to 1/2. - Using
*plus*-selection (some kind of elitist selection) in conjunction with variation operators that allow for reaching any point in finite discrete search spaces in finite time guarantees stochastic convergence to the global optimizer. However, since this is a result that holds for infinite running time only, one cannot draw general conclusions regarding the finite time behavior of the ESs. - Using
*plus*-selection is recommended for combinatorial optimization problems. - Evolution in ESs is usually modeled on the
*phenotypic*level. The problem to be optimized is usually presented in its*natural*problem representation trying to respect the principle of*strong causality.*This means that the variation operators (mutation and recombination) should perform search steps in such a manner that small search steps result in small fitness changes and vice versa. - Variation operators (mutation and recombination) should be designed on the basis of the maximum entropy principle.
- Mutation is the main source of variation, i.e., an ES always contains a mutation operator. Mutation should respect the reachability condition: Each (finite distant) point of the search space should be reachable in a finite number of steps. Mutations should be scalable (if possible) in order to allow for self-adaptation. The maximum entropy principle suggests the use of normally distributed mutations in \(\mathbb{R}^n\) search spaces and geometrically distributed mutations in \(\mathbb{Z}^n\) (integer) search spaces.
- Recombination is applied whenever possible and useful. It uses \(\rho=2\) or more parental individuals to generate a single recombinant (the case \(\rho > 2\) is referred to as
*multi-recombination*). The main goal of recombination is the conservation of common components of the parents, i.e., the transfer of (beneficial) similarities to the next generation and to dampen the effect of malicious components of the parents' genes (genetic repair effect).

Note, it is not always possible to respect all design principles in specific applications. Violating some of these principles does not necessarily lead to poorly performing strategies.

## Example: A Simple \((\mu+\lambda)\)-ES for Combinatorial Ordering Problems

Consider an optimization problem
\(\mathbf{y}^* = \mathrm{argopt}_{\mathbf{y}} f(\mathbf{y})\)
where \(\mathbf{y}\) is a vector describing the permutation of
\(n\) components.
For example, \(\mathbf{y} = (1, 3, 9, 2, \ldots)\) describes the ordering
of components, e.g., the ordering of city numbers a salesperson visits
consecutively (Traveling Salesperson Problem), or the ordering of jobs
in a job shop scheduling problem such that the overall cost
\(f(\mathbf{y})\)
are minimal. In ESs, this optimization problem is usually represented in its
*natural problem representation*, i.e., the variation operators act
directly on the ordering \(\mathbf{y}\ .\) An individual is defined as
\(\mathbf{a} = (\mathbf{y}, F(\mathbf{y}))\ .\)
The ES generates \(\lambda\) offspring according to
\[
\forall l=1, \ldots, \lambda : \;\;
\begin{cases}
& m \leftarrow \mbox{rand}\{1, \mu\}, \\[2mm]
& \mathbf{y}_l \leftarrow \mbox{PerMutate}( \mathbf{y}_{m; \, \mu+\lambda}), \\[2mm]
& F_l \leftarrow F(\mathbf{y}_l)
\end{cases}
\]
representing steps 2 and 3 of the conceptual ES algorithm. The ES selects
randomly a parent from the set of the \(\mu\) best individuals from both the
parents and offspring of the last generation (indicated by the
\(\mathbf{y}_{m; \, \mu+\lambda}\) notation). This parent is then
mutated by a random permutation. Simple permutation operators are
displayed in Figure 1.

They represent elementary move steps defining certain search
neighborhoods (number of states that can be reached by one step).
Unlike mutations in continuous search spaces, there exists always a
minimal search step (representing the smallest mutation possible).
The performance of the different permutation operators depends on the
optimization problem to be solved. Trying to ensure the *strong causality principle*
(i.e., small steps should result in small fitness changes)
is one of the main design and selection principle for such operators.

## Covariance Matrix Adaptation Evolution Strategies CMA-ESs

CMA-ESs represent the state-of-the-art in evolutionary optimization
in real-valued \(\mathbb{R}^n\) search spaces. The CMA-ES has
been proposed by A. Gawelczyk, N. Hansen, and A. Ostermeier in the
mid 1990s. Its basic difference to the
\((\mu/\mu_I, \lambda)\)-\(\sigma\)-Self-Adaptation-ES
example is the shape of mutation distribution which is generated
according to a covariance matrix \(\mathbf{C}\) which is
*adapted* during evolution. Thus, the mutations can adapt to the local
shape of the fitness landscape and convergence to the optimum can be
increased considerably. It uses special statistics cumulated over the
generations to control endogenous strategy-specific parameters
(the covariance matrix \(\mathbf{C}\) and the global step
size \(\sigma\)). This is in contrast to the (mutative)
\(\sigma\)-self-adaptation approach considered before.
A simplified (but well working) instantiation of the offspring update
formulas of the \((\mu/\mu_I, \lambda)\)-CMA strategy for
small population sizes \(\lambda\) (small, compared to the
search space dimensionality \(n\)) reads

**\((\mu/\mu_I, \lambda)\)-CMA-ES**

\[\mbox{(L1):} \quad \forall l=1, \ldots, \lambda : \;\; \begin{cases} & \mathbf{w}_l \leftarrow \sigma \sqrt{ \mathbf{C} } \, \mathbf{N}_l(\mathbf{0}, \mathbf{1}),\\[2mm] & \mathbf{y}_l \leftarrow \mathbf{y} + \mathbf{w}_l, \\[2mm] & F_l \leftarrow F(\mathbf{y}_l), \end{cases} \] \[\mbox{(L2):} \quad \mathbf{y} \leftarrow \mathbf{y} + \langle \mathbf{w} \rangle, \] \[\mbox{(L3):} \quad \mathbf{s} \leftarrow \left(1-\frac{1}{\tau}\right)\mathbf{s} + \sqrt{\frac{\mu}{\tau} \left(2-\frac{1}{\tau}\right)} \, \frac{\langle \mathbf{w} \rangle}{\sigma}, \] \[\mbox{(L4):} \quad \mathbf{C} \leftarrow \left(1-\frac{1}{\tau_{\mathrm{c}}}\right)\mathbf{C} + \frac{1}{\tau_{\mathrm{c}}} \mathbf{s} \mathbf{s}^T, \] \[\mbox{(L5):} \quad \mathbf{s}_\sigma \leftarrow \left(1-\frac{1}{\tau_\sigma}\right) \mathbf{s}_\sigma + \sqrt{\frac{\mu}{\tau_\sigma} \left(2-\frac{1}{\tau_\sigma}\right)} \, \langle \mathbf{N}(\mathbf{0}, \mathbf{1}) \rangle , \] \[\mbox{(L6):} \quad \sigma \leftarrow \sigma\exp\left[ \frac{\| \mathbf{s}_{\sigma} \|^2 - n} {2 n \sqrt{n} } \right]. \]

In (L1), \(\lambda\) offspring \(\mathbf{y}_l\) are generated by transforming standard normally distributed random vectors using a transformation matrix \(\sqrt{\mathbf{C}}\) to be obtained, e.g., by Cholesky decomposition of the covariance matrix \(\mathbf{C}\ ,\) i.e., \(\sqrt{\mathbf{C}} \sqrt{\mathbf{C}}^T = \mathbf{C}\ ,\) and the global step size factor \(\sigma\ .\) In (L2) the best \(\mu\) mutations are recombined according to Equation (II), thus, forming the recombinant \(\mathbf{y}\) (center of mass individual) for the next generation. Algorithmically, there is only one recombinant (to be used as the final solution when the algorithm terminates). Vector \(\langle \mathbf{w} \rangle\) connects the recombinants of two consecutive generations, thus, \(\langle \mathbf{w} \rangle/\sigma\) represents the tendency of evolution in the search space. This information is cumulated in the \(\mathbf{s}\) vector (\(\mathbf{s} = \mathbf{0}\ ,\) initially chosen) exponentially decaying with time constant \(\tau=\sqrt{n}\ .\) The direction vector \(\mathbf{s}\) is used to update (rank one update) the covariance matrix \(\mathbf{C}\) (\(\mathbf{C} = \mathbf{1}\ ,\) initially chosen) with time constant \(\tau_{\mathrm{c}} \propto n^2\ .\) The remaining (L5) and (L6) are used to control the global step size \(\sigma\) using the cumulated step size adaptation (CSA) technique with time constant \(\tau_\sigma = \sqrt{n}\) (\(\mathbf{s}_\sigma = \mathbf{0}\ ,\) initially chosen). The recombinant \(\langle \mathbf{N}(\mathbf{0}, \mathbf{1}) \rangle\) is calculated using Equation (II).

Simple implementations of the CMA-ES for Mathematica and Matlab/Octave can be found here.

CMA-ES for large population sizes differ basically in the way how the covariance matrix \(\mathbf{C}\) is updated in (L4). To this end, (L4) is extended by a rank \(\mu\) update proportional to \[ \frac{1}{\mu \sigma^2} \sum_{m=1}^\mu \mathbf{w}_{m;\lambda} (\mathbf{w}_{m;\lambda})^T. \] In order to take full advantage of this update, the time constants \(\tau\ ,\) \(\tau_{\mathrm{c}}\ ,\) and \(\tau_\sigma\) must be chosen accordingly (see Hansen et al. 2003).

## References

- Beyer, H.-G. and Schwefel, H.-P. (2002). Evolution Strategies: A Comprehensive Introduction. In
*Natural Computing,*1(1):3-52.

- Beyer, H.-G. (2001). The Theory of Evolution Strategies. Natural Computing Series. Springer, Berlin 2001.

- Hansen, N. and Ostermeier, A. (2001). Completely Derandomized Self-Adaptation in Evolution Strategies. In
*Evolutionary Computation,*9(1):159-195.

- Hansen, N. and Müller, S.D. and Koumoutsakos, P. (2003). Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). In
*Evolutionary Computation,*11(1):1-18.

- Rechenberg, I. (1994). Evolutionsstrategie '94. Frommann-Holzboog Verlag, Stuttgart, (in German).

- Schwefel, H.-P. (1995). Evolution and Optimum Seeking. Wiley, New York, NY.

**Internal references**

- Jan A. Sanders (2006) Averaging. Scholarpedia, 1(11):1760.
- Tomasz Downarowicz (2007) Entropy. Scholarpedia, 2(11):3901.
- Rob Schreiber (2007) MATLAB. Scholarpedia, 2(7):2929.
- Frank Hoppensteadt (2006) Predator-prey model. Scholarpedia, 1(10):1563.

## External Links

- Hans-Georg Beyer's Website
- Evolutionary_Algorithms - Terms and Definitions
- Evolution Strategy Related Website of Bionics Chair at the Technical University Berlin
- Nikolaus Hansen's Website with CMA-ES Related Stuff
- H.-P. Schwefel's Two-Phase Nozzle Evolved by an (1+1)-ES
- Following a Moving Optimum by a (15,100)-ES
- A Collection of Evolution Strategy Demonstrations (to be found below)
- Evolution Strategies in Action: Animations Using MS Windows

## See Also

Evolutionary Algorithms, Evolutionary Computation, Evolutionary Programming, Evolutionary Robotics, Evolvable Hardware, Evolving Intelligent Systems, Genetic Algorithms, Genetic Programming