# Evolution strategies

This revision has not been approved and may contain inaccuracies

Post-publication activity

Curator: Hans-Georg Beyer

Evolution Strategies (ES) are a sub-class of nature-inspired direct search (and optimization) methods belonging to the class of Evolutionary Algorithms (EA) which use mutation, recombination, and selection applied to a population of individuals containing candidate solutions in order to evolve iteratively better and better solutions. Its roots date back to the mid 1960s when P. Bienert, I. Rechenberg, and H.-P. Schwefel at the Technical University of Berlin, Germany, developed the first bionics-inspired schemes for evolving optimal shapes of minimal drag bodies in a wind tunnel using Darwin's evolution principle.

Evolution Strategies can be applied in all fields of optimization including continuous, discrete, combinatorial search spaces $$\mathcal{Y}$$ without and with constraints as well as mixed search spaces. Given the optimization problem $\mathbf{y}^* = \mathrm{argopt}_{{\mathbf{y} \in \mathcal{Y}}} \, f(\mathbf{y}),$ the function $$f(\mathbf{y})$$ to be optimized, also referred to as objective (or goal) function, can be presented in mathematical form, via simulations, or even in terms of measurements obtained from real objects.

## The Canonical ES Versions

The canonical versions of the ES are denoted by $(\mu/\rho, \lambda)\mbox{-ES} \quad \mbox{and} \quad (\mu/\rho + \lambda)\mbox{-ES},$ respectively. Here $$\mu$$ denotes the number of parents, $$\rho \leq \mu$$ the mixing number (i.e., the number of parents involved in the procreation of an offspring), and $$\lambda$$ the number of offspring. The parents are deterministically selected from the (multi-)set of either the offspring, referred to as comma-selection ($$\mu < \lambda$$ must hold), or both the parents and offspring, referred to as plus-selection. Selection is based on the ranking of the individuals' fitness $$F(\mathbf{y})$$. In general, an $\mbox{ES individual} \quad \mathbf{a} := (\mathbf{y}, \mathbf{s}, F(\mathbf{y}))$ comprises the object parameter vector $$\mathbf{y} \in \mathcal{Y}$$ to be optimized, a set of strategy parameters $$\mathbf{s}$$, needed especially in self-adaptive ESs, and the individual's observed fitness $$F(\mathbf{y})$$ being equivalent to the objective function $$f(\mathbf{y})$$, i.e., $$F(\mathbf{y}) \equiv f(\mathbf{y})$$ in the simplest case. (The distinction between $$F(\mathbf{y})$$ and $$f(\mathbf{y})$$ is necessary, since $$F(\mathbf{y})$$ can be the result of a local search operator which is applied to the function to be optimized $$f(\mathbf{y})$$. Furthermore, the observed $$F(\mathbf{y})$$ can be the result of a noisy $$f(\mathbf{y})$$-evaluation process). The conceptual algorithm of the $$(\mu/\rho \; \stackrel{+}{,} \;\lambda)$$-ES is given below:

$$(\mu/\rho \; \stackrel{+}{,} \; \lambda)$$-Self-Adaptation-Evolution-Strategy

1. Initialize parent population $$\mathbf{P}_\mu = \{ \mathbf{a}_1, \ldots, \mathbf{a}_{\mu} \}$$.
2. Generate $$\lambda$$ offspring $$\tilde{\mathbf{a}}$$ forming the offspring population $$\tilde{\mathbf{P}}_\lambda = \{ \tilde{\mathbf{a}}_1, \ldots, \tilde{\mathbf{a}}_\lambda\}$$ where each offspring $$\tilde{\mathbf{a}}$$ is generated by:
1. Select (randomly) $$\rho$$ parents from $$\mathbf{P}_\mu$$ (if $$\rho = \mu$$ take all parental individuals instead).
2. Recombine the $$\rho$$ selected parents $$\mathbf{a}$$ to form a recombinant individual $$\mathbf{r}$$.
3. Mutate the strategy parameter set $$\mathbf{s}$$ of the recombinant $$\mathbf{r}$$.
4. Mutate the objective parameter set $$\mathbf{y}$$ of the recombinant $$\mathbf{r}$$ using the mutated strategy parameter set to control the statistical properties of the object parameter mutation.
3. Select new parent population from either
• the offspring population $$\tilde{\mathbf{P}}_\lambda$$ (this is referred to as comma-selection), or
• the offspring $$\tilde{\mathbf{P}}_\lambda$$ and parent $$\mathbf{P}_\mu$$population (this is referred to as plus-selection)
4. Goto 2. until termination criterion fulfilled.

Depending on the search space and objective function $$f(\mathbf{y})$$, the recombination and/or the mutation of the strategy parameters may or may not occur in specific instantiation of the algorithm. For example, a $$(\mu/1 + \lambda)$$-ES, or equivalently $$(\mu + \lambda)$$-ES, does not use recombination. It draws its new $$\mu$$ parents for the next generation from both the old $$\mu$$ parents and the $$\lambda$$ offspring (generated from these parents) by taking the best $$\mu$$ individuals (with respect to the observed $$F(\mathbf{y})$$).

Evolution Strategies of type $$(\mu/\rho + 1)$$ are also referred to as steady state ESs, i.e., strategies without a generation gap: They produce only one offspring per generation. After evaluating its fitness $$F(\mathbf{y})$$, the worst individual is removed from the population. Strategies of this type are especially useful on parallel computers when the times for calculating the individuals' fitnesses are non-constant, thus allowing for asynchronous parallel processing.

As to the termination condition, distance measures in the fitness and or search space as well as the maximum number of generations can be used.

## Example$(\mu/\mu_I, \lambda)$-$$\sigma$$-Self-Adaptation-ES

Mutation and recombination operators in ES are problem-specifically designed. As an example, consider the $$(\mu/\mu_I, \lambda)$$-Self-Adaptation-ES for unconstrained real-valued search spaces $$\mathbb{R}^n$$. An individual is defined as $$\mathbf{a} = (\mathbf{y}, \sigma, F(\mathbf{y}))$$. The ES generates $$\lambda$$ offspring according to $\forall l=1, \ldots, \lambda : \;\; \begin{cases} & \sigma_l \leftarrow \langle \sigma \rangle \mathrm{e}^{\tau \mathrm{N}_l(0,1)}, \\[2mm] & \mathbf{y}_l \leftarrow \langle \mathbf{y} \rangle + \sigma_l \mathbf{N}_l(\mathbf{0}, \mathbf{I}), \end{cases} \qquad\qquad\mbox{(I)}$ representing steps 2 and 3 of the conceptual $$(\mu/\rho \; \stackrel{+}{,} \; \lambda)$$-Self-Adaptation-Evolution-Strategy algorithm (initialization, evolution loop, and termination condition are not shown). The $$\mathrm{N}_l(0,1)$$ and $$\mathbf{N}_l(\mathbf{0}, \mathbf{I})$$ are (0, 1) normally distributed random scalars and vectors, respectively, implementing the mutation operation for the strategy parameter $$\sigma$$ and the $$n$$-dimensional object parameter vector $$\mathbf{y}.$$ Both mutation operations are applied to the respective recombinants $$\langle \sigma \rangle$$ and $$\langle \mathbf{y} \rangle.$$ The mutated strategy parameter $$\sigma_l$$ controls the strength of the object parameter mutation (in this example, $$\sigma_l$$ is simply the standard deviation of the normally distributed random components). This mutation is additively applied to the recombinant $$\langle \mathbf{y} \rangle.$$ Changing the mutation strength $$\sigma$$ according to (I), allows for a self-tuning of the mutation strength: Since the individual's $$\sigma_l$$ controls the generation of the individual's $$\mathbf{y}_l,$$ selecting a specific individual $$\mathbf{a}_l$$ according to its fitness $$F(\mathbf{y}_l)$$ results in an inheritance of the corresponding $$\sigma_l$$ value. This process is also referred to as self-adaptation. That is, self-adaptive ES can work without any model-based external control rule for the endogenous strategy parameters. The rate of self-adaptation depends on the choice of the learning parameter $$\tau.$$ Empirical as well as theoretical investigations suggest to choose it proportional to $$1/\sqrt{n}$$, e.g., $$\tau = 1/\sqrt{2n}$$. Recombination in (I) is done by arithmetical averaging over the $$\mu$$ best offspring individuals: Let $$a$$ be a component of the individual $$\mathbf{a}$$ (either being $$\sigma$$ or a component of $$\mathbf{y}$$) the corresponding component of the recombinant $$\langle a \rangle$$ is calculated according to $\langle a \rangle := \frac{1}{\mu} \sum_{m=1}^{\mu} a_{m;\lambda}, \qquad\qquad\mbox{(II)}$ where "$$m;\lambda$$" denotes the $$m$$th best offspring individual (of the last generation). This type of recombination is referred to as global intermediate recombination and denoted by a subscript $$I$$ attached to the mixing number $$\rho$$. Apart from the intermediate recombination there are other types, e.g. discrete recombination where the parental components are coordinate-wise randomly transfered to the recombinant.

## Variations on ESs and Operator Design Principles

While the $$(\mu/\mu_I,\lambda)$$-ES in Eq. (I) uses isotropically distributed mutations for the object parameter vector $$\mathbf{y}$$, more advanced ES use covariance matrix adaptation (CMA) techniques (CMA-ES) to allow for correlated mutations in real-valued search spaces. Furthermore, hierarchically organized ES (also referred to as Meta-ES or Nested-ES) are in use. For example, simple Meta-ES can be defined by $\left[ \mu' / \rho' + \lambda' (\mu/\rho, \lambda)^\gamma \right]\mbox{-ES}$ with $$\lambda'$$ subpopulations of $$(\mu/\rho, \lambda)$$-ESs running independently over a number of generations $$\gamma$$ (isolation time). Such strategies are used in mixed structure and parameter optimization tasks and for evolutionary learning of strategy parameters (e.g., population sizes, mutation parameters) of the inner evolution loop.

The performance of an ES on a specific problem class depends crucially on the design of the ES-operators (mutation, recombination, selection) used. Ideally they should be designed in such a manner that they guarantee the evolvability of the system throughout the whole evolution process. Here are some principles and general guide lines:

• Selection is done by population truncation similar to that what breeder are doing when breeding animals or plants.
• Typical truncation ratios $$\mu/\lambda$$ in strategies with comma-selection in continuous search spaces are in the range from 1/7 to 1/2.
• Using plus-selection (some kind of elitist selection) in conjunction with variation operators that allow for reaching any point in finite discrete search spaces in finite time guarantees stochastic convergence to the global optimizer. However, since this is a result that holds for infinite running time only, one cannot draw general conclusions regarding the finite time behavior of the ES.
• Using plus-selection is recommended for combinatorial optimization problems.
• Evolution in ES is usually modeled on the phenotypic level. The problem to be optimized is usually presented in its natural problem representation trying to respect the principle of strong causality. This means that the variation operators (mutation and recombination) should perform search steps in such a manner that small search steps result in small fitness changes and vice versa.
• Variation operators (mutation and recombination) should be designed on the basis of the maximum entropy principle.
• Mutation is the main source of variation, i.e., ES always contains a mutation operator. Mutation should respect the reachability condition: Each (finite distant) point of the search space should be reachable in a finite number of steps. Mutations should be scalable (if possible) in order to allow for self-adaptation. The maximum entropy principle suggests the use of normally distributed mutations in $$\mathbb{R}^n$$ search spaces.
• Recombination is applied whenever possible and useful. It uses $$\rho=2$$ or more parental individuals to generate a single recombinant (if the number $$\rho > 2$$ multi-recombination is performed). The main goal of recombination is the conservation of common components of the parents, i.e., the transfer of (beneficial) similarities to the next generation and to dampen the effect of malicious components of the parents' genes (genetic repair effect).

Note, it is not always possible to respect all design principles in specific applications. Violating some of these principles does not necessarily lead to poorly performing strategies.

## Example: A Simple $$(\mu+\lambda)$$-ES for Combinatorial Ordering Problems

Consider an optimization problem $$\mathbf{y}^* = \mathrm{argopt}_{\mathbf{y}} f(\mathbf{y})$$ where $$\mathbf{y}$$ is a vector describing the permutation of $$n$$ components. For example, $$\mathbf{y} = (1, 3, 9, 2, \ldots)$$ describes the ordering of components, e.g., the ordering of city numbers a salesperson visits consecutively (Traveling Salesperson Problem), or the ordering of jobs in a job shop scheduling problem such that the overall cost $$f(\mathbf{y})$$ are minimal. In ESs, this optimization problem is usually represented in its natural problem representation, i.e., the variation operators act directly on the ordering $$\mathbf{y}$$. An individual is defined as $$\mathbf{a} = (\mathbf{y}, F(\mathbf{y}))$$. The ES generates $$\lambda$$ offspring according to $\forall l=1, \ldots, \lambda : \;\; \begin{cases} & m \leftarrow \mbox{rand}\{1, \mu\}, \\[2mm] & \mathbf{y}_l \leftarrow \mbox{PerMutate}( \mathbf{y}_{m; \, \mu+\lambda}), \end{cases}$ representing steps 2 and 3 of the conceptual ES algorithm. The ES selects randomly a parent from the set of the $$\mu$$ best individuals from both the parents and offspring of the last generation (indicated by the $$\mathbf{y}_{m; \, \mu+\lambda}$$ notation). This parent is then mutated by a random permutation. Simple permutation operators are displayed in Figure <ref>fig1</ref>.

Figure 1: Four permutation operators: from left to right and top to bottom: inversion, insertion, 2-exchange, and shifting.

They represent elementary move steps defining certain search neighborhoods (number of states that can be reached by one step). Unlike mutations in continuous search spaces, there exists always a minmal search step (representing the smallest mutation possible). The performance of the different permutation operators depends on the optimization problem to be solved. Trying to ensure the strong causality principle (i.e., small steps should result in small fitness changes) is one of the main design and selection principle for such operators.

## Covariance Matrix Adaptation Evolution Strategies CMA-ES

CMA-ES represents the state-of-the-art in evolutionary optimization in real-valued $$\mathbb{R}^n$$ search spaces. It has been proposed by A. Gawelczyk, N. Hansen, and A. Ostermeier in the late 1990s. Its basic difference to the $$(\mu/\mu_I, \lambda)$$-$$\sigma$$-Self-Adaptation-ES example is the shape of mutation distribution which is generated according to a covariance matrix $$\mathbf{C}$$ which is adapted during evolution. Thus, the mutations can adapt to the local shape of the fitness landsacpe and convergence to the optimum can be increased considerably. It uses special statistics cumulated over the generations to control endogenous strategy-specific parameters (the covariance matrix $$\mathbf{C}$$ and the global step size $$\sigma$$). This is in contrast to the mutative self-adaptation approach considered before. An instantiation of the offspring update formulae of the simplest $$(\mu/\mu_I, \lambda)$$-CMA strategy for small population sizes $$\lambda$$ (small, compared to the search space dimensionality $$n$$) reads

$$(\mu/\mu_I, \lambda)$$-CMA-ES

$\mbox{(L1):} \quad \forall l=1, \ldots, \lambda : \;\; \begin{cases} & \mathbf{w}_l \leftarrow \sigma \sqrt\mathbf{C} \, \mathbf{N}_l(\mathbf{0}, \mathbf{1}),\\[2mm] & \mathbf{y}_l \leftarrow \mathbf{y} + \mathbf{w}_l, \end{cases}$ $\mbox{(L2):} \quad \mathbf{y} \leftarrow \mathbf{y} + \langle \mathbf{w} \rangle,$ $\mbox{(L3):} \quad \mathbf{s} \leftarrow \left(1-\frac{1}{\tau}\right)\mathbf{s} + \sqrt{\frac{\mu}{\tau} \left(2-\frac{1}{\tau}\right)} \, \frac{\langle \mathbf{w} \rangle}{\sigma},$ $\mbox{(L4):} \quad \mathbf{C} \leftarrow \left(1-\frac{1}{\tau_{\mathrm{c}}}\right)\mathbf{C} + \frac{1}{\tau_{\mathrm{c}}} \mathbf{s} \mathbf{s}^T,$ $\mbox{(L5):} \quad \mathbf{s}_\sigma \leftarrow \left(1-\frac{1}{\tau_\sigma}\right) \mathbf{s}_\sigma + \sqrt{\frac{\mu}{\tau_\sigma} \left(2-\frac{1}{\tau_\sigma}\right)} \, \langle \mathbf{N}(\mathbf{0}, \mathbf{1}) \rangle ,$ $\mbox{(L6):} \quad \sigma \leftarrow \sigma\exp\left[ \frac{\| \mathbf{s}_{\sigma} \|^2 - n} {2 n \sqrt{n} } \right].$

In (L1), $$\lambda$$ offspring $$\mathbf{y}_l$$ are generated by transforming standard normally distributed random vectors using a transformation matrix $$\sqrt{\mathbf{C}}$$ to be obtained, e.g., by Cholesky decomposition of the covariance matrix $$\mathbf{C}$$, i.e., $$\sqrt{\mathbf{C}} \sqrt{\mathbf{C}}^T = \mathbf{C}$$, and the global step size factor $$\sigma$$. In (L2) the best $$\mu$$ mutations are recombined according to Equation (II), thus, forming the recombinant $$\mathbf{y}$$ (center of mass individual) for the next generation. Algorithmically, there is only one recombinant (to be used as the final solution when the algorithm terminates). Vector $$\langle \mathbf{w} \rangle$$ connects the recombinants of two consecutive generations, thus, $$\langle \mathbf{w} \rangle/\sigma$$ represents the tendency of evolution in the search space. This information is cumulated in the $$\mathbf{s}$$ vector ($$\mathbf{s} = \mathbf{0}$$, initially chosen) exponentially decaying with time constant $$\tau=\sqrt{n}$$. The direction vector $$\mathbf{s}$$ is used to update (rank one update) the covariance matrix $$\mathbf{C}$$ ($$\mathbf{C} = \mathbf{1}$$, initially chosen) with time constant $$\tau_{\mathrm{c}} \propto n^2$$. The remaining (L5) and (L6) are used to control the global step size $$\sigma$$ using the cumulated step size adaptation (CSA) technique with time constant $$\tau_\sigma = \sqrt{n}$$ ($$\mathbf{s}_\sigma = \mathbf{0}$$, initially chosen). The recombinant $$\langle \mathbf{N}(\mathbf{0}, \mathbf{1}) \rangle$$ is calculated using Equation (II).

CMA-ES for large population sizes differ basically in the way how the covariance matrix $$\mathbf{C}$$ is updated in (L4). To this end, (L4) is extended by a rank $$\mu$$ update proportional to $\frac{1}{\mu \sigma^2} \sum_{m=1}^\mu \mathbf{w}_{m;\lambda} (\mathbf{w}_{m;\lambda})^T.$ In order to take full advantage of this update, the time constants $$\tau$$, $$\tau_{\mathrm{c}}$$, and $$\tau_\sigma$$ must be chosen accordingly (see Hansen et al. 2003).

## References

• Beyer, H.-G. and Schwefel, H.-P. (2002). Evolution Strategies: A Comprehensive Introduction. In Natural Computing, 1(1):3-52.
• Beyer, H.-G. (2001). The Theory of Evolution Strategies. Natural Computing Series. Springer, Berlin 2001.
• Hansen, N. and Ostermeier, A. (2001). Completely Derandomized Self-Adaptation in Evolution Strategies. In Evolutionary Computation, 9(1):159-195.
• Hansen, N. and Müller, S.D. and Koumoutsakos, P. (2003). Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). In Evolutionary Computation, 11(1):1-18.
• Rechenberg, I. (1994). Evolutionsstrategie '94. Frommann-Holzboog Verlag, Stuttgart, (in German).
• Schwefel, H.-P. (1995). Evolution and Optimum Seeking. Wiley, New York, NY.