# Genetic algorithms

Post-publication activity

Curator: John H. Holland

Genetic algorithms are based on the classic view of a chromosome as a string of genes. R.A. Fisher used this view to found mathematical genetics, providing mathematical formula specifying the rate at which particular genes would spread through a population (Fisher, 1958). Key elements of Fisher’s formulation are:

• a specified set of alternatives (alleles) for each gene, thereby specifying the allowable strings of genes (the possible chromosomes),
• a generation-by-generation view of evolution where, at each stage, a population of individuals produces a set of offspring that constitutes the next generation,
• a fitness function that assigns to each string of alleles the number of offspring the individual carrying that chromosome will contribute to the next generation, and
• a set of genetic operators, particularly mutation in Fisher’s formulation, that modify the offspring of an individual so that the next generation differs from the current generation.

Natural selection, in this formulation, can be thought of as a procedure for searching through the set of possible individuals, the search space, to find individuals of progressively higher fitness. Even when a natural population consists of a single species – say the current generation of humans – there is considerable variation within that population. These variants constitute samples of the search space.

## Definition

A genetic algorithm (GA) is a generalized, computer-executable version of Fisher’s formulation (Holland J, 1995). The generalizations consist of:

• concern with interaction of genes on a chromosome, rather than assuming alleles act independently of each other, and
• enlargement of the set of genetic operators to include other well-known genetic operators such as crossing-over (recombination) and inversion.

Under the first generalization, the fitness function becomes a complex nonlinear function that usually cannot be usefully approximated by summing up the effects of individual genes. The second generalization puts emphasis on genetic mechanisms, such as crossover, that operate regularly on chromosomes. Crossover takes place in every mating whereas mutation of a given gene typically occurs in less than 1 in a million individuals. Crossover is the central reason that mammals produce offspring exhibiting a mix of their parents’ characteristics – consider the human population for a vivid, familiar example – and crossover makes possible the artificial cross-breeding of selected animals and plants to produce superior varieties. Clearly crossover’s high frequency gives it an important role in evolution, and it has an essential role in the operation of GAs, as outlined below.

Crossover in its simplest form (one-point crossover) is easily defined: Two chromosomes are lined up, a point along the chromosome is randomly selected, then the pieces to the left of that point are exchanged between the chromosomes, producing a pair of offspring chromosomes.

This simple version of crossover is a good approximation to what actually takes places in mating organisms. Under one-point crossover, alleles that are close together on the chromosome are likely to be passed as a unit to one of the offspring, while alleles that are far apart are likely to be separated by crossover with one allele appearing in one offspring and the other allele appearing in the other offspring. For example, given a chromosome with 1001 genes, there is only one chance in a thousand that an adjacent pair of alleles will be separated in the offspring, while alleles at opposite ends of the string will always be separated. In standard genetic terminology, this phenomenon is called linkage. By observing the frequency of separation of allele-determined characteristics in successive generations, linkage can be determined experimentally. Indeed, assuming one-point crossover made gene sequencing possible long before we had any knowledge of DNA – in 1944, using experimentally determined linkage, the Carnegie Institution of Washington published a large volume recording the order of genes in the fruit fly (Lindsley & Grell, 1944).

The genetic algorithm, following Fisher’s formulation, uses the differing fitness of variants in the current generation to propagate improvements to the next generation, but the GA places strong emphasis on the variants produced by crossover. The basic GA subroutine, which produces the next generation from the current one, executes the following steps (where, for simplicity, it is assumed that each individual is specified by a single chromosome):

1. Start with a population of $$N$$ individual strings of alleles (perhaps generated at random).
2. Select two individuals at random from the current population, biasing selection toward individuals with higher fitness.
3. Use crossover (with occasional use of mutation) to produce two new individuals which are assigned to the next generation.
4. Repeat steps (1) and (2) $$N/2$$ times to produce a new generation of N individuals.

Of course, there are many ways to modify these steps, but most characteristics of a GA are already exhibited by this basic program.

## Crossover

Crossover introduces a considerable complication in the study of successive generations: Highly effective individuals in the parent generation (say, an “Einstein”), will not reappear in the next generation, because only a subset of the parent’s alleles is passed on to any given offspring. This raises an important question: What is passed from generation to generation if the parents’ specific chromosomes are never passed on? An answer to this question requires a prediction of the generation-by-generation spread of clusters of alleles, requiring a substantial generalization of Fisher’s fundamental theorem. The schema theorem is one such generalization; it deals with arbitrary allele clusters call schemas. A schema is specified using the symbol * (‘don’t care’) to specify places along the chromosome not belonging to the cluster. For example, if there are two distinct alleles for each position, call them 1 and 0, then the cluster consisting of allele 1 at position 2, allele 0 at position 4, and allele 0 at position 5, is designated by the string *1*00**...*. Let N(s,t) be the number of chromosomes carrying schema s in generation t. The schema theorem specifies the (expected) number $$N(s,t+1)$$ of chromosomes carrying schema $$s$$ in the next generation. A simplified version has the following form$N(s,t+1) = u(s,t)[1-e]N(s,t)$

where $$u(s,t)$$ is the average fitness of the chromosomes carrying schema $$s$$ at time $$t$$ (the observed average fitness), and $$e$$ is the overall probability (usually quite small) that the cluster s will be destroyed (or created) by mutation or crossover. (Note that $$e$$ does become large if the alleles in the cluster are spread over a considerable part of the chromosome). This formula for $$N(s,t+1)$$ can be restated in terms of probabilities (proportions) $$P(s,t)\ ,$$ a more typical form for mathematical genetics, by noting that $$P(s,t) = N(s,t)/N(t)\ ,$$ were $$N(t)$$ is the size of the population at time $$t\ .$$

The schema theorem shows that every cluster of alleles present in the population increases or decreases its proportion in the population at a rate determined by the observed average fitness of its carriers. In particular, schemas consistently associated with strings of above-average fitness spread rapidly through the population. As a result, crossover regularly tries out new combinations of these above-average schemas; they become building blocks for further attempts at solution. Though a GA only directly processes $$N(t)$$ strings in a generation, it effectively processes the much larger number of schemata carried in the population. For example, the number of schemas on a single string of length 40 is $$2 * exp(40) \sim 16,000\ ,$$ so a GA processing a few dozen strings of length 40 effectively processes thousands of schemas. For problems in which schemas capture regularities in the search space, this is a tremendous speedup.

## Uses

Genetic algorithms are routinely used to find good solutions for problems that do not yield to standard techniques such as gradient ascent (“hill-climbing”) or additive approximations (“the whole equals the sum of the parts”) (Mitchell, 2009). Some typical problems on which the GA has been used are control of pipelines, jet engine design, scheduling, protein folding, machine learning, modeling language acquisition and evolution, and modeling complex adaptive systems (such as markets and ecosystems). To use the GA, the search space must be represented as strings over some fixed alphabet, much as biological chromosomes are represented by strings over 4 nucleotides. The strings can represent anything from biological organisms, or rules for processing signals, to agents in a complex adaptive system. The GA is initialized with a population of these strings, which may be simply selected at random from the search space, or the initial population may be “salted” with strings picked out using prior knowledge of the problem. The GA then processes the strings, and successive generations uncover schemas of above-average observed fitness. When above-average, well-linked schemas are regularly associated with improvements, the GA rapidly exploits those improvements.