Original PDF Flash format global-optimization-algorithms---theory-and-application  


Global Optimization Algorithms Theory And Application

Global Optimization Algorithms
– Theory and Application –
Ed
2nd
Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Learning Classifier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Simulated Annealing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .263
Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Sigoa – Implementation in Java . . . . . . . . . . . . . . . . . . . . . . . 439
Background (Mathematics, Computer Science, . . . ) . . . 455
Thomas Weise
Version: 2009-06-26
Newest Version: http://www.it-weise.de/


Preface
This e-book is devoted to global optimization algorithms, which are methods to find opti-
mal solutions for given problems. It especially focuses on Evolutionary Computation by dis-
cussing evolutionary algorithms, genetic algorithms, Genetic Programming, Learning Classi-
fier Systems, Evolution Strategy, Differential Evolution, Particle Swarm Optimization, and
Ant Colony Optimization. It also elaborates on other metaheuristics like Simulated An-
nealing, Extremal Optimization, Tabu Search, and Random Optimization. The book is no
book in the conventional sense: Because of frequent updates and changes, it is not really
intended for sequential reading but more as some sort of material collection, encyclopedia,
or reference work where you can look up stuff, find the correct context, and are provided
with fundamentals.
With this book, two major audience groups are addressed:
1. It can help students since we try to describe the algorithms in an understandable, consis-
tent way and, maybe even more important, includes much of the background knowledge
needed to understand them. Thus, you can find summaries on stochastic theory and the-
oretical computer science in Part IV on page 455. Additionally, application examples are
provided which give an idea how problems can be tackled with the different techniques
and what results can be expected.
2. Fellow researchers and PhD students may find the application examples helpful too. For
them, in-depth discussions on the single methodologies are included that are supported
with a large set of useful literature references.
If this book contains something you want to cite or reference in your work, please use the
citation suggestion provided in Chapter D on page 591.
In order to maximize the utility of this electronic book, it contains automatic, clickable links.
They are shaded with dark gray so the book is still b/w printable. You can click on
1. entries in the table of contents,
2. citation references like [916],
3. page references like “95”,
4. references such as “see Figure 2.1 on page 96” to sections, figures, tables, and listings,
and
5. URLs and links like “http://www.lania.mx/~ccoello/EMOO/ [accessed 2007-10-25]”.1
The following scenario is now for example possible: A student reads the text and finds a
passage that she wants to investigate in-depth. She clicks on a citation in that seems inter-
esting and the corresponding reference is shown. To some of the references which are online
1 URLs are usually annotated with the date we have accessed them, like http://www.lania.
mx/~ccoello/EMOO/ [accessed 2007-10-25]. We can neither guarantee that their content remains un-
changed, nor that these sites stay available. We also assume no responsibility for anything we
linked to.

4
available, links are provided in the reference text. By clicking on such a link, the Adobe
ReaderR 2 will open another window and load the regarding document (or a browser window
of a site that links to the document). After reading it, the student may use the “backwards”
button in the navigation utility to go back to the text initially read in the e-book.
The contents of this book are divided into four parts. In the first part, different optimization
technologies will be introduced and their features are described. Often, small examples will
be given in order to ease understanding. In the second part starting at page 315, we elab-
orate on different application examples in detail. With the Sigoa framework, one possible
implementation of optimization algorithms in Java, is discussed and we show how some of
solutions of the previous problem instances can be realized in Part III on page 439. Finally,
in the last part following at page 455, the background knowledge is provided for the rest of
the book. Optimization is closely related to stochastic, and hence, an introduction into this
subject can be found here. Other important background information concerns theoretical
computer science and clustering algorithms.
However, this book is currently worked on. It is still in a very preliminary phase where
major parts are still missing or under construction. Other sections or texts are incomplete
(tagged with TODO). There may as well be errors in the contents or issues may be stated
ambiguously (I do not have proof-readers). Additionally, the sequence of the content is not
very good. Because of frequent updates, small sections may grow and become chapters, be
moved to another place, merged with other sections, and so on. Thus, this book will change
often. I choose to update, correct, and improve this book continuously instead of providing
a new version each half year or so because I think this way it has a higher utility because
it provides more information earlier. By doing so, I also risk confusing you with strange
grammar and structure, so if you find something fishy, please let me know so I can correct
and improve it right away.
The updates and improvements will result in new versions of the book, which will regularly
appear on the website http://www.it-weise.de/. The direct download link to the newest
version of this book is http://www.it-weise.de/projects/book.pdf. The LATEX source
code of this book including all graphics and the bibliography is available at http://www.
it-weise.de/projects/bookSource.zip. The source may not always be the one of the
most current version of the book. Compiling it requires multiple runs of BibTEX because of
the nifty way the references are incorporated.
I would be very happy if you provide feedback, report errors or missing things that you have
found, criticize something, or have any additional ideas or suggestions. Do not hesitate to
contact me via my email address tweise@gmx.de.
Matter of fact, a large number of people helped me to improve this book over time. I
have enumerated the most important contributors in Chapter C – Thank you guys, I really
appreciate your help!
Copyright c 2006-2009 Thomas Weise.
Permission is granted to copy, distribute and/or modify this document under the terms
of the GNU Free Documentation License, Version 1.2 or any later version published by
the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no
Back-Cover Texts. A copy of the license is included in the section entitled GNU Free
Documentation License (FDL). You can find a copy of the GNU Free Documentation Li-
cense in appendix Chapter A on page 575.
2 The Adobe ReaderR is available for download at http://www.adobe.com/products/reader/
[accessed 2007-08-13].

5
At many places in this book we refer to Wikipedia [2219] which is a great source of knowl-
edge. Wikipedia [2219] contains articles and definitions for many of the aspects discussed in
this book. Like this book, it is updated and improved frequently. Therefore, including the
links adds greatly to the book’s utility, in my opinion.
Important Notice
Be aware that this version of this book marks a point of transition from the first edition to
the second one. Major fractions of the text of the first edition have not yet been revised and
are, thus, not included in this document. However, I believe that this version corrects many
shortcomings as well as inconsistencies from the first edition plus is better structured.


Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Part I Global Optimization
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.1 A Classification of Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.1.1 Classification According to Method of Operation . . . . . . . . . . . . . . . . . . 22
1.1.2 Classification According to Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2 What is an optimum? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.1 Single Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.2 Multiple Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.2.3 Constraint Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.2.4 Unifying Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.3 The Structure of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.3.1 Spaces, Sets, and Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.3.2 Fitness Landscapes and Global Optimization . . . . . . . . . . . . . . . . . . . . . 47
1.3.3 Gradient Descend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.3.4 Other General Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.4 Problems in Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.4.2 Premature Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.4.3 Ruggedness and Weak Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.4.4 Deceptiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
1.4.5 Neutrality and Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.4.6 Epistasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.4.7 Noise and Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.4.8 Overfitting and Oversimplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1.4.9 Dynamically Changing Fitness Landscape . . . . . . . . . . . . . . . . . . . . . . . . 76
1.4.10 The No Free Lunch Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
1.4.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
1.5 Formae and Search Space/Operator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
1.5.1 Forma Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
1.5.2 Genome Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
1.6 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
1.6.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
1.6.2 Conferences, Workshops, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8
CONTENTS
1.6.3 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
1.6.4 Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
1.6.5 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2
Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.1.1 The Basic Principles from Nature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.1.2 The Basic Cycle of Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . 96
2.1.3 The Basic Evolutionary Algorithm Scheme . . . . . . . . . . . . . . . . . . . . . . . 98
2.1.4 From the Viewpoint of Formae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.1.5 Does the natural Paragon Fit? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.1.6 Classification of Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.1.7 Configuration Parameters of evolutionary algorithms . . . . . . . . . . . . . . . 104
2.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.2.2 Conferences, Workshops, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.2.3 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.2.4 Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.2.5 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2.3 Fitness Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
2.3.2 Weighted Sum Fitness Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
2.3.3 Pareto Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
2.3.4 Sharing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.3.5 Variety Preserving Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.3.6 Tournament Fitness Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.4 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
2.4.2 Truncation Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.4.3 Fitness Proportionate Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
2.4.4 Tournament Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
2.4.5 Ordered Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
2.4.6 Ranking Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.4.7 VEGA Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.4.8 Clearing and Simple Convergence Prevention (SCP) . . . . . . . . . . . . . . . 134
2.5 Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
2.5.1 NCGA Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
2.6 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
2.6.1 VEGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3
Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
3.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
3.2.2 Conferences, Workshops, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
3.2.3 Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
3.2.4 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
3.3 Genomes in Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.4 Fixed-Length String Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.4.1 Creation: Nullary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.4.2 Mutation: Unary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.4.3 Permutation: Unary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.4.4 Crossover: Binary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.5 Variable-Length String Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

CONTENTS
9
3.5.1 Creation: Nullary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.5.2 Mutation: Unary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.5.3 Crossover: Binary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.6 Schema Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
3.6.1 Schemata and Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
3.6.2 Wildcards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.6.3 Holland’s Schema Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.6.4 Criticism of the Schema Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.6.5 The Building Block Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.7 The Messy Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.7.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.7.2 Reproduction Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.7.3 Splice: Binary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.7.4 Overspecification and Underspecification . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.7.5 The Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3.8 Genotype-Phenotype Mappings and Artificial Embryogeny . . . . . . . . . . . . . . . . 155
4
Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.2.2 Conferences, Workshops, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.2.3 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.2.4 Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.2.5 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.3 (Standard) Tree Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.3.1 Creation: Nullary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.3.2 Mutation: Unary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.3.3 Recombination: Binary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.3.4 Permutation: Unary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.3.5 Editing: Unary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.3.6 Encapsulation: Unary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
4.3.7 Wrapping: Unary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
4.3.8 Lifting: Unary Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.3.9 Automatically Defined Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.3.10 Automatically Defined Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.3.11 Node Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.4 Genotype-Phenotype Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
4.4.1 Cramer’s Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
4.4.2 Binary Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.4.3 Gene Expression Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.4.4 Edge Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.5 Grammars in Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.5.2 Trivial Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.5.3 Strongly Typed Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
4.5.4 Early Research in GGGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
4.5.5 Gads 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
4.5.6 Grammatical Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.5.7 Gads 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.5.8 Christiansen Grammar Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.5.9 Tree-Adjoining Grammar-guided Genetic Programming . . . . . . . . . . . . 187
4.6 Linear Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

10
CONTENTS
4.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
4.6.2 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.6.3 The Compiling Genetic Programming System . . . . . . . . . . . . . . . . . . . . . 193
4.6.4 Automatic Induction of Machine Code by Genetic Programming . . . . 193
4.6.5 Java Bytecode Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.6.6 Brameier and Banzhaf: LGP with Implicit Intron removal . . . . . . . . . . 194
4.6.7 Homologous Crossover: Binary Reproduction . . . . . . . . . . . . . . . . . . . . . . 195
4.6.8 Page-based LGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.7 Graph-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.7.1 Parallel Algorithm Discovery and Orchestration . . . . . . . . . . . . . . . . . . . 196
4.7.2 Parallel Distributed Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . 196
4.7.3 Genetic Network Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.7.4 Cartesian Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.8 Epistasis in Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
4.8.1 Forms of Epistasis in Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . 202
4.8.2 Algorithmic Chemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.8.3 Soft Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4.8.4 Rule-based Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
4.9 Artificial Life and Artificial Chemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.9.1 Push, PushGP, and Pushpop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
4.9.2 Fraglets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
4.10 Problems Inherent in the Evolution of Algorithms . . . . . . . . . . . . . . . . . . . . . . . 219
4.10.1 Correctness of the Evolved Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.10.2 All-Or-Nothing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.10.3 Non-Functional Features of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5
Evolution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.2.2 Conferences, Workshops, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.2.3 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.3 Populations in Evolution Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.3.1 (1 + 1)-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.3.2 (µ + 1)-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.3.3 (µ + λ)-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.3.4 (µ, λ)-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.3.5 (µ/ρ, λ)-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.3.6 (µ/ρ + λ)-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.3.7 (µ′, λ′(µ, λ)γ)-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.4 One-Fifth Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.5 Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.5.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
6
Evolutionary Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.2.2 Conferences, Workshops, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.2.3 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

CONTENTS
11
7
Learning Classifier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.2.2 Conferences, Workshops, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.2.3 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
7.3 The Basic Idea of Learning Classifier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
7.3.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
7.3.2 Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7.3.3 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7.3.4 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.3.5 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.3.6 Non-Learning Classifier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.3.7 Learning Classifier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.3.8 The Bucket Brigade Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
7.3.9 Applying the Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.4 Families of Learning Classifier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
8
Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.2.2 Conferences, Workshops, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.2.3 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.2.4 Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
8.2.5 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
8.3 River Formation Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
9
Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
9.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
9.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
9.2.2 Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
9.2.3 Conferences, Workshops, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
9.2.4 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
10 Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
10.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.3 Multi-Objective Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.4 Problems in Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
10.5 Hill Climbing with Random Restarts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.6 GRASP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.6.1 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.7 Raindrop Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11 Random Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
11.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
11.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

12
CONTENTS
12 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
12.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.2.2 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.3 Temperature Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.4 Multi-Objective Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
13 Extremal Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.1.1 Self-Organized Criticality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.1.2 The Bak-Sneppens model of Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.2 Extremal Optimization and Generalized Extremal Optimization . . . . . . . . . . . 270
13.3 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
13.3.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
14 Tabu Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
14.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
14.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
14.2.2 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
14.3 Multi-Objective Tabu Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
15 Memetic and Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
15.1 Memetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
15.2 Lamarckian Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
15.3 Baldwin Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
15.4 Summary on Lamarckian and Baldwinian Evolution . . . . . . . . . . . . . . . . . . . . . 280
15.5 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
15.5.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
15.5.2 Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
15.5.3 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
16 Downhill Simplex (Nelder and Mead) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
16.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
16.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
16.2.2 Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
16.2.3 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
16.3 The Downhill Simplex Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
16.4 Hybridizing with the Downhill Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
17 State Space Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
17.2 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
17.2.1 Areas Of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
17.2.2 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
17.3 Uninformed Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
17.3.1 Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
17.3.2 Depth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
17.3.3 Depth-limited Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
17.3.4 Iterative Deepening Depth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . 294
17.3.5 Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
17.4 Informed Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

CONTENTS
13
17.4.1 Greedy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
17.4.2 A⋆ search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
17.4.3 Adaptive Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
18 Parallelization and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
18.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
18.2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
18.2.1 Client-Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
18.2.2 Island Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
18.2.3 Mixed Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
18.3 Cellular Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
19 Maintaining the Optimal Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
19.1 Updating the Optimal Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
19.2 Obtaining Optimal Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
19.3 Pruning the Optimal Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
19.3.1 Pruning via Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
19.3.2 Adaptive Grid Archiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Part II Applications
20 Experimental Settings, Measures, and Evaluations . . . . . . . . . . . . . . . . . . . . . 315
20.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
20.1.1 The Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
20.1.2 The Optimization Algorithm Applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
20.1.3 Other Run Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
20.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
20.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
20.3.1 Simple Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
20.3.2 Sophisticated Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
20.4 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
20.4.1 Confidence Intervals or Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 324
20.4.2 Factorial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
21 Benchmarks and Toy Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
21.1 Real Problem Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
21.1.1 Single-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
21.1.2 Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
21.1.3 Dynamic Fitness Landscapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
21.2 Binary Problem Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
21.2.1 Kauffman’s NK Fitness Landscapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
21.2.2 The p-Spin Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
21.2.3 The ND Family of Fitness Landscapes . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
21.2.4 The Royal Road . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
21.2.5 OneMax and BinInt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
21.2.6 Long Path Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
21.2.7 Tunable Model for Problematic Phenomena . . . . . . . . . . . . . . . . . . . . . . . 341
21.3 Genetic Programming Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
21.3.1 Artificial Ant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
21.3.2 The Greatest Common Divisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

14
CONTENTS
22 Contests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
22.1 DATA-MINING-CUP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
22.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
22.1.2 The 2007 Contest – Using Classifier Systems . . . . . . . . . . . . . . . . . . . . . . 374
22.2 The Web Service Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
22.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
22.2.2 The 2006/2007 Semantic Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
23 Real-World Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
23.1 Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
23.1.1 Genetic Programming: Genome for Symbolic Regression . . . . . . . . . . . . 397
23.1.2 Sample Data, Quality, and Estimation Theory . . . . . . . . . . . . . . . . . . . . 398
23.1.3 An Example and the Phenomenon of Overfitting . . . . . . . . . . . . . . . . . . 399
23.1.4 Limits of Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
23.2 Global Optimization of Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
23.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
23.2.2 Synthesizing Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
23.2.3 Paper List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
23.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
24 Research Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
24.1 Genetic Programming of Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 413
24.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
24.1.2 Evolving Proactive Aggregation Protocols . . . . . . . . . . . . . . . . . . . . . . . . 414
Part III Sigoa – Implementation in Java
25 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
25.1 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
25.1.1 Multi-Objectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
25.1.2 Separation of Specification and Implementation . . . . . . . . . . . . . . . . . . . 440
25.1.3 Separation of Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
25.1.4 Support for Pluggable Simulations and Introspection . . . . . . . . . . . . . . . 441
25.1.5 Distribution utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
25.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
25.3 Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
26 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
26.1 The 2007 DATA-MINING-CUP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
26.1.1 The Phenotype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
26.1.2 The Genotype and the Embryogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
26.1.3 The Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
26.1.4 The Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
26.1.5 The Evolution Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Part IV Background

CONTENTS
15
27 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
27.1 Set Membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
27.2 Relations between Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
27.3 Special Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
27.4 Operations on Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
27.5 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
27.6 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
27.7 Binary Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
27.7.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
27.7.2 Order Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
27.7.3 Equivalence Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
28 Stochastic Theory and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
28.1 General Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
28.1.1 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
28.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
28.2.1 Probabily as defined by Bernoulli (1713) . . . . . . . . . . . . . . . . . . . . . . . . . 467
28.2.2 The Limiting Frequency Theory of von Mises . . . . . . . . . . . . . . . . . . . . . 468
28.2.3 The Axioms of Kolmogorov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
28.2.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
28.2.5 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
28.2.6 Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
28.2.7 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
28.2.8 Probability Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
28.3 Stochastic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
28.3.1 Count, Min, Max and Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
28.3.2 Expected Value and Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
28.3.3 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
28.3.4 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
28.3.5 Skewness and Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
28.3.6 Median, Quantiles, and Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
28.3.7 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
28.3.8 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
28.4 Some Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
28.4.1 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
28.4.2 Poisson Distribution πλ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
28.4.3 Binomial Distribution B(n, p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
28.5 Some Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
28.5.1 Continuous Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
28.5.2 Normal Distribution N µ, σ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
28.5.3 Exponential Distribution exp(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
28.5.4 Chi-square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
28.5.5 Student’s t-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
28.6 Example – Throwing a Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
28.7 Estimation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
28.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
28.7.2 Likelihood and Maximum Likelihood Estimators . . . . . . . . . . . . . . . . . . 500
28.7.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
28.7.4 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
28.8 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
28.8.1 Non-Parametric Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
28.9 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
28.9.1 Generating Pseudorandom Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
28.9.2 Random Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528

16
CONTENTS
28.9.3 Converting Random Numbers to other Distributions . . . . . . . . . . . . . . . 528
28.10List of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
28.10.1Gamma Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
28.10.2Riemann Zeta Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
29 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
29.1 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
29.1.1 Distance Measures for Strings of Equal Length . . . . . . . . . . . . . . . . . . . . 537
29.1.2 Distance Measures for Real-Valued Vectors . . . . . . . . . . . . . . . . . . . . . . . 537
29.1.3 Elements Representing a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
29.1.4 Distance Measures Between Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
29.2 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
29.2.1 Cluster Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
29.2.2 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
29.2.3 nth Nearest Neighbor Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
29.2.4 Linkage Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
29.2.5 Leader Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
30 Theoretical Computer Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
30.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
30.1.1 Algorithms and Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
30.1.2 Properties of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
30.1.3 Complexity of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
30.1.4 Randomized Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
30.2 Distributed Systems and Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 553
30.2.1 Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
30.2.2 Some Architectures of Distributes Systems . . . . . . . . . . . . . . . . . . . . . . . . 556
30.3 Grammars and Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
30.3.1 Syntax and Formal Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
30.3.2 Generative Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
30.3.3 Derivation Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
30.3.4 Backus-Naur Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
30.3.5 Extended Backus-Naur Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
30.3.6 Attribute Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
30.3.7 Extended Attribute Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
30.3.8 Adaptive Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
30.3.9 Christiansen Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
30.3.10Tree-Adjoining Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
30.3.11S-expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Appendices
A
GNU Free Documentation License (FDL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
A.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
A.2 Applicability and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
A.3 Verbatim Copying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
A.4 Copying in Quantity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
A.5 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
A.6 Combining Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
A.7 Collections of Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
A.8 Aggregation with Independent Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
A.9 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
A.10 Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579

CONTENTS
17
A.11 Future Revisions of this License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
B
GNU Lesser General Public License (LPGL) . . . . . . . . . . . . . . . . . . . . . . . . . . 581
B.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
B.2 Terms and Conditions for Copying, Distribution and Modification . . . . . . . . . 582
B.3 No Warranty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
B.4 How to Apply These Terms to Your New Libraries . . . . . . . . . . . . . . . . . . . . . . . 586
C
Credits and Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
D
Citation Suggestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
U . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771
V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776
X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787
Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
List of Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819


Part I
Global Optimization


1
Introduction
One of the most fundamental principles in our world is the search for an optimal state.
It begins in the microcosm where atoms in physics try to form bonds1 in order to minimize
the energy of their electrons [1625]. When molecules form solid bodies during the process of
freezing, they try to assume energy-optimal crystal structures. These processes, of course,
are not driven by any higher intention but purely result from the laws of physics.
The same goes for the biological principle of survival of the fittest [1940] which, together
with the biological evolution [485], leads to better adaptation of the species to their environ-
ment. Here, a local optimum is a well-adapted species that dominates all other animals in
its surroundings. Homo sapiens have reached this level, sharing it with ants, bacteria, flies,
cockroaches, and all sorts of other creepy creatures.
As long as humankind exists, we strive for perfection in many areas. We want to reach
a maximum degree of happiness with the least amount of effort. In our economy, profit and
sales must be maximized and costs should be as low as possible. Therefore, optimization is
one of the oldest of sciences which even extends into daily life [1519].
If something is important, general, and abstract enough, there is always a mathematical
discipline dealing with it. Global optimization2 is the branch of applied mathematics and nu-
merical analysis that focuses on, well, optimization. The goal of global optimization is to find
the best possible elements x⋆ from a set X according to a set of criteria F = {f1,f2,..,fn}.
These criteria are expressed as mathematical functions3, the so-called objective functions.
Definition 1.1 (Objective Function). An objective function f : X → Y with Y ⊆ R is
a mathematical function which is subject to optimization.
The codomain Y of an objective function as well as its range must be a subset of the real
numbers (Y ⊆ R). The domain X of f is called problem space and can represent any type
of elements like numbers, lists, construction plans, and so on. It is chosen according to the
problem to be solved with the optimization process. Objective functions are not necessarily
mere mathematical expressions, but can be complex algorithms that, for example, involve
multiple simulations. Global optimization comprises all techniques that can be used to find
the best elements x⋆ in X with respect to such criteria f ∈ F.
In the remaining text of this introduction, we will first provide a rough classification
of the different optimization techniques which we will investigate in the further course of
this book (Section 1.1). In Section 1.2, we will outline how these best elements which we
are after can be defined. We will use Section 1.3 to shed some more light onto the meaning
and inter-relation of the symbols already mentioned (f , F , x, x⋆, X, Y , . . . ) and outline
1 http://en.wikipedia.org/wiki/Chemical_bond [accessed 2007-07-12]
2 http://en.wikipedia.org/wiki/Global_optimization [accessed 2007-07-03]
3 The concept of mathematical functions is outlined in set theory in Definition 27.27 on page 462.

22
1 Introduction
the general structure of optimization processes. If optimization was a simple thing to do,
there wouldn’t be a whole branch of mathematics with lots of cunning people dealing with
it. In Section 1.4 we will introduce the major problems that can be encountered during
optimization. We will discuss Formae as a general way of describing properties of possible
solutions in Section 1.5. In this book, we will provide additional hints that point to useful
literature, web links, conferences, and so on for all algorithms which we discuss. The first
of these information records, dealing with global optimization in general, can be found in
Section 1.6.
In the chapters to follow these introductory sections, different approaches to optimization
are discussed, examples for the applications are given, and the mathematical foundation and
background information is provided.
1.1 A Classification of Optimization Algorithms
In this book, we will only be able to discuss a small fraction of the wide variety of global
optimization techniques [1614]. Before digging any deeper into the matter, I will attempt to
provide a classification of these algorithms as overview and discuss some basic use cases.
1.1.1 Classification According to Method of Operation
Figure 1.1 sketches a rough taxonomy of global optimization methods. Generally, optimiza-
tion algorithms can be divided in two basic classes: deterministic and probabilistic algo-
rithms. Deterministic algorithms (see also Definition 30.11 on page 550) are most often used
if a clear relation between the characteristics of the possible solutions and their utility for a
given problem exists. Then, the search space can efficiently be explored using for example a
divide and conquer scheme4. If the relation between a solution candidate and its “fitness”
are not so obvious or too complicated, or the dimensionality of the search space is very high,
it becomes harder to solve a problem deterministically. Trying it would possible result in
exhaustive enumeration of the search space, which is not feasible even for relatively small
problems.
Then, probabilistic algorithms5 come into play. The initial work in this area which now
has become one of most important research fields in optimization was started about 55 years
ago (see [1743, 750, 219], and [287]). An especially relevant family of probabilistic algorithms
are the Monte Carlo6-based approaches. They trade in guaranteed correctness of the solution
for a shorter runtime. This does not mean that the results obtained using them are incorrect
– they may just not be the global optima. On the other hand, a solution a little bit inferior
to the best possible one is better than one which needs 10100 years to be found. . .
Heuristics used in global optimization are functions that help decide which one of a set
of possible solutions is to be examined next. On one hand, deterministic algorithms usually
employ heuristics in order to define the processing order of the solution candidates. An
example for such a strategy is informed searche, as discussed in Section 17.4 on page 295.
Probabilistic methods, on the other hand, may only consider those elements of the search
space in further computations that have been selected by the heuristic.
Definition 1.2 (Heuristic). A heuristic7 [1407, 1711, 1626] is a part of an optimization
algorithm that uses the information currently gathered by the algorithm to help to decide
which solution candidate should be tested next or how the next individual can be produced.
Heuristics are usually problem class dependent.
4 http://en.wikipedia.org/wiki/Divide_and_conquer_algorithm [accessed 2007-07-09]
5 The common properties of probabilistic algorithms are specified in Definition 30.18 on page 552.
6 See Definition 30.20 on page 552 for a in-depth discussion of the Monte Carlo-type probabilistic
algorithms
7 http://en.wikipedia.org/wiki/Heuristic_%28computer_science%29 [accessed 2007-07-03]

1.1 A Classification of Optimization Algorithms
23
Deterministic
State Space
Branch and
Algebraic
Search
Bound
Geometry
Probabilistic
Artificial
Intelligence (AI)
Monte Carlo
Soft Computing
Algorithms
Computational
Intelligence (CI)
(Stochastic)
Hill Climbing
Evolutionary
Computation (EC)
Memetic
Random
Algorithms
Optimization
Simulated
Evolutionary
Harmonic
Annealing (SA)
Algorithms (EA)
Search (HS)
Tabu Search
Genetic
Swarm
(TS)
Algorithms (GA)
Intelligence (SI)
Parallel
(LCS) Learning
Ant Colony
Tempering
Classifier System
Optimization (ACO)
Stochastic
Evolutionary
Particle Swarm
Tunneling
Programming
Optimization (PSO)
Direct Monte
Evolution
Differential
Carlo Sampling
Strategy (ES)
Evolution (DE)
(GP) Genetic
Standard Genetic
Programming
Programming
Linear Genetic
Prograaming
Grammar Guided
Genetic Prog.
Figure 1.1: The taxonomy of global optimization algorithms.
Definition 1.3 (Metaheuristic). A metaheuristic8 is a method for solving very general
classes of problems. It combines objective functions or heuristics in an abstract and hopefully
efficient way, usually without utilizing deeper insight into their structure, i. e., by treating
them as black-box-procedures [813, 832, 233].
This combination is often performed stochastically by utilizing statistics obtained from
samples from the search space or based on a model of some natural phenomenon or physical
process. Simulated annealing, for example, decides which solution candidate to be evalu-
ated next according to the Boltzmann probability factor of atom configurations of solid-
ifying metal melts. Evolutionary algorithms copy the behavior of natural evolution and
treat solution candidates as individuals that compete in a virtual environment. Unified
8 http://en.wikipedia.org/wiki/Metaheuristic [accessed 2007-07-03]

24
1 Introduction
models of metaheuristic optimization procedures have been proposed by Vaessens et al.
[2087, 2088], Rayward-Smith [1710], Osman [1588], and Taillard et al. [1996].
An important class of probabilistic Monte Carlo metaheuristics is Evolutionary Compu-
tation9. It encompasses all algorithms that are based on a set of multiple solution candidates
(called population) which are iteratively refined. This field of optimization is also a class
of Soft Computing10 as well as a part of the artificial intelligence11 area. Some of its most
important members are evolutionary algorithms and Swarm Intelligence, which will be dis-
cussed in-depth in this book. Besides these nature-inspired and evolutionary approaches,
there exist also methods that copy physical processes like the before-mentioned Simulated
Annealing, Parallel Tempering, and Raindrop Method, as well as techniques without direct
real-world role model like Tabu Search and Random Optimization. As a preview of what can
be found in this book, we have marked the techniques that will be discussed with a thicker
border in Figure 1.1.
1.1.2 Classification According to Properties
The taxonomy just introduced classifies the optimization methods according to their algo-
rithmic structure and underlying principles, in other words, from the viewpoint of theory. A
software engineer or a user who wants to solve a problem with such an approach is however
more interested in its “interfacing features” such as speed and precision.
Speed and precision are conflicting objectives, at least in terms of probabilistic algo-
rithms. A general rule of thumb is that you can gain improvements in accuracy of opti-
mization only by investing more time. Scientists in the area of global optimization try to
push this Pareto frontier12 further by inventing new approaches and enhancing or tweaking
existing ones.
Optimization Speed
When it comes to time constraints and hence, the required speed of the optimization algo-
rithm, we can distinguish two main types of optimization use cases.
Definition 1.4 (Online Optimization). Online optimization problems are tasks that need
to be solved quickly in a time span between ten milliseconds to a few minutes. In order to
find a solution in this short time, optimality is normally traded in for speed gains.
Examples for online optimization are robot localization, load balancing, services com-
position for business processes (see for example Section 22.2.1 on page 384), or updating
a factory’s machine job schedule after new orders came in. From the examples, it becomes
clear that online optimization tasks are often carried out repetitively – new orders will, for
instance, continuously arrive in a production facility and need to be scheduled to machines
in a way that minimizes the waiting time of all jobs.
Definition 1.5 (Offline Optimization). In offline optimization problems, time is not so
important and a user is willing to wait maybe even days if she can get an optimal or close-
to-optimal result.
Such problems regard for example design optimization, data mining (see for in-
stance Section 22.1 on page 373), or creating long-term schedules for transportation crews.
These optimization processes will usually be carried out only once in a long time.
Before doing anything else, one must be sure about to which of these two classes the
problem to be solved belongs.
9 http://en.wikipedia.org/wiki/Evolutionary_computation [accessed 2007-09-17]
10 http://en.wikipedia.org/wiki/Soft_computing [accessed 2007-09-17]
11 http://en.wikipedia.org/wiki/Artificial_intelligence [accessed 2007-09-17]
12 Pareto frontiers will be discussed in Section 1.2.2 on page 31.

1.2 What is an optimum?
25
TODO
Number of Criteria
Optimization algorithms can be divided in such which try to find the best values of single
objective functions f and such that optimize sets F of target functions. This distinction
between single-objective optimization and multi-objective optimization is discussed in depth
in Section 1.2.2.
1.2 What is an optimum?
We have already said that global optimization is about finding the best possible solutions
for given problems. Thus, it cannot be a bad idea to start out by discussing what it is that
makes a solution optimal 13.
1.2.1 Single Objective Functions
In the case of optimizing a single criterion f , an optimum is either its maximum or minimum,
depending on what we are looking for. If we own a manufacturing plant and have to assign
incoming orders to machines, we will do this in a way that miniminzes the time needed
to complete them. On the other hand, we will arrange the purchase of raw material, the
employment of staff, and the placing of commercials in a way that maximizes our profit. In
global optimization, it is a convention that optimization problems are most often defined
as minimizations and if a criterion f is subject to maximization, we simply minimize its
negation (−f).
Figure 1.2 illustrates such a function f defined over a two-dimensional space X =
(X1, X2). As outlined in this graphic, we distinguish between local and global optima. A
global optimum is an optimum of the whole domain X while a local optimum is an optimum
of only a subset of X.
Definition 1.6 (Local Maximum). A (local) maximum ˆ
xl ∈ X of one (objective) function
f : X → R is an input element with f(ˆxl) ≥ f(x) for all x neighboring ˆxl.
If X ⊆ Rn, we can write:
∀ˆxl ∃ε > 0 : f(ˆxl) ≥ f(x) ∀x ∈ X,|x − ˆxl| < ε
(1.1)
Definition 1.7 (Local Minimum). A (local) minimum ˇ
xl ∈ X of one (objective) function
f : X → R is an input element with f(ˇxl) ≤ f(x) for all x neighboring ˇxl.
If X ⊆ R, we can write:
∀ˇxl ∃ε > 0 : f(ˇxl) ≤ f(x) ∀x ∈ X,|x − ˇxl| < ε
(1.2)
Definition 1.8 (Local Optimum). A (local) optimum x⋆l ∈ X of one (objective) function
f : X → R is either a local maximum or a local minimum.
Definition 1.9 (Global Maximum). A global maximum ˆ
x ∈ x of one (objective) function
f : X → R is an input element with f(ˆx) ≥ f(x) ∀x ∈ X.
Definition 1.10 (Global Minimum). A global minimum ˇ
x ∈ X of one (objective) func-
tion f : X → R is an input element with f(ˇx) ≤ f(x) ∀x ∈ X.
13 http://en.wikipedia.org/wiki/Maxima_and_minima [accessed 2007-07-03]

26
1 Introduction
local maximum
global maximum
local minimum
local maximum
f
X2
X1
global minimum
X
Figure 1.2: Global and local optima of a two-dimensional function.
Definition 1.11 (Global Optimum). A global optimum x⋆ ∈ X of one (objective) func-
tion f : X → R is either a global maximum or a global minimum.
Even a one-dimensional function f : X = R → R may have more than one global
maximum, multiple global minima, or even both in its domain X. Take the cosine function
for example: It has global maxima ˆ
xi at ˆ
xi = 2iπ and global minima ˇ
xi at ˇ
xi = (2i + 1)π for
all i ∈ Z. The correct solution of such an optimization problem would then be a set X⋆ of
all optimal inputs in X rather than a single maximum or minimum. Furthermore, the exact
meaning of optimal is problem dependent. In single-objective optimization, it either means
minimum or maximum. In multi-objective optimization, there exist a variety of approaches
to define optima which we will discuss in-depth in Section 1.2.2.
Definition 1.12 (Optimal Set). The optimal set X⋆ is the set that contains all optimal
elements.
There are normally multiple, often even infinite many optimal solutions. Since the mem-
ory of our computers is limited, we can find only a finite (sub-)set of them. We thus dis-
tinguish between the global optimal set X⋆ and the set X⋆ of (seemingly optimal) elements
which an optimizer returns. The tasks of global optimization algorithms are
1. to find solutions that are as good as possible and
2. that are also widely different from each other [534].
The second goal becomes obvious if we assume that we have an objective function f :
R → R which is optimal for all x ∈ [0,10] ⇔ x ∈ X⋆. This interval contains uncountable
many solutions, and an optimization algorithm may yield X⋆1 = {0,0.1,0.11,0.05,0.01} or
X⋆2 = {0,2.5,5,7.5,10} as result. Both sets only represent a small subset of the possible
solutions. The second result (X⋆2), however, gives us a broader view on the optimal set.
Even good optimization algorithms do not necessarily find the real global optima but may
only be able to approximate them. In other words, X⋆3 = {−0.3,5,7.5,11} is also a possible
result of the optimization process, although containing two sub-optimal elements.
In Chapter 19 on page 307, we will introduce different algorithms and approaches that
can be used to maintain an optimal set or to select the optimal elements from a given set
during an optimization process.

1.2 What is an optimum?
27
1.2.2 Multiple Objective Functions
Global optimization techniques are not just used for finding the maxima or minima of single
functions f . In many real-world design or decision making problems, they are rather applied
to sets F consisting of n = |F| objective functions fi, each representing one criterion to be
optimized [537, 360, 716].
F = {fi : X → Yi : 0 < i ≤ n,Yi ⊆ R}
(1.3)
Algorithms designed to optimize such sets of objective functions are usually named with
the prefix multi-objective, like multi-objective evolutionary algorithms which are discussed
in Definition 2.2 on page 96.
Examples
Factory Example
Multi-objective optimization often means to compromise conflicting goals. If we go back to
our factory example, we can specify the following objectives that all are subject to optimiza-
tion:
1. Minimize the time between an incoming order and the shipment of the corresponding
product.
2. Maximize profit.
3. Minimize costs for advertising, personal, raw materials etc..
4. Maximize product quality.
5. Minimize negative impact on environment.
The last two objectives seem to contradict clearly the cost minimization. Between the per-
sonal costs and the time needed for production and the product quality there should also be
some kind of (contradictive) relation. The exact mutual influences between objectives can
apparently become complicated and are not always obvious.
Artificial Ant Example
Another example for such a situation is the Artificial Ant problem14 where the goal is to
find the most efficient controller for a simulated ant. The efficiency of an ant should not only
be measured by the amount of food it is able to pile. For every food item, the ant needs
to walk to some point. The more food it piles, the longer the distance it needs to walk. If
its behavior is driven by a clever program, it may walk along a shorter route which would
not be discovered by an ant with a clumsy controller. Thus, the distance it has to cover
to find the food or the time it needs to do so may also be considered in the optimization
process. If two control programs produce the same results and one is smaller (i. e., contains
fewer instructions) than the other, the smaller one should be preferred. Like in the factory
example, the optimization goals conflict with each other.
From these both examples, we can gain another insight: To find the global optimum
could mean to maximize one function fi ∈ F and to minimize another one fj ∈ F, (i = j).
Hence, it makes no sense to talk about a global maximum or a global minimum in terms
of multi-objective optimization. We will thus retreat to the notation of the set of optimal
elements x⋆ ∈ X⋆ ⊆ X.
Since compromises for conflicting criteria can be defined in many ways, there exist mul-
tiple approaches to define what an optimum is. These different definitions, in turn, lead to
different sets X⋆.

28
1 Introduction
y=f
y
1(x)
y=f2(x)
x
^
x
x X
Î
2
^1
1
Figure 1.3: Two functions f1 and f2 with different maxima ˆ
x1 and ˆ
x2.
Graphical Example 1
We will discuss some of these approaches in the following by using two graphical examples
for illustration purposes. In the first example pictured in Figure 1.3, we want to maximize
two independent objective functions F1 = {f1,f2}. Both objective functions have the real
numbers R as problem space X1. The maximum (and thus, the optimum) of f1 is ˆ
x1 and
the largest value of f2 is at ˆ
x2. In Figure 1.3, we can easily see that f1 and f2 are partly
conflicting: Their maxima are at different locations and there even exist areas where f1 rises
while f2 falls and vice versa.
Graphical Example 2
^
^
^
^
x
x
1
2
x
x
3
4
f
f
3
4
x 2
x
x 2
x1
1
X2
Figure 1.4: Two functions f3 and f4 with different minima ˇ
x1, ˇ
x2, ˇ
x3, and ˇ
x4.
The objective functions f1 and f2 in the first example are mappings of a one-dimensional
problem space X1 to the real numbers that are to be maximized. In the second exam-
ple sketched in Figure 1.4, we instead minimize two functions f3 and f4 that map a two-
dimensional problem space X2 ⊂ R2 to the real numbers R. Both functions have two global
minima; the lowest values of f3 are ˇ
x1 and ˇ
x2 whereas f4 gets minimal at ˇ
x3 and ˇ
x4. It
should be noted that ˇ
x1 = ˇ
x2 = ˇ
x3 = ˇ
x4.
14 See Section 21.3.1 on page 354 for more details.

1.2 What is an optimum?
29
Weighted Sums (Linear Aggregation)
The simplest method to define what is optimal is computing a weighted sum g(x) of all
the functions fi(x) ∈ F.15 Each objective fi is multiplied with a weight wi representing its
importance. Using signed weights also allows us to minimize one objective and to maximize
another. We can, for instance, apply a weight wa = 1 to an objective function fa and the
weight wb = −1 to the criterion fb. By minimizing g(x), we then actually minimize the
first and maximize the second objective function. If we instead maximize g(x), the effect
would be converse and fb would be minimized and fa would be maximized. Either way,
multi-objective problems are reduced to single-objective ones by this method.
n
g(x) =
wifi(x) =
wifi(x)
(1.4)
i=1
∀fi∈F
x⋆ ∈ X⋆ ⇔ g(x⋆) ≥ g(x) ∀x ∈ X
(1.5)
Graphical Example 1
Figure 1.5 demonstrates optimization with the weighted sum approach for the example given
in Section 1.2.2. The weights are both set to 1 = w1 = w2. If we maximize g1(2), we will
thus also maximize the functions f1 and f2. This leads to a single optimum x⋆ = ˆ
x.
y
y=g1(x)=f1(x)+f2(x)
y =f
1
1(x)
y2=f2(x)
x
^
x X
Î 1
Figure 1.5: Optimization using the weighted sum approach (first example).
Graphical Example 2
The sum of the two-dimensional functions f3 and f4 from the second graphical example
given in Section 1.2.2 is sketched in Figure 1.6. Again we set the weights w3 and w4 to 1.
The sum g2 however is subject to minimization. The graph of g2 has two especially deep
valleys. At the bottoms of these valleys, the two global minima ˇ
x5 and ˇ
x6 can be found.
Problems with Weighted Sums
The drawback of this approach is that it cannot handle functions that rise or fall with
different speed16 properly. In Figure 1.7, we have sketched the sum g(x) of the two objective
functions f1(x) = −x2 and f2(x) = ex−2. When minimizing or maximizing this sum, we
15 This approach applies a linear aggregation function for fitness assignment and is therefore also
often referred to as linear aggregating.
16 See Section 30.1.3 on page 550
or http://en.wikipedia.org/wiki/Asymptotic_notation [accessed 2007-07-03] for related informa-
tion.

30
1 Introduction
^
^
x
x
5
6
y
x 2
x1
y=g2(x)=f3(x)+f4(x)
Figure 1.6: Optimization using the weighted sum approach (second example).
45
y=g(x)=f1(x)+f2(x)
35
y2=f2(x)
25
15
5
-5
-3
-1
-5
1
3
5
-15
y =f
1
1(x)
-25
Figure 1.7: A problematic constellation for the weighted sum approach.
will always disregard one of the two functions, depending on the interval chosen. For small
x, f2 is negligible compared to f1. For x > 5 it begins to outpace f1 which, in turn, will
now become negligible. Such functions cannot be added up properly using constant weights.
Even if we would set w1 to the really large number 1010, f1 will become insignificant for
−(402)∗1010
all x > 40, because
e40−2
≈ 0.0005. Therefore, weighted sums are only suitable
to optimize functions that at least share the same big-O notation (see Section 30.1.3 on
page 550). Often, it is not obvious how the objective functions will fall or rise. How can we,
for instance, determine whether the objective maximizing the food piled by an Artificial Ant
rises in comparison to the objective minimizing the distance walked by the simulated insect?
And even if the shape of the objective functions and their complexity class were clear, the
question about how to set the weights w properly still remains open in most cases [487]. In
the same paper, Das and Dennis [487] also show that with weighted sum approaches, not
necessarily all elements considered optimal in terms of Pareto domination will be found.

1.2 What is an optimum?
31
Pareto Optimization
The mathematical foundations for multi-objective optimization which considers conflicting
criteria in a fair way has been laid by Vilfredo Pareto [1615] 110 years ago [1225]. Pareto
optimality17 became an important notion in economics, game theory, engineering, and social
sciences [390, 2219, 1587, 752]. It defines the frontier of solutions that can be reached by
trading-off conflicting objectives in an optimal manner. From this front, a decision maker
(be it a human or an algorithm) can finally choose the configurations that, in his opinion,
suit best [715, 716, 375, 1961, 877, 760, 177]. The notation of optimal in the Pareto sense is
strongly based on the definition of domination:
Definition 1.13 (Domination). An element x1 dominates (is preferred to) an element
x2 (x1 ⊢ x2) if x1 is better than x2 in at least one objective function and not worse with
respect to all other objectives. Based on the set F of objective functions f , we can write:
x1 ⊢ x2 ⇔ ∀i : 0 < i ≤ n ⇒ ωifi(x1) ≤ ωifi(x2) ∧
(1.6)
∃j : 0 < j ≤ n : ωjfj(x1) < ωjfj(x2)
1 if f
ω
i should be minimized
i =
(1.7)
−1 if fi should be maximized
Different from the weights in the weighted sum approach, the factors ωi only carry
sign information which allows us to maximize some objectives and to minimize some other
criteria.
The Pareto domination relation defines a strict partial order (see Definition 27.31 on
page 463) on the space of possible objective values. In contrast, the weighted sum approach
imposes a total order by projecting it into the real numbers R.
Definition 1.14 (Pareto Optimal). An element x⋆ ∈ X is Pareto optimal (and hence,
part of the optimal set X⋆) if it is not dominated by any other element in the problem space
X. In terms of Pareto optimization, X⋆ is called the Pareto set or the Pareto Frontier.
x⋆ ∈ X⋆ ⇔ ∃x ∈ X : x ⊢ x⋆
(1.8)
Graphical Example 1
In Figure 1.8, we illustrate the impact of the definition of Pareto optimality on our first
example (outlined in Section 1.2.2). We assume again that f1 and f2 should both be maxi-
mized and hence, ω1 = ω2 = −1. The areas shaded with dark gray are Pareto optimal and
thus, represent the optimal set X⋆ = [x2, x3] ∪ [x5,x6] which here contains infinite many
elements18. All other points are dominated, i. e., not optimal.
The points in the area between x1 and x2 (shaded in light gray) are dominated by other
points in the same region or in [x2, x3], since both functions f1 and f2 can be improved by
increasing x. If we start at the leftmost point in X (which is position x1), for instance, we
can go one small step ∆ to the right and will find a point x1 + ∆ dominating x1 because
f1(x1 + ∆) > f1(x1) and f2(x1 + ∆) > f2(x1). We can repeat this procedure and will always
find a new dominating point until we reach x2. x2 demarks the global maximum of f2, the
point with the highest possible f2 value, which cannot be dominated by any other point in
X by definition (see Equation 1.6).
From here on, f2 will decrease for a while, but f1 keeps rising. If we now go a small step
∆ to the right, we will find a point x2 + ∆ with f2(x2 + ∆) < f2(x2) but also f1(x2 + ∆) >
f1(x2). One objective can only get better if another one degenerates. In order to increase f1,
f2 would be decreased and vice versa and so the new point is not dominated by x2. Although
17 http://en.wikipedia.org/wiki/Pareto_efficiency [accessed 2007-07-03]
18 In practice, of course, our computers can only handle finitely many elements

32
1 Introduction
y
y=f (x)
1
y=f (x)
2
x X
Î 1
x x x
x
x
x
1
2
3
4
5
6
Figure 1.8: Optimization using the Pareto Frontier approach.
some of the f2(x) values of the other points x ∈ [x1,x2) may be larger than f2(x2 + ∆),
f1(x2 + ∆) > f1(x) holds for all of them. This means that no point in [x1, x2) can dominate
any point in [x2, x4] because f1 keeps rising until x4 is reached.
At x3 however, f2 steeply falls to a very low level. A level lower than f2(x5). Since the f1
values of the points in [x5, x6] are also higher than those of the points in (x3, x4], all points
in the set [x5, x6] (which also contains the global maximum of f1) dominate those in (x3, x4].
For all the points in the white area between x4 and x5 and after x6, we can derive similar
relations. All of them are also dominated by the non-dominated regions that we have just
discussed.
Graphical Example 2
Another method to visualize the Pareto relationship is outlined in Figure 1.9 for our second
graphical example. For a certain resolution of the problem space X2, we have counted the
number of elements that dominate each element x ∈ X2. The higher this number, the
worst is the element x in terms of Pareto optimization. Hence, those solution candidates
residing in the valleys of Figure 1.9 are better than those which are part of the hills. This
Pareto ranking approach is also used in many optimization algorithms as part of the fitness
assignment scheme (see Section 2.3.3 on page 112, for instance). A non-dominated element
is, as the name says, not dominated by any other solution candidate. These elements are
Pareto optimal and have a domination-count of zero. In Figure 1.9, there are four such areas
X⋆1, X⋆2, X⋆3, and X⋆4.
X« X«X«

1
2
3
4
#dom
x 2
x1
Figure 1.9: Optimization using the Pareto Frontier approach (second example).

1.2 What is an optimum?
33
If we compare Figure 1.9 with the plots of the two functions f3 and f4 in Figure 1.4, we
can see that hills in the domination space occur at positions where both, f3 and f4 have high
values. Conversely, regions of the problem space where both functions have small values are
dominated by very few elements.
Besides these examples here, another illustration of the domination relation which may
help understanding Pareto optimization can be found in Section 2.3.3 on page 112 (Figure 2.4
and Table 2.1).
Problems of Pure Pareto Optimization
The complete Pareto optimal set is often not the wanted result of an optimization algorithm.
Usually, we are rather interested in some special areas of the Pareto front only.
Artificial Ant Example We can again take the Artificial Ant example to visualize this prob-
lem. In Section 1.2.2 on page 27 we have introduced multiple conflicting criteria in this
problem.
1. Maximize the amount of food piled.
2. Minimize the distance covered or the time needed to find the food.
3. Minimize the size of the program driving the ant.
Pareto optimization may now yield for example:
1. A program consisting of 100 instructions, allowing the ant to gather 50 food items when
walking a distance of 500 length units.
2. A program consisting of 100 instructions, allowing the ant to gather 60 food items when
walking a distance of 5000 length units.
3. A program consisting of 10 instructions, allowing the ant to gather 1 food item when
walking a distance of 5 length units.
4. A program consisting of 0 instructions, allowing the ant to gather 0 food item when
walking a distance of 0 length units.
The result of the optimization process obviously contains two useless but non-dominated
individuals which occupy space in the population and the non-dominated set. We also invest
processing time in evaluating them, and even worse, they may dominate solutions that are
not optimal but fall into the space behind the interesting part of the Pareto front. Further-
more, memory restrictions usually force us to limit the size of the list of non-dominated
solutions found during the search. When this size limit is reached, some optimization al-
gorithms use a clustering technique to prune the optimal set while maintaining diversity.
On one hand, this is good since it will preserve a broad scan of the Pareto frontier. In this
case on the other hand, a short but dumb program is of course very different from a longer,
intelligent one. Therefore, it will be kept in the list and other solutions which differ less from
each other but are more interesting for us will be discarded.
Furthermore, non-dominated elements have a higher probability of being explored fur-
ther. This then leads inevitably to the creation of a great proportion of useless offspring. In
the next generation, these useless offspring will need a good share of the processing time to
be evaluated.
Thus, there are several reasons to force the optimization process into a wanted direction.
In Section 22.2.2 on page 390 you can find an illustrative discussion on the drawbacks of
strict Pareto optimization in a practical example (evolving web service compositions).
1.2.3 Constraint Handling
Such a region of interest is one of the reasons for one further extension of the definition of op-
timization problems: In many scenarios, p inequality constraints g and q equality constraints
h may be imposed additional to the objective functions. Then, a solution candidate x is fea-
sible, if and only if gi(x) ≥ 0 ∀i = 1,2,..,p and hi(x) = 0 ∀i = 1,2,..,q holds. Obviously, only

34
1 Introduction
a feasible individual can be a solution, i. e., an optimum, for a given optimization problem.
Comprehensive reviews on techniques for such problems have been provided by Michalewicz
[1406], Michalewicz and Schoenauer [1410], Ceollo Coello [358], and Ceollo Coello et al. [361]
in the context of Evolutionary Computation.
Death Penalty
Probably the easiest way of dealing with constraints is to simply reject all infeasible solution
candidates right away and not considering them any further in the optimization process.
This death penalty [1406, 1408] can only work in problems where the feasible regions are
very large and will lead the search to stagnate in cases where this is not the case. Also, the
information which could be gained from the infeasible individuals is discarded with them
and not used during the optimization.
Penalty Functions
Maybe one of the most popular approach for dealing with constraints, especially in the
area of single-objective optimization, goes back to Courant [458] who introduced the idea
of penalty functions in 1943. Here, the constraints are combined with the objective function
f , resulting in a new function f ′ which is then actually optimized. The basic idea is that
this combination is done in a way which ensures that an infeasible solution candidate has
always a worse f ′-value than a feasible one with the same objective values. In [458], this is
achieved by defining f ′ as f ′(x) = f (x) + v [h(x)]2. Various similar approaches exist. Carroll
[345, 346], for instance, chose a penalty function of the form f ′(x) = f (x) + v
p
[g
i=1
i(x)]−1
which ensures that the function g does not become zero or negative.
There are practically no limits for the ways in which a penalty for infeasibility can be
integrated into the objective functions. Several researchers suggest dynamic penalties which
incorporate the index of the current iteration of the optimizer [1063, 1560] or adaptive
penalties which additionally utilize population statistics [1876, 1877, 875, 159]. Rigorous
discussions on penalty functions have been contributed by Fiacco and McCormick [665] and
Smith and Coit [1901].
Constraints as Additional Objectives
Another idea for handling constraints would be to consider them as new objective functions.
If g(x) ≥ 0 must hold, for instance, we can transform this to a new objective function
f ∗(x) = min {−g(x),0} subject to minimization. The minimum is needed since there is no
use in maximizing g further than 0 and hence, after it reached 0, the optimization pressure
must be removed. An approach similar to this is Deb’s Goal Programming method [536, 533].
The Method of Inequalities
General inequality constraints can also be processed according to the Method of Inequalities
(MOI) introduced by Zakian [2304, 2305, 2306, 2307, 2308] in his seminal work on computer-
aided control systems design (CACSD) [1814, 2200, 2315]. In the MOI, an area of interest
is specified in form of a goal range [ˇ
ri, ˆ
ri] for each objective function fi.
Pohlheim [1651] outlines how this approach can be combined with Pareto optimization:
Based on the inequalities, three categories of solution candidates can be defined and each
element x ∈ X belongs to one of them:
1. It fulfills all of the goals, i. e.,
ˇ
ri ≤ fi(x) ≤ ˆri ∀i ∈ [1,|F|]
(1.9)

1.2 What is an optimum?
35
2. It fulfills some (but not all) of the goals, i. e.,
(∃i ∈ [1,|F|] : ˇri ≤ fi(x) ≤ ˆri) ∧ (∃j ∈ [1,|F|] : (fj(x) < ˇrj) ∨ (fj(x) > ˆrj)) (1.10)
3. It fulfills none of the goals, i. e.,
(fi(x) < ˇ
ri) ∨ (fi(x) > ˆri) ∀i ∈ [1,|F|]
(1.11)
Using these groups, a new comparison mechanism is created:
1. The solution candidates that fulfill all goals are preferred instead of all other individuals
that either fulfill some or no goals.
2. The solution candidates that are not able to fulfill any of the goals succumb to those
which fulfill at least some goals.
3. Only the solutions that are in the same group are compared on basis on the Pareto
domination relation.
By doing so, the optimization process will be driven into the direction of the interesting
part of the Pareto frontier. Less effort will be spent in creating and evaluating individuals
in parts of the problem space that most probably do not contain any valid solution.
Graphical Example 1
In Figure 1.10, we apply the Pareto-based Method of Inequalities to our first graphical
example. We impose the same goal ranges on both objectives ˆ
r1 = ˆ
r2 and ˇ
r1 = ˇ
r2. By
doing so, the second non-dominated region from the Pareto example Figure 1.8 suddenly
becomes infeasible, since f1 rises over ˆ
r1 there. Also, the greater part of the first optimal
area from this example is infeasible because f2 drops under ˇ
r2. In the whole domain X of
the optimization problem, only the regions [x1, x2] and [x3, x4] fulfill all the target criteria.
To these elements, Pareto comparisons are applied. It turns out that the elements in [x3, x4]
dominate all the elements [x1, x2] since they provide higher values in f1 for same values in
f2. If we scan through [x3, x4] from left to right, we can see the f1 rises while f2 degenerates,
which is why the elements in this area cannot dominated each other and, hence, are all
optimal.
y
y=f (x)
1
r^ ^
1,r2
y=f (x)
2
^ ^
r1,r2
x X
Î 1
x x x x
1
2
3
4
Figure 1.10: Optimization using the Pareto-based Method of Inequalities approach (first
example).
Graphical Example 2
In Figure 1.11 we apply the Pareto-based Method of Inequalities to our second graphical
example from Section 1.2.2. We apply two different ranges of interest [ˇ
r3, ˆ
r3] and [ˇ
r4, ˆ
r4] on
f3 and f4 as sketched in Fig. 1.11.a.

36
1 Introduction
r^
r^
4
3
f3
f4
^
^
r
r4
3
x 2
x
x 2
x1
1
Fig. 1.11.a: The ranges applied to f3 and f4.
«x« x«x«
x«x«x«x«
2
3
4
5
6
7
8 X«
x1
9
3
#dom
2
MOIclass 1
x 2
x
x 2
x
1
1
Fig. 1.11.b: The Pareto-based Method of Inequal-
Fig. 1.11.c: The Pareto-based Method of In-
ities class division.
equalities ranking.
Figure 1.11: Optimization using the Pareto-based Method of Inequalities approach (first
example).
Like we did in the second example for Pareto optimization, we want to plot the quality of
the elements in the problem space. Therefore, we first assign a number c ∈ {1,2,3} to each
of its elements in Fig. 1.11.b. This number corresponds to the classes to which the elements
belong, i. e., 1 means that a solution candidate fulfills all inequalities, for an element of class
2, at least some of the constraints hold, and the elements in class 3 fail all requirements.
Based on this class division, we can then perform a modified Pareto counting where each
element dominates all the elements in higher classes Fig. 1.11.c. The result is that multiple
single optima x⋆1, x⋆2, x⋆3, etc., and even a set of adjacent, non-dominated elements X⋆9 occurs.
These elements are, again, situated at the bottom of the illustrated landscape whereas the
worst solution candidates reside on hill tops.
A good overview on techniques for the Method of Inequalities is given by Whidborne
et al. [2200].
Limitations and Other Methods
Other approaches for incorporating constraints into optimization are Goal Attainment [2233,
714] and Goal Programming19 [377, 376]. Especially interesting in our context are methods
19 http://en.wikipedia.org/wiki/Goal_programming [accessed 2007-07-03]

1.2 What is an optimum?
37
which have been integrated into evolutionary algorithms [2002, 536, 533, 1804, 1651], such
as the popular Goal Attainment approach by Fonseca and Fleming [714] which is similar to
the Pareto-MOI we have adopted from Pohlheim [1651]. Again, an overview on this subject
is given by Ceollo Coello et al. in [361].
1.2.4 Unifying Approaches
External Decision Maker
All approaches for defining what optima are and how constraints should be considered are
rather specific and bound to certain mathematical constructs. The more general concept of
an External Decision Maker which (or who) decides which solution candidates prevail has
been introduced by Fonseca and Fleming [715, 716]. One of the ideas behind “externalizing”
the assessment process on what is good and what is bad is that Pareto optimization imposes
only a partial order20 on the solution candidates. In a partial order, elements may exists
which neither succeed nor precede each other. As we have seen in Section 1.2.2, there can,
for instance, be two individuals x1, x2 ∈ X with neither x1 ⊢ x2 nor x2 ⊢ x1. A special
case of this situation is the non-dominated set, the so-called Pareto frontier which we try
to estimate with the optimization process.
Most fitness assignment processes, however, require some sort of total order21, where each
individual is either better or worse than each other (except for the case of identical solution
candidates which are, of course, equal to each other). The fitness assignment algorithms can
create such a total order by themselves. One example for doing this is the Pareto ranking
which we will discuss later in Section 2.3.3 on page 112, where the number of individuals
dominating a solution candidate denotes its fitness.
While this method of ordering is a good default approach able of directing the search
into the direction of the Pareto frontier and delivering a broad scan of it, it neglects the fact
that the user of the optimization most often is not interested in the whole optimal set but
has preferences, certain regions of interest [717]. This region will then exclude the infeasible
(but Pareto optimal) programs for the Artificial Ant as discussed in Section 1.2.2. What the
user wants is a detailed scan of these areas, which often cannot be delivered by pure Pareto
optimization.
a priori
knowledge
utility/cost
results
DM
EA
(decision maker)
(an optimizer)
objective values
(acquired knowledge)
Figure 1.12: An external decision maker providing an evolutionary algorithm with utility
values.
Here comes the External Decision Maker as an expression of the user’s preferences [712]
into play, as illustrated in Figure 1.12. The task of this decision maker is to provide a cost
function u : Y → R (or utility function, if the underlying optimizer is maximizing) which
maps the space of objective values Y (which is usually Rn) to the space of real numbers
20 A definition of partial order relations is specified in Definition 27.31 on page 463.
21 The concept of total orders is elucidated in Definition 27.32 on page 464.

38
1 Introduction
R. Since there is a total order defined on the real numbers, this process is another way
of resolving the “incomparability-situation”. The structure of the decision making process
u can freely be defined and may incorporate any of the previously mentioned methods.
u could, for example, be reduced to compute a weighted sum of the objective values, to
perform an implicit Pareto ranking, or to compare individuals based on pre-specified goal-
vectors. Furthermore, it may even incorporate forms of artificial intelligence, other forms of
multi-criterion Decision Making, and even interaction with the user. This technique allows
focusing the search onto solutions which are not only optimal in the Pareto sense, but also
feasible and interesting from the viewpoint of the user.
Fonseca and Fleming make a clear distinction between fitness and cost values. Cost values
have some meaning outside the optimization process and are based on user preferences.
Fitness values on the other hand are an internal construct of the search with no meaning
outside the optimizer (see Definition 1.35 on page 46 for more details). If External Decision
Makers are applied in evolutionary algorithms or other search paradigms that are based on
fitness measures, these will be computed using the values of the cost function instead of the
objective functions [718, 712, 713].
Prevalence Optimization
We have now discussed various approaches which define optima in terms of multi-objective
optimization and steer the search process into their direction. Let us subsume all of them in
general approach. From the concept of Pareto optimization to the Method of Inequalities,
the need to compare elements of the problem space in terms of their quality as solution
for a given problem winds like a read thread through this matter. Even the weighted sum
approach and the External Decision Maker do nothing else than mapping multi-dimensional
vectors to the real numbers in order to make them comparable.
If we compare two solution candidates x1 und x2, either x1 is better than x2, vice versa,
or both are of equal quality. Hence, there are three possible relations between two elements
of the problem space. These two results can be expressed with a comparator function cmpF .
Definition 1.15 (Comparator Function). A comparator function cmp : A2 → R maps
all pairs of elements (a1, a2) ∈ A2 to the real numbers Raccording to two complementing
partial orders22 R1 and R2:
R1(a1, a2) ⇔ cmp(a1,a2) < 0 ∀a1,a2 ∈ A
(1.12)
R2(a1, a2) ⇔ cmp(a1,a2) > 0 ∀a1,a2 ∈ A
(1.13)
R1(a1, a2) ∧ R2(a1,a2) ⇔ cmp(a1,a2) = 0 ∀a1,a2 ∈ A
(1.14)
cmp(a, a) = 0
∀a ∈ A
(1.15)
R1 (and hence, cmp(a1, a2) < 0) is equivalent to the precedence relation and R2 denotes
succession.
From the three defining equations, many features of cmp can be deduced. It is, for
instance, transitive, i. e., cmp(a1, a2) < 0 ∧ cmp(a2,a3) < 0 ⇒ cmp(a1,a3)) < 0. Provided
with the knowledge of the objective functions f ∈ F, such a comparator function cmpF can
be imposed on the problem spaces of our optimization problems:
Definition 1.16 (Prevalence Comparator Function). A prevalence comparator func-
tion cmpF : X2 → R maps all pairs (x1,x2) ∈ X2 of solution candidates to the real numbers
R according to Definition 1.15.
The subscript F in cmpF illustrates that the comparator has access to all the values of
the objective functions in addition to the problem space elements which are its parameters.
As shortcut for this comparator function, we introduce the prevalence notation as follows:
22 Partial orders are introduced in Definition 27.30 on page 463.

1.2 What is an optimum?
39
Definition 1.17 (Prevalence). An element x1 prevails over an element x2 (x1 ≻ x2) if
the application-dependent prevalence comparator function cmpF (x1, x2) ∈ R returns a value
less than 0.
(x1 ≻ x2) ⇔ cmpF(x1,x2) < 0 ∀x1,x2,∈ X
(1.16)
(x1 ≻ x2) ∧ (x2 ≻ x3) ⇒ x1 ≻ x3 ∀x1,x2,x3 ∈ X
(1.17)
It is easy to see that we can define Pareto domination relations and Method of
Inequalities-based comparisons, as well as the weighted sum combination of objective val-
ues based on this notation. Together with the fitness assignment strategies which will be
introduced later in this book (see Section 2.3 on page 111), it covers many of the most so-
phisticated multi-objective techniques that are proposed, for instance, in [715, 1128, 2002].
By replacing the Pareto approach with prevalence comparisons, all the optimization algo-
rithms(especially many of the evolutionary techniques) relying on domination relations can
be used in their original form while offering the new ability of scanning special regions of
interests of the optimal frontier.
Since the comparator function cmpF and the prevalence relation impose a partial order
on the problem space X like the domination relation does, we can construct the optimal set
in a way very similar to Equation 1.8:
x⋆ ∈ X⋆ ⇔ ∃x ∈ X : x = x⋆ ∧ x ≻ x⋆
(1.18)
For illustration purposes, we will exercise the prevalence approach on the examples of the
weighted sum cmpF,F,weightedS method23 with the weights wi as well as on the domination-
based Pareto optimization24 cmpF,Pareto with the objective directions ωi:
|F |
cmpF,weightedS(x1, x2) =
(wifi(x2) − wifi(x1)) ≡ g(x2) − g(x1)
(1.19)
i=1
−1 if x1⊢x2
cmp
1 if x
F,Pareto(x1, x2) = 
(1.20)
Artificial Ant Example
 2⊢x1
0 otherwise
With the prevalence comparator, we can also easily solve the problem stated in Section 1.2.2
by no longer encouraging the evolution of useless programs for Artificial Ants while retaining
the benefits of Pareto optimization. The comparator function simple can be defined in a
way that they will always be prevailed by useful programs. It therefore may incorporate the
knowledge on the importance of the objective functions. Let f1 be the objective function
with an output proportional to the food piled, f2 would denote the distance covered in
order to find the food, and f3 would be the program length. Equation 1.21 demonstrates
one possible comparator function for the Artificial Ant problem.


−1 if (f1(x1) > 0 ∧ f1(x2) = 0)∨

(f2(x1) > 0 ∧ f2(x2) = 0)∨
(f3(x1) > 0 ∧ f1(x2) = 0)
cmp

1 if (f
F,ant(x1, x2) = 
(1.21)

1(x2) > 0 ∧ f1(x1) = 0) ∨

(f

2(x2) > 0 ∧ f2(x1) = 0) ∨

23 See Equation 1.4 on p
(f

3(x2) > 0 ∧ f1(x1) = 0)
cmpF,Pareto(x1, x2) otherwise
age 29 for more information on weighted sum optimization.
24 Pareto optimization was defined in Equation 1.6 on page 31.

40
1 Introduction
Later in this book, we will discuss some of the most popular optimization strategies. Al-
though they are usually implemented based on Pareto optimization, we will always introduce
them using prevalence.
1.3 The Structure of Optimization
After we have discussed what optima are and have seen a crude classification of global
optimization algorithms, let us now take a look on the general structure common to all
optimization processes. This structure consists of a number of well-defined spaces and sets
as well as the mappings between them. Based on this structure of optimization, we will
introduce the abstractions fitness landscapes, problem landscape, and optimization problem
which will lead us to a more thorough definition of what optimization is.
1.3.1 Spaces, Sets, and Elements
In this section, we elaborate on the relation between the (possibly different) representations
of solution candidates for search and for evaluation. We will show how these representations
are connected and introduce fitness as a relative utility measures defined on sets of solution
candidates. You will find that the general model introduced here applies to all the global
optimization methods mentioned in this book, often in a simplified manner. One example for
this structure of optimization processes is given in Figure 1.13 by using a genetic algorithm
which encodes the coordinates of points in a plane into bit strings as an illustration.
The Problem Space and the Solutions therein
Whenever we tackle an optimization problem, we first have to define the type of the pos-
sible solutions. For deriving a controller for the Artificial Ant problem, we could choose
programs or artificial neural networks as solution representation. If we are to find the root
of a mathematical function, we would go for real numbers R as solution candidates and when
configuring or customizing a car for a sales offer, all possible solutions are elements of the
power set of all optional features. With this initial restriction to a certain type of results,
we have specified the problem space X.
Definition 1.18 (Problem Space). The problem space X (phenome) of an optimization
problem is the set containing all elements x which could be its solution.
Usually, more than one problem space can be defined for a given optimization problem.
A few lines before, we said that as problem space for finding the root of a mathematical
function, the real number R would be fine. On the other hand, we could as well restrict
ourselves to the natural numbers N or widen the search to the whole complex plane C. This
choice has major impact: On one hand, it determines which solutions we can possible find.
On the other hand, it also has subtle influence on the search operations. Between each two
different points in R, for instance, there are infinitely many other numbers, while in N, there
are not.
In dependence on genetic algorithms, we often refer to the problem space synonymously
phenome. The problem space X is often restricted by
1. logical constraints that rule out elements which cannot be solutions, like programs of
zero length when trying to solve the Artificial Ant problem and
2. practical constraints that prevent us, for instance, from taking all real numbers into
consideration in the minimization process of a real function. On our off-the-shelf CPUs
or with the Java programming language, we can only use 64 bit floating point numbers.
With these 64 bit, it is only possible to express numbers up to a certain precision and
we cannot have more than 15 or so decimals.

1.3 The Structure of Optimization
41
F i t n e s s
a n d
h e u r i s t i c
v a l u e s
(normally) have only a meaning in the
fitness
context of a population or a set of
solution candidates.
Fitness Space
V
R+
Í
Fitness Values
v(x)ÎV
Fitness Assignment Process
Fitness Assignment Process
(x)
(x)
1
f 1
f
Objective Space
Y
Rn
Í
Objective Values
F(xÎX)ÎY
Objective Function(s)
Objective Function(s)
Solution Space
S Í X
(3,0) (3,1) (3,2) (3,3)
(3,0) (3,1) (3,2)
(3,3)
(2,0) (2,1) (2,2) (2,3)
(2,0) (2,1) (2,2) (2,3)
(1,0) (1,1) (1,2) (1,3)
(1,0) (1,1) (1,2)
(1,3)
(0,2)
(0,0) (0,1) (0,2) (0,3)
(0,0) (0,1)
(0,2) (0,3)
Problem Space
X
Population (Phenotypes) Pop Í G X
´
Genotype-Phenotype Mapping
Genotype-Phenotype Mapping
0110
0111
1110
1111
0111
1111
1110
0111
1111
0010
0011
1010
1011
0010 0010
0010 0010
0100
0101
1100
1101
0100
1000
0000
0001
1000
1001
Search Space
G
Population (Genotypes) Pop Í G X
´
The Involved Spaces
The Involved Sets/Elements
Figure 1.13: Spaces, Sets, and Elements involved in an optimization process.

42
1 Introduction
Definition 1.19 (Solution Candidate).
A solution candidate x is an element of the
problem space X of a certain optimization problem.
In the context of evolutionary algorithms, solution candidates are usually called pheno-
types. In this book, we will use both terms synonymously. Somewhere inside the problem
space, the solutions of the optimization problem will be located (if the problem can actually
be solved, that is).
Definition 1.20 (Solution Space). We call the union of all solutions of an optimization
problem its solution space S.
X⋆ ⊆ S ⊆ X
(1.22)
This solution space contains (and can be equal to) the global optimal set X⋆. There may
exist valid solutions x ∈ S which are not elements of the X⋆, especially in the context of
constraint optimization (see Section 1.2.3).
The Search Space
Definition 1.21 (Search Space). The search space G of an optimization problem is the
set of all elements g which can be processed by the search operations.
As previously mentioned, the type of the solution candidates depends on the problem
to be solved. Since there are many different applications for optimization, there are many
different forms of problem spaces. It would be cumbersome to develop search operations
time and again for each new problem space we encounter. Such an approach would not only
be error-prone, it would also make it very hard to formulate general laws and to consolidate
findings. Instead, we often reuse well-known search spaces for many different problems. Then,
only a mapping between search and problem space has to be defined (see page 44). Although
this is not always possible, it allows us to use more out-of-the-box software in many cases.
In dependence on genetic algorithms, we often refer to the search space synonymously as
genome25, a term coined by the German biologist Winkler [2241] as a portmanteau of the
words gene26 and chromosome [1267]. The genome is the whole hereditary information of
organisms. This includes both, the genes and the non-coding sequences of the Deoxyribonu-
cleic acid (DNA27), which is illustrated in Figure 1.14. Simply put, the DNA is a string of
Thymine
Cytosine
Hydrogen
Adenine
Guanine
Bond
Phosphate
Deoxyribose (sugar)
Figure 1.14: A sketch of a part of a DNA molecule.
base pairs that encodes the phenotypical characteristics of the creature it belongs to.
25 http://en.wikipedia.org/wiki/Genome [accessed 2007-07-15]
26 The words gene, genotype, and phenotype have, in turn, been introduced by the Danish biologist
Johannsen [1056]. [2240]
27 http://en.wikipedia.org/wiki/Dna [accessed 2007-07-03]

1.3 The Structure of Optimization
43
Definition 1.22 (Genotype).
The elements g ∈ G of the search space G of a given
optimization problem are called the genotypes.
The elements of the search space rarely are unstructured aggregations. Instead, they often
consist of distinguishable parts, hierarchical units, or well-typed data structures. The same
goes for the DNA in biology. It consists of genes, segments of nucleic acid, that contain
the information necessary to produce RNA strings in a controlled manner28. A fish, for
instance, may have a gene for the color of its scales. This gene, in turn, could have two
possible “values” called alleles29, determining whether the scales will be brown or gray.
The genetic algorithm community has adopted this notation long ago and we can use it for
arbitrary search spaces.
Definition 1.23 (Gene).
The distinguishable units of information in a genotype that
encode the phenotypical properties are called genes.
Definition 1.24 (Allele). An allele is a value of specific gene.
Definition 1.25 (Locus). The locus30 is the position where a specific gene can be found
in a genotype.
Figure 1.15 on page 45 refines the relations of genotypes and phenotypes from the
initial example for the spaces in Figure 1.13 by also marking genes, alleles, and loci. In the
car customizing problem also mentioned earlier, the first gene could identify the color of
the automobile. Its locus would then be 0 and it could have the alleles 00, 01, 10, and 11,
encoding for red, white, green, and blue, for instance. The second gene (at locus 1) with the
alleles 0 or 1 may define whether or not the car comes with climate control, and so on.
The Search Operations
In some problems, the search space G may be identical to the problem space X. If we go back
to our previous examples, for instance, we will find that there exist a lot of optimization
strategies that work directly on vectors of real numbers. When minimizing a real function, we
could use such an approach (Evolution Strategies, for instance, see Chapter 5 on page 227)
and set G = X = R. Also, the configurations of cars may be represented as bit strings:
Assume that such a configuration consists of k features, which can either be included or
excluded from an offer to the customer. We can then search in the space of binary strings of
this length G = Bk = {true,false}k, which is exactly what genetic algorithms (discussed
in Section 3.1 on page 141) do. By using their optimization capabilities, we do not need
to mess with the search and selection techniques but can rely on well-researched standard
operations.
Definition 1.26 (Search Operations). The search operations searchOp are used by op-
timization algorithms in order to explore the search space G.
We subsume all search operations which are applied by an optimization algorithm in
order to solve a given problem in the set Op. Search operations can be defined with different
arities31. Equation 1.23, for instance, denotes an n-ary operator, i. e., one with n arguments.
The result of a search operation is one element of the search space.
searchOp : Gn → G
(1.23)
28 http://en.wikipedia.org/wiki/Gene [accessed 2007-07-03]
29 http://en.wikipedia.org/wiki/Allele [accessed 2007-07-03]
30 http://en.wikipedia.org/wiki/Locus_%28genetics%29 [accessed 2007-07-03]
31 http://en.wikipedia.org/wiki/Arity [accessed 2008-02-15]

44
1 Introduction
Mutation and crossover in genetic algorithms (see Chapter 3) are examples for unary
and binary search operations, whereas Differential Evolution utilizes a ternary operator (see
Section 5.5). Optimization processes are often initialized by creating random genotypes –
usually the results of a search operation with zero arity (no parameters).
Search operations often involve randomized numbers. In such cases, it makes no sense to
reason about their results like ∃g1,g2 ∈ G : g2 = searchOp(g1)∧... Instead, we need to work
with probabilities like ∃g1,g2 ∈ G : g2 = P(searchOp(g1)) > 0∧... Based on Definition 1.26,
we will use the notation Op(x) for the application of any of the operations searchOp ∈ Op
to the genotype x. With Opk(x) we denote k successive applications of (possibly different)
search operators. If the parameter x is left away, i. e., just Opk is written, this chain has to
start with a search operation with zero arity. In the style of Badea and Stanciu [111] and
Skubch [1897, 1898], we now can define:
Definition 1.27 (Completeness). A set Op of search operations searchOp is complete
if and only if every point g1 in the search space G can be reached from every other point
g2 ∈ G by applying only operations searchOp ∈ Op.
∀g1,g2 ∈ G ⇒ ∃k ∈ N : P g1 = Opk(g2) > 0
(1.24)
Definition 1.28 (Weak Completeness). A set Op of search operations searchOp is weakly
complete if and only if every point g in the search space G can be reached by applying only
operations searchOp ∈ Op. A weakly complete set of search operations hence includes at
least one parameterless function.
∀g ∈ G ⇒ ∃k ∈ N : P g = Opk > 0
(1.25)
If the set of search operations is not complete, there are points in the search space which
cannot be reached. Then, we are probably not able to explore the problem space adequately
and possibly will not find satisfyingly good solution.
Definition 1.29 (Adjacency (Search Space)). A point g2 is adjacent to a point g1 in
the search space G if it can be reached by applying a single search operation searchOp to
g1. Notice that the adjacency relation is not necessarily symmetric.
true if
adjacent(g
∃searchOp ∈ Op : P(searchOp(g1) = g2) > 0
2, g1) =
(1.26)
false otherwise
The Connection between Search and Problem Space
If the search space differs from the problem space, a translation between them is furthermore
required. In our car example, we would need to transform the binary strings processed by
the genetic algorithm to objects which represent the corresponding car configurations and
can be processed by the objective functions.
Definition 1.30 (Genotype-Phenotype Mapping). The genotype-phenotype mapping
(GPM, or ontogenic mapping [1619]) gpm : G → X is a left-total32 binary relation which
maps the elements of the search space G to elements in the problem space X.
∀g ∈ G ∃x ∈ X : gpm(g) = x
(1.27)
The only hard criterion we impose on genotype-phenotype mappings in this book is
left-totality, i. e., that they map each element of the search space to at least one solution
candidate. They may be functional relations if they are deterministic. Although it is possible
to create mappings which involve random numbers and, hence, cannot be considered to be
32 See Equation 27.51 on page 461 to 5 on page 462 for an outline of the properties of binary
relations.

1.3 The Structure of Optimization
45
0 0 0 0
0 0 0 1
searchOp
x=gpm(g)
0 0 1 0
... g G
Î
x X
Î
0 1 1 1
3
3
1 1 0 1
2
2
1st Gene
2nd Gene
1
1
1 1 1 0
allele ,,01``
allele ,,11``
1 1 1 1
at locus 0
at locus 1
0
1
2
3
0
1
2
3
genome G
genotype g G
Î
phenotype x X
Î
phenome X
(search space)
(solution candidate)
(problem space)
Figure 1.15: The relation of genome, genes, and the problem space.
functions in the mathematical sense of Section 27.7.1 on page 462. Then, Equation 1.27
would need to be rewritten to Equation 1.28.
∀g ∈ G ∃x ∈ X : P(gpm(g) = x) > 0
(1.28)
Genotype-phenotype mappings should further be surjective [1694], i. e., relate at least
one genotype to each element of the problem space. Otherwise, some solution candidates
can never be found and evaluated by the optimization algorithm and there is no guarantee
whether the solution of a given problem can be discovered or not. If a genotype-phenotype
mapping is injective, which means that it assigns distinct phenotypes to distinct elements
of the search space, we say that it is free from redundancy. There are different forms of
redundancy, some are considered to be harmful for the optimization process, others have
positive influence33. Most often, GPMs are not bijective (since they are neither necessarily
injective nor surjective). Nevertheless, if a genotype-phenotype mapping is bijective, we can
construct an inverse mapping gpm−1 : X → G.
gpm−1(x) = g ⇔ gpm(g) = x ∀x ∈ X,g ∈ G
(1.29)
Based on the genotype-phenotype mapping, we can also define an adjacency relation for
the problem space, which, of course, is also not necessarily symmetric.
Definition 1.31 (Adjacency (Problem Space)). A point x2 is adjacent to a point x1
in the problem space X if it can be reached by applying a single search operation searchOp
to their corresponding elements in the problem space.
true if
adjacent(x
∃g1,g2 : x1 = gpm(g1) ∧ x2 = gpm(g2) ∧ adjacent(g2,g1)
2, x1) =
false otherwise
(1.30)
By the way, we now have the means to define the term local optimum clearer. The original
Definition 1.8 only applies to single objective functions, but with the use of the adjacency
relation adjacent, the prevalence criterion ≻, and the connection between the search space
and the problem space gpm, we clarify it for multiple objectives.
Definition 1.32 (Local Optimum).
A (local) optimum x⋆l ∈ X of a set of objective
functions F function is not worse than all points adjacent to it.
∀x⋆l ∈ G ⇒ ∀x ∈ X : adjacent(x,x⋆l) ⇒ x≻x⋆
(1.31)
l
33 See Section 1.4.5 on page 67 for more information.

46
1 Introduction
The Objective Space and Optimization Problems
After the appropriate problem space has been defined, the search space has been selected and
a translation between them (if needed) was created, we are almost ready to feed the problem
to a global optimization algorithm. The main purpose of such an algorithm obviously is to
find as many elements as possible from the solution space – We are interested in the solution
candidates with the best possible evaluation results. This evaluation is performed by the
set F of n objective functions f ∈ F, each contributing one numerical value describing the
characteristics of a solution candidate x.34
Definition 1.33 (Objective Space). The objective space Y is the space spanned by the
codomains of the objective functions.
F = {fi : X → Yi : 0 < i ≤ n,Yi ⊆ R} ⇒ Y = Y1 × Y2 × .. × Yn
(1.32)
The set F maps the elements x of the problem space X to the objective space Y and,
by doing so, gives the optimizer information about their qualities as solutions for a given
problem.
Definition 1.34 (Optimization Problem). An optimization problem is defined by a five-
tuple (X, F, G, Op, gpm) specifying the problem space X, the objective functions F , the
search space G, the set of search operations Op, and the genotype-phenotype mapping gpm.
In theory, such an optimization problem can always be solved if Op is complete and the
gpmis surjective.
Generic search and optimization algorithms find optimal elements if provided with an
optimization problem defined in this way. Evolutionary algorithms, which we will discuss
later in this book, are generic in this sense. Other optimization methods, like genetic algo-
rithmsfor example, may be more specialized and work with predefined search spaces and
search operations.
Fitness as a Relative Measure of Utility
When performing a multi-objective optimization, i. e., n = |F| > 1, the elements of Y are
vectors in Rn. In Section 1.2.2 on page 27, we have seen that such vectors cannot always
be compared directly in a consistent way and that we need some (comparative) measure for
what is “good”. In many optimization techniques, especially in evolutionary algorithms, this
measure is used to map the objective space to a subset V of the positive real numbers R+.
For each solution candidate, this single real number represents its fitness as solution for the
given optimization problem. The process of computing such a fitness value is often not solely
depending on the absolute objective values of the solution candidates but also on those of
the other phenotypes known. It could, for instance, be position of a solution candidate in the
list of investigated elements sorted according to the Pareto relation. Hence, fitness values
often only have a meaning inside the optimization process [712] and may change by time,
even if the objective values stay constant. In deterministic optimization methods, the value
of a heuristic function which approximates how many modifications we will have to apply
to the element in order to reach a feasible solution can be considered as the fitness.
Definition 1.35 (Fitness). The fitness35 value v(x) ∈ V of an element x of the problem
space X corresponds to its utility as solution or its priority in the subsequent steps of the
optimization process. The space spanned by all possible fitness values V is normally a subset
of the positive real numbers V ⊆ R+.
34 See also Equation 1.3 on page 27.
35 http://en.wikipedia.org/wiki/Fitness_(genetic_algorithm) [accessed 2008-08-10]

1.3 The Structure of Optimization
47
The origin of the term fitness has been borrowed biology36 [1915, 1624] by the evolution-
ary algorithms community. When the first applications of genetic algorithms were developed,
the focus was mainly on single-objective optimization. Back then, they called this single
function fitness function and thus, set objective value ≡ fitness value. This point of view is
obsolete in principle, yet you will find many contemporary publications that use this notion.
This is partly due the fact that in simple problems with only one objective function, the
old approach of using the objective values directly as fitness, i. e., v(x) = f (x) ∀x ∈ X, can
sometimes actually be applied. In multi-objective optimization processes, this is not possible
and fitness assignment processes like those which we are going to elaborate on in Section 2.3
on page 111 are applied instead.
In the context of this book, fitness is subject to minimization, i. e., elements with smaller
fitness are “better” than those with higher fitness. Although this definition differs from the
biological perception of fitness, it complies with the idea that optimization algorithms are
to find the minima of mathematical functions (if nothing else has been stated).
Futher Definitions
In order to ease the discussions of different global optimization algorithms, we furthermore
define the data structure individual. Especially evolutionary algorithms, but also many other
techniques, work on sets of such individuals. Their fitness assignment processes determine
fitness values for the individuals relative to all elements of these populations.
Definition 1.36 (Individual). An individual p is a tuple (p.g, p.x) of an element p.g in
the search space G and the corresponding element p.x = gpmp.g in the problem space X.
Besides this basic individual structure, many practical realizations of optimization al-
gorithms use such a data structure to store additional information like the objective and
fitness values. Then, we will consider individuals as tuples in G × X × Z, where Z is the
space of the additional information stored – Z = Y ×V, for instance. In the algorithm defini-
tions later in this book, we will often access the phenotypes p.x without explicitly using the
genotype-phenotype mapping, since the relation of p.x and p.g complies to Definition 1.36.
Definition 1.37 (Population). A population Pop is a list of individuals used during an
optimization process.
Pop ⊆ G × X : ∀p = (p.g,p.x) ∈ Pop ⇒ p.x = gpm(p.g)
(1.33)
As already mentioned, the fitness v(x) of an element x in the problem space X often not
solely depends on the element itself. Normally, it is rather a relative measure putting the
features of x into the context of a set of solution candidates x. We denote this by writing
v(x, X). It is also possible that the fitness involves the whole individual data, including the
genotypic and phenotypic structures. We can denote this by writing v(p, Pop).
1.3.2 Fitness Landscapes and Global Optimization
A very powerful metaphor in global optimization is the fitness landscape37. Like many other
abstractions in optimization, fitness landscapes have been developed and extensively been
researched by evolutionary biologists [2261, 1099, 775, 502]. Basically, they are visualiza-
tions of the relationship between the genotypes or phenotypes in a given population and
their corresponding reproduction probability. The idea of such visualizations goes back to
Wright [2261], who used level contours diagrams in order to outline the effects of selection,
36 http://en.wikipedia.org/wiki/Fitness_%28biology%29 [accessed 2008-02-22]
37 http://en.wikipedia.org/wiki/Fitness_landscape [accessed 2007-07-03]

48
1 Introduction
mutation, and crossover on the capabilities of populations to escape local optimal configu-
rations. Similar abstractions arise in many other areas [1954], like in physics of disordered
systems like spin-glasses [208, 1402], for instance.
In Chapter 2, we will discuss evolutionary algorithms, which are optimization methods
inspired by natural evolution. The evolutionary algorithm research community has widely
adopted the fitness landscapes as relation between individuals and their objective values
[1431, 623]. Langdon and Poli [1242]38 explain that fitness landscapes can be imagined as
a view on a countryside from far above. The height of each point is then analogous to its
objective value. An optimizer can then be considered as a short-sighted hiker who tries to
find the lowest valley or the highest hilltop. Starting from a random point on the map, she
wants to reach this goal by walking the minimum distance.
As already mentioned, evolutionary algorithms were first developed as single-objective
optimization methods. Then, the objective values were directly used as fitness and the
“reproduction probability”, i. e., the chance of a solution candidate for being subject of
further investigation, was proportional to them. In multi-objective optimization applications
with more sophisticated fitness assignment and selection processes, this simple approach does
not reflect the biological metaphor correctly anymore.
In the context of this book we will book, we therefore deviate from this view. Since
it would possibly be confusing for the reader if we used a different definition for fitness
landscapes than the rest of the world, we introduce the new term problem landscape and
keep using the term fitness landscape in the traditional manner. In Figure 1.19 on page 57,
you can find some examples for fitness landscapes.
Definition 1.38 (Problem Landscape).
The problem landscape Φ : X ×N → [0,1] ⊂ R+ maps all the points x in a problem space
X to the cumulative probability of reaching them until (inclusively) the τth evaluation of a
solution candidate. The problem landscape thus depends on the optimization problem and
on the algorithm applied in order to solve the problem.
Φ(x, τ ) = P x has been visited until the τ th individual evaluation ∀x ∈ X,τ ∈ N (1.34)
This definition of problem landscape is very similar to the performance measure defini-
tion used by Wolpert and Macready [2244, 2245] in their No Free Lunch Theorem which
will be discussed in Section 1.4.10 on page 76. In our understanding, problem landscapes
are not only closer to the original meaning of fitness landscapes in biology, they also have
another advantage. According to this definition, all entities involved in an optimization pro-
cess directly influence the problem landscape. The choice of the search operations in the
search space G, the way the initial elements are picked, the genotype-phenotype mapping,
the objective functions, the fitness assignment process, and the way individuals are selected
for further exploration all have impact on Φ. We can furthermore make the following as-
sumptions about Φxτ , since it is basically a some form of cumulative distribution function
(see Definition 28.18 on page 470).
Φ(x, τ1) ≥ Φ(x,τ2) ∀τ1 < τ2 ∧ x ∈ X,τ1,τ2 ∈ N
(1.35)
0 ≤ Φ(x,τ) ≤ 1 ∀x ∈ X,τ ∈ N
(1.36)
Referring back to Definition 1.34, we can now also define what optimization algorithms
are.
Definition 1.39 (Optimization Algorithm).
An optimization algorithm is a transfor-
mation (X, F, G, Op, gpm) → Φ of an optimization problem (X,F,G,Op,gpm) to a problem
landscape Φ that will find at least one local optimum x⋆ for each optimization problem
l
38 This part of [1242] is also online available at http://www.cs.ucl.ac.uk/staff/W.Langdon/FOGP/
intro_pic/landscape.html [accessed 2008-02-15].

1.3 The Structure of Optimization
49
(X, F, G, Op, gpm) with a weakly complete set of search operations Op and a surjective
genotype-phenotype mapping gpm if granted infinite processing time and if such an opti-
mum exists (see Equation 1.37).
∃x⋆l ∈ X : lim Φ(x⋆l,τ) = 1
(1.37)
τ →∞
An optimization algorithm is characterized by
1. the way it assigns fitness to the individuals,
2. the ways it selects them for further investigation,
3. the way it applies the search operations, and
4. the way it builds and treats its state information.
The first condition in Definition 1.40, the completeness of Op, is mandatory because the
search space G cannot be explored fully otherwise. If the genotype-phenotype mapping gpm
is not surjective, there exist points in the problem space X which can never be evaluated.
Only if both conditions hold, it is guaranteed that an optimization algorithm can find at
least one local optimum.
The best optimization algorithm for a given problem (X, F, G, Op, gpm) is the one with
the highest values of Φ(x⋆, τ ) for the optimal elements x⋆ in the problem space and for the
lowest values of τ . It may be interesting that this train of thought indicates that finding
the best optimization algorithm for a given optimization problem is, itself, a multi-objective
optimization problem.
Definition 1.40 (Global Optimization Algorithm).
Global optimization algorithms
are optimization algorithms that employs measures that prevent convergence to local optima
and increase the probability of finding a global optimum.
For a perfect global optimization algorithm (given an optimization problem with weakly
complete search operations and a surjective genotype-phenotype mapping), Equation 1.38
would hold. In reality, it can be considered questionable whether such an algorithm can
actually be built.
∀x1,x2 ∈ X : x1≻x2 ⇒ lim Φ(x1,τ) > lim Φ(x2,τ)
(1.38)
τ →∞
τ →∞
f(x)


l
X
Figure 1.16: An example optimization problem.
Let us now give a simple example for problem landscapes and how they are influenced by
the optimization algorithm applied to them. Figure 1.16 illustrates one objective function,

50
1 Introduction
defined over a finite subset X of the two-dimensional real plane, which we are going to
optimize. We use the problem space X also as search space G, so we can do not need
a genotype-phenotype mapping. For optimization, we will use a very simple hill climbing
algorithm39, which initially randomly creates one solution candidate uniformly distributed
in X. In each iteration, it creates a new solution candidate from the known one using an
unary search operation. The old and the new candidate are compared, and the better one
is kept. Hence, we do not need to differentiate between fitness and objective values. In the
example, better means has lower fitness. In Figure 1.16, we can spot one local optimum
x⋆ and one global optimum x⋆. Between them, there is a hill, an area of very bad fitness.
l
The rest of the problem space exhibits a small gradient into the direction its center. The
optimization algorithm will likely follow this gradient and sooner or later discover x⋆ or x⋆.
l
The chances of x⋆ are higher, since it is closer to the center of X.
l
With this setting, we have recorded the traces of two experiments with 1.3 million runs
of the optimizer (8000 iterations each). From these records, we can approximate the problem
landscapes very good.
In the first experiment, depicted in Figure 1.17, we used a search operation searchOp1 :
X → X which created a new solution candidate normally distributed around the old one.
In all experiments, we had divided X in a regular lattice. searchOp2 : X → X, used in the
second experiment, the new solution candidates are direct neighbors of the old ones in this
lattice. The problem landscape Φ produced by this operator is shown in Figure 1.18. Both
operators are complete, since each point in the search space can be reached from each other
point by applying them.
searchOp1(x) ≡ (x1 + randomn(),x2 + randomn())
(1.39)
searchOp1(x) ≡ (x1 + randomu(−1,1),x2 + randomu(−1,1))
(1.40)
In both experiments, the first probabilities of the elements of the search space of being
discovered are very low, near to zero in the first few iterations. To put it precise, since our
problem space is a 36 ×36 lattice, this probability is 1/362 in the first iteration. Starting with
the tenth or so iteration, small peaks begin to form around the places where the optima are
located. These peaks grow
Well, as already mentioned, this idea of problem landscapes and optimization reflects
solely the author’s views. Notice also that it is not always possible to define problem land-
scapes for problem spaces which are uncountable infinitely large. Since the local optimum
x⋆ at the center of the large basin and the gradient points straighter into its direction, it has
l
a higher probability of being found than the global optimum x⋆. The difference between the
two search operators tested becomes obvious starting with approximately the 2000th itera-
tion. In the hill climber with the operator utilizing the normal distribution, the Φ value of
the global optimum begins to rise farther and farther, finally surpassing the one of the local
optimum. Even if the optimizer gets trapped in the local optimum, it will still eventually
discover the global optimum and if we had run this experiment longer, the according proba-
bility would have converge to 1. The reason for this is that with the normal distribution, all
points in the search space have a non-zero probability of being found from all other points
in the search space. In other words, all elements of the search space are adjacent.
The operator based on the uniform distribution is only able to create points in the direct
neighborhood of the known points. Hence, if an optimizer gets trapped in the local optimum,
it can never escape. If it arrives at the global optimum, it will never discover the local one.
In Fig. 1.18.l, we can see that Φ(x⋆, 8000)
l
≈ 0.7 and Φ(x⋆,8000) ≈ 0.3. One of the two
points will be the result of the optimization process.
From the example we can draw four conclusions:
1. Optimization algorithms discover good elements with higher probability than elements
with bad characteristics. Well, this is what they should do.
39 Hill climbing algorithms are discussed thoroughly in Chapter 10.

1.3 The Structure of Optimization
51
0.8
0.8
0.6
(x,0)
0.6
(x,2)
0.4
F
0.4
F
0.2
0.2
0
0
X
X
Fig. 1.17.a: Φ(x, 1)
Fig. 1.17.b: Φ(x, 2)
0.8
0.8
0.6
(x,5)
0.6
(x,10)
0.4
F
0.4
F
0.2
0.2
0
0
X
X
Fig. 1.17.c: Φ(x, 5)
Fig. 1.17.d: Φ(x, 10)
0.8
0.8
0.6
0.6
(x,50)
0.4
F
0.4
(x,100)
0.2
0.2
F
0
0
X
X
Fig. 1.17.e: Φ(x, 50)
Fig. 1.17.f: Φ(x, 100)
0.8
0.8
0.6
0.6
0.4
(x,500)
0.4
(x,1000)
0.2
F
0.2
F
0
0
X
X
Fig. 1.17.g: Φ(x, 500)
Fig. 1.17.h: Φ(x, 1000)
0.8
0.8
0.6
0.6
0.4
(x,2000)
0.4
(x,4000)
0.2
F
0.2
F
0
0
X
X
Fig. 1.17.i: Φ(x, 2000)
Fig. 1.17.j: Φ(x, 4000)
0.8
0.8
0.6
0.6
0.4
(x,6000)
0.4
(x,8000)
0.2
F
0.2
F
0
0
X
X
Fig. 1.17.k: Φ(x, 6000)
Fig. 1.17.l: Φ(x, 8000)
Figure 1.17: The problem landscape of the example problem derived with searchOp1.

52
1 Introduction
0.8
0.8
0.6
(x,0)
0.6
(x,2)
0.4
F
0.4
F
0.2
0.2
0
0
X
X
Fig. 1.18.a: Φ(x, 1)
Fig. 1.18.b: Φ(x, 2)
0.8
0.8
0.6
(x,5)
0.6
(x,10)
0.4
F
0.4
F
0.2
0.2
0
0
X
X
Fig. 1.18.c: Φ(x, 5)
Fig. 1.18.d: Φ(x, 10)
0.8
0.8
0.6
0.6
(x,50)
0.4
F
0.4
(x,100)
0.2
0.2
F
0
0
X
X
Fig. 1.18.e: Φ(x, 50)
Fig. 1.18.f: Φ(x, 100)
0.8
0.8
0.6
0.6
0.4
(x,500)
0.4
(x,1000)
0.2
F
0.2
F
0
0
X
X
Fig. 1.18.g: Φ(x, 500)
Fig. 1.18.h: Φ(x, 1000)
0.8
0.8
0.6
0.6
0.4
(x,2000)
0.4
(x,4000)
0.2
F
0.2
F
0
0
X
X
Fig. 1.18.i: Φ(x, 2000)
Fig. 1.18.j: Φ(x, 4000)
0.8
0.8
0.6
0.6
0.4
(x,6000)
0.4
(x,8000)
0.2
F
0.2
F
0
0
X
X
Fig. 1.18.k: Φ(x, 6000)
Fig. 1.18.l: Φ(x, 8000)
Figure 1.18: The problem landscape of the example problem derived with searchOp2.

1.3 The Structure of Optimization
53
2. The success of optimization depends very much on the way the search is conducted.
3. It also depends on the time (or the number of iterations) the optimizer allowed to use.
4. Hill climbing algorithms are no global optimization algorithms since they have no means
of preventing getting stuck at local optima.
1.3.3 Gradient Descend
Definition 1.41 (Gradient). A gradient40 of a scalar field f : Rn → R is a vector field
which points into the direction of the greatest increase of the scalar field. It is denoted by
∇f or grad(f).
Optimization algorithms depend on some form of gradient in objective or fitness space in
order to find good individuals. In most cases, the problem space X is not a vector space over
the real numbers R, so we cannot directly differentiate the objective functions with Nabla
operator41 ∇F. Generally, samples of the search space are used to approximate the gradient.
If we compare to elements x1 and x2 of problem space and find x1 ≻ x2, we can assume
that there is some sort of gradient facing downwards from x2 to x1. When descending this
gradient, we can hope to find an x3 with x3 ≻ x1 and finally the global minimum.
1.3.4 Other General Features
There are some further common semantics and operations that are shared by most op-
timization algorithms. Many of them, for instance, start out by randomly creating some
initial individuals which are then refined iteratively. Optimization processes which are not
allowed to run infinitely have to find out when to terminate. In this section we define and
discuss general abstractions for such commonalities.
Iterations
Global optimization algorithms often iteratively evaluate solution candidates in order to
approach the optima. We distinguish between evaluations τ and iterations t.
Definition 1.42 (Evaluation). The value τ ∈ N0 denotes the number of solution candi-
dates for which the set of objective functions F has been evaluated.
Definition 1.43 (Iteration). An iteration42 refers to one round in a loop of an algorithm.
It is one repetition of a specific sequence of instruction inside an algorithm.
Algorithms are referred to as iterative if most of their work is done by cyclic repetition
of one main loop. In the context of this book, an iterative optimization algorithm starts
with the first step t = 0. The value t ∈ N0 is the index of the iteration currently performed
by the algorithm and t + 1 refers to the following step. One example for iterative algorithm
is Algorithm 1.1. In some optimization algorithms like genetic algorithms, for instance,
iterations are referred to as generations.
There often exists a well-defined relation between the number of performed solution
candidate evaluations τ and the index of the current iteration t in an optimization process:
Many global optimization algorithms generate and evaluate a certain number of individuals
per generation.
40 http://en.wikipedia.org/wiki/Gradient [accessed 2007-11-06]
41 http://en.wikipedia.org/wiki/Del [accessed 2008-02-15]
42 http://en.wikipedia.org/wiki/Iteration [accessed 2007-07-03]

54
1 Introduction
Termination Criterion
The termination criterion terminationCriterion() is a function with access to all the infor-
mation accumulated by an optimization process, including the number of performed steps
t, the objective values of the best individuals, and the time elapsed since the start of the
process. With terminationCriterion(), the optimizers determine when they have to halt.
Definition 1.44 (Termination Criterion). When the termination criterion function
terminationCriterion() ∈ {true,false} evaluates to true, the optimization process will
stop and return its results.
Some possible criteria that can be used to decide whether an optimizer should terminate
or not are [1975, 1634, 2325, 2326]:
1. The user may grant the optimization algorithm a maximum computation time. If this
time has been exceeded, the optimizer should stop. Here we should note that the time
needed for single individuals may vary, and so will the times needed for iterations. Hence,
this time threshold can sometimes not be abided exactly.
2. Instead of specifying a time limit, a total number of iterations ˆ
t or individual evaluations
ˆ
τ may be specified. Such criteria are most interesting for the researcher, since she often
wants to know whether a qualitatively interesting solution can be found for a given
problem using at most a predefined number of samples from the problem space.
3. An optimization process may be stopped when no improvement in the solution quality
could be detected for a specified number of iterations. Then, the process most probably
has converged to a (hopefully good) solution and will most likely not be able to make
further progress.
4. If we optimize something like a decision maker or classifier based on a sample data set,
we will normally divide this data into a training and a test set. The training set is used
to guide the optimization process whereas the test set is used to verify its results. We can
compare the performance of our solution when fed with the training set to its properties
if fed with the test set. This comparison helps us detect when most probably no further
generalization can be achieved by the optimizer and we should terminate the process.
5. Obviously, we can terminate an optimization process if it has already yielded a suffi-
ciently good solution.
In practical applications, we can apply any combination of the criteria above in order to
determine when to halt. How the termination criterion is tested in an iterative algorithm is
illustrated in Algorithm 1.1.
Algorithm 1.1: Example Iterative Algorithm
Input: [implicit] terminationCriterion(): the termination criterion
Data: t: the iteration counter
1 begin
2
t ←− 0
// initialize the data of the algorithm
3
while terminationCriterion() do
// perform one iteration - here happens the magic
4
t ←− t + 1
5 end

1.3 The Structure of Optimization
55
Minimization
Many optimization algorithms have been developed for single-objective optimization in their
original form. Such algorithms may be used for both, minimization or maximization. Without
loss of generality we will present them as minimization processes since this is the most
commonly used notation. An algorithm that maximizes the function f may be transformed
to a minimization using −f instead.
Note that using the prevalence comparisons as introduced in Section 1.2.4 on page 38,
multi-objective optimization processes can be transformed into single-objective minimization
processes. Therefore x1 ≻ x2 ⇔ cmpF(x1,x2) < 0.
Modeling and Simulating
While there are a lot of problems where the objective functions are mathematical expressions
that can directly be computed, there exist problem classes far away from such simple function
optimization that require complex models and simulations.
Definition 1.45 (Model). A model43 is an abstraction or approximation of a system that
allows us to reason and to deduce properties of the system.
Models are often simplifications or idealization of real-world issues. They are defined by
leaving away facts that probably have only minor impact on the conclusions drawn from
them. In the area of global optimization, we often need two types of abstractions:
1. The models of the potential solutions shape the problem space X. Examples are
a) programs in Genetic Programming, for example for the Artificial Ant problem,
b) construction plans of a skyscraper,
c) distributed algorithms represented as programs for Genetic Programming,
d) construction plans of a turbine,
e) circuit diagrams for logical circuits, and so on.
2. Models of the environment in which we can test and explore the properties of the po-
tential solutions, like
a) a map on which the Artificial Ant will move which is driven by the evolved program,
b) an abstraction from the environment in which the skyscraper will be built, with wind
blowing from several directions,
c) a model of the network in which the evolved distributed algorithms can run,
d) a physical model of air which blows through the turbine,
e) the model of an energy source the other pins which will be attached to the circuit
together with the possible voltages on these pins.
Models themselves are rather static structures of descriptions and formulas. Deriving
concrete results (objective values) from them is often complicated. It often makes more
sense to bring the construction plan of a skyscraper to life in a simulation. Then we can test
the influence of various wind strengths and directions on building structure and approximate
the properties which define the objective values.
Definition 1.46 (Simulation). A simulation44 is the computational realization of a model.
Whereas a model describes abstract connections between the properties of a system, a sim-
ulation realizes these connections.
Simulations are executable, live representations of models that can be as meaningful as
real experiments. They allow us to reason if a model makes sense or not and how certain
objects behave in the context of a model.
43 http://en.wikipedia.org/wiki/Model_%28abstract%29 [accessed 2007-07-03]
44 http://en.wikipedia.org/wiki/Simulation [accessed 2007-07-03]

56
1 Introduction
1.4 Problems in Optimization
1.4.1 Introduction
The classification of optimization algorithms in Section 1.1.1 and the table of contents of this
book enumerate a wide variety of optimization algorithms. Yet, the approaches introduced
here resemble only a small fraction of the actual number of available methods. It is a justified
question to ask why there are so many different approaches, why is this variety needed? One
possible answer is simply because there are so many different kinds of optimization tasks.
Each of them puts different obstacles into the way of the optimizers and comes with own,
characteristic difficulties.
In this chapter we want to discuss the most important of these complications, the major
problems that may be encountered during optimization. Some of subjects in the following
text concern global optimization in general (multi-modality and overfitting, for instance),
others apply especially to nature-inspired approaches like genetic algorithms (epistasis and
neutrality, for example). Neglecting even a single one them during the design or process of
optimization can render the whole efforts invested useless, even if highly efficient optimiza-
tion techniques are applied. By giving clear definitions and comprehensive introductions to
these topics, we want to raise the awareness of scientists and practitioners in the industry
and hope to help them to use optimization algorithms more efficiently.
In Figure 1.19, we have sketched a set of different types of fitness landscapes (see Sec-
tion 1.3.2) which we are going to discuss. The objective values in the figure are subject to
minimization and the small bubbles represent solution candidates under investigation. An
arrow from one bubble to another means that the second individual is found by applying
one search operation to the first one.
The Term “Difficult”
Before we go more into detail about what makes these landscapes difficult, we should es-
tablish the term in the context of optimization. The degree of difficulty of solving a certain
problem with a dedicated algorithm is closely related to its computational complexity45, i. e.,
the amount of resources such as time and memory required to do so. The computational com-
plexity depends on the number of input elements needed for applying the algorithm. This
dependency is often expressed in form of approximate boundaries with the Big-O-family
notations introduced by Bachmann [96] and made popular by Landau [1236]. Problems can
further be divided into complexity classes. One of the most difficult complexity classes own-
ing to its resource requirements is NP, the set of all decision problems which are solvable
in polynomial time by non-deterministic Turing machines [773]. Although many attempts
have been made, no algorithm has been found which is able to solve an NP-complete
[773] problem in polynomial time on a deterministic computer. One approach to obtaining
near-optimal solutions for problems in NP in reasonable time is to apply metaheuristic,
randomized optimization procedures.
As already stated, optimization algorithms are guided by objective functions. A function
is difficult from a mathematical perspective in this context if it is not continuous, not
differentiable, or if it has multiple maxima and minima. This understanding of difficulty
comes very close to the intuitive sketches in Figure 1.19.
In many real world applications of metaheuristic optimization, the characteristics of
the objective functions are not known in advance. The problems are usually NP or have
45 see Section 30.1.3 on page 550

1.4 Problems in Optimization
57
objectivevaluesf(x)
x
objectivevaluesf(x)
x
Fig. 1.19.a: Best Case
Fig. 1.19.b: Low Variation
? ? ?
objectivevaluesf(x)
multiple (local) optima x
objectivevaluesf(x)
no useful gradient information
Fig. 1.19.c: Multimodal
Fig. 1.19.d: Rugged
?
region with misleading
neutral area
gradient information
objectivevaluesf(x)
x
objectivevaluesf(x)
x
Fig. 1.19.e: Deceptive
Fig. 1.19.f: Neutral
needle
?(isolated
?
optimum)
?
neutral
area
or
area without much
objectivevaluesf(x)
information
x
objectivevaluesf(x)
x
Fig. 1.19.g: Needle-In-A-Haystack
Fig. 1.19.h: Nightmare
Figure 1.19: Different possible properties of fitness landscapes (minimization).

58
1 Introduction
unknown complexity. It is therefore only rarely possible to derive boundaries for the perfor-
mance or the runtime of optimizers in advance, let alone exact estimates with mathematical
precision.
Most often, experience, rules of thumb, and empirical results based on models obtained
from related research areas such as biology are the only guides available. In this chapter,
we discuss many such models and rules, providing a better understanding of when the
application of a metaheuristic is feasible and when not, as well as with indicators on how to
avoid defining problems in a way that makes them difficult.
1.4.2 Premature Convergence
Introduction
An optimization algorithm has converged if it cannot reach new solution candidates anymore
or if it keeps on producing solution candidates from a “small”46 subset of the problem space.
Meta-heuristc global optimization algorithms will usually converge at some point in time.
In nature, a similar phenomenon can be observed according to [1196]: The niche preemption
principle states that a niche in a natural environment tends to become dominated by a single
species [1347]. One of the problems in global optimization (and basically, also in nature) is
that it is often not possible to determine whether the best solution currently known is a
situated on local or a global optimum and thus, if convergence is acceptable. In other words,
it is usually not clear whether the optimization process can be stopped, whether it should
concentrate on refining the current optimum, or whether it should examine other parts of
the search space instead. This can, of course, only become cumbersome if there are multiple
(local) optima, i. e., the problem is multimodal as depicted in Fig. 1.19.c.
A mathematical function is multimodal if it has multiple maxima or minima [1863, 2327,
512]. A set of objective functions (or a vector function) F is multimodal if it has multiple
(local or global) optima – depending on the definition of “optimum” in the context of the
corresponding optimization problem.
The Problem
An optimization process has prematurely converged to a local optimum if it is no longer able
to explore other parts of the search space than the area currently being examined and there
exists another region that contains a superior solution [2075, 1824]. Figure 1.20 illustrates
examples for premature convergence.
The existence of multiple global optima itself is not problematic and the discovery of
only a subset of them can still be considered as successful in many cases. The occurrence of
numerous local optima, however, is more complicated.
Domino Convergence
The phenomenon of domino convergence has been brought to attention by Rudnick [1773]
who studied it in the context of his BinInt problem [1773, 2036] which is discussed in Sec-
tion 21.2.5. In principle, domino convergence occurs when the solution candidates have fea-
tures which contribute to significantly different degrees to the total fitness. If these features
are encoded in separate genes (or building blocks) in the genotypes, they are likely to be
treated with different priorities, at least in randomized or heuristic optimization methods.
Building blocks with a very strong positive influence on the objective values, for instance,
will quickly be adopted by the optimization process (i. e., “converge”). During this time, the
alleles of genes with a smaller contribution are ignored. They do not come into play until
46 according to a suitable metric like numbers of modifications or mutations which need to be
applied to a given solution in order to leave this subset

1.4 Problems in Optimization
59
global optimum
local optimum
objectivevaluesf(x)
x
Fig. 1.20.a: Example 1: Maximization
Fig. 1.20.b: Example 2: Minimization
Figure 1.20: Premature convergence in the objective space.
the optimal alleles of the more “important” blocks have been accumulated. Rudnick [1773]
called this sequential convergence phenomenon domino convergence due to its resemblance
to a row of falling domino stones [2036].
Let us consider the application of a genetic algorithm in such a scenario. Mutation
operators from time to time destroy building blocks with strong positive influence which
are then reconstructed by the search. If this happens with a high enough frequency, the
optimization process will never get to optimize the lower salient blocks because repairing
and rediscovering those with higher importance takes precedence. Thus, the mutation rate
of the EA limits the probability of finding the global optima in such a situation.
In the worst case, the contributions of the less salient genes may almost look like noise and
they are not optimized at all. Such a situation is also an instance of premature convergence,
since the global optimum which would involve optimal configurations of all building blocks
will not be discovered. In this situation, restarting the optimization process will not help
because it will always turn out the same way. Example problems which are often likely to
exhibit domino convergence are the Royal Road and the aforementioned BinInt problem,
which you can find discussed in Section 21.2.4 and Section 21.2.5, respectively.
One Cause: Loss of Diversity
In biology, diversity is the variety and abundance of organisms at a given place and time
[1598, 1348]. Much of the beauty and efficiency of natural ecosystems is based on a dazzling
array of species interacting in manifold ways. Diversification is also a good strategy utilized
by investors in the economy in order to increase their wealth.
In population-based global optimization algorithms, maintaining a set of diverse solution
candidates is very important as well. Losing diversity means approaching a state where all
the solution candidates under investigation are similar to each other. Another term for this
state is convergence. Discussions about how diversity can be measured have been provided
by Routledge [1771], Cousins [459], Magurran [1348], Morrison and De Jong [1462], Paenke
et al. [1598], and Burke et al. [309, 311].
Preserving diversity is directly linked with maintaining a good balance between exploita-
tion and exploration [1598] and has been studied by researchers from many domains, such
as
1. Genetic Algorithms [1558, 1750, 1751],
2. Evolutionary Algorithms [253, 254, 1262, 1471, 1943, 1892],

60
1 Introduction
3. Genetic Programming [510, 871, 872, 310, 311, 273],
4. Tabu Search [812, 816], and
5. Particle Swarm Optimization [2226].
Exploration vs. Exploitation
The operations which create new solutions from existing ones have a very large impact on
the speed of convergence and the diversity of the populations [637, 1910]. The step size in
Evolution Strategy is a good example of this issue: setting it properly is very important and
leads over to the “exploration versus exploitation” problem [940] which can be observed in
other areas of global optimization as well.47
In the context of optimization, exploration means finding new points in areas of the
search space which have not been investigated before. Since computers have only limited
memory, already evaluated solution candidates usually have to be discarded in order to
accommodate the new ones. Exploration is a metaphor for the procedure which allows search
operations to find novel and maybe better solution structures. Such operators (like mutation
in evolutionary algorithms) have a high chance of creating inferior solutions by destroying
good building blocks but also a small chance of finding totally new, superior traits (which,
however, is not guaranteed at all).
Exploitation, on the other hand, is the process of improving and combining the traits of
the currently known solutions, as done by the crossover operator in evolutionary algorithms,
for instance. Exploitation operations often incorporate small changes into already tested
individuals leading to new, very similar solution candidates or try to merge building blocks
of different, promising individuals. They usually have the disadvantage that other, possibly
better, solutions located in distant areas of the problem space will not be discovered.
Almost all components of optimization strategies can either be used for increasing ex-
ploitation or in favor of exploration. Unary search operations that improve an existing so-
lution in small steps can often be built, hence being exploitation operators. They can also
be implemented in a way that introduces much randomness into the individuals, effectively
making them exploration operators. Selection operations48 in Evolutionary Computation
choose a set of the most promising solution candidates which will be investigated in the
next iteration of the optimizers. They can either return a small group of best individuals
(exploitation) or a wide range of existing solution candidates (exploration).
Optimization algorithms that favor exploitation over exploration have higher convergence
speed but run the risk of not finding the optimal solution and may get stuck at a local
optimum. Then again, algorithms which perform excessive exploration may never improve
their solution candidates well enough to find the global optimum or it may take them
very long to discover it “by accident”. A good example for this dilemma is the Simulated
Annealing algorithm discussed in Chapter 12 on page 263. It is often modified to a form called
simulated quenching which focuses on exploitation but loses the guaranteed convergence to
the optimum. Generally, optimization algorithms should employ at least one search operation
of explorative character and at least one which is able to exploit good solutions further. There
exists a vast body of research on the trade-off between exploration and exploitation that
optimization algorithms have to face [638, 945, 622, 1494, 49, 538].
Countermeasures
There is no general approach which can prevent premature convergence. The probability
that an optimization process gets caught in a local optimum depends on the characteristics
of the problem to be solved and the parameter settings and features of the optimization
algorithms applied [2051, 1775].
47 More or less synonymously to exploitation and exploration, the terms intensifications and diver-
sification have been introduced by Glover [812, 816] in the context of Tabu Search.
48 Selection will be discussed in Section 2.4 on page 121.

1.4 Problems in Optimization
61
A very crude and yet, sometimes effective measure is restarting the optimization pro-
cess at randomly chosen points in time. One example for this method is GRASP s, Greedy
Randomized Adaptive Search Procedures [663, 652] (see Section 10.6 on page 256), which con-
tinuously restart the process of creating an initial solution and refining it with local search.
Still, such approaches are likely to fail in domino convergence situations. Increasing the
proportion of exploration operations may also reduce the chance of premature convergence.
In order to extend the duration of the evolution in evolutionary algorithms, many meth-
ods have been devised for steering the search away from areas which have already been
frequently sampled. This can be achieved by integrating density metrics into the fitness
assignment process. The most popular of such approaches are sharing and niching (see Sec-
tion 2.3.4). The Strength Pareto Algorithms, which are widely accepted to be highly efficient,
use another idea: they adapt the number of individuals that one solution candidate dom-
inates as density measure [2329, 2332]. One very simple method aiming for convergence
prevention is introduced in Section 2.4.8. Using low selection pressure furthermore decreases
the chance of premature convergence but also decreases the speed with which good solutions
are exploited.
Another approach against premature convergence is to introduce the capability of self-
adaptation, allowing the optimization algorithm to change its strategies or to modify its
parameters depending on its current state. Such behaviors, however, are often implemented
not in order to prevent premature convergence but to speed up the optimization process
(which may lead to premature convergence to local optima) [1776, 1777, 1778].
1.4.3 Ruggedness and Weak Causality
The Problem: Ruggedness
Optimization algorithms generally depend on some form of gradient in the objective or
fitness space. The objective functions should be continuous and exhibit low total variation49,
so the optimizer can descend the gradient easily. If the objective functions are unsteady
or fluctuating, i. e., going up and down, it becomes more complicated for the optimization
process to find the right directions to proceed to. The more rugged a function gets, the harder
it becomes to optimize it. For short, one could say ruggedness is multi-modality plus steep
ascends and descends in the fitness landscape. Examples of rugged landscapes are Kauffman’s
NK fitness landscape (see Section 21.2.1), the p-Spin model discussed in Section 21.2.2,
Bergman and Feldman’s jagged fitness landscape [182], and the sketch in Fig. 1.19.d on
page 57.
One Cause: Weak Causality
During an optimization process, new points in the search space are created by the search
operations. Generally we can assume that the genotypes which are the input of the search
operations correspond to phenotypes which have previously been selected. Usually, the better
or the more promising an individual is, the higher are its chances of being selected for further
investigation. Reversing this statement suggests that individuals which are passed to the
search operations are likely to have a good fitness. Since the fitness of a solution candidate
depends on its properties, it can be assumed that the features of these individuals are not so
bad either. It should thus be possible for the optimizer to introduce slight changes to their
49 http://en.wikipedia.org/wiki/Total_variation [accessed 2008-04-23]

62
1 Introduction
properties in order to find out whether they can be improved any further50. Normally, such
exploitive modifications should also lead to small changes in the objective values and hence,
in the fitness of the solution candidate.
Definition 1.47 (Strong Causality).
Strong causality (locality) means that small
changes in the properties of an object also lead to small changes in its behavior [1713,
1714, 1759].
This principle (proposed by Rechenberg [1713, 1714]) should not only hold for the search
spaces and operations designed for optimization, but applies to natural genomes as well. The
offspring resulting from sexual reproduction of two fish, for instance, has a different genotype
than its parents. Yet, it is far more probable that these variations manifest in a unique color
pattern of the scales, for example, instead of leading to a totally different creature.
Apart from this straightforward, informal explanation here, causality has been investi-
gated thoroughly in different fields of optimization, such as Evolution Strategy [1713, 597],
structure evolution [1303, 1302], Genetic Programming [1758, 1759, 1007, 597], genotype-
phenotype mappings [1854], search operators [597], and evolutionary algorithms in general
[1955, 1765, 597].
In fitness landscapes with weak (low) causality, small changes in the solution candidates
often lead to large changes in the objective values, i. e., ruggedness. It then becomes harder
to decide which region of the problem space to explore and the optimizer cannot find reliable
gradient information to follow. A small modification of a very bad solution candidate may
then lead to a new local optimum and the best solution candidate currently known may be
surrounded by points that are inferior to all other tested individuals.
The lower the causality of an optimization problem, the more rugged its fitness landscape
is, which leads to a degeneration of the performance of the optimizer [1168]. This does not
necessarily mean that it is impossible to find good solutions, but it may take very long to
do so.
Fitness Landscape Measures
As measures for the ruggedness of a fitness landscape (or their general difficulty), many
different metrics have been proposed. Wedge and Kell [2164] and Altenberg [45] provide
nice lists of them in their work51, which we summarize here:
• Weinberger [2169] introduced the autocorrelation function and the correlation length of
random walks.
• The correlation of the search operators was used by Manderick et al. [1354] in conjunction
with the autocorrelation.
• Jones and Forrest [1070, 1069] proposed the fitness distance correlation (FDC), the corre-
lation of the fitness of an individual and its distance to the global optimum. This measure
has been extended by researchers such as Clergue et al. [416, 2103].
• The probability that search operations create offspring fitter than their parents, as defined
by Rechenberg [1713] and Beyer [196] (and called evolvability by Altenberg [42]), will be
discussed in Section 1.4.5 on page 65 in depth.
• Simulation dynamics have been researched by Altenberg [42] and Grefenstette [855].
• Another interesting metric is the fitness variance of formae (Radcliffe and Surry [1695])
and schemas (Reeves and Wright [1717]).
• The error threshold method from theoretical biology [625, 1552] has been adopted Ochoa
et al. [1557] for evolutionary algorithms. It is the “critical mutation rate beyond which
structures obtained by the evolutionary process are destroyed by mutation more fre-
quently than selection can reproduce them” [1557].
50 We have already mentioned this under the subject of exploitation.
51 Especially the one of Wedge and Kell [2164] is beautiful and far more detailed than this summary
here.

1.4 Problems in Optimization
63
• The negative slope coefficient (NSC) by Vanneschi et al. [2104, 2105] may be considered
as an extension of Altenberg’s evolvability measure.
• Davidor [489] uses the epistatic variance as a measure of utility of a certain representation
in genetic algorithms. We discuss the issue of epistasis in Section 1.4.6.
• The genotype-fitness correlation (GFC) of Wedge and Kell [2164] is a new measure for
ruggedness in fitness landscape and has been shown to be a good guide for determining
optimal population sizes in Genetic Programming.
Autocorrelation and Correlation Length
As example, let us take a look at the autocorrelation function as well as the correlation
length of random walks [2169]. Here we borrow its definition from Verel et al. [2114]:
Definition 1.48 (Autocorrelation Function). Given a random walk (xi, xi+1, . . . ), the
autocorrelation function ρ of an objective function f is the autocorrelation function of the
time series (f (xi) , f (xi+1) , . . . ).
E[f (x
ρ(k, f ) =
i) f (xi+k)] − E[f(xi)] E[f(xi+k)]
(1.41)
D2[f (xi)]
where E[f (xi)] and D2[f (xi)] are the expected value and the variance of f (xi).
The correlation length τ = − 1
measures how the autocorrelation function de-
log ρ(1,f )
creases and summarizes the ruggedness of the fitness landscape: the larger the correlation
length, the lower the total variation of the landscape. From the works of Kinnear, Jr. [1141]
and Lipsitch [1293] from 18, however, we also know that correlation measures do not always
represent the hardness of a problem landscape full.
Countermeasures
To the knowledge of the author, no viable method which can directly mitigate the effects of
rugged fitness landscapes exists. In population-based approaches, using large population sizes
and applying methods to increase the diversity can reduce the influence of ruggedness, but
only up to a certain degree. Utilizing Lamarckian evolution [522, 2215] or the Baldwin effect
[123, 929, 930, 2215], i. e., incorporating a local search into the optimization process, may
further help to smoothen out the fitness landscape [864] (see Section 15.2 and Section 15.3,
respectively).
Weak causality is often a home-made problem because it results to some extent from
the choice of the solution representation and search operations. We pointed out that explo-
ration operations are important for lowering the risk of premature convergence. Exploitation
operators are as same as important for refining solutions to a certain degree. In order to
apply optimization algorithms in an efficient manner, it is necessary to find representations
which allow for iterative modifications with bounded influence on the objective values, i. e.,
exploitation. In Section 1.5.2, we present some further rules-of-thumb for search space and
operation design.
1.4.4 Deceptiveness

64
1 Introduction
Introduction
Especially annoying fitness landscapes show deceptiveness (or deceptivity). The gradient of
deceptive objective functions leads the optimizer away from the optima, as illustrated in
Fig. 1.19.e.
The term deceptiveness is mainly used in the genetic algorithm52 community in the
context of the Schema Theorem. Schemas describe certain areas (hyperplanes) in the search
space. If an optimization algorithm has discovered an area with a better average fitness
compared to other regions, it will focus on exploring this region based on the assumption
that highly fit areas are likely to contain the true optimum. Objective functions where this
is not the case are called deceptive [190, 821, 1285]. Examples for deceptiveness are the ND
fitness landscapes outlined in Section 21.2.3, trap functions (see Section 21.2.3), and the
fully deceptive problems given by Goldberg et al. [825, 541].
The Problem
If the information accumulated by an optimizer actually guides it away from the optimum,
search algorithms will perform worse than a random walk or an exhaustive enumeration
method. This issue has been known for a long time [2159, 1433, 1434, 2034] and has been
subsumed under the No Free Lunch Theorem which wewill discuss in Section 1.4.10.
Countermeasures
Solving deceptive optimization tasks perfectly involves sampling many individuals with very
bad features and low fitness. This contradicts the basic ideas of metaheuristics and thus,
there are no efficient countermeasures against deceptivity. Using large population sizes, main-
taining a very high diversity, and utilizing linkage learning (see Section 1.4.6) are, maybe,
the only approaches which can provide at least a small chance of finding good solutions.
1.4.5 Neutrality and Redundancy
The Problem: Neutrality
Definition 1.49 (Neutrality). We consider the outcome of the application of a search
operation to an element of the search space as neutral if it yields no change in the objective
values [1718, 149].
It is challenging for optimization algorithms if the best solution candidate currently
known is situated on a plane of the fitness landscape, i. e., all adjacent solution candidates
have the same objective values. As illustrated in Fig. 1.19.f, an optimizer then cannot find
any gradient information and thus, no direction in which to proceed in a systematic manner.
From its point of view, each search operation will yield identical individuals. Furthermore,
optimization algorithms usually maintain a list of the best individuals found, which will then
overflow eventually or require pruning.
The degree of neutrality ν is defined as the fraction of neutral results among all possible
products of the search operations applied to a specific genotype [149]. We can generalize
this measure to areas G in the search space G by averaging over all their elements. Regions
where ν is close to one are considered as neutral.
∀g1 ∈ G ⇒ ν(g1) = |{g2 : P(g2 = Op(g1)) > 0 ∧ F(gpm(g2)) = F(gpm(g1))}| (1.42)
|{g2 : P(g2 = Op(g1)) > 0}|
1
∀G ⊆ G ⇒ ν(G) =
ν(g)
(1.43)
|G| g∈G
52 We are going to discuss genetic algorithms in Chapter 3 on page 141 and the Schema Theorem
in Section 3.6 on page 150.

1.4 Problems in Optimization
65
Evolvability
Another metaphor in global optimization borrowed from biological systems is evolvability53
[500]. Wagner [2132, 2133] points out that this word has two uses in biology: According
to Kirschner and Gerhart [1144], a biological system is evolvable if it is able to generate
heritable, selectable phenotypic variations. Such properties can then be spread by natural
selection and changed during the course of evolution. In its second sense, a system is evolvable
if it can acquire new characteristics via genetic change that help the organism(s) to survive
and to reproduce. Theories about how the ability of generating adaptive variants has evolved
have been proposed by Riedl [1732], Altenberg [43], Wagner and Altenberg [2134], Bonner
[247], and Conrad [439], amongst others. The idea of evolvability can be adopted for global
optimization as follows:
Definition 1.50 (Evolvability). The evolvability of an optimization process in its current
state defines how likely the search operations will lead to solution candidates with new (and
eventually, better) objectives values.
The direct probability of success [1713, 196], i.e., the chance that search operators produce
offspring fitter than their parents, is also sometimes referred to as evolvability in the context
of evolutionary algorithms [45, 42].
Neutrality: Problematic and Beneficial
The link between evolvability and neutrality has been discussed by many researchers [2300,
2133]. The evolvability of neutral parts of a fitness landscape depends on the optimization
algorithm used. It is especially low for hill climbing and similar approaches, since the search
operations cannot directly provide improvements or even changes. The optimization process
then degenerates to a random walk, as illustrated in Fig. 1.19.f on page 57. The work of
Beaudoin et al. [161] on the ND fitness landscapes54 shows that neutrality may “destroy”
useful information such as correlation.
Researchers in molecular evolution, on the other hand, found indications that the major-
ity of mutations in biology have no selective influence [732, 980] and that the transformation
from genotypes to phenotypes is a many-to-one mapping. Wagner [2133] states that neutral-
ity in natural genomes is beneficial if it concerns only a subset of the properties peculiar to
the offspring of a solution candidate while allowing meaningful modifications of the others.
Toussaint and Igel [2050] even go as far as declaring it a necessity for self-adaptation.
The theory of punctuated equilibria55, in biology introduced by Eldredge and Gould
[630, 629], states that species experience long periods of evolutionary inactivity which are
interrupted by sudden, localized, and rapid phenotypic evolutions [118].56 It is assumed that
the populations explore neutral layers57 during the time of stasis until, suddenly, a relevant
change in a genotype leads to a better adapted phenotype [2098] which then reproduces
quickly. Similar phenomena can be observed/are utilized in EAs [426, 1365].
“Uh?”, you may think, “How does this fit together?” The key to differentiating between
“good” and “bad” neutrality is its degree ν in relation to the number of possible solutions
maintained by the optimization algorithms. Smith et al. [1913] have used illustrative ex-
amples similar to Figure 1.21 showing that a certain amount of neutral reproductions can
foster the progress of optimization. In Fig. 1.21.a, basically the same scenario of premature
convergence as in Fig. 1.20.a on page 59 is depicted. The optimizer is drawn to a local opti-
mum from which it cannot escape anymore. Fig. 1.21.b shows that a little shot of neutrality
53 http://en.wikipedia.org/wiki/Evolvability [accessed 2007-07-03]
54 See Section 21.2.3 on page 333 for a detailed elaboration on the ND fitness landscape.
55 http://en.wikipedia.org/wiki/Punctuated_equilibrium [accessed 2008-07-01]
56 A very similar idea is utilized in the Extremal Optimization method discussed in Chapter 13.
57 Or neutral networks, as discussed in Section 1.4.5.

66
1 Introduction
could form a bridge to the global optimum. The optimizer now has a chance to escape the
smaller peak if it is able to find and follow that bridge, i. e., the evolvability of the system
has increased. If this bridge gets wider, as sketched in Fig. 1.21.c, the chance of finding the
global optimum increases as well. Of course, if the bridge gets too wide, the optimization
process may end up in a scenario like in Fig. 1.19.f on page 57 where it cannot find any
direction. Furthermore, in this scenario we expect the neutral bridge to lead to somewhere
useful, which is not necessarily the case in reality.
global optimum
local optimum
Fig. 1.21.a: Premature Conver-
Fig. 1.21.b: Small Neutral
Fig.
1.21.c:
Wide
Neutral
gence
Bridge
Bridge
Figure 1.21: Possible positive influence of neutrality.
Recently, the idea of utilizing the processes of molecular58 and evolutionary59 biology as
complement to Darwinian evolution for optimization gains interest [144]. Scientists like Hu
and Banzhaf [967, 968] have begun to study the application of metrics such as the evolution
rate of gene sequences [2281, 2257] to evolutionary algorithms. Here, the degree of neutrality
(synonymous vs. non-synonymous changes) seems to play an important role.
Examples for neutrality in fitness landscapes are the ND family (see Section 21.2.3), the
NKp and NKq models (discussed in Section 21.2.1), and the Royal Road (see Section 21.2.4).
Another common instance of neutrality is bloat in Genetic Programming, which is outlined
in Section 4.10.3 on page 224.
Neutral Networks
From the idea of neutral bridges between different parts of the search space as sketched by
Smith et al. [1913], we can derive the concept of neutral networks.
Definition 1.51 (Neutral Network). Neutral networks are equivalence classes K of el-
ements of the search space G which map to elements of the problem space X with the same
objective values and are connected by chains of applications of the search operators Op [149].
∀g1,g2 ∈ G : g1 ∈ K(g2) ⊆ G ⇔ ∃k ∈ N0 : P g2 = Opk(g1) > 0 ∧
F (gpm(g1)) = F (gpm(g2))
(1.44)
Barnett [149] states that a neutral network has the constant innovation property if
58 http://en.wikipedia.org/wiki/Molecular_biology [accessed 2008-07-20]
59 http://en.wikipedia.org/wiki/Evolutionary_biology [accessed 2008-07-20]

1.4 Problems in Optimization
67
1. the rate of discovery of innovations keeps constant for a reasonably large amount of
applications of the search operations [981], and
2. if this rate is comparable with that of an unconstrained random walk.
Networks with this property may prove very helpful if they connect the optima in the fitness
landscape. Stewart [1962] utilizes neutral networks and the idea of punctuated equilibria
in his extrema selection, a genetic algorithm variant that focuses on exploring individuals
which are far away from the centroid of the set of currently investigated solution candidates
(but have still good objective values). Then again, Barnett [148] showed that populations
in genetic algorithm tend to dwell in neutral networks of high dimensions of neutrality
regardless of their objective values, which (obviously) cannot be considered advantageous.
The convergence on neutral networks has furthermore been studied by Bornberg-Bauer
and Chan [251], van Nimwegen et al. [2097, 2096], and Wilke [2225]. Their results show that
the topology of neutral networks strongly determines the distribution of genotypes on them.
Generally, the genotypes are “drawn” to the solutions with the highest degree of neutrality
ν on the neutral network Beaudoin et al. [161].
Redundancy: Problematic and Beneficial
Definition 1.52 (Redundancy). Redundancy in the context of global optimization is a
feature of the genotype-phenotype mapping and means that multiple genotypes map to the
same phenotype, i. e., the genotype-phenotype mapping is not injective.
∃g1,g2 : g1 = g2 ∧ gpm(g1) = gpm(g2)
(1.45)
The role of redundancy in the genome is as controversial as that of neutrality [2168].
There exist many accounts of its positive influence on the optimization process. Shipman
et al. [1871, 1856], for instance, tried to mimic desirable evolutionary properties of RNA
folding [980]. They developed redundant genotype-phenotype mappings using voting (both,
via uniform redundancy and via a non-trivial approach), Turing machine-like binary instruc-
tions, Cellular automata, and random Boolean networks [1099]. Except for the trivial voting
mechanism based on uniform redundancy, the mappings induced neutral networks which
proved beneficial for exploring the problem space. Especially the last approach provided par-
ticularly good results [1871, 1856]. Possibly converse effects like epistasis (see Section 1.4.6)
arising from the new genotype-phenotype mappings have not been considered in this study.
Redundancy can have a strong impact on the explorability of the problem space. When
utilizing a one-to-one mapping, the translation of a slightly modified genotype will always
result in a different phenotype. If there exists a many-to-one mapping between genotypes
and phenotypes, the search operations can create offspring genotypes different from the
parent which still translate to the same phenotype. The optimizer may now walk along a
path through this neutral network. If many genotypes along this path can be modified to
different offspring, many new solution candidates can be reached [1871]. One example for
beneficial redundancy is the extradimensional bypass idea discussed in Section 1.5.2.
The experiments of Shipman et al. [1872, 1870] additionally indicate that neutrality
in the genotype-phenotype mapping can have positive effects. In the Cartesian Genetic
Programming method, neutrality is explicitly introduced in order to increase the evolvability
(see Section 4.7.4 on page 201) [2110, 2297].
Yet, Rothlauf [1765] and Shackleton et al. [1856] show that simple uniform redundancy
is not necessarily beneficial for the optimization process and may even slow it down. There
is no use in introducing encodings which, for instance, represent each phenotypic bit with
two bits in the genotype where 00 and 01 map to 0 and 10 and 11 map to 1. Another example
for this issue is given in Fig. 1.31.b on page 86.

68
1 Introduction
Summary
Different from ruggedness which is always bad for optimization algorithms, neutrality has
aspects that may further as well as hinder the process of finding good solutions. Generally
we can state that degrees of neutrality ν very close to 1 degenerate optimization processes
to random walks. Some forms of neutral networks accompanied by low (nonzero) values of
ν can improve the evolvability and hence, increase the chance of finding good solutions.
Adverse forms of neutrality are often caused by bad design of the search space or
genotype-phenotype mapping. Uniform redundancy in the genome should be avoided where
possible and the amount of neutrality in the search space should generally be limited.
Needle-In-A-Haystack
One of the worst cases of fitness landscapes is the needle-in-a-haystack (NIAH) problem
sketched in Fig. 1.19.g on page 57, where the optimum occurs as isolated spike in a plane. In
other words, small instances of extreme ruggedness combine with a general lack of informa-
tion in the fitness landscape. Such problems are extremely hard to solve and the optimization
processes often will converge prematurely or take very long to find the global optimum. An
example for such fitness landscapes is the all-or-nothing property often inherent to Genetic
Programming of algorithms [2058], as discussed in Section 4.10.2 on page 223.
1.4.6 Epistasis
Introduction
In biology, epistasis60 is defined as a form of interaction between different genes [1640].
The term was coined by Bateson [157] and originally meant that one gene suppresses the
phenotypical expression of another gene. In the context of statistical genetics, epistasis was
initially called “epistacy” by Fisher [677]. According to Lush [1335], the interaction between
genes is epistatic if the effect on the fitness of altering one gene depends on the allelic state of
other genes. This understanding of epistasis comes very close to another biological expression:
Pleiotropy61, which means that a single gene influences multiple phenotypic traits [2227]. In
the area of global optimization, such fine-grained distinctions are usually not made and the
two terms are often used more or less synonymously.
Definition 1.53 (Epistasis). In optimization, epistasis is the dependency of the contribu-
tion of one gene to the value of the objective functions on the allelic state of other genes.
[491, 44, 1503]
We speak of minimal epistasis when every gene is independent of every other gene. Then,
the optimization process equals finding the best value for each gene and can most efficiently
be carried out by a simple greedy search (see Section 17.4.1) [491]. A problem is maximally
epistatic when no proper subset of genes is independent of any other gene [1924, 1503].
Examples of problems with a high degree of epistasis are Kauffman’s NK fitness landscape
[1098, 1100] (Section 21.2.1), the p-Spin model [48] (Section 21.2.2), and the tunable model
of Weise et al. [2185] (Section 21.2.7).
The Problem
As sketched in Figure 1.22, epistasis has a strong influence on many of the previously dis-
cussed problematic features. If one gene can “turn off” or affect the expression of other
60 http://en.wikipedia.org/wiki/Epistasis [accessed 2008-05-31]
61 http://en.wikipedia.org/wiki/Pleiotropy [accessed 2008-03-02]

1.4 Problems in Optimization
69
genes, a modification of this gene will lead to a large change in the features of the pheno-
type. Hence, the causality will be weakened and ruggedness ensues in the fitness landscape.
It also becomes harder to define search operations with exploitive character. Moreover, sub-
sequent changes to the “deactivated” genes may have no influence on the phenotype at all,
which would then increase the degree of neutrality in the search space. Epistasis is mainly an
aspect of the way in which the genome G and the genotype-phenotype mapping are defined.
It should be avoided where possible.
Needle in a
Haystack
ruggedness
multi-
modality
neutrality
weak causality
high
epistasis
º causes
Figure 1.22: The influence of epistasis on the fitness landscape.
Generally, epistasis and conflicting objectives in multi-objective optimization should be
distinguished from each other. Epistasis as well as pleiotropy is a property of the influence
of the editable elements (the genes) of the genotypes on the phenotypes. Objective functions
can conflict without the involvement of any of these phenomena. We can, for example,
define two objective functions f1(x) = x and f2(x) = −x which are clearly contradicting
regardless of whether they both are subject to maximization or minimization. Nevertheless,
if the solution candidates x and the genotypes are simple real numbers and the genotype-
phenotype mapping is an identity mapping, neither epistatic nor pleiotropic effects can
occur.
Naudts and Verschoren [1504] have shown for the special case of length-two binary string
genomes that deceptiveness does not occur in situations with low epistasis and also that
objective functions with high epistasis are not necessarily deceptive. Another discussion
about different shapes of fitness landscapes under the influence of epistasis is given by
Beerenwinkel et al. [167].
Countermeasures
General
We have shown that epistasis is a root cause for multiple problematic features of optimiza-
tion tasks. General countermeasures against epistasis can be divided into two groups. The
symptoms of epistasis can be mitigated with the same methods which increase the chance of
finding good solutions in the presence of ruggedness or neutrality – using larger populations
and favoring explorative search operations. Epistasis itself is a feature which results from
the choice of the search space structure, the search operations, and the genotype-phenotype
mapping. Avoiding epistatic effects should be a major concern during their design. This can
lead to a great improvement in the quality of the solutions produced by the optimization
process [2181]. Some general rules for search space design are outlined in Section 1.5.2.

70
1 Introduction
Linkage Learning
According to Winter et al. [2242], linkage is “the tendency for alleles of different genes to
be passed together from one generation to the next” in genetics. This usually indicates
that these genes are closely located in the same chromosome. In the context of evolutionary
algorithms, this notation is not useful since identifying spatially close elements inside the
genotypes g ∈ G is trivial. Instead, we are interested in alleles of different genes which have
a joint effect on the fitness [1486, 1485].
Identifying these linked genes, i. e., learning their epistatic interaction, is very helpful for
the optimization process. Such knowledge can be used to protect building blocks62 from being
destroyed by the search operations (such as crossover in genetic algorithms), for instance.
Finding approaches for linkage learning has become an especially popular discipline in the
area of evolutionary algorithms with binary [896, 1486, 1647] and real [546] genomes. Two
important methods from this area are the messy GA (mGA, see Section 3.7) by Goldberg
et al. [825] and the Bayesian Optimization Algorithm (BOA) [1633, 333]. Module acquisition
[66] may be considered as such an effort.
1.4.7 Noise and Robustness
Introduction – Noise
In the context of optimization, three types of noise can be distinguished. The first form is
noise in the training data used as basis for learning (i). In many applications of machine
learning or optimization where a model for a given system is to be learned, data samples
including the input of the system and its measured response are used for training. Some
typical examples of situations where training data is the basis for the objective function
evaluation are
1. the usage of global optimization for building classifiers (for example for predicting buying
behavior using data gathered in a customer survey for training),
2. the usage of simulations for determining the objective values in Genetic Programming
(here, the simulated scenarios correspond to training cases), and
3. the fitting of mathematical functions to (x, y)-data samples (with artificial neural net-
works or symbolic regression, for instance).
Since no measurement device is 100% accurate and there are always random errors, noise is
present in such optimization problems.
Besides inexactnesses and fluctuations in the input data of the optimization process,
perturbations are also likely to occur during the application of its results. This category
subsumes the other two types of noise: perturbations that may arise from (ii) inaccuracies
in the process of realizing the solutions and (iii) environmentally induced perturbations
during the applications of the products.
This issue can be illustrated by using the process of developing the perfect tire for a car
as an example. As input for the optimizer, all sorts of material coefficients and geometric
constants measured from all known types of wheels and rubber could be available. Since
these constants have been measured or calculated from measurements, they include a certain
degree of noise and imprecision (i).
The result of the optimization process will be the best tire construction plan discovered
during its course and it will likely incorporate different materials and structures. We would
hope that the tires created according to the plan will not fall apart if, accidently, an extra
0.0001% of a specific rubber component is used (ii). During the optimization process, the
behavior of many construction plans will be simulated in order to find out about their
utility. When actually manufactured, the tires should not behave unexpectedly when used
62 See Section 3.6.5 for information on the Building Block Hypothesis.

1.4 Problems in Optimization
71
in scenarios different from those simulated (iii) and should instead be applicable in all driving
situations likely to occur.
The effects of noise in optimization have been studied by various researchers; Miller
and Goldberg [1416, 1415], Lee and Wong [1268], and Gurin and Rastrigin [870] are some
of them. Many global optimization algorithms and theoretical results have been proposed
which can deal with noise. Some of them are, for instance, specialized
1. genetic algorithms [685, 2062, 2060, 1799, 1800, 1146],
2. Evolution Strategies [195, 100, 881], and
3. Particle Swarm Optimization [1606, 884] approaches.
The Problem: Need for Robustness
The goal of global optimization is to find the global optima of the objective functions. While
this is fully true from a theoretical point of view, it may not suffice in practice. Optimization
problems are normally used to find good parameters or designs for components or plans to
be put into action by human beings or machines. As we have already pointed out, there will
always be noise and perturbations in practical realizations of the results of optimization.
There is no process in the world that is 100% accurate and the optimized parameters,
designs, and plans have to tolerate a certain degree of imprecision.
Definition 1.54 (Robustness). A system in engineering or biology isrobust if it is able to
function properly in the face of genetic or environmental perturbations [2132].
Therefore, a local optimum (or even a non-optimal element) for which slight disturbances
only lead to gentle performance degenerations is usually favored over a global optimum lo-
cated in a highly rugged area of the fitness landscape [276]. In other words, local optima in
regions of the fitness landscape with strong causality are sometimes better than global op-
tima with weak causality. Of course, the level of this acceptability is application-dependent.
Figure 1.23 illustrates the issue of local optima which are robust vs. global optima which
are not. More examples from the real world are:
1. When optimizing the control parameters of an airplane or a nuclear power plant, the
global optimum is certainly not used if a slight perturbation can have hazardous effects
on the system [2062].
2. Wiesmann et al. [2218, 2217] bring up the topic of manufacturing tolerances in multilayer
optical coatings. It is no use to find optimal configurations if they only perform optimal
when manufactured to a precision which is either impossible or too hard to achieve on
a constant basis.
3. The optimization of the decision process on which roads should be precautionary salted
for areas with marginal winter climate is an example of the need for dynamic robustness.
The global optimum of this problem is likely to depend on the daily (or even current)
weather forecast and may therefore be constantly changing. Handa et al. [886] point
out that it is practically infeasible to let road workers follow a constantly changing plan
and circumvent this problem by incorporating multiple road temperature settings in the
objective function evaluation.
4. Tsutsui et al. [2062, 2060] found a nice analogy in nature: The phenotypic characteristics
of an individual are described by its genetic code. During the interpretation of this code,
perturbations like abnormal temperature, nutritional imbalances, injuries, illnesses and
so on may occur. If the phenotypic features emerging under these influences have low fit-
ness, the organism cannot survive and procreate. Thus, even a species with good genetic
material will die out if its phenotypic features become too sensitive to perturbations.
Species robust against them, on the other hand, will survive and evolve.

72
1 Introduction
f(x)
X
robust local optimum
global optimum
Figure 1.23: A robust local optimum vs. a “unstable” global optimum.
Countermeasures
For the special case where the phenome is a real vector space (X ⊆ Rn), several approaches
for dealing with the need for robustness have been developed. Inspired by Taguchi meth-
ods63 [1995], possible disturbances are represented by a vector δ = (δ1, δ2, .., δn)T , δi ∈ R
in the method suggested by Greiner [859, 860]. If the distributions and influences of
the δi are known, the objective function f (x) : x ∈ X can be rewritten as ˜f(x,δ)
[2218]. In the special case where δ is normally distributed, this can be simplified to
˜
f (x1 + δ1, x2 + δ2, .., xn + δn)T . It would then make sense to sample the probability distri-
bution of δ a number of t times and to use the mean values of ˜
f (x, δ) for each objective func-
tion evaluation during the optimization process. In cases where the optimal value y⋆ of the
objective function f is known, Equation 1.46 can be minimized. This approach is also used
in the work of Wiesmann et al. [2217, 2218] and basically turns the optimization algorithm
into something like a maximum likelihood estimator (see Section 28.7.2 and Equation 28.252
on page 502).
1 t
2
f ′(x) =
y⋆
(1.46)
t
− ˜f(x,δi)
i=1
This method corresponds to using multiple, different training scenarios during the objec-
tive function evaluation in situations where X ⊆ Rn. By adding random noise and artificial
perturbations to the training cases, the chance of obtaining robust solutions which are stable
when applied or realized under noisy conditions can be increased.
1.4.8 Overfitting and Oversimplification
In all scenarios where optimizers evaluate some of the objective values of the solution can-
didates by using training data, two additional phenomena with negative influence can be
observed: overfitting and oversimplification.
Overfitting
The Problem
Definition 1.55 (Overfitting). Overfitting64 is the emergence of an overly complicated
model (solution candidate) in an optimization process resulting from the effort to provide
the best results for as much of the available training data as possible [1805, 1905, 785, 564].
63 http://en.wikipedia.org/wiki/Taguchi_methods [accessed 2008-07-19]
64 http://en.wikipedia.org/wiki/Overfitting [accessed 2007-07-03]

1.4 Problems in Optimization
73
A model (solution candidate) m ∈ X optimized based on a finite set of training data
is considered to be overfitted if a less complicated, alternative model m′ ∈ X exists which
has a smaller error for the set of all possible (maybe even infinitely many), available, or
(theoretically) producible data samples. This model m′ may, however, have a larger error in
the training data.
The phenomenon of overfitting is best known and can often be encountered in the field
of artificial neural networks or in curve fitting65 [2019, 1291, 1265, 1806, 1761]. The latter
means that we have a set A of n training data samples (xi, yi) and want to find a function
f that represents these samples as well as possible, i. e., f (xi) = yi ∀(xi,yi) ∈ A.
There exists exactly one polynomial66 of the degree n − 1 that fits to each such training
data and goes through all its points.67 Hence, when only polynomial regression is performed,
there is exactly one perfectly fitting function of minimal degree. Nevertheless, there will also
be an infinite number of polynomials with a higher degree than n − 1 that also match the
sample data perfectly. Such results would be considered as overfitted.
In Figure 1.24, we have sketched this problem. The function f1(x) = x shown in
Fig. 1.24.b has been sampled three times, as sketched in Fig. 1.24.a. There exists no other
polynomial of a degree of two or less that fits to these samples than f1. Optimizers, however,
could also find overfitted polynomials of a higher degree such as f2 which also match the
data, as shown in Fig. 1.24.c. Here, f2 plays the role of the overly complicated model m
which will perform as good as the simpler model m′ when tested with the training sets only,
but will fail to deliver good results for all other input data.
y
y
y
m`
m
x
x
x
Fig.
1.24.a:
Three
sample
Fig. 1.24.b: m′ ≡ f1(x) = x.
Fig. 1.24.c: m ≡ f2(x).
points of f1.
Figure 1.24: Overfitting due to complexity.
A very common cause for overfitting is noise in the sample data. As we have already
pointed out, there exists no measurement device for physical processes which delivers per-
fect results without error. Surveys that represent the opinions of people on a certain topic
or randomized simulations will exhibit variations from the true interdependencies of the ob-
served entities, too. Hence, data samples based on measurements will always contain some
noise.
In Figure 1.25 we have sketched how such noise may lead to overfitted results. Fig. 1.25.a
illustrates a simple physical process obeying some quadratic equation. This process has been
measured using some technical equipment and the 100 noisy samples depicted in Fig. 1.25.b
has been obtained. Fig. 1.25.c shows a function resulting from an optimization that fits
the data perfectly. It could, for instance, be a polynomial of degree 99 that goes right
through all the points and thus, has an error of zero. Although being a perfect match to the
65 We will discuss overfitting in conjunction with Genetic Programming-based symbolic regression
in Section 23.1 on page 397.
66 http://en.wikipedia.org/wiki/Polynomial [accessed 2007-07-03]
67 http://en.wikipedia.org/wiki/Polynomial_interpolation [accessed 2008-03-01]

74
1 Introduction
measurements, this complicated model does not accurately represent the physical law that
produced the sample data and will not deliver precise results for new, different inputs.
y
y
y
m`
m
x
x
x
Fig. 1.25.a: The original phys-
Fig. 1.25.b: The measuremen-
Fig. 1.25.c: The overfitted re-
ical process.
t/training data.
sult.
Figure 1.25: Fitting noise.
From the examples we can see that the major problem that results from overfitted solu-
tions is the loss of generality.
Definition 1.56 (Generality). A solution of an optimization process is general if it is
not only valid for the sample inputs a1, a2, . . . , an which were used for training during the
optimization process, but also for different inputs a = ai ∀i : 0 < i ≤ n if such inputs a
exist.
Countermeasures
There exist multiple techniques that can be utilized in order to prevent overfitting to a
certain degree. It is most efficient to apply multiple such techniques together in order to
achieve best results.
A very simple approach is to restrict the problem space X in a way that only solutions up
to a given maximum complexity can be found. In terms of function fitting, this could mean
limiting the maximum degree of the polynomials to be tested. Furthermore, the functional
objective functions which solely concentrate on the error of the solution candidates should
be augmented by penalty terms and non-functional objective functions putting pressure in
the direction of small and simple models [564, 1108].
Large sets of sample data, although slowing down the optimization process, may improve
the generalization capabilities of the derived solutions. If arbitrarily many training datasets
or training scenarios can be generated, there are two approaches which work against over-
fitting:
1. The first method is to use a new set of (randomized) scenarios for each evaluation of
each solution candidate. The resulting objective values then may differ largely even if
the same individual is evaluated twice in a row, introducing incoherence and ruggedness
into the fitness landscape.
2. At the beginning of each iteration of the optimizer, a new set of (randomized) scenarios
is generated which is used for all individual evaluations during that iteration. This
method leads to objective values which can be compared without bias. They can be
made even more comparable if the objective functions are always normalized into some
fixed interval, say [0, 1].
In both cases it is helpful to use more than one training sample or scenario per evaluation
and to set the resulting objective value to the average (or better median) of the outcomes.

1.4 Problems in Optimization
75
Otherwise, the fluctuations of the objective values between the iterations will be very large,
making it hard for the optimizers to follow a stable gradient for multiple steps.
Another simple method to prevent overfitting is to limit the runtime of the optimizers
[1805]. It is commonly assumed that learning processes normally first find relatively general
solutions which subsequently begin to overfit because the noise “is learned”, too.
For the same reason, some algorithms allow to decrease the rate at which the solution
candidates are modified by time. Such a decay of the learning rate makes overfitting less
likely.
Dividing Data into Training and Test Sets If only one finite set of data samples is available
for training/optimization, it is common practice to separate it into a set of training data
At and a set of test cases Ac. During the optimization process, only the training data is
used. The resulting solutions are tested with the test cases afterwards. If their behavior is
significantly worse when applied to Ac than when applied to At, they are probably overfitted.
The same approach can be used to detect when the optimization process should be
stopped. The best known solution candidates can be checked with the test cases in each
iteration without influencing their objective values which solely depend on the training data.
If their performance on the test cases begins to decrease, there are no benefits in letting the
optimization process continue any further.
Oversimplification
The Problem
Oversimplification (also called overgeneralization) is the opposite of overfitting. Whereas
overfitting denotes the emergence of overly complicated solution candidates, oversimplified
solutions are not complicated enough. Although they represent the training samples used
during the optimization process seemingly well, they are rough overgeneralizations which
fail to provide good results for cases not part of the training.
A common cause for oversimplification is sketched in Figure 1.26: The training sets
only represent a fraction of the set of possible inputs. As this is normally the case, one
should always be aware that such an incomplete coverage may fail to represent some of the
dependencies and characteristics of the data, which then may lead to oversimplified solutions.
Another possible reason for oversimplification is that ruggedness, deceptiveness, too much
neutrality, or high epistasis in the fitness landscape may lead to premature convergence and
prevent the optimizer from surpassing a certain quality of the solution candidates. It then
cannot adapt them completely even if the training data perfectly represents the sampled
process. A third possible cause is that a problem space could have been chosen which does
not include the correct solution.
Fig. 1.26.a shows a cubic function. Since it is a polynomial of degree three, four sample
points are needed for its unique identification. Maybe not knowing this, only three samples
have been provided in Fig. 1.26.b. By doing so, some vital characteristics of the function
are lost. Fig. 1.26.c depicts a square function – the polynomial of the lowest degree that fits
exactly to these samples. Although it is a perfect match, this function does not touch any
other point on the original cubic curve and behaves totally differently at the lower parameter
area.
However, even if we had included point P in our training data, it would still be possible
that the optimization process would yield Fig. 1.26.c as a result. Having training data that
correctly represents the sampled system does not mean that the optimizer is able to find a
correct solution with perfect fitness – the other, previously discussed problematic phenomena
can prevent it from doing so. Furthermore, if it was not known that the system which was
to be modeled by the optimization process can best be represented by a polynomial of the
third degree, one could have limited the problem space X to polynomials of degree two and
less. Then, the result would likely again be something like Fig. 1.26.c, regardless of how
many training samples are used.

76
1 Introduction
y
y
y
P
x
x
x
Fig. 1.26.a: The “real system”
Fig. 1.26.b: The sampled train-
Fig. 1.26.c: The oversimplified
and the points describing it.
ing data.
result.
Figure 1.26: Oversimplification.
Countermeasures
In order to counter oversimplification, its causes have to be mitigated. Generally, it is not
possible to have training scenarios which cover the complete input space of the evolved
programs. By using multiple scenarios for each individual evaluation, the chance of missing
important aspects is decreased. These scenarios can be replaced with new, randomly created
ones in each generation, in order to decrease this chance even more. The problem space, i. e.,
the representation of the solution candidates, should further be chosen in a way which
allows constructing a correct solution to the problem defined. Then again, releasing too
many constraints on the solution structure increases the risk of overfitting and thus, careful
proceeding is recommended.
1.4.9 Dynamically Changing Fitness Landscape
It should also be mentioned that there exist problems with dynamically changing fitness
landscapes [282, 1465, 1729, 277, 278]. The task of an optimization algorithm is then to
provide solution candidates with momentarily optimal objective values for each point in
time. Here we have the problem that an optimum in iteration t will possibly not be an
optimum in iteration t + 1 anymore.
Problems with dynamic characteristics can, for example, be tackled with special forms
[2280] of
1. evolutionary algorithms [2053, 2224, 279, 280, 1463, 1464, 82],
2. genetic algorithms [817, 1457, 1458, 1459, 1146],
3. Particle Swarm Optimization [343, 344, 1280, 1605, 211],
4. Differential Evolution [1391, 2266], and
5. Ant Colony Optimization [868, 869]
The moving peaks benchmarks by Branke [277, 278] and Morrison and De Jong [1465]
are good examples for dynamically changing fitness landscapes. You can find them discussed
in Section 21.1.3 on page 328.
1.4.10 The No Free Lunch Theorem
By now, we know the most important problems that can be encountered when applying
an optimization algorithm to a given problem. Furthermore, we have seen that it is arguable
what actually an optimum is if multiple criteria are optimized at once. The fact that there

1.4 Problems in Optimization
77
is most likely no optimization method that can outperform all others on all problems can,
thus, easily be accepted. Instead, there exist a variety of optimization methods specialized
in solving different types of problems. There are also algorithms which deliver good results
for many different problem classes, but may be outperformed by highly specialized methods
in each of them. These facts have been formalized by Wolpert and Macready [2244, 2245]
in their No Free Lunch Theorems68 (NFL) for search and optimization algorithms.
Initial Definitions
Wolpert and Macready [2245] consider single-objective optimization and define an optimiza-
tion problem φ(g) ≡ f(gpm(g)) as a mapping of a search space G to the objective space Y.69
Since this definition subsumes the problem space and the genotype-phenotype mapping, only
skipping the possible search operations, it is very similar to our Definition 1.34 on page 46.
They further call a time-ordered set dm of m distinct visited points in G × Y a “sample” of
size m and write dm ≡ {(dgm(1),dym(1)),(dgm(2),dym(2)),...,(dgm(m),dym(m))}. dgm(i) is the
genotype and dym(i) the corresponding objective value visited at time step i. Then, the set
Dm = (G × Y)m is the space of all possible samples of length m and D = ∪m≥0Dm is the
set of all samples of arbitrary size.
An optimization algorithm a can now be considered to be a mapping of the previously
visited points in the search space (i. e., a sample) to the next point to be visited. Formally,
this means a : D → G. Without loss of generality, Wolpert and Macready [2245] only regard
unique visits and thus define a : d ∈ D → g : g ∈ d.
Performance measures Ψ can be defined independently from the optimization algorithms
only based on the values of the objective function visited in the samples dm. If the objective
function is subject to minimization, Ψ (dym) = min {dym : i = 1..m} would be the appropriate
measure.
Often, only parts of the optimization problem φ are known. If the minima of the objective
function f were already identified beforehand, for instance, its optimization would be useless.
Since the behavior in wide areas of φ is not obvious, it makes sense to define a probability
P (φ) that we are actually dealing with φ and no other problem. Wolpert and Macready
[2245] use the handy example of the travelling salesman problem in order to illustrate this
issue. Each distinct TSP produces a different structure of φ. Yet, we would use the same
optimization algorithm a for all problems of this class without knowing the exact shape of φ.
This corresponds to the assumption that there is a set of very similar optimization problems
which we may encounter here although their exact structure is not known. We act as if there
was a probability distribution over all possible problems which is non-zero for the TSP-alike
ones and zero for all others.
The Theorem
The performance of an algorithm a iterated m times on an optimization problem φ can
then be defined as P (dym |φ,m,a), i. e., the conditional probability of finding a particular
sample dym. Notice that this measure is very similar to the value of the problem landscape
Φ(x, τ ) introduced in Definition 1.38 on page 48 which is the cumulative probability that
the optimizer has visited the element x ∈ X until (inclusively) the τth evaluation of the
objective function(s).
Wolpert and Macready [2245] prove that the sum of such probabilities over all possi-
ble optimization problems φ is always identical for all optimization algorithms. For two
optimizers a1 and a2, this means that
68 http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization [accessed 2008-03-
28]
69 Notice that we have partly utilized our own notations here in order to be consistent throughout
the book.

78
1 Introduction
P (dym |φ,m,a1 ) =
P (dym |φ,m,a2 )
(1.47)
∀φ
∀φ
Hence, the average over all φ of P (dym |φ,m,a) is independent of a.
Implications
From this theorem, we can immediately follow that, in order to outperform a1 in one opti-
mization problem, a2 will necessarily perform worse in another. Figure 1.27 visualizes this
issue. It shows that general optimization approaches like evolutionary algorithms can solve
a variety of problem classes with reasonable performance. In this figure, we have chosen
a performance measure Φ subject to maximization, i. e., the higher its values, the faster
will the problem be solved. Hill climbing approaches, for instance, will be much faster than
evolutionary algorithms if the objective functions are steady and monotonous, that is, in a
smaller set of optimization tasks. Greedy search methods will perform fast on all problems
with matroid70 structure. Evolutionary algorithms will most often still be able to solve these
problems, it just takes them longer to do so. The performance of hill climbing and greedy
approaches degenerates in other classes of optimization tasks as a trade-off for their high
utility in their “area of expertise”.
very crude sketch
performance
all possible optimization problems
random walk or exhaustive enumeration or ...
general optimization algorithm - an EA, for instance
specialized optimization algorithm 1; a hill climber, for instance
specialized optimization algorithm 2; a depth-first search, for instance
Figure 1.27: A visualization of the No Free Lunch Theorem.
One interpretation of the No Free Lunch Theorem is that it is impossible for any opti-
mization algorithm to outperform random walks or exhaustive enumerations on all possible
problems. For every problem where a given method leads to good results, we can construct
a problem where the same method has exactly the opposite effect (see Section 1.4.4). As
a matter of fact, doing so is even a common practice to find weaknesses of optimization
algorithms and to compare them with each other, see Section 21.2.6, for example.
70 http://en.wikipedia.org/wiki/Matroid [accessed 2008-03-28]

1.4 Problems in Optimization
79
Another interpretation is that every useful optimization algorithm utilizes some form
of problem-specific knowledge. Radcliffe [1696] states that without such knowledge, search
algorithms cannot exceed the performance of simple enumerations. Incorporating knowledge
starts with relying on simple assumptions like “if x is a good solution candidate, than we
can expect other good solution candidates in its vicinity”, i. e., strong causality. The more
(correct) problem specific knowledge is integrated (correctly) into the algorithm structure,
the better will the algorithm perform. On the other hand, knowledge correct for one class
of problems is, quite possibly, misleading for another class. In reality, we use optimizers to
solve a given set of problems and are not interested in their performance when (wrongly)
applied to other classes.
The rough meaning of the NLF is that all black-box optimization methods perform
equally well over the complete set of all optimization problems [1563]. In practice, we do not
want to apply an optimizer to all possible problems but to only some, restricted classes. In
terms of these classes, we can make statements about which optimizer performs better.
Today, there exists a wide range of work on No Free Lunch Theorems for many different
aspects of machine learning. The website http://www.no-free-lunch.org/71 gives a good
overview about them. Further summaries, extensions, and criticisms have been provided by

oppen et al. [1173], Droste et al. [602, 601, 599, 600], Oltean [1563], and Igel and Toussaint
[1008, 1009]. Radcliffe and Surry [1694] discuss the NFL in the context of evolutionary
algorithms and the representations used as search spaces. The No Free Lunch Theorem is
furthermore closely related to the Ugly Duckling Theorem72 proposed by Watanabe [2159]
for classification and pattern recognition.
1.4.11 Conclusions
The subject of this introductory chapter was the question about what makes optimization
problems hard, especially for metaheuristic approaches. We have discussed numerous differ-
ent phenomena which can affect the optimization process and lead to disappointing results.
If an optimization process has converged prematurely, it has been trapped in a non-
optimal region of the search space from which it cannot “escape” anymore (Section 1.4.2).
Ruggedness (Section 1.4.3) and deceptiveness (Section 1.4.4) in the fitness landscape, of-
ten caused by epistatic effects (Section 1.4.6), can misguide the search into such a region.
Neutrality and redundancy (Section 1.4.5) can either slow down optimization because the
application of the search operations does not lead to a gain in information or may also con-
tribute positively by creating neutral networks from which the search space can be explored
and local optima can be escaped from. Noise is present in virtually all practical optimization
problems. The solutions that are derived for them should be robust (Section 1.4.7). Also,
they should neither be too general (oversimplification, Section 1.4.8) nor too specifically
aligned only to the training data (overfitting, Section 1.4.8). Furthermore, many practical
problems are multi-objective, i. e., involve the optimization of more than one criterion at
once (partially discussed in Section 1.2.2), or concern objectives which may change over time
(Section 1.4.9).
In the previous section, we discussed the No Free Lunch Theorem and argued that it is
not possible to develop the one optimization algorithm, the problem-solving machine which
can provide us with near-optimal solutions in short time for every possible optimization
task. This must sound very depressing for everybody new to this subject.
Actually, quite the opposite is the case, at least from the point of view of a researcher.
The No Free Lunch Theorem means that there will always be new ideas, new approaches
which will lead to better optimization algorithms to solve a given problem. Instead of being
doomed to obsolescence, it is far more likely that most of the currently known optimization
methods have at least one niche, one area where they are excellent. It also means that it
71 accessed: 2008-03-28
72 http://en.wikipedia.org/wiki/Ugly_duckling_theorem [accessed 2008-08-22]

80
1 Introduction
EDA
Branch &
Bound
Dynamic
Program.
Evolutionary

GA, GP, ES,
Algorithms
Search
DE, EP, ...
Extremal
RFD
Memetic
IDDFS
Optimiz.
Algorithms
Simulated
ACO
Hill
LCS
Annealing
Tabu
Climbing
Search
PSO
Downhill
Simplex
Random
Optimiz.
Figure 1.28: The puzzle of optimization algorithms.
is very likely that the “puzzle of optimization alorithms” will never be completed. There
will always be a chance that an inspiring moment, an observation in nature, for instance,
may lead to the invention of a new optimization algorithm which performs better in some
problem areas than all currently known ones.
1.5 Formae and Search Space/Operator Design
Most global optimization algorithms share the premise that solutions to problems are either
elements of a somewhat continuous space that can be approximated stepwise or that they can
be composed of smaller modules which have good attributes even when occurring separately.
The design of the search space (or genome) G and the genotype-phenotype mapping
gpm is vital for the success of the optimization process. It determines to what degree these
expected features can be exploited by defining how the properties and the behavior of
the solution candidates are encoded and how the search operations influence them. In this
chapter, we will first discuss a general theory about how properties of individuals can be
defined, classified, and how they are related. We will then outline some general rules for
the design of the genome which are inspired by our previous discussion of the possible
problematic aspects of fitness landscapes.
1.5.1 Forma Analysis
The Schema Theorem has been stated for genetic algorithms by Holland [940] in its seminal
work [940, 512, 945]. In this section, we are going to discuss it in the more general version
from Weicker [2167] as introduced by Radcliffe and Surry [1695] and Surry [1983] in [1692,
1696, 1691, 1691, 1695].
The different individuals p in the population Pop of the search and optimization algo-
rithms are characterized by their properties φ. Whereas the optimizers themselves focus
mainly on the phenotypical properties since these are evaluated by the objective functions,
the properties of the genotypes may be of interest in an analysis of the optimization perfor-
mance.
A rather structural property φ1 of formulas f : R → R in symbolic regression73 would be
whether it contains the mathematical expression x+1 or not. We can also declare a behavioral
property φ2 which is true if |f(0) − 1| ≤ 0.1 holds, i. e., if the result of f is close to a value
73 More information on symbolic regression can be found in Section 23.1 on page 397.

1.5 Formae and Search Space/Operator Design
81
1 for the input 0, and false otherwise. Assume that the formulas were decoded from a
binary search space G = Bn to the space of trees that represent mathematical expression by
a genotype-phenotype mapping. A genotypical property then would be if a certain sequence
of bits occurs in the genotype p.g and a phenotypical property is the number of nodes in
the phenotype p.x, for instance. If we try to solve a graph-coloring problem, for example, a
property φ3 ∈ {black,white,gray} could denote the color of a specific vertex q as illustrated
in Figure 1.29.
Af =black
3
G5
q
G4
q
G1
q
G
G
2
6
q
q
G8
q
G7
q
G3
Í
q
Pop X
Af =
3
Af =gray
3
Figure 1.29: An graph coloring-based example for properties and formae.
In general, we can imagine the properties φi to be some sort of functions that map the
individuals to property values. φ1 and φ2 would then both map the space of mathematical
functions to the set B = {true,false} whereas φ3 maps the space of all possible colorings
for the given graph to the set {white,gray,black}. On the basis of the properties φi we can
define equivalence relations74 ∼φ :i
p1 ∼φ p
i
2 ⇒ φi(p1) = φi(p2) ∀p1, p2 ∈ G × X
(1.48)
Obviously, for each two solution candidates and x1 and x2, either x1 ∼φ x
x
i
2 or x1 ∼φi 2
holds. These relations divide the search space into equivalence classes Aφi=v.
Definition 1.57 (Forma). An equivalence class Aφi=v that contains all the individuals
sharing the same characteristic v in terms of the property φi is called a forma [1691] or
predicate [2122].
Aφi=v = {∀p ∈ G × X : φi(p) = v}
(1.49)
∀p1,p2 ∈ Aφ
p
i =v ⇒ p1 ∼φi 2
(1.50)
The number of formae induced by a property, i. e., the number of its different character-
istics, is called its precision [1691]. The precision of φ1 and φ2 is 2, for φ3 it is 3. We can
define another property φ4 ≡ f(0) denoting the value a mathematical function has for the
input 0. This property would have an uncountable infinite large precision.
Two formae Aφi=v and Aφj=w are said to be compatible, written as Aφi=v ⊲⊳ Aφj=w, if
there can exist at least one individual which is an instance of both.
74 See the definition of equivalence classes in Section 27.7.3 on page 464.

82
1 Introduction
f
2
2(x)=x +1.1
f1(x)=x+1
f
f
f
4
7
8(x)=(tan x)+1
f1
f6
f
x+2
~f1
A
3(x)=
f =true
1
f
f2
Af
5(x)=tan x
f
=false
1
f
3
f
7(x)=(cos x)(x+1)
5
f8
f
x+1)
4(x)=2(
f6(x)=(sin x)(x+1)
Í
Pop X
~ f2
f1
f2
~f4
f7
f8
Af =true
2
f5
f
f
f
f
6
2
6
5 f3
f
A
4
f =false
2
A
Af =1.1
f
4
=0
f
4
1
f3
f
f
7
4
f
Af =2
4
A
8
f =1
4
Figure 1.30: Example for formae in symbolic regression.
Aφi=v ⊲⊳ Aφj=w ⇔ Aφi=v ∩ Aφj=w = ∅
(1.51)
Aφi=v ⊲⊳ Aφj=w ⇔ ∃ p ∈ G × X : p ∈ Aφi=v ∧ p ∈ Aφj=w
(1.52)
Aφi=v ⊲⊳ Aφi=w ⇒ w = v
(1.53)
Of course, two different formae of the same property φi, i. e., two different charac-
teristics of φi, are always incompatible. In our initial symbolic regression example hence
Aφ1=true ⊲⊳ Aφ1=false since it is not possible that a function f contains a term x + 1 and at
the same time does not contain it. All formae of the properties φ1 and φ2 on the other hand
are compatible: Aφ1=false ⊲⊳ Aφ2=false, Aφ1=false ⊲⊳ Aφ2=true, Aφ1=true ⊲⊳ Aφ2=false, and
Aφ1=true ⊲⊳ Aφ2=true all hold. If we take φ4 into consideration, we will find that there exist
some formae compatible with some of φ2 and some that are not, like Aφ2=true ⊲⊳ Aφ4=1 and
Aφ2=false ⊲⊳ Aφ4=2, but Aφ2=true ⊲⊳ Aφ4=0 and Aφ2=false ⊲⊳ Aφ4=0.95.
The discussion of forma and their dependencies stems from the evolutionary algorithm
community and there especially from the supporters of the Building Block Hypothesis. The
idea is that the algorithm first discovers formae which have a good influence on the overall
fitness of the solution candidates. The hope is that there are many compatible ones under
these formae that are then gradually combined in the search process.
In this text we have defined formae and the corresponding terms on the basis of individ-
uals p which are records that assign an element of the problem spaces p.x ∈ X to an element
of the search space p.g ∈ G. Generally, we will relax this notation and also discuss forma
directly in the context of the search space G or problem space X, when appropriate.
1.5.2 Genome Design
In software engineering, there are some design patterns75 that describe good practice and
experience values. Utilizing these patterns will help the software engineer to create well-
organized, extensible, and maintainable applications.
Whenever we want to solve a problem with global optimization algorithms, we need to
define the structure of a genome. The individual representation along with the genotype-
75 http://en.wikipedia.org/wiki/Design_pattern_%28computer_science%29 [accessed 2007-08-12]

1.5 Formae and Search Space/Operator Design
83
phenotype mapping is a vital part of genetic algorithms and has major impact on the chance
of finding good solutions.
We have already discussed the basic problems that we may encounter during optimiza-
tion. The choice of the search space, the search operations, and the genotype-phenotype
mapping have major impact on the chance of finding good solutions. After formalizing the
ideas of properties and formae, we will now outline some general best practices for the genome
design from different perspectives. These principles can lead to finding better solutions or
higher optimization speed if considered in the design phase [1765, 1525].
In Goldberg [821] defines two general design patterns for genotypes in genetic algorithm
which we will state here in the context of the forma analysis [1525]:
1. The representations of the formae in the search space should be as short as possible and
the representations of different, compatible phenotypic formae should not influence each
other.
2. The alphabet of the encoding and the lengths of the different genes should be as small
as possible.
Both rules target for minimal redundancy in the genomes. We have already mentioned
in Section 1.4.5 on page 67 that uniform redundancy slows down the optimization process.
Especially the second rule focuses on this cause of neutrality by discouraging the use of
unnecessary large alphabets for encoding in a genetic algorithm. Palmer and Kershenbaum
[1602, 1603] define additional rules for tree-representations in [1602, 1601], which have been
generalized by Nguyen [1525]:
3. A good search space and genotype-phenotype mapping should be able to represent all
phenotypes, i. e., be surjective (see Section 27.7 on page 461).
∀x ∈ X ⇒ ∃g ∈ G : x = gpm(g)
(1.54)
4. The search space G should be unbiased in the sense that all phenotypes are represented
by the same number of genotypes. This property allows to efficiently select an unbiased
start population, giving the optimizer the chance of reaching all parts of the problem
space.
∀x1,x2 ∈ X ⇒ |{g ∈ G : x1 = gpm(g)}| ≈ |{g ∈ G : x2 = gpm(g)}|
(1.55)
5. The genotype-phenotype mapping should always yield valid phenotypes. The meaning
of valid in this context is that if the problem space X is the set of all possible trees,
only trees should be encoded in the genome. If we use the R3 as problem space, no
vectors with fewer or more elements than three should be produced by the genotype-
phenotype mapping. This form of validity does not imply that the individuals are also
correct solutions in terms of the objective functions.
6. The genotype-phenotype mapping should be simple and bijective.
7. The representations in the search space should possess strong causality (locality), i. e.,
small changes in the genotype lead to small changes in the phenotype (see Section 1.4.3).
Optimally, this would mean that:
∀x1,x2 ∈ X,g ∈ G : x1 = gpm(g) ∧ x2 = gpm(searchOp(g)) ⇒ x2 ≈ x1
(1.56)
Ronald [1752] summarizes some further rules [1752, 1525]:
8. The genotypic representation should be aligned to a set of reproduction operators in a
way that good configurations of formae are preserved by the search operations and do
not easily get lost during the exploration of the search space.
9. The representations should minimize epistasis (see Section 1.4.6 on page 68 and the 1st
rule).
10. The problem should be represented at an appropriate level of abstraction.

84
1 Introduction
11. If a direct mapping between genotypes and phenotypes is not possible, a suitable artificial
embryogeny approach should be applied.
Let us now summarize some more conclusions for search spaces based on forma analysis
as stated by Radcliffe [1692] and Weicker [2167].
12. Formae in Genotypic and Phenotypic Space
The optimization algorithms find new elements in the search space G by applying the search
operations searchOp ∈ Op. These operations can only create, modify, or combine genotypical
formae since they usually have no information about the problem space. Most mathematical
models dealing with the propagation of formae like the Building Block Hypothesis and the
Schema Theorem76 thus focus on the search space and show that highly fit genotypical for-
mae will more probably be investigated further than those of low utility. Our goal, however,
is to find highly fit formae in the problem space X. Such properties can only be created,
modified, and combined by the search operations if they correspond to genotypical formae.
A good genotype-phenotype mapping should provide this feature.
It furthermore becomes clear that useful separate properties in phenotypic space can only
be combined by the search operations properly if they are represented by separate formae
in genotypic space too.
13. Compatibility of Formae
Formae of different properties should be compatible. Compatible Formae in phenotypic space
should also be compatible in genotypic space. This leads to a low level of epistasis and hence
will increase the chance of success of the reproduction operations.
14. Inheritance of Formae
The 8th rule mentioned Formae should not get lost during the exploration of the search space.
From a good binary search operation like recombination (crossover) in genetic algorithms,
we can expect that if its two parameters g1 and g2 are members of a forma A, the resulting
element will also be an instance of A.
∀g1,g2 ∈ A ⊆ G ⇒ searchOp(g1,g2) ∈ A
(1.57)
If we furthermore can assume that all instances of all formae A with minimal precision
(A ∈ mini) of an individual are inherited by at least one parent, the binary reproduction
operation is considered as pure.
∀g3 = searchOp(g1,g2) ∈ G, ∀A ∈ mini : g3 ∈ A ⇒ g1 ∈ A ∨ g2 ∈ A
(1.58)
If this is the case, all properties of a genotype g3 which is a combination of two others
g1, g2 can be traced back to at least one of its parents. Otherwise, searchOp also performs an
implicit unary search step, a mutation in genetic algorithm, for instance. Such properties,
although discussed here for binary search operations only, can be extended to arbitrary n-ary
operators.
76 See Section 3.6 for more information on the Schema Theorem.

1.5 Formae and Search Space/Operator Design
85
15. Combinations of Formae
If genotypes g1, g2, . . . which are instances of different but compatible formae A1 ⊲⊳ A2 ⊲⊳ . . .
are combined by a binary (or n-ary) search operation, the resulting genotype g should be an
instance of both properties, i. e., the combination of compatible formae should be a forma
itself.
∀g1 ∈ A1,g2 ∈ A2,··· ⇒ searchOp(g1,g2,...) ∈ A1 ∩ A2 ∩ ... (= ∅)
(1.59)
If this principle holds for many individuals and formae, useful properties can be com-
bined by the optimization step by step, narrowing down the precision of the arising, most
interesting formae more and more. This should lead the search to the most promising regions
of the search space.
16. Reachability of Formae
The set of available search operations Op should include at least one unary search operation
which is able to reach all possible formae. If the binary search operations in Op all are pure,
this unary operator is the only one (apart from creation operations) able to introduce new
formae which are not yet present in the population. Hence, it should be able to find any
given forma.
17. Influence of Formae
One rule which, in my opinion, was missing in the lists given by Radcliffe [1692] and Weicker
[2167] is that the absolute contributions of the single formae to the overall objective values of
a solution candidate should to be too different. Let us divide the phenotypic formae into those
with positive and those with negative or neutral contribution and let us, for simplification
purposes, assume that those with positive contribution can be arbitrarily combined. If one
of the positive formae has a contribution with an absolute value much lower than those of
the other positive formae, we will trip into the problem of domino convergence discussed
in Section 1.4.2 on page 58.
Then, the search will first discover the building blocks of higher value. This, itself, is
not a problem. However, as we have already pointed out in Section 1.4.2, if the search is
stochastic and performs exploration steps, chances are that alleles of higher importance get
destroyed during this process and have to be rediscovered. The values of the less salient
formae would then play no role. Thus, the chance of finding them strongly depends on how
frequent the destruction of important formae takes place.
Ideally, we would therefore design the genome and phenome in a way that the different
characteristics of the solution candidate all influence the objective values to a similar degree.
Then, the chance of finding good formae increases.
(18.) Extradimensional Bypass
Minimal-sized genomes are not always the best approach. An interesting aspect of genome
design supporting this claim is inspired by the works of the theoretical biologist Conrad
[436, 438, 440, 437]. According to his extradimensional bypass principle, it is possible to
transform a rugged fitness landscape with isolated peeks into one with connected saddle
points by increasing the dimensionality of the search space [387, 342]. In [440] he states that
the chance of isolated peeks in randomly created fitness landscapes decreases when their
dimensionality grows.
This partly contradicts rule 1 and 2 which state that genomes should be as compact as
possible. Conrad [440] does not suggest that nature includes useless sequences in the genome
but either genes which allow for

86
1 Introduction
1. new phenotypical characteristics or
2. redundancy providing new degrees of freedom for the evolution of a species.
In some cases, such an increase in freedom makes more than up for the additional “costs”
arising from the enlargement of the search space. The extradimensional bypass can be con-
sidered as an example of positive neutrality (see Section 1.4.5).
global
global
f(x)
optimum
f(x)
optimum
local
optimum
local
optimum
G’
G’
G
G
Fig. 1.31.a: Useful increase of dimensional-
Fig. 1.31.b: Useless increase of dimension-
ity.
ality.
Figure 1.31: Examples for an increase of the dimensionality of a search space G (1d) to G′
(2d).
In Fig. 1.31.a, an example for the extradimensional bypass (similar to Fig. 6 in [246])
is sketched. The original problem had a one-dimensional search space G corresponding to
the horizontal axis up front. As can be seen in the plane in the foreground, the objective
function had two peeks: a local optimum on the left and a global optimum on the right,
separated by a larger valley. When the optimization process began climbing up the local
optimum, it was very unlikely that it ever could escape this hill and reach the global one.
Increasing the search space to two dimensions (G′), however, opened up a path way
between them. The two isolated peeks became saddle points on a longer ridge. The global
optimum is now reachable from all points on the local optimum.
Generally, increasing the dimension of the search space makes only sense if the added
dimension has a non-trivial influence on the objective functions. Simply adding a useless new
dimension (as done in Fig. 1.31.b) would be an example for some sort of uniform redundancy
from which we already know (see Section 1.4.5) that it is not beneficial. Then again, adding
useful new dimensions may be hard or impossible to achieve in most practical applications.
A good example for this issue is given by Bongard and Paul [246] who used an EA to
evolve a neural network for the motion control of a bipedal robot. They performed runs
where the evolution had control over some morphological aspects and runs where it had
not. The ability to change the leg with of the robots, for instance, comes at the expense
of an increase of the dimensions of the search spaced. Hence, one would expect that the
optimization would perform worse. Instead, in one series of experiments, the results were
much better with the extended search space. The runs did not converge to one particular
leg shape but to a wide range of different structures. This led to the assumption that the
morphology itself was not so much target of the optimization but the ability of changing it
transformed the fitness landscape to a structure more navigable by the evolution.
In some other experimental runs of Bongard and Paul [246], this phenomenon could not
be observed, most likely because

1.6 General Information
87
1. the robot configuration led to a problem of too high complexity, i. e., ruggedness in the
fitness landscape and/or
2. the increase in dimensionality this time was too large to be compensated by the gain of
evolvability.
Further examples for possible benefits of “gradually complexifying” the search space are
given by Malkin in his doctoral thesis [1351].
1.6 General Information
To all the optimization methods that are discussed in this book, you will find such a General
Information section. Here we outline some of the applications of the respective approach,
name the most important conferences, journals, and books as well as link to some online
resources.
1.6.1 Areas Of Application
Some example areas of application of global optimization algorithms are:
Application
References
Chemistry, Chemical Engineering
[204, 1787, 691]
Biochemistry
[690]
Constraint Satisfaction Problems (CSP)
[1519]
Multi-Criteria Decision Making (MCDM)
[877, 375]
Biology
[691]
[209, 691, 1814, 613, 1787, 690,
Engineering, Structural Optimization, and Design
691, 379]
Economics and Finance
[613, 691, 1051]
Parameter Estimation
[690]
Mathematical Problems
[761]
Optics
[132, 2057]
Operations Research
[691, 878]
Networking and Communication
[450]
Section 23.2 on page 401
This is just a small sample of the possible applications of global optimization algorithms. It
has neither some sort of order nor a focus on some specific areas. In the general information
sections of the following chapters, you will find many application examples for the algorithm
discussed.
1.6.2 Conferences, Workshops, etc.
Some conferences, workshops and such and such on global optimization algorithms are:
AAAI: National Conference on Artificial Intelligence
http://www.aaai.org/Conferences/conferences.php [accessed 2007-09-06]
History: 2008: Chicago, Illinois, see [738]
2007: Vancouver, British Columbia, Canada, see [954]

88
1 Introduction
2006: Boston, Massachusetts, USA, see [805]
2005: Pittsburgh, Pennsylvania, USA, see [1359]
2004: San Jose, California, USA, see [1381]
2002: Edmonton, Alberta, Canada, see [547]
2000: Austin, Texas, USA, see [1103]
1999: Orlando, Florida, USA, see [917]
1998: Madison, Wisconsin, USA, see [1472]
1997: Providence, Rhode Island, USA, see [1219, 3]
1996: Portland, Oregon, USA, see [410, 2]
1994: Seattle, WA, USA, see [906]
1993: Washington, DC, USA, see [668]
1992: San Jose, California, USA, see [1986]
1991: Anaheim, California, USA, see [530]
1990: Boston, Massachusetts, USA, see [563]
1988: St. Paul, Minnesota, USA, see [1435]
1987: Seattle, WA, USA, see [723]
1986: Philadelphia, PA, USA, see [1110, 1111]
1984: Austin, TX, USA, see [267]
1983: Washington, DC, USA, see [788]
1982: Pittsburgh, PA, USA, see [2143]
1980: Stanford University, California, USA, see [126]
AISB: Artificial Intelligence and Simulation of Behaviour + Workshop on Evolutionary
Computing
http://www.aisb.org.uk/convention/index.shtml [accessed 2008-09-11]
History: 2008: Aberdeen, UK, see [866]
2007: Newcastle upon Tyne, UK, see [2030]
2006: Bristol, UK, see [2029]
2005: Hatfield, UK, see [2028]
2004: Leeds, UK, see [2027]
2003: Aberystwyth, UK, see [2026]
2002: Imperial College, UK, see [2025]
2001: York, UK, see [2024]
2000: Birmingham, UK, see [2023]
1997: Manchester, UK, see [447]
1996: Brighton, UK, see [695]
1995: Sheffield, UK, see [694]
1994: Leeds, UK, see [693]
HAIS: International Conference on Hybrid Artificial Intelligence Systems
http://gicap.ubu.es/hais2009/ [accessed 2009-03-02]
History: 2009: Salamanca, Spain, see [79]
2008: Burgos, Spain, see [443]
2007: Salamanca, Spain, see [442]
2006: Ribeir˜
ao Preto, SP, Brazil, see [117]
HIS: International Conference on Hybrid Intelligent Systems

1.6 General Information
89
http://www.softcomputing.net/hybrid.html [accessed 2007-09-01]
History: 2008: Barcelona, Spain, see [2267]
2007: Kaiserslautern, Germany, see [1170]
2006: Auckland, New Zealand, see [993]
2005: Rio de Janeiro, Brazil, see [1510]
2004: Kitakyushu, Japan, see [991]
2003: Melbourne, Australia, see [8]
2002: Santiago, Chile, see [7]
2001: Adelaide, Australia, see [6]
ICNC: International Conference on Advances in Natural Computation
History: 2007: Haikou, China, see [995, 996, 997, 998, 999]
2006: Xi’an, China, see [1052, 1053]
2005: Changsha, China, see [2151, 2152, 2153]
IAAI: Conference on Innovative Applications of Artificial Intelligence
http://www.aaai.org/Conferences/IAAI/iaai.php [accessed 2007-09-06]
History: 2006: Boston, Massachusetts, USA, see [805]
2005: Pittsburgh, Pennsylvania, USA, see [1359]
2004: San Jose, California, USA, see [1381]
2003: Acapulco, M´exico, see [1731]
2002: Edmonton, Alberta, Canada, see [547]
2001: Seattle, Washington, USA, see [932]
2000: Austin, Texas, USA, see [1103]
1999: Orlando, Florida, USA, see [917]
1998: Madison, Wisconsin, USA, see [1472]
1997: Providence, Rhode Island, USA, see [1219]
1996: Portland, Oregon, USA, see [410]
1995: Montreal, Quebec, Canada, see [22]
1994: Seattle, Washington, USA, see [318]
1993: Washington, DC, USA, see [1]
1992: San Jose, California, USA, see [1844]
1991: Anaheim, California, USA, see [1907]
1990: Washington, DC, USA, see [1706]
1989: Stanford University, California, USA, see [1835]
KES: Knowledge-Based Intelligent Information & Engineering Systems
History: 2007: Vietri sul Mare, Italy, see [75, 76, 77]
2006: Bournemouth, UK, see [756, 757, 758]
2005: Melbourne, Australia, see [1129, 1130, 1131, 1132]
2004: Wellington, New Zealand, see [1514, 1515, 1516]
2003: Oxford, UK, see [1599, 1600]
2002: Podere d’Ombriano, Crema, Italy, see [481]
2001: Osaka and Nara, Japan, see [1037]
2000: Brighton, UK, see [962, 963]
1999: Adelaide, South Australia, see [1032]

90
1 Introduction
1998: Adelaide, South Australia, see [1033, 1034, 1035]
1997: Adelaide, South Australia, see [1030, 1031]
MCDM: International Conference on Multiple Criteria Decision Making
http://project.hkkk.fi/MCDM/conf.html [accessed 2007-09-10]
History: 2008: Auckland, New Zealand, see [620]
2006: Chania, Crete, Greece, see [2333]
2004: Whistler, British Columbia, Canada, see [2165]
2002: Semmering, Austria, see [1334]
2000: Ankara, Turkey, see [1167]
1998: Charlottesville, Virginia, USA, see [877]
1997: Cape Town, South Africa, see [1963]
1995: Hagen, Germany, see [645]
1994: Coimbra, Portugal, see [419]
1992: Taipei, Taiwan, see [2069]
1990: Fairfax, USA, see [1916]
1988: Manchester, UK, see [1301]
1986: Kyoto, Japan, see [1500]
1984: Cleveland, Ohio, USA, see [876]
1982: Mons, Belgium, see [893]
1980: Newark, Delaware, USA, see [1467]
1979: K¨
onigswinter, Germany, see [644]
1977: Buffalo, New York, USA, see [2328]
1975: Jouy-en-Josas, France, see [2039]
Mendel: International Conference on Soft Computing
http://mendel-conference.org/ [accessed 2007-09-09]
History: 2009: Brno, Czech Republic, see [292]
2008: Brno, Czech Republic, see [291]
2007: Prague, Czech Republic, see [1590]
2006: Brno, Czech Republic, see [293]
2005: Brno, Czech Republic, see [2084]
2004: Brno, Czech Republic, see [2083]
2003: Brno, Czech Republic, see [2082]
2002: Brno, Czech Republic, see [2081]
2001: Brno, Czech Republic, see [2086]
2000: Brno, Czech Republic, see [1591]
1999: Brno, Czech Republic, see [2080]
1998: Brno, Czech Republic, see [2079]
1997: Brno, Czech Republic, see [2078]
1996: Brno, Czech Republic, see [2077]
1995: Brno, Czech Republic, see [2076]
MIC: Metaheuristics International Conference
History: 2007: Montreal, Canada, see [1449]
2005: Vienna, Austria, see [2115]

1.6 General Information
91
2003: Kyoto, Japan, see [988]
2001: Porto, Portugal, see [1721]
1999: Angra dos Reis, Brazil, see [1726]
1997: Sophia Antipolis, France, see [2124]
1995: Breckenridge, Colorado, USA, see [1589]
MICAI: Advances in Artificial Intelligence, The Mexican International Conference on Arti-
ficial Intelligence
http://www.micai.org/ [accessed 2008-06-29]
History: 2007: Aguascalientes, M´exico, see [782]
2006: Apizaco, M´exico, see [781, 493]
2005: Monterrey, M´exico, see [783]
2004: Mexico City, M´exico, see [1442]
2002: M´erida, Yucat´an, M´exico, see [425]
2000: Acapulco, M´exico, see [325]
WOPPLOT: Workshop on Parallel Processing: Logic, Organization and Technology
History: 1992: Tutzing, Germany (?), see [2068]
1989: Neubiberg and Wildbad Kreuth, Germany, see [164]
1986: Neubiberg, see [163]
1983: Neubiberg, see [162]
In the general information sections of the following chapters, you will find many conferences
and workshops that deal with the respective algorithms discussed, so this is just a small
selection.
1.6.3 Journals
Some journals that deal (at least partially) with global optimization algorithms are:
Journal of Global Optimization, ISSN: 0925-5001 (Print) 1573-2916 (Online), ap-
pears monthly, publisher: Springer Netherlands, http://www.springerlink.com/content/
100288/ [accessed 2007-09-20]
The Journal of the Operational Research Society, ISSN: 0160-5682, appears monthly, ed-
itor(s): John Wilson, Terry Williams, publisher: Palgrave Macmillan, The OR Society,
http://www.palgrave-journals.com/jors/ [accessed 2007-09-16]
IEEE Transactions on Systems, Man, and Cybernetics (SMC), appears Part A/B: bi-
monthly, Part C: quaterly, editor(s): Donald E. Brown (Part A), Diane Cook (Part B),
Vladimir Marik (Part C), publisher: IEEE Press, http://www.ieeesmc.org/ [accessed 2007-09-
16]
Journal of Heuristics, ISSN: 1381-1231 (Print), 1572-9397 (Online), appears bi-monthly,
publisher: Springer Netherlands, http://www.springerlink.com/content/102935/ [accessed
2007-09-16]
European Journal of Operational Research (EJOR), ISSN: 0377-2217, appears bi-
weekly, editor(s): Roman Slowinski, Jesus Artalejo, Jean-Charles. Billaut, Robert Dyson,
Lorenzo Peccati, publisher: North-Holland, Elsevier, http://www.elsevier.com/wps/
find/journaldescription.cws_home/505543/description [accessed 2007-09-21]
Computers & Operations Research, ISSN: 0305-0548, appears monthly, editor(s):
Stefan Nickel, publisher: Pergamon, Elsevier, http://www.elsevier.com/wps/find/
journaldescription.cws_home/300/description [accessed 2007-09-21]

92
1 Introduction
Applied Statistics, ISSN: 0035-9254, editor(s): Gilmour, Skinner, publisher: Blackwell Pub-
lishing for the Royal Statistical Society, http://www.blackwellpublishing.com/journal.
asp?ref=0035-9254 [accessed 2007-09-16]
Applied Intelligence, ISSN: 0924-669X (Print), 1573-7497 (Online), appears bi-monthly, pub-
lisher: Springer Netherlands, http://www.springerlink.com/content/100236/ [accessed 2007-
09-16]
Artificial Intelligence Review , ISSN: 0269-2821 (Print), 1573-7462 (Online), appears until
2005, publisher: Springer Netherlands, http://www.springerlink.com/content/100240/
[accessed 2007-09-16]
Journal of Artificial Intelligence Research (JAIR), ISSN: 11076-9757, editor(s): Toby Walsh,
http://www.jair.org/ [accessed 2007-09-16]
Knowledge and Information Systems, ISSN: 0219-1377 (Print), 0219-3116 (Online), ap-
pears approx. eight times a year, publisher: Springer London, http://www.springerlink.
com/content/0219-1377 [accessed 2007-09-16] and http://www.springer.com/west/home/
computer/information+systems?SGWID=4-152-70-1136715-0 [accessed 2007-09-16]
SIAM Journal on Optimization (SIOPT), ISSN: 1052-6234 (print) / 1095-7189 (electronic),
appears quarterly, editor(s): Nicholas I. M. Gould, publisher: Society for Industrial and
Applied Mathematics, http://www.siam.org/journals/siopt.php [accessed 2008-06-14]
Applied Soft Computing, ISSN: 1568-4946, appears quarterly, editor(s): R. Roy, publisher:
Elsevier B.V., http://www.sciencedirect.com/science/journal/15684946 [accessed 2008-06-
15]
Advanced Engineering Informatics, ISSN: 1474-0346, appears quaterly, editor(s): J.C. Kunz,
I.F.C. Smith, T. Tomiyama, publisher: Elsevier B.V., http://www.elsevier.com/wps/
find/journaldescription.cws_home/622240/description [accessed 2008-08-01]
Journal of Machine Learning Research (JMLR), ISSN: 1533-7928, 1532-4435, appears 8
times/year, editor(s): Lawrence Saul and Leslie Pack Kaelbling, publisher: Microtome Pub-
lishing, http://jmlr.csail.mit.edu/ [accessed 2008-08-06]
Annals of Operations Research, ISSN: 0254-5330, 1572-9338, appears monthly, editor(s):
Endre Boros, publisher: Springer, http://www.springerlink.com/content/0254-5330 [ac-
cessed 2008-10-27]
International Journal of Applied Metaheuristic Computing (IJAMC , appears starts in
2010, editor(s): Peng-Yeng Yin, publisher: Information Resources Management Association,
http://www.igi-global.com/journals/details.asp?id=33344 [accessed 2009-01-02]
1.6.4 Online Resources
Some general, online available ressources on global optimization algorithms are:
http://www.mat.univie.ac.at/~neum/glopt.html [accessed 2007-09-20]
Last update: up-to-date
Arnold Neumaier’s global optimization website which includes links, publica-
Description: tions, and software.
http://www.soft-computing.de/ [accessed 2008-05-18]
Last update: up-to-date
Description: Yaochu Jin’s size on soft computing including links and conference infos.
http://web.ift.uib.no/~antonych/glob.html [accessed 2007-09-20]
Last update: up-to-date
Description: Web site with many links maintained by Gennady A. Ryzhikov.

1.6 General Information
93
http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/Hedar_files/
TestGO.htm [accessed 2007-11-06]
Last update: up-to-date
Description: A beautiful collection of test problems for global optimization algorithms
http://www.c2i.ntu.edu.sg/AI+CI/Resources/ [accessed 2008-20-25]
Last update: 2006-11-02
Description: A large collection of links about AI and CI.
1.6.5 Books
Some books about (or including significant information about) global optimization algo-
rithms are:
Pardalos, Thoai, and Horst [1614]: Introduction to Global Optimization
Pardalos and Resende [1613]: Handbook of Applied Optimization
Floudas and Pardalos [691]: Frontiers in Global Optimization
Dzemyda, Saltenis, and Zilinskas [613]: Stochastic and Global Optimization
Gandibleux, Sevaux, S¨orensen, and T’kindt [766]: Metaheuristics for Multiobjective Optimi-
sation
Glover and Kochenberger [813]: Handbook of Metaheuristics

orn and ˇ
Zilinskas [2047]: Global Optimization
Chiong [391]: Nature-Inspired Algorithms for Optimisation
Floudas [690]: Deterministic Global Optimization: Theory, Methods and Applications
Chankong and Haimes [375]: Multiobjective Decision Making Theory and Methodology
Steuer [1961]: Multiple Criteria Optimization: Theory, Computation and Application
Haimes, Hall, and Freedman [878]: Multiobjective Optimization in Water Resource Systems
Charnes and Cooper [376]: Management Models and Industrial Applications of Linear Pro-
gramming
Corne, Dorigo, Glover, Dasgupta, Moscato, Poli, and Price [448]: New Ideas in Optimisation
Gonzalez [832]: Handbook of Approximation Algorithms and Metaheuristics
Jain and Kacprzyk [1036]: New Learning Paradigms in Soft Computing
Tiwari, Knowles, Avineri, Dahal, and Roy [2044]: Applications of Soft Computing – Recent
Trends
Chawdry, Roy, and Pant [379]: Soft Computing in Engineering Design and Manufacturing
Siarry and Michalewicz [1875]: Advances in Metaheuristics for Hard Optimization
Onwubolu and Babu [1580]: New Optimization Techniques in Engineering
Pardalos and Du [1612]: Handbook of Combinatorial Optimization
Reeves [1716]: Modern Heuristic Techniques for Combinatorial Problems
Corne, Oates, and Smith [450]: Telecommunications Optimization: Heuristic and Adaptive
Techniques
Kontoghiorghes [1171]: Handbook of Parallel Computing and Statistics
Bui and Alam [299]: Multi-Objective Optimization in Computational Intelligence: Theory
and Practice


2
Evolutionary Algorithms
2.1 Introduction
Definition 2.1 (Evolutionary
Algorithm).
Evolutionary algorithms1
(EAs) are
population-based metaheuristic optimization algorithms that use biology-inspired mecha-
nisms like mutation, crossover, natural selection, and survival of the fittest in order to refine
a set of solution candidates iteratively. [99, 104, 105]
The advantage of evolutionary algorithms compared to other optimization methods is
their “black box” character that makes only few assumptions about the underlying objective
functions. Furthermore, the definition of objective functions usually requires lesser insight to
the structure of the problem space than the manual construction of an admissible heuristic.
EAs therefore perform consistently well in many different problem categories.
2.1.1 The Basic Principles from Nature
In 1859, Darwin [485] published his book “On the Origin of Species”2 in which he identified
the principles of natural selection and survival of the fittest as driving forces behind the
biological evolution. His theory can be condensed into ten observations and deductions
[485, 1375, 2219]:
1. The individuals of a species posses great fertility and produce more offspring than can
grow into adulthood.
2. Under the absence of external influences (like natural disasters, human beings, etc.), the
population size of a species roughly remains constant.
3. Again, if no external influences occur, the food resources are limited but stable over
time.
4. Since the individuals compete for these limited resources, a struggle for survival ensues.
5. Especially in sexual reproducing species, no two individuals are equal.
6. Some of the variations between the individuals will affect their fitness and hence, their
ability to survive.
7. A good fraction of these variations are inheritable.
8. Individuals less fit are less likely to reproduce, whereas the fittest individuals will survive
and produce offspring more probably.
9. Individuals that survive and reproduce will likely pass on their traits to their offspring.
1 http://en.wikipedia.org/wiki/Artificial_evolution [accessed 2007-07-03]
2 http://en.wikipedia.org/wiki/The_Origin_of_Species [accessed 2007-07-03]

96
2 Evolutionary Algorithms
10. A species will slowly change and adapt more and more to a given environment during
this process which may finally even result in new species.
Evolutionary algorithms abstract from this biological process and also introduce a change
in semantics by being goal-driven [2091]. The search space G in evolutionary algorithms is
then an abstraction of the set of all possible DNA strings in nature and its elements g ∈ G
play the role of the natural genotypes. Therefore, we also often refer to G as the genome and
to the elements g ∈ G as genotypes. Like any creature is an instance of its genotype formed
by embryogenesis3, the solution candidates (or phenotypes) x ∈ X in the problem space X
are instances of genotypes formed by the genotype-phenotype mapping: x = gpm(g). Their
fitness is rated according to objective functions which are subject to optimization and drive
the evolution into specific directions.
2.1.2 The Basic Cycle of Evolutionary Algorithms
We can distinguish between single-objective and multi-objective evolutionary algorithms,
where the latter means that we try to optimize multiple, possible conflicting criteria. Our
following elaborations will be based on these MOEAs. The general area of Evolutionary
Computation that deals with multi-objective optimization is called EMOO, evolutionary
multi-objective optimization.
Definition 2.2 (MOEA). A multi-objective evolutionary algorithm (MOEA) is able to
perform an optimization of multiple criteria on the basis of artificial evolution [359, 360,
2101, 534, 537, 716, 1471].
Initial Population
Evaluation
Fitness Assignment
create an initial
compute the objective
use the objective values
population of random
values of the solution
to determine fitness
individuals
candidates
values
Reproduction
Selection
create new individuals
select the fittest indi-
from the mating pool by
viduals for reproduction
crossover and mutation
Figure 2.1: The basic cycle of evolutionary algorithms.
All evolutionary algorithms proceed in principle according to the scheme illustrated in
Figure 2.1:
1. Initially, a population Pop of individuals p with a random genome p.g is created.
2. The values of the objective functions f ∈ F are computed for each solution candidate
p.x in Pop. This evaluation may incorporate complicated simulations and calculations.
3. With the objective functions, the utility of the different features of the solution candi-
dates have been determined and a fitness value v(p.x) can now be assigned to each of
them. This fitness assignment process can, for instance, incorporate a prevalence com-
parator function cmpF which uses the objective values to create an order amongst the
individuals.
3 http://en.wikipedia.org/wiki/Embryogenesis [accessed 2008-03-10]

2.1 Introduction
97
4. A subsequent selection process filters out the solution candidates with bad fitness and
allows those with good fitness to enter the mating pool with a higher probability. Since
fitness is subject to minimization in the context of this book, the lower the v(p.x)-values
are, the higher is the (relative) utility of the individual to whom they belong.
5. In the reproduction phase, offspring is created by varying or combining the genotypes p.g
of the selected individuals p ∈ Mate by applying the search operations searchOp ∈ Op
(which are called reproduction operations in the context of EAs). These offspring are
then subsequently integrated into the population.
6. If the terminationCriterion() is met, the evolution stops here. Otherwise, the algorithm
continues at step 2.
In the following few paragraphs, we will discuss how the natural evolution of a species
could proceed and put the artificial evolution of solution candidates in an EA into this
context. When an evolutionary algorithm starts, there exists no information about what is
good or what is bad. Basically, only some random genes p.x = create() are coupled together
as individuals in the initial population Pop(t = 0). I think, back in the Eoarchean4, the
earth age 3.8 billion years ago where most probably the first single-celled life occurred, it
was probably the same.
For simplification purposes, we will assume that the evolution does proceed stepwise
in distinct generations. At the beginning of every generation, nature “instantiates” each
genotype p.g (given as DNA sequence) as a new phenotype p.x = gpm(p.g) – a living
organism – for example a fish. The survival of the genes of the fish depends on how good
it performs in the ocean (F (p.x) =?), in other words, on how fit it is v(p.x). Its fitness,
however, is not only determined by one single feature of the phenotype like its size (= f1).
Although a bigger fish will have better chances to survive, size alone does not help if it is
too slow to catch any prey (= f2). Also its energy consumption f3 should be low so it does
not need to eat all the time. Other factors influencing the fitness positively are formae like
sharp teeth f4 and colors that blend into the environment f5 so it cannot be seen too easily
by sharks. If its camouflage is too good on the other hand, how will it find potential mating
partners (f6 ≁ f5)? And if it is big, it will also have a higher energy consumption f1 ≁ f3.
So there may be conflicts between the desired properties.
To sum it up, we could consider the life of the fish as the evaluation process of its genotype
in an environment where good qualities in one aspect can turn out as drawbacks in other
perspectives. In multi-objective evolutionary algorithms, this is exactly the same and I tried
to demonstrate this by annotating the fish-story with the symbols previously defined in
the global optimization theory sections. For each problem that we want to solve, we can
specify multiple so-called objective functions f ∈ F. An objective function f represents one
feature that we are interested in. Let us assume that we want to evolve a car (a pretty weird
assumption, but let’s stick with it). The genotype p.g ∈ G would be the construction plan
and the phenotype p.x ∈ X the real car, or at least a simulation of it. One objective function
fa would definitely be safety. For the sake of our children and their children, the car should
also be environment-friendly, so that’s our second objective function fb. Furthermore, a
cheap price fc, fast speed fd, and a cool design fe would be good. That makes five objective
functions from which for example the second and the fourth are contradictory (f ≁
b
fd).
After the fish genome is instantiated, nature “knows” about its phenotypic properties.
Fitness, however, is always relative; it depends on your environment. I, for example, may
be considered as a fit man in my department (computer science). If took a stroll to the
department of sports science, that statement will probably not hold anymore. The same
goes for the fish, its fitness depends on the other fish in the population (and its prey and
predators). If one fish p1.x can beat another one p2.x in all categories, i.e., is bigger, stronger,
smarter, and so on, we can clearly consider it as fitter (p1.x≻p2.x ⇒ cmpF(p1.x,p2.x) < 0)
since it will have a better chance to survive. This relation is transitive but only forms a partial
order since a fish that is strong but not very clever and a fish that is clever but not strong
4 http://en.wikipedia.org/wiki/Eoarchean [accessed 2007-07-03]

98
2 Evolutionary Algorithms
maybe have the same probability to reproduce and hence, are not directly comparable5.
Well, Ok, we cannot decide if a weak fish p3.x with a clever behavioral pattern is worse or
better than a really strong but less cunning one p4.x (cmpF (p3.x, p4.x) = 0). Both traits are
furthered in the evolutionary process and maybe, one fish of the first kind will sometimes
mate with one of the latter and produce an offspring which is both, intelligent and sporty6.
Multi-objective evolutionary algorithms basically apply the same principles in their fit-
ness assignment process “assignFitness”. One of the most popular methods for computing
the fitness is called Pareto ranking7. It does exactly what we’ve just discussed: It first chooses
the individuals that are beaten by no one (we call this non-dominated set) and assigns a
good (scalar) fitness value v(p1.x) to them. Then it looks at the rest of the population and
picks those (P ⊂ Pop) which are not beaten by the remaining individuals and gives them a
slightly worse fitness value v(p.x) > v(p1.x) ∀p ∈ P – and so on, until all solution candidates
have received one scalar fitness.
Now, how fit a fish is does not necessarily determine directly if it can produce offspring.
An intelligent fish may be eaten by a shark and a strong one can die from disease. The
fitness8 is only some sort of probability of reproduction. The process of selection is always
stochastic, without guarantees – even a fish that is small, slow, and lacks any sophisticated
behavior might survive and could produce even more offspring than a highly fit one.
The evolutionary algorithms work in exactly the same way – they use a selection algo-
rithm “select” in order to pick the fittest individuals and place them into the mating pool
Mate. The oldest selection scheme is called Roulette wheel 9. In the original version of this
algorithm (intended for fitness maximization), the chance of an individual p to reproduce is
proportional to its fitness v(p.x).
Last but not least, there is the reproduction phase. Fish reproduce sexually. Whenever a
female fish and a male fish mate, their genes will be recombined by crossover. Furthermore,
mutations may take place which. Most often, they affect the characteristics of resulting larva
only slightly [1730]. Since fit fish produce offspring with higher probability, there is a good
chance that the next generation will contain at least some individuals that have combined
good traits from their parents and perform even better than them.
In evolutionary algorithms, we do not have such a thing as “gender”. Each individual
from the mating pool can potentially be recombined with every other one. In the car example,
this means that we would modify the construction plans by copying the engine of one car
and placing it into the car body of another one. Also, we could alter some features like the
shape of the headlights randomly. This way, we receive new construction plans for new cars.
Our chance that an environment-friendly engine inside a cool-looking car will result in a
car that is more likely to be bought by the customer is good. If we iteratively perform the
reproduction process “reproducePop” time and again, there is a high probability that the
solutions finally found will be close to optimal.
2.1.3 The Basic Evolutionary Algorithm Scheme
After this informal outline about the artificial evolution and how we can use it as an opti-
mization method, let us now specify the basic scheme common to all evolutionary algorithms.
In principle, all EAs are variations and extensions of the basic approach “simpleEA” defined
Algorithm 2.1, a cycle of evaluation, selection, and reproduction repeated in each iteration
t. Algorithm 2.1 relies on functions and prototypes that we will introduce step by step.
5 Which is a very comforting thought for all computer scientists.
6 I wonder if the girls in the sports department are open to this kind of argumentation?
7 Pareto comparisons are discussed in Section 1.2.2 on page 31 and elaborations on Pareto ranking
can be found in Section 2.3.3.
8 This definition is fitness is not fully compatible with biological one, see Section 2.1.5 for more
information on that topic.
9 The roulette wheel selection algorithm will be introduced in Section 2.4.3 on page 124.

2.1 Introduction
99
Algorithm 2.1: X⋆ ←− simpleEA(cmpF,ps)
Input: cmpF : the comparator function which allows us to compare the utility of two
solution candidates
Input: ps: the population size
Data: t: the generation counter
Data: Pop: the population
Data: Mate: the mating pool
Data: v: the fitness function resulting from the fitness assigning process
Output: X⋆: the set of the best elements found
1 begin
2
t ←− 0
3
Pop ←− createPop(ps)
4
while ¬terminationCriterion() do
5
v ←− assignFitness(Pop, cmpF)
6
Mate ←− select(Pop, v, ps)
7
t ←− t + 1
8
Pop ←− reproducePop(Mate)
9
return extractPhenotypes(extractOptimalSet(Pop))
10 end
1. The function “createPop(ps)”, which will be introduced as Algorithm 2.18 in Section 2.5
on page 137, produces an initial, randomized population consisting of ps individuals in
the first iteration t = 0.
2. The termination criterion “terminationCriterion()” checks whether the evolutionary al-
gorithm should terminate or continue its work, see Section 1.3.4 on page 54.
3. Most evolutionary algorithms assign a scalar fitness v(p.x) to each individual p by com-
paring its vector of objective values F (p.x) to other individuals in the population Pop.
The function v is built by a fitness assignment process “assignFitness”, which we will
discuss in Section 2.3 on page 111 in more detail. During this procedure, the genotype-
phenotype mapping is implicitly carried out as well as simulations needed to compute
the objective functions f ∈ F.
4. A selection algorithm “select” (see Section 2.4 on page 121) then chooses ps interesting
individuals from the population Pop and inserts them into the mating pool Mate.
5. With “reproducePop”, a new population is generated from the individuals inside the
mating pool using mutation and/or recombination. More information on reproduction
can be found in Section 2.5 on page 137 and in Definition 2.13.
6. The functions “extractOptimalSet” and “extractPhenotypes” which you can find in-
troduced in Definition 19.2 on page 308 and Equation 19.1 on page 307 are used to
extract all the non-prevailed individuals p⋆ from the final population and to return their
corresponding phenotypes p⋆.x only.
2.1.4 From the Viewpoint of Formae
Let us review our introductory fish example in terms of forma analysis. Fish can, for instance,
be characterized by the properties “clever” and “strong”. Crudely simplified, both properties
may be true or false for a single individual and hence define two formae each. A third
property can be the color, for which many different possible variations exist. Some of them
may be good in terms of camouflage, others maybe good in terms of finding mating partners.
Now a fish can be clever and strong at the same time, as well as weak and green. Here, a
living fish allows nature to evaluate the utility of at least three different formae.
This fact has first been stated by Holland [940] for genetic algorithms and is termed im-
plicit parallelism (or intrinsic parallelism). Since then, it has been studied by many different
researchers [858, 853, 188, 2123]. If the search space and the genotype-phenotype mapping

100
2 Evolutionary Algorithms
are properly designed, the implicit parallelism in conjunction with the crossover/recombina-
tion operations is one of the reasons why evolutionary algorithms are such a successful class
of optimization algorithms.
2.1.5 Does the natural Paragon Fit?
At this point it should be mentioned that the direct reference to Darwinian evolution in
evolutionary algorithms is somehow controversial. Paterson [1619], for example, points out
that “neither GAs [genetic algorithms] nor GP [Genetic Programming] are concerned with
the evolution of new species, nor do they use natural selection.” On the other hand, nobody
would claim that the idea of selection has not been borrowed from nature although many ad-
ditions and modifications have been introduced in favor for better algorithmic performance.
The second argument concerning the development of different species depends on definition:
According to Wikipedia [2219], a species is a class of organisms which are very similar in
many aspects such as appearance, physiology, and genetics. In principle, there is some el-
bowroom for us and we may indeed consider even different solutions to a single problem in
evolutionary algorithms as members of a different species – especially if the binary search
operation crossover/recombination applied to their genomes cannot produce another valid
solution candidate.
Another interesting difference was pointed out by Sharpe [1859] who states that natural
evolution “only proceed[s] sufficiently fast to ensure survival” whereas evolutionary algo-
rithms used for engineering need to be fast in order to be feasible and to compete with other
problem solving techniques.
Furthermore, although the concept of fitness10 in nature is controversial [1915], it is
often considered as an a posteriori measurement. It then defines the ratio of the numbers
of occurrences of a genotype in a population after and before selection or the number of
offspring an individual has in relation to the number of offspring of another individual. In
evolutionary algorithms, fitness is an a priori quantity denoting a value that determines
the expected number of instances of a genotype that should survive the selection process.
However, one could conclude that biological fitness is just an approximation of the a priori
quantity arisen due to the hardness (if not impossibility) of directly measuring it.
My personal opinion (which may as well be wrong) is that the citation of Darwin here is
well motivated since there are close parallels between Darwinian evolution and evolutionary
algorithms. Nevertheless, natural and artificial evolution are still two different things and
phenomena observed in either of the two do not necessarily carry over to the other.
2.1.6 Classification of Evolutionary Algorithms
The Family of Evolutionary Algorithms
The family of evolutionary algorithms encompasses five members, as illustrated in Figure 2.2.
We will only enumerate them here in short. In depth discussions will follow in the next
chapters.
1. Genetic algorithms (GAs) are introduced in Chapter 3 on page 141. GAs subsume
all evolutionary algorithms which have bit strings as search space G.
2. The set of evolutionary algorithms which explore the space of real vectors X ⊆ Rn is
called Evolution Strategies (ES, see Chapter 5 on page 227).
3. For Genetic Programming (GP), which will be elaborated on in Chapter 4 on
page 157, we can provide two definitions: On one hand, GP includes all evolutionary
algorithms that grow programs, algorithms, and these alike. On the other hand, also all
EAs that evolve tree-shaped individuals are instances of Genetic Programming.
10 http://en.wikipedia.org/wiki/Fitness_(biology) [accessed 2008-08-10]

2.1 Introduction
101
4. Learning Classifier Systems (LCS), discussed in Chapter 7 on page 233, are online
learning approaches that assign output values to given input values. They internally use
a genetic algorithm to find new rules for this mapping.
5. Evolutionary programming (EP, see Chapter 6 on page 231) is an evolutionary
approach that treats the instances of the genome as different species rather than as
individuals. Over the decades, it has more or less merged into Genetic Programming
and the other evolutionary algorithms.
Genetic Programming
GGGP
SGP
LGP
Evolutionary
Programming
Genetic Algorithms
Evolution Strategy
Learning Classifier
Systems
Differential
Evolution
Evolutionary Algorithms
Figure 2.2: The family of evolutionary algorithms.
The early research [518] in genetic algorithms (see Section 3.1 on page 141), Genetic
Programming (see Section 4.1.1 on page 157), and evolutionary programming (see Section 6.1
on page 231) date back to the 1950s and 60s. Besides the pioneering work listed in these
sections, at least other important early contribution should not go unmentioned here: The
Evolutionary Operation (EVOP) approach introduced by Box [260], Box and Draper [261]
in the late 1950s. The idea of EVOP was to apply a continuous and systematic scheme of
small changes in the control variables of a process. The effects of these modifications are
evaluated and the process is slowly shifted into the direction of improvement. This idea
was never realized as a computer algorithm, but Spendley et al. [1941] used it as basis for
their simplex method which then served as progenitor of the downhill simplex algorithm11
of Nelder and Mead [1517]. [518, 1276] Satterthwaite’s REVOP [1815, 1816], a randomized
Evolutionary Operation approach, however, was rejected at this time [518].
We now have classified different evolutionary algorithms according to their semantics,
in other words, corresponding to their special search and problem spaces. All five major
approaches can be realized with the basic scheme defined in Algorithm 2.1. To this simple
structure, there exist many general improvements and extensions. Since these normally do
not concern the search or problem spaces, they also can be applied to all members of the
EA family alike. In the further text of this chapter, we will discuss the major components
of many of today’s most efficient evolutionary algorithms [357]. The distinctive features of
these EAs are:
1. The population size or the number of populations used.
11 We discuss Nelder and Mead [1517]’s downhill simplex optimization method in Chapter 16 on
page 283.

102
2 Evolutionary Algorithms
2. The method of selecting the individuals for reproduction.
3. The way the offspring is included into the population(s).
Populations in Evolutionary Algorithms
There exist various way in which an evolutionary algorithm can process its population.
Especially interesting is how the population Pop(t + 1) of the next iteration is formed as a
combination of the current one Pop(t) and its offspring. If it only contains this offspring,
we speak of extinctive selection [1512, 1869]. Extinctive selection can be compared with
ecosystems of small protozoa12 which reproduce in a fissiparous13 manner. In this case, of
course, the elders will not be present in the next generation. Other comparisons can partly
be drawn to the sexual reproducing to octopi, where the female dies after protecting the
eggs until the larvae hatch, or to the black widow spider where the female devours the male
after the insemination. Especially in the area of genetic algorithms, extinctive strategies are
also known as generational algorithms.
Definition 2.3 (Generational). In evolutionary algorithms that are generational [1677],
the next generation will only contain the offspring of the current one and no parent individ-
uals will be preserved.
Extinctive evolutionary algorithms can further be divided into left and right selection
[2264]. In left extinctive selections, the best individuals are not allowed to reproduce in
order to prevent premature convergence of the optimization process. Conversely, the worst
individuals are not permitted to breed in right extinctive selection schemes in order to reduce
the selective pressure since they would otherwise scatter the fitness too much.
In algorithms that apply a preservative selection scheme, the population is a combination
of the next population and the offspring [102, 1064, 1762, 2091]. The biological metaphor for
such algorithms is that the lifespan of many organisms exceeds a single generation. Hence,
parent and child individuals compete with each other for survival.
For Evolution Strategywhich you can find discussed in Chapter 5 on page 227, there
exists a notation which also can be used describe the generation transition in evolutionary
algorithms in general [934, 935, 1841, 102].
1. λ denotes the number of offspring created and
2. µ is the number of parent individuals.
Extinctive selection patterns are denoted as (µ, λ)-strategies and will create λ ≥ µ child
individuals from the µ available genotypes. From these, they only keep the µ best solution
candidates and discard the µ parents as well as the λ − µ worst children.
In (µ + λ)-strategy, again λ children are generated from µ parents, often with λ > µ.
Then, the parent and offspring populations are united (to a population of the size λ + µ)
and from this unison, only the µ best individuals will “survive”. (µ + λ)-strategies are thus
preservative.
Steady-state evolutionary algorithms [1746, 499, 1538, 365, 1987, 2211], abbreviated by
SSEA, are preservative evolutionary algorithms with values of λ that are relatively low in
comparison with µ. Usually, λ is chosen in a way that a binary search operator crossover is
applied exactly once per generation. Although steady-state evolutionary algorithms are often
observed to produce better results than generational EAs. Chafekar et al. [365], for exam-
ple, introduce steady-state evolutionary algorithms that are able to outperform generational
NSGA-II (which you can find summarized in ?? on page ??) for some difficult problems.
In experiments of Jones and Soule [1066] (primarily focused on other issues), steady-state
algorithms showed better convergence behavior in a multi-modal landscape. Similar results
12 http://en.wikipedia.org/wiki/Protozoa [accessed 2008-03-12]
13 http://en.wikipedia.org/wiki/Binary_fission [accessed 2008-03-12]

2.1 Introduction
103
have been reported by Chevreux [389] in the context of molecule design optimization. Dif-
ferent generational selection methods have been compared to the steady-state GENITOR
approach by Goldberg and Deb [822]. On the other hand, with steady-state approaches, we
run also the risk of premature convergence.
Even in preservative strategies, it is not granted that the best individuals will always
survive. In principle, a (µ + λ) strategy can also mean that from µ + λ individuals, µ are
chosen with a certain selection algorithm. Most are randomized, and even if such methods
pick the best solution candidates with the highest probabilities, they may also select worse
individuals. At this point, it is maybe interesting to mention that the idea that larger
populations will always lead to better optimization results does not necessarily always hold,
as shown by van Nimwegen and Crutchfield [2096].
Definition 2.4 (Elitism). An elitist evolutionary algorithm [512, 1261, 359] ensures that
at least one copy of the best individual(s) of the current generation is propagated on to the
next generation.
The main advantage of elitism is that its convergence is guaranteed, meaning that once
the global optimum has been discovered, the evolutionary algorithm converges to that opti-
mum. On the other hand, the risk of converging to a local optimum is also higher. Elitism
is an additional feature of global optimization algorithms – a special type of preservative
strategy – which is often realized by using a secondary population only containing the
non-prevailed individuals. This population is updated at the end of each iteration. Such
an archive-based elitism can be combined with both, generational and preservative strate-
gies. Algorithm 2.2 specifies the basic scheme of elitist evolutionary algorithms.
Algorithm 2.2: X⋆ ←− elitistEA(cmpF,ps,a)
Input: cmpF : the comparator function which allows us to compare the utility of two
solution candidates
Input: ps: the population size
Input: as: the archive size
Data: t: the generation counter
Data: Pop: the population
Data: Mate: the mating pool
Data: Arc: the archive with the best individuals found so far
Data: v: the fitness function resulting from the fitness assigning process
Output: X⋆: the set of best solution candidates discovered
1 begin
2
t ←− 0
3
Arc ←− ∅
4
Pop ←− createPop(ps)
5
while ¬terminationCriterion() do
6
Arc ←− updateOptimalSetN(Arc, Pop)
7
Arc ←− pruneOptimalSet(Arc, as)
8
v ←− assignFitness(Pop, Arc, cmpF)
9
Mate ←− select(Pop, Arc, v, ps)
10
t ←− t + 1
11
Pop ←− reproducePop(Mate)
12
return extractPhenotypes(extractOptimalSet(Pop ∪ Arc))
13 end
Let us now outline the new methods and changes introduced in Algorithm 2.2 in short.
1. The archive Arc is the set of best individuals found by the algorithm. Initially, it is
the empty set ∅. Subsequently, it is updated with the function “updateOptimalSetN”

104
2 Evolutionary Algorithms
which inserts new, unprevailed elements from the population into it and also removes
individuals from the archive which are superseded by those new optima. Algorithms that
realize such updating are defined in Section 19.1 on page 307.
2. If the optimal set becomes too large – it might theoretically contain uncountable many
individuals – “pruneOptimalSet” reduces it to a proper size, employing techniques like
clustering in order to preserve the element diversity. More about pruning can be found
in Section 19.3 on page 309.
3. You should also notice that both, the fitness assignment and selection processes, of elitist
evolutionary algorithms may take the archive as additional parameter. In principle,
such archive-based algorithms can also be used in non-elitist evolutionary algorithms by
simply replacing the parameter Arc with ∅.
2.1.7 Configuration Parameters of evolutionary algorithms
Figure 2.3 illustrates the basic configuration parameters of evolutionary algorithms. The
performance and success of an evolutionary optimization approach applied to a problem
given by a set of objective functions F and a problem space X is defined by
Fitn
essA
Proc ssig
ess nme
Algorithm
Search
n
S
S
p
e
ac
t
ar
e
ch
an
O
d
perations
Ar
Selection
chive
Basic
/Pruning
Parame
Mapping
Genotype-Phenotype
ter
s
Figure 2.3: The configuration parameters of evolutionary algorithms.
1. its basic parameter settings like the population size ps or the crossover and mutation
rates,
2. whether it uses an archive Arc of the best individuals found and, if so, which pruning
technology is used to prevent it from overflowing,
3. the fitness assignment process “assignFitness” and the selection algorithm “select”,
4. the choice of the search space G and the search operations Op,
5. and the genotype-phenotype mapping connecting the search Space and the problem
space.
In Section 20.1, we go more into detail on how to state the configuration of an optimiza-
tion algorithm in order to fully describe experiments and to make them reproducible.

2.2 General Information
105
2.2 General Information
2.2.1 Areas Of Application
Some example areas of application of evolutionary algorithms are:
Application
References
Function Optimization
[1562, 1673]
Multi-Objective Optimization
[715, 716, 357, 1054, 1804, 537]
Combinatorial Optimization
[254, 1762, 1270, 1338]
Engineering, Structural Optimization, and Design
[755, 1412, 1554]
Constraint Satisfaction Problems (CSP)
[2091, 1054, 716, 1804]
Economics and Finance
[388, 1975, 503, 640, 409]
Biology
[2075, 704]
Data Mining and Data Analysis
[2178, 445, 797, 444]
Mathematical Problems
[1094]
Electrical Engineering and Circuit Design
[488, 2075]
Chemistry, Chemical Engineering
[1061, 482, 389]
Scheduling
[1360, 374, 1227, 454, 250]
Robotics
[2158]
Image Processing
[322, 1532]
Networking and Communication
[1889, 1890, 453, 1497, 1684, 35]
see Section 23.2 on page 401
Medicine
[411, 1911]
Ressource Minimization, Environment Surveillance/Pro- [886]
tection
Military and Defense
[1393]
Evolving Behaviors, e.g., for Agents or Game Players
[1705]
For more information see also the application sections of the different members of the evo-
lutionary algorithm family: genetic algorithms in Section 3.2.1 on page 142, Genetic Pro-
gramming in Section 4.2.1 on page 160, Evolution Strategy in Section 5.2.1 on page 227,
evolutionary programming in Section 6.2.1 on page 231, and Learning Classifier Systems
in Section 7.2.1 on page 233.
2.2.2 Conferences, Workshops, etc.
Some conferences, workshops and such and such on evolutionary algorithms are:
BIOMA: International Conference on Bioinspired Optimization Methods and their Appli-
cations
http://bioma.ijs.si/ [accessed 2007-06-30]
History: 2008: Ljubljana, Slovenia, see [670]
2006: Ljubljana, Slovenia, see [669]
2004: Ljubljana, Slovenia, see [671]
CEC: Congress on Evolutionary Computation
http://ieeexplore.ieee.org/servlet/opac?punumber=7875 [accessed 2007-09-05]
History: 2008: Hong Kong, China, see [1409]
2007: Singapore, see [1005]

106
2 Evolutionary Algorithms
2006: Vancouver, BC, Canada, see [2291]
2005: Edinburgh, Scotland, UK, see [449]
2004: Portland, Oregon, USA, see [1004]
2003: Canberra, Australia, see [1803]
2002: Honolulu, HI, USA, see [703]
2001: Seoul, Korea, see [1003]
2000: La Jolla, California, USA, see [1002]
1999: Washington D.C., USA, see [69]
1998: Anchorage, Alaska, USA, see [1001]
1997: Indianapolis, IN, USA, see [106]
1996: Nagoya, Japan, see [1006]
1995: Perth, Australia, see [1000]
1994: Orlando, Florida, USA, see [1411]
Dagstuhl Seminar: Practical Approaches to Multi-Objective Optimization
History: 2006: Dagstuhl, Germany, see [283]
2004: Dagstuhl, Germany, see [281]
EA/AE: Conference on Artificial Evolution (Evolution Artificielle)
History: 2007: Tours, France, see [1441]
2005: Lille, France, see [2000]
2003: Marseilles, France, see [1283]
2001: Le Creusot, France, see [428]
1999: Dunkerque, France, see [711]
1997: Nˆımes, France, see [894]
1995: Brest, France, see [41]
1994: Toulouse, France, see [40]
EMO: International Conference on Evolutionary Multi-Criterion Optimization
History: 2007: Matsushima/Sendai, Japan, see [1555]
2005: Guanajuato, M´exico, see [422]
2003: Faro, Portugal, see [719]
2001: Zurich, Switzerland, see [2331]
EUROGEN: Evolutionary Methods for Design Optimization and Control with Applications
to Industrial Problems
History: 2007: Jyv¨
askyl¨a, Finland, see [2072]
2005: Munich, Germany, see [1827]
2003: Barcelona, Spain, see [147]
2001: Athens, Greece, see [803]
1999: Jyv¨
askyl¨a, Finland, see [1413]
1997: Triest, Italy, see [1681]
1995: Las Palmas de Gran Canaria, Spain, see [1059]
EvoCOP: European Conference on Evolutionary Computation in Combinatorial Optimiza-
tion
http://www.evostar.org/ [accessed 2007-09-05]
Co-located with EvoWorkshops and EuroGP.
History: 2009: T¨
ubingen, Germany, see [455]
2008: Naples, Italy, see [2094]

2.2 General Information
107
2007: Valencia, Spain, see [456]
2006: Budapest, Hungary, see [843]
2005: Lausanne, Switzerland, see [1700]
2004: Coimbra, Portugal, see [842]
2003: Essex, UK, see [1701]
2002: Kinsale, Ireland, see [321]
2001: Lake Como, Milan, Italy, see [235]
EvoWorkshops: Applications of Evolutinary Computing: EvoCoMnet, EvoFIN, EvoIASP,
EvoINTERACTION, EvoMUSART, EvoPhD, EvoSTOC and EvoTransLog
http://www.evostar.org/ [accessed 2007-08-05]
Co-located with EvoCOP and EuroGP.
History: 2009: T¨
ubingen, Germany, see [802]
2008: Naples, Italy, see [801]
2007: Valencia, Spain, see [800]
2006: Budapest, Hungary, see [1768]
2005: Lausanne, Switzerland, see [1767]
2004: Coimbra, Portugal, see [1702]
2003: Essex, UK, see [1701]
2002: Kinsale, Ireland, see [321]
2001: Lake Como, Milan, Italy, see [235]
2000: Edinburgh, Scotland, UK, see [320]
1999: G¨oteborg, Sweden, see [1665]
1998: Paris, France, see [976]
FEA: International Workshop on Frontiers in Evolutionary Algorithms
Was part of the Joint Conference on Information Science
History: 2005: Salt Lake City, Utah, USA, see [1794]
2003: Cary, North Carolina, USA, see [639]
2002: Research Triangle Park, North Carolina, USA, see [353]
2000: Atlantic City, NJ, USA, see [2154]
1998: Research Triangle Park, North Carolina, USA, see [2021]
1997: Research Triangle Park, North Carolina, USA, see [1865]
FOCI: IEEE Symposium on Foundations of Computational Intelligence
History: 2007: Honolulu, Hawaii, USA, see [1388]
GECCO: Genetic and Evolutionary Computation Conference
http://www.sigevo.org/ [accessed 2007-08-30]
A recombination of the Annual Genetic Programming Conference (GP, see Section 4.2.2 on
page 161) and the International Conference on Genetic Algorithms (ICGA, see Section 3.2.2
on page 143), also “contains” the International Workshop on Learning Classifier Systems
(IWLCS, see Section 7.2.2 on page 234).
History: 2008: Atlanta, Georgia, USA, see [1117, 409, 1393, 1911, 1705]
2007: London, England, see [2037, 2038]
2006: Seattle, Washington, USA, see [352]
2005: Washington, D.C., USA, see [202, 199, 1764, 1766]
2004: Seattle, Washington, USA, see [544, 545, 1113]
2003: Chicago, Illinois, USA, see [334, 335]
2002: New York, USA, see [1245, 331, 154, 1572, 1326]
2001: San Francisco, California, USA, see [1937, 833]

108
2 Evolutionary Algorithms
2000: Las Vegas, Nevada, USA, see [2216, 2210]
1999: Orlando, Florida, USA, see [142, 1584, 1889]
GEM: International Conference on Genetic and Evolutionary Methods
see Section 3.2.2 on page 143
ICANNGA: International Conference on Adaptive and Natural Computing Algorithms
before 2005: International Conference on Artificial Neural Nets and Genetic Algorithms
History: 2007: Warsaw, Poland, see [173, 174]
2005: Coimbra, Portugal, see [1725]
2003: Roanne, France, see [1628]
2001: Prague, Czech Republic, see [1224]
1999: Portoroz, Slovenia, see [576]
1997: Norwich, England, see [1902]
1995: Al`es, France, see [1627]
1993: Innsbruck, Austria, see [36]
ICNC: International Conference on Advances in Natural Computation
see Section 1.6.2 on page 89
Mendel: International Conference on Soft Computing
see Section 1.6.2 on page 90
PPSN: International Conference on Parallel Problem Solving from Nature
http://ls11-www.informatik.uni-dortmund.de/PPSN/ [accessed 2007-09-05]
History: 2008: Dortmund, Germany, see [1948]
2006: Reykjavik, Iceland, see [1779]
2004: Birmingham, UK, see [2285]
2002: Granada, Spain, see [867]
2000: Paris, France, see [1830]
1998: Amsterdam, The Netherlands, see [624]
1996: Berlin, Germany, see [2118]
1994: Jerusalem, Israel, see [492]
1992: Brussels, Belgium, see [1357]
1990: Dortmund, Germany, see [1842]
2.2.3 Journals
Some journals that deal (at least partially) with evolutionary algorithms are:
Evolutionary Computation, ISSN: 1063-6560, appears quaterly, editor(s): Marc Schoenauer,
publisher: MIT Press, http://www.mitpressjournals.org/loi/evco [accessed 2007-09-16]
IEEE Transactions on Evolutionary Computation, ISSN: 1089-778X, appears bi-monthly,
editor(s): Xin Yao, publisher: IEEE Computational Intelligence Society, http://ieee-cis.
org/pubs/tec/ [accessed 2007-09-16]
Biological Cybernetics, ISSN: 0340-1200 (Print), 1432-0770 (Online), appears bi-monthly,
publisher: Springer Berlin/Heidelberg, http://www.springerlink.com/content/100465/
[accessed 2007-09-16]
Complex Systems, ISSN: 0891-2513, appears quaterly, editor(s): Stephen Wolfram, publisher:
Complex Systems Publications, Inc., http://www.complex-systems.com/ [accessed 2007-09-16]
Journal of Artificial Intelligence Research (JAIR) (see Section 1.6.3 on page 92)
New Mathematics and Natural Computation (NMNC), ISSN: 1793-0057, appears three times
a year, editor(s): Paul P. Wang, publisher: World Scientific, http://www.worldscinet.com/
nmnc/ [accessed 2007-09-19]

2.2 General Information
109
The Journal of the Operational Research Society (see Section 1.6.3 on page 91)
2.2.4 Online Resources
Some general, online available ressources on evolutionary algorithms are:
http://www.lania.mx/~ccoello/EMOO/ [accessed 2007-09-20]
Last update: up-to-date
EMOO Web page – Dr. Coello Coello’s giant bibliography and paper reposi-
Description: tory for evolutionary multi-objective optimization.
http://www-isf.maschinenbau.uni-dortmund.de/links/ci_links.html [accessed 2007-10-14]
Last update: up-to-date
Computational Intelligence (CI)-related links and literature, maintained by
Description: J¨orn Mehnen
http://www.aip.de/~ast/EvolCompFAQ/ [accessed 2007-09-16]
Last update: 2001-04-01
Frequently Asked Questions of the comp.ai.genetic group by Heitk¨otter and
Description: Beasley [916].
http://nknucc.nknu.edu.tw/~hcwu/pdf/evolec.pdf [accessed 2007-09-16]
Last update: 2005-02-19
Description: Lecture Nodes on Evolutionary Computation by Wu [2264]
http://ls11-www.cs.uni-dortmund.de/people/beyer/EA-glossary/ [accessed 2008-04-10]
Last update: 2002-02-25
Online glossary on terms and definitions in evolutionary algorithms by Beyer
Description: et al. [201]
http://www.illigal.uiuc.edu/web/ [accessed 2008-05-17]
Last update: up-to-date
Description: The Illinois Genetic Algorithms Laboratory (IlliGAL)
http://www.peterindia.net/Algorithms.html [accessed 2008-05-17]
Last update: up-to-date
A large collection of links about evolutionary algorithms, Genetic Program-
Description: ming, genetic algorithms, etc.
http://www.fmi.uni-stuttgart.de/fk/evolalg/ [accessed 2008-05-17]
Last update: 2003-07-08
Description: The Evolutionary Computation repository of the University of Stuttgart.
http://dis.ijs.si/filipic/ec/ [accessed 2008-05-18]
Last update: 2007-11-09
The Evolutionary Computation repository of the Joˇzf Stefan Institute in
Description: Slovenia
http://www.red3d.com/cwr/evolve.html [accessed 2008-05-18]
Last update: 2002-07-27
Evolutionary Computation and its application to art and design by Craig
Description: Reynolds

110
2 Evolutionary Algorithms
http://surf.de.uu.net/encore/ [accessed 2008-05-18]
Last update: 2004-08-26
ENCORE, the electronic appendix to The Hitch-Hiker’s Guide to Evolution-
Description: ary Computation, see [916]
http://www-isf.maschinenbau.uni-dortmund.de/links/ci_links.html [accessed 2008-05-18]
Last update: 2006-09-13
Description: A collection of links to computational intelligence / EAs
http://www.tik.ee.ethz.ch/sop/education/misc/moeaApplet/ [accessed 2008-10-25]
Last update: 2008-06-30
Description: An applet illustrating a multi-objective EA
2.2.5 Books
Some books about (or including significant information about) evolutionary algorithms are:

ack [99]: Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolution-
ary Programming, Genetic Algorithms

ack, Fogel, and Michalewicz [104]: Handbook of Evolutionary Computation
Ceollo Coello, Lamont, and van Veldhuizen [361]: Evolutionary Algorithms for Solving Multi-
Objective Problems
Deb [537]: Multi-Objective Optimization Using Evolutionary Algorithms
Coello Coello and Lamont [424]: Applications of Multi-Objective Evolutionary Algorithms
Eiben and Smith [623]: Introduction to Evolutionary Computing
Dumitrescu, Lazzerini, Jain, and Dumitrescu [608]: Evolutionary Computation
Fogel [696]: Evolutionary Computation: The Fossil Record

ack, Fogel, and Michalewicz [107]: Evolutionary Computation 1: Basic Algorithms and
Operators

ack, Fogel, and Michalewicz [108]: Evolutionary Computation 2: Advanced Algorithms and
Operators
Bentley [181]: Evolutionary Design by Computers
De Jong [515]: Evolutionary Computation: A Unified Approach
Weicker [2167]: Evolution¨
are Algorithmen
Gerdes, Klawonn, and Kruse [789]: Evolution¨
are Algorithmen
Nissen [1535]: Einf¨
uhrung in evolution¨
are Algorithmen: Optimierung nach dem Vorbild der
Evolution
Yao [2284]: Evolutionary Computation: Theory and Applications
Yu, Davis, Baydar, and Roy [2299]: Evolutionary Computation in Practice
Yang, Ong, and Jin [2280]: Evolutionary Computation in Dynamic and Uncertain Environ-
ments
Morrison [1464]: Designing Evolutionary Algorithms for Dynamic Environments
Branke [280]: Evolutionary Optimization in Dynamic Environments
Nedjah, Alba, and Mourelle [1512]: Parallel Evolutionary Computations
Kosi´
nski [1177]: Advances in Evolutionary Algorithms
Rothlauf [1765]: Representations for Genetic and Evolutionary Algorithms
Banzhaf and Eeckman [137]: Evolution and Biocomputation – Computational Models of Evo-
lution

2.3 Fitness Assignment
111
Fogel and Corne [704]: Evolutionary Computation in Bioinformatics
Johnston [1061]: Applications of Evolutionary Computation in Chemistry
Clark [411]: Evolutionary Algorithms in Molecular Design
Chen [388]: Evolutionary Computation in Economics and Finance
Ghosh and Jain [797]: Evolutionary Computation in Data Mining
Miettinen, M¨akel¨
a, Neittaanm¨
aki, and Periaux [1412]: Evolutionary Algorithms in Engineer-
ing and Computer Science
Fogel [698]: Evolutionary Computation: Principles and Practice for Signal Processing
Ashlock [85]: Evolutionary Computation for Modeling and Optimization
Watanabe and Hashem [2158]: Evolutionary Computations – New Algorithms and their Ap-
plications to Evolutionary Robots
Cagnoni, Lutton, and Olague [322]: Genetic and Evolutionary Computation for Image Pro-
cessing and Analysis
Kramer [1214]: Self-Adaptive Heuristics for Evolutionary Computation
Lobo, Lima, and Michalewicz [1299]: Parameter Setting in Evolutionary Algorithms
Spears [1925]: Evolutionary Algorithms – The Role of Mutation and Recombination
Eiben and Michalewicz [621]: Evolutionary Computation
Jin [1055]: Knowledge Incorporation in Evolutionary Computation
Grosan, Abraham, and Ishibuchi [862]: Hybrid Evolutionary Algorithms
Abraham, Jain, and Goldberg [9]: Evolutionary Multiobjective Optimization
Kallel, Naudts, and Rogers [1083]: Theoretical Aspects of Evolutionary Computing
Ghosh and Tsutsui [798]: Advances in Evolutionary Computing – Theory and Applications
Yang, Shan, and Bui [2279]: Success in Evolutionary Computation
Pereira and Tavares [1635]: Bio-inspired Algorithms for the Vehicle Routing Problem
2.3 Fitness Assignment
2.3.1 Introduction
With concept of Pareto domination and prevalence comparisons introduced in Section 1.2.2
on page 27 we define a partial order on the elements in the problem space X. In multi-
objective optimization, each solution candidate p.x is characterized by a vector of objective
values F (p.x). Many selection algorithms however cannot work with such vectors and need
scalar fitness values instead. By assigning a single real number v(p.x) (the fitness) to each
solution candidate p.x, also a total order is defined on them.
The fitness assigned to an individual may not just reflect its rank in the population, but
can also incorporate density/niching information. This way, not only the quality of a solution
candidate is considered, but also the overall diversity of the population. This can improve the
chance of finding the global optima as well as the performance of the optimization algorithm
significantly. If many individuals in the population occupy the same rank or do not dominate
each other, for instance, such information will be very helpful.
The fitness v(p.x) thus may not only depend on the solution candidate p.x itself, but on
the whole population Pop of the evolutionary algorithm (and on the archive Arc of optimal
elements, if available). In practical realizations, the fitness values are often stored in a special
member variable in the individual records. Therefore, v(p.x) can be considered as a mapping
that returns the value of such a variable which has previously been stored there by a fitness
assignment process “assignFitness”.
Definition 2.5 (Fitness Assignment). A fitness assignment process “assignFitness” cre-
ates a function v : X → R+ which relates a scalar fitness value to each solution candidate in
the population Pop Equation 2.1 (and archive Arc, if an archive is available Equation 2.2).
v = assignFitness(Pop, cmpF ) ⇒ v(p.x) ∈ V ⊆ R+ ∀p ∈ Pop
(2.1)
v = assignFitness(Pop, Arc, cmpF ) ⇒ v(p.x) ∈ V ⊆ R+ ∀p ∈ Pop ∪ Arc
(2.2)

112
2 Evolutionary Algorithms
In the context of this book, we generally minimize fitness values, i. e., the lower the
fitness of a solution candidate the better. Therefore, many of the fitness assignment processes
based on the prevalence relation will obey to Equation 2.3. This equation represents a general
relation – sometimes it is useful to violate it for some individuals in the population, especially
when crowding information is incorporated.
p1.x≻p2.x ⇒ v(p1.x) < v(p2.x) ∀p1,p2 ∈ Pop ∪ Arc
(2.3)
2.3.2 Weighted Sum Fitness Assignment
The most primitive fitness assignment strategy would be assigning a weighted sum of the
objective values. This approach is very static and comes with the same problems as weighted
sum-based approach for defining what an optimum is introduced in Section 1.2.2 on page 29.
It makes no use of the prevalence relation. For computing the weighted sum of the different
objective values of a solution candidate, we reuse Equation 1.4 on page 29 from the weighted
sum optimum definition. The weights have to be chosen in a way that ensures that v(p.x) ∈
R+ holds for all individuals p.
v(p.x) = assignFitnessWeightedSum(Pop) ⇔ ∀p ∈ Pop ⇒ v(p.x) = g(p.x)
(2.4)
2.3.3 Pareto Ranking
Another very simple method for computing fitness values is to let them directly reflect the
Pareto domination (or prevalence) relation. Figure 2.4 and Table 2.1 illustrate the Pareto
relations in a population of 15 individuals and their corresponding objective values f1 and
f2, both subject to minimization. There are two ways for doing this: First, to each individual,
10
f
8
15
2
8
5
7
1
6
9
14
6
10
5
7
11
4
2
13
3
12
2
3
Pareto Frontier
1
4
0
1
2
3
4
5
6
7
8
9
10
f
12
1
Figure 2.4: An example scenario for Pareto ranking.
we can assign a value inversely proportional to the number of other individuals it prevails,
like v(p1.x) ≡
1
. We have written such fitness values in the column “Ap.
|∀p2∈Pop:p1.x≻p2.x|+1
1” of Table 2.1 for Pareto optimization, i. e., the special case where the Pareto dominance

2.3 Fitness Assignment
113
x prevails
is prevailed by
Ap. 1 Ap. 2
1 {5, 6, 8, 9, 14, 15}

1/7
0
2 {6, 7, 8, 9, 10, 11, 13, 14, 15} ∅
1/10
0
3 {12, 13, 14, 15}

1/5
0
4 ∅

1
0
5 {8, 15}
{1}
1/3
1
6 {8, 9, 14, 15}
{1,2}
1/5
2
7 {9, 10, 11, 14, 15}
{2}
1/6
1
8 {15}
{1,2,5,6}
1/2
4
9 {14, 15}
{1,2,6,7}
1/3
4
10 {14, 15}
{2,7}
1/3
2
11 {14, 15}
{2,7}
1/3
2
12 {13, 14, 15}
{3}
1/4
1
13 {15}
{2,3,12}
1/2
3
14 {15}
{1,2,3,6,7,9,10,11,12}
1/2
9
15 ∅
{1,2,3,5,6,7,8,9,10,11,12,13,14}
1
13
Table 2.1: The Pareto domination relation of the individuals illustrated in Figure 2.4.
relation is used to define prevalence. Individuals that dominate many others will here receive
a lower fitness value than those which are prevailed by many. When taking a look at these
values, the disadvantage of this approach becomes clear: It promotes individuals that reside
in crowded region of the problem space and underrates those in sparsely explored areas.
By doing so, the fitness assignment process achieves exactly the opposite of what we
want. Instead of exploring the problem space and delivering a wide scan of the frontier
of best possible solution candidates, it will focus all effort on a small set of individuals.
We will only obtain a subset of the best solutions and it is even possible that this fitness
assignment method leads to premature convergence to a local optimum. A good example
for this problem are the four non-prevailed individuals {1,2,3,4} from the Pareto frontier.
The best fitness is assigned to the element 2, followed by individual 1. Although individual
7 is dominated (by 1), its fitness is better than the fitness of the non-dominated element 3.
The solution candidate 4 gets the worst possible fitness 1, since it prevails no other
element. Its chances for reproduction are similarly low than those of individual 15 which
is dominated by all other elements except 4. Hence, both solution candidates will most
probably be not selected and vanish in the next generation. The loss of solution candidate
4 will greatly decrease the diversity and even increase the focus on the crowded area near 1
and 2.
A much better second approach for fitness assignment is directly based on the domination
(or prevalence) relation and has first been proposed by Goldberg [821]. Here, the idea is to
assign the number of individuals it is prevailed by to each solution candidate [1315, 253, 255,
851]. This way, the previously mentioned negative effects will not occur. The column “Ap 2”
in Table 2.1 shows that all four non-prevailed individuals now have the best possible fitness
0. Hence, the exploration pressure is applied to a much wider area of the Pareto frontier. This
so-called Pareto ranking can be performed by first removing all non-prevailed individuals
from the population and assigning the rank 0 to them. Then, the same is performed with
the rest of the population. The individuals only dominated by those on rank 0 (now non-
dominated) will be removed and get the rank 1. This is repeated until all solution candidates
have a proper fitness assigned to them. Algorithm 2.3 outlines another simple way to perform
Pareto ranking. Since we follow the idea of the freer prevalence comparators instead of Pareto
dominance relations, we will synonymously refer to this approach as Prevalence ranking.
As already mentioned, the fitness values of all non-prevailed elements in our example
Figure 2.4 and Table 2.1 are equally 0. However, the region around the individuals 1 and 2
has probably already extensively been explored, whereas the surrounding of solution candi-

114
2 Evolutionary Algorithms
Algorithm 2.3: v ←− assignFitnessParetoRank(Pop,cmpF)
Input: Pop: the population to assign fitness values to
Input: cmp : the prevalence comparator defining the prevalence relation
F
Data: i, j, cnt: the counter variables
Output: v: a fitness function reflecting the Prevalence ranking
1 begin
2
for i ←− len(Pop) − 1 down to 0 do
3
cnt ←− 0
4
p ←− Pop[i]
5
for j ←− len(Pop) − 1 down to 0 do
// Check whether cmpF (Pop[j].x, p.x) < 0
6
if (j = i) ∧ (Pop[j].x≻p.x) then cnt ←− cnt + 1
7
v(p.x) ←− cnt
8
return v
9 end
date 4 is rather unknown. A better approach of fitness assignment should incorporate such
information and put a bit more pressure into the direction of individual 4, in order to make
the evolutionary algorithm investigate this area more thoroughly.
2.3.4 Sharing Functions
Previously, we have mentioned that the drawback of Pareto ranking is that it does not
incorporate any information about whether the solution candidates in the population reside
closely to each other or in regions of the problem space which are only sparsely covered by
individuals. Sharing, as a method for including such diversity information into the fitness
assignment process, was introduced by Holland [940] and later refined by Deb [532], Goldberg
and Richardson [824], and Deb and Goldberg [539]. [1801, 1417, 1558]
Definition 2.6 (Sharing Function). A sharing function Sh : R+ → R+ is a function
used to relate two individuals p1 and p2 to a value that decreases with their distance14
d = dist(p1, p2) in a way that it is 1 for d = 0 and 0 if the distance exceeds a specified
constant σ.

1 if d ≤ 0
Shσ(d = dist(p1, p2)) = Sh
(2.5)
Sharing functions can be employed in
σ (d) ∈ [0, 1] if 0 < d < σ
0 otherwise
many different ways and are used by a variety
of fitness assignment processes [824, 532]. Typically, the simple triangular function Sh tri
[959] or one of its either convex (Sh cvexp) or concave (Sh ccavp) pendants with the power
p ∈ R+,p > 0 are applied. Besides using different powers of the distance-σ-ratio, another
approach is the exponential sharing method Sh exp.
14 The concept of distance and a set of different distance measures is defined in Section 29.1 on
page 537.

2.3 Fitness Assignment
115
1
if 0
Sh tri
− dσ
≤ d < σ
σ (σ) d =
(2.6)
0 otherwise
p
1
if 0
Sh cvex
− dσ
≤ d < σ
σ,p(d) =
(2.7)
0 otherwise
p
1
if 0
Sh ccav
− dσ
≤ d < σ
σ,p(d) =
(2.8)
0 otherwise

1 if d ≤ 0
Sh exp

0 if d
σ,p(d) = 
≥ σ
(2.9)

For sharing, the distance of the indivi  e− pd
σ −e−p
otherwise
1−e−p
duals in the search space G as well as their distance
in the problem space X or the objective space Y may be used. If the solution candidates
are real vectors in the Rn, we could use the Euclidean distance of the phenotypes of the
individuals directly, i.e., compute disteucl(p1.x, p2.x). In genetic algorithms, where the search
space is the set of all bit strings G = Bn of the length n, another suitable approach would be
to use the Hamming distance15 distHam(p1.g, p2.g) of the genotypes. The work of Deb [532],
however, indicates that phenotypical sharing will often be superior to genotypical sharing.
Definition 2.7 (Niche Count). The niche count m(p, P ) [535, 1417] of an individual p is
the sum its sharing values with all individual in a list P .
len(P )−1
∀p ∈ P ⇒ m(p,P) =
Shσ(dist(p, P [i]))
(2.10)
i=0
The niche count m is always greater than zero, since p ∈ P and, hence, Shσ(dist(p,p)) = 1
is computed and added up at least once. The original sharing approach was developed for
single-objective optimization where only one objective function f was subject to maximiza-
tion. In this case, its value was simply divided by the niche count, punishing solutions in
crowded regions [1417]. The goal of sharing was to distribute the population over a number
of different peaks in the fitness landscape, with each peak receiving a fraction of the popu-
lation proportional to its height [959]. The results of dividing the fitness by the niche counts
strongly depends on the height differences of the peaks and thus, on the complexity class16
of f . On f1 ∈ O(x), for instance, the influence of m is much bigger than on a f2 ∈ O(ex).
By multiplying the niche count m to predetermined fitness values v′, we can use this
approach for fitness minimization in conjunction with a variety of other different fitness
assignment processes, but also inherit its shortcomings:
v(p.x) = v′(p.x) ∗ m(p,Pop), v′ ≡ assignFitness(Pop,cmpF)
(2.11)
Sharing was traditionally combined with fitness proportionate, i. e., roulette wheel se-
lection17. Oei et al. [1558] have shown that if the sharing function is computed using the
parental individuals of the “old” population and then na¨ıvely combined with the more so-
phisticated tournament selection18, the resulting behavior of the evolutionary algorithm may
be chaotic. They suggested to use the partially filled “new” population to circumvent this
problem. The layout of evolutionary algorithms, as defined in this book, bases the fitness
computation on the whole set of “new” individuals and assumes that their objective values
have already been completely determined. In other words, such issues simply do not exist
in multi-objective evolutionary algorithms as introduced here and the chaotic behavior does
occur.
15 See Definition 29.6 on page 537 for more information on the Hamming distance.
16 See Section 30.1.3 on page 550 for a detailed introduction into complexity and the O-notation.
17 Roulette wheel selection is discussed in Section 2.4.3 on page 124.
18 You can find an outline of tournament selection in Section 2.4.4 on page 127.

116
2 Evolutionary Algorithms
For computing the niche count m, O n2 comparisons are needed. According to Goldberg
et al. [827], sampling the population can be sufficient to approximate min order to avoid
this quadratic complexity.
2.3.5 Variety Preserving Ranking
Using sharing and the niche counts na¨ıvely leads to more or less unpredictable effects. Of
course, it promotes solutions located in sparsely populated niches but how much their fitness
will be improved is rather unclear. Using distance measures which are not normalized can
lead to strange effects, too. Imagine two objective functions f1 and f2. If the values of f1
span from 0 to 1 for the individuals in the population whereas those of f2 range from 0 to
10 000, the components of f1 will most often be negligible in the Euclidian distance of two
individuals in the objective space Y. Another problem is that the effect of simple sharing
on the pressure into the direction of the Pareto frontier is not obvious either or depends on
the sharing approach applied. Some methods simply add a niche count to the Pareto rank,
which may cause non-dominated individuals having worse fitness than any others in the
population. Other approaches scale the niche count into the interval [0, 1) before adding it
which not only ensures that non-dominated individuals have the best fitness but also leave
the relation between individuals at different ranks intact, which does not further variety
very much.
Variety Preserving Ranking is a fitness assignment approach based on Pareto ranking
using prevalence comparators and sharing. We have developed it in order to mitigate all these
previously mentioned side effects and balance the evolutionary pressure between optimizing
the objective functions and maximizing the variety inside the population. In the following,
we will describe the process of Variety Preserving Ranking-based fitness assignment which
is defined in Algorithm 2.4.
Before this fitness assignment process can begin, it is required that all individuals with
infinite objective values must be removed from the population Pop. If such a solution candi-
date is optimal, i. e., if it has negative infinitely large objectives in a minimization process,
for instance, it should receive fitness zero, since fitness is subject to minimization. If the indi-
vidual is infeasible, on the other hand, its fitness should be set to len(Pop) +
len(Pop) + 1,
which is one larger than every other fitness values that may be assigned by Algorithm 2.4.
In lines 2 to 9, we create a list ranks which we use to efficiently compute the Pareto
rank of every solution candidate in the population. By the way, the word prevalence rank
would be more precise in this case, since we use prevalence comparisons as introduced in
Section 1.2.4. Therefore, Variety Preserving Ranking is not limited to Pareto optimization
but may also incorporate External Decision Makers (Section 1.2.4) or the method of in-
equalities (Section 1.2.3). The highest rank encountered in the population is stored in the
variable maxRank. This value may be zero if the population contains only non-prevailed
elements. The lowest rank will always be zero since the prevalence comparators cmpF define
order relations which are non-circular by definition.19. We will use maxRank to determine
the maximum penalty for solutions in an overly crowded region of the search space later on.
From line 10 to 18, we determine the maximum and the minimum values that each
objective function takes on when applied to the individuals in the population. These values
are used to store the inverse of their ranges in the array rangeScales, which we will use to
scale all distances in each dimension (objective) of the individuals into the interval [0, 1].
There are |F| objective functions in F and, hence, the maximum Euclidian distance between
two solution candidates in the (scaled) objective space becomes
|F|. It occurs if all the
distances in the single dimensions are 1.
The most complicated part of the Variety Preserving Ranking algorithm is between
line 19 and 33. Here we computed the scaled distance from every individual to each other
19 In all order relations imposed on finite sets there is always at least one “smallest” element.
See Section 27.7.2 on page 463 for more information.

2.3 Fitness Assignment
117
Algorithm 2.4: v ←− assignFitnessVarietyPreserving(Pop,cmpF)
Input: Pop: the population
Input: cmp : the comparator function
F
Input: [implicit] F : the set of objective functions
Data: . . . : sorry, no space here, we’ll discuss this in the text
Output: v: the fitness function
1 begin
/* If needed: Remove all elements with infinite objective values from Pop
and assign fitness 0 or len(Pop) +
len(Pop) + 1 to them. Then compute the
prevalence ranks.
*/
2
ranks ←− createList(len(Pop), 0)
3
maxRank ←− 0
4
for i ←− len(Pop) − 1 down to 0 do
5
for j ←− i − 1 down to 0 do
6
k ←− cmpF(Pop[i].x, Pop[j].x)
7
if k < 0 then ranks[j] ←− ranks[j] + 1
8
else if k > 0 then ranks[i] ←− ranks[i] + 1
9
if ranks[i] > maxRank then maxRank ←− ranks[i]
// determine the ranges of the objectives
10
mins ←− createList(|F| , +∞)
11
maxs ←− createList(|F| , −∞)
12
foreach p ∈ Pop do
13
for i ←− |F| down to 1 do
14
if fi(p.x) < mins[i−1] then mins[i−1] ←− fi(p.x)
15
if fi(p.x) > maxs[i−1] then maxs[i−1] ←− fi(p.x)
16
rangeScales ←− createList(|F| , 1)
17
for i ←− |F| − 1 down to 0 do
18
if maxs[i] > mins[i] then rangeScales[i] ←− 1/(maxs[i] − mins[i])
// Base a sharing value on the scaled Euclidean distance of all elements
19
shares ←− createList(len(Pop), 0)
20
minShare ←− +∞
21
maxShare ←− −∞
22
for i ←− len(Pop) − 1 down to 0 do
23
curShare ←− shares[i]
24
for j ←− i − 1 down to 0 do
25
dist ←− 0
26
for k ←− |F| down to 1 do
dist ←− dist + [(fk(Pop[i].x) − fk(Pop[j].x)) ∗ rangeScales[k−1]]2
27

s ←− Sh exp√
dist
28
|F|,16
29
curShare ←− curShare + s
30
shares[j] ←− shares[j] + s
31
shares[i] ←− curShare
32
if curShare < minShare then minShare ←− curShare
33
if curShare > maxShare then maxShare ←− curShare
// Finally, compute the fitness values
1/ (maxShare
scale
34
←−
− minShare) if maxShare > minShare
1 otherwise
35
for i ←− len(Pop) − 1 down to 0 do
36
if ranks[i] > 0 then
37

v(Pop[i].x) ←− ranks[i] + maxRank ∗ scale ∗ (shares[i] − minShare)
38
else v(Pop[i].x) ←− scale ∗ (shares[i] − minShare)
39 end

118
2 Evolutionary Algorithms
solution candidate in the objective space and use this distance to aggregate share values
(in the array shares). Therefore, again two nested loops are needed (lines 22 and 24). The
distance components of two individuals Pop[i] and Pop[j] are scaled and summarized in a

variable dist in line 27. The Euclidian distance between them is
dist which we use to
determine a sharing value in 28. We therefore have decided for exponential sharing with
power 16 and σ =
|F|, as introduced in Equation 2.9 on page 115. For every individual,
we sum up all the shares (see line 30). While doing so, we also determine the minimum and
maximum such total share in the variables minShare and maxShare in lines 32 and 33.
We will use these variables to scale all sharing values again into the interval [0, 1] (line
34), so the individual in the most crowded region always has a total share of 1 and the most
remote individual always has a share of 0. So basically, we now know two things about the
individuals in Pop:
1. their Pareto ranks, stored in the array ranks, giving information about their relative
quality according to the objective values and
2. their sharing values, held in shares, denoting how densely crowded the area around
them is.
With this information, we determine the final fitness values of an individual p as follows:
If p is non-prevailed, i. e., its rank is zero, its fitness is its scaled total share (line 38).

Otherwise, we multiply the square root of the maximum rank,
maxRank, with the scaled
share and add it to its rank (line 37). By doing so, we preserve the supremacy of non-
prevailed individuals in the population but allow them to compete with each other based on
the crowdedness of their location in the objective space. All other solution candidates may
degenerate in rank, but at most by the square root of the worst rank.
Example
Let us now apply Variety Preserving Ranking to the examples for Pareto ranking from
Section 2.3.3. In Table 2.2, we again list all the solution candidates from Figure 2.4 on
page 112, this time with their objective values obtained with f1 and f2 corresponding to
their coordinates in the diagram. In the third column, you can find the Pareto ranks of the
individuals as it has been listed in Table 2.1 on page 113. The columns share/u and share/s
correspond to the total sharing sums of the individuals, unscaled and scaled into [0, 1].
x f1 f2 rank share/u share/s v(x)
1
1
7
0
0.71
0.779 0.779
2
2
4
0
0.239
0.246 0.246
3
6
2
0
0.201
0.202 0.202
4 10
1
0
0.022
0
0
5
1
8
1
0.622
0.679 3.446
6
2
7
2
0.906
1 5.606
7
3
5
1
0.531
0.576 3.077
8
2
9
4
0.314
0.33 5.191
9
3
7
4
0.719
0.789 6.845
10
4
6
2
0.592
0.645 4.325
11
5
5
2
0.363
0.386
3.39
12
7
3
1
0.346
0.366 2.321
13
8
4
3
0.217
0.221 3.797
14
7
7
9
0.094
0.081 9.292
15
9
9
13
0.025
0.004 13.01
Table 2.2: An example for Variety Preserving Ranking based on Figure 2.4.

2.3 Fitness Assignment
119
1
15
8
0.8
5
14
0.6
1
6
9
10
0.4
sharepotential
7
11
13
0.2
2
12
0
3
4
10
8
Pareto Frontier
6
f2
4
2
f1
0
2
4
6
8
10
Figure 2.5: The sharing potential in the Variety Preserving Ranking example
But first things first; as already mentioned, we know the Pareto ranks of the solution
candidates from Table 2.1, so the next step is to determine the ranges of values the objective
functions take on for the example population. These can again easily be found out from
Figure 2.4. f1 spans from 1 to 10, which leads to rangeScale[0] = 1/9. rangeScale[1] = 1/8
since the maximum of f2 is 9 and its minimum is 1. With this, we now can compute the
(dimensionally scaled) distances amongst the solution candidates in the objective space, the

values of
dist in algorithm Algorithm 2.4, as well as the corresponding values of the sharing

function Sh exp√
dist . We have noted these in Table 2.3, using the upper triangle
|F |,16
of the table for the distances and the lower triangle for the shares.
The value of the sharing function can be imagined as a scalar field, as illustrated in
Figure 2.5. In this case, each individual in the population can be considered as an electron
that will build an electrical field around it resulting in a potential. If two electrons come
close, repulsing forces occur, which is pretty much the same what we want to do with Variety
Preserving Ranking. Unlike the electrical field, the power of the sharing potential falls expo-
nentially, resulting in relatively steep spikes in Figure 2.5 which gives proximity and density
a heavier influence. Electrons in atoms on planets are limited in their movement by other
influences like gravity or nuclear forces, which are often stronger than the electromagnetic
force. In Variety Preserving Ranking, the prevalence rank plays this role – as you can see in
Table 2.2, its influence on the fitness is often dominant.
By summing up the single sharing potentials for each individual in the example, we
obtain the fifth column of Table 2.3, the unscaled share values. Their minimum is around
0.022 and the maximum is 0.94. Therefore, we must subtract 0.022 from each of these values
and multiply the result with 1.131. By doing so, we build the column shares/s. Finally, we
can compute the fitness values v(x) according to lines 38 and 37 in Algorithm 2.4.

120
2 Evolutionary Algorithms
Upper triangle: distances. Lower triangle: corresponding share values.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
0.391
0.836
1.25
0.125
0.111 0.334
0.274
0.222 0.356
0.51
0.833 0.863 0.667 0.923
2
0.012
0.51
0.965
0.512
0.375 0.167
0.625
0.391 0.334 0.356
0.569 0.667
0.67 0.998
3 7.7E-5
0.003
0.462
0.933
0.767 0.502
0.981
0.708 0.547 0.391
0.167 0.334 0.635 0.936
4 6.1E-7 1.8E-5
0.005
1.329
1.163 0.925
1.338
1.08 0.914 0.747
0.417 0.436 0.821 1.006
5
0.243
0.003 2.6E-5 1.8E-7
0.167 0.436
0.167
0.255 0.417 0.582
0.914 0.925 0.678 0.898
6
0.284
0.014 1.7E-4 1.8E-6
0.151
0.274
0.25
0.111 0.255 0.417
0.747 0.765 0.556 0.817
7
0.023
0.151
0.003 2.9E-5
0.007
0.045
0.512
0.25 0.167 0.222
0.51 0.569
0.51 0.833
8
0.045
0.001 1.5E-5 1.5E-7
0.151
0.059 0.003
0.274 0.436 0.601
0.933 0.914 0.609 0.778
9
0.081
0.012 3.3E-4 4.8E-6
0.056
0.284 0.059
0.045
0.167 0.334
0.669
0.67 0.444 0.712
10
0.018
0.023
0.002 3.2E-5
0.009
0.056 0.151
0.007
0.151
0.167
0.502
0.51 0.356
0.67
11
0.003
0.018
0.012 2.1E-4
0.001
0.009 0.081
0.001
0.023 0.151
0.334 0.356 0.334 0.669
12
8E-5
0.002
0.151
0.009 3.2E-5 2.1E-4 0.003 2.6E-5
0.001 0.003 0.023
0.167
0.5 0.782
13 5.7E-5
0.001
0.023
0.007 2.9E-5 1.7E-4 0.002 3.2E-5
0.001 0.003 0.018
0.151
0.391 0.635
14
0.001
0.001
0.001 9.3E-5 4.6E-4
0.002 0.003
0.001
0.007 0.018 0.023
0.003 0.012
0.334
15 2.9E-5 1.2E-5 2.5E-5 1.1E-5 3.9E-5 9.7E-5 8E-5 1.5E-4 3.1E-4 0.001 0.001 1.4E-4 0.001 0.023
Table 2.3: The distance and sharing matrix of the example from Table 2.2.
The last column of Table 2.2 lists these results. All non-prevailed individuals have re-
tained a fitness value less than one, lower than those of any other solution candidate in
the population. However, amongst these best individuals, solution candidate 4 is strongly
preferred, since it is located in a very remote location of the objective space. Individual
1 is the least interesting non-dominated one, because it has the densest neighborhood in
Figure 2.4. In this neighborhood, the individuals 5 and 6 with the Pareto ranks 1 and 2 are
located. They are strongly penalized by the sharing process and receive the fitness values
v(5) = 3.446 and v(6) = 5.606. In other words, individual 5 becomes less interesting than
solution candidate 7 which has a worse Pareto rank. 6 now is even worse than individual 8
which would have a fitness better by two if strict Pareto ranking was applied.
Based on these fitness values, algorithms like Tournament selection (see Section 2.4.2) or
fitness proportionate approaches (discussed in Section 2.4.3) will pick elements in a way that
preserves the pressure into the direction of the Pareto frontier but also leads to a balanced
and sustainable variety in the population. The benefits of this approach have been shown,
for instance, in [1650, 2188].
2.3.6 Tournament Fitness Assignment
In tournament fitness assignment, which is a generalization of the q-level binary tournament
selection introduced by Weicker [2167], the fitness of each individual is computed by letting
it compete q times against r other individuals (with r = 1 as default) and counting the
number of competitions it loses. For a better understanding of the tournament metaphor
see Section 2.4.4 on page 127, where the tournament selection scheme is discussed. Anyway,
the number of losses will approximate its Pareto rank, but are a bit more randomized that
that. If we would count the number of tournaments won instead of the losses, we would
encounter the same problems than in the first idea of Pareto ranking.
TODO add remaining fitness
assignment methods

2.4 Selection
121
Algorithm 2.5: v ←− assignFitnessTournamentq,r(Pop,cmpF)
Input: q: the number of tournaments per individuals
Input: r: the number of other contestants per tournament, normally 1
Input: Pop: the population to assign fitness values to
Input: cmpF : the comparator function providing the prevalence relation
Data: i, j, k, z: counter variables
Data: b: a Boolean variable being true as long as a tournament isn’t lost
Data: p: the individual currently examined
Output: v: the fitness function
1 begin
2
for i ←− len(Pop) − 1 down to 0 do
3
z ←− q
4
p ←− Pop[i]
5
for j ←− q down to 1 do
6
b ←− true
7
k ←− r
8
while (k > 0) ∧ b do
9
b ←− Pop[⌊randomu(0,len(Pop))⌋].x≻p.x
10
k ←− k − 1
11
if b then z ←− z − 1
12
v(p.x) ←− z
13
return v
14 end
2.4 Selection
2.4.1 Introduction
Definition 2.8 (Selection). In evolutionary algorithms, the selection20 operation Mate =
select(Pop, v, ms) chooses ms individuals according to their fitness values v from the popu-
lation Pop and places them into the mating pool Mate [99, 1242, 232, 1431].
Mate = select(Pop, v, ms) ⇒ ∀p ∈ Mate ⇒ p ∈ Pop
∀p ∈ Pop ⇒ p ∈ G × X
v(p.x) ∈ R+ ∀p ∈ Pop
(len(Mate) ≥ min{len(Pop),ms}) ∧ (len(Mate) ≤ ms)
(2.12)
On the mating pool, the reproduction operations discussed in Section 2.5 on page 137
will subsequently be applied. Selection may behave in a deterministic or in a randomized
manner, depending on the algorithm chosen and its application-dependant implementation.
Furthermore, elitist evolutionary algorithms may incorporate an archive Arc in the selection
process, as sketched in Algorithm 2.2.
Generally, there are two classes of selection algorithms: such with replacement (anno-
tated with a subscript r) and such without replacement (annotated with a subscript w, see
Equation 2.13) [1809]. In a selection algorithm without replacement, each individual from
the population Pop is taken into consideration for reproduction at most once and therefore
20 http://en.wikipedia.org/wiki/Selection_%28genetic_algorithm%29 [accessed 2007-07-03]

122
2 Evolutionary Algorithms
also will occur in the mating pool Mate one time at most. The mating pool returned by
algorithms with replacement can contain the same individual multiple times. Like in nature,
one individual may thus have multiple offspring. Normally, selection algorithms are used in
a variant with replacement. One of the reasons therefore is the number of elements to be
placed into the mating pool (corresponding to the parameter ms). If len(Pop) < ms, the
mating pool returned by a method without replacement contains less than ms individuals
since it can at most consist of the whole population.
Mate = selectw(Pop, v, ms) ⇒ countOccurences(p,Mate) = 1 ∀p ∈ Mate
(2.13)
The selection algorithms have major impact on the performance of evolutionary algo-
rithms. Their behavior has thus been subject to several detailed studies, conducted by, for
instance, Goldberg and Deb [823], Blickle and Thiele [232], and Zhong et al. [2318], just to
name a few.
Usually, fitness assignment processes are carried out before selection and the selection
algorithms base their decisions solely on the fitness v of the individuals. It is possible to rely
on the prevalence relation, i. e., to write select(Pop, cmpF , ms) instead of select(Pop, v, ms),
thus saving the costs of the fitness assignment process. However, this will lead to the same
problems that occurred in the first approach of prevalence-proportional fitness assignment
(see Section 2.3.3 on page 112) and we will therefore not discuss such techniques in this
book.
Many selection algorithms only work with scalar fitness and thus need to rely on a fitness
assignment process in multi-objective optimization. Selection algorithms can be chained –
the resulting mating pool of the first selection may then be used as input for the next one,
maybe even with a secondary fitness assignment process in between. In some applications,
an environmental selection that reduces the number of individuals is performed first and
then a mating selection follows which extracts the individuals which should be used for
reproduction.
Visualization
In the following sections, we will discuss multiple selection algorithms. In order to ease
understanding them, we will visualize the expected number of times S(p) that an individual
p will reach the mating pool Mate for some of the algorithms.
S(p) = E[countOccurences(p, Mate)]
(2.14)
Therefore, we will use the special case where we have a population Pop of len(Pop) = 1000
individuals, p0..p999 and also a target mating pool size ms = 1000. Each individual pi has
the fitness value v(pi.x), and fitness is subject to minimization. For this fitness, we consider
two cases:
1. As sketched in Fig. 2.6.a, the individual pi has fitness i, i. e., v1(p0.x) = 0, v1(p1.x) =
1, . . . , v1(p999.x) = 999.
2. Individual pi has fitness (i + 1)3, i. e., v2(p0.x) = 1, v2(p1.x) = 3, . . . , v2(p999.x) =
1 000 000 000, as illustrated in Fig. 2.6.b.
2.4.2 Truncation Selection
Truncation selection21, also called deterministic selection or threshold selection, returns the
k < ms best elements from the list Pop. These elements are copied as often as needed
until the mating pool size ms reached. For k, normally values like len(Pop)/2 or len(Pop)/3 are
21 http://en.wikipedia.org/wiki/Truncation_selection [accessed 2007-07-03]

2.4 Selection
123
1000
v
1e9
v2(pi.x)
1(pi.x)
800
8e8
600
6e8
400
4e8
200
2e8
i
i
0
200
400
600
800
0
200
400
600
800
Fig. 2.6.a: Case 1: v(pi.x) = i
Fig. 2.6.b: Case 2: v(pi.x) = (i + 1)3
Figure 2.6: The two example fitness cases.
used. Algorithm 2.6 realizes this scheme by first sorting the population in ascending order
according to the fitness v. Then, it iterates from 0 to ms and inserts only the elements with
indices from 0 to k − 1 into the mating pool.
Algorithm 2.6: Mate ←− truncationSelectk(Pop,v,ms)
Input: Pop: the list of individuals to select from
Input: v: the fitness values
Input: ms: the number of individuals to be placed into the mating pool Mate
Input: k: cut-off value
Data: i: counter variables
Output: Mate: the winners of the tournaments which now form the mating pool
1 begin
2
Mate ←− ()
3
k ←− min {k, len(Pop)}
4
Pop ←− sortLista(Pop, v)
5
for i ←− 0 up to ms − 1 do
6
Mate ←− addListItem(Mate, Pop[i mod k])
7
return Mate
8 end
Truncation selection is usually used in Evolution Strategies with (µ+λ) and (µ, λ) strate-
gies. In general evolutionary algorithms, it should be combined with a fitness assignment
process that incorporates diversity information in order to prevent premature convergence.
Recently, L¨assig et al. [1260] have proved that truncation selection is the optimal selection
strategy for crossover, provided that the right value of k is used. In practical applications,
this value is normally not known.
In Figure 2.7, we sketch the expected number of offspring for the individuals from our
examples specified in Section 2.4.1. In this selection scheme, the diagram will look exactly
the same regardless whether we use fitness configuration 1 or 2, since it is solely based
on the order of individuals and not on the numerical relation of their fitness. If we set
k = ms = len(Pop), each individual will have one offspring in average. If k = 1 ms, the
2
top-50% individuals will have two offspring and the others none. For k = 1 ms, only the
10
best 100 from the 1000 solution candidates will reach the mating pool but reproduce 10
times in average.

124
2 Evolutionary Algorithms
9
S(pi)
7
6
k=0.100 * |Pop|
k=0.125 * |Pop|
5
k=0.250 * |Pop|
k=0.500 * |Pop|
4
k=1.000 * |Pop|
3
2
1
0
0
100
200
300
400
500
600
700
i
900
0
100
200
300
400
500
600
700 v1(p .x)
i
900
1
1e6
8e6
3e7
6e7
1e8
2e8
3e8 v2(p .x)
i
7e8
Figure 2.7: The number of expected offspring in truncation selection.
2.4.3 Fitness Proportionate Selection
Fitness proportionate selection22 has already been applied in the original genetic algorithms
as introduced by Holland [940] and therefore is one of the oldest selection schemes. In fitness
proportionate selection, the probability P (p1) of an individual p1 ∈ Pop to enter the mating
pool is proportional to its fitness v(p.x) (subject to maximization) compared to the sum of
the fitness of all individuals. This relation in its original form is defined in Equation 2.15
below.
v(p
P (p
1.x)
1) =
(2.15)
v(p
∀p
2.x)
2 ∈Pop
There exists a variety of approaches which realize such probability distributions [823],
like stochastic remainder selection (Brindle [289], Booker [248]) and stochastic universal
selection (Baker [121], Greffenstette and Baker [858]). The most commonly known method is
the Monte Carlo roulette wheel selection by De Jong [512], where we imagine the individuals
of a population to be placed on a roulette23 wheel as sketched in Fig. 2.8.a. The size of
the area on the wheel standing for a solution candidate is proportional to its fitness. The
wheel is spun, and the individual where it stops is placed into the mating pool Mate. This
procedure is repeated until ms individuals have been selected.
In the context of this book, fitness is subject to minimization. Here, higher fitness values
v(p.x) indicate unfit solution candidates p.x whereas lower fitness denotes high utility. Fur-
thermore, the fitness values are normalized into a range of [0, sum], because otherwise, fitness
proportionate selection will handle the set of fitness values {0,1,2} in a different way than
{10,11,12}. Equation 2.19 defines the framework for such a (normalized) fitness proportion-
ate selection “rouletteWheelSelect”. It is illustrated exemplarily in Fig. 2.8.b and realized
in Algorithm 2.7 as a variant with and in Algorithm 2.8 without replacement. Amongst
22 http://en.wikipedia.org/wiki/Fitness_proportionate_selection [accessed 2008-03-19]
23 http://en.wikipedia.org/wiki/Roulette [accessed 2008-03-20]

2.4 Selection
125
f(x4)=40
f(x2)=20
f(x
A(x
0
4)= /60 A
1)=10
A(x
1
2)= /5 A
A(x
30
1/
= 0 A
2 A
1)=
/60 A=
f(x3)=30
f(x1)=10
A(x
1
3)= /3 A
A(x
1
1)= /10 A
f(x3)=30
A(x
10
3)=
/60 A
f(x
f(x
1
2)=20
4)=40
= 6
/ A
A(x
2
A(x
20
1/ A
2)=
/60 A=
4)= /5 A
3
Fig. 2.8.a: Example for fitness maxi-
Fig. 2.8.b: Example for normalized fitness min-
mization.
imization.
Figure 2.8: Examples for the idea of roulette wheel selection.
others, Whitley [2211] points out that even fitness normalization as performed here cannot
overcome the drawbacks of fitness proportional selection methods.
minV = min {v(p.x) ∀p ∈ Pop}
(2.16)
maxV = max {v(p.x) ∀p ∈ Pop}
(2.17)
maxV
normV(p.x) =
− v(p.x)
(2.18)
maxV − minV
normV (p
P (p
1.x)
1) =
(2.19)
normV (p
∀p
2.x)
2∈Pop
But what are the drawbacks of fitness proportionate selection methods? Let us therefore
visualize the expected results of roulette wheel selection applied to the special cases stated in
Section 2.4.1. Figure 2.9 illustrates the number of expected occurrences S(pi) of an individual
pi if roulette wheel selection was applied. Since ms = 1000, we draw one thousand times
a single individual from the population Pop. Each single choice is based on the proportion
of the individual fitness in the total fitness of all individuals, as defined in Equation 2.15
and Equation 2.19. Thus, in scenario 1 with the fitness sum 999∗998 = 498501, the relation
2
S(pi) = ms ∗ i holds for fitness maximization and S(p
for minimization.
498501
i) = ms 999−i
498501
As result (sketched in Fig. 2.9.a), the fittest individuals produce (on average) two offspring,
whereas the worst solution candidates will always vanish in this example. For the 2nd scenario
with v2(pi.x) = (i + 1)3, the total fitness sum is approximately 2.51 ·1011 and S(pi) =
ms (i+1)3
holds for maximization. The resulting expected values depicted in Fig. 2.9.b
2.52
are sig · 1011
nificantly different from those in Fig. 2.9.a. The meaning of this is that the design of
the objective functions (or the fitness assignment process) has a much stronger influence on
the convergence behavior of the evolutionary algorithm. This selection method only works
well if the fitness of an individual is indeed something like a proportional measure for the
probability that it will produce better offspring.
Thus, roulette wheel selection has a bad performance compared to other schemes like
tournament selection [823, 231] or ranking selection [823, 232]. It is mainly included here for
the sake of completeness and because it is easy to understand and suitable for educational
purposes.

126
2 Evolutionary Algorithms
Algorithm 2.7: Mate ←− rouletteWheelSelectr(Pop,v,ms)
Input: Pop: the list of individuals to select from
Input: v: the fitness values
Input: ms: the number of individuals to be placed into the mating pool Mate
Data: i: a counter variable
Data: a: a temporary store for a numerical value
Data: A: the array of fitness values
Data: min, max, sum: the minimum, maximum, and sum of the fitness values
Output: Mate: the mating pool
1 begin
2
A ←− createList(len(Pop) , 0)
3
min ←− ∞
4
max ←− −∞
5
for i ←− 0 up to len(Pop) − 1 do
6
a ←− v(Pop[i].x)
7
A[i] ←− a
8
if a < min then min ←− a
9
if a > max then max ←− a
10
if max = min then
11
max ←− max + 1
12
min ←− min − 1
13
sum ←− 0
14
for i ←− 0 up to len(Pop) − 1 do
15
sum ←− max−A[i]
max−min
16
A[i] ←− sum
17
for i ←− 0 up to ms − 1 do
18
a ←− searchItemas(randomu(0, sum) , A)
19
if a < 0 then a ←− −a − 1
20
Mate ←− addListItem(Mate, Pop[a])
21
return Mate
22 end
1.8
fitness maximization
3.5
S(p )
fitness maximization
i
S(pi)
1.4
fitness minimization
1.2
2.5
1.0
2.0
0.8
1.5
0.6
1.0
0.4
0.2
fitness minimization
0.5
0
0
0
200
400
600
i
0
200
400
600
i
0
200
400
600
v
1
8.12e6 6.48e7 2.17e8 v2(p.x)
1(p.x)
Fig. 2.9.a: v1(pi.x) = i
Fig. 2.9.b: v1(pi.x) = (i + 1)3
Figure 2.9: The number of expected offspring in roulette wheel selection.

2.4 Selection
127
Algorithm 2.8: Mate ←− rouletteWheelSelectw(Pop,v,ms)
Input: Pop: the list of individuals to select from
Input: v: the fitness values
Input: ms: the number of individuals to be placed into the mating pool Mate
Data: i: a counter variable
Data: a, b: temporary stores for numerical values
Data: A: the array of fitness values
Data: min, max, sum: the minimum, maximum, and sum of the fitness values
Output: Mate: the mating pool
1 begin
2
A ←− createList(len(Pop) , 0)
3
min ←− ∞
4
max ←− −∞
5
for i ←− 0 up to len(Pop) − 1 do
6
a ←− v(Pop[i].x)
7
A[i] ←− a
8
if a < min then min ←− a
9
if a > max then max ←− a
10
if max = min then
11
max ←− max + 1
12
min ←− min − 1
13
sum ←− 0
14
for i ←− 0 up to len(Pop) − 1 do
15
sum ←− max−A[i]
max−min
16
A[i] ←− sum
17
for i ←− 0 up to min {ms, len(Pop)} − 1 do
18
a ←− searchItemas(randomu(0, sum) , A)
19
if a < 0 then a ←− −a − 1
20
if a = 0 then b ←− 0
21
else b ←− A[a−1]
22
b ←− A[a] − b
23
for j ←− a + 1 up to len(A) − 1 do
24
A[j] ←− A[j] − b
25
sum ←− sum − b
26
Mate ←− addListItem(Mate, Pop[a])
27
Pop ←− deleteListItem(Pop, a)
28
A ←− deleteListItem(A, a)
29
return Mate
30 end
2.4.4 Tournament Selection
Tournament selection24, proposed by Wetzel [2198] and studied by Brindle [289], is one
of the most popular and effective selection schemes. Its features are well-known and have
been analyzed by a variety of researchers such as Blickle and Thiele [231, 232], Miller and
Goldberg [1416], Lee et al. [1269], Sastry and Goldberg [1809], and Oei et al. [1558]. In
tournament selection, k elements are picked from the population Pop and compared with
each other in a tournament. The winner of this competition will then enter mating pool
Mate. Although being a simple selection strategy, it is very powerful and therefore used in
many practical applications [55, 316, 1403, 46].
As example, consider a tournament selection (with replacement) with a tournament size
of two [2208]. For each single tournament, the contestants are chosen randomly according to
24 http://en.wikipedia.org/wiki/Tournament_selection [accessed 2007-07-03]

128
2 Evolutionary Algorithms
a uniform distribution and the winners will be allowed to enter the mating pool. If we assume
that the mating pool will contain about as same as many individuals as the population, each
individual will, on average, participate in two tournaments. The best solution candidate of
the population will win all the contests it takes part in and thus, again on average, contributes
approximately two copies to the mating pool. The median individual of the population is
better than 50% of its challengers but will also loose against 50%. Therefore, it will enter the
mating pool roughly one time on average. The worst individual in the population will lose
all its challenges to other solution candidates and can only score even if competing against
itself, which will happen with probability (1/ms)2. It will not be able to reproduce in the
average case because ms ∗ (1/ms)2 = 1/ms < 1 ∀ms > 1.
For visualization purposes, let us go back to our examples from Section 2.4.1 with a
population of 1000 individuals p0..p999 and ms = 1000. Again, we assume that each indi-
vidual has an unique fitness value of v1(pi.x) = i or v2(pi.x) = (i + 1)3, respectively. If we
apply tournament selection with replacement in this special scenario, the expected number
of occurrences S(pi) of an individual pi in the mating pool can be computed according to
Blickle and Thiele [232] as
1000
k
1000
k
S(p
− i
− i − 1
i) = ms ∗
(2.20)
1000

1000
9
S(pi)
7
6
k=10
k=5
5
k=4
k=3
4
k=2
k=1
3
2
1
0
0
100
200
300
400
500
600
700
i
900
0
100
200
300
400
500
600
700 v1(p .x)
i
900
1
1e6
8e6
3e7
6e7
1e8
2e8
3e8 v2(p .x)
i
7e8
Figure 2.10: The number of expected offspring in tournament selection.
The absolute values of the fitness play no role. The only thing that matters is whether
or not the fitness of one individual is higher as the fitness of another one, not fitness dif-
ference itself. The expected numbers of offspring for the two example cases 1 and 2 from
Section 2.4.1 are the same. Tournament selection thus gets rid of the problems of fitness
proportionate methods. Figure 2.10 depicts these numbers for different tournament sizes
k = {1,2,3,4,5,10}. If k = 1, tournament selection degenerates to randomly picking indi-
viduals and each solution candidate will occur one time in the mating pool on average. With

2.4 Selection
129
rising k, the selection pressure increases: individuals with good fitness values create more
and more offspring whereas the chance of worse solution candidates to reproduce decreases.
Tournament selection with replacement (TSR) is presented in Algorithm 2.9. Tournament
selection without replacement (TSoR) [1269, 18] can be defined in two forms. In the first
variant specified as Algorithm 2.10, a solution candidate cannot compete against itself. This
method is defined in. In Algorithm 2.11, on the other hand, an individual may enter the
mating pool at most once.
Algorithm 2.9: Mate ←− tournamentSelectr,k(Pop,v,ms)
Input: Pop: the list of individuals to select from
Input: v: the fitness values
Input: ms: the number of individuals to be placed into the mating pool Mate
Input: [implicit] k: the tournament size
Data: a: the index of the tournament winner
Data: i, j: counter variables
Output: Mate: the winners of the tournaments which now form the mating pool
1 begin
2
Mate ←− ()
3
Pop ←− sortLista(Pop, v)
4
for i ←− 0 up to ms − 1 do
5
a ←− ⌊randomu(0, len(Pop))⌋
6
for j ←− 1 up to k − 1 do
7
a ←− min {a, ⌊randomu(0, len(Pop))⌋}
8
Mate ←− addListItem(Mate, Pop[a])
9
return Mate
10 end
Algorithm 2.10: Mate ←− tournamentSelectw1,k(Pop,v,ms)
Input: Pop: the list of individuals to select from
Input: v: the fitness values
Input: ms: the number of individuals to be placed into the mating pool Mate
Input: [implicit] k: the tournament size
Data: a: the index of the tournament winner
Data: i, j: counter variables
Output: Mate: the winners of the tournaments which now form the mating pool
1 begin
2
Mate ←− ()
3
Pop ←− sortLista(Pop, v)
4
for i ←− 0 up to min {len(Pop) , ms} − 1 do
5
a ←− ⌊randomu(0, len(Pop))⌋
6
for j ←− 1 up to min {len(Pop) , k} − 1 do
7
a ←− min {a, ⌊randomu(0, len(Pop))⌋}
8
Mate ←− addListItem(Mate, Pop[a])
9
Pop ←− deleteListItem(Pop, a)
10
return Mate
11 end
The algorithms specified here should more precisely be entitled as deterministic tour-
nament selection algorithms since the winner of the k contestants that take part in each

130
2 Evolutionary Algorithms
Algorithm 2.11: Mate ←− tournamentSelectw2,k(Pop,v,ms)
Input: Pop: the list of individuals to select from
Input: v: the fitness values
Input: ms: the number of individuals to be placed into the mating pool Mate
Input: [implicit] k: the tournament size
Data: A: the list of contestants per tournament
Data: a: the tournament winner
Data: i, j: counter variables
Output: Mate: the winners of the tournaments which now form the mating pool
1 begin
2
Mate ←− ()
3
Pop ←− sortLista(Pop, v)
4
for i ←− 0 up to ms − 1 do
5
A ←− ()
6
for j ←− 1 up to min {k, len(Pop)} do
7
repeat
8
a ←− ⌊randomu(0, len(Pop))⌋
9
until searchItemu(a, A) < 0
10
A ←− addListItem(A, a)
11
a ←− min A
12
Mate ←− addListItem(Mate, Pop[a])
13
return Mate
14 end
tournament enters the mating pool. In the non-deterministic variant this is not necessarily
the case. There, a probability p is defined. The best individual in the tournament is selected
with probability p, the second best with probability p(1 −p), the third best with probability
p(1 − p)2 and so on. The ith best individual in a tournament enters the mating pool with
probability p(1 − p)i. Algorithm 2.12 on the facing page realizes this behavior for a tour-
nament selection with replacement. Notice that it becomes equivalent to Algorithm 2.9 on
the previous page if p is set to 1. Besides the algorithms discussed here, a set of additional
tournament-based selection methods has been introduced by Lee et al. [1269].

2.4 Selection
131
Algorithm 2.12: Mate ←− tournamentSelectp (Pop,v,ms)
r,k
Input: Pop: the list of individuals to select from
Input: v: the fitness values
Input: ms: the number of individuals to be placed into the mating pool Mate
Input: [implicit] p: the selection probability, p ∈ [0, 1]
Input: [implicit] k: the tournament size
Data: A: the set of tournament contestants
Data: i, j: counter variables
Output: Mate: the winners of the tournaments which now form the mating pool
1 begin
2
Mate ←− ()
3
Pop ←− sortLista(Pop, v)
4
for i ←− 0 up to ms − 1 do
5
A ←− ()
6
for j ←− 0 up to k − 1 do
7
A ←− addListItem(A, ⌊randomu(0, len(Pop))⌋)
8
A ←− sortLista(A,cmp(a1, a2) ≡ (a1 − a2))
9
for j ←− 0 up to len(A) − 1 do
10
if (randomu() ≤ p) ∨ (j ≥ len(A) − 1) then
11
Mate ←− addListItem(Mate, Pop[A[j]])
12
j ←− ∞
13
return Mate
14 end
2.4.5 Ordered Selection
Ordered selection is another approach for circumventing the problems of fitness proportion-
ate selection methods. Here, the probability of an individual to be selected is proportional
to (a power of) its position (rank) in the sorted list of all individuals in the population.
The implicit parameter k ∈ R+ of the ordered selection algorithm determines the selection
pressure. It equals to the number of expected offspring of the best individual and is thus
much similar to the parameter k of tournament selection. The bigger k gets, the higher is
the probability that individuals which are non-prevailed i. e., have good objective values will
be selected.
Algorithm 2.13 demonstrates how ordered selection with replacement works and the
variant without replacement is described in Algorithm 2.14. Basically, it first converts the
parameter k to a power q to which the uniformly drawn random numbers are raised that
are used for indexing the sorted individual list. This can be achieved with Equation 2.21.
1
q =
(2.21)
1 − logk
log ms
Figure 2.11 illustrates the expected offspring in the application of ordered selection with
k ∈ {1,2,3,4,5}. Like tournament selection, a value of k = 1 leads degenerates the evolution-
ary algorithm to a parallel random walk. Another close similarity to tournament selection
occurs when comparing the exact formulas computing the expected offspring for our exam-
ples:
i + 1 q
i
q
S(pi) = ms ∗
(2.22)
1000
− 1000
Equation 2.22 looks pretty much like Equation 2.20. The differences between the two
selection methods become obvious when comparing the diagrams Figure 2.11 and Figure 2.10
which both are independent of the actual fitness values. Tournament selection creates many

132
2 Evolutionary Algorithms
Algorithm 2.13: Mate ←− orderedSelectpr(Pop,v,ms)
Input: Pop: the list of individuals to select from
Input: v: the fitness values
Input: ms: the number of individuals to be placed into the mating pool Mate
Input: [implicit] k: the parameter of the ordering selection
Data: q: the power value to be used for ordering
Data: i: a counter variable
Output: Mate: the mating pool
1 begin
q
2
←−
1
1− log k
log ms
3
Mate ←− ()
4
Pop ←− sortLista(Pop, v)
5
for i ←− 0 up to ms − 1 do
6
Mate ←− addListItem(Mate, Pop[⌊randomu()p∗len(Pop)⌋])
7
return Mate
8 end
Algorithm 2.14: Mate ←− orderedSelectpw(Pop,v,ms)
Input: Pop: the list of individuals to select from
Input: v: the fitness values
Input: ms: the number of individuals to be placed into the mating pool Mate
Input: [implicit] k: the parameter of the ordering selection
Data: q: the power value to be used for ordering
Data: i, j: counter variables
Output: Mate: the mating pool
1 begin
q
2
←−
1
1− log k
log ms
3
Mate ←− ()
4
Pop ←− sortLista(Pop, v)
5
for i ←− 0 up to min {ms, len(Pop)} − 1 do
6
j ←− ⌊randomu()p ∗ len(Pop)⌋
7
Mate ←− addListItem(Mate, Pop[j])
8
Pop ←− deleteListItem(Pop, j)
9
return Mate
10 end
copies of the better fraction of the population and almost none of the others. Ordered
selection focuses on an even smaller group of the fittest individuals but also even the worst
solution candidates still have a survival probability not too far from one. In other words,
while tournament selection reproduces a larger group of good individuals and kills most of
the others, ordered selection assigns very high fertility to very few individuals but preservers
also the less fitter ones.

2.4 Selection
133
4
S(pi)
k=5, q=1.304
k=4, q=1.251
k=3, q=1.189
k=2, q=1.112
2
k=1, q=1
1
0
0
100
200
300
400
500
600
700
i
900
0
100
200
300
400
500
600
700 v1(p .x)
i
900
1
1e6
8e6
3e7
6e7
1e8
2e8
3e8 v2(p .x)
i
7e8
Figure 2.11: The number of expected offspring in ordered selection.
2.4.6 Ranking Selection
Ranking selection, introduced by Baker [120] and more thoroughly discussed by Whitley
[2211], Blickle and Thiele [232, 230], and Goldberg and Deb [823] is another approach for
circumventing the problems of fitness proportionate selection methods. In ranking selection
[120, 2211, 858], the probability of an individual to be selected is proportional to its position
(rank) in the sorted list of all individuals in the population. Using the rank smoothes out
larger differences of the objective values and emphasizes small ones. Generally, we can the
conventional ranking selection method as the application of a fitness assignment process
setting the rank as fitness (which can be achieved with Pareto ranking) and a subsequent
fitness proportional selection.
2.4.7 VEGA Selection
The Vector Evaluated Genetic Algorithm by Schaffer [1821, 1822] applies a special selection
algorithm which does not incorporate any preceding fitness assignment process but works on
the objective values directly. For each of the objective functions fi ∈ F, it selects a subset of
the mating pool Mate of the size ms/|F |. Therefore it applies fitness proportionate selection
which is based on fi instead of a fitness assignment “assignFitness”. The mating pool is then
a mixture of these sub-selections. Richardson et al. [1728] show in [1820] that this selection
scheme is approximately the same as if computing a weighted sum of the fitness values. As
pointed out by Fonseca and Fleming [714], in the general case, this selection method will
sample non-prevailed solution candidates at different frequencies. Schaffer also anticipated
that the population of his GA may split into different species, each particularly strong in
one objective, if the Pareto frontier is concave.

134
2 Evolutionary Algorithms
Algorithm 2.15: Mate ←− vegaSelect(Pop,F,ms)
Input: Pop: the list of individuals to select from
Input: F : the objective functions
Input: ms: the number of individuals to be placed into the mating pool Mate
Data: i: a counter variable
Data: j: the size of the current subset of the mating pool
Data: A: a temporary mating pool
Output: Mate: the individuals selected
1 begin
2
Mate ←− ()
3
for i ←− 1 up to |F| do
4
j ←− ms
|F|
5
if i = 1 then j ←− j + ms mod |F|
6
A ←− rouletteWheelSelectr(Pop, v ≡ fi, j)
7
Mate ←− appendList(Mate, A)
8
return Mate
9 end
2.4.8 Clearing and Simple Convergence Prevention (SCP)
In our experiments (especially in Genetic Programming and problems with discrete objective
functions) we often use a very simple mechanism to prevent premature convergence (see
Section 1.4.2) which we outline in Algorithm 2.17. In our opinion, this SCP method is
neither a fitness nor a selection algorithm, but we think it fits best into this section.
The idea is simple: the more similar individuals we have in the population, the more
likely are we converged. We do not know whether we have converged to a global optimum
or to a local one. If we got stuck at a local optimum, we should maybe limit the fraction of
the population which resides at this spot. In case we have found the global optimum, this
approach does not hurt, because in the end, one single point on this optimum suffices.
Clearing
The first one to apply such an explicit limitation method was P´etrowski [1638, 1639] whose
clearing approach is applied in each generation and works as specified in Algorithm 2.16
where fitness is subject to minimization. Basically, clearing divides the population of an EA
into several sub-populations according to a distance measure dist applied in the genotypic
(G) or phenotypic space (X) in each generation. The individuals of each sub-population have
at most the distance σ to the fittest individual in this niche. Then, the fitness of all but the
k best individuals in such a sub-population is set to the worst possible value. This effectively
prevents that a niche can get too crowded. Sareni and Kr¨ahenb¨
uhl [1801] showed that this
method is very promising. Singh and Deb [1892] suggest a modified clearing approach which
shifts individuals that would be cleared farther away and reevaluates their fitness.
SCP
We modified this approach in two respects: We measure similarity not in form of a distance
in G or X, but in the objective space Y ⊆ R|F|. All individuals are compared with each
other. If two have exactly the same objective values25, one of them is thrown away with
25 The exactly-the-same-criterion makes sense in combinatorial optimization and many Genetic
Programming problems but may easily be replaced with a limit imposed on the Euclidian distance
in real-valued optimization problems, for instance.

2.4 Selection
135
Algorithm 2.16: Pop′ ←− clearing(Pop,σ,k)
Input: Pop: the list of individuals to apply clearing to
Input: σ: the clearing radius
Input: k: the nieche capacity
Input: [implicit] v: the fitness values
Input: [implicit] dist: a distance measure in the genome or phenome
Data: n: the current number of winners
Data: i, j: counter variables
Output: Pop′: the pruned population
1 begin
2
Pop′ ←− sortLista(Pop, v)
3
for i ←− 0 up to len(Pop′) − 1 do
4
if v(Pop′[i].x) < ∞ then
5
n ←− 1
6
for j ←− i + 1 up to len(Pop′) − 1 do
7
if (v(Pop′[j].x) < ∞) ∧ (dist(Pop′[i], Pop′[j]) < σ) then
8
if n < k then n ←− n + 1
9
else v(Pop′[j].x) ←− ∞
10 end
probability26 cp ∈ [0,1] and does not take part in any further comparisons. This way, we
weed out similar individuals without making any assumptions about G or X and make room
in the population and mating pool for a wider diversity of solution candidates. For cp = 0,
this prevention mechanism is turned off, for cp = 1, all remaining individuals will have
different objective values.
Although this approach is very simple, the results of our experiments were often sig-
nificantly better with this convergence prevention method turned on than without it
[1650, 2188]. Additionally, in none of our experiments, the outcomes were influenced nega-
tively by this filter, which makes it even more robust than other methods for convergence
prevention like sharing or variety preserving. Algorithm 2.17, which has to be applied after
the evaluation of the objective values of the individuals in the population and before any
fitness assignment or selection takes place, specifies how our simple mechanism works.
If an individual p occurs n times in the population or if there are n individuals with
exactly the same objective values, Algorithm 2.17 cuts down the expected number of their
occurrences S(p) to
n
n−1
(1
1
S(p) =
(1 − cp)i−1 =
(1 − cp)i = − cp)n − 1 = − (1 − cp)n
(2.23)
cp
i=1
i=0
−cp
In Figure 2.12, we sketch the expected number of remaining instances of the individual
p after this pruning process if it occurred n times in the population before Algorithm 2.17
was applied.
From Equation 2.23 follows that even a population of infinite size which has fully con-
verged to one single value will probably not contain more than 1 copies of this individual
cp
after the simple convergence prevention has been applied. This threshold is also visible in
Figure 2.12.
1
1
1
lim S(p) = lim
− (1 − cp)n = − 0 =
(2.24)
n→∞
n→∞
cp
cp
cp
In P´etrowski’s clearing approach [1638], the maximum number of individuals which can
survive in a niche was a fixed constant k and, if less than k individuals resided in a niche,
26 instead of defining a fixed threshold k

136
2 Evolutionary Algorithms
Algorithm 2.17: Pop′ ←− convergencePreventionSCP(Pop,cp)
Input: Pop: the list of individuals to apply convergence prevention to
Input: cp: the convergence prevention probability, cp ∈ [0, 1]
Input: [implicit] F : the set of objective functions
Data: i, j: counter variables
Data: p: the individual checked in this generation
Output: Pop′: the pruned population
1 begin
2
Pop′ ←− ()
3
for i ←− 0 up to len(Pop) − 1 do
4
p ←− Pop[i]
5
for j ←− len(Pop′) − 1 down to 0 do
6
if f (p.x) = f (Pop′[j].x) ∀f ∈ F then
7
if randomu() < cp then
8
Pop′ ←− deleteListItem(Pop′, j)
9
Pop′ ←− addListItem(Pop′, p)
10
return Pop′
11 end
9
cp=0.1
S(p)
7
6
5
cp=0.2
4
3
cp=0.3
cp=0.5
2
cp=0.7
1
0
5
10
15
n
25
Figure 2.12: The expected numbers of occurences for different values of n and cp.
none of them would be affected. Different from that, an expected value of the number of
individuals allowed in a niche is specified with the probability cp and may be both, exceeded
or undercut. Another difference of the approaches arises from the space in which the distance
is computed.
Discussion
Whereas clearing prevents the EA from concentrating too much on a certain area in the
search or problem space, SCP stops it from keeping too many individuals with equal utility.
The former approach works against premature convergence to a certain solution structure

2.5 Reproduction
137
while the latter forces the EA to “keep track” of a trail to solution candidates with worse
fitness which may later evolve to good individuals with traits different from the currently
exploited ones.
Which of the two approaches is better has not yet been tested with comparative experi-
ments and is part of our future work. At the present moment, we assume that in real-valued
search or problem spaces, clearing should be more suitable whereas we know from exper-
iments using our approach only that SCP performs very good in combinatorial problems
[1650, 2188] Genetic Programming (see Section 21.3.2, for instance).
TODO add remaining selection
algorithms
2.5 Reproduction
An optimization algorithm uses the information gathered up to step t for creating the so-
lution candidates to be evaluated in step t + 1. There exist different methods to do so.
In evolutionary algorithms, the aggregated information corresponds to the population Pop
and the set of best individuals Arc if such an archive is maintained. The search operations
searchOp ∈ Op in used in the evolutionary algorithm family are called reproduction oper-
ation, inspired by the biological procreation mechanisms27 of mother nature [1730]. There
are four basic operations:
1. Creation has no direct natural paragon; it simple creates a new genotype without any
ancestors or heritage. Hence, it roughly can be compared with the occurrence of the first
living cells from out a soup of certain chemicals28.
2. Duplication resembles the cell division29, resulting in two individuals similar to one
parent.
3. Mutation in evolutionary algorithms corresponds to small, random variations in the
genotype of an individual, exactly like its natural counterpart30.
4. Like in sexual reproduction, recombination31 combines two parental genotypes to a new
genotype including traits from both elders.
In the following, we will discuss these operations in detail and provide general definitions
form them.
Definition 2.9 (Creation). The creation operation “create” is used to produce a new
genotype g ∈ G with a random configuration.
g = create() ⇒ g ∈ G
(2.25)
When an evolutionary algorithm starts, no information about the search space has been
gathered yet. Hence, we cannot use existing solution candidates to derive new ones and
search operations with an arity higher than zero cannot be applied. Creation is thus used
to fill the initial population Pop(t = 0).
Definition 2.10 (Duplication). The duplication operation duplicate : G → G is used to
create an exact copy of an existing genotype g ∈ G.
g = duplicate(g) ∀g ∈ G
(2.26)
27 http://en.wikipedia.org/wiki/Reproduction [accessed 2007-07-03]
28 http://en.wikipedia.org/wiki/Abiogenesis [accessed 2008-03-17]
29 http://en.wikipedia.org/wiki/Cell_division [accessed 2008-03-17]
30 http://en.wikipedia.org/wiki/Mutation [accessed 2007-07-03]
31 http://en.wikipedia.org/wiki/Sexual_reproduction [accessed 2008-03-17]

138
2 Evolutionary Algorithms
Duplication is just a placeholder for copying an element of the search space, i. e., it is
what occurs when neither mutation nor recombination are applied. It is useful to increase
the share of a given type of individual in a population.
Definition 2.11 (Mutation). The mutation operation mutate : G → G is used to create a
new genotype gn ∈ G by modifying an existing one. The way this modification is performed
is application-dependent. It may happen in a randomized or in a deterministic fashion.
gn = mutate(g) : g ∈ G ⇒ gn ∈ G
(2.27)
Definition 2.12 (Recombination).
The recombination (or crossover32) operation
recombine : G × G → G is used to create a new genotype gn ∈ G by combining the
features of two existing ones. Depending on the application, this modification may happen
in a randomized or in a deterministic fashion.
gn = recombine(ga, gb) : ga, gb ∈ G ⇒ gn ∈ G
(2.28)
Notice that the term recombination is more general than crossover since it stands for
arbitrary search operations that combines the traits of two individuals. Crossover, however,
is only used if the elements search space G are linear representations. Then, it stands for
exchanging parts of these so-called strings.
Now we can define the set OpEA of search operations most commonly applied in evolu-
tionary algorithms as
OpEA = {create,duplicate,mutate,recombine}
(2.29)
All of them can be combined arbitrarily. It is, for instance, not unusual to mutate the results
of a recombination operation, i. e., to perform mutate(recombine(g1, g2)).
The four operators are altogether used to reproduce whole populations of individuals.
Definition 2.13 (reproducePop).
The population reproduction operation Pop
=
reproducePop(Mate) is used to create a new population Pop by applying the reproduction
operations to the mating pool Mate.
Pop = reproducePop(Mate) ⇒ ∀p ∈ Mate ⇒ p ∈ P, ∀p ∈ Pop ⇒ p ∈ P, len(Pop) = len(Mate)
∀p ∈ Pop ⇒ p.g = create() ∨
p.g = duplicate(pold.g) : pold ∈ Mate ∨
p.g = mutate(pold.g) : pold ∈ Mate ∨
p.g = recombine(pold1.g, pold2.g) :
pold1, pold2 ∈ Mate
(2.30)
For creating an initial population of the size s, we furthermore define the function
createPop(s) in Algorithm 2.18.
2.5.1 NCGA Reproduction
The Neighborhood Cultivation Genetic Algorithm by Watanabe et al. [2160] discussed in
?? uses a special reproduction method. Recombination is performed only on neighboring
individuals, which leads to child genotypes close to their parents. This so-called neighbor-
hood cultivation shifts the recombination-operator more into the direction exploitation, i. e.,
NCGA uses crossover for investigating the close surrounding of known solution candidates.
The idea is that parents that do not differ much from each other are more likely to be com-
patible in order to produce functional offspring than parents that have nothing in common.
32 http://en.wikipedia.org/wiki/Recombination [accessed 2007-07-03]

2.6 Algorithms
139
Algorithm 2.18: Pop ←− createPop(s)
Input: s: the number of individuals in the new population
Input: [implicit] create: the creation operator
Data: i: a counter variable
Output: Pop: the new population of randomly created individuals (len(Pop) = s)
1 begin
2
Pop ←− ()
3
for i ←− 0 up to s − 1 do
4
Pop ←− addListItem(Pop, create())
5
return Pop
6 end
Neighborhood cultivation is achieved in Algorithm 2.19 by sorting the mating pool along one
focused objective. Then, the elements situated directly besides each other are recombined.
The focus on the objective rotates in a way that in a three-objective optimization the first
objective is focused at the beginning, then the second, then the third and after that again
the first. The algorithm shown here receives the additional parameter foc which denotes
the focused objective. Both, recombination and mutation are performed with an implicitly
defined probability (r and m, respectively).
Algorithm 2.19: Pop ←− ncgaReproducePopfoc(Mate)
Input: Mate: the mating pool
Input: foc: the objective currently focused
Input: [implicit] recombine, mutate: the recombination and mutation routines
Input: [implicit] r, m: the probabilities of recombination and mutation
Data: i: a counter variable
Output: Pop: the new population with len(Pop) = len(Mate)
1 begin
2
Pop ←− sortLista(Mate, ffoc)
3
for i ←− 0 up to len(Pop) − 1 do
4
if (randomu() ≤ r) ∧ (i < len(Pop) − 1) then Pop[i] ←− recombine(Pop[i], Pop[i+1])
5
if randomu() ≤ m then Pop[i] ←− mutate(Pop[i])
6
return Pop
7 end
2.6 Algorithms
Besides the basic evolutionary algorithms introduced in Section 2.1.3 on page 98, there exists
a variety of other, more sophisticated approaches. Many of them deal especially with multi-
objective optimization which imposes new challenges on fitness assignment and selection. In
this section we discuss the most prominent of these evolutionary algorithms.
2.6.1 VEGA
The very first multi-objective genetic algorithm is the Vector Evaluated Genetic Algorithm
(VEGA) created by Schaffer [1821, 1822] in the mid-1980s. The main difference between
VEGA and the basic form of evolutionary algorithms is the modified selection algorithm
which you can find discussed in Section 2.4.7 on page 133. This selection algorithm solely

140
2 Evolutionary Algorithms
relies on the objective functions F and does not use any preceding fitness assignment process
nor can it incorporate a prevalence comparison scheme cmpF . However, it has severe weak-
nesses also discussed in Section 2.4.7 and thus cannot be considered as an efficient approach
to multi-objective optimization.
Algorithm 2.20: X⋆ ←− vega(F,s)
Input: F : the objective functions
Input: ps: the population size
Data: t: the generation counter
Data: Pop: the population
Data: Mate: the mating pool
Data: v: the fitness function resulting from the fitness assigning process
Output: X⋆: the set of the best elements found
1 begin
2
t ←− 0
3
Pop ←− createPop(ps)
4
while terminationCriterion() do
5
Mate ←− vegaSelect(Pop, F, ps)
6
t ←− t + 1
7
Pop ←− reproducePop(Mate)
8
return extractPhenotypes(extractOptimalSet(Pop))
9 end
TODO add remaining EAs

3
Genetic Algorithms
3.1 Introduction
Genetic algorithms1 (GAs) are a subclass of evolutionary algorithms where the elements
of the search space G are binary strings (G = B∗) or arrays of other elementary types. As
sketched in Figure 3.1, the genotypes are used in the reproduction operations whereas the
values of the objective functions f ∈ F are computed on basis of the phenotypes in the
problem space X which are obtained via the genotype-phenotype mapping “gpm”. [821,
940, 916, 2208]
The roots of genetic algorithms go back to the mid-1950s, where biologists like Barricelli
[150, 151, 152, 153] and the computer scientist Fraser [742] began to apply computer-aided
simulations in order to gain more insight into genetic processes and the natural evolution and
selection. Bremermann [287] and Bledsoe [216, 215, 217, 218] used evolutionary approaches
based on binary string genomes for solving inequalities, for function optimization, and for
determining the weights in neural networks in the early 1960s [219]. At the end of that
decade, important research on such search spaces was contributed by Bagley [116] (who
introduced the term genetic algorithm), Rosenberg [1760], Cavicchio, Jr. [354, 355], and
Frantz [741] – all based on the ideas of Holland at the University of Michigan. As a result of
Holland’s work [937, 939, 940, 938] genetic algorithms as a new approach for problem solving
could be formalized finally became widely recognized and popular. Today, there are many
applications in science, economy, and research and development [1681] that can be tackled
with genetic algorithms. Therefore, various forms of genetic algorithms [423] have been
developed to. Some genetic algorithms2 like the human-based genetic algorithms3 (HBGA),
for instance, even require human beings for evaluating or selecting the solution candidates
[1884, 1997, 1998, 1178, 883]
It should further be mentioned that, because of the close relation to biology and since ge-
netic algorithms were originally applied to single-objective optimization, the objective func-
tions f here are often referred to as fitness functions. This is a historically grown misnaming
which should not be mixed up with the fitness assignment processes discussed in Section 2.3
on page 111 and the fitness values v used in the context of this book.
1 http://en.wikipedia.org/wiki/Genetic_algorithm [accessed 2007-07-03]
2 http://en.wikipedia.org/wiki/Interactive_genetic_algorithm [accessed 2007-07-03]
3 http://en.wikipedia.org/wiki/HBGA [accessed 2007-07-03]

142
3 Genetic Algorithms
phenotype x
objective function fi
GPM
genotype g
Population
Initial Population
Evaluation
Fitness Assignment
create an initial
compute the objective
use the objective values
population of random
values of the solution
to determine fitness
individuals
candidates
values
Reproduction
Selection
create new individuals
select the fittest indi-
from the mating pool by
viduals for reproduction
crossover and mutation
Population Pop
genotype g
mutation
crossover
Figure 3.1: The basic cycle of genetic algorithms.
3.2 General Information
3.2.1 Areas Of Application
Some example areas of application of genetic algorithms are:
Application
References
[1275, 417, 1228, 160, 340, 339,
Scheduling
341]
[475, 2269, 474, 476, 531, 2127,
Chemistry, Chemical Engineering
1075, 1401]
Medicine
[319, 1900, 2278, 2117]
Data Mining and Data Analysis
[1424, 1089, 834, 1991, 445]
Geometry and Physics
[366, 367, 966, 1222, 1223]
Economics and Finance
[2302]
[628, 1861, 1220, 290, 1164,
Networking and Communication
2324]
see Section 23.2 on page 401
Electrical Engineering and Circuit Design
[1304, 1305, 1306]

3.2 General Information
143
Image Processing
[25]
[1480, 1134, 1754, 2020, 2323,
Combinatorial Optimization
32]
3.2.2 Conferences, Workshops, etc.
Some conferences, workshops and such and such on genetic algorithms are:
EUROGEN: Evolutionary Methods for Design Optimization and Control with Applications
to Industrial Problems
see Section 2.2.2 on page 106
FOGA: Foundations of Genetic Algorithms
http://www.sigevo.org/ [accessed 2007-09-01]
History: 2007: Mexico City, M´exico, see [1960]
2005: Aizu-Wakamatsu City, Japan, see [2259]
2002: Torremolinos, Spain, see [519]
2000: Charlottesville, VA, USA, see [1927]
1998: Madison, WI, USA, see [139]
1996: San Diego, CA, USA, see [172]
1994: Estes Park, Colorado, USA, see [2214]
1992: Vail, Colorado, USA, see [2209]
1990: Bloomington Campus, Indiana, USA, see [1924]
FWGA: Finnish Workshop on Genetic Algorithms and Their Applications
NWGA: Nordic Workshop on Genetic Algorithms
History: 1997: Helsinki, Finland, see [30]
1996: Vaasa, Finland, see [29]
1995: Vaasa, Finland, see [28]
1994: Vaasa, Finland, see [27]
1992: Espoo, Finland, see [26]
GALESIA: International Conference on Genetic Algorithms in Engineering Systems: Inno-
vations and Applications
now part of CEC, see Section 2.2.2 on page 105
History: 1997: Glasgow, UK, see [990]
1995: Scheffield, UK, see [2309]
GECCO: Genetic and Evolutionary Computation Conference
see Section 2.2.2 on page 107
GEM: International Conference on Genetic and Evolutionary Methods
History: 2008: Las Vegas, Nevada, USA, see [81]
2007: Las Vegas, Nevada, USA, see [80]
ICGA: International Conference on Genetic Algorithms
Now part of GECCO, see Section 2.2.2 on page 107
History: 1997: East Lansing, Michigan, USA, see [98]
1995: Pittsburgh, PA, USA, see [636]
1993: Urbana-Champaign, IL, USA, see [730]
1991: San Diego, CA, USA, see [170]

144
3 Genetic Algorithms
1989: Fairfax, Virginia, USA, see [1820]
1987: Cambridge, MA, USA, see [857]
1985: Pittsburgh, PA, USA, see [856]
ICANNGA: International Conference on Adaptive and Natural Computing Algorithms
see Section 2.2.2 on page 108
Mendel: International Conference on Soft Computing
see Section 1.6.2 on page 90
3.2.3 Online Resources
Some general, online available ressources on genetic algorithms are:
http://www.obitko.com/tutorials/genetic-algorithms/ [accessed 2008-05-17]
Last update: 1998
Description: A very thorough introduction to genetic algorithms by Marek Obitko
http://www.aaai.org/AITopics/html/genalg.html [accessed 2008-05-17]
Last update: up-to-date
Description: The genetic algorithms and Genetic Programming pages of the AAAI
http://www.illigal.uiuc.edu/web/ [accessed 2008-05-17]
Last update: up-to-date
Description: The Illinois Genetic Algorithms Laboratory (IlliGAL)
http://www.cs.cmu.edu/Groups/AI/html/faqs/ai/genetic/top.html [accessed 2008-05-17]
Last update: 1997-08-10
Description: The Genetic Algorithms FAQ.
http://www.rennard.org/alife/english/gavintrgb.html [accessed 2008-05-17]
Last update: 2007-07-10
Description: An introduction to genetic algorithms by Jean-Philippe Rennard.
http://www.optiwater.com/GAsearch/ [accessed 2008-06-08]
Last update: 2003-11-15
Description: GA-Search – The Genetic Algorithms Search Engine
3.2.4 Books
Some books about (or including significant information about) genetic algorithms are:
Goldberg [821]: Genetic Algorithms in Search, Optimization and Machine Learning
Mitchell [1431]: An Introduction to Genetic Algorithms
Davis [495]: Handbook of Genetic Algorithms
Haupt and Haupt [905]: Practical Genetic Algorithms
Gen and Cheng [787]: Genetic Algorithms and Engineering Design
Chambers [368]: Practical Handbook of Genetic Algorithms: Applications
Chambers [369]: Practical Handbook of Genetic Algorithms: New Frontiers
Chambers [370]: Practical Handbook of Genetic Algorithms: Complex Coding Systems
Holland [940]: Adaptation in Natural and Artificial Systems
Gen and Chen [786]: Genetic Algorithms (Engineering Design and Automation)
Cant’u-Paz [330]: Efficient and Accurate Parallel Genetic Algorithms
Heistermann [915]: Genetische Algorithmen. Theorie und Praxis evolution¨
arer Optimierung

3.3 Genomes in Genetic Algorithms
145
Sch¨oneburg, Heinzmann, and Feddersen [1831]: Genetische Algorithmen und Evolution-
sstrategien
Gwiazda [873]: Crossover for single-objective numerical optimization problems
Schaefer and Telega [1819]: Foundations of Global Genetic Optimization
Karr and Freeman [1093]: Industrial Applications of Genetic Algorithms

ack [99]: Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolution-
ary Programming, Genetic Algorithms
Davis [494]: Genetic Algorithms and Simulated Annealing
Alba and Dorronsoro [33]: Cellular Genetic Algorithms
3.3 Genomes in Genetic Algorithms
Most of the terminology which we have defined in Section 1.3 and used throughout this
book stems from the GA sector. The search spaces G of genetic algorithms, for instance, are
referred to genome and its elements are called genotypes. Genotypes in nature encompass
the whole hereditary information of an organism encoded in the DNA4. The DNA is a string
of base pairs that encodes the phenotypical characteristics of the creature it belongs to. Like
their natural prototypes, the genomes in genetic algorithms are strings, linear sequences of
certain data types [821, 945, 1431]. Because of the linear structure, these genotypes are also
often called chromosomes. In genetic algorithms, we most often use chromosomes which are
strings of one and the same data type, for example bits or real numbers.
Definition 3.1 (String Chromosome). A string chromosome can either be a fixed-length
tuple (Equation 3.1) or a variable-length list (Equation 3.2).
In the first case, the loci i of the genes gi are constant and, hence, the tuples may contain
elements of different types Gi.
G = {∀(g[1],g[2],..,g[n]) : g[i] ∈ Gi ∀i ∈ 1..n}
(3.1)
This is not given in variable-length string genomes. Here, the positions of the genes may
shift when the reproduction operations are applied. Thus, all elements of such genotypes
must have the same type GT .
G = {∀lists g : g[i] ∈ GT ∀0 ≤ i < len(g)}
(3.2)
String chromosomes are normally bit strings, vectors of integer numbers, or vectors of real
numbers. Genetic algorithms with numeric vector genomes in their natural representation,
i. e., where G = X ⊆ Rn are called real-encoded [1107]. Today, more sophisticated methods
for evolving good strings (vectors) of (real) numbers exist (such as Evolution Strategies,
Differential Evolution, or Particle Swarm Optimization) than processing them like binary
strings with the standard reproduction operations of GAs.
Bit string genomes are sometimes complemented with the application of gray coding5
during the genotype-phenotype mapping. This is done in an effort to preserve locality (see
Section 1.4.3) and ensure that small changes in the genotype will also lead to small changes in
the phenotypes [349]. Collins and Eaton [430] studied different encodings for GAs and found
that their E-code outperform both gray and direct binary coding in function optimization.
Messy genomes (see Section 3.7) where introduced to improve locality by linkage learning.
Genetic algorithms are the original prototype of evolutionary algorithms and therefore,
fully adhere to the description given in Section 2.1.2. They provide search operators which
closely copy sexual and asexual reproduction schemes from nature. In such “sexual” search
4 You can find an illustration of the DNA in Figure 1.14 on page 42
5 http://en.wikipedia.org/wiki/Gray_coding [accessed 2007-07-03]

146
3 Genetic Algorithms
operations, the genotypes of the two parents genotypes will recombine. In asexual reproduc-
tion, mutations are the only changes that occur. It is very common to apply both principles
in conjunction, i. e., to first recombine two elements from the search space and subsequently,
make them subject to mutation.
In nature, life begins with a single cell which divides6 time and again until a mature
individual is formed7 after the genetic information has been reproduced. The emergence
of a phenotype from its genotypic representation is called embryogenesis in biology and
its counterparts in evolutionary search are the genotype-phenotype mapping and artificial
embryogeny which we will discuss in Section 3.8 on page 155.
Let us shortly recapitulate the structure of the elements g of the search space G. A gene
(see Definition 1.23 on page 43) is the basic informational unit in a genotype g. Depending
on the genome, a gene can be a bit, a real number, or any other structure. In biology, a gene
is a segment of nucleic acid that contains the information necessary to produce a functional
RNA product in a controlled manner. An allele (see Definition 1.24) is a value of specific
gene in nature and in EAs alike. The locus (see Definition 1.25) is the position where a
specific gene can be found in a chromosome. Besides the functional genes and their alleles,
there are also parts of natural genomes which have no (obvious) function [2161, 819]. The
American biochemist Gilbert [806] coined the term intron8 for such parts. Similar structures
can also be observed in evolutionary algorithms with variable-length encodings.
Definition 3.2 (Intron).
Parts of a genotype g ∈ G that does not contribute to the
phenotype x = gpm(g) are referred to as introns.
Biological introns have often been thought of as junk DNA or “old code”, i. e., parts
of the genome that were translated to proteins in evolutionary past, but now are not used
anymore. Currently though, many researchers assume that introns are maybe not as useless
as initially assumed [467]. Instead, they seem to provide support for efficient splicing, for
instance. The role of introns in genetic algorithms is as same as mysterious. They represent a
form of redundancy – which is known to have possible as well as negative effects, as outlined
in Section 1.4.5 on page 67 and Section 4.10.3.
Figure 3.2 combines Figure 1.15 on page 45 and Figure 1.13 and illustrates the relations
between the aforementioned entities in a bit string genome G = B4 of the length 4, where two
bits encode for one coordinate in a two-dimensional plane. Additional bits could appended
to the genotypes because a variable-length representation is used for some strange reason,
for instance. Then, these could occur as introns and would not influence the phenotype in
the example.
6 http://en.wikipedia.org/wiki/Cell_division [accessed 2007-07-03]
7 Matter of fact, cell division will continue until the individual dies. However, this is not important
here.
8 http://en.wikipedia.org/wiki/Intron [accessed 2007-07-05]

3.4 Fixed-Length String Chromosomes
147
x=gpm(g)
0 1 1 1
3
0 1 0
2
Gene
Introns
1
- Allele = ,,11`` - no effect
0
- Locus = 1
during gpm
1
2
3
genotype g G
Î
phenotype x X
Î
Figure 3.2: A four bit string genome G and a fictitious phenotype X.
3.4 Fixed-Length String Chromosomes
Especially widespread in genetic algorithms are search spaces based on fixed-length chro-
mosomes. The properties of their crossover and mutation operations are well known and an
extensive body of research on them is available [821, 945].
3.4.1 Creation: Nullary Reproduction
Creation of fixed-length string individuals means simple to create a new tuple of the structure
defined by the genome and initialize it with random values. In reference to Equation 3.1 on
page 145, we could roughly describe this process with Equation 3.3.
createfl() ≡ (g[1],g[2],..,g[n]) : g[i] = Gi[⌊randomu()∗len(Gi)⌋] ∀i ∈ 1..n
(3.3)
3.4.2 Mutation: Unary Reproduction
Mutation is an important method for preserving the diversity of the solution candidates by
introducing small, random changes into them. In fixed-length string chromosomes, this can
be achieved by randomly modifying the value (allele) of a gene, as illustrated in Fig. 3.3.a.
Fig. 3.3.b shows the more general variant of this form of mutation where 0 < n < len(g)
locations in the genotype g are changed at once. In binary coded chromosomes, for example,
these genes would be bits which can simply be toggled. For real-encoded genomes, modifying
an element gi can be done by replacing it with a number drawn from a normal distribution
with expected value g1, like gnew
i
∼ N g1,σ2 .
Fig. 3.3.a: Single-gene mutation. Fig. 3.3.b: Multi-gene mutation Fig. 3.3.c: Multi-gene mutation
(a).
(b).
Figure 3.3: Value-altering mutation of string chromosomes.

148
3 Genetic Algorithms
3.4.3 Permutation: Unary Reproduction
The permutation operation is an alternative mutation method where the alleles of two genes
are exchanged as sketched in Figure 3.4. This, of course, makes only sense if all genes have
similar data types. Permutation is, for instance, useful when solving problems that involve
finding an optimal sequence of items, like the travelling salesman problem [1263, 78]. Here, a
genotype g could encode the sequence in which the cities are visited. Exchanging two alleles
then equals of switching two cities in the route.
Figure 3.4: Permutation applied to a string chromosome.
3.4.4 Crossover: Binary Reproduction
Amongst all evolutionary algorithms, genetic algorithms have the recombination operation
which probably comes closest to the natural paragon. Figure 3.5 outlines the recombination
of two string chromosomes, the so-called crossover, which is performed by swapping parts
of two genotypes.
When performing single-point crossover (SPX9), both parental chromosomes are split
at a randomly determined crossover point. Subsequently, a new child genotype is created
by appending the second part of the second parent to the first part of the first parent as
illustrated in Fig. 3.5.a. In two-point crossover (TPX, sketched in Fig. 3.5.b), both parental
genotypes are split at two points and a new offspring is created by using parts number one
and three from the first, and the middle part from the second parent chromosome. Fig. 3.5.c
depicts the generalized form of this technique: the n-point crossover operation, also called
multi-point crossover (MPX). For fixed-length strings, the crossover points for both parents
are always identical.
( )
( )
( )
Fig. 3.5.a: Single-point
Fig. 3.5.b: Two-point
Fig. 3.5.c: Multi-point
Crossover (SPX).
Crossover (TPX).
Crossover (MPX).
Figure 3.5: Crossover (recombination) operators for fixed-length string genomes.
9 This abbreviation is also used for simplex crossover, see Section 16.4.

3.5 Variable-Length String Chromosomes
149
3.5 Variable-Length String Chromosomes
Variable-length genomes for genetic algorithms where first proposed by Smith in his PhD
thesis [1912]. There, he introduced a new variant of classifier systems10 with the goal of
evolving programs for playing poker [1912, 1688].
3.5.1 Creation: Nullary Reproduction
Variable-length strings can be created by first randomly drawing a length l > 0 and then
creating a list of that length filled with random elements.
3.5.2 Mutation: Unary Reproduction
If the string chromosomes are of variable length, the set of mutation operations introduced
in Section 3.4 can be extended by two additional methods. First, we could insert a couple
of genes with randomly chosen alleles at any given position into a chromosome (Fig. 3.6.a).
Second, this operation can be reversed by deleting elements from the string (Fig. 3.6.b).
It should be noted that both, insertion and deletion, are also implicitly be performed by
crossover. Recombining two identical strings with each other can, for example, lead to dele-
tion of genes. The crossover of different strings may turn out as an insertion of new genes
into an individual.
Since the reproduction operations can change the length of a genotypes (therefore the
name “variable-length”), variable-length strings need to be constructed of elements of the
same type. There is no longer a constant relation between locus and type.
Fig. 3.6.a: Insertion of random genes.
Fig. 3.6.b: Deletion of genes.
Figure 3.6: Search operators for variable-length strings (additional to those from Section 3.4.2
and Section 3.4.3).
3.5.3 Crossover: Binary Reproduction
For variable-length string chromosomes, the same crossover operations are available as for
fixed-length strings except that the strings are no longer necessarily split at the same loci.
The lengths of the new strings resulting from such a cut and splice operation may differ
from the lengths of the parents, as sketched in Figure 3.7. A special case of this type of
recombination is the homologous crossover, where only genes at the same loci are exchanged.
This method is discussed thoroughly in Section 4.6.7 on page 195.
10 See Chapter 7 for more information on classifier systems.

150
3 Genetic Algorithms
( )
( )
( )
Fig. 3.7.a: Single-Point
Fig. 3.7.b: Two-Point
Fig. 3.7.c: Multi-Point
Crossover
Crossover
Crossover
Figure 3.7: Crossover of variable-length string chromosomes.
3.6 Schema Theorem
The Schema Theorem is a special instance of forma analysis (discussed in Section 1.5.1
on page 80) for genetic algorithms. Matter of fact, it is older than its generalization and
was first stated by Holland back in 1975 [940, 512, 945]. Here we will first introduce the
basic concepts of schemata, masks, and wildcards before going into detail about the Schema
Theorem itself, its criticism, and the related Building Block Hypothesis.
3.6.1 Schemata and Masks
Assume that the genotypes g in the search space G of genetic algorithms are strings of
a fixed-length l over an alphabet11 Σ, i. e., G = Σl. Normally, Σ is the binary alphabet
Σ = {true,false} = {0,1}. From forma analysis, we know that properties can be defined
on the genotypic or the phenotypic space. For fixed-length string genomes, we can consider
the values at certain loci as properties of a genotype. There are two basic principles on
defining such properties: masks and do not care symbols.
Definition 3.3 (Mask). For a fixed-length string genome G = Σl, we define the set of all
genotypic masks Ml as the power set12 of the valid loci Ml = P({1,...,l}) [2167]. Every
mask mi ∈ Ml defines a property φi and an equivalence relation:
g ∼φ h
i
⇔ g[j] = h[j] ∀j ∈ mi
(3.4)
The order “order(mi)” of the mask mi is the number of loci defined by it:
order(mi) = |mi|
(3.5)
The defined length δ(mi) of a mask mi is the maximum distance between two indices in
the mask:
δ(mi) = max {|j − k| ∀j,k ∈ mi}
(3.6)
A mask contains the indices of all elements in a string that are interesting in terms of the
property it defines. Assume we have bit strings of the length l = 3 as genotypes (G = B3).
The set of valid masks M3 is then M3 = {{1},{2},{3},{1,3},{1,3},{2,3},{1,2,3}}. The
mask m1 = {1,2}, for example, specifies that the values at the loci 1 and 2 of a genotype
denote the value of a property φ1 and the value of the bit at position 3 is irrelevant. There-
fore, it defines four formae Aφ1=(0,0) = {(0,0,0),(0,0,1)}, Aφ1=(0,1) = {(0,1,0),(0,1,1)},
Aφ1=(1,0) = {(1,0,0),(1,0,0)}, and Aφ1=(1,1) = {(1,1,0),(1,1,1)}.
Definition 3.4 (Schema). A forma defined on a string genome concerning the values of
the characters at specified loci is called Schema [940, 389].
11 Alphabets and such and such are defined in Section 30.3 on page 561.
12 The power set you can find described in Definition 27.9 on page 458.

3.6 Schema Theorem
151
3.6.2 Wildcards
The second method of specifying such schemata is to use don’t care symbols (wildcards)
to create “blueprints” H of their member individuals. Therefore, we place the don’t care
symbol * at all irrelevant positions and the characterizing values of the property at the
others.
g[j] if j
∀j ∈ 1..l ⇒ H[
∈ mi
j] =
(3.7)
∗ otherwise
H[j] ∈ Σ ∪ {∗} ∀j ∈ 1..l
(3.8)
(3.9)
We now can redefine the aforementioned schemata like: Aφ1=(0,0) ≡ H1 = (0,0,∗),
Aφ1=(0,1) ≡ H2 = (0,1,∗), Aφ1=(1,0) ≡ H3 = (1,0,∗), and Aφ1=(1,1) ≡ H4 = (1,1,∗). These
schemata mark hyperplanes in the search space G, as illustrated in Figure 3.8 for the three
bit genome. Schemas correspond to masks and thus, definitions like the defined length and
order can easily be transported into their context.
g2
H =(0 0 *
1
, , )
( ,
0 ,
0 )
1
( ,
1 ,
0 )
1
H =(1 0 *
3
, , )
( ,
1 ,
1 )
1
( ,
0 ,
1 )
1
H
0 1 *
H
1 * *
5=(
, , )
2=(
, , )
H =(1 1 *
4
, , )
g0
( ,
0 ,
0 )
0
( ,
1 ,
0 )
0
g1 0 1 0
( ,
1 ,
1 )
0
( , , )
Figure 3.8: An example for schemata in a three bit genome.
3.6.3 Holland’s Schema Theorem
The Schema Theorem13 was defined by Holland [940] for genetic algorithms which use
fitness-proportionate selection (see Section 2.4.3 on page 124) where fitness is subject to
maximization [512, 945].
countOccurences(H, Pop)
countOccurences(H, Pop)
t ∗ v(H)t (1
t+1 ≥
v
− p)
(3.10)
t
where
1. countOccurences(H, Pop) is the number of instances of a given schema defined by the
t
blueprint H in the population Pop of generation t,
13 http://en.wikipedia.org/wiki/Holland%27s_Schema_Theorem [accessed 2007-07-29]

152
3 Genetic Algorithms
2. v(H) is the average fitness of the members of this schema (observed in time step t),
t
3. vt is the average fitness of the population in time step t, and
4. p is the probability that an instance of the schema will be “destroyed” by a reproduction
operation, i. e., the probability that the offspring of an instance of the schema is not an
instance of the schema.
From this formula can be deduced that genetic algorithms will generate for short, above-
average fit schemata an exponentially rising number of samples. This is because they will
multiply with a certain factor in each generation and only few of them are destroyed by the
reproduction operations. In the special case of single-point crossover (crossover rate cr) and
single-bit mutation (mutation rate mr) in a binary genome of the fixed length l G = Bl ,
the destruction probability p is noted in Equation 3.11.
δ(H)
order(H)
p = cr
+ mr
(3.11)
l − 1
l
3.6.4 Criticism of the Schema Theorem
The deduction that good schemata will spread exponentially is only a very optimistic as-
sumption and not generally true. If a highly fit schema has many offspring with good fit-
ness, this will also improve the overall fitness of the population. Hence, the probabilities in
Equation 3.10 will shift over time. Generally, the Schema Theorem represents a lower bound
that will only hold for one generation [2208]. Trying to derive predictions for more than one
or two generations using the Schema Theorem as is will lead to deceptive or wrong results
[858, 854].
Furthermore, the population of a genetic algorithm only represents a sample of limited
size of the search space G. This limits the reproduction of the schemata but also makes
statements about probabilities in general more complicated. Since we only have samples
of the schemata H and cannot be sure if v(H) really represents the average fitness of all
t
the members of the schema (that is why we annotate it with t instead of writing v(H)).
Thus, even reproduction operators which preserve the instances of the schema may lead to
a decrease of v(H)
by time. It is also possible that parts of the population already have
t+..
converged and other members of a schema will not be explored anymore, so we do not get
further information about its real utility.
Additionally, we cannot know if it is really good if one specific schema spreads fast, even
it is very fit. Remember that we have already discussed the exploration versus exploitation
topic and the importance of diversity in Section 1.4.2 on page 60.
Another issue is that we implicitly assume that most schemata are compatible and can
be combined, i. e., that there is low interaction between different genes. This is also not
generally valid: Epistatic effects, for instance, can lead to schema incompatibilities. The
expressiveness of masks and blueprints even is limited and can be argued that there are
properties which we cannot specify with them. Take the set D3 of numbers divisible by
three for example D3 = {3,6,9,12,..}. Representing them as binary strings will lead to D3 =
{0011,0110,1001,1100,... } if we have a bit-string genome of the length 4. Obviously, we
cannot seize these genotypes in a schema using the discussed approach. They may, however,
be gathered in a forma. The Schema Theorem, however, cannot hold for such a forma since
the probability p of destruction may be different from instance to instance.
3.6.5 The Building Block Hypothesis
According to Harik [896], the substructure of a genotype which allows it to match to a schema
is called a building block. The Building Block Hypothesis (BBH) proposed by Goldberg
[821], Holland [940] is based on two assumptions:

3.7 The Messy Genetic Algorithm
153
1. When a genetic algorithm solves a problem, there exist some low-order, low-defining
length schemata with above-average fitness (the so-called building blocks).
2. These schemata are combined step by step by the genetic algorithm in order to form
larger and better strings. By using the building blocks instead of testing any possible bi-
nary configuration, genetic algorithms efficiently decrease the complexity of the problem.
[821]
Although it seems as if the Building Block Hypothesis is supported by the Schema
Theorem, this cannot be verified easily. Experiments that originally were intended to proof
this theory often did not work out as planned [1432] (and also consider the criticisms of the
Schema Theorem mentioned in the previous section). In general, there exists much criticism
of the Building Block Hypothesis and, although it is a very nice model, it cannot yet be
considered as proven sufficiently.
3.7 The Messy Genetic Algorithm
According to the schema theorem specified in Equation 3.10 and Equation 3.11, a schema
is likely to spread in the population if it has above-average fitness, is short (i. e., low defined
length) and is of low order [116]. Thus, according to Equation 3.11, from two schemas of the
same average fitness and order, the one with the lesser defined length will be propagated
to more offspring, since it is less likely to be destroyed by crossover. Therefore, placing
dependent genes close to each other would be a search space design approach since it will
allow good building blocks to proliferate faster. These building blocks, however, are not
known at design time – otherwise the problem would already be solved. Hence, it is not
generally possible to devise such a design.
The messy genetic algorithms (mGAs) developed by Goldberg et al. [825] use a coding
scheme which is intended to allow the genetic algorithm to re-arrange genes at runtime.
It can place the genes of a building block spatially close together. This method of linkage
learning may thus increase the probability that these building blocks, i.e., sets of epistatically
linked genes, are preserved during crossover operations, as sketched in Figure 3.9. It thus
mitigates the effects of epistasis as discussed in Section 1.4.6.
destroyed in 6 out of 9 cases by crossover
rearrange
destroyed in 1 out of 9 cases by crossover
Figure 3.9: Two linked genes and their destruction probability under single-point crossover.
3.7.1 Representation
The idea behind the genomes used in messy GAs goes back to the work Bagley
[116] from 1967 who first introduced a representation where the ordering of the
genes was not fixed. Instead, for each gene a tuple (φ, γ) with its position (lo-
cus) φ and value (allele) γ was used. For instance, the bit string 000111 can be
represented as g1 = ((0, 0) , (1, 0) , (2, 0) , (3, 1) , (4, 1) , (5, 1)) but as well as g2 =
((5, 1) , (1, 0) , (3, 1) , (2, 0) , (0, 0) , (4, 1)) where both genotypes map to the same phenotype,
i. e., gpm(g1) = gpm(g2).

154
3 Genetic Algorithms
3.7.2 Reproduction Operations
Inversion: Unary Reproduction
The inversion operator reverses the order of genes between two randomly chosen loci
[116, 896]. With this operation, any particular ordering can be produced in a relatively
small number of steps. Figure 3.10 illustrates, for example, how the possible building block
components (1, 0), (3, 0), (4, 0), and (6, 0) can be brought together in two steps. Nevertheless,
the effects of the inversion operation were rather disappointing [116, 741].
(0, )
0 (1, )
0 (2, )
0 (3, )
0 (4, )
1 (5, )
1 (6, )
1 (7, )
1
first inversion
(0, )
0 (1, )
0 (4, )
1 (3, )
0 (2, )
0 (5, )
1 (6, )
1 (7, )
1
second inversion
(0, )
0 (5, )
1 (2, )
0 (3, )
0 (4, )
1 (1, )
0 (6, )
1 (7, )
1
Figure 3.10: An example for two subsequent applications of the inversion operation [896].
Cut: Unary Reproduction
The cut operator splits a genotype g into two with the probability pc = (len(g) − 1)pK where
pK is a bitwise probability and len(g) the length of the genotype [1153]. With pk = 0.1, the
g1 = ((0, 0) , (1, 0) , (2, 0) , (3, 1) , (4, 1) , (5, 1)) has a cut probability of pc = (6 − 1)∗0.1 = 0.5.
A cut at position 4 would lead to g3 = ((0, 0) , (1, 0) , (2, 0) , (3, 1)) and g4 = ((4, 1) , (5, 1)).
3.7.3 Splice: Binary Reproduction
The splice operator joins two genotypes with a predefined probability ps by simply attach-
ing one to the other [1153]. Splicing g2 = ((5, 1) , (1, 0) , (3, 1) , (2, 0) , (0, 0) , (4, 1)) and g4 =
((4, 1) , (5, 1)), for instance, leads to g5 = ((5, 1) , (1, 0) , (3, 1) , (2, 0) , (0, 0) , (4, 1) , (4, 1) , (5, 1)).
In summary, the application of two cut and a subsequent splice operation to two genotypes
has roughly the same effect as a single-point crossover operator in variable-length string
chromosomes Section 3.5.3.
3.7.4 Overspecification and Underspecification
The genotypes in messy GAs have a variable length and the cut and splice operators can lead
to genotypes being over or underspecified. If we assume a three bit genome, the genotype g6 =
((2, 0) , (0, 0) , (2, 1) , (1, 0)) is overspecified since it contains two (in this example, different)
alleles for the third gene (at locus 2). g7 = ((2, 0) , (0, 0)), in turn, is underspecified since it
does not contain any value for the gene in the middle (at locus 1).
Dealing with overspecification is rather simple [1153, 608]: The genes are processed from
left to right during the genotype-phenotype mapping, and the first allele found for a specific
locus wins. In other words, g6 from above codes for 000 and the second value for locus 2 is
discarded. The loci left open during the interpretation of underspecified genes are filled with
values from a template string [1153]. If this string was 000, g7 would code for 000, too.

3.8 Genotype-Phenotype Mappings and Artificial Embryogeny
155
3.7.5 The Process
In a simple genetic algorithm, building blocks are identified and recombined simultaneously,
which leads to a race between recombination and selection [896]. In the messy GA [825, 826],
this race is avoided by separating the evolutionary process into two stages:
1. In the primordial phase, building blocks are identified. In the original conception of
the messy GA, all possible building blocks of a particular order k are generated. Via
selection, the best ones are identified and spread in the population.
2. These building blocks are recombined with the cut and splice operators in the subsequent
juxtapositional phase.
The complexity of the original mGA needed a bootstrap phase in order to identify the
order-k building blocks which required to identify the order-k − 1 blocks first. This boot-
strapping was done by applying the primordial and juxtapositional phases for all orders from
1 to k − 1. This process was later improved by using a probabilistic complete initialization
algorithm [828] instead.
3.8 Genotype-Phenotype Mappings and Artificial Embryogeny
As already stated a dozen times by now, genetic algorithms use string genomes to encode
the phenotypes x that represent the possible solutions. These phenotypes, however, do not
necessarily need to be one-dimensional strings too. Instead, they can be construction plans,
circuit layouts, or trees14. The process of translating genotypes into corresponding pheno-
types is called genotype-phenotype mapping and has been introduced in Definition 1.30 on
page 44.
Embryogenesis is the natural process in which the embryo forms and develops15 and
to which the genotype-phenotype mapping in genetic algorithms and Genetic Programming
corresponds. Most of even the more sophisticated of these mappings are based on an implicit
one-to-one relation in terms of complexity. In the Grammar-guided Genetic Programming
approach Gads16, for example, a single gene encodes (at most) the application of a single
grammatical rule, which in turn unfolds a single node in a tree.
Embryogeny in nature is much more complex. Among other things, the DNA, for in-
stance, encodes the structural design information of the human brain. As pointed out by
Manos et al. [1358], there are only about 30 thousand active genes in the human genome
(2800 million amino acids) for over 100 trillion neural connections in our cerebrum. A huge
manifold of information is hence decoded from “data” which is of a much lower magnitude.
This is possible because the same genes can be reused in order to repeatedly create the same
pattern. The layout of the light receptors in the eye, for example, is always the same – just
their wiring changes.
Definition 3.5 (Artificial Embryogeny).
We subsume all methods of transforming
a genotype into a phenotype of (much) higher complexity under the subject of artificial
embryogeny [1358, 1957, 192] (also known as computational embryogeny [1221, 259]).
Two different approaches are common in artificial embryogeny: constructing the phe-
notype by using a grammar to translate the genotype and expanding it step by step until
14 See for example Section 4.5.6 on page 181
15 http://en.wikipedia.org/wiki/Embryogenesis [accessed 2007-07-03]
16 See Section 4.5.5 on page 179 for more details.

156
3 Genetic Algorithms
a terminal state is reached or simulating chemical processes. Both methods may also re-
quire subsequent correction steps that ensure that the produced results are correct, which is
also common in normal genotype-phenotype mappings [2295]. An example for gene reuse is
the genotype-phenotype mapping performed in Grammatical Evolution which is discussed
in Section 4.5.6 on page 182.

4
Genetic Programming
4.1 Introduction
The term Genetic Programming1 (GP) [1196, 916] has two possible meanings. First, it is often
used to subsume all evolutionary algorithms that have tree data structures as genotypes.
Second, it can also be defined as the set of all evolutionary algorithms that breed programs2,
algorithms, and similar constructs. In this chapter, we focus on the latter definition which
still includes discussing tree-shaped genomes.
The conventional well-known input-processing-output model3 from computer science
states that a running instance of a program uses its input information to compute and
return output data. In Genetic Programming, usually some inputs or situations and corre-
sponding output data samples are known or can be produced or simulated. The goal then
is to find a program that connects them or that exhibits some kind of desired behavior
according to the specified situations, as sketched in Figure 4.1.
samples are known
Process
input
output
(running Program)
to be found with genetic programming
Figure 4.1: Genetic Programming in the context of the IPO model.
4.1.1 History
The history of Genetic Programming [63] goes back to the early days of computer science.
In 1957, Friedberg [750] left the first footprints in this area by using a learning algorithm
to stepwise improve a program. The program was represented as a sequence of instructions4
for a theoretical computer called Herman [750, 751]. Friedberg did not use an evolutionary,
population-based approach for searching the programs. This may be because the idea of
1 http://en.wikipedia.org/wiki/Genetic_programming [accessed 2007-07-03]
2 We have extensively discussed the topic of algorithms and programs in Section 30.1.1 on page 547.
3 see Section 30.1.1 on page 549
4 Linear Genetic Programming is discussed in Section 4.6 on page 191.

158
4 Genetic Programming
evolutionary algorithms wasn’t fully developed yet5 and also because of the limited compu-
tational capacity of the computers of that era.
Around the same time, Samuel applied machine learning to the game of checkers and by
doing so, created the world’s first self-learning program. In the future development section
of his 1959 paper [1795], he suggested that effort could be spent into allowing the (checkers)
program to learn scoring polynomials – an activity which would equal symbolic regression.
Yet, in his 1967 follow-up work [1797], he could not report any progress in this issue.
The evolutionary programming approach for evolving finite state machines by Fogel et al.
[708], discussed in Chapter 6 on page 231, dates back to 1966. In order to build predictors,
different forms of mutation (but no crossover) were used for creating offspring from successful
individuals.
Fourteen years later, the next generation of scientists began to look for ways to evolve
programs. New results were reported by Smith [1912] in his PhD thesis in 1980. Forsyth
[733] evolved trees denoting fully bracketed Boolean expressions for classification problems
in 1981 [733, 735, 734].
The mid-1980s were a very productive period for the development of Genetic Program-
ming. Cramer [462] applied a genetic algorithm in order to evolve a program written in a
subset of the programming language PL in 1985.6 This GA used a string of integers as genome
and employed a genotype-phenotype mapping that recursively transformed them into pro-
gram trees. At the same time, the undergraduate student Schmidhuber [1828] also used a
genetic algorithm to evolve programs at the Siemens AG. He re-implemented his approach
in Prolog at the TU Munich in 1987 [562, 1828]. Hicklin [924] and Fujuki [754] implemented
reproduction operations for manipulating the if-then clauses of LISP programs consisting of
single COND-statements. With this approach, Fujiko and Dickinson [753] evolved strategies
for playing the iterated prisoner’s dilemma game. Bickel and Bickel [206] evolved sets of
rules which were represented as trees using tree-based mutation crossover operators.
Genetic Programming became fully accepted at the end of this productive decade mainly
because of the work of Koza [1183, 1184]. He also studied many benchmark applications of
Genetic Programming, such as learning of Boolean functions [1190, 1185], the Artificial Ant
problem7 [1188, 1187, 1196], and symbolic regression8 [1190, 1196], a method for obtaining
mathematical expressions that match given data samples. Koza formalized (and patented
[1183, 1194]) the idea of employing genomes purely based on tree data structures rather than
string chromosomes as used in genetic algorithms. In symbolic regression, such trees can, for
instance, encode Lisp S-expressions9 where a node stands for a mathematical operation and
its child nodes are the parameters of the operation. Leaf nodes then are terminal symbols
like numbers or variables. This form of Genetic Programming is called Standard Genetic
Programming or SGP, in short. With it, not only mathematical functions but also more
complex programs can be expressed as well.
Generally, a tree can represent a rule set [1389, 1390], a mathematical expressions, a
decision tree [1193], or even the blueprint of an electrical circuit [1082]. Trees are very
close to the natural structure of algorithms and programs. The syntax of most of the high-
level programming languages, for example, leads to a certain hierarchy of modules and
alternatives. Not only does this form normally constitute a tree – compilers even use tree
representations internally. When reading the source code of a program, they first split it into
tokens10, parse11 these tokens, and finally create an abstract syntax tree12 (AST) [1065, 961].
The internal nodes of ASTs are labeled by operators and the leaf nodes contain the operands
5 Compare with Section 3.1 on page 141.
6 Cramer’s approach is discussed in Section 4.4.1 on page 171.
7 The Artificial Ant is discussed in Section 21.3.1 on page 354 in this book.
8 More information on symbolic regression is presented in Section 23.1 on page 397 in this book.
9 List S-expressions are discussed in Section 30.3.11 on page 571
10 http://en.wikipedia.org/wiki/Lexical_analysis [accessed 2007-07-03]
11 http://en.wikipedia.org/wiki/Parse_tree [accessed 2007-07-03]
12 http://en.wikipedia.org/wiki/Abstract_syntax_tree [accessed 2007-07-03]

4.1 Introduction
159
Pop = createPop(s)
Input: s the size of the population to be created
Data: i a counter variable
Output: Pop the new, random population
1 List<IIndividual> createPop(s) {
1
begin
2
List<Individual> Xpop;
3
Xpop = new ArrayList<IIndividual>(s);
2
Pop
()
4
for(int i=s; i>0; i--) {
3
i
s
5
Xpop.add(create());
6
4
}
while i 0
> do
7
return Xpop;
5
Pop
appendList(Pop, create())
8
}
6
i
i-1
7
return Pop
8
end
Program
Algorithm
(Schematic Java, High-Level Language)
{block}
while
ret
Pop
()
i
s
>
{block}
Pop
i
0
appendList
Pop create i
Abstract Syntax Tree Representation
i
1
Figure 4.2: The AST representation of algorithms/programs.
of these operators. In principle, we can illustrate almost every13 program or algorithm as
such an AST (see Figure 4.2).
Tree-based Genetic Programming directly evolves individuals in this form, which also
provides a very intuitive representation for mathematical functions for which it has initially
been used for by Koza. Another interesting aspect of the tree genome is that it has no natu-
ral role model. While genetic algorithms match their direct biological metaphor particularly
well, Genetic Programming introduces completely new characteristics and traits. Genetic
Programming is one of the few techniques that are able to learn solutions of potentially
unbound complexity. It can be considered as more general than genetic algorithms, because
it makes fewer assumptions about the structure of possible solutions. Furthermore, it of-
ten offers white-box solutions that are human-interpretable. Other optimization approaches
like artificial neural networks, for example, generate black-box outputs, which are highly
complicated if not impossible to fully grasp [1382].
13 Excluding such algorithms and programs that contain jumps (the infamous “goto”) that would
produce crossing lines in the flowchart (http://en.wikipedia.org/wiki/Flowchart [accessed 2007-
07-03]).

160
4 Genetic Programming
4.2 General Information
4.2.1 Areas Of Application
Some example areas of application of Genetic Programming are:
Application
References
[1190, 1196, 87, 2270, 1699,
Symbolic Regression and Function Synthesis
1196, 17, 528]
Section 23.1
Grammar Induction
[1042, 1394, 465, 1174]
[1186, 744, 1592, 1593, 242, 445,
Data Mining and Data Analysis
1193, 2253, 332]
Section 22.1.2
[1082, 1182, 1080, 1206, 1205,
Electrical Engineering and Circuit Design
1211, 1669, 506]
Medicine
[2055, 270, 243, 956]
Economics and Finance
[1191, 1513, 1674, 1577]
Geometry and Physics
[1307, 2277]
Cellular Automata and Finite State Machines
[58, 59, 508, 509]
[140, 1242, 1324, 1325, 1317,
Automated Programming
1212]
[1201, 1202, 1204, 986, 1317, 57,
Robotics
1576, 986, 1323]
[434, 504, 2180, 1257, 1887,
Networking and Communication
1888]
Section 24.1 on page 413
and Section 23.2 on page 401
[1187, 179, 180, 1688, 1686,
1687, 907, 909, 55, 54, 1933, 67,
Evolving Behaviors, e.g., for Agents or Game Players
1492, 984, 987, 985, 986, 1340,
1341, 1342, 2194, 1323]
Pattern Recognition
[53, 56, 2015, 2014, 2016]
Biochemistry
[1200, 1199]
Machine Learning
[1203, 863]
See also Section 4.4.3 on page 174, Section 4.5.6 on page 184, and Section 4.7.4 on page 201.
4.2.2 Conferences, Workshops, etc.
Some conferences, workshops and such and such on Genetic Programming are:
EuroGP: European Conference on Genetic Programming
http://www.evostar.org/ [accessed 2007-09-05]
Co-located with EvoWorkshops and EvoCOP.
History: 2009: T¨
ubingen, see [2106]
2008: Naples, Italy, see [1579]
2007: Valencia, Spain, see [617]
2006: Budapest, Hungary, see [429]

4.2 General Information
161
2005: Lausanne, Switzerland, see [1116]
2004: Coimbra, Portugal, see [1115]
2003: Essex, UK, see [1786]
2002: Kinsale, Ireland, see [737]
2001: Lake Como, Italy, see [1423]
2000: Edinburgh, Scotland, UK, see [1666]
1999: G¨oteborg, Sweden, see [1664]
1998: Paris, France, see [141, 1663]
GECCO: Genetic and Evolutionary Computation Conference
see Section 2.2.2 on page 107
GP: Annual Genetic Programming Conference
Now part of GECCO, see Section 2.2.2 on page 107
History: 1998: Madison, Wisconsin, USA, see [1209, 1198]
1997: Stanford University, CA, USA, see [1208, 1956]
1996: Stanford University, CA, USA, see [1207, 1197]
GPTP: Genetic Programming Theory Practice Workshop
http://www.cscs.umich.edu/gptp-workshops/ [accessed 2007-09-28]
History: 2007: Ann Arbor, Michigan, USA, see [1945]
2006: Ann Arbor, Michigan, USA, see [1735]
2005: Ann Arbor, Michigan, USA, see [2298]
2004: Ann Arbor, Michigan, USA, see [1583]
2003: Ann Arbor, Michigan, USA, see [1734]
ICANNGA: International Conference on Adaptive and Natural Computing Algorithms
see Section 2.2.2 on page 108
Mendel: International Conference on Soft Computing
see Section 1.6.2 on page 90
4.2.3 Journals
Some journals that deal (at least partially) with Genetic Programming are:
Genetic Programming and Evolvable Machines (GPEM), ISSN: 1389-2576 (Print) 1573-7632
(Online), appears quaterly, editor(s): Wolfgang Banzhaf, publisher: Springer Netherlands,
http://springerlink.metapress.com/content/104755/ [accessed 2007-09-28]
4.2.4 Online Resources
Some general, online available ressources on Genetic Programming are:
http://www.genetic-programming.org/
[accessed
2007-09-20]
and
http://www.
genetic-programming.com/ [accessed 2007-09-20]
Last update: up-to-date
Two portal pages on Genetic Programming websites, both maintained by
Description: Koza.
http://www.cs.bham.ac.uk/~wbl/biblio/ [accessed 2007-09-16]
Last update: up-to-date
Description: Langdon’s large Genetic Programming bibliography.
http://www.lulu.com/items/volume_63/2167000/2167025/2/print/book.pdf
[accessed
2008-03-26]

162
4 Genetic Programming
Last update: up-to-date
Description: A Field Guide to Genetic Programming, see [1667]
http://www.aaai.org/AITopics/html/genalg.html [accessed 2008-05-17]
Last update: up-to-date
Description: The genetic algorithms and Genetic Programming pages of the AAAI
http://www.cs.ucl.ac.uk/staff/W.Langdon/www_links.html [accessed 2008-05-18]
Last update: 2007-07-28
Description: William Langdon’s Genetic Programming contacts
4.2.5 Books
Some books about (or including significant information about) Genetic Programming are:
Koza [1196]: Genetic Programming, On the Programming of Computers by Means of Natural
Selection
Poli, Langdon, and McPhee [1667]: A Field Guide to Genetic Programming
Koza [1195]: Genetic Programming II: Automatic Discovery of Reusable Programs: Auto-
matic Discovery of Reusable Programs
Koza, Bennett III, Andre, and Keane [1210]: Genetic Programming III: Darwinian Invention
and Problem Solving
Koza, Keane, Streeter, Mydlowec, Yu, and Lanza [1212]: Genetic Programming IV: Routine
Human-Competitive Machine Intelligence
Langdon and Poli [1242]: Foundations of Genetic Programming
Langdon [1238]: Genetic Programming and Data Structures: Genetic Programming + Data
Structures = Automatic Programming!
Banzhaf, Nordin, Keller, and Francone [140]: Genetic Programming: An Introduction – On
the Automatic Evolution of Computer Programs and Its Applications
Kinnear, Jr. [1140]: Advances in Genetic Programming, Volume 1
Angeline and Kinnear, Jr [61]: Advances in Genetic Programming, Volume 2
Spector, Langdon, O’Reilly, and Angeline [1936]: Advances in Genetic Programming, Volume
3
Brameier and Banzhaf [275]: Linear Genetic Programming
Wong and Leung [2253]: Data Mining Using Grammar Based Genetic Programming and
Applications
Geyer-Schulz [795]: Fuzzy Rule-Based Expert Systems and Genetic Machine Learning
Spector [1932]: Automatic Quantum Computer Programming – A Genetic Programming
Approach
Nedjah, Abraham, and de Macedo Mourelle [1511]: Genetic Systems Programming: Theory
and Experiences
4.3 (Standard) Tree Genomes
Tree-based Genetic Programming (TGP), usually referred to as Standard Genetic Program-
ming, SGP) is the most widespread Genetic Programming variant, both for historical reasons
and because of its efficiency in many problem domains. In this section, the well-known re-
production operations applicable to tree genomes are outlined.
4.3.1 Creation: Nullary Reproduction
Before the evolutionary process can begin, we need an initial, randomized population. In
genetic algorithms, we therefore simply created a set of random bit strings. For Genetic
Programming, we do the same with trees instead of such one-dimensional sequences.

4.3 (Standard) Tree Genomes
163
Normally, there is a maximum depth d specified that the tree individuals are not allowed
to surpass. Then, the creation operation will return only trees where the path between the
root and the most distant leaf node is not longer than d. There are three different ways for
realizing the “create()” operation (see Definition 2.9 on page 137) for trees which can be
distinguished according to the depth of the produced individuals.
maximum depth
Figure 4.3: Tree creation by the full method.
maximum depth
Figure 4.4: Tree creation by the grow method.
The full method (Figure 4.3) creates trees where each (non-backtracking) path from the
root to the leaf nodes has exactly the length d. The grow method depicted in Figure 4.4,
also creates trees where each (non-backtracking) path from the root to the leaf nodes is not
longer than d but may be shorter. This is achieved by deciding randomly for each node if
it should be a leaf or not when it is attached to the tree. Of course, to nodes of the depth
d − 1, only leaf nodes can be attached to.
Koza [1196] additionally introduced a mixture method called ramped half-and-half. For
each tree to be created, this algorithm draws a number r uniformly distributed between 2
and d: (r = ⌊random2d + 1⌋). Now either full or grow is chosen to finally create a tree with
the maximum depth r (in place of d). This method is often preferred since it produces an
especially wide range of different tree depths and shapes and thus provides a great initial
diversity.
4.3.2 Mutation: Unary Reproduction
Tree genotypes may undergo small variations during the reproduction process in the evo-
lutionary algorithm. Such a mutation is usually defined as the random selection of a node
in the tree, removing this node and all of its children, and finally replacing it with another
node [1196]. From this idea, three operators can be derived:
1. replacement of existing nodes randomly created ones (Fig. 4.5.a),
2. insertions of new nodes or small trees (Fig. 4.5.b), and
3. the deletion of nodes, as illustrated in Fig. 4.5.c.
The effects of insertion and deletion can also be achieved with replacement.

164
4 Genetic Programming
maximum depth
maximum depth
Fig. 4.5.a: Sub-tree replacement.
Fig. 4.5.b: Sub-tree insertions.
maximum depth
Fig. 4.5.c: Sub-tree deletion.
Figure 4.5: Possible tree mutation operations.
4.3.3 Recombination: Binary Reproduction
The mating process in nature – the recombination of the genotypes of two individuals –
is also copied in tree-based Genetic Programming. Applying the default sub-tree exchange
recombination operator to two trees means to swap sub-trees between them as illustrated
in Figure 4.6. Therefore, one single sub-tree is selected randomly from each of the parents
and subsequently are cut out and reinserted in the partner genotype. Notice that, like in
genetic algorithms, the effects of insertion and deletion operations can also be achieved by
recombination.
( )
maximum depth
Figure 4.6: Tree crossover by exchanging sub-trees.
If a depth restriction is imposed on the genome, both, the mutation and the crossover
operation have to respect them. The new trees they create must not exceed it.
The intent of using the recombination operation in Genetic Programming is the same
as in genetic algorithms. Over many generations, successful building blocks – for example a
highly fit expression in a mathematical formula – should spread throughout the population
and be combined with good genes of different solution candidates. Yet, recombination in
Standard Genetic Programming can also have a very destructive effect on the individual
fitness [1525, 1544, 140]. Angeline [62] even argues that it performs no better than mutation
and causes bloat [65].

4.3 (Standard) Tree Genomes
165
Several techniques have been proposed in order to mitigate these effects. In 1994,
D’Haeseleer [557] obtained modest improvements with his strong context preserving
crossover that permitted only the exchange of sub-trees that occupied the same positions
in the parents. Poli and Langdon [1661, 1662] define the similar single-point crossover for
tree genomes with the same purpose: increasing the probability of exchanging genetic ma-
terial which is structural and functional akin and thus decreasing the disruptiveness. A
related approach define by Francone et al. [740] for linear Genetic Programming is discussed
in Section 4.6.7 on page 195.
4.3.4 Permutation: Unary Reproduction
The tree permutation operation illustrated in Figure 4.7 resembles the permutation operation
of string genomes or the inversion used in messy GA (Section 3.7.2, [1196]). Like mutation,
it is used to reproduce one single tree. It first selects an internal node of the parental tree.
The child nodes attached to that node are then shuffled randomly, i. e., permutated. If
the tree represents a mathematical formula and the operation represented by the node is
commutative, this has no direct effect. The main goal is to re-arrange the nodes in highly
fit sub-trees in order to make them less fragile for other operations such as recombination.
The effects of this operation are doubtable and most often it is not applied [1196].
Figure 4.7: Tree permutation – (asexually) shuffling sub-trees.
4.3.5 Editing: Unary Reproduction
Editing trees in Genetic Programming is what simplifying is to mathematical formulas. Take
x = b + (7 −4)+(1∗a) for instance. This expression clearly can be written in a shorter way
be replacing (7−4) with 3 and (1∗a) with a. By doing so, we improve its readability and also
decrease the computational time for concrete values of a and b. Similar measures can often
be applied to algorithms and program code. Editing a tree as outlined in Figure 4.8 means
to create a new offspring tree which is more efficient but, in terms of functional aspects,
equivalent to its parent. It is thus a very domain-specific operation.
+
+
+
*
+
a
b
-
1 a
b
3
7 4
Figure 4.8: Tree editing – (asexual) optimization.
A positive aspect of editing is that it usually reduces the number of nodes in a tree
by removing useless expression, for instance. This makes it more easy for recombination

166
4 Genetic Programming
operations to pick “important” building blocks. At the same time, the expression (7 − 4)
is now less likely to be destroyed by the reproduction processes since it is replaced by the
single terminal node 3.
On the other hand, editing also reduces the diversity in the genome which could degrade
the performance by decreasing the variety of structures available. Another negative aspect
would be if (in our example) a fitter expression was (7 − (4 ∗ a)) and a is a variable close to
1. Then, transforming (7 − 4) into 3 prevents a transition to the fitter expression.
In Koza’s experiments, Genetic Programming with and without editing showed equal
performance, so this operation is not necessarily needed [1196].
4.3.6 Encapsulation: Unary Reproduction
The idea behind the encapsulation operation is to identify potentially useful sub-trees and
to turn them into atomic building block as sketched in Figure 4.9. To put it plain, we create
new terminal symbols that (internally hidden) are trees with multiple nodes. This way,
they will no longer be subject to potential damage by other reproduction operations. The
new terminal may spread throughout the population in the further course of the evolution.
According to Koza, this operation has no substantial effect but may be useful in special
applications like the evolution of artificial neural networks [1196].
Figure 4.9: An example for tree encapsulation.
4.3.7 Wrapping: Unary Reproduction
Applying the wrapping operation means to first select an arbitrary node n in the tree.
Additionally, we create a new non-terminal node m outside of the tree. In m, at least one
child node position is left unoccupied. We then cut n (and all its potential child nodes) from
the original tree and append it to m by plugging it into the free spot. Now we hang m into
the tree position that formerly was occupied by n.
Figure 4.10: An example for tree wrapping.

4.3 (Standard) Tree Genomes
167
The purpose of this reproduction method illustrated in Figure 4.10 is to allow modi-
fications of non-terminal nodes that have a high probability of being useful. Simple mu-
tation would, for example, cut n from the tree or replace it with another expression.
This will always change the meaning of the whole sub-tree below n dramatically, like for
example in (b+3) + a −→ (b*3) + a. By wrapping however, a more subtle change like
(b+3) + a −→ ((b+1)+3) + a is possible.
The wrapping operation is introduced by the author – at least, I have not seen another
source where it is used.
4.3.8 Lifting: Unary Reproduction
While wrapping allows nodes to be inserted in non-terminal positions with small change of
the tree’s semantic, lifting is able to remove them in the same way. It is the inverse operation
to wrapping, which becomes obvious when comparing Figure 4.10 and Figure 4.11.
Figure 4.11: An example for tree lifting.
Lifting begins with selecting an arbitrary inner node n of the tree. This node then replaces
its parent node. The parent node inclusively all of its child nodes (except n) are removed
from the tree. With lifting, a tree that represents the mathematical formula (b + (1 − a)) ∗ 3
can be transformed to b ∗ 3 in a single step. Lifting is used by the author in his experiments
with Genetic Programming (see for example Section 24.1.2 on page 414). I, however, have
not yet found other sources using a similar operation.
4.3.9 Automatically Defined Functions
The concept of automatically defined functions (ADFs) introduced by Koza [1196] provides
some sort of pre-specified modularity for Genetic Programming. Finding a way to evolve
modules and reusable building blocks is one of the key issues in using GP to derive higher-
level abstractions and solutions to more complex problems [66, 67, 1195]. If ADFs are used,
a certain structure is defined for the genome. The root of the tree usually loses its functional
responsibility and now serves only as glue that holds the individual together and has a fixed
number n of children, from which n − 1 are automatically defined functions and one is the
result-generating branch. When evaluating the fitness of an individual, often only this first
branch is taken into consideration whereas the root and the ADFs are ignored. The result-
generating branch, however, may use any of the automatically defined functions to produce
its output.
When ADFs are employed, typically not only their number must be specified beforehand
but also the number of arguments of each of them. How this works can maybe best illustrated
by using the example given in Figure 4.12. It stems from function approximation14, since
this is the area where many early examples of the idea of ADFs come from.
Assume that the goal of GP is to approximate a function g with the one parameter x
and that a genome is used where two functions (f0 and f1) are automatically defined. f0
14 A very common example for function approximation, Genetic Programming-based symbolic re-
gression, is discussed in Section 23.1 on page 397.

168
4 Genetic Programming
dummy root
ADF0
ADF1
f0(a)
f1(a,b)
g(x)
+
*
*
f1
+
a
7
- 4b
4
f0
f
a
0
3
call to ADF1
x
x
call to ADF
result-generating
0
branch
Figure 4.12: A concrete example for automatically defined functions.
has a single formal parameter a and f1 has two formal parameters a and b. The genotype
Figure 4.12 encodes the following mathematical functions:
g(x) = f1(4, f0(x)) ∗ (f0(x) + 3)
f0(a) = a + 7
f1(a, b) = (−a) ∗ b
Hence, g(x) ≡ ((−4) ∗ (x + 7)) ∗ ((x + 7) + 3). The number of children of the function
calls in the result-generating branch must be equal to the number of the parameters of the
corresponding ADF.
Although ADFs were first introduced in symbolic regression by Koza [1196], they can
also be applied to a variety of other problems like in the evolution of agent behaviors [1688,
1686, 52, 55], electrical circuit design [1206], or the evolution of robotic behavior [57].
4.3.10 Automatically Defined Macros
Spector’s idea of automatically defined macros (ADMs) complements the ADFs of Koza
[1928, 1929]. Both concepts are very similar and only differ in the way that their parameters
are handled. The parameters in automatically defined functions are always values whereas
automatically defined macros work on code expressions. This difference shows up only when
side-effects come into play.
In Figure 4.13, we have illustrated the pseudo-code of two programs – one with a function
(called ADF) and one with a macro (called ADM). Each program has a variable x which is
initially zero. The function y() has the side-effect that it increments x and returns its new
value. Both, the function and the macro, return a sum containing their parameter a two
times. The parameter of ADF is evaluated before ADF is invoked. Hence, x is incremented one
time and 1 is passed to ADF which then returns 2=1+1. The parameter of the macro, however,
is the invocation of y(), not its result. Therefore, the ADM resembles to two calls to y(),
resulting in x being incremented two times and in 3=1+2 being returned.
The ideas of automatically defined macros and automatically defined functions are very
close to each other. Automatically defined macros are likely to be useful in scenarios where
context-sensitive or side-effect-producing operators play important roles [1928, 1929]. In
other scenarios, there is no much difference between the application of ADFs and ADMs.
Finally, it should be mentioned that the concepts of automatically defined functions and

4.3 (Standard) Tree Genomes
169
variable x=0
variable x=0
subroutine y()
subroutine y()
begin
begin
x++
x++
return x
return x
end
end
function
macro
ADF(param a) º (a+a)
ADM(param a) º (a+a)
main_program
main_program
begin
begin
print(”out: “ + func(y))

print( out: “ + ADM(y))
end
end
Program with ADF
Program with ADM
...roughly resembles
...roughly resembles
main_program
main_program
begin
begin
variable temp1, temp2
variable temp
temp1 = y()
temp = y() + y()
temp2 = temp1 + temp1
print(”out: “ + temp)
print(”out: “ + temp2)
end
end
C:\
C:\
C:\
C:\
C:\
C:\
C:\
produced output
C:\
produced output
exec main_program
exec main_program
-> “out 2”
-> “out 3”
Figure 4.13: Comparison of functions and macros.
macros are not restricted to the standard tree genomes but are also applicable in other
forms of Genetic Programming, such as linear Genetic Programming or PADO.15
4.3.11 Node Selection
In most of the reproduction operations for tree genomes, in mutation as well as in recom-
bination, certain nodes in the trees need to be selected. In order to apply the mutation, we
first need to find the node which is to be altered. For recombination, we need one node in
each parent tree. These nodes are then exchanged. The question how to select these nodes
seems to be more or less irrelevant but plays an important role in reality. The literature
most often speaks of “randomly selecting” a node but does not describe how exactly this
should be done.
A good method for doing so could select all nodes c and n in the tree t
with exactly the same probability as done by the method “uniformSelectNode”, i. e.,
P (uniformSelectNode(t) = c) = P (uniformSelectNode(t) = n) ∀s,n ∈ t.
15 Linear Genetic Programming is discussed in Section 4.6 on page 191 and a summary on PADO
can be found in Section 4.7.1 on page 196.

170
4 Genetic Programming
Therefore, we define the weight nodeWeight(n) of a tree node n to be the total num-
ber of nodes in the sub-tree with n as root, i. e., itself, its children, grandchildren, grand-
grandchildren, etc.
len(n.children)−1
nodeWeight(n) = 1 +
nodeWeight(n.children[i])
(4.1)
i=0
Thus, the nodeWeight of the root of a tree is the number of all nodes in the tree and
the nodeWeight of each of the leaves is exactly 1. In uniformSelectNode, the probability for
a node of being selected in a tree t is thus 1/nodeWeight(t). We can create such a probability
distribution by descending it from the root according to Algorithm 4.1.
Algorithm 4.1: n ←− uniformSelectNode(t)
Input: t: the (root of the) tree to select a node from
Data: c: the currently investigated node
Data: c.children: the list of child nodes of c
Data: b, d: two Boolean variables
Data: r: a value uniformly distributed in [0, nodeWeight(c)]
Data: i: an index
Output: n: the selected node
1 begin
2
b ←− true
3
c ←− t
4
while b do
5
r ←− ⌊randomu(0, nodeWeight(c))⌋
6
if r ≥ nodeWeight(c) − 1 then b ←− false
7
else
8
i ←− len(c.children) − 1
9
while i ≥ 0 do
10
r ←− r − nodeWeight(c.children[i])
11
if r < 0 then
12
c ←− c.children[i]
13
i ←− −1
14
else
15
i ←− i − 1
16
return c
17 end
A tree descend where with probabilities different from these defined here will lead to
unbalanced node selection probability distributions. Then, the reproduction operators will
prefer accessing some parts of the trees while very rarely altering the other regions. We could,
for example, descend the tree by starting at the root t and would return the current node
with probability 0.5 or recursively go to one of its children (also with 50% probability). Then,
the root t would have a 50 : 50 chance of being the starting point of reproduction operation.
Its direct children have at most probability 0.52/len(t.children) each, and their children even
0.53/len(t.children)len(t.children[i].children) and so on. Hence, the leaves would almost never take
actively part in reproduction. We could also choose other probabilities which strongly prefer
going down to the children of the tree, but then, the nodes near to the root will most likely
be left untouched during reproduction. Often, this approach is favored by selection methods,
although leaves in different branches of the tree are not chosen with the same probabilities
if the branches differ in depth. When applying Algorithm 4.1 on the other hand, there exist
no regions in the trees that have lower selection probabilities than others.

4.4 Genotype-Phenotype Mappings
171
4.4 Genotype-Phenotype Mappings
Genotype-phenotype mappings (GPM, see Section 3.8 on page 155) are used in many differ-
ent Genetic Programming approaches. Here we give a few examples about them. Many of
the Grammar-guided Genetic Programming approaches discussed in Section 4.5 on page 176
are based on similar mappings.
4.4.1 Cramer’s Genetic Programming
It is interesting to see that the earliest Genetic Programming approaches were based on a
genotype-phenotype mapping. One of them, dating back to 1985, is the method of Cramer
[462]. His goal was to evolve programs in a modified subset of the programming language
PL. Two simple examples for such programs, obtained from his work, are:
1
;;Set variable V0 to have the value of V1
2
(: ZERO V0 )
3
(: LOOP V1 (: INC V0 ) )
4
5
;;Multiply V3 by V4 and store the result in V5
6
(: ZERO V5 )
7
(: LOOP V3 (: LOOP V4 (: INC V5 ) ) )
Listing 4.1: Two examples for the PL dialect used by Cramer for Genetic Programming
On basis of a genetic algorithm working on integer strings, he proposed two ideas on how
to convert these strings to valid program trees.
The JB Mapping
The first approach was to divide the integer string into tuples of a fixed length which is large
enough to hold the information required to encode an arbitrary instruction. In the case our
examples, these are triplets where the first item identifies the operation, and the following
two numbers define its parameters. Superfluous information, like a second parameter for a
unary operation, is ignored.
1
(0
4 2) → (:BLOCK AS4 AS2)
2
(1
6 0) → (:LOOP V6 AS0)
3
(2
1 9) → (:SET V1 V9)
4
(3 17 8) → (:ZERO V17) ;;the 8 is ignored
5
(4
0 5) → (:INC V0)
;;the 5 is ignored
Listing 4.2: An example for the JB Mapping
Here, the symbols of the form Vn and ASn represent variables and auxiliary statements,
respectively. Cramer distinguishes between input variables providing data to a program and
local (body) variables used for computation. Any of them can be chosen as output variable
at the end of the execution. The multiplication program used in Listing 4.1 can now be
encoded as (0 0 1 3 5 8 1 3 2 1 4 3 4 5 9 9 2) which translates to
1
(0 0 1) ;;main statement
→ (:BLOCK AS0 AS1)
2
(3 5 8) ;;auxiliary statement 0 → (:ZERO V5)
3
(1 3 2) ;;auxiliary statement 1 → (:LOOP V3 AS2)
4
(1 4 3) ;;auxiliary statement 2 → (:LOOP V4 AS3)
5
(4 5 9) ;;auxiliary statement 3 → (:INC V5)
Listing 4.3: Another example for the JP Mapping
Cramer outlines some of the major problems of this representation, especially the strong
positional epistasis16 – the strong relation of the meaning of an instruction to its position.
This epistasis makes it very hard for the genetic operations to work efficiently, i.e., to prevent
destruction of the genotypes passed to them.
16 We come back to positional epistasis in Section 4.8.1 on page 202.

172
4 Genetic Programming
The TB Mapping
The TB mapping is essentially the same as the JB mapping, but reduces these problems a
bit. Instead of using the auxiliary statement method as done in JB, the expressions in the
TB language are decoded recursively. The string (0 (3 5)(1 3 (1 4 (4 5))) ), for instance,
expands to the program tree illustrated in Listing 4.3. Furthermore, Cramer restricts mu-
tation to the statements near the fringe of the tree, more specifically, to leaf operators that
do not require statements as arguments and to non-leaf operations with leaf statements as
arguments. Similar restrictions apply to crossover.
4.4.2 Binary Genetic Programming
With their Binary Genetic Programming (BGP) approach [136], Keller and Banzhaf [1119,
1120, 1121] further explore the utility of explicit genotype-phenotype mappings and neutral
variations in the genotypes. They called the genes in their fixed-length binary string genome
codons analogously to molecular biology where a codon is a triplet of nucleic acids in the
DNA17, encoding one amino acid at most. Each codon corresponds to one symbol in the
target language. The translation of the binary string genotype g into a string representing
an expression in the target language works as follows:
1. x ←− ε
2. Take the next gene (codon) g from g and translate it to the according symbol s.
3. If s is a valid continuation of x, set x ←− x◦s and continue in step 2.
4. Otherwise, compute the set of symbols S that would be valid continuation of x.
5. From this set, extract the set of (valid) symbols S′ which have the minimal Hamming
distance18 to the codon g.
6. From S′ take the symbol s′ which has the minimal codon value and append it to x:
x ←− x◦s′.
After this mapping, x can still be an invalid expression since there maybe were not
enough genes in g so the phenotype is incomplete, for example x = 3 ∗ 4 − sin(v∗. These
incomplete sequences are fixed by consecutively appending symbols that lead to a quick end
of an expression according to some heuristic.
The genotype-phenotype mapping of Binary Genetic Programming represents a n : 1
relation: Due to the fact that different codons may be replaced by the same approximation,
multiple genotypes have the same phenotypic representation. This also means that there can
be genetic variations induced by the mutation operation that do not influence the fitness.
Such neutral variations are often considered as a driving force behind (molecular) evolution
[1137, 1138, 973] and are discussed in Section 1.4.5 on page 67 in detail.
From the form of the genome we assume the number of corrections needed in the
genotype-phenotype mapping(especially for larger grammars) will be high. This, in turn,
could lead to very destructive mutation and crossover operations since if one codon is mod-
ified, the semantics of many subsequent codons may be influenced wildly. This issue is also
discussed in Section 4.8.1 on page 204.
4.4.3 Gene Expression Programming
Gene Expression Programming (GEP) by Ferreira [654, 655, 656, 657, 658] introduces an
interesting method for dealing with remaining unsatisfied function arguments at the end
of the expression tree building process. Like BGP, Gene Expression Programming uses a
genotype-phenotype mapping that translates fixed-length string chromosomes into tree phe-
notypes representing programs.
17 See Figure 1.14 on page 42 for more information on the DNA.
18 see Definition 29.6 on page 537

4.4 Genotype-Phenotype Mappings
173
A gene in GEP is composed of a head and a tail [654] which are further divided into
codons, where each codon directly encodes one expression. The codons in the head of a
gene can represent arbitrary expressions whereas the codons in the tail can only stand
for parameterless terms. This makes the tail a reservoir for unresolved arguments of the
expressions in the head.
For each problem, the length h of the head is chosen as a fixed value, and the length of the
tail t is defined according to Equation 4.2, where n is the arity (the number of arguments)
of the function with the most arguments.
t = h(n − 1) + 1
(4.2)
The reason for this formula is that we have h expressions in the head, each of them
taking at most n parameters. An upper bound for the total number of arguments is thus
h ∗n. From this number, h−1 are already satisfied since all expressions in the head (except
for the first one) themselves are arguments to expressions instantiated before. This leaves at
most h ∗ n − (h − 1) = h ∗ n − h + 1 = h(n − 1) + 1 unsatisfied parameters. With this simple
measure, incomplete expressions that require additional repair operations in BGP and most
other approaches simply cannot occur.
For instance, consider the grammar for mathematical expressions with the terminal sym-
bols Σ = √ ·,*,/,-,+,a,b given as example in [654]. It includes two variables, a and b,
as well as five mathematical functions, √ ·, *, /, +, and -. √· has the arity 1 since it takes
one argument, the other four have arity 2. Hence, n = 2.
0
1
b
*
2
a
3
phenotypicrepresentation
b
a
a
a
b
4
GPM
0 1
2
3
4
unused
b *
a
b a a a b a a b b a a
gene
head
tail
Figure 4.14: A GPM example for Gene Expression Programming.
Figure 4.14 illustrates an example gene (with h = 10 and t = h(2 − 1) + 1 = 11) and its
phenotypic representation of this mathematical expression grammar. A phenotype is built
by interpreting the gene as a level-order traversal19 of the nodes of the expression tree. In
19 http://en.wikipedia.org/wiki/Tree_traversal [accessed 2007-07-15]

174
4 Genetic Programming
other words, the first codon of a gene encodes the root r of expression tree (here +). Then,
all nodes in the first level (i. e., the children of r, here √ · and -) are stored from left to
right, then their children and so on. In the phenotypic representation, we have sketched the
traversal order and numbered the levels. These level numbers are annotated to the gene but
are neither part of the real phenotype nor the genotype. Furthermore, the division of the
gene into head and tail is shown. In the head, the mathematical expressions as well as the
variables may occur, while variables are the sole construction element of the tail.
In GEP, multiple genes form one genotype, thus encoding multiple expression trees.
These trees may then be combined to one phenotype by predefined statements. It is easy
to see that binary or integer strings can be used as genome, because the number of allowed
symbols is known in advance.
This fixed mapping is also a disadvantage of Gene Expression Programming in com-
parison with the methods introduced later which have variable input grammars. On the
other hand, there is the advantage that all genotypes can be translated to valid expression
trees without requiring any corrections. Another benefit is that it seems to circumvent –
at least partially – the problem of low causality from which the string-to-tree-GPM based
approaches in often suffer. By modularizing the genotypes, potentially harmful influences
of the reproduction operations are confined to single genes while others may stay intact.
(See Section 4.8.1 on page 204 for more details.)
General Information
Areas Of Application
Some example areas of application of Gene Expression Programming are:
Application
References
Symbolic Regression and Function Synthesis
[659, 660, 1308]
[1389, 1390, 2319, 2334, 2320,
Data Mining and Data Analysis
1361]
Electrical Engineering and Circuit Design
[337]
Machine Learning
[661, 1278]
Geometry and Physics
[2018, 1127, 364]
Online Resources
Some general, online available ressources on Gene Expression Programming are:
http://www.gene-expression-programming.com/ [accessed 2007-08-19]
Last update: up-to-date
Gene Expression Programming Website. Includes publications, tutorials, and
Description: software.
4.4.4 Edge Encoding
Up until now, we only have considered how string genotypes can be transformed to more
complex structures like trees. Obviously, genotype-phenotype mappings are not limited to
this, but can work on tree genotypes as well. In [1321], Luke and Spector present their
edge encoding approach where the genotypes are trees (or forests) of expressions from a
graph-definition language. During the GPM, these trees are interpreted and construct the
phenotypes, arbitrary directed graphs. Edge encoding is closely related to Gruau’s cellular
encoding [863], which works on nodes instead of edges.

4.4 Genotype-Phenotype Mappings
175
Each functions and terminals in edge encoding work on tuples (a, b) containing two node
identifiers. Such a tuple represents a directed edge from node a to node b. The