Original PDF Flash format evolving-artificial-neural-networks---proceedings-of-the-ieee  


Evolving Artificial Neural Networks Proceedings Of The Ieee

Evolving Artificial Neural Networks
XIN YAO, SENIOR MEMBER, IEEE
Invited Paper
Learning and evolution are two fundamental forms of adapta-
between ANN’s and EA’s for combinatorial optimization
tion. There has been a great interest in combining learning and
will be mentioned but not discussed in detail.
evolution with artificial neural networks (ANN’s) in recent years.
This paper: 1) reviews different combinations between ANN’s and
evolutionary algorithms (EA’s), including using EA’s to evolve

A. Artificial Neural Networks
ANN connection weights, architectures, learning rules, and input
features; 2) discusses different search operators which have been

1) Architectures: An ANN consists of a set of processing
used in various EA’s; and 3) points out possible future research
elements, also known as neurons or nodes, which are
directions. It is shown, through a considerably large literature
interconnected. It can be described as a directed graph in
review, that combinations between ANN’s and EA’s can lead to
which each node
performs a transfer function
of the
significantly better intelligent systems than relying on ANN’s or
form
EA’s alone.
Keywords—Evolutionary computation, intelligent systems, neu-
ral networks.
(1)
I.
INTRODUCTION
where
is the output of the node
is the th input to
Evolutionary artificial neural networks (EANN’s) refer
the node, and
is the connection weight between nodes
to a special class of artificial neural networks (ANN’s) in
and .
is the threshold (or bias) of the node. Usually,
which evolution is another fundamental form of adaptation
is nonlinear, such as a heaviside, sigmoid, or Gaussian
in addition to learning [1]–[5]. Evolutionary algorithms
function.
(EA’s) are used to perform various tasks, such as con-
ANN’s can be divided into feedforward and recurrent
nection weight training, architecture design, learning rule
classes according to their connectivity. An ANN is feed-
adaptation, input feature selection, connection weight ini-
forward if there exists a method which numbers all the
tialization, rule extraction from ANN’s, etc. One distinct
nodes in the network such that there is no connection from
feature of EANN’s is their adaptability to a dynamic
a node with a large number to a node with a smaller number.
environment. In other words, EANN’s can adapt to an en-
All the connections are from nodes with small numbers to
vironment as well as changes in the environment. The two
nodes with larger numbers. An ANN is recurrent if such a
forms of adaptation, i.e., evolution and learning in EANN’s,
numbering method does not exist.
make their adaptation to a dynamic environment much more
In (1), each term in the summation only involves one
effective and efficient. In a broader sense, EANN’s can be
input
. High-order ANN’s are those that contain high-
regarded as a general framework for adaptive systems, i.e.,
order nodes, i.e., nodes in which more than one input
systems that can change their architectures and learning
are involved in some of the terms of the summation. For
rules appropriately without human intervention.
example, a second-order node can be described as
This paper is most concerned with exploring possible
benefits arising from combinations between ANN’s and
EA’s. Emphasis is placed on the design of intelligent
systems based on ANN’s and EA’s. Other combinations
Manuscript received July 10, 1998; revised February 18, 1999. This
work was supported in part by the Australian Research Council through
where all the symbols have similar definitions to those in
its small grant scheme.
(1).
The author is with the School of Computer Science, University
The architecture of an ANN is determined by its topo-
of Birmingham, Edgbaston, Birmingham B15 2TT U.K. (e-mail:
xin@cs.bham.ac.uk).
logical structure, i.e., the overall connectivity and transfer
Publisher Item Identifier S 0018-9219(99)06906-6.
function of each node in the network.
0018–9219/99$10.00 © 1999 IEEE
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999
1423

Fig. 1.
A general framework of EA’s.
2) Learning in ANN’s: Learning in ANN’s is typically
gradient-based search algorithms. They do not depend
accomplished using examples. This is also called “training”
on gradient information and thus are quite suitable for
in ANN’s because the learning is achieved by adjusting the
problems where such information is unavailable or very
connection weights1 in ANN’s iteratively so that trained
costly to obtain or estimate. They can even deal with
(or learned) ANN’s can perform certain tasks. Learning in
problems where no explicit and/or exact objective function
ANN’s can roughly be divided into supervised, unsuper-
is available. These features make them much more robust
vised, and reinforcement learning. Supervised learning is
than many other search algorithms. Fogel [15] and B¨ack
based on direct comparison between the actual output of
et al. [16] give a good introduction to various evolutionary
an ANN and the desired correct output, also known as the
algorithms for optimization.
target output. It is often formulated as the minimization
of an error function such as the total mean square error
C. Evolution in EANN’s
between the actual output and the desired output summed
Evolution has been introduced into ANN’s at roughly
over all available data. A gradient descent-based optimiza-
three different levels: connection weights; architectures;
tion algorithm such as backpropagation (BP) [6] can then
and learning rules. The evolution of connection weights
be used to adjust connection weights in the ANN iteratively
introduces an adaptive and global approach to training,
in order to minimize the error. Reinforcement learning is a
especially in the reinforcement learning and recurrent net-
special case of supervised learning where the exact desired
work learning paradigm where gradient-based training al-
output is unknown. It is based only on the information of
gorithms often experience great difficulties. The evolution
whether or not the actual output is correct. Unsupervised
of architectures enables ANN’s to adapt their topologies
learning is solely based on the correlations among input
to different tasks without human intervention and thus
data. No information on “correct output” is available for
provides an approach to automatic ANN design as both
learning.
ANN connection weights and structures can be evolved.
The essence of a learning algorithm is the learning rule,
The evolution of learning rules can be regarded as a process
i.e., a weight-updating rule which determines how connec-
of “learning to learn” in ANN’s where the adaptation of
tion weights are changed. Examples of popular learning
learning rules is achieved through evolution. It can also be
rules include the delta rule, the Hebbian rule, the anti-
regarded as an adaptive process of automatic discovery of
Hebbian rule, and the competitive learning rule [7].
novel learning rules.
More detailed discussion of ANN’s can be found in [7].
D. Organization of the Article
B. EA’s
The remainder of this paper is organized as follows.
EA’s refer to a class of population-based stochastic
Section II discusses the evolution of connection weights.
search algorithms that are developed from ideas and princi-
The aim is to find a near-optimal set of connection weights
ples of natural evolution. They include evolution strategies
globally for an ANN with a fixed architecture using EA’s.
(ES) [8], [9], evolutionary programming (EP) [10], [11],
Various methods of encoding connection weights and dif-
[12], and genetic algorithms (GA’s) [13], [14]. One im-
ferent search operators used in EA’s will be discussed.
portant feature of all these algorithms is their population-
Comparisons between the evolutionary approach and con-
based search strategy. Individuals in a population compete
ventional training algorithms, such as BP, will be made.
and exchange information with each other in order to
In general, no single algorithm is an overall winner for all
perform certain tasks. A general framework of EA’s can
kinds of networks. The best training algorithm is problem
be described by Fig. 1.
dependent.
EA’s are particularly useful for dealing with large com-
Section III is devoted to the evolution of architectures,
plex problems which generate many local optima. They are
i.e., finding a near-optimal ANN architecture for the tasks
less likely to be trapped in local minima than traditional
at hand. It is known that the architecture of an ANN
1 Thresholds (biases) can be viewed as connection weights with fixed
determines the information processing capability of the
input 01.
ANN. Architecture design has become one of the most
1424
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

Fig. 2.
A typical cycle of the evolution of connection weights.
important tasks in ANN research and application. Two most
a global minimum if the error function is multimodal
important issues in the evolution of architectures, i.e., the
and/or nondifferentiable. A detailed review of BP and other
representation and search operators used in EA’s, will be
learning algorithms can be found in [7], [17], and [25].
addressed in this section. It is shown that evolutionary
One way to overcome gradient-descent-based training
algorithms relying on crossover operators do not perform
algorithms’ shortcomings is to adopt EANN’s, i.e., to for-
very well in searching for a near-optimal ANN architecture.
mulate the training process as the evolution of connection
Reasons and empirical results will be given in this section
weights in the environment determined by the architecture
to explain why this is the case.
and the learning task. EA’s can then be used effectively
If imagining ANN’s connection weights and architectures
in the evolution to find a near-optimal set of connection
as their “hardware,” it is easier to understand the importance
weights globally without computing gradient information.
of the evolution of ANN’s “software”—learning rules.
The fitness of an ANN can be defined according to different
Section IV addresses the evolution of learning rules in
needs. Two important factors which often appear in the
ANN’s and examines the relationship between learning
fitness (or error) function are the error between target and
and evolution, e.g., how learning guides evolution and
actual outputs and the complexity of the ANN. Unlike
how learning itself evolves. It is demonstrated that an
the case in gradient-descent-based training algorithms, the
ANN’s learning ability can be improved through evolution.
fitness (or error) function does not have to be differentiable
Although research on this topic is still in its early stages,
or even continuous since EA’s do not depend on gradient
further studies will no doubt benefit research in ANN’s and
information. Because EA’s can treat large, complex, non-
machine learning as a whole.
differentiable, and multimodal spaces, which are the typical
Section V summarizes some other forms of combinations
case in the real world, considerable research and application
between ANN’s and EA’s. They do not intend to be
has been conducted on the evolution of connection weights
exhaustive, simply indicative. They demonstrate the breadth
[24], [26]–[112].
of possible combinations between ANN’s and EA’s.
The evolutionary approach to weight training in ANN’s
Section VI first describes a general framework of
consists of two major phases. The first phase is to decide
EANN’s in terms of adaptive systems where interactions
the representation of connection weights, i.e., whether in
among three levels of evolution are considered. The
the form of binary strings or not. The second one is
framework provides a common basis for comparing
the evolutionary process simulated by an EA, in which
different EANN models. The section then gives a brief
search operators such as crossover and mutation have to
summary of the paper and concludes with a few remarks.
be decided in conjunction with the representation scheme.
Different representations and search operators can lead to
II.
THE EVOLUTION OF CONNECTION WEIGHTS
quite different training performance. A typical cycle of the
evolution of connection weights is shown in Fig. 2. The
Weight training in ANN’s is usually formulated as min-
evolution stops when the fitness is greater than a predefined
imization of an error function, such as the mean square
value (i.e., the training error is smaller than a certain value)
error between target and actual outputs averaged over all
or the population has converged.
examples, by iteratively adjusting connection weights. Most
training algorithms, such as BP and conjugate gradient
algorithms [7], [17]–[19], are based on gradient descent.
A. Binary Representation
There have been some successful applications of BP in
The canonical genetic algorithm (GA) [13], [14] has
various areas [20]–[22], but BP has drawbacks due to its use
always used binary strings to encode alternative solutions,
of gradient descent [23], [24]. It often gets trapped in a local
often termed chromosomes. Some of the early work in
minimum of the error function and is incapable of finding
evolving ANN connection weights followed this approach
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1425

(a)
(b)
(a)
(b)
Fig. 3.
(a) An ANN with connection weights shown. (b) A
Fig. 4.
(a) An ANN which is equivalent to that given in Fig. 3(a).
binary representation of the weights, assuming that each weight
(b) Its binary representation under the same representation scheme.
is represented by four bits.
many-to-one mapping from the representation (genotype)
[24], [26], [28], [37], [38], [41], [52], [53]. In such a repre-
to the actual ANN (phenotype) since two ANN’s that order
sentation scheme, each connection weight is represented by
their hidden nodes differently in their chromosomes will
a number of bits with certain length. An ANN is encoded by
still be equivalent functionally. For example, ANN’s shown
concatenation of all the connection weights of the network
by Figs. 3(a) and 4(a) are equivalent functionally, but they
in the chromosome.
have different chromosomes as shown by Figs. 3(b) and
A heuristic concerning the order of the concatenation
4(b). In general, any permutation of the hidden nodes
is to put connection weights to the same hidden/output
will produce functionally equivalent ANN’s with differ-
node together. Hidden nodes in ANN’s are in essence
ent chromosome representations. The permutation problem
feature extractors and detectors. Separating inputs to the
makes crossover operator very inefficient and ineffective in
same hidden node far apart in the binary representation
producing good offspring.
would increase the difficulty of constructing useful feature
detectors because they might be destroyed by crossover
B. Real-Number Representation
operators. It is generally very difficult to apply crossover
There have been some debates on the cardinality of
operators in evolving connection weights since they tend
the genotype alphabet. Some have argued that the mini-
to destroy feature detectors found during the evolutionary
mal cardinality, i.e., the binary representation, might not
process.
be the best [48], [114]. Formal analysis of nonstandard
Fig. 3 gives an example of the binary representation of
representations and operators based on the concept of
an ANN whose architecture is predefined. Each connection
equivalent classes [115], [116] has given representations
weight in the ANN is represented by 4 bits, the whole ANN
other than ary strings a more solid theoretical foundation.
is represented by 24 bits where weight 0000 indicates no
Real numbers have been proposed to represent connection
connection between two nodes.
weights directly, i.e., one real number per connection
The advantages of the binary representation lie in its
weight [27], [29], [30], [48], [63]–[65], [74], [95], [96],
simplicity and generality. It is straightforward to apply
[102], [110], [111], [117], [118]. For example, a real-
classical crossover (such as one-point or uniform crossover)
number representation of the ANN given by Fig. 3(a) could
and mutation to binary strings. There is little need to design
be (4.0,10.0,2.0,0.0,7.0,3.0).
complex and tailored search operators. The binary repre-
As connection weights are represented by real numbers,
sentation also facilitates digital hardware implementation
each individual in an evolving population will be a real
of ANN’s since weights have to be represented in terms of
vector. Traditional binary crossover and mutation can no
bits in hardware with limited precision.
longer be used directly. Special search operators have
There are several encoding methods, such as uniform,
to be designed. Montana and Davis [27] defined a large
Gray, exponential, etc., that can be used in the binary
number of tailored genetic operators which incorporated
representation. They encode real values using different
many heuristics about training ANN’s. The idea was to
ranges and precisions given the same number of bits.
retain useful feature detectors formed around hidden nodes
However, a tradeoff between representation precision and
during evolution. Their results showed that the evolutionary
the length of chromosome often has to be made. If too
training approach was much faster than BP for the problems
few bits are used to represent each connection weight,
they considered. Bartlett and Downs [30] also demonstrated
training might fail because some combinations of real-
that the evolutionary approach was faster and had better
valued connection weights cannot be approximated with
scalability than BP.
sufficient accuracy by discrete values. On the other hand,
A natural way to evolve real vectors would be to use
if too many bits are used, chromosomes representing large
EP or ES since they are particularly well-suited for treating
ANN’s will become extremely long and the evolution in
continuous optimization. Unlike GA’s, the primary search
turn will become very inefficient.
operator in EP and ES is mutation. One of the major
One of the problems faced by evolutionary training of
advantages of using mutation-based EA’s is that they can
ANN’s is the permutation problem [32], [113], also known
reduce the negative impact of the permutation problem.
as the competing convention problem. It is caused by the
Hence the evolutionary process can be more efficient. There
1426
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

have been a number of successful examples of applying EP
is unavailable or very costly to obtain or estimate. For
or ES to the evolution of ANN connection weights [29],
example, the evolutionary approach has been used to train
[63]–[65], [67], [68], [95], [96], [102], [106], [111], [117],
recurrent ANN’s [41], [60], [65], [100], [102], [103], [106],
[119], [120]. In these examples, the primary search operator
[117], [126]–[128], higher order ANN’s [52], [53], and
has been Gaussian mutation. Other mutation operators, such
fuzzy ANN’s [76], [77], [129], [130]. Moreover, the same
as Cauchy mutation [121], [122], can also be used. EP
EA can be used to train many different networks regardless
and ES also allow self adaptation of strategy parameters.
of whether they are feedforward, recurrent, or higher order
Evolving connection weights by EP can be implemented
ANN’s. The general applicability of the evolutionary ap-
as follows.
proach saves a lot of human efforts in developing different
training algorithms for different types of ANN’s.
1) Generate an initial population of
individuals at
The evolutionary approach also makes it easier to gener-
random and set
. Each individual is a pair
ate ANN’s with some special characteristics. For example,
of real-valued vectors,
,
the ANN’s complexity can be decreased and its generaliza-
where
’s are connection weight vectors and
’s are
tion increased by including a complexity (regularization)
variance vectors for Gaussian mutations (also known
term in the fitness function. Unlike the case in gradient-
as strategy parameters in self-adaptive EA’s). Each
based training, this term does not need to be differentiable
individual corresponds to an ANN.
or even continuous. Weight sharing and weight decay can
2) Each individual
, creates a
also be incorporated into the fitness function easily.
single offspring
by: for
Evolutionary training can be slow for some problems in
comparison with fast variants of BP [131] and conjugate
(2)
gradient algorithms [19], [132]. However, EA’s are gen-
(3)
erally much less sensitive to initial conditions of training.
They always search for a globally optimal solution, while
where
, and
denote the
a gradient descent algorithm can only find a local optimum
th component of the vectors
and
,
in a neighborhood of the initial solution.
respectively.
denotes a normally distributed
For some problems, evolutionary training can be signif-
one-dimensional random number with mean zero and
icantly faster and more reliable than BP [30], [34], [40],
variance one.
indicates that the random
[63], [83], [89]. Prados [34] described a GA-based training
number is generated anew for each value of . The
algorithm which is “significantly faster than methods that
parameters
and
are commonly set to
use the generalized delta rule (GDR).” For the three tests
and
[15], [123].
in (3) may be
reported in his paper [34], the GA-based training algorithm
replaced by Cauchy mutation [121], [122], [124] for
“took a total of about 3 hours and 20 minutes, and the GDR
faster evolution.
took a total of about 23 hours and 40 minutes.” Bartlett and
3) Determine the fitness of every individual, including
Downs [30] also gave a modified GA which was “an order
all parents and offspring, based on the training error.
of magnitude” faster than BP for the 7-bit parity problem.
Different error functions may be used here.
The modified GA seemed to have better scalability than
4) Conduct pairwise comparison over the union of par-
BP since it was “around twice” as slow as BP for the
ents
and offspring
.
XOR problem but faster than BP for the larger 7-bit parity
For each individual,
opponents are chosen uni-
problem.
formly at random from all the parents and offspring.
Interestingly, quite different results were reported by Ki-
For each comparison, if the individual’s fitness is
tano [133]. He found that the GA–BP method, a technique
no smaller than the opponent’s, it receives a “win.”
that runs a GA first and then BP, “is, at best, equally
Select
individuals out of
and
efficient to faster variants of back propagation in very small
, that have most wins to form the
scale networks, but far less efficient in larger networks.”
next generation. (This tournament selection scheme
The test problems he used included the XOR problem,
may be replaced by other selection schemes, such as
various size encoder/decoder problems, and the two-spiral
[125].)
problem. However, there have been many other papers
5) Stop if the halting criterion is satisfied; otherwise,
which report excellent results using hybrid evolutionary and
and go to Step 2).
gradient descent algorithms [32], [67], [70], [71], [74], [80],
[81], [86], [103], [105], [110]–[112].
C. Comparison Between Evolutionary Training
The discrepancy between two seemingly contradictory
and Gradient-Based Training
results can be attributed at least partly to the different
As indicated at the beginning of Section II, the evolu-
EA’s and BP compared. That is, whether the comparison
tionary training approach is attractive because it can handle
is between a classical binary GA and a fast BP algorithm,
the global search problem better in a vast, complex, mul-
or between a fast EA and a classical BP algorithm. The
timodal, and nondifferentiable surface. It does not depend
discrepancy also shows that there is no clear winner in
on gradient information of the error (or fitness) function
terms of the best training algorithm. The best one is always
and thus is particularly appealing when this information
problem dependent. This is certainly true according to the
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1427

no-free-lunch theorem [134]. In general, hybrid algorithms
tend to perform better than others for a large number of
problems.
D. Hybrid Training
Most EA’s are rather inefficient in fine-tuned local search
although they are good at global search. This is especially
true for GA’s. The efficiency of evolutionary training can
be improved significantly by incorporating a local search
procedure into the evolution, i.e., combining EA’s global
search ability with local search’s ability to fine tune. EA’s
can be used to locate a good region in the space and then
a local search procedure is used to find a near-optimal
Fig. 5.
An illustration of using an EA to find good initial weights
solution in this region. The local search algorithm could
such that a local search algorithm can find the globally optimal
be BP [32], [133] or other random search algorithms [30],
weights easily. wI2 in this figure is an optimal initial weight
[135]. Hybrid training has been used successfully in many
because it can lead to the global optimum wB using a local search
algorithm.
application areas [32], [67], [70], [71], [74], [80], [81], [86],
[103], [105], [110]–[112].
Lee [81] and many others [32], [136]–[138] used GA’s to
a large number of connections and nonlinear nodes may
search for a near-optimal set of initial connection weights
overfit noise in the training data and fail to have good
and then used BP to perform local search from these
generalization ability.
initial weights. Their results showed that the hybrid GA/BP
Up to now, architecture design is still very much a human
approach was more efficient than either the GA or BP al-
expert’s job. It depends heavily on the expert experience
gorithm used alone. If we consider that BP often has to run
and a tedious trial-and-error process. There is no systematic
several times in practice in order to find good connection
way to design a near-optimal architecture for a given task
weights due to its sensitivity to initial conditions, the hybrid
automatically. Research on constructive and destructive al-
training algorithm will be quite competitive. Similar work
gorithms represents an effort toward the automatic design of
on the evolution of initial weights has also been done on
architectures [141]–[148]. Roughly speaking, a constructive
competitive learning neural networks [139] and Kohonen
algorithm starts with a minimal network (network with
networks [140].
minimal number of hidden layers, nodes, and connections)
It is interesting to consider finding good initial weights
and adds new layers, nodes, and connections when nec-
as locating a good region in the weight space. Defining that
essary during training while a destructive algorithm does
basin of attraction of a local minimum as being composed
the opposite, i.e., starts with the maximal network and
of all the points, sets of weights in this case, which can
deletes unnecessary layers, nodes, and connections during
converge to the local minimum through a local search
training. However, as indicated by Angeline et al. [149],
algorithm, then a global minimum can easily be found by
“Such structural hill climbing methods are susceptible to
the local search algorithm if an EA can locate a point, i.e., a
becoming trapped at structural local optima.” In addition,
set of initial weights, in the basin of attraction of the global
they “only investigate restricted topological subsets rather
minimum. Fig. 5 illustrates a simple case where there is
than the complete class of network architectures” [149].
only one connection weight in the ANN. If an EA can find
Design of the optimal architecture for an ANN can be
an initial weight such as
, it would be easy for a local
formulated as a search problem in the architecture space
search algorithm to arrive at the globally optimal weight
where each point represents an architecture. Given some
even though
itself is not as good as
.
performance (optimality) criteria, e.g., lowest training error,
lowest network complexity, etc., about architectures, the
III.
THE EVOLUTION OF ARCHITECTURES
performance level of all architectures forms a discrete
Section II assumed that the architecture of an ANN is
surface in the space. The optimal architecture design is
predefined and fixed during the evolution of connection
equivalent to finding the highest point on this surface. There
weights. This section discusses the design of ANN architec-
are several characteristics of such a surface as indicated by
tures. The architecture of an ANN includes its topological
Miller et al. [150] which make EA’s a better candidate for
structure, i.e., connectivity, and the transfer function of each
searching the surface than those constructive and destruc-
node in the ANN. As indicated in the beginning of this
tive algorithms mentioned above. These characteristics are
paper, architecture design is crucial in the successful ap-
[150]:
plication of ANN’s because the architecture has significant
1) the surface is infinitely large since the number of
impact on a network’s information processing capabilities.
possible nodes and connections is unbounded;
Given a learning task, an ANN with only a few connections
2) the surface is nondifferentiable since changes in the
and linear nodes may not be able to perform the task
number of nodes or connections are discrete and can
at all due to its limited capability, while an ANN with
have a discontinuous effect on EANN’s performance;
1428
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

Fig. 6.
A typical cycle of the evolution of architectures.
3) the surface is complex and noisy since the mapping
term topological structure (topology) in these two sec-
from an architecture to its performance is indirect,
tions. Section III-C discusses the evolution of node transfer
strongly epistatic, and dependent on the evaluation
functions briefly. Then we explain why the simultaneous
method used;
evolution of ANN connection weights and architectures
4) the surface is deceptive since similar architectures
is beneficial and what search operators should be used in
may have quite different performance;
evolving architectures in Section III-D.
5) the surface is multimodal since different architectures
A. The Direct Encoding Scheme
may have similar performance.
Two different approaches have been taken in the direct
Similar to the evolution of connection weights, two major
encoding scheme. The first separates the evolution of archi-
phases involved in the evolution of architectures are the
tectures from that of connection weights [24], [150], [153],
genotype representation scheme of architectures and the EA
[154], [165], [167], [169], [170]. The second approach
used to evolve ANN architectures. One of the key issues
evolves architectures and connection weights simultane-
in encoding ANN architectures is to decide how much
ously [149], [179], [180], [182], [185]–[200]. This section
information about an architecture should be encoded in the
will focus on the first approach. The second approach will
chromosome. At one extreme, all the details, i.e., every
be discussed in Section III-D.
connection and node of an architecture, can be specified
In the first approach, each connection of an architecture
by the chromosome. This kind of representation scheme
is directly specified by its binary representation [24], [150],
is called direct encoding. At the other extreme, only the
[153], [154], [165], [167], [169], [170], [202]. For example,
most important parameters of an architecture, such as the
an
matrix
can represent an ANN
number of hidden layers and hidden nodes in each layer,
architecture with
nodes, where
indicates presence or
are encoded. Other details about the architecture are left to
absence of the connection from node to node . We can use
the training process to decide. This kind of representation
to indicate a connection and
to indicate no
scheme is called indirect encoding. After a representation
connection. In fact,
can represent real-valued connection
scheme has been chosen, the evolution of architectures can
weights from node
to node
so that the architecture and
progress according to the cycle shown in Fig. 6. The cycle
connection weights can be evolved simultaneously [37],
stops when a satisfactory ANN is found.
[42], [45], [165], [166], [169]–[171].
Considerable research on evolving ANN architectures
Each matrix
has a direct one-to-one mapping to
has been carried out in recent years [33], [42], [45],
the corresponding ANN architecture. The binary string
[118], [127], [128], [130], [138], [149]–[225]. Most of
representing an architecture is the concatenation of rows
the research has concentrated on the evolution of ANN
(or columns) of the matrix. Constraints on architectures
topological structures. Relatively little has been done on
being explored can easily be incorporated into such a rep-
the evolution of node transfer functions, let alone the
resentation scheme by imposing constraints on the matrix,
simultaneous evolution of both topological structures and
e.g., a feedforward ANN will have nonzero entries only
node transfer functions. In this paper, we will analyze
in the upper-right triangle of the matrix. Figs. 7 and 8
the genotypical representation scheme of topological struc-
give two examples of the direct encoding scheme of ANN
tures in Sections III-A and III-B. For convenience, the
architectures. It is obvious that such an encoding scheme
term architecture will be used interchangeably with the
can handle both feedforward and recurrent ANN’s.
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1429

(a)
(b)
(c)
Fig. 7.
An example of the direct encoding of a feedforward ANN. (a), (b), and (c) show the
architecture, its connectivity matrix, and its binary string representation, respectively. Because only
feedforward architectures are under consideration, the binary string representation only needs to
consider the upper-right triangle of the matrix.
(a)
(b)
(c)
Fig. 8.
An example of the direct encoding of a recurrent ANN. (a), (b), and (c) show the
architecture, its connectivity matrix, and its binary string representation, respectively.
Fig. 7(a) shows a feedforward ANN with two inputs and
is possible if we want to explore the whole connectivity
one output. Its connectivity matrix is given by Fig. 7(b),
space. The EA used to evolve recurrent ANN’s can be the
where entry
indicates the presence or absence of a
same as that used to evolve feedforward ones.
connection from node
to node . For example, the first
The direct encoding scheme as described above is quite
row indicates connections from node 1 to all other nodes.
straightforward to implement. It is very suitable for the
The first two columns are 0’s because there is no connection
precise and fine-tuned search of a compact ANN architec-
from node 1 to itself and no connection to node 2. However,
ture, since a single connection can be added or removed
node 1 is connected to nodes 3 and 4. Hence columns 3
from the ANN easily. It may facilitate rapid generation and
and 4 have 1’s. Converting this connectivity matrix to a
optimization of tightly pruned interesting designs that no
chromosome is straightforward. We can concatenate all the
one has hit upon so far [150].
rows (or columns) and obtain
Another flexibility provided by the evolution of architec-
tures stems from the fitness definition. There is virtually
00 110 00 101 00 001 00 001 00 000.
no limitation such as being differentiable or continuous
Since the ANN is feedforward, we only need to represent
on how the fitness function should be defined at Step 3
the upper triangle of the connectivity matrix in order to
in Fig. 6. The training result pertaining to an architecture
reduce the chromosome length. The reduced chromosome is
such as the error and the training time is often used in
given by Fig. 7(c). An EA can then be employed to evolve
the fitness function. The complexity measurement such as
a population of such chromosomes. In order to evaluate
the number of nodes and connections is also used in the
the fitness of each chromosome, we need to convert a
fitness function. As a matter of fact, many criteria based
chromosome back to an ANN, initialize it with random
on the information theory or statistics [226]–[228] can
weights, and train it. The training error will be used to
readily be introduced into the fitness function without much
measure the fitness. It is worth noting that the ANN in
difficulty. Improvement on ANN’s generalization ability
Fig. 7 has a shortcut connection from the input to output.
can be expected if these criteria are adopted. Schaffer
Such shortcuts pose no problems to the representation and
et al. [153] have presented an experiment which showed
evolution. An EA is capable of exploring all possible
that an ANN designed by the evolutionary approach had
connectivities.
better generalization ability than one trained by BP using
Fig. 8 shows a recurrent ANN. Its representation is
a human-designed architecture.
basically the same as that for feedforward ANN’s. The
One potential problem of the direct encoding scheme
only difference is that no reduction in chromosome length
is scalability. A large ANN would require a very large
1430
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

matrix and thus increase the computation time of the
chromosomes to specify independently the whole nervous
evolution. One way to cut down the size of matrices is
system according to the discoveries of neuroscience.
to use domain knowledge to reduce the search space. For
1) Parametric Representation: ANN architectures may
example, if complete connection is to be used between two
be specified by a set of parameters such as the number
neighboring layers in a feedforward ANN, its architecture
of hidden layers, the number of hidden nodes in each
can be encoded by just the number of hidden layers and
layer, the number of connections between two layers, etc.
nodes in each hidden layer. The length of chromosome can
These parameters can be encoded in various forms in a
be reduced greatly in this case [153], [154]. However, doing
chromosome. Harp et al. [152], [156] used a “blueprint” to
so requires sufficient domain knowledge and expertise,
represent an architecture which consists of one or more
which are difficult to obtain in practice. We also run the
segments representing an area (layer) and its efferent
risk of missing some very good solutions when we restrict
connectivity (projections). The first and last area are
the search space manually.
constrained to be the input and output area, respectively.
The permutation problem as illustrated by Figs. 3 and
Each segment includes two parts of the information: 1)
4 in Section II-A still exists and causes unwanted side
that about the area itself, such as the number of nodes in
effects in the evolution of architectures. Because two func-
the area and the spatial organization of the area, and 2)
tionally equivalent ANN’s which order their hidden nodes
that about the efferent connectivity. It should be noted that
differently have two different genotypical representations,
only the connectivity pattern instead of each connection
the probability of producing a highly fit offspring by
is specified here. The detailed node-to-node connection is
recombining them is often very low. Some researchers
specified by implicit developmental rules, e.g., the network
thus avoided crossover and adopted only mutations in
instantiation software used by Harp et al. [152], [156].
the evolution of architectures [45], [128], [149], [179],
Similar parametric representation methods with different
[185]–[197], [217], [223], although it has been shown that
sets of parameters have also been proposed by others [155],
crossover may be useful and important in increasing the
[159]. An interesting aspect of Harp et al.’s work is their
efficiency of evolution for some problems [48], [113],
combination of learning parameters with architectures in
[212], [229]. Hancock [113] suggested that the permutation
the genotypical representation. The learning parameters can
problem might “not be as severe as had been supposed”
evolve along with architecture parameters. The interaction
with the population size and the selection mechanism he
between the two can be explored through evolution.
used because “The increased number of ways of solving
Although the parametric representation method can re-
the problem outweigh the difficulties of bringing building
duce the length of binary chromosome specifying ANN’s
blocks together.” Thierens [101] proposed a genetic encod-
architectures, EA’s can only search a limited subset of
ing scheme of ANN’s which can avoid the permutation
problem, however, only very limited experimental results
the whole feasible architecture space. For example, if we
were presented. It is worth indicating that most studies on
encode only the number of hidden nodes in the hidden layer,
the permutation problem concentrate on the GA used, e.g.,
we basically assume strictly layered feedforward ANN’s
genetic operators, population sizes, selection mechanisms,
with a single hidden layer. We will have to assume two
etc. While it is necessary to investigate the algorithm, it is
neighboring layers are fully connected as well. In general,
equally important to study the genotypical representation
the parametric representation method will be most suitable
scheme, since the performance surface defined in the be-
when we know what kind of architectures we are trying
ginning of Section III is determined by the representation.
to find.
More research is needed to further understand the impact of
2) Developmental Rule Representation: A quite different
the permutation problem on the evolution of architectures.
indirect encoding method from the above is to encode de-
velopmental rules, which are used to construct architectures,
B. The Indirect Encoding Scheme
in chromosomes [151], [168], [184], [205], [230], [232].
In order to reduce the length of the genotypical represen-
The shift from the direct optimization of architectures to
tation of architectures, the indirect encoding scheme has
the optimization of developmental rules has brought some
been used by many researchers [151], [152], [155], [156],
benefits, such as more compact genotypical representation,
[159], [160], [168], [184], [205], [208], [211], [230]–[232]
to the evolution of architectures. The destructive effect of
where only some characteristics of an architecture are
crossover will also be lessened since the developmental rule
encoded in the chromosome. The details about each con-
representation is capable of preserving promising building
nection in an ANN is either predefined according to prior
blocks found so far [151]. But this method also has some
knowledge or specified by a set of deterministic devel-
problems [233].
opmental rules. The indirect encoding scheme can pro-
A developmental rule is usually described by a recursive
duce more compact genotypical representation of ANN
equation [230] or a generation rule similar to a production
architectures, but it may not be very good at finding
rule in a production system with a left-hand side (LHS) and
a compact ANN with good generalization ability. Some
a right-hand side (RHS) [151]. The connectivity pattern of
[151], [230], [231] have argued that the indirect encoding
the architecture in the form of a matrix is constructed from a
scheme is biologically more plausible than the direct one,
basis, i.e., a single-element matrix, by repetitively applying
because it is impossible for genetic information encoded in
suitable developmental rules to nonterminal elements in
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1431

Fig. 9.
Examples of some developmental rules used to construct a connectivity matrix. S is the
initial element (or state).
the current matrix until the matrix contains only terminal2
elements which indicate the presence or absence of a
connection, that is, until a connectivity pattern is fully
specified.
Some examples of developmental rule are given in Fig. 9.
(a)
(b)
(c)
Each developmental rule consists of a LHS which is a
nonterminal element and a RHS which is a 2
2 matrix
with either terminal or nonterminal elements. A typical step
of constructing a connection matrix is to find rules whose
LHS’s appear in the current matrix and replace all the
appearance with respective RHS’s. For example, given a
set of rules as described by Fig. 9, where
is the starting
symbol (state), one step application of these rules will
produce the matrix
(d)
(e)
by replacing
with
. If we apply these rules again,
we can generate the following matrix:
Fig. 10.
Development of an EANN architecture using the rules
given in Fig. 9: (a) the initial state; (b) step 1; (c) step 2; (d) step 3
when all the entries in the matrix are terminal elements, i.e., either
1 or 0; and (e) the architecture. The nodes in the architecture are
numbered from one to eight. Isolated nodes are not shown.
Note that nodes 2, 4, and 6 do not appear in the ANN
by replacing
with
with
with
,
because they are not connected to any other nodes.
and
with
. Another step of applying these rules
The example described by Figs. 9 and 10 illustrates how
would lead us to the matrix
an ANN architecture can be defined given a set of rules.
The question now is how to get such a set of rules to
construct an ANN. One answer is to evolve them. We can
encode the whole rule set as an individual [151] (the so-
called Pitt approach) or encode each rule as an individual
[184] (the so-called Michigan approach). Each rule may
be represented by four allele positions corresponding to
four elements in the RHS of the rule. The LHS can
be represented implicitly by the rule’s position in the
chromosome. Each position in a chromosome can take
by replacing
with
with
with
, and
one of many different values, depending on how many
with
. Since the above matrix consists of ones and
nonterminal elements (symbols) we use in the rule set.
zeros only (i.e., terminals only), there will be no further
For example, the nonterminals may range from “ ” to
application of developmental rules. The above matrix will
“ ” and “ ” to “ .” The 16 rules with “ ” to “ ” on the
be an ANN’s connection matrix.
LHS and 2
2 matrices with only ones and zeros on the
Fig. 10 summarizes the previous three rule-rewriting
RHS are predefined and do not participate in evolution in
steps and the final ANN generated according to Fig. 10(d).
order to guarantee that different connectivity patterns can
2 In this paper, a terminal element is either 1 (existence of a connection)
be reached. Since there are 26 different rules, whose LHS
or 0 (nonexistence of a connection) and a nonterminal element is a symbol
is “ ,” “ ,”
,“ ,” respectively, a chromosome encoding
other than 1 and 0. These definitions are slightly different from those used
by others [151].
all of them would need 26
4
104 alleles, four per rule.
1432
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

The LHS of a rule is implicitly determined by its position
current architecture cannot reduce the training error below
in the chromosome. For example, the rule set in Fig. 9 can
certain threshold. Each hidden layer is constructed auto-
be represented by the following chromosome:
matically through an evolutionary process which employs
ABCDaaaaiiiaiaacaeae
the GA with fitness sharing. Fitness sharing encourages the
where the first four elements indicate the RHS of rule “ ”,
formation of different feature detectors (hidden nodes) in
the second four indicate the RHS of rule “ ,” etc.
the population. The number of hidden nodes in each hidden
Some good results from the developmental rule repre-
layer can vary.
sentation method have been reported [151] using various
One limitation of this approach [236] is that it could
size encoder/decoder problems. However, the method has
only deal with strictly layered feedforward ANN’s. Another
some limitations. It often needs to predefine the number of
limitation is that there are usually several hidden nodes in
rewriting steps. It does not allow recursive rules. It is not
the same species which have very similar functionality,
very good at evolving detailed connectivity patterns among
i.e., which are basically the same feature detector in a
individual nodes. A compact genotypical representation
population. Such redundancy needs to be removed by an
does not imply a compact phenotypical representation, i.e.,
additional clean-up algorithm.
a compact ANN architecture. Recent work by Siddiqi and
Smith and Cribbs [181], [237] also used an individual
Lucas [233] shows that the direct encoding scheme can be
to represent a hidden node rather than the whole ANN.
at least as good as the developmental rule method. They
Their approach can only deal with strictly three-layered
have reimplemented Kitano’s system and discovered that
feedforward ANN’s.
the performance difference between the direct and indirect
encoding schemes was not caused by the encoding scheme
C. The Evolution of Node Transfer Functions
itself, but by how sparsely connected the initial ANN
The discussion on the evolution of architectures so far
architectures were in the initial population [151]. The direct
only deals with the topological structure of an architecture.
encoding scheme achieved the same performance as that
The transfer function of each node in the architecture has
achieved by the developmental rule representation when
been assumed to be fixed and predefined by human experts,
the initial conditions were the same [233].
yet the transfer function has been shown to be an important
The developmental rule representation method normally
part of an ANN architecture and have significant impact on
separates the evolution of architectures from that of con-
ANN’s performance [238]–[240]. The transfer function is
nection weights. This creates some problems for evolution.
often assumed to be the same for all the nodes in an ANN,
Section III-D will discuss these in more detail.
at least for all the nodes in the same layer.
Mjolsness et al. [230] described a similar rule encoding
Stork et al. [241] were, to our best knowledge, the first to
method where rules are represented by recursive equa-
apply EA’s to the evolution of both topological structures
tions which specify the growth of connection matrices.
and node transfer functions even though only simple ANN’s
Coefficients of these recursive equations, represented by
with seven nodes were considered. The transfer function
decomposition matrices, are encoded in genotypes and op-
was specified in the structural genes in their genotypic
timized by simulated annealing instead of EA’s. Connection
representation. It was much more complex than the usual
weights are optimized along with connectivity by simulated
sigmoid function because they tried to model a biological
annealing since each entry of a connection matrix can have
neuron in the tailflip circuitry of crayfish.
a real-valued weight. One advantage of using simulated
White and Ligomenides [171] adopted a simpler ap-
annealing instead of GA’s in the evolution is the avoidance
proach to the evolution of both topological structures and
of the destructive effect of crossover. Wilson [234] also
node transfer functions. For each individual (i.e., ANN)
used simulated annealing in ANN architecture design.
in the initial population, 80% nodes in the ANN used
3) Fractal Representation: Merrill and Port [231] pro-
the sigmoid transfer function and 20% nodes used the
posed another method for encoding architectures which
Gaussian transfer function. The evolution was used to
is based on the use of fractal subsets of the plane. They
decide the optimal mixture between these two transfer
argued that the fractal representation of architectures was
functions automatically. The sigmoid and Gaussian transfer
biologically more plausible than the developmental rule
function themselves were not evolvable. No parameters of
representation. They used three real-valued parameters, i.e.,
the two functions were evolved.
an edge code, an input coefficient, and an output coefficient
Liu and Yao [191] used EP to evolve ANN’s with
to specify each node in an architecture. In a sense, this
both sigmoidal and Gaussian nodes. Rather than fixing
encoding method is closer to the direct encoding scheme
the total number of nodes and evolve mixture of different
rather to the indirect one. Fast simulated annealing [235]
nodes, their algorithm allowed growth and shrinking of the
was used in the evolution.
whole ANN by adding or deleting a node (either sigmoidal
4) Other Representations: A very different approach to
or Gaussian). The type of node added or deleted was
the evolution of architectures has been proposed by Ander-
determined at random. Good performance was reported for
sen and Tsoi [236]. Their approach is unique in that each
some benchmark problems [191]. Hwang et al. [225] went
individual in a population represents a hidden node rather
one step further. They evolved ANN topology, node transfer
than the whole architecture. An architecture is built layer
function, as well as connection weights for projection neural
by layer, i.e., hidden layers are added one by one if the
networks.
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1433

Sebald and Chellapilla [242] used the evolution of node
[182], [185]–[200], [230], [232]. In this case, each individ-
transfer function as an example to show the importance of
ual in a population is a fully specified ANN with complete
evolving representations. Representation and search are the
weight information. Since there is a one-to-one mapping
two key issues in problem solving. Co-evolving solutions
between a genotype and its phenotype, fitness evaluation
and their representations may be an effective way to tackle
is accurate.
some difficult problems where little human expertise is
One issue in evolving ANN’s is the choice of search
available.
operators used in EA’s. Both crossover-based and mutation-
based EA’s have been used. However, use of crossover
appears to contradict the basic ideas behind ANN’s, because
D. Simultaneous Evolution of Architectures
crossover works best when there exist “building blocks” but
and Connection Weights
it is unclear what a building block might be in an ANN since
The evolutionary approaches discussed so far in design-
ANN’s emphasize distributed (knowledge) representation
ing ANN architectures evolve architectures only, without
[244]. The knowledge in an ANN is distributed among all
any connection weights. Connection weights have to be
the weights in the ANN. Recombining one part of an ANN
learned after a near-optimal architecture is found. This is es-
with another part of another ANN is likely to destroy both
pecially true if one uses the indirect encoding scheme, such
ANN’s.
as the developmental rule method. One major problem with
However, if ANN’s do not use a distributed represen-
the evolution of architectures without connection weights
tation but rather a localized one, such as radial basis
is noisy fitness evaluation [194]. In other words, fitness
function (RBF) networks or nearest-neighbor multilayer
evaluation as described in step 3 of Fig. 6 is very inaccurate
perceptrons, crossover might be a very useful operator.
and noisy because a phenotype’s (i.e., an ANN with a
There has been some work in this area where good results
full set of weights) fitness was used to approximate its
were reported [119], [120], [245]–[253]. In general, ANN’s
genotype’s (i.e., an ANN without any weight information)
using distributed representation are more compact and have
fitness. There are two major sources of noise [194].
better generalization capability for most practical problems.
1) The first source is the random initialization of the
Yao and Liu [193], [194] developed an automatic system,
weights. Different random initial weights may pro-
EPNet, based on EP for simultaneous evolution of ANN
duce different training results. Hence, the same geno-
architectures and connection weights. EPNet does not use
type may have quite different fitness due to different
any crossover operators for the reason given above. It relies
random initial weights used in training.
on a number of mutation operators to modify architectures
2) The second source is the training algorithm. Different
and weights. Behavioral (i.e., functional) evolution, rather
training algorithms may produce different training
genetic evolution, is emphasized in EPNet. A number of
results even from the same set of initial weights. This
techniques were adopted to maintain the behavioral link
is especially true for multimodal error functions. For
between a parent and its offspring [190]. Fig. 11 shows the
example, BP may reduce an ANN’s error to 0.05
main structure of EPNet.
through training, but an EA could reduce the error
EPNet uses rank-based selection [125] and five muta-
to 0.001 due to its global search capability.
tions: hybrid training; node deletion; connection deletion;
connection addition; and node addition [188], [194], [254].
Such noise may mislead evolution because the fact that
Hybrid training is the only mutation in EPNet which mod-
the fitness of a phenotype generated from genotype
is
ifies ANN’s weights. It is based on a modified BP (MBP)
higher than that generated from genotype
does not mean
algorithm with an adaptive learning rate and simulated
that
truly has higher quality than
. In order to reduce
annealing. The other four mutations are used to grow and
such noise, an architecture usually has to be trained many
prune hidden nodes and connections.
times from different random initial weights. The average
The number of epochs used by MBP to train each ANN’s
result is then used to estimate the genotype’s mean fitness.
in a population is defined by two user-specified parameters.
This method increases the computation time for fitness
There is no guarantee that an ANN will converge to even
evaluation dramatically. It is one of the major reasons why
a local optimum after those epochs. Hence this training
only small ANN’s were evolved in previous studies.
process is called partial training. It is used to bridge the
In essence, the noise identified in this paper is caused by
behavioral gap between a parent and its offspring.
the one-to-many mapping from genotypes to phenotypes.
The five mutations are attempted sequentially. If one
Angeline et al. [149] and Fogel [12], [243] have provided a
mutation leads to a better offspring, it is regarded as
more general discussion on the mapping between genotypes
successful. No further mutation will be applied. Otherwise
and phenotypes. It is clear that the evolution of architectures
the next mutation is attempted. The motivation behind
without any weight information has difficulties in evaluat-
ordering mutations is to encourage the evolution of compact
ing fitness accurately. As a result, the evolution would be
ANN’s without sacrificing generalization. A validation set
very inefficient.
is used in EPNet to measure the fitness of an individual.
One way to alleviate this problem is to evolve ANN
EPNet has been tested extensively on a number of bench-
architectures and connection weights simultaneously [37],
mark problems and achieved excellent results, including
[42], [45], [149], [165], [166], [169]–[172], [179], [180],
parity problems of size from four to eight, the two-spiral
1434
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

Fig. 11.
The main structure of EPNet.
problem, the breast cancer problem, the diabetes problem,
all ANN’s. In fact, what is needed from an ANN is its
the heart disease problem, the thyroid problem, the Aus-
ability to adjust its learning rule adaptively according to its
tralian credit card problem, the Mackey–Glass time series
architecture and the task to be performed. In other words,
prediction problem, etc. Very compact ANN’s with good
an ANN should learn its learning rule dynamically rather
generalization ability have been evolved [185]–[195]. There
than have it designed and fixed manually. Since evolution
are some other different EP-based systems for designing
is one of the most fundamental forms of adaptation, it is
ANN’s [128], [129], [217], [223], but none has been tested
not surprising that the evolution of learning rules has been
on as many different benchmark problems.
introduced into ANN’s in order to learn their learning rules.
The relationship between evolution and learning is
IV. THE EVOLUTION OF LEARNING RULES
extremely complex. Various models have been proposed
An ANN training algorithm may have different perfor-
[257]–[271], but most of them deal with the issue of
mance when applied to different architectures. The design
how learning can guide evolution [257]–[260] and the
of training algorithms, more fundamentally the learning
relationship between the evolution of architectures and
rules used to adjust connection weights, depends on the
that of connection weights [261]–[263]. Research into
type of architectures under investigation. Different variants
the evolution of learning rules is still in its early stages
of the Hebbian learning rule have been proposed to deal
[264]–[267], [269], [270]. This research is important not
with different architectures. However, designing an optimal
only in providing an automatic way of optimizing learning
learning rule becomes very difficult when there is little
rules and in modeling the relationship between learning
prior knowledge about the ANN’s architecture, which is
and evolution, but also in modeling the creative process
often the case in practice. It is desirable to develop an
since newly evolved learning rules can deal with a complex
automatic and systematic way to adapt the learning rule to
and dynamic environment. This research will help us to
an architecture and the task to be performed. Designing a
understand better how creativity can emerge in artificial
learning rule manually often implies that some assumptions,
systems, like ANN’s, and how to model the creative process
which are not necessarily true in practice, have to be made.
in biological systems. A typical cycle of the evolution of
For example, the widely accepted Hebbian learning rule
learning rules can be described by Fig. 12. The iteration
has recently been shown to be outperformed by a new rule
stops when the population converges or a predefined
proposed by Artola et al. [255] in many cases [256]. The
maximum number of iterations has been reached.
new rule can learn more patterns than the optimal Hebbian
Similar to the reason explained in Section III-D, the
rule and can learn exceptions as well as regularities. It is,
fitness evaluation of each individual, i.e., the encoded
however, still difficult to say that this rule is optimal for
learning rule, is very noisy because we use phenotype’s
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1435

Fig. 12.
A typical cycle of the evolution of learning rules.
fitness (i.e., an ANN’s training result) to approximate
Unlike the evolution of connection weights and architec-
genotype’s fitness (i.e., a learning rule’s fitness). Such
tures which only deal with static objects in an ANN, i.e.,
approximation may be inaccurate. Some techniques have
weights and architectures, the evolution of learning rules
been used to alleviate this problem, e.g., using a weighted
has to work on the dynamic behavior of an ANN. The key
average of the training results from ANN’s with different
issue here is how to encode the dynamic behavior of a
initial connection weights in the fitness function. If the
learning rule into static chromosomes. Trying to develop a
ANN architecture is predefined and fixed, the evolved
universal representation scheme which can specify any kind
learning rule would be optimized toward this architecture. If
of dynamic behaviors is clearly impractical, let alone the
a near-optimal learning rule for different ANN architectures
prohibitive long computation time required to search such
is to be evolved, the fitness evaluation should be based on
a learning rule space. Constraints have to be set on the
the average training result from different ANN architectures
type of dynamic behaviors, i.e., the basic form of learning
in order to avoid overfitting a particular architecture.
rules being evolved in order to reduce the representation
complexity and the search space.
A. The Evolution of Algorithmic Parameters
Two basic assumptions which have often been made on
learning rules are: 1) weight updating depends only on
The adaptive adjustment of BP parameters (such as
local information such as the activation of the input node,
the learning rate and momentum) through evolution could
the activation of the output node, the current connection
be considered as the first attempt of the evolution of
weight, etc., and 2) the learning rule is the same for all
learning rules [32], [152], [272]. Harp et al. [152] encoded
connections in an ANN. A learning rule is assumed to be a
BP’s parameters in chromosomes together with ANN’s
linear function of these local variables and their products.
architecture. This evolutionary approach is different from
That is, a learning rule can be described by the function [5]
the nonevolutionary one such as offered by Jacobs [273]
because the simultaneous evolution of both algorithmic
parameters and architectures facilitates exploration of in-
(4)
teractions between the learning algorithm and architectures
such that a near-optimal combination of BP with an archi-
tecture can be found.
where
is time,
is the weight change,
Other researchers [32], [139], [213], [272] also used an
are local variables, and the ’s are real-valued coefficients
evolutionary process to find parameters for BP but ANN’s
which will be determined by evolution. In other words,
architecture was predefined. The parameters evolved in this
the evolution of learning rules in this case is equivalent to
case tend to be optimized toward the architecture rather
the evolution of real-valued vectors of
’s. Different
’s
than being generally applicable to learning. There are a
determine different learning rules. Due to a large number
number of BP algorithms with an adaptive learning rate
of possible terms in (4), which would make evolution
and momentum where a nonevolutionary approach is used.
very slow and impractical, only a few terms have been
Further comparison between the two approaches would be
used in practice according to some biological or heuristic
quite useful.
knowledge.
There are three major issues involved in the evolution
B. The Evolution of Learning Rules
of learning rules: 1) determination of a subset of terms
described in (4); 2) representation of their real-valued
The evolution of algorithmic parameters is certainly
coefficients as chromosomes; and 3) the EA used to evolve
interesting but it hardly touches the fundamental part of
these chromosomes. Chalmers [264] defined a learning rule
a training algorithm, i.e., its learning rule or weight up-
as a linear combination of four local variables and their
dating rule. Adapting a learning rule through evolution is
six pairwise products. No third- or fourth-order3 terms
expected to enhance ANN’s adaptivity greatly in a dynamic
environment.
3 The order is defined as the number of variables in a product.
1436
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

were used. Ten coefficients and a scale parameter were
be very high, i.e., many different architectures and learning
encoded in a binary string via exponential encoding. The
tasks have to be used in the fitness evaluation.
architecture used in the fitness evaluation was fixed because
only single-layer ANN’s were considered and the number
V.
OTHER COMBINATIONS BETWEEN ANN’S AND EA’S
of inputs and outputs were fixed by the learning task at
hand. After 1000 generations, starting from a population of
A. The Evolution of Input Features
randomly generated learning rules, the evolution discovered
the well-known delta rule [7], [274] and some of its vari-
For many practical problems, the possible inputs to an
ants. These experiments, although simple and preliminary,
ANN can be quite large. There may be some redundancy
demonstrated the potential of evolution in discovering novel
among different inputs. A large number of inputs to an
learning rules from a set of randomly generated rules.
ANN increase its size and thus require more training data
However, constraints set on learning rules could prevent
and longer training times in order to achieve a reasonable
some from being evolved such as those which include third-
generalization ability. Preprocessing is often needed to
or fourth-order terms.
reduce the number of inputs to an ANN. Various dimension
Similar experiments on the evolution of learning rules
reduction techniques, including the principal component
were also carried out by others [265], [266], [267], [269],
analysis, have been used for this purpose.
[270]. Fontanari and Meir [267] used Chalmers’ approach
The problem of finding a near-optimal set of input
to evolve learning rules for binary perceptrons. They also
features to an ANN can be formulated as a search problem.
considered four local variables but only seven terms were
Given a large set of potential inputs, we want to find a near-
adopted in their learning rules, which included one first-
optimal subset which has the fewest number of features
order, three second-order, and three third-order terms in (4).
but the performance of the ANN using this subset is no
Baxter [269] took one step further than just the evolution
worse than that of the ANN using the whole input set. EA’s
of learning rules. He tried to evolve complete ANN’s
have been used to perform such a search effectively [267],
(including connection weights, architectures, and learning
[277]–[287]. Very good results, i.e., better performance
rules) in a single level of evolution. It is clear that the search
with fewer inputs, have been reported from these studies.
space of possible ANN’s would be enormous if constraints
In the evolution of input features, each individual in the
population represents a subset of all possible inputs. This
were not set on the connection weights, architectures, and
can be implemented using a binary chromosome whose
learning rules. In his experiments, only ANN’s with binary
length is the same as the total number of input features.
threshold nodes were considered, so the weights could only
Each bit in the chromosome corresponds to a feature. “1”
be
1 or
1. The number of nodes in ANN’s was fixed.
indicates presence of a feature, while “0” indicates absence
The learning rule only considered two Boolean variables.
of the feature. The evaluation of an individual is carried out
Although Baxter’s experiments were rather simple, they
by training an ANN with these inputs and using the result
confirmed that complex behaviors could be learned and
to calculate its fitness value. The ANN architecture is often
the ANN’s learning ability could be improved through
fixed. Such evaluation is very noisy, however, because of
evolution [269].
the reason explained in Section III-D.
Bengio et al.’s approach [265], [266] is slightly different
Not only does the evolution of input features provide a
from Chalmers’ in the sense that gradient descent algo-
way to discover important features from all possible inputs
rithms and simulated annealing, rather than EA’s, were
automatically, it can also be used to discover new training
used to find near-optimal
’s. In their experiments, four
examples. Zhang and Veenker [288] described an active
local variables and one zeroth-order, three first-order, and
learning paradigm where a training algorithm based on
three second-order terms in (4) were used.
EA’s can self-select training examples. Cho and Cha [289]
Research related to the evolution of learning rules in-
proposed another algorithm for evolving training sets by
cludes Parisi et al.’s work on “econets,” although they
adding virtual samples.
did not evolve learning rules explicitly [260], [275]. They
emphasized the crucial role of the environment in which
the evolution occured while only using some simple neural
B. ANN as Fitness Estimator
networks. The issue of environmental diversity is closely
EA’s have been used with success to optimize various
related to the noisy fitness evaluation as pointed out in
control parameters [290]–[292]. However, it is very time
Section III-D and at the beginning of Section IV. There
consuming and costly to obtain fitness values for some
are two possible sources of noise. The first is the decoding
control problems as it is impractical to run a real system
process (morphogenesis) of chromosomes. The second is
for each combination of control parameters. In order to
introduced when a decoded learning rule is evaluated by
get around this problem and make evolution more efficient,
using it to train ANN’s. The environmental diversity is
fitness values are often approximated rather than computed
essential in obtaining a good approximation to the fitness
exactly. ANN’s are often used to model and approximate
of the decoded learning rule and thus in reducing the noise
a real control system due to their good generalization
from the second source. If a general learning rule which
abilities. The input to such ANN’s will be a set of control
is applicable to a wide range of ANN architectures and
parameters. The output will be the control system output
learning tasks is needed, the environmental diversity has to
from which an evaluation of the whole system can easily be
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1437

obtained. When an EA is used to search for a near-optimal
experiments which show that EA’s can be used to evolve
set of control parameters, the ANN will be used in fitness
ANN ensembles [192], [193], [302]–[305].
evaluation rather than the real control system [293]–[297].
This combination of ANN’s and EA’s has a couple of
D. Others
advantages in evolving control systems. First, the time-
consuming fitness evaluation based on real control systems
There are some other novel combinations between
is replaced by fast fitness evaluation based on ANN’s. Sec-
ANN’s and EA’s. For example, Zitar and Hassoun [306]
ond, this combination provides safer evolution of control
used EA’s to extract rules in a reinforcement learning
systems. EA’s are stochastic algorithms. It is possible that
system and then used them to train ANN’s. Sziranyi
some poor control parameters may be generated in the
[99] and Pal and Bhandari [98] used EA’s to tune circuit
evolutionary process. These parameters could damage a
parameters and templates in cellular ANN’s. Olmez [97]
real control system. If we use ANN’s to estimate fitness,
used EA’s to optimize a modified restricted Coulomb
we do not need to use the real system and thus can avoid
energy (RCE) ANN. Imada and Araki [307] used EA’s
damages to the real system. However, how successful this
to evolve connection weights for Hopfield ANN’s. Many
combination approach will be depends largely on how well
others used EA’s and ANN’s for combinatorial or global
ANN’s learn and generalize.
(numerical) optimization in order to combine EA’s global
search capability with ANN’s fast convergence to local
optima [308]–[318].
C. Evolving ANN Ensembles
Learning is often formulated as an optimization problem
VI. CONCLUDING REMARKS
in the machine learning field. However, learning is different
Although evolution has been introduced into ANN’s at
from optimization in practice because we want the learned
various levels, they can roughly be divided into three: the
system to have best generalization, which is different from
evolution of connection weights, architectures, and learning
minimizing an error function on a training data set. The
rules. This section first describes a general framework for
ANN with the minimum error on a training data set may
ANN’s and then draws some conclusions.
not have best generalization unless there is an equivalence
between generalization and the error on the training data.
Unfortunately, measuring generalization quantitatively and
A. A General Framework for EANN’s
accurately is almost impossible in practice [298] although
A general framework for EANN’s can be described
there are many theories and criteria on generalization,
by Fig. 13 [3]–[5]. The evolution of connection weights
such as the minimum description length (MDL) [299],
proceeds at the lowest level on the fastest time scale in an
Akaike information criteria (AIC) [300], and minimum
environment determined by an architecture, a learning rule,
message length (MML) [301]. In practice, these criteria
and learning tasks. There are, however, two alternatives to
are often used to define better error functions in the hope
decide the level of the evolution of architectures and that of
that minimizing the functions will maximize generalization.
learning rules: either the evolution of architectures is at the
While these functions often lead to better generalization of
highest level and that of learning rules at the lower level
learned systems, there is no guarantee.
or vice versa. The lower the level of evolution, the faster
EA’s are often used to maximize a fitness function or
the time scale it is on.
minimize an error function, and thus they face the same
From the engineering perspective, the decision on the
problem as described above: maximizing a fitness function
level of evolution depends on what kind of prior knowledge
is different from maximizing generalization. The EA is
is available. If there is more prior knowledge about an
actually used as an optimization, not learning, algorithm.
ANN’s architecture than that about their learning rules, or
While little can be done for traditional nonpopulation-based
if a particular class of architectures is pursued, it is better
learning, there are opportunities for improving population-
to put the evolution of architectures at the highest level
based learning, e.g., evolutionary learning.
because such knowledge can be encoded in an architecture’s
Since the maximum fitness may not be equivalent to best
genotypic representation to reduce the (architecture) search
generalization in evolutionary learning, the best individual
space and the lower-level evolution of learning rules can
with the maximum fitness in a population may not be
be biased toward this type of architectures. On the other
the one we want. Other individuals in the population may
hand, the evolution of learning rules should be at the
contain some useful information that will help to improve
highest level if there is more prior knowledge about them
generalization of learned systems. It is thus beneficial
available or if there is a special interest in certain type of
to make use of the whole population rather than any
learning rules. Unfortunately, there is usually little prior
single individual. A population always contains at least
knowledge available about either architectures or learning
as much information as any single individual. Hence,
rules in practice except for some very vague statements
combining different individuals in the population to form
[319]. In this case, it is more appropriate to put the evolution
an integrated system is expected to produce better results.
of architectures at the highest level since the optimality
Such a population of ANN’s is called an ANN ensemble
of a learning rule makes more sense when evaluated in
in this section. There have been some very successful
an environment including the architecture to which the
1438
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

characteristics of the design problem given in the begin-
ning of Section III. The direct encoding scheme of ANN
architectures is very good at fine tuning and generating
a compact architecture. The indirect encoding scheme is
suitable for finding a particular type of ANN architecture
quickly. Separating the evolution of architectures and that
of connection weights can make fitness evaluation inaccu-
rate and mislead evolution. Simultaneous evolution of ANN
architectures and connection weights generally produces
better results. It is argued that crossover produces more
harm than benefit in evolving ANN’s using distributed
representation (e.g., multilayer perceptrons) because it de-
stroys knowledge learned and distributed among different
connections easily. Crossover would be more suitable for
localized ANN’s, such as RBF networks.
Evolution can also be used to allow an ANN to adapt its
learning rule to its environment. In a sense, the evolution
provides ANN’s with the ability of learning to learn. It also
helps to model the relationship between learning and evo-
lution. Preliminary experiments have shown that efficient
learning rules can be evolved from randomly generated
rules. Current research on the evolution of learning rules
normally assumes that learning rules can be specified by
(4). While constraints on learning rules are necessary to
Fig. 13.
A general framework for EANN’s.
reduce the search space in the evolution, they might prevent
some interesting learning rules from being discovered.
Global search procedures such as EA’s are usually com-
learning rule is applied. Fig. 13 summarizes different levels
putationally expensive. It would be better not to employ
of evolution in ANN’s.
EA’s at all three levels of evolution. It is, however, benefi-
Fig. 13 can also be viewed as a general framework of
cial to introduce global search at some levels of evolution,
adaptive systems if we do not restrict ourselves to EA’s
especially when there is little prior knowledge available
and three levels. Simulated annealing, gradient descent,
at that level and the performance of the ANN is required
and even exhaustive search could be considered as special
to be high, because the trial-and-error and other heuristic
cases of EA’s. For example, the traditional BP network
methods are very ineffective in such circumstances.
can be considered as a special case of our general frame-
With the increasing power of parallel computers, the
work with one-shot (only-one-candidate) search used in
evolution of large ANN’s becomes feasible. Not only can
the evolution of architectures and learning rules and BP
such evolution discover possible new ANN architectures
used in the evolution of connection weights. In fact, the
and learning rules, but it also offers a way to model the
general framework provides a basis for comparing various
creative process as a result of ANN’s adaptation to a
specific EANN models according to the search procedures
dynamic environment.
they used at three different levels since it defines a three-
dimensional space where zero represents one-shot search
and
represents exhaustive search along each axis. Each
ACKNOWLEDGMENT
EANN model corresponds to a point in this space.
The author is grateful to D. B. Fogel and an anonymous
referee for their constructive comment on earlier versions
of this paper.
B. Conclusions
Evolution can be introduced into ANN’s at many differ-
REFERENCES
ent levels. The evolution of connection weights provides a
[1] X. Yao, “Evolution of connectionist networks,” in Preprints
global approach to connection weight training, especially
Int. Symp. AI, Reasoning & Creativity, Queensland, Australia,
when gradient information of the error function is difficult
Griffith Univ., 1991, pp. 49–52.
or costly to obtain. Due to the simplicity and generality
[2]
, “A review of evolutionary artificial neural networks,” Int.
J. Intell. Syst., vol. 8, no. 4, pp. 539–567, 1993.
of the evolution and the fact that gradient-based training
[3]
, “Evolutionary artificial neural networks,” Int. J. Neural
algorithms often have to be run multiple times in order
Syst., vol. 4, no. 3, pp. 203–222, 1993.
[4]
, “The evolution of connectionist networks,” in Artificial
to avoid being trapped in a poor local optimum, the
Intelligence and Creativity, T. Dartnall, Ed.
Dordrecht, The
evolutionary approach is quite competitive.
Netherlands: Kluwer, pp. 233–243, 1994.
Evolution can be used to find a near-optimal ANN
[5]
, “Evolutionary artificial neural networks,” in Encyclope-
dia of Computer Science and Technology, vol. 33, A. Kent and
architecture automatically. This has several advantages over
J. G. Williams, Eds.
New York: Marcel Dekker, 1995, pp.
heuristic methods of architecture design because of the
137–170.
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1439

[6] G. E. Hinton, “Connectionist learning procedures,” Artificial
Computing’89, D. J. Evans, G. R. Joubert, and F. J. Peters, Eds.
Intell., vol. 40, no. 1–3, pp. 185–234, Sept. 1989.
Amsterdam, The Netherlands: Elsevier, 1989, pp. 275–280.
[7] J. Hertz, A. Krogh, and R. Palmer, Introduction to the Theory of
[32] R. K. Belew, J. McInerney, and N. N. Schraudolph, “Evolving
Neural Computation.
Reading, MA: Addison-Wesley, 1991.
networks: Using genetic algorithm with connectionist learning,”
[8] H.-P. Schwefel, Numerical Optimization of Computer Models.
Comput. Sci. Eng. Dep. (C-014), , Univ. of California, San
Chichester, U.K.: Wiley, 1981.
Diego, Tech. Rep. CS90-174 (revised), Feb. 1991.
[9] H.-P. Schwefel, Evolution and Optimum Seeking.
New York:
[33] N. J. Radcliffe, “Genetic neural networks on MIMD comput-
Wiley, 1995.
ers (compressed edition),” Ph.D. dissertation, Dep. Theoretical
[10] L. J. Fogel, A. J. Owens, and M. J. Walsh, Artificial Intelligence
Phys., Univ. Edinburgh, U.K., 1990.
Through Simulated Evolution.
New York: Wiley, 1966.
[34] D. L. Prados, “Training multilayered neural networks by replac-
[11] D. B. Fogel, System Identification Through Simulated Evolu-
ing the least fit hidden neurons,” in Proc. IEEE SOUTHEAST-
tion: A Machine Learning Approach to Modeling.
Needham
CON’92, vol. 2, pp. 634–637.
Heights, MA: Ginn, 1991.
[35] H. de Garis, “Steerable genNets: The genetic programming
[12]
, Evolutionary Computation: Toward a New Philosophy of
of steerable behaviors in genNets,” in Toward a Practice of
Machine Intelligence.
New York: IEEE Press, 1995.
Autonomous Systems: Proc. 1st Europ. Conf. Artificial Life, F.
[13] J. H. Holland, Adaptation in Natural and Artificial Systems.
J. Varela and P. Bourgine, Eds.
Cambridge, MA: MIT Press,
Ann Arbor, MI: Univ. Michigan Press, 1975.
1991, pp. 272–281.
[14] D. E. Goldberg, Genetic Algorithms in Search, Optimization,
[36]
, “Using the genetic algorithm to train time dependent
and Machine Learning.
Reading, MA: Addison-Wesley, 1989.
behaviors in neural networks,” in Proc. 1st Int. Workshop Mul-
[15] D. B. Fogel, “An introduction to simulated evolutionary opti-
tistrategy Learning (MSL-91), R. S. Michalski and G. Tecuci,
mization,” IEEE Trans. Neural Networks, vol. 5, pp. 3–14, Jan.
Eds.
Fairfax, VA: Center for Artificial Intelligence, 1991, pp.
1994.
273–280.
[16] T. B¨ack, U. Hammel, and H.-P. Schwefel, “Evolutionary com-
[37] M. Srinivas and L. M. Patnaik, “Learning neural network
putation: Comments on the history and current state,” IEEE
weights using genetic algorithms—Improving performance by
Trans. Evolutionary Computation, vol. 1, pp. 3–17, Apr. 1997.
search-space reduction,” in Proc. 1991 IEEE Int. Joint Conf.
[17] D. R. Hush and B. G. Horne, “Progress in supervised neural
Neural Networks (IJCNN’91 Singapore), vol. 3, pp. 2331–2336.
networks,” IEEE Signal Processing Mag., vol. 10, pp. 8–39,
[38] H. de Garis, “GenNets: Genetically programmed neural
Jan. 1993.
nets—Using the genetic algorithm to train neural nets whose
[18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learn-
inputs and/or outputs vary in time,” in Proc. 1991 IEEE Int.
ing internal representations by error propagation,” in Parallel
Joint Conf. Neural Networks (IJCNN’91 Singapore), vol. 2, pp.
Distributed Processing: Explorations in the Microstructures of
1391–1396.
Cognition, vol. I, D. E. Rumelhart and J. L. McClelland, Eds.
[39] A. Homaifar and S. Guan, “Training weights of neural networks
Cambridge, MA: MIT Press, 1986, pp. 318–362.
by genetic algorithms and messy genetic algorithms,” in Proc.
[19] M. F. Møller, “A scaled conjugate gradient algorithm for
2nd IASTED Int. Symp. Expert Systems and Neural Networks,
fast supervised learning,” Neural Networks, vol. 6, no. 4, pp.
M. H. Hamza, Ed.
Anaheim, CA: Acta, 1990, pp. 74–77.
525–533, 1993.
[40] D. L. Prados, “New learning algorithm for training multilay-
[20] K. J. Lang, A. H. Waibel, and G. E. Hinton, “A time-delay
ered neural networks that uses genetic-algorithm techniques,”
neural network architecture for isolated word recognition,”
Electron. Lett., vol. 28, pp. 1560–1561, July 1992.
Neural Networks, vol. 3, no. 1, pp. 33–43, 1990.
[41] A. P. Wieland, “Evolving neural network controllers for un-
[21] S. S. Fels and G. E. Hinton, “Glove-talk: A neural network
stable systems,” in Proc. 1991 IEEE Int. Joint Conf. Neural
interface between a data-glove and a speech synthesizer,” IEEE
Networks (IJCNN’91 Seattle), vol. 2, pp. 667–673.
Trans. Neural Networks, vol. 4, pp. 2–8, Jan. 1993.
[42] J. R. Koza and J. P. Rice, “Genetic generation of both the
[22] S. Knerr, L. Personnaz, and G. Dreyfus, “Handwritten digit
weights and architecture for a neural network,” in Proc. 1991
recognition by neural networks with single-layer training,”
IEEE Int. Joint Conf. Neural Networks (IJCNN’91 Seattle), vol.
IEEE Trans. Neural Networks, vol. 3, pp. 962–968, Nov. 1992.
2, pp. 397–404.
[23] R. S. Sutton, “Two problems with backpropagation and other
[43] S. Dominic, R. Das, D. Whitley, and C. Anderson, “Genetic
steepest-descent learning procedures for networks,” in Proc.
reinforcement learning for neural networks,” in Proc. 1991
8th Annual Conf. Cognitive Science Society.
Hillsdale, NJ:
IEEE Int. Joint Conf. Neural Networks (IJCNN’91 Seattle), vol.
Erlbaum, 1986, pp. 823–831.
2, pp. 71–76.
[24] D. Whitley, T. Starkweather, and C. Bogart, “Genetic al-
[44] F. A. Dill and B. C. Deer, “An exploration of genetic algorithms
gorithms and neural networks: Optimizing connections and
for the selection of connection weights in dynamical neural net-
connectivity,” Parallel Comput., vol. 14, no. 3, pp. 347–361,
works,” in Proc. IEEE 1991 National Aerospace and Electronics
1990.
Conf. NAECON 1991, vol. 3, pp. 1111–1115.
[25] Y. Chauvin and D. E. Rumelhart, Eds., Backpropagation: The-
[45] S. Bornholdt and D. Graudenz, “General asymmetric neural
ory, Architectures, and Applications.
Hillsdale, NJ: Erlbaum,
networks and structure design by genetic algorithms,” Neural
1995.
Networks, vol. 5, no. 2, pp. 327–334, 1992.
[26] D. Whitley, “The GENITOR algorithm and selective pressure:
[46] B. Maricic, “Genetically programmed neural network for
Why rank-based allocation of reproductive trials is best,” in
solving pole-balancing problem,” in Artificial Neural Networks:
Proc. 3rd Int. Conf. Genetic Algorithms and Their Applications,
Proc. Int. Conf. Artificial Neural Networks—ICANN-91, vol.
J. D. Schaffer, Ed.
San Mateo, CA: Morgan Kaufmann, 1989,
2, T. Kohonen, K. M¨akisara, O. Simula, and J. Kangas,
pp. 116–121.
Eds.
Amsterdam, The Netherlands: North-Holland, 1991,
[27] D. Montana and L. Davis, “Training feedforward neural net-
pp. 1273–1276.
works using genetic algorithms,” in Proc. 11th Int. Joint Conf.
[47] J. J. Spofford and K. J. Hintz, “Evolving sequential machines
Artificial Intelligence.
San Mateo, CA: Morgan Kaufmann,
in amorphous neural networks,” in Artificial Neural Networks:
1989, pp. 762–767.
Proc. Int. Conf. Artificial Neural Networks—ICANN-91, vol. 1,
[28] T. P. Caudell and C. P. Dolan, “Parametric connectivity:
T. Kohonen, K. M¨akisara, O. Simula, and J. Kangas, Eds.
Am-
Training of constrained networks using genetic algorithms,” in
sterdam, The Netherlands: North-Holland, 1991, pp. 973–978.
Proc. 3rd Int. Conf. Genetic Algorithms and Their Applications,
[48] F. Menczer and D. Parisi, “Evidence of hyperplanes in the
J. D. Schaffer, Ed.
San Mateo, CA: Morgan Kaufmann, 1989,
genetic learning of neural networks,” Biological Cybern., vol.
pp. 370–374.
66, pp. 283–289, 1992.
[29] D. B. Fogel, L. J. Fogel, and V. W. Porto, “Evolving neural
[49] Y. Ichikawa and T. Sawa, “Neural network application for direct
networks,” Biological Cybern., vol. 63, no. 6, pp. 487–493,
feedback controllers,” IEEE Trans. Neural Networks, vol. 3, pp.
1990.
224–231, Mar. 1992.
[30] P. Bartlett and T. Downs, “Training a neural network with
[50] A. W. O’Neil, “Genetic based training of two-layer, optoelec-
a genetic algorithm,” Dep. Elect. Eng., Univ. Queensland,
tronic neural network,” Electron. Lett., vol. 28, pp. 47–48, Jan.
Australia, Tech. Rep., Jan. 1990.
1992.
[31] J. Heistermann and H. Eckardt, “Parallel algorithms for learning
[51] J. Heistermann, “A mixed genetic approach to the optimization
in neural networks with evolution strategy,” in Proc. Parallel
of neural controllers,” in Proc. CompEuro’92, P. Dewilde and J.
1440
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

Vandewallele, Eds.
Los Alamitos, CA: IEEE Computer Soc.,
[71] S. L. Hung and H. Adeli, “Parallel genetic/neural network
1992, pp. 459–464.
learning algorithm for MIMD shared memory machines,” IEEE
[52] D. J. Janson and J. F. Frenzel, “Application of genetic algo-
Trans. Neural Networks, vol. 5, pp. 900–909, Nov. 1994.
rithms to the training of higher order neural networks,” J. Syst.
[72] T. Maifeld and G. Sheble, “Short-term load forecasting by a
Eng., vol. 2, pp. 272–276, 1992.
neural network and a refined genetic algorithm,” Electric Power
[53] D. J. Janson and J. F. Frenzel, “Training product unit neural
Syst. Res., vol. 31, no. 3, pp. 147–152, 1994.
networks with genetic algorithms,” IEEE Expert, vol. 8, pp.
[73] B. Deo, A. Datta, B. Kukreja, R. Rastogi, and K. Deb, “Opti-
26–33, May 1993.
mization of back propagation algorithm and GAS-assisted ANN
[54] A. V. Scherf and L. D. Voelz, “Training neural networks with
models for hot metal desulphurization,” Steel Res., vol. 65, no.
genetic algorithms for target detection,” in Proc. SPIE, Conf.
12, pp. 528–533, 1994.
Science of Artificial Neural Networks, vol. 1710, pt. 1, Orlando,
[74] A. J. Skinner and J. Q. Broughton, “Neural networks in compu-
FL, 1992, pp. 734–741.
tational materials science: Training algorithms,” Modeling and
[55] E. DeRouin and J. Brown, “Alternative learning methods for
Simulation in Materials Sci. Eng., vol. 3, no. 3, pp. 371–390,
training neural network classifiers,” in Proc. SPIE, Conf. Sci-
1995.
ence of Artificial Neural Networks, pt. 1, vol. 1710, Orlando,
[75] Y. Chen, M. Narita, and T. Yamada, “Nuclear reactor diagnostic
FL, 1992, pp. 474–483.
system using genetic algorithm (GA)-trained neural networks,”
[56] M. A. Lewis, A. H. Fagg, and A. Solidum, “Genetic pro-
Elect. Eng. Japan (English Translation of Denki Gakkai Ron-
gramming approach to the construction of a neural network
bunshi), vol. 115, no. 5, pp. 88–99, 1995.
for control of a walking robot,” in Proc. 1992 IEEE Int. Conf.
[76] M. Lehotsky, V. Olej, and J. Chmurny, “Pattern recognition
Robotics and Automation, vol. 3.
Los Alamitos, CA: IEEE
based on the fuzzy neural networks and their learning by
Computer Soc., 1992, pp. 2618–2623.
modified genetic algorithms,” Neural Network World, vol. 5,
[57] H. B. Penfold, O. F. Diessel, and M. W. Bentink, “A genetic
no. 1, pp. 91–97, 1995.
breeding algorithm which exhibits self-organizing in neural
[77] N. Wang, D. Yao, and F. Zhang, “Genetic-based fuzzy net
networks,” in Proc. IASTED Int. Symp. Artificial Intelligence
controller and its application,” Advances in Modeling and Anal.
Application and Neural Networks—AINN’90, M. H. Hamza, Ed.
B, vol. 38, nos. 1–2, pp. 49–58, 1997.
Anaheim, CA: ACTA, 1990, pp. 293–296.
[78] P. Osmera, “Optimization of neural networks by genetic al-
[58] M. L. Gargano, R. A. Marose, and L. von Kleeck, “An
gorithms,” Neural Network World, vol. 5, no. 6, pp. 965–976,
application of artificial neural networks and genetic algorithms
1995.
to personnel selection in the financial industry,” in Proc. 1st Int.
[79] S. Jain, P.-Y. Peng, A. Tzes, and F. Khorrami, “Neural network
Conf. Artificial Intelligence on Wall Street.
Los Alamitos, CA:
design with genetic learning for control of a single link flexible
IEEE Computer Soc., 1991, pp. 257–262.
manipulator,” J. Intell. Robot. Syst.: Theory Applicat., vol. 15,
[59] J. G. Elias, “Genetic generation of connection patterns for a dy-
no. 2, pp. 135–151, 1996.
namic artificial neural network,” in Proc. Int. Workshop Combi-
[80] M. A. Taha and A. S. Hanna, “Evolutionary neural network
nations of Genetic Algorithms and Neural Networks (COGANN-
model for the selection of pavement maintenance strategy,”
92), D. Whitley and J. D. Schaffer, Eds.
Los Alamitos, CA:
Transportation Res. Rec., vol. 1497, pp. 70–76, May 1995.
IEEE Computer Soc., 1992, pp. 38–54.
[81] S.-W. Lee, “Off-line recognition of totally unconstrained hand-
[60] R. D. Beer and J. C. Gallagher, “Evolving dynamical neural
written numerals using multilayer cluster neural network,” IEEE
networks for adaptive behavior,” Adaptive Behavior, vol. 1, no.
Trans. Pattern Anal. Machine Intell., vol. 18, pp. 648–652, June
1, pp. 91–122, 1992.
1996.
[61] L. Meeden, “Incremental approach to developing intelligent
[82] S.-K. Lee and D. Jang, “Translation, rotation and scale invariant
neural network controllers for robots,” IEEE Trans. Syst., Man,
pattern recognition using spectral analysis and hybrid genetic-
Cybern. B, vol. 26, pp. 474–485, Mar. 1996.
neural-fuzzy networks,” Comput. Ind. Eng., vol. 30, no. 3, pp.
[62] S. Baluja, “Evolution of an artificial neural network based
511–522, 1996.
autonomous land vehicle controller,” IEEE Trans. Syst., Man,
[83] J. V. Hansen and R. D. Meservy, “Learning experiments with
Cybern. B, vol. 26, pp. 450–463, Mar. 1996.
genetic optimization of a generalized regression neural net-
[63] V. W. Porto, D. B. Fogel, and L. J. Fogel, “Alternative neural
work,” Decision Support Syst., vol. 18, nos. 3–4, pp. 317–325,
network training methods,” IEEE Expert, vol. 10, pp. 16–22,
1996.
Mar. 1995.
[84] S.-J. Huang and C.-L. Huang, “Genetic-based multilayered per-
[64] S. Yao, C. J. Wei, and Z. Y. He, “Evolving wavelet neural
ceptron for Taiwan power system short-term load forecasting,”
networks for function approximation,” Electron. Lett., vol. 32,
Electric Power Syst. Res., vol. 38, no. 1, pp. 69–74, 1996.
no. 4, pp. 360–361, 1996.
[85]
, “Application of genetic-based neural networks to ther-
[65] G. W. Greenwood, “Training partially recurrent neural net-
mal unit commitment,” IEEE Trans. Power Syst., vol. 12, pp.
works using evolutionary strategies,” IEEE Trans. Speech Audio
654–660, Feb. 1997.
Processing, vol. 5, pp. 192–194, Feb. 1997.
[86] Y. M. Chen and R. M. O’Connell, “Active power line con-
[66] A. Chilingarian, S. Ter-Antonyan, and A. Vardanyan, “Com-
ditioner with a neural network control,” IEEE Trans. Ind.
parison of Bayesian and neural techniques in problems of
Applicat., vol. 33, pp. 1131–1136, July/Aug. 1997.
classification to multiple categories,” Nuclear Instrum. Methods
[87] C. R. Chen and C. H. Chu, “Concurrent training algorithm for
in Phys. Res., Section A: Accelerators, Spectrometers, Detectors
supervised learning in artificial neural networks,” J. Inform. Sci.
and Associated Equipment, vol. 389, nos. 1–2, pp. 230–232,
Eng., vol. 13, no. 2, pp. 267–291, 1997.
1997.
[88] J. Jarmulak, P. Spronck, and E. J. H. Kerckhoffs, “Neural
[67] A. P. Topchy and O. A. Lebedko, “Neural network training by
networks in process control: Model-based and reinforcement
means of cooperative evolutionary search,” Nuclear Instrum.
trained controllers,” Comput. Electron. Agriculture, vol. 18, nos.
Methods in Phys. Res., Section A: Accelerators, Spectrometers,
2–3, pp. 149–166, 1997.
Detectors and Associated Equipment, vol. 389, nos. 1–2, pp.
[89] R. S. Sexton, R. E. Dorsey, and J. D. Johnson, “Toward global
240–241, 1997.
optimization of neural networks: A comparison of the genetic
[68] R. Berlich and M. Kunze, “Comparison between the perfor-
algorithm and backpropagation,” Decision Support Syst., vol.
mance of feed forward neural networks and the supervised
22, no. 2, pp. 171–185, 1998.
growing neural gas algorithm,” Nuclear Instrum. Methods in
[90] M. Kamo, T. Kubo, and Y. Iwasa, “Neural network for female
Phys. Res., Section A: Accelerators, Spectrometers, Detectors
mate preference, trained by a genetic algorithm,” Philosophical
and Associated Equipment, vol. 389, nos. 1–2, pp. 274–277,
Trans. Roy. Soc., vol. 353, no. 1367, p. 399, 1998.
1997.
[91] J. Zhou and D. L. Civco, “Using genetic learning neural
[69] Y. Ikuno, H. Kawabata, Y. Hirao, M. Hirata, T. Nagahara, and
networks for spatial decision making in GIS,” PE & RS:
Y. Inagaki, “Application of an improved genetic algorithm to
Photogrammetric Eng. Remote Sensing, vol. 62, no. 11, pp.
the learning of neural networks,” Solid State Commun., vol. 91,
1287–1295, 1996.
no. 3, pp. 731–735, 1994.
[92] B. Yoon, D. J. Holmes, and G. Langholz, “Efficient genetic
[70] W. Kinnebrock, “Accelerating the standard backpropagation
algorithms for training layered feedforward neural networks,”
method using a genetic approach,” Neurocomput., vol. 6, nos.
Inform. Sci., vol. 76, nos. 1–2, pp. 67–85, 1994.
5–6, pp. 583–588, 1994.
[93] P. G. Korning, “Training neural networks by means of genetic
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1441

algorithms working on very long chromosomes,” Int. J. Neural
[115] N. J. Radcliffe, “Equivalence class analysis of genetic algo-
Syst., vol. 6, no. 3, pp. 299–316, 1995.
rithms,” Complex Syst., vol. 5, no. 2, pp. 183–205, 1991.
[94] S. Park, L.-J. Park, and C. Park, “A neuro-genetic controller for
[116] D. B. Fogel and A. Ghozeil, “A note on representations and
nonminimum phase systems,” IEEE Trans. Neural Networks,
variation operators,” IEEE Trans. Evolutionary Computation,
vol. 6, pp. 1297–1300, Sept. 1995.
vol. 1, pp. 159–161, July 1997.
[95] D. B. Fogel, E. C. Wasson, and V. W. Porto, “A step toward
[117] N. Saravanan and D. B. Fogel, “Evolving neural control sys-
computer-assisted mammography using evolutionary program-
tems,” IEEE Expert, vol. 10, pp. 23–27, Mar. 1995.
ming and neural networks,” Cancer Lett., vol. 119, no. 1, p.
[118] K. S. Tang, C. Y. Chan, K. F. Man, and S. Kwong, “Genetic
93, 1997.
structure for NN topology and weights optimization,” in Proc.
[96] D. B. Fogel, E. C. Wasson, and E. M. Boughton, “Evolving
1st IEE/IEEE Int. Conf. Genetic Algorithms in Engineering Sys-
neural networks for detecting breast cancer,” Cancer Lett., vol.
tems: Innovations and Applications (GALESIA’95), Stevenage,
96, no. 1, pp. 49–53, 1995.
U.K., Inst. Elect. Eng. Conf. Pub. 414, pp. 250–255.
[97] T. Olmez, “Classification of EEG waveforms by using RCE
[119] M. Sarkar and B. Yegnanarayana, “Evolutionary programming-
neural networks and genetic algorithms,” Electron. Lett., vol.
based probabilistic neural networks construction technique,” in
33, no. 18, pp. 1561–1562, 1997.
Proc. 1997 IEEE Int. Conf. Neural Networks, Part 1 (of 4), pp.
[98] S. Pal and D. Bhandari, “Genetic algorithms with fuzzy fitness
456–461.
function for object extraction using cellular networks,” Fuzzy
[120] P. J. Angeline, “Evolving basis functions with dynamic recep-
Sets and Syst., vol. 65, nos. 2–3, pp. 129–139, 1994.
tive fields,” in Proc. 1997 IEEE Int. Conf. Systems, Man, and
[99] T. Sziranyi, “Robustness of cellular neural networks in image
Cybernetics, Part 5 (of 5), pp. 4109–4114.
deblurring and texture segmentation,” Int. J. Circuit Theory
[121] X. Yao and Y. Liu, “Fast evolutionary programming,” in Evo-
Applicat., pt 2, vol. 24, no. 3, pp. 381–396, 1996.
lutionary Programming V: Proc. 5th Annu. Conf. Evolutionary
[100] P. Adamidis and V. Petridis, “Co-operating populations with
Programming, L. J. Fogel, P. J. Angeline, and T. B¨ack, Eds.
different evolution behaviors,” in Proc. 1996 IEEE Int. Conf.
Cambridge, MA: MIT Press, 1996, pp. 451–460.
Evolutionary Computation, ICEC’96, pp. 188–191.
[122] X. Yao, G. Lin, and Y. Liu, “An analysis of evolutionary
[101] D. Thierens, “Non-redundant genetic coding of neural net-
algorithms based on neighborhood and step sizes,” in Evolu-
works,” in Proc. 1996 IEEE Int. Conf. Evolutionary Compu-
tionary Programming VI: Proc. 6th Annu. Conf. Evolutionary
tation, ICEC’96, pp. 571–575.
Programming, vol. 1213 of Lecture Notes in Computer Science,
[102] R. G. Hutchins, “Identifying nonlinear dynamic systems using
P. J. Angeline, R. G. Reynolds, J. R. McDonnell, and R.
neural nets and evolutionary programming,” in Proc. 28th
Eberhart, Eds.
Berlin, Germany: Springer-Verlag, 1997, pp.
Asilomar Conf. Signals, Systems & Computers. Part 2 (of 2).
297–307.
Los Alamitos, CA: IEEE Computer Soc., 1994, pp. 887–891.
[123] T. B¨ack and H.-P. Schwefel, “An overview of evolutionary
[103] J. Lei, G. He, and J.-P. Jiang, “State estimation of the CSTR
algorithms for parameter optimization,” Evolutionary Compu-
system based on a recurrent neural network trained by HGA’s,”
tation, vol. 1, no. 1, pp. 1–23, 1993.
in Proc. 1997 IEEE Int. Conf. Neural Networks. Part 2 (of 4),
[124] X. Yao and Y. Liu, “Fast evolution strategies,” Control Cybern.,
pp. 779–782.
vol. 26, no. 3, pp. 467–496, 1997.
[104] D. Popvic and K. C. S. Murty, “Retaining diversity of search
[125] X. Yao, “An empirical study of genetic operators in genetic
point distribution through a breeder genetic algorithm for neu-
algorithms,” Microprocessing and Microprogramming, vol. 38,
ral network learning,” in Proc. 1997 IEEE Int. Conf. Neural
no. 1–5, pp. 707–714, Sept. 1993.
Networks. Part 1 (of 4), pp. 495–498.
[126] J. Torreele, “Temporal processing with recurrent networks:
[105] A. Likartsis, I. Vlachavas, and L. H. Tsoukalas, “New hybrid
An evolutionary approach,” in Proc. 4th Int. Conf. Genetic
neural-genetic methodology for improving learning,” in Proc.
Algorithms, R. K. Belew and L. B. Booker, Eds.
San Mateo,
9th IEEE Int. Conf. Tools with Artificial Intelligence, pp. 32–36.
CA: Morgan Kaufmann, 1991, pp. 555–561.
[106] A. Schultz and H. Wechsler, “Data fusion in neural networks
[127] M. Mandischer, “Evolving recurrent neural networks with non-
via computational evolution,” in Proc. 1994 IEEE Int. Conf.
binary encoding,” in Proc. 1995 IEEE Int. Conf. Evolutionary
Neural Networks. Part 5 (of 7), pp. 3044–3049.
Computation, Part 2 (of 2), pp. 584–589.
[107] M. Koeppen, M. Teunis, and B. Nickolay, “Neural network
[128] F. Heimes, G. Zalesski, W. L. Jr., and M. Oshima, “Traditional
that uses evolutionary learning,” in Proc. 1997 IEEE Int. Conf.
and evolved dynamic neural networks for aircraft simulation,”
Evolutionary Computation, ICEC’97, pp. 635–639.
in Proc. 1997 IEEE Int. Conf. Systems, Man, and Cybernetics,
[108] J. Aguilar and A. Colmenares, “Recognition algorithm using
Part 3 (of 5), pp. 1995–2000.
evolutionary learning on the random neural networks,” in Proc.
[129] K.-H. Wu, C.-H. Chen, and J.-D. Lee, “Cache-genetic-based
1997 IEEE Int. Conf. Neural Networks. Part 2 (of 4), pp.
modular fuzzy neural network for robot path planning,” in Proc.
1023–1028.
1996 IEEE Int. Conf. Systems, Man and Cybernetics, Part 4 (of
[109] R. Hochman, T. M. Khoshgoftaar, E. B. Allen, and J. P.
4), pp. 3089–3094, 1996.
Hudepohl, “Evolutionary neural networks: A robust approach
[130] T. Ichimura, T. Takano, and E. Tazaki, “Reasoning and learning
to software reliability problems,” in Proc. 1997 8th Int. Symp.
method for fuzzy rules using neural networks with adaptive
Software Reliability Engineering.
Los Alamitos, CA: IEEE
structured genetic algorithm,” in Proc. 1995 IEEE Int. Conf.
Computer Soc., 1997, pp. 13–26.
Systems, Man and Cybernetics, Part 4 (of 5), pp. 3269–3274.
[110] W. Yan, Z. Zhu, and R. Hu, “Hybrid genetic/BP algorithm and
[131] S. E. Fahlman, “Faster-learning variations on back-propagation:
its application for radar target classification,” in Proc. 1997
An empirical study,” in Proc. 1988 Connectionist Models Sum-
IEEE National Aerospace and Electronics Conf., NAECON. Part
mer School, D. S. Touretzky, G. E. Hinton, and T. J. Sejnowski,
2 (of 2), pp. 981–984.
Eds.
San Mateo, CA: Morgan Kaufmann, 1988, pp. 38–51.
[111] J.-M. Yang, C.-Y. Kao, and J.-T. Horng, “Evolving neural
[132] E. M. Johansson, F. U. Dowla, and D. M. Goodman, “Backprop-
induction regular language using combined evolutionary
agation learning for multilayer feed-forward neural networks
algorithms,” in Proc. 1996 1st Joint Conf. Intelligent Sys-
using the conjugate gradient method,” Int. J. Neural Syst., vol.
tems/ISAI/IFIS, pp. 162–169.
2, no. 4, pp. 291–301, 1991.
[112] P. Zhang, Y. Sankai, and M. Ohta, “Hybrid adaptive learning
[133] H. Kitano, “Empirical studies on the speed of convergence of
control of nonlinear system,” in Proc. 1995 American Control
neural network training using genetic algorithms,” in Proc. 8th
Conf. Part 4 (of 6), pp. 2744–2748.
Nat. Conf. AI (AAAI-90).
Cambridge, MA: MIT Press, 1990,
[113] P. J. B. Hancock, “Genetic algorithms and permutation prob-
pp. 789–795.
lems: A comparison of recombination operators for neural net
[134] D. H. Wolpert and W. G. Macready, “No free lunch theorems
structure specification,” in Proc. Int. Workshop Combinations
for optimization,” IEEE Trans. Evolutionary Computation, vol.
of Genetic Algorithms and Neural Networks (COGANN-92), D.
1, pp. 67–82, Apr. 1997.
Whitley and J. D. Schaffer, Eds.
Los Alamitos, CA: IEEE
[135] X. Yao, “Optimization by genetic annealing,” in Proc. 2nd
Computer Soc., 1992, pp. 108–122.
Australian Conf. Neural Networks, Sydney, Australia, 1991, pp.
[114] J. Antonisse, “A new interpretation of schema notation that
94–97.
overturns the binary encoding constraint,” in Proc. 3rd Int. Conf.
[136] S. Omatu and S. Deris, “Stabilization of inverted pendulum
Genetic Algorithms and Their Applications, J. D. Schaffer, Ed.
by the genetic algorithm,” in Proc. 1996 IEEE Conf. Emerging
San Mateo, CA: Morgan Kaufmann, 1989, pp. 86–91.
Technologies and Factory Automation, ETFA’96. Part 1 (of 2),
1442
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

pp. 282–287.
1987, pp. 63–69.
[137] S. Omatu and M. Yoshioka, “Self-tuning neuro-PID control and
[159] P. J. B. Hancock, “GANNET: Design of a neural net for face
applications,” in Proc. 1997 IEEE Int. Conf. Systems, Man, and
recognition by genetic algorithm,” Center for Cognitive and
Cybernetics, Part 3 (of 5), pp. 1985–1989.
Computational Neuroscience, Dep. Comput. Sci. Psychology,
[138] I. Erkmen and A. Ozdogan, “Short term load forecasting using
Stirling Univ., Stirling, U.K., Tech. Rep. CCCN-6, Aug. 1990.
genetically optimized neural network cascaded with a modified
[160] C. P. Dolan and M. G. Dyer, “Toward the evolution of sym-
Kohonen clustering process,” in Proc. 1997 IEEE Int. Symp.
bols,” in Proc. 2nd Int. Conf. Genetic Algorithms and Their
Intelligent Control, pp. 107–112.
Applications.
Hillsdale, NJ: Erlbaum, 1987, pp. 123–131.
[139] J. J. Merelo, M. Pat´on, A. Ca˜nas, A. Prieto, and F. Mor´an,
[161] B. Fullmer and R. Miikkulainen, “Using marker-based genetic
“Optimization of a competitive learning neural network by
encoding of neural networks to evolve finite-state behavior,”
genetic algorithms,” in Proc. Int. Workshop Artificial Neural
in Toward a Practice of Autonomous Systems: Proc. 1st Eu-
Networks (IWANN’93), Lecture Notes in Computer Science, vol.
rop. Conf. Artificial Life, F. J. Varela and P. Bourgine, Eds.
686.
Berlin, Germany: Springer-Verlag, 1993, pp. 185–192.
Cambridge, MA: MIT Press, 1991, pp. 255–262.
[140] D. D. Wang and J. Xu, “Fault detection based on evolving LVQ
[162] M. Zaus and R. Megnet, “Fusion-technology and the design of
neural networks,” in Proc. 1996 IEEE Int. Conf. Systems, Man
evolutionary machines for neural networks,” in Artificial Neural
and Cybernetics, vol. 1, pp. 255–260.
Networks: Proc. Int. Conf. Artificial Neural Networks—ICANN-
[141] S. E. Fahlman and C. Lebiere, “The cascade-correlation learning
91, vol. 2, T. Kohonen, K. M¨akisara, O. Simula, and J. Kangas,
architecture,” in Advances in Neural Information Processing
Eds.
Amsterdam, The Netherlands: North-Holland, 1991, pp.
Systems 2, D. S. Touretzky, Ed.
San Mateo, CA: Morgan
1165–1168.
Kaufmann, 1990, pp. 524–532.
[163] N. Dodd, “Optimization of neural-network structure using ge-
[142] M. Frean, “The upstart algorithm: A method for constructing
netic techniques,” in Proc. Conf. Applications of Artificial In-
and training feedforward neural networks,” Neural Computa-
telligence in Engineering VI, G. Rzevski and R. A. Adey, Eds.
tion, vol. 2, no. 2, pp. 198–209, 1990.
London, U.K.: Elsevier, 1991, pp. 939–944.
[143] M. C. Mozer and P. Smolensky, “Skeletonization: A technique
[164] S. J. Marshall and R. F. Harrison, “Optimization and training of
for trimming the fat from a network via relevance assessment,”
feedforward neural networks by genetic algorithms,” in Proc.
Connection Sci., vol. 1, no. 1, pp. 3–26, 1989.
2nd IEE Int. Conf. Artificial Neural Networks.
London, U.K.:
[144] J. Sietsma and R. J. F. Dow, “Creating artificial neural networks
Inst. Elect. Eng. Press, 1991, pp. 39–43.
that generalize,” Neural Networks, vol. 4, no. 1, pp. 67–79,
[165] S. Oliker, M. Furst, and O. Maimon, “A distributed genetic
1991.
algorithm for neural network design and training,” Complex
[145] Y. Hirose, K. Yamashita, and S. Hijiya, “Back-propagation
Syst., vol. 6, no. 5, pp. 459–477, 1992.
algorithm which varies the number of hidden units,” Neural
[166] L. Mart´i, “Genetically generated neural networks I: Repre-
Networks, vol. 4, no. 1, pp. 61–66, 1991.
sentational effects,” in Proc. Int. Joint Conf. Neural Networks
[146] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain dam-
(IJCNN’92 Baltimore), vol. IV, pp. 537–542.
age,” in Advances in Neural Information Processing Systems
[167] W. Schiffmann, M. Joost, and R. Werner, “Synthesis and per-
2, D. S. Touretzky, Ed.
San Mateo, CA: Morgan Kaufmann,
formance analysis of multilayer neural network architectures,”
1990, pp. 598–605.
Inst. Phys., Koblenz, Univ. Koblenz, Tech. Rep. 16/1992, 1992.
[147] A. Roy, L. S. Kim, and S. Mukhopadhyay, “A polynomial
[168] H.-M. Voigt, J. Born, and I. Santib´a˜nez-Koref, “Evolutionary
time algorithm for the construction and training of a class of
structuring of artificial neural networks,” Bionics and Evolution
multilayer perceptrons,” Neural Networks, vol. 6, no. 4, pp.
Techniques Lab., Tech. Univ. Berlin, Germany, Tech. Rep.,
535–545, 1993.
1993.
[148] J.-N. Hwang, S.-S. You, S.-R. Lay, and I.-C. Jou, “What’s
[169] F. J. Mar´in and F. Sandoval, “Genetic synthesis of discrete-
wrong with a cascaded correlation learning network: A pro-
time recurrent neural network,” in Proc. Int. Workshop Artificial
jection pursuit learning perspective,” Dep. Elect. Eng., Univ.
Neural Networks (IWANN’93), Lecture Notes in Computer Sci-
Washington, Seattle, Tech. Rep. FT-10, 1993.
ence, vol. 686.
Berlin, Germany: Springer-Verlag, 1993, pp.
[149] P. J. Angeline, G. M. Sauders, and J. B. Pollack, “An evolution-
179–184.
ary algorithm that constructs recurrent neural networks,” IEEE
[170] E. Alba, J. F. Aldana, and J. M. Troya, “Fully automatic ANN
Trans. Neural Networks, vol. 5, pp. 54–65, Jan. 1994.
design: A genetic approach,” in Proc. Int. Workshop Artificial
[150] G. F. Miller, P. M. Todd, and S. U. Hegde, “Designing neural
Neural Networks (IWANN’93), Lecture Notes in Computer Sci-
networks using genetic algorithms,” in Proc. 3rd Int. Conf.
ence, vol. 686.
Berlin, Germany: Springer-Verlag, 1993, pp.
Genetic Algorithms and Their Applications, J. D. Schaffer, Ed.
399–404.
San Mateo, CA: Morgan Kaufmann, 1989, pp. 379–384.
[171] D. White and P. Ligomenides, “GANNet: A genetic algorithm
[151] H. Kitano, “Designing neural networks using genetic algorithms
for optimizing topology and weights in neural network design,”
with graph generation system,” Complex Syst., vol. 4, no. 4, pp.
in Proc. Int. Workshop Artificial Neural Networks (IWANN’93),
461–476, 1990.
Lecture Notes in Computer Science, vol. 686.
Berlin, Ger-
[152] S. A. Harp, T. Samad, and A. Guha, “Toward the genetic
many: Springer-Verlag, 1993, pp. 322–327.
synthesis of neural networks,” in Proc. 3rd Int. Conf. Genetic
[172] H. Andersen, “A constructive algorithm for a multilayer per-
Algorithms and Their Applications, J. D. Schaffer, Ed.
San
ceptron based on co-operative population concepts in genetic
Mateo, CA: Morgan Kaufmann, 1989, pp. 360–369.
algorithms,” Master’s thesis, Dep. Elect. Comput. Eng., Univ.
[153] J. D. Schaffer, R. A. Caruana, and L. J. Eshelman, “Using
Queensland, Brisbane, Australia, Sept. 1993.
genetic search to exploit the emergent behavior of neural
[173] D. H. Gariglio and A. J. Helget, “Identification and control of a
networks,” Phys. D, vol. 42, pp. 244–248, 1990.
simulated distillation plant using connectionist and evolutionary
[154] S. W. Wilson, “Perceptron redux: Emergence of structure,”
techniques,” Simulation, vol. 63, no. 6, pp. 393–404, 1994.
Phys. D, vol. 42, pp. 249–256, 1990.
[174] L. A. Belfore, II, and A.-R. A. Arkadan, “Modeling faulted
[155] N. Dodd, D. Macfarlane, and C. Marland, “Optimization of
switched reluctance motors using evolutionary neural net-
artificial neural network structure using genetic techniques
works,” IEEE Trans. Ind. Electron., vol. 44, pp. 226–233, Feb.
implemented on multiple transputers,” in Proc. Transputing’91,
1997.
P. Welch, D. Stiles, T. L. Kunii, and A. Bakkers, Eds.
Ams-
[175] A. Dobnikar, “Evolutionary design of application-specific neu-
terdam, The Netherlands: IOS, 1991, pp. 687–700.
ral networks: A genetic approach,” Neural Network World, vol.
[156] S. A. Harp, T. Samad, and A. Guha, “Designing application-
5, no. 1, pp. 41–50, 1995.
specific neural networks using the genetic algorithm,” in Ad-
[176] Y. Sato and T. Furuya, “Coevolution in recurrent neural net-
vances in Neural Information Processing Systems 2, D. S.
works using genetic algorithms,” Syst. Comput. Japan, vol. 27,
Touretzky, Ed.
San Mateo, CA: Morgan Kaufmann, 1990, pp.
no. 5, pp. 64–73, 1996.
447–454.
[177] F. Mondada and D. Floreano, “Evolution of neural control struc-
[157] W. B. Dress, “Darwinian optimization of synthetic neural
tures: Some experiments on mobile robots,” Robot. Autonomous
systems,” in Proc. 1st IEEE Int. Conf. Neural Networks, vol.
Syst., vol. 16, nos. 2–4, pp. 183–195, 1995.
3, 1987, pp. 769–775.
[178] H. Ishigami, T. Fukuda, and F. Arai, “Structure optimization of
[158] A. Bergman and M. Kerszberg, “Breeding intelligent automata,”
fuzzy neural network by genetic algorithm,” Fuzzy Sets Syst.,
in Proc. of the 1st IEEE Int. Conf. Neural Networks, vol. 3,
vol. 71, no. 3, pp. 257–264, 1995.
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1443

[179] J. Fang and Y. Xi, “Neural network design based on evolu-
Neural Networks, vol. 7, no. 2, pp. 341–351, 1994.
tionary programming,” Artificial Intell. Eng., vol. 11, no. 2, pp.
[202] B. Kothari, B. Paya, and I. Esat, “Machinery fault diagnostics
155–161, 1997.
using direct encoding graph syntax for optimizing artificial
[180] C. Nikolopoulos and P. Fellrath, “Hybrid expert system for
neural network structure,” in Proc. 1996 3rd Biennial Joint
investment advising,” Expert Syst., vol. 11, no. 4, pp. 245–248,
Conf. Engineering Systems Design and Analysis, ESDA, Part
1994.
7 (of 9).
New York: ASME, 1996, pp. 205–210.
[181] R. E. Smith and I. H. B. Cribbs, “Combined biological
[203] C. R. Chow and C. H. Chu, “Configuration of multilayered
paradigms: A neural, genetics-based autonomous systems
feedforward networks by an evolutionary process,” in Proc.
strategy,” Robot. Autonomous Syst., vol. 22, no. 1, pp. 65–74,
37th Midwest Symp. Circuits and Systems, Part 1 (of 2), 1994,
1997.
pp. 531–534.
[182] V. Maniezzo, “Genetic evolution of the topology and weight
[204] S. Dreiseitl and W. Jacak, “Genetic algorithm based neural
distribution of neural networks,” IEEE Trans. Neural Networks,
networks for dynamical system modeling,” in Proc. 1995 IEEE
vol. 5, pp. 39–53, Jan. 1994.
Int. Conf. Evolutionary Computation, Part 2 (of 2), pp. 602–607.
[183] J. Sklansky and M. Vriesenga, “Genetic selection and neu-
[205] E. Vonk, L. C. Jain, and R. Johnson, “Using genetic algorithms
ral modeling of piecewise linear classifiers,” Int. J. Pattern
with grammar encoding to generate neural networks,” in Proc.
Recognition Artificial Intell., vol. 10, no. 5, pp. 587–612, 1996.
1995 IEEE Int. Conf. Neural Networks, Part 4 (of 6), pp.
[184] X. Yao and Y. Shi, “A preliminary study on designing artificial
1928–1931.
neural networks using co-evolution,” in Proc. IEEE Singapore
[206] S.-B. Cho and K. Shimohara, “Modular neural networks
Int. Conf. Intelligent Control and Instrumentation, Singapore,
evolved by genetic programming,” in Proc. 1996 IEEE Int.
June 1995, pp. 149–154.
Conf. Evolutionary Computation, ICEC’96, pp. 681–684.
[185] X. Yao and Y. Liu, “EPNet for chaotic time-series prediction,”
[207] D. H. F. Yip, E. L. Hines, and W. W. H. Yu, “Application of
in Select. Papers 1st Asia-Pacific Conf. Simulated Evolution and
artificial neural networks in sales forecasting,” in Proc. 1997
Learning (SEAL’96), vol. 1285 of Lecture Notes in Artificial
IEEE Int. Conf. Neural Networks, Part 4 (of 4), pp. 2121–2124.
Intelligence, X. Yao, J.-H. Kim, and T. Furuhashi, Eds.
Berlin,
[208] C. A. Perez and C. A. Holzmann, “Improvements on hand-
Germany: Springer-Verlag, 1997, pp. 146–156.
written digit recognition by genetic selection of neural network
[186] Y. Liu and X. Yao, “A population-based learning algorithm
topology and by augmented training,” in Proc. 1997 IEEE
which learns both architectures and weights of neural net-
Int. Conf. Systems, Man, and Cybernetics, Part 2 (of 5), pp.
works,” Chinese J. Advanced Software Res., vol. 3, no. 1, pp.
1487–1491.
54–65, 1996.
[209] K. Shirakawa and N. Okubo, “Genetic determination of large-
[187] X. Yao and Y. Liu, “Toward designing artificial neural networks
signal HEMT model,” in Proc. 1997 27th Europ. Microwave
by evolution,” Appl. Math. Computation, vol. 91, no. 1, pp.
Conf, Part 1 (of 2), Turnbridge Wells, U.K., pp. 432–436.
83–90, 1998.
[210] N. Snoad and T. Bossomaier, “MONSTER—The ghost in the
[188]
, “Evolutionary artificial neural networks that learn and
connection machine: Modularity of neural systems in theoretical
generalize well,” in Proc. 1996 IEEE Int. Conf. Neural Net-
evolutionary research,” in Proc. 1995 ACM/IEEE Supercomput-
works, Washington, DC, June 3–6, 1996, pp. 159–164.
ing Conf.
Los Alamitos, CA: IEEE Computer Soc., 1995, pp.
[189]
, “Evolving artificial neural networks for medical appli-
578–601.
cations,” in Proc. 1995 Australia-Korea Joint Workshop Evolu-
[211] T. Ragg and S. Gutjahr, “Automatic determination of optimal
tionary Computation, KAIST, Taejon, Korea, Sept. 1995, pp.
network topologies based on information theory and evolution,”
1–16.
in Proc. 1997 23rd EUROMICRO Conf..
Los Alamitos, CA:
[190] X. Yao, “The importance of maintaining behavioral link be-
IEEE Computer Soc., 1997, pp. 549–555.
tween parents and offspring,” in Proc. 1997 IEEE Int. Conf.
[212] S. D. Likothanassis, E. Georgopoulos, and D. Fotakis,
Evolutionary Computation (ICEC’97), Indianapolis, IN, Apr.
“Optimizing the structure of neural networks using evolution
1997, pp. 629–633.
techniques,” in Proc. 5th Int. Conf. Application of High-
[191] Y. Liu and X. Yao, “Evolutionary design of artificial neu-
Performance Computers in Engineering.
Ashurst,
U.K.:
ral networks with different nodes,” in Proc. 1996 IEEE Int.
Computational Mechanics, 1997, pp. 157–168.
Conf. Evolutionary Computation (ICEC’96), Nagoya, Japan, pp.
[213] D. Patel, “Using genetic algorithms to construct a network for
670–675.
financial prediction,” in Proc. SPIE: Applications of Artificial
[192] X. Yao and Y. Liu, “Ensemble structure of evolutionary artifi-
Neural Networks in Image Processing, Bellingham, WA, 1996,
cial neural networks,” in Proc. 1996 IEEE Int. Conf. Evolution-
pp. 204–213.
ary Computation (ICEC’96), Nagoya, Japan, pp. 659–664.
[214] D. E. Moriarty and R. Miikkulainen, “Improving game-tree
[193]
, “Making use of population information in evolutionary
search with evolutionary neural networks,” in Proc. 1st IEEE
artificial neural networks,” IEEE Trans. Syst., Man, Cyber. B,
Conf. Evolutionary Computation. Part 1 (of 2), 1994, pp.
vol. 28, pp. 417–425, Mar. 1998.
496–501.
[194]
, “A new evolutionary system for evolving artificial neural
[215] S. G. Romaniuk, “Applying crossover operators to automatic
networks,” IEEE Trans. Neural Networks, vol. 8, pp. 694–713,
neural network construction,” in Proc. 1st IEEE Conf. Evolu-
May 1997.
tionary Computation. Part 2 (of 2), 1994, pp. 750–752.
[195]
, “Evolving artificial neural networks through evolution-
[216]
, “Toward minimal network architectures with evolution-
ary programming,” in Evolutionary Programming V: Proc. 5th
ary growth networks,” in Proc. 1994 IEEE Int. Conf. Neural
Annu. Conf. Evolutionary Programming, L. J. Fogel, P. J.
Networks, Part 3 (of 7), pp. 1710–1712.
Angeline, and T. B¨ack, Eds.
Cambridge, MA: MIT Press,
[217] C.-H. Lee and J. H. Kim, “Evolutionary ordered neural network
1996, pp. 257–266.
with a linked-list encoding scheme,” in Proc. 1996 IEEE Int.
[196] J. R. McDonnell and D. E. Waagen, “Evolving cascade-
Conf. Evolutionary Computation, ICEC’96, pp. 665–669.
correlation networks for time-series forecasting,” Int. J.
[218] A. I. Esparcia-Alcazar and K. C. Sharman, “Genetic pro-
Artificial Intell. Tools, vol. 3, no. 3, pp. 327–338, 1995.
gramming techniques that evolve recurrent neural network
[197] J. R. McDonnell and D. Waagen, “Evolving recurrent percep-
architectures for signal processing,” in Proc. 1996 IEEE Work-
trons for time-series modeling,” IEEE Trans. Neural Networks,
shop Neural Networks for Signal Processing, pp. 139–148.
vol. 5, pp. 24–38, Jan. 1994.
[219] Z. Q. Bo, H. Y. Li, R. K. Aggarwal, A. T. Johns, and P. J.
[198] J. C. F. Pujol and R. Poli, “Evolving the topology and the
Moore, “Non-communication protection of transmission line
weights of neural networks using a dual representation,” Appl.
based on genetic evolved neural network,” in Proc. 1997 6th
Intell., vol. 8, no. 1, pp. 73–84, 1998.
Int. Conf. Developments in Power System Protection, Stevenage,
[199] N. Richards, D. E. Moriarty, and R. Miikkulainen, “Evolving
U.K., Inst. Elect. Eng. Conf. Pub. 434, 1997, pp. 291–294.
neural networks to play Go,” Appl. Intell., vol. 8, no. 1, pp.
[220] Z. Q. Bo, H. Y. Li, R. K. Aggarwal, and A. T. Johns,
85–96, 1998.
“Current transients based faulted phase selection technique
[200] D. Moriarty and R. Miikkulainen, “Discovering complex othello
using a genetic algorithm evolved neural network,” in Proc.
strategies through evolutionary neural networks,” Connection
1997 32nd Universities Power Engineering Conf., UPEC’97,
Sci., vol. 7, nos. 3/4, pp. 195–210, 1995.
Part 2 (of 2).
Iraklio, Greece: Technological Educational Inst.,
[201] R. Smalz and M. Conrad, “Combining evolution with credit
1997, pp. 959–962.
apportionment: A new learning algorithm for neural nets,”
[221] Y. H. Song, A. T. Johns, Q. Y. Xuan, and J. Y. Liu, “Genetic
1444
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

algorithm based neural networks applied to fault classification
A. E. Eiben, Eds.
Berlin, Germany: Springer-Verlag, 1998,
for EHV transmission lines with a UPFC,” in Proc. 1997 6th
pp. 271–280.
Int. Conf. Developments in Power System Protection, Stevenage,
[243] D. B. Fogel, “Phenotypes, genotypes, and operators in evolu-
U.K., Inst. Elect. Eng. Conf. Pub. 434, 1997, pp. 278–281.
tionary computation,” in Proc. 1995 IEEE Int. Conf. Evolution-
[222] Q. Zhao, “EditEr: A combination of IEA and CEA,” in Proc.
ary Computation (ICEC’95), Perth, Australia, pp. 193–198.
1997 IEEE Int. Conf. Evolutionary Computation, ICEC’97, pp.
[244] D. E. Rumelhart and J. L. McClelland, Eds., Parallel Dis-
641–645.
tributed Processing: Explorations in the Microstructures of Cog-
[223] M. Sarkar and B. Yegnanarayana, “Feedforward neural net-
nition.
Cambridge, MA: MIT Press, 1986.
works configuration using evolutionary programming,” in Proc.
[245] S. Jockusch and H. Ritter, “Self-organizing maps: Local com-
1997 IEEE Int. Conf. Neural Networks, Part 1 (of 4), pp.
petition and evolutionary optimization,” Neural Networks, vol.
438–443.
7, no. 8, pp. 1229–1239, 1994.
[224] S. B. Cho, “Combining modular neural networks developed
[246] Q. Zhao and T. Higuchi, “Evolutionary learning of nearest-
by evolutionary algorithm,” in Proc. 1997 IEEE Int. Conf.
neighbor MLP,” IEEE Trans. Neural Networks, vol. 7, pp.
Evolutionary Computation, ICEC’97, pp. 647–650.
762–767, May 1996.
[225] M. W. Hwang, J. Y. Choi, and J. Park, “Evolutionary projection
[247] Q. Zhao, “Stable on-line evolutionary learning of NN-MLP,”
neural networks,” in Proc. 1997 IEEE Int. Conf. Evolutionary
IEEE Trans. Neural Networks, vol. 8, pp. 1371–1378, Nov.
Computation, ICEC’97, pp. 667–671.
1997.
[226] M. Bichsel and P. Seitz, “Minimum class entropy: A maximum
[248] Q. Zhao and T. Higuchi, “Efficient learning of NN-MLP based
information approach to layered networks,” Neural Networks,
on individual evolutionary algorithm,” Neurocomput., vol. 13,
vol. 2, no. 2, pp. 133–141, 1989.
nos. 2–4, pp. 201–215, 1996.
[227] D. B. Fogel, “An information criterion for optimal neural
[249] B. A. Whitehead and T. D. Choate, “Cooperative-competitive
network selection,” IEEE Trans. Neural Networks, vol. 2, pp.
genetic evolution of radial basis function centers with widths
490–497, Sept. 1991.
for time series prediction,” IEEE Trans. Neural Networks, vol.
[228] J. Utans and J. Moody, “Selecting neural network architectures
7, pp. 869–880, July 1996.
via the prediction risk: Application to corporate bond rating
[250] B. A. Whitehead, “Genetic evolution of radial basis function
prediction,” in Proc. 1st Int. Conf. AI Applications on Wall
coverage using orthogonal niches,” IEEE Trans. Neural Net-
Street.
Los Alamitos, CA: IEEE Computer Soc., 1991, pp.
works, vol. 7, pp. 1525–1528, Nov. 1996.
35–41.
[251] S. A. Billings and G. L. Zheng, “Radial basis function network
[229] W. M. Spears and V. Anand, “A study of crossover operators
configuration using genetic algorithms,” Neural Networks, vol.
in genetic programming,” in Proc. 6th Int. Symp. Methodologies
8, no. 6, pp. 877–890, 1995.
for Intelligent Systems (ISMIS’91), Z. W. Ras and M. Ze-
[252] B. A. Whitehead and T. D. Choate, “Evolving space-filling
mankova, Eds.
Berlin, Germany: Springer-Verlag, 1991, pp.
curves to distribute radial basis functions over an input space,”
409–418.
IEEE Trans. Neural Networks, vol. 5, pp. 15–23, Jan. 1994.
[230] E. Mjolsness, D. H. Sharp, and B. K. Alpert, “Scaling, machine
[253] E. P. Maillard and D. Gueriot, “RBF neural network, basis
learning, and genetic neural nets,” Advances in Applied Math.,
functions and genetic algorithms,” in Proc. 1997 IEEE Int. Conf.
vol. 10, pp. 137–163, 1989.
Neural Networks. Part 4 (of 4), pp. 2187–2190.
[231] J. W. L. Merrill and R. F. Port, “Fractally configured neural
[254] Y. Liu and X. Yao, “A population-based learning algorithm
networks,” Neural Networks, vol. 4, no. 1, pp. 53–60, 1991.
which learns both architectures and weights of neural net-
[232] F. Gruau, “Genetic synthesis of boolean neural networks with
works,” in Proc. ICYCS’95 Workshop Soft Computing, Beijing,
a cell rewriting developmental process,” in Proc. Int. Work-
China, 1995, pp. 29–38.
shop Combinations of Genetic Algorithms and Neural Networks
[255] A. Artola, S. Broecher, and W. Singer, “Different voltage-
(COGANN-92), D. Whitley and J. D. Schaffer, Eds.
Los
dependent thresholds for inducing long-term depression and
Alamitos, CA: IEEE Computer Soc., 1992, pp. 55–74.
long-term potentiation in slices of rat visual cortex,” Nature,
[233] A. A. Siddiqi and S. M. Lucas, “A comparison of matrix
vol. 347, no. 6288, pp. 69–72, Sept. 1990.
rewriting versus direct encoding for evolving neural networks,”
[256] P. J. B. Hancock, L. S. Smith, and W. A. Phillips, “A bio-
in Proc. 1998 IEEE Int. Conf. Evolutionary Computation, pp.
logically supported error-correcting learning rule,” in Proc. Int.
392–397.
Conf. Artificial Neural Networks—ICANN-91, vol. 1, T. Koho-
[234] S. S. Wilson, “Teaching network connectivity using simulated
nen, K. M¨akisara, O. Simula, and J. Kangas, Eds.
Amsterdam,
annealing on a massively parallel processor,” Proc. IEEE, vol.
The Netherlands: North-Holland, 1991, pp. 531–536.
79, pp. 559–566, Apr. 1991.
[257] J. M. Smith, “When learning guides evolution,” Nature, vol.
[235] H. H. Szu and R. L. Hartley, “Nonconvex optimization by fast
329, no. 6142, pp. 761–762, Oct. 1987.
simulated annealing,” Proc. IEEE, vol. 75, pp. 1538–1540, Nov.
[258] G. E. Hinton and S. J. Nowlan, “How learning can guide
1987.
evolution,” Complex Syst., vol. 1, no. 3, pp. 495–502, 1987.
[236] H. C. Andersen and A. C. Tsoi, “A constructive algorithm for
[259] R. K. Belew, “Evolution, learning and culture: Computational
the training of a multilayer perceptron based on the genetic
metaphors for adaptive algorithms,” Comput. Sci. Eng. Dep. (C-
algorithm,” Complex Syst., vol. 7, no. 4, pp. 249–268, 1993.
014), Univ. of California, San Diego, Tech. Rep. #CS89-156,
[237] R. E. Smith and H. B. Cribbs III, “Is a learning classifier system
Sept. 1989.
a type of neural network,” Evolutionary Computation, vol. 2,
[260] S. Nolfi, J. L. Elman, and D. Parisi, “Learning and evolution in
no. 1, pp. 19–36, Spring 1994.
neural networks,” Center Res. Language, Univ. California, San
[238] G. Mani, “Learning by gradient descent in function space,”
Diego, July Tech. Rep. CRT-9019, 1990.
in Proc. IEEE Int. Conf. System, Man, and Cybernetics, Los
[261] H. M¨uhlenbein and J. Kindermann, “The dynamics of evolution
Angeles, CA, 1990, pp. 242–247.
and learning—Toward genetic neural networks,” in Connection-
[239] D. R. Lovell and A. C. Tsoi, The Performance of the Neocogni-
ism in Perspective, R. Pfeifer et al., Eds.
Amsterdam, The
tron with Various S-Cell and C-Cell Transfer Functions, Intell.
Netherlands: Elsevier, pp. 173–198, 1989.
Machines Lab., Dep. Elect. Eng., Univ. Queensland, Tech. Rep.,
[262] H. M¨uhlenbein, “Adaptation in open systems: Learning and
Apr. 1992.
evolution,” in Workshop Konnektionismus, J. Kindermann and
[240] B. DasGupta and G. Schnitger, “Efficient approximation with
C. Lischka, Eds.
Germany: GMD, 1988, pp. 122–130.
neural networks: A comparison of gate functions,” Dep. Com-
[263] J. Paredis, “The evolution of behavior: Some experiments,” in
put. Sci., Pennsylvania State Univ., University Park, Tech. Rep.,
Proc. 1st Int. Conf. Simulation of Adaptive Behavior: From Ani-
1992.
mals to Animats, J. Meyer and S. W. Wilson, Eds.
Cambridge,
[241] D. G. Stork, S. Walker, M. Burns, and B. Jackson, “Preadap-
MA: MIT Press, 1991.
tation in neural circuits,” in Proc. Int. Joint Conf. Neural
[264] D. J. Chalmers, “The evolution of learning: An experiment in
Networks, vol. I, Washington, DC, 1990, pp. 202–205.
genetic connectionism,” in Proc. 1990 Connectionist Models
[242] A. V. Sebald and K. Chellapilla, “On making problems evolu-
Summer School, D. S. Touretzky, J. L. Elman, and G. E. Hinton,
tionarily friendly, part I: Evolving the most convenient repre-
Eds.
San Mateo, CA: Morgan Kaufmann, 1990, pp. 81–90.
sentations,” in Evolutionary Programming VII: Proc. 7th Annu.
[265] Y. Bengio and S. Bengio, “Learning a synaptic learning
Conf. Evolutionary Programming, vol. 1447 of Lecture Notes in
rule,” D´ep. Informatique et de Recherche Op´erationelle, Univ.
Computer Science, V. W. Porto, N. Saravanan, D. Waagen, and
Montr´eal, Canada, Tech. Rep. 751, Nov. 1990.
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1445

[266] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei, “On the
Networks. Part 4 (of 4), pp. 2460–2463.
optimization of a synaptic learning rule,” in Preprints Conf.
[287] A. D. Brown and H. C. Card, “Evolutionary artificial neural net-
Optimality in Artificial and Biological Neural Networks, Univ.
works,” in Proc. 1997 Canadian Conf. Electrical and Computer
of Texas, Dallas, Feb. 6–8, 1992.
Engineering, CCECE’97. Part 1 (of 2), pp. 313–317.
[267] J. F. Fontanari and R. Meir, “Evolving a learning algorithm for
[288] B.-T. Zhang and G. Veenker, “Neural networks that teach
the binary perceptron,” Network, vol. 2, no. 4, pp. 353–359,
themselves through genetic discovery of novel examples,” in
Nov. 1991.
Proc. 1991 IEEE Int. Joint Conf. Neural Networks (IJCNN’91
[268] D. H. Ackley and M. S. Littman, “Interactions between learning
Singapore), vol. 1, pp. 690–695.
and evolution,” in Artificial Life II, SFI Studies in the Sciences of
[289] S. Cho and K. Cha, “Evolution of neural network training set
Complexity, Vol. X, C. G. Langton, C. Taylor, J. D. Farmer, and
through addition of virtual samples,” in Proc. 1996 IEEE Int.
S. Rasmussen, Eds.
Reading, MA: Addison-Wesley, 1991, pp.
Conf. Evolutionary Computation, ICEC’96, pp. 685–688.
487–509.
[290] Z. Michalewicz, C. Z. Janikow, and J. B. Krawczyk, “A modi-
[269] J. Baxter, “The evolution of learning algorithms for artificial
fied genetic algorithm for optimal control problems,” Comput.
neural networks,” in Complex Systems, D. Green and T. Bosso-
Math. Applicat., vol. 23, no. 12, pp. 83–94, 1992.
maier, Eds.
Amsterdam, The Netherlands: IOS, pp. 313–326,
[291] D. B. Fogel, “Applying evolutionary programming to selected
1992.
control problems,” Comput. Math. Applicat., vol. 27, no. 11,
[270] D. Crosher, “The artificial evolution of a generalized class of
pp. 89–104, 1994.
adaptive processes,” in Preprints AI’93 Workshop Evolutionary
[292] J. Bobbin and X. Yao, “Solving optimal control problems with a
Computation, Nov. 1993, pp. 18–36.
cost on changing control by evolutionary algorithms,” in Proc.
[271] P. Turney, D. Whitley, and R. Anderson, Eds., Evolutionary
1997 IEEE Int. Conf. Evolutionary Computation (ICEC’97),
Computation (Special Issue on the Baldwin Effect), vol. 4, no.
Indianapolis, IN, pp. 331–336.
3, pp. 213–329, 1996.
[293] T. Morimoto, J. de Baerdemaeker, and Y. Hashimoto, “Intelli-
[272] H. B. Kim, S. H. Jung, T. G. Kim, and K. H. Park, “Fast learning
gent approach for optimal control of fruit-storage process using
method for back-propagation neural network by evolutionary
neural networks and genetic algorithms,” Comput. Electron.
adaptation of learning rates,” Neurocomput., vol. 11, no. 1, pp.
Agriculture, vol. 18, nos. 2–3, pp. 205–224, 1997.
101–106, 1996.
[294] T. Morimoto, J. Suzuki, and Y. Hashimoto, “Optimization of
[273] R. A. Jacobs, “Increased rates of convergence through learning
a fuzzy controller for fruit storage using neural networks and
rate adaptation,” Neural Networks, vol. 1, no. 3, pp. 295–307,
genetic algorithms,” Eng. Applicat. Artificial Intell., vol. 10, no.
1988.
5, pp. 453–461, 1997.
[274] B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in
[295] N. Noguchi and H. Terao, “Path planning of an agricultural mo-
1960 IRE WESTCON Convention Rec., pp. 96–104, 1960.
bile robot by neural network and genetic algorithm,” Comput.
[275] D. Parasi, F. Cecconi, and S. Nolfi, “Econets: Neural networks
Electron. Agriculture, vol. 18, nos. 2–3, pp. 187–204, 1997.
that learn in an environment,” Network, vol. 1, no. 2, pp.
[296] L.-D. Chou and J.-L. C. Wu, “Bandwidth allocation of virtual
149–168, Apr. 1990.
paths using neural-network-based genetic algorithms,” Proc.
[276] M. N. Narayanan and S. B. Lucas, “A genetic algorithm to
Inst. Elect. Eng. Commun., vol. 145, no. 1, pp. 33–39, 1998.
improve a neural network to predict a patient’s response to
[297]
, “Parameter adjustment using neural-network-based ge-
Warfarin,” Methods Inform. Med., vol. 32, pp. 55–58, 1993.
netic algorithms for guaranteed QOS in atm networks,” IEICE
[277] Z. Guo and R. E. Uhrig, “Using genetic algorithms to select in-
Trans. Commun., vol. E78-B, no. 4, pp. 572–579, 1995.
puts for neural networks,” in Proc. Int. Workshop Combinations
[298] D. H. Wolpert, “A mathematical theory of generalization,”
of Genetic Algorithms and Neural Networks (COGANN-92), D.
Complex Syst., vol. 4, no. 2, pp. 151–249, 1990.
Whitley and J. D. Schaffer, Eds.
Los Alamitos, CA: IEEE
[299] J. Rissanen, “Modeling by shortest data description,” Automat-
Computer Soc., 1992, pp. 223–234.
ica, vol. 14, no. 5, pp. 465–471, Sept. 1978.
[278] F. Z. Brill, D. E. Brown, and W. N. Martin, “Fast genetic
[300] H. Akaike, “A new look at the statistical model identification,”
selection of features for neural network classifiers,” IEEE Trans.
IEEE Trans. Automat. Contr., vol. AC-19, pp. 716–723, Dec.
Neural Networks, vol. 3, pp. 324–328, Mar. 1992.
1974.
[279] L. S. Hsu and Z. B. Wu, “Input pattern encoding through gen-
[301] C. S. Wallace and J. D. Patrick, “Coding decision trees,” Dep.
eralized adaptive search,” in Proc. Int. Workshop Combinations
Comput. Sci., Monash University, Clayton, Victoria, Australia,
of Genetic Algorithms and Neural Networks (COGANN-92), D.
Tech. Rep. 91/153, Aug. 1991.
Whitley and J. D. Schaffer, Eds.
Los Alamitos, CA: IEEE
[302] X. Yao, Y. Liu, and P. Darwen, “How to make best use
Computer Soc., 1992, pp. 235–247.
of evolutionary learning,” in Complex Systems: From Local
[280] E. J. Chang and R. P. Lippmann, “Using genetic algorithms
Interactions to Global Phenomena, R. Stocker, H. Jelinek, and
to improve pattern classification performance,” in Advances in
B. Durnota, Eds.
Amsterdam, The Netherlands: IOS, 1996,
Neural Information Processing Systems (3), R. P. Lippmann, J.
pp. 229–242.
E. Moody, and D. S. Touretzky, Eds.
San Mateo, CA: Morgan
[303] P. G. Harrald and M. Kamstra, “Evolving artificial neural
Kaufmann, 1991, pp. 797–803.
networks to combine financial forecasts,” IEEE Trans. Evolu-
[281] J. Hornyak and L. Monostori, “Feature extraction technique for
tionary Computation, vol. 1, pp. 40–51, Apr. 1997.
ANN-based financial forecasting,” Neural Network World, vol.
[304] Y. Liu and X. Yao, “Toward designing neural network ensem-
7, nos. 4–5, pp. 543–552, 1997.
bles by evolution,” in Parallel Problem Solving from Nature
[282] S. M. Yamany, K. J. Khiani, and A. A. Farag, “Applications of
(PPSN) V, vol. 1498 of Lecture Notes in Computer Science, A.
neural networks and genetic algorithms in the classification of
E. Eiben, T. B¨ack, M. Schoenauer, and H.-P. Schwefel, Eds.
enothelial cells,” Pattern Recognition Lett., vol. 18, nos. 11–13,
Berlin, Germany: Springer-Verlag, 1998, pp. 623–632.
pp. 1205–1210, 1997.
[305] S. Wesolkowski and K. Hassanein, “Comparative study of
[283] B. Back, T. Laitinen, and K. Sere, “Neural networks and genetic
combination schemes for an ensemble of digit recognition
algorithms for bankruptcy predictions,” Expert Syst. Applicat.:
neural networks,” in Proc. 1997 IEEE Int. Conf. Systems, Man,
Int. J., vol. 11, no. 4, pp. 407–413, 1996.
and Cybernetics. Part 4 (of 5), pp. 3534–3539, 1997.
[284] F. Dellaert and J. Vandewalle, “Automatic design of cellular
[306] R. A. Zitar and M. H. Hassoun, “Neurocontrollers trained with
neural networks by means of genetic algorithms: Finding a
rules extracted by a genetic assisted reinforcement learning
feature detector,” in Proc. IEEE Int. Workshop Cellular Neural
system,” IEEE Trans. Neural Networks, vol. 6, pp. 859–878,
Networks and Their Applications, 1994, pp. 189–194.
July 1995.
[285] P. R. Weller, R. Summers, and A. C. Thompson, “Using
[307] A. Imada and K. Araki, “Application of an evolution strategy to
a genetic algorithm to evolve an optimum input set for a
the Hopfield model of associative memory,” in Proc. 1997 IEEE
predictive neural network,” in Proc. 1st IEE/IEEE Int. Conf.
Int. Conf. Evolutionary Computation, ICEC’97, pp. 679–683.
Genetic Algorithms in Engineering Systems: Innovations and
[308] P. P. C. Yip and Y.-H. Pao, “Combinatorial optimization with
Applications (GALESIA’95), Stevenage, U.K., Inst. Elect. Eng.
use of guided evolutionary simulated annealing,” IEEE Trans.
Conf. Pub. 414, 1995, pp. 256–258.
Neural Networks, vol. 6, no. 2, pp. 290–295, 1995.
[286] M. A. Kupinski and M. L. Maryellen, “Feature selection and
[309] D. W. Coit and A. E. Smith, “Solving the redundancy allocation
classifiers for the computerized detection of mass lesions in
problem using a combined neural network/genetic algorithm
digital mammography,” in Proc. 1997 IEEE Int. Conf. Neural
approach,” Comput. Oper. Res., vol. 23, no. 6, pp. 515–526,
1446
PROCEEDINGS OF THE IEEE, VOL. 87, NO. 9, SEPTEMBER 1999

1996.
[319] G. Weiss, “Combining neural and evolutionary learning: As-
[310] M. Kishimoto, K. Sakasai, K. Ara, T. Fujita, and Y. Suzuki,
pects and approaches,” Institut f¨ur Informatik, Technische Univ.
“Reconstruction of plasma current profile of tokamaks using
M¨unchen, Germany, Tech. Rep. FKI-132-90, May 1990.
combinatorial optimization techniques,” IEEE Trans. Plasma
Sci.
, vol. 24, pp. 528–538, Apr. 1996.
[311] L. L. Rogers, F. U. Dowla, and V. M. Johnson, “Optical field-
scale groundwater remediation using neural networks and the
genetic algorithms,” Environmental Sci. Technol., vol. 29, no.
Xin Yao (Senior Member, IEEE) received the
5, pp. 1145–1156, 1995.
[312] J. S. Huang and H.-C. Liu, “Object recognition using genetic
B.Sc. degree from the University of Science
algorithms with a Hopfield’s neural model,” Expert Syst. Appli-
and Technology of China (USTC), Hefei, Anhui,
cat., vol. 13, no. 3, pp. 191–199, 1997.
China, in 1982, the M.Sc. degree from the
[313] J.-V. Potvin, D. Dube, and C. Robillard, “Hybrid approach to
North China Institute of Computing Technolo-
vehicle routing using neural networks and genetic algorithms,”
gies (NCI), Beijing, in 1985, and the Ph.D.
Appl. Intell., vol. 6, no. 3, pp. 241–252, 1996.
degree from USTC in 1990.
[314] C. H. Dagli and P. Poshyanonda, “New approaches to nesting
He is a Professor of Computer Science at
rectangular patterns,” J. Intell. Manufact., vol. 8, no. 3, pp.
the School of Computer Science, University of
177–190, 1997.
Birmingham, U.K. He was an Associate Pro-
[315] H. C. Lee and C. H. Dagli, “Parallel genetic-neuro scheduler
fessor in the School of Computer Science, Uni-
for job-shop scheduling problems,” Int. J. Production Econ.,
versity College, the University of New South Wales, Australian Defence
vol. 51, nos. 1–2, pp. 115–122, 1997.
Force Academy (ADFA), Canberra, before joining the University of
[316] S. Sette, L. Boullart, and P. Kiekens, “Optimizing the
Birmingham. He held postdoctoral fellowships in the Australian National
fiber-to-yarn production process with a combined neural
University (ANU) and the Commonwealth Scientific and Industrial Re-
network/genetic algorithm approach,” Textile Res. J., vol. 67,
search Organization (CSIRO) in 1990–1992.
no. 2, pp. 84–92, 1997.
Dr. Yao was the Program Committee Cochair of CEC’99, ICCIMA’99,
[317] S.-S. Han and G. S. May, “Using neural network process models
IEEE ICEC’98, SEAL’98, IEEE ICEC’97, and SEAL’96. He is an
to perform PECVD silicon dioxide recipe synthesis via genetic
Associate Editor of IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION
algorithms,” IEEE Trans. Semiconduct. Manufact., vol. 10, pp.
and Knowledge and Information Systems: An International Journal. He
279–287, May 1997.
is a member of the editorial board of the Journal of Cognitive Systems
[318] N. Funabiki, J. Kitamichi, and S. Nishikawa, “Evolutionary
Research, a member of the IEEE NNC Technical Committee on Evolu-
neural network algorithm for max cut problems,” in Proc. 1997
tionary Computation, and the Second Vice-President of the Evolutionary
IEEE Int. Conf. Neural Networks, Part 2 (of 4), pp. 1260–1265.
Programming Society.
YAO: EVOLVING ARTIFICIAL NEURAL NETWORKS
1447