Table Of ContentScientific Discovery
using
Genetic Programming
Maarten Keijzer
LYNGBY2001
IMM-PHD-xx
IMM
iii
Abstra
t
Geneti
Programming is
apable of automati
ally indu
ingsymboli
omputer pro-
gramsonthebasisofasetofexamplesortheirperforman
einasimulation. Math-
emati
al expressions are a well-de(cid:12)ned subset of symboli
omputer programs and
are also suitable for optimization using the geneti
programming paradigm. The
indu
tion of mathemati
al expressions based on data is
alled symboli
regression.
In this work, geneti
programming is extended to not just (cid:12)t the data i.e., get the
numbers right, but also to get the dimensions right. For this units of measurement
are used. The main
ontribution in this work
an be summarized as:
The symboli
expressions produ
ed by geneti
programming
an be
made suitable for analysis and interpretation by using units of mea-
surement to guide or restri
t the sear
h.
To a
hieve this, the following has been a
omplished:
(cid:15) A standard geneti
programming system is modi(cid:12)ed to be able to indu
e
expressions that more-or-less abide type
onstraints. This system is used to
implement a preferential bias towards dimensionally
orre
t solutions.
(cid:15) A novel geneti
programming system is introdu
ed that is able to indu
e
expressionsin languagesthat need
ontext-sensitive
onstraints. It isdemon-
strated that this system
an be used to implement a de
larative bias towards
1. the ex
lusion of
ertain synta
ti
al
onstru
ts;
2. the indu
tion of expressions that use units of measurement;
3. the indu
tion of expressions that use matrix algebra;
4. the indu
tion of expressions that are numeri
ally stable and
orre
t.
(cid:15) A
ase study usingfour real-worldproblems in the indu
tionof dimensionally
orre
t empiri
al equations on data using the two di(cid:11)erent methods is pre-
sented to illustrate the use and limitations of these methods in a framework
of s
ienti(cid:12)
dis
overy.
vii
Prefa
e
This thesis has been submitted in partial ful(cid:12)lment for the degree of Do
tor of
Philosophy. The work do
umented in this thesis has been
arried out both at DHI
|Water&EnvironmentandtheDepartmentforMathemati
alModelling,Se
tion
forDigitalSignalPro
essingattheTe
hni
alUniversityofDenmark. Theworkwas
supervised by Professor Lars Kai Hansen of the DTU and Dr. Vladan Babovi
of
DHI | Water & Environment.
During the Ph.D. study a number of
onferen
e papers and journal papers have
been written.
A
epted Journal Papers and Book Chapters
(cid:15) Maarten Keijzer and Vladan Babovi
. De
larative and preferential bias in
gp-based s
ienti(cid:12)
dis
overy. Geneti
Programmingand Evolvable Ma
hines,
to appear 2002.
(cid:15) Vladan Babovi
and Maarten Keijzer. On the introdu
tion of de
larative
bias in knowledge dis
overy
omputer systems. In P. Goodwin, editor. New
paradigms in river and estuarine management. Kluwer, 2001.
(cid:15) Vladan Babovi
and Maarten Keijzer. Geneti
programming as a model
indu
tion engine. Journal of Hydroinformati
s, 2(1):35-61,2000.
(cid:15) Vladan Babovi
and Maarten Keijzer. Fore
astingof river dis
hargesin the
presen
e of
haos and noise. In J. Marsalek, editor, Coping with Floods:
Lessons Learned from Re
ent Experien
es, Kluwer, 1999.
(cid:15) Vladan Babovi
, Jean Philip Dre
ourt, Maarten Keijzer and Peter Friis
Hansen. Modelling of water supply assets: a data mining approa
h. Urban
Water, to appear 2002.
Conferen
e Papers
(cid:15) Maarten Keijzer, Vladan Babovi
, Conor Ryan, Mi
hael O'Neill, and Mike
Cattoli
o. Adaptive logi
programming. In Lee Spe
tor et.al., eds, Pro-
eedingsof the Geneti
and EvolutionaryComputationConferen
e (GECCO-
2001), 2001.
(cid:15) Maarten Keijzer, Conor Ryan, Mi
hael O'Neill, Mike Cattoli
o, and Vladan
Babovi
. Ripple
rossover in geneti
programming. In Julian Miller et.al.,
Geneti
Programming, Pro
eedings of EuroGP,2001
viii
(cid:15) MaartenKeijzerandVladanBabovi
. Geneti
programmingwithinaframe-
work of
omputer-aided dis
overy of s
ienti(cid:12)
knowledge. In Darell Whitley,
et.al., Pro
eedings of the Geneti
and Evolutionary Computation Conferen
e
(GECCO-2000), 2000.
(cid:15) Maarten Keijzer and Vladan Babovi
. Geneti
programming, ensemble
methodsandthebias/varian
etradeo(cid:11)|introdu
toryinvestigations. InRi
-
ardo Poli et.al., Geneti
Programming, Pro
eedings of EuroGP'2000, 2000.
(cid:15) Maarten Keijzer and Vladan Babovi
. Dimensionally aware geneti
pro-
gramming. In Wolfgang Banzhaf et al., Pro
eedings of the Geneti
and Evo-
lutionary Computation Conferen
e, volume 2, 1999.
(cid:15) Maarten Keijzer, J.J. Merelo, G. Romero, M. S
hoenauer. Evolving Ob-
je
ts: a general purpose evolutionary
omputation library In Pierre Collet,
EA-01, Evolution Arti(cid:12)
ielle, 5th International Conferen
e on Evolutionary
Algorithms, 2001.
(cid:15) Maarten Keijzer and Vladan Babovi
. Error
orre
tion of a deterministi
modelin Veni
e lagoonby lo
allinear models. In Modelli
omplessiemetodi
omputatzionali intensivi per la stima e la previsione, 1999.
(cid:15) Mi
haelO'Neill,ConorRyan,MaartenKeijzerandMikeCattoli
o. Crossover
in Grammati
al Evolution: The Sear
h Continues. In Julian Miller et.al., Ge-
neti
Programming, Pro
eedings of EuroGP,2001.
(cid:15) KimJ(cid:28)rgensen,BerryElfering,MaartenKeijzer,andVladanBabovi
. Anal-
ysis of long term morphologi
al
hanges: A data mining approa
h. In Pro-
eedings of the International Conferen
e on Coastal Engineering, Australia,
2000.
(cid:15) VladanBabovi
,MaartenKeijzer,andMagnusStefansson. Optimalembed-
dingusingevolutionaryalgorithms. InPro
eedingsoftheFourthInternational
Conferen
e on Hydroinformati
s, Iowa City, USA, 2000.
(cid:15) Vladan Babovi
, Maarten Keijzer, and Marek Bundzel. From global to
lo
al modelling: A
ase study in error
orre
tion of deterministi
models. In
Pro
eedingsoftheFourthInternationalConferen
eonHydroinformati
s,Iowa
City, USA, 2000.
(cid:15) Vladan Babovi
, Maarten Keijzer, David R. Aquilera, and Joe Harrington.
An evolutionary approa
h to knowledge indu
tion: Geneti
programming in
hydrauli
engineering. In Pro
eedings of the World Water & Environmental
Resour
es Congress, 2001.
(cid:15) Vladan Babovi
andMaarten Keijzer. A Gaussianpro
ess model appliedto
the predi
tion of water levels in Veni
e lagoon. In Pro
eedings of the XXIX
Congress of the International Asso
iation for Hydrauli
Resear
h,2001.
(cid:15) Vladan Babovi
and Maarten Keijzer. An evolutionary algorithm approa
h
to theindu
tionofdi(cid:11)erential equations. In Pro
eedingsofthe FourthInter-
national Conferen
e on Hydroinformati
s, 2000.
ix
(cid:15) Vladan Babovi
and Maarten Keijzer. Computer supported knowledge dis-
overy | A
ase study in (cid:13)ow resistan
e indu
ed by vegetation. In Pro
eed-
ings of the XXVIII Congress of the International Asso
iation for Hydrauli
Resear
h, 1999.
(cid:15) VladanBabovi
andMaarten Keijzer. Datatoknowledge|thenews
ien-
ti(cid:12)
paradigm. In D. Savi
and G. Walters, editors, Water Industry Systems,
1999.
Submitted Journal Papers
(cid:15) Maarten Keijzer and Vladan Babovi
. Knowledge fusion in data driven
modeling. Ma
hine Learning.
(cid:15) Vladan Babovi
and Maarten Keijzer. Rainfall runo(cid:11) modelling based on
geneti
programming. Nordi
Hydrology.
x
A
knowledgements
First and foremost I would like to thank Vladan, not only for
onvin
ing me to try
to obtain a Ph.D. in Denmark by joining him in his Talent proje
t, but also for his
insistent enthusiasm and his many valuable
ontributions to this work. Although
as a Ph.D. thesis, this work is ne
essarily authored by me alone, most of the views
that are expressed in this work have been jointly developed.
Lars Kai and his group at the DTU have been very helpful. Although right from
thestartI'vetakenanalmostdiametri
allyoppositepathfromthegroup'sresear
h
by
on
entrating on the use of symboli
expressions rather than `sound' numeri
al
pro
edure, these `numeri
s' did have a profound in(cid:13)uen
e on the work. I have
learned a lot from the group.
Conor Ryan, Mi
hael O'Neill and Mike Cattoli
o deserve mentioning for the many
intense and
onsiderably less intense dis
ussionswe held during the various
onfer-
en
es and workshops in the past three years. One of the tangible results of these
dis
ussionsistheALP systemwhi
hisbasedonMi
haelandConor's`Grammati
al
Evolution' system. I hope we
an
ontinue to
ooperate in the future.
ThePh.D.andthisthesiswerefundedbytheDanishResear
hCoun
ilunderTalent
Proje
t 9800463 entitled "Data to Knowledge { D2K". This funding is greatly
appre
iated.
Deventer, May 1, 2002
Maarten Keijzer
xi
Contents
Abstra
t iii
Resume (Abstra
t in Danish) v
Prefa
e vii
1 Introdu
tion 1
2 Geneti
Programming 5
2.1 Evolution at work:
Geneti
& Evolutionary Computation . . . . . . . . . . . . . . . . . 5
2.2 Standard Geneti
Programming . . . . . . . . . . . . . . . . . . . . 7
2.2.1 The Primitives . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Measuring Performan
e and Wrapping . . . . . . . . . . . . 11
2.2.5 Auxiliary parameters and variables . . . . . . . . . . . . . . 13
2.3 Multi-Obje
tive Optimization . . . . . . . . . . . . . . . . . . . . . 13
2.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Symboli
Regression 21
3.1 The Con
entration of Suspended Sediment. . . . . . . . . . . . . . 24
3.2 Symboli
Regression on the Sediment Transport Problem . . . . . . 28
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
xii CONTENTS
4 Indu
tion of Empiri
al Equations 33
4.1 Units of Measurement as a Type System . . . . . . . . . . . . . . . 36
4.2 Language, Bias and Sear
h . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Typing in Geneti
Programming . . . . . . . . . . . . . . . . . . . 39
4.4 Expressiveness of Type Systems . . . . . . . . . . . . . . . . . . . . 40
4.5 Typed Variation Operators . . . . . . . . . . . . . . . . . . . . . . 43
4.5.1 Broken ergodi
ity . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.2 Loss of diversity . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Dimensionally Aware Geneti
Programming 47
5.1 Coer
ed Geneti
Programming . . . . . . . . . . . . . . . . . . . . 48
5.1.1 Cal
ulating the Coer
ion Error for the uom system . . . . . 49
5.1.2 Wrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Example: Sediment Transport. . . . . . . . . . . . . . . . . . . . . 51
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 An Adaptive Logi
Programming System 57
6.1 Logi
Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 An Adaptive Logi
Programming System . . . . . . . . . . . . . . . 61
6.2.1 Representation and the Mapping Pro
ess . . . . . . . . . . 63
6.2.2 Ba
ktra
king . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2.4 Performan
e Evaluation . . . . . . . . . . . . . . . . . . . . 82
6.2.5 Spe
ial Predi
ates . . . . . . . . . . . . . . . . . . . . . . . 82
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.5 ALP, ILP and CLP. . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7 Appli
ations for the ALP System 91
7.1 Appli
ations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1.1 A Sensible Ant on the Santa Fe Trail . . . . . . . . . . . . . 92
7.1.2 Interval Arithmeti
. . . . . . . . . . . . . . . . . . . . . . 98
7.1.3 Units of Measurement . . . . . . . . . . . . . . . . . . . . . 103
7.1.4 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Dis
ussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3 The Art of Geneti
Programming . . . . . . . . . . . . . . . . . . . 117
CONTENTS xiii
8 Experiments in S
ienti(cid:12)
Dis
overy 119
8.1 Problem 1: Settling Velo
ity of Sand Parti
les . . . . . . . . . . . . 120
8.2 Problem 2: Settling Velo
ity of Fae
al Pellets . . . . . . . . . . . . 121
8.3 Problem 3: Con
entration of sediment near bed . . . . . . . . . . . 122
8.4 Problem 4: Roughness indu
ed by (cid:13)exible vegetation . . . . . . . . 123
8.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.6 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.6.1 Bias/Varian
e Analysis . . . . . . . . . . . . . . . . . . . . 126
8.6.2 Settling velo
ity of sand parti
les . . . . . . . . . . . . . . . 129
8.6.3 Settling velo
ity of fae
al pellets . . . . . . . . . . . . . . . 130
8.6.4 Con
entration of suspended sediment near bed . . . . . . . 130
8.6.5 Additional roughness indu
ed by vegetation . . . . . . . . . 131
8.6.6 Summary of the quantitative analysis. . . . . . . . . . . . . 131
8.7 Qualitative Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.7.1 Interpretability of un
onstrained expressions . . . . . . . . . 133
8.7.2 Settling velo
ity of sand parti
les . . . . . . . . . . . . . . . 133
8.7.3 Settling velo
ity of fae
al pellets . . . . . . . . . . . . . . . 134
8.7.4 Con
entration of suspended sediment near bed . . . . . . . 136
8.7.5 Additional roughness indu
ed by vegetation . . . . . . . . . 137
8.7.6 Summary and s
ope of GP-based s
ienti(cid:12)
dis
overy . . . . 138
8.8 Dis
ussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9 Con
lusion 141
List of Tables 145
List of Figures 148
Bibliography 151
1
Chapter 1
Introdu
tion
Physi
al
on
epts are free
reations of the human mind, and are not,
however it may seem, uniquely determined by the external world.
-Albert Einstein and Leopold Infeld, 1938
Theformationofmoderns
ien
eo
urred approximatelyintheperiod betweenthe
late15thandthelate18th
entury. Thenewfoundationswerebasedontheutiliza-
tion of a physi
al experiment and the appli
ation of a mathemati
al apparatus in
order to des
ribe these experiments. Theworksof Brahe, Kepler, Newton,Leibniz,
Euler andLagrangepersonify thisapproa
h. Prior tothese developments,s
ienti(cid:12)
work primarily
onsisted of
olle
ting the observables, or re
ording the `readings of
the book of nature itself'.
This s
ienti(cid:12)
approa
h is traditionally
hara
terized by two stages: a (cid:12)rst one in
whi
h a set of observations of the physi
al system are
olle
ted, and a se
ond one
in whi
h an indu
tive assertion about the behaviour of the system | a hypothesis
| is generated. Observations present spe
i(cid:12)
knowledge, whereas hypotheses rep-
resents ageneralization ofthese data whi
himplies or des
ribes observations. One
may argue that through this pro
ess of hypothesis generation, one fundamentally
e
onomizesthought,asmore
ompa
twaysofdes
ribingobservationsareproposed.
Although this view of the dispassionate s
ientist observing fa
ts and produ
ing
equations is popular, it is not all there is to say about the pro
ess of s
ienti(cid:12)
dis
overy. In the years that lead to Kepler's famous laws of planetary motion,
he introdu
ed and abandoned various informal models of the solar-system. These
models initially took the form of a
olle
tion of embedded spheres (Holland et al.,
1986)(pp. 323-325). It was only when he abandoned the idea of planets moving
in
ir
ular orbits around the sun and repla
ed it with ellipses that he was able to
postulate his laws. Kepler is not unique in this; the pro
ess of the formulation of
s
ienti(cid:12)
law or theory usually takes pla
e in the
ontext of a mental model of the
phenomenon under study: using the right
on
ept to explain the equation provides
additionaljusti(cid:12)
ationfortheseequations. Findingaproper
on
eptualizationofthe
problemisasmu
hafeatofs
ienti(cid:12)
dis
overyastheformulationofamathemati
al
des
ription or explanation of a phenomenon.
Today,inthebeginningofthe21st
entury, weareexperien
ingyetanother
hange
in the s
ienti(cid:12)
pro
ess as just outlined. This latest s
ienti(cid:12)
approa
h is one