ebook img

Coding Ockham’s Razor PDF

173 Pages·2018·1.797 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Coding Ockham’s Razor

Lloyd Allison Coding Ockham’s Razor 123 LloydAllison FacultyofInformationTechnology MonashUniversity Melbourne,Victoria,Australia ISBN978-3-319-76432-0 ISBN978-3-319-76433-7 (eBook) https://doi.org/10.1007/978-3-319-76433-7 LibraryofCongressControlNumber:2018936916 ©SpringerInternationalPublishingAG,partofSpringerNature2018 Preface The minimum message length (MML) principle was devised by Chris Wallace (1933–2004) and David Boulton in the late 1960s [12, 93] initially to solve the unsupervised mixture modelling problem—an important problem, a mathematical analysis,andaworkingcomputerprogram(Snob)thatgivesusefulresultsinmany differentareasofscience,a“complete”researchproject. TheFoundationChairofComputerScienceatMonashUniversity,Chrisisalso particularlyrememberedforhisworkonthe“Wallacemultiplier”[85,86],pseudo- randomnumbergenerators[14,89],andoperatingsystems[6,99]. MML was developed [91, 92] in practical and theoretical directions and was applied to many inference problems by Chris, co-workers, postgraduates, and postdocs. One of my personal favourite applications is Jon Patrick’s modelling of megalithicstonecircles[65,66]. IfirstheardaboutMMLoverlunchonedaywhichledtoapplyingittobiological sequencealignment[3]andrelatedproblems[15],andeventuallyaftermanytwists andturnstoproteinstructuralalignment[17]andproteinfoldingpatterns[83]. Unfortunately much MML-based research that led to new inductive inference programsresultedinlittlesharedsoftwarecomponentry.Anewprogramtendedto be written largely from scratch by a postgrad, postdoc or other researcher and did notcontributetoanysoftwarelibraryofsharedparts.Assuchtheprogramsembody reimplementations of standard parts. This phenomenon is not due to any special propertyofMMLandactuallyseemstobequitecommoninresearchbutitisrather ironicbecause,whatwiththecomplexityofmodelsandofdatabeingmeasuredin the same units, MML is well suited to writing components that can be reused and suppliedasparameterstootherinductiveinferencesoftware. The first MML book is the one written by Chris Wallace [92] but published posthumously; it is the reference work for MML theory. This other MML book is anattempttodoacombinedMMLandSoftwareEngineeringanalysisofinductive inferencesoftware.Soundprogrammingskillsareneededtowritenewapplication programsforinductiveinferenceproblems.Somemathematicalskills,particularly in calculus and linear algebra, are needed to do a new MML analysis of one’s favouritestatisticalmodel. Melbourne,Victoria,Australia LloydAllison Contents Preface............................................................................. vii 1 Introduction................................................................. 1 1.1 ExplanationversusPrediction...................................... 4 1.2 Models............................................................... 5 1.2.1 Implementation............................................ 5 1.3 Estimators........................................................... 6 1.4 Information.......................................................... 8 1.5 MML ................................................................ 9 1.6 MML87.............................................................. 10 1.6.1 SingleContinuousParameterθ........................... 10 1.6.2 MultipleContinuousParameters ......................... 13 1.7 Outline............................................................... 14 2 Discrete ...................................................................... 17 2.1 Uniform ............................................................. 19 2.2 TwoState............................................................ 19 2.3 MultiState ........................................................... 21 2.4 Adaptive............................................................. 23 3 Integers ...................................................................... 27 3.1 UniversalCodes..................................................... 28 3.2 WallaceTreeCode.................................................. 30 3.3 AWarning........................................................... 32 3.4 Geometric ........................................................... 32 3.5 Poisson .............................................................. 36 3.6 Shifting.............................................................. 38 4 Continuous .................................................................. 41 4.1 Uniform ............................................................. 42 4.2 Exponential.......................................................... 43 4.3 Normal............................................................... 45 4.4 NormalGivenμ..................................................... 48 4.5 Laplace .............................................................. 48 4.6 ComparingModels ................................................. 52 5 Function-Models............................................................ 53 5.1 KonstantFunction-Model,K....................................... 54 5.2 Multinomial......................................................... 55 5.3 Intervals ............................................................. 56 5.4 ConditionalProbabilityTable(CPT) .............................. 57 6 Multivariate................................................................. 61 6.1 Independent ......................................................... 62 6.2 Dependent........................................................... 63 6.3 DataOperations..................................................... 64 7 MixtureModels............................................................. 65 7.1 TheMessage ........................................................ 66 7.2 Search ............................................................... 69 7.3 Implementation ..................................................... 70 7.4 AnExampleMixtureModel........................................ 71 7.5 The27Club......................................................... 74 8 Function-Models2.......................................................... 77 8.1 NaiveBayes......................................................... 78 8.2 AnExample......................................................... 79 8.3 Note:NotSoNaive................................................. 80 8.4 ClassificationTrees................................................. 81 8.5 TheMessage ........................................................ 82 8.6 Search ............................................................... 84 8.7 Implementation ..................................................... 85 8.8 AnExample......................................................... 86 8.9 MissingValues...................................................... 88 9 Vectors....................................................................... 89 9.1 D-Dimensions,RD ................................................. 90 9.1.1 Implementation............................................ 90 9.1.2 NormandDirection....................................... 91 9.2 Simplexand(cid:2)x(cid:3)(cid:2) =1.............................................. 92 1 9.2.1 Uniform.................................................... 93 9.2.2 Implementation............................................ 93 9.2.3 Dirichlet ................................................... 94 9.3 DirectionsinRD .................................................... 96 9.3.1 Uniform.................................................... 97 9.3.2 vonMises–Fisher(vMF)Distribution.................... 97 10 LinearRegression .......................................................... 103 10.1 SingleVariable...................................................... 103 10.1.1 Implementation............................................ 106 10.2 SingleDependence.................................................. 107 10.2.1 UnknownSingleDependence ............................ 109 10.3 MultipleInputVariables............................................ 109 10.3.1 Implementation............................................ 111 11 Graphs....................................................................... 113 11.1 OrderedorUnorderedGraphs...................................... 114 11.2 AdjacencyMatrixv.AdjacencyLists.............................. 117 11.3 ModelsofGraphs................................................... 117 11.4 Gilbert,ErdosandRenyiModels .................................. 117 11.5 Gilbert,ErdosandRenyiAdaptive................................. 119 11.6 SkewedDegreeModel.............................................. 120 11.7 MotifModels........................................................ 121 11.8 ASimpleMotifModel ............................................. 122 11.9 AnAdaptiveMotifModel.......................................... 124 11.10 Comparisons ........................................................ 126 11.11 BiologicalNetworks................................................ 127 12 BitsandPieces .............................................................. 131 12.1 Priors ................................................................ 132 12.2 Parameterisation .................................................... 132 12.3 Estimators........................................................... 133 12.4 DataAccuracyofMeasurement(AoM) ........................... 134 12.5 SmallorBigData................................................... 134 12.6 DataCompressionTechniques..................................... 135 12.7 GettingStarted...................................................... 136 12.8 TestingandEvaluatingResults..................................... 137 12.8.1 EvaluatingFunction-Models.............................. 138 12.9 ProgrammingLanguages........................................... 138 12.10 logSum .............................................................. 139 12.11 Data-Sets ............................................................ 140 13 AnImplementation......................................................... 143 13.1 Support .............................................................. 143 13.2 Values................................................................ 144 13.3 Utilities.............................................................. 146 13.4 Maths................................................................ 146 13.5 Vectors............................................................... 147 13.6 Graphs............................................................... 148 13.7 Models............................................................... 149 13.8 ExamplePrograms.................................................. 153 Glossary........................................................................... 155 Bibliography...................................................................... 167 Index............................................................................... 173 Chapter 1 Introduction Thisbookisaboutinductiveinferenceusingtheminimummessagelength(MML) principle and a computer. It is accompanied by a library of software (Chap.13) to helpanapplicationsprogrammer,studentorresearcherinthefieldsofdataanalysis ormachinelearningtowritecomputerprogramsofthiskind. Inductive inference is the business of learning something general from a given data-set. The thing that is learnt might be a parameter estimate, probability distribution, hypothesis, statistical model, explanation or theory, and the data-set issometimescalledexamples,observations,evidenceortraining-data.Over-fitting isawellknowndangerinwhichacomplexmodelisliabletofitdatabetterthana simplermodelyetmaynotbejustified—theimprovementinfitmaynotbeworththe added complexity of the model. MML [92,93] relies on Bayes’s theorem [8] and Shannon’s information theory [76]. It quantifies a model’s complexity and allows it to be traded off against the model’s fit to data. It is a realisation of Ockham’s razor[78](orOccam’srazor). Inductive, “characterized by the inference of general laws from particular instances.” Inference, “aconclusionreachedonthebasisofevidenceandreasoning.”1 1OED. 2 1 Introduction pr(H) prior probability of H pr(H D) posterior probability of H given D pr(D H) probability of D conditional on (given) H, also likelihood (of H (given D)) Fig.1.1 Standarddefinitions Aninferenceisananswertosomequestionaboutaprocessthatgivesrisetodata and that we are interested in. The question assumes some aspects of the process thataretakentobecommonknowledge,someaspectsthatareunknownbutofno interest,andsomeaspectsthatareunknownbutofinterestandthataretobeinferred. Itspecifiestheformofananswer—whatunknownsorparametershavetobefilled in.Itisimportanttobeclearaboutexactlywhatquestionisbeingaskedandabout theformofananswer—whatitincludesanddoesnotinclude. An inference can be something modest such as an estimate of a parameter of a probability distribution, or something grand such as a theory of fundamental particles in physics, say. People tend to use the terms estimate, distribution, hypothesis, model, explanation, and theory in rough order of increasing grandeur although in reality there are no hard divisions between them. Regardless, an inferencemaywellbewronginthatitispossibleforittobeinvalidatedbyfuture data,althoughitmaybegoodenoughtobeusefuluntilthatday[13]. Q: Whatisthedifferencebetweenahypothesisandatheory? A: Thinkofahypothesisasacard.Atheoryisahousemadeofhypotheses.2 For hypotheses H,H ,H ,... and data D, common terminology (Fig.1.1) 0 1 includes, prior and posterior which refer to before and after seeing the data D. IfXisarandomvariablethatmaytakeoneofthevaluesx ,x ,...,weoftenwrite 1 2 pr(x )forpr(X =x ). i i Bayes’s theorem [8], in a form [52, 81] that is relevant here, states that the probability of hypothesis H and data D equals the probability of H times the probabilityofDgivenH: pr(H&D)=pr(H).pr(D|H)=pr(D).pr(H|D) (1.1) pr(H)isgivenbyapriorprobabilitydistributionoverhypothesesandrepresents generalideasaboutwhatmighttypicallybeinferredfromrealdatabeforethathas beensighted.pr(H|D)isgivenbyaposteriordistributionoverhypothesesandisto becalculated.pr(D)andpr(D|H)aregivenbydistributionsoverdata;theformeris oftennotknowninpracticeandthelatterfollowsfromthespecificationofamodel. ItfollowsfromBayesthattheposterioroddsratiooftwohypothesesH andH is 1 2 pr(H1|D) = pr(H1).pr(D|H1). pr(H2|D) pr(H2).pr(D|H2) 2MarilynvosSavant.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.