ebook img

Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data PDF

559 Pages·2014·102.536 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data

Statistics, Data Mining, and Machine Learning in Astronomy PRINCETON SERIES IN MODERN OBSERVATIONAL ASTRONOMY DavidN.Spergel,SERIES EDITOR Written by some of the world’s leading astronomers, the Princeton Series in Modern ObservationalAstronomyaddressestheneedsandinterestsofcurrentandfutureprofes- sional astronomers. International in scope, the series includes cutting-edge monographs andtextbooksontopicsgenerallyfallingunderthecategoriesofwavelength,observational techniques and instrumentation, and observational objects from a multiwavelength perspective. Statistics, Data Mining, and Machine Learning in Astronomy A PRACTICAL PYTHON GUIDE FOR THE ANALYSIS OF SURVEY DATA Željko Ivezic´, Andrew J. Connolly, Jacob T. VanderPlas, and Alexander Gray • PRINCETON UNIVERSITY PRESS PRINCETON AND OXFORD Copyright©2014byPrincetonUniversityPress PublishedbyPrincetonUniversityPress,41WilliamStreet, Princeton,NewJersey08540 IntheUnitedKingdom:PrincetonUniversityPress,6OxfordStreet, Woodstock,OxfordshireOX201TW press.princeton.edu AllRightsReserved ISBN978-0-691-15168-7 LibraryofCongressControlNumber:2013951369 BritishLibraryCataloging-in-PublicationDataisavailable ThisbookhasbeencomposedinMinionProw/Universelightcondensedfordisplay Printedonacid-freepaper∞ TypesetbySRNovaPvtLtd,Bangalore,India PrintedintheUnitedStatesofAmerica 10 9 8 7 6 5 4 3 2 1 Contents Preface vii I Introduction 1 1 AbouttheBookandSupportingMaterial 3 1.1 WhatDoDataMining,MachineLearning,andKnowledgeDiscovery Mean? 3 1.2 WhatisThisBookAbout? 5 1.3 AnIncompleteSurveyoftheRelevantLiterature 8 1.4 IntroductiontothePythonLanguageandtheGitCodeManagement Tool 12 1.5 DescriptionofSurveysandDataSetsUsedinExamples 14 1.6 PlottingandVisualizingtheDatainThisBook 31 1.7 HowtoEfficientlyUseThisBook 37 References 39 2 FastComputationonMassiveDataSets 43 2.1 DataTypesandDataManagementSystems 43 2.2 AnalysisofAlgorithmicEfficiency 44 2.3 SevenTypesofComputationalProblem 46 2.4 SevenStrategiesforSpeedingThingsUp 47 2.5 CaseStudies:SpeedupStrategiesinPractice 50 References 63 II Statistical Frameworks and Exploratory Data Analysis 67 3 ProbabilityandStatisticalDistributions 69 3.1 BriefOverviewofProbabilityandRandomVariables 70 3.2 DescriptiveStatistics 78 3.3 CommonUnivariateDistributionFunctions 85 3.4 TheCentralLimitTheorem 105 3.5 BivariateandMultivariateDistributionFunctions 108 3.6 CorrelationCoefficients 115 3.7 RandomNumberGenerationforArbitraryDistributions 118 References 122 vi (cid:127) Contents 4 ClassicalStatisticalInference 123 4.1 Classicalvs.BayesianStatisticalInference 123 4.2 MaximumLikelihoodEstimation(MLE) 124 4.3 ThegoodnessofFitandModelSelection 131 4.4 MLAppliedtoGaussianMixtures:TheExpectationMaximization Algorithm 134 4.5 ConfidenceEstimates:theBootstrapandtheJackknife 140 4.6 HypothesisTesting 144 4.7 ComparisonofDistributions 149 4.8 NonparametricModelingandHistograms 163 4.9 SelectionEffectsandLuminosityFunctionEstimation 166 4.10 Summary 172 References 172 5 BayesianStatisticalInference 175 5.1 IntroductiontotheBayesianMethod 176 5.2 BayesianPriors 180 5.3 BayesianParameterUncertaintyQuantification 185 5.4 BayesianModelSelection 186 5.5 NonuniformPriors:Eddington,Malmquist,andLutz–KelkerBiases 191 5.6 SimpleExamplesofBayesianAnalysis:ParameterEstimation 196 5.7 SimpleExamplesofBayesianAnalysis:ModelSelection 223 5.8 NumericalMethodsforComplexProblems(MCMC) 229 5.9 SummaryofProsandConsforClassicalandBayesianmethods 239 References 243 III Data Mining and Machine Learning 247 6 SearchingforStructureinPointData 249 6.1 NonparametricDensityEstimation 250 6.2 Nearest-NeighborDensityEstimation 257 6.3 ParametricDensityEstimation 259 6.4 FindingClustersinData 270 6.5 CorrelationFunctions 277 6.6 WhichDensityEstimationandClusteringAlgorithmsShouldIUse? 281 References 285 7 DimensionalityandItsReduction 289 7.1 TheCurseofDimensionality 289 7.2 TheDataSetsUsedinThisChapter 291 7.3 PrincipalComponentAnalysis 292 7.4 NonnegativeMatrixFactorization 305 7.5 ManifoldLearning 306 7.6 IndependentComponentAnalysisandProjectionPursuit 313 7.7 WhichDimensionalityReductionTechniqueShouldIUse? 316 References 318 Contents (cid:127) vii 8 RegressionandModelFitting 321 8.1 FormulationoftheRegressionProblem 321 8.2 RegressionforLinearModels 325 8.3 RegularizationandPenalizingtheLikelihood 332 8.4 PrincipalComponentRegression 337 8.5 KernelRegression 338 8.6 LocallyLinearRegression 339 8.7 NonlinearRegression 340 8.8 UncertaintiesintheData 342 8.9 RegressionthatisRobusttoOutliers 344 8.10 GaussianProcessRegression 349 8.11 Overfitting,Underfitting,andCross-Validation 352 8.12 WhichRegressionMethodShouldIUse? 361 References 363 9 Classification 365 9.1 DataSetsUsedinThisChapter 365 9.2 AssigningCategories:Classification 366 9.3 GenerativeClassification 368 9.4 K-Nearest-NeighborClassifier 378 9.5 DiscriminativeClassification 380 9.6 SupportVectorMachines 382 9.7 DecisionTrees 386 9.8 EvaluatingClassifiers:ROCCurves 394 9.9 WhichClassifierShouldIUse? 397 References 400 10 TimeSeriesAnalysis 403 10.1 MainConceptsforTimeSeriesAnalysis 404 10.2 ModelingToolkitforTimeSeriesAnalysis 405 10.3 AnalysisofPeriodicTimeSeries 426 10.4 TemporallyLocalizedSignals 452 10.5 AnalysisofStochasticProcesses 455 10.6 WhichMethodShouldIUseforTimeSeriesAnalysis? 465 References 465 IV Appendices 469 A AnIntroductiontoScientificComputingwithPython 471 A.1 ABriefHistoryofPython 471 A.2 TheSciPyUniverse 472 A.3 GettingStartedwithPython 474 A.4 IPython:TheBasicsofInteractiveComputing 486 A.5 IntroductiontoNumPy 488 A.6 VisualizationwithMatplotlib 494 A.7 OverviewofUsefulNumPy/SciPyModules 498 viii (cid:127) Contents A.8 EfficientCodingwithPythonandNumPy 503 A.9 WrappingExistingCodeinPython 506 A.10OtherResources 508 B AstroML:MachineLearningforAstronomy 511 B.1 Introduction 511 B.2 Dependencies 511 B.3 ToolsIncludedinAstroMLv0.1 512 C AstronomicalFluxMeasurementsandMagnitudes 515 C.1 TheDefinitionoftheSpecificFlux 515 C.2 WavelengthWindowFunctionforAstronomicalMeasurements 515 C.3 TheAstronomicalMagnitudeSystems 516 D SQLQueryforDownloadingSDSSData 519 E ApproximatingtheFourierTransformwiththeFFT 521 References 525 VisualFigureIndex 527 Index 533 Preface Astronomy and astrophysics are witnessing dramatic increases in data volume as detectors, telescopes, and computers become ever more powerful. During the last decade, sky surveys across the electromagnetic spectrum have collected hundreds of terabytesofastronomicaldataforhundredsofmillionsofsources.Overthenextdecade, data volumes will enter the petabyte domain, and provide accurate measurements for billions of sources. Astronomy and physics students are not traditionally trained to handlesuchvoluminousandcomplexdatasets.Furthermore,standardanalysismethods employedinastronomyoftenlagfarbehindtherapidprogressinstatisticsandcomputer science.Themainpurposeofthisbookistohelpminimizethetimeittakesastudentto becomeaneffectiveresearcher. This book provides the interface between astronomical data analysis problems and modernstatisticalmethods.Itisaimedatphysicalanddata-centricscientistswhohavean understandingofthesciencedriversforanalyzinglargedatasetsbutmaynotbeawareof developments in statistical techniques over the last decade. The book targets researchers who want to use existing methods for the analysis of large data sets, rather than those interestedinthedevelopmentofnewmethods.Theoreticaldiscussionsarelimitedtothe minimum required to understand the algorithms. Nevertheless, extensive and detailed referencestorelevantspecialistliteratureareinterspersedthroughoutthebook. We present an example-driven compendium of modern statistical and data mining methods,togetherwithcarefullychosenexamplesbasedonrealmoderndatasets,andof currentastronomicalapplicationsthatwillillustrateeachmethodintroducedinthebook. The book is loosely organized by practical analysis problems, and offers a comparative analysisofdifferenttechniques,includingdiscussionsoftheadvantagesandshortcomings of each method, and their scaling with the sample size. The exposition of the material is supportedbyappropriatepubliclyavailablePythoncode(availablefromthebookwebsite, ratherthanfullyprintedhere)anddatatoenableareadertoreproduceallthefiguresand examples,evaluatethetechniques,andadaptthemtotheirownfieldofinterest.Tosome extent,thisbookisananalogofthewell-knownNumericalRecipesbook,butaimedatthe analysisofmassiveastronomicaldatasets,withmoreemphasisonmoderntoolsfordata miningandmachinelearning,andwithfreelyavailablecode. From the start, we desired to create a book which, in the spirit of reproducible research, would allow readers to easily replicate the analysis behind every example and figure. We believe thisfeaturewillmakethe bookuniquelyvaluable asapracticalguide. WechosetoimplementthisusingPython,apowerfulandflexibleprogramminglanguage that is quickly becoming a standard in astronomy (a number of next-generation large astronomicalsurveysandprojectsusePython,e.g.,JVLA,ALMA,LSST).ThePythoncode base associated with this book, called AstroML, is maintained as a live web repository (GitHub),andisintendedtobeagrowingcollectionofwell-documentedandwell-tested toolsforastronomicalresearch.Anyastronomicalresearcherwhoiscurrentlydeveloping

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.