ebook img

Mathematics for Machine Learning PDF

376 Pages·2018·6.482 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Mathematics for Machine Learning

388 Foreword Machinelearningisthelatestinalonglineofattemptstocapturehuman 389 knowledgeandreasoningintoaformthatissuitableforconstructingma- 390 chinesandengineeringautomatedsystems.Asmachinelearningbecomes 391 more ubiquitous and its software packages become easier to use it is nat- 392 uralanddesirablethatthelow-leveltechnicaldetailsareabstractedaway 393 and hidden from the practitioner. However, this brings with it the danger 394 that a practitioner becomes unaware of the design decisions and, hence, 395 the limits of machine learning algorithms. The enthusiastic practitioner 396 who is interested to learn more about the magic behind successful ma- 397 chine learning algorithms currently faces a daunting set of pre-requisite 398 knowledge: 399 Programminglanguagesanddataanalysistools 400 • Large-scalecomputationandtheassociatedframeworks 401 • Mathematicsandstatisticsandhowmachinelearningbuildsonit 402 • At universities, introductory courses on machine learning tend to spend 403 earlypartsofthecoursecoveringsomeofthesepre-requisites.Forhistori- 404 calreasons,coursesinmachinelearningtendtobetaughtinthecomputer 405 science department, where students are often trained in the first two ar- 406 easofknowledge,butnotsomuchinmathematicsandstatistics.Current 407 machinelearningtextbookstrytosqueezeinoneortwochaptersofback- 408 groundmathematics,eitheratthebeginningofthebookorasappendices. 409 Thisbookbringsthemathematicalfoundationsofbasicmachinelearning 410 conceptstotheforeandcollectstheinformationinasingleplace. 411 Why Another Book on Machine Learning? 412 Machine learning builds upon the language of mathematics to express 413 conceptsthatseemintuitivelyobviousbutwhicharesurprisinglydifficult 414 toformalize.Onceproperlyformalizedwecanthenusethetoolsofmath- 415 ematics to derive the consequences of our design choices. This allows us 416 to gain insights into the task we are solving and also the nature of intel- 417 ligence. One common complaint of students of mathematics around the 418 globe is that the topics covered seem to have little relevance to practi- 419 cal problems. We believe that machine learning is an obvious and direct 420 motivationforpeopletolearnmathematics. 421 1 Draftchapter(September11,2018)from“MathematicsforMachineLearning”(cid:13)c2018byMarc PeterDeisenroth,AAldoFaisal,andChengSoonOng.TobepublishedbyCambridgeUniversity Press.Reporterrataandfeedbacktohttp://mml-book.com.Pleasedonotpostordistributethis file,pleaselinktohttps://mml-book.com. 2 Foreword This book is intended to be a guidebook to the vast mathematical lit- 422 “Mathislinkedin 423 erature that forms the foundations of modern machine learning. We mo- thepopularmind tivate the need for mathematical concepts by directly pointing out their 424 withphobiaand usefulness in the context of fundamental machine learning problems. In 425 anxiety.You’dthink the interest of keeping the book short, many details and more advanced we’rediscussing 426 spiders.”(Strogatz,427 concepts have been left out. Equipped with the basic concepts presented 2014) here, and how they fit into the larger context of machine learning, the 428 readercanfindnumerousresourcesforfurtherstudy,whichweprovideat 429 theendoftherespectivechapters.Forreaderswithamathematicalback- 430 ground,thisbookprovidesabriefbutpreciselystatedglimpseofmachine 431 learning. In contrast to other books that focus on methods and models of 432 machinelearning(MacKay,2003b;Bishop,2006;Alpaydin,2010;Rogers 433 and Girolami, 2016; Murphy, 2012; Barber, 2012; Shalev-Shwartz and 434 Ben-David, 2014) or programmatic aspects of machine learning (Mu¨ller 435 and Guido, 2016; Raschka and Mirjalili, 2017; Chollet and Allaire, 2018) 436 we provide only four representative examples of machine learning algo- 437 rithms.Insteadwefocusonthemathematicalconceptsbehindthemodels 438 themselves,withtheintentofilluminatingtheirabstractbeauty.Wehope 439 that all readers will be able to gain a deeper understanding of the ba- 440 sic questions in machine learning and connect practical questions arising 441 fromtheuseofmachinelearningwithfundamentalchoicesinthemathe- 442 maticalmodel. 443 Who is the Target Audience? 444 Asapplicationsofmachinelearningbecomewidespreadinsocietywebe- 445 lieve that everybody should have some understanding of its underlying 446 principles.Thisbookiswritteninanacademicmathematicalstyle,which 447 enables us to be precise about the concepts behind machine learning. We 448 encourage readers unfamiliar with this seemingly terse style to persevere 449 and to keep the goals of each topic in mind. We sprinkle comments and 450 remarks throughout the text, in the hope that it provides useful guidance 451 withrespecttothebigpicture.Thebookassumesthereadertohavemath- 452 ematical knowledge commonly covered in high-school mathematics and 453 physics. For example, the reader should have seen derivatives and inte- 454 grals before, and geometric vectors in two or three dimensions. Starting 455 from there we generalize these concepts. Therefore, the target audience 456 of the book includes undergraduate university students, evening learners 457 andpeoplewhoparticipateinonlinemachinelearningcourses. 458 In analogy to music, there are three types of interaction, which people 459 havewithmachinelearning: 460 Astute Listener 461 Thedemocratizationofmachinelearningbytheprovisionofopen-source 462 software,onlinetutorials,andcloud-basedtoolsallowsuserstonotworry 463 about the nitty gritty details of pipelines. Users can focus on extracting 464 Draft(2018-09-11)fromMathematicsforMachineLearning.Errataandfeedbacktohttps://mml-book.com. Foreword 3 insights from data using off-the-shelf tools. This enables non-tech savvy 465 domain experts to benefit from machine learning. This is similar to lis- 466 tening to music; the user is able to choose and discern between different 467 types of machine learning, and benefits from it. More experienced users 468 arelikemusiccritics,askingimportantquestionsabouttheapplicationof 469 machinelearninginsocietysuchasethics,fairness,andprivacyofthein- 470 dividual.Wehopethatthisbookprovidesaframeworkforthinkingabout 471 the certification and risk management of machine learning systems, and 472 allowthemtousetheirdomainexpertisetobuildbettermachinelearning 473 systems. 474 Experienced Artist 475 Skilled practitioners of machine learning are able to plug and play differ- 476 ent tools and libraries into an analysis pipeline. The stereotypical prac- 477 titioner would be a data scientist or engineer who understands machine 478 learning interfaces and their use cases, and is able to perform wonderful 479 feats of prediction from data. This is similar to virtuosos playing music, 480 where highly skilled practitioners can bring existing instruments to life, 481 and bring enjoyment to their audience. Using the mathematics presented 482 here as a primer, practitioners would be able to understand the benefits 483 andlimitsoftheirfavoritemethod,andtoextendandgeneralizeexisting 484 machine learning algorithms. We hope that this book provides the impe- 485 tus for more rigorous and principled development of machine learning 486 methods. 487 Fledgling Composer 488 As machine learning is applied to new domains, developers of machine 489 learning need to develop new methods and extend existing algorithms. 490 Theyareoftenresearcherswhoneedtounderstandthemathematicalba- 491 sisofmachinelearninganduncoverrelationshipsbetweendifferenttasks. 492 This is similar to composers of music who, within the rules and structure 493 ofmusicaltheory,createnewandamazingpieces.Wehopethisbookpro- 494 vides a high-level overview of other technical books for people who want 495 tobecomecomposersofmachinelearning.Thereisagreatneedinsociety 496 fornewresearcherswhoareabletoproposeandexplorenovelapproaches 497 forattackingthemanychallengesoflearningfromdata. 498 (cid:13)c2018MarcPeterDeisenroth,A.AldoFaisal,ChengSoonOng.TobepublishedbyCambridgeUniversityPress. 1 572 Introduction and Motivation 1.1 Finding Words for Intuitions 573 Machinelearningisaboutdesigningalgorithmsthatlearnfromdata.The 574 goal is to find good models that generalize well to future data. The chal- 575 lengeisthattheconceptsandwordsareslippery,andaparticularcompo- 576 nent of the machine learning system can be abstracted to different math- 577 ematical concepts. For example, the word “algorithm” is used in at least 578 twodifferentsensesinthecontextofmachinelearning.Inthefirstsense, 579 we use the phrase “machine learning algorithm” to mean a system that 580 makes predictions based on input data. We refer to these algorithms as 581 582 predictors. In the second sense, we use the exact same phrase “machine predictors learning algorithm” to mean a system that adapts some internal parame- 583 ters of the predictor so that it performs well on future unseen input data. 584 585 Herewerefertothisadaptationastrainingapredictor. training The first part of this book describes the mathematical concepts and 586 foundationsneededtotalkaboutthethreemaincomponentsofamachine 587 learning system: data, models, and learning. We will briefly outline these 588 components here, and we will revisit them again in Chapter 8 once we 589 have the mathematical language under our belt. Adding to the challenge 590 isthefactthatthesameEnglishwordcouldmeandifferentmathematical 591 concepts, and we can only work out the precise meaning via the context. 592 We already remarked about the overloaded use of the word “algorithm”, 593 andthereaderwillbefacedwithothersuchphrases.Weadvisethereader 594 to use the idea of “type checking” from computer science and apply it 595 to machine learning concepts. Type checking allows the reader to sanity 596 checkwhethertheequationthattheyareconsideringcontainsinputsand 597 outputs of the correct type, and whether they are mixing different types 598 ofobjects. 599 While not all data is numerical it is often useful to consider data in a 600 601 number format. In this book, we assume that the data has already been data appropriatelyconvertedintoanumericalrepresentationsuitableforread- 602 ing into a computer program. In this book, we think of data as vectors. 603 604 As another illustration of how subtle words are, there are three different dataasvectors waystothinkaboutvectors:avectorasanarrayofnumbers(acomputer 605 science view), a vector as an arrow with a direction and magnitude (a 606 11 Draftchapter(September12,2018)from“MathematicsforMachineLearning”(cid:13)c2018byMarc PeterDeisenroth,AAldoFaisal,andChengSoonOng.TobepublishedbyCambridgeUniversity Press.Reporterrataandfeedbacktohttp://mml-book.com.Pleasedonotpostordistributethis file,pleaselinktohttps://mml-book.com. 12 IntroductionandMotivation physicsview),andavectorasanobjectthatobeysadditionandscaling(a 607 mathematicalview). 608 model 609 Whatisamodel?Modelsaresimplifiedversionsofreality,whichcapture aspects of the real world that are relevant to the task. Users of the model 610 need to understand what the model does not capture, and hence obtain 611 anappreciationofthelimitationsofit.Applyingmodelswithoutknowing 612 their limitations is like driving a vehicle without knowing whether it can 613 turnleftornot.Machinelearningalgorithmsadapttodata,andtherefore 614 theirbehaviorwillchangeasitlearns.Applyingmachinelearningmodels 615 without knowing their limitations is like sitting in a self-driving vehicle 616 without knowing whether it has encountered enough left turns during its 617 training phase. In this book, we use the word “model” to distinguish be- 618 tweentwoschoolsofthoughtabouttheconstructionofmachinelearning 619 predictors: the probabilisitic view and the optimization view. The reader 620 isreferredtoDomingos(2012)foramoregeneralintroductiontothefive 621 schoolsofmachinelearning. 622 learning 623 We now come to the crux of the matter, the learning component of machine learning. Assume we have a way to represent data as vectors 624 and that we have an appropriate model. We are interested in training 625 our model based on data so that it performs well on unseen data. Pre- 626 dicting well on data that we have already seen (training data) may only 627 meanthatwefoundagoodwaytomemorizethedata.However,thismay 628 notgeneralizewelltounseendata,andinpracticalapplicationsweoften 629 need to expose our machine learning system to situations that it has not 630 encountered before. We use numerical methods to find good parameters 631 that“fit”themodeltodata,andmosttrainingmethodscanbethoughtof 632 as an approach analogous to climbing a hill to reach its peak. The peak 633 of the hill corresponds to a maximization of some desired performance 634 measure. The challenge is to design algorithms that learn from past data 635 butgeneralizeswell. 636 Letussummarizethemainconceptsofmachinelearning: 637 Weusedomainknowledgetorepresentdataasvectors. 638 • Wechooseanappropriatemodel,eitherusingtheprobabilisiticoropti- 639 • mizationview. 640 Welearnfrompastdatabyusingnumericaloptimizationmethodswith 641 • theaimthatitperformswellonunseendata. 642 1.2 Two Ways to Read this Book 643 We can consider two strategies for understanding the mathematics for 644 machinelearning: 645 Building up the concepts from foundational to more advanced. This is 646 • often the preferred approach in more technical fields, such as mathe- 647 matics. This strategy has the advantage that the reader at all times is 648 Draft(2018-09-12)fromMathematicsforMachineLearning.Errataandfeedbacktohttps://mml-book.com. 1.2 TwoWaystoReadthisBook 13 Figure1.1 The foundationsand fourpillarsof Machine Learning machinelearning. y Regression DimensionalitReduction Density Estimation Classification Vector Calculus Probability & Distributions Optimization Linear Algebra Analytic Geometry Matrix Decomposition able to rely on their previously learned definitions, and there are no 649 murky hand-wavy arguments that the reader needs to take on faith. 650 Unfortunately,forapractitionermanyofthefoundationalconceptsare 651 not particularly interesting by themselves, and the lack of motivation 652 meansthatmostfoundationaldefinitionsarequicklyforgotten. 653 Drilling down from practical needs to more basic requirements. This 654 • goal-driven approach has the advantage that the reader knows at all 655 times why they need to work on a particular concept, and there is a 656 clear path of required knowledge. The downside of this strategy is that 657 the knowledge is built on shaky foundations, and the reader has to 658 rememberasetofwordsforwhichtheydonothaveanywayofunder- 659 standing. 660 This book is split into two parts, where Part I lays the mathematical 661 foundations and Part II applies the concepts from Part I to a set of basic 662 machine learning problems, which form four pillars of machine learning 663 asillustratedinFigure1.1. 664 Part I is about Mathematics 665 Werepresentnumericaldataasvectorsandrepresentatableofsuchdata 666 667 as a matrix. The study of vectors and matrices is called linear algebra, linearalgebra which we introduce in Chapter 2. The collection of vectors as a matrix is 668 also described there. Given two vectors, representing two objects in the 669 real world, we want to be able to make statements about their similarity. 670 Theideaisthatvectorsthataresimilarshouldbepredictedtohavesimilar 671 outputs by our machine learning algorithm (our predictor). To formalize 672 the idea of similarity between vectors, we need to introduce operations 673 (cid:13)c2018MarcPeterDeisenroth,A.AldoFaisal,ChengSoonOng.TobepublishedbyCambridgeUniversityPress. 14 IntroductionandMotivation that take two vectors as input and return a numerical value represent- 674 ing their similarity. This construction of similarity and distances is called 675 analyticgeometry 676 analyticgeometryandisdiscussedinChapter3.InChapter4,weintroduce matrix 677 some fundamental concepts about matrices and matrix decomposition. It decomposition turns out that operations on matrices are extremely useful in machine 678 learning,andweusethemforrepresentingdataaswellasformodeling. 679 We often consider data to be noisy observations of some true underly- 680 ing signal, and hope that by applying machine learning we can identify 681 the signal from the noise. This requires us to have a language for quanti- 682 fying what noise means. We often would also like to have predictors that 683 allow us to express some sort of uncertainty, e.g., to quantify the confi- 684 dence we have about the value of the prediction for a particular test data 685 probabilitytheory 686 point. Quantification of uncertainty is the realm of probability theory and iscoveredinChapter6.Insteadofconsideringapredictorasasinglefunc- 687 tion, we could consider predictors to be probabilistic models, i.e., models 688 describingthedistributionofpossiblefunctions. 689 Toapplyhill-climbingapproachesfortrainingmachinelearningmodels, 690 we need to formalize the concept of a gradient, which tells us the direc- 691 tion which to search for a solution. This idea of the direction to search 692 calculus 693 is formalized by calculus, which we present in Chapter 5. How to use a sequence of these search directions to find the top of the hill is called 694 optimization 695 optimization,whichweintroduceinChapter7. It turns out that the mathematics for discrete categorical data is differ- 696 ent from the mathematics for continuous real numbers. Most of machine 697 learningassumescontinuousvariables,andexceptforChapter6theother 698 chaptersinPartIofthebookonlydiscusscontinuousvariables.However, 699 formanyapplicationdomains,dataiscategoricalinnature,andnaturally 700 there are machine learning problems that consider categorical variables. 701 Forexample,wemaywishtomodelsex(male/female).Sinceweassume 702 that our data is numerical, we encode sex as the numbers 1 and +1 703 − for male and female, respectively. However, it is worth keeping in mind 704 when modeling that sex is a categorical variable, and the actual differ- 705 ence in value between the two numbers should not have any meaning in 706 the model. This distinction between continuous and categorical variables 707 givesrisetodifferentmachinelearningapproaches. 708 Part II is about Machine Learning 709 fourpillarsof 710 The second part of the book introduces four pillars of machine learning as machinelearning listed in Table 1.1. The rows in the table distinguish between problems 711 where the variable of interest is continuous or categorical. We illustrate 712 how the mathematical concepts introduced in the first part of the book 713 can be used to design machine learning algorithms. In Chapter 8, we re- 714 statethethreecomponentsofmachinelearning(data,modelsandparam- 715 eter estimation) in a mathematical fashion. In addition, we provide some 716 guidelines for building experimental setups that guard against overly op- 717 Draft(2018-09-12)fromMathematicsforMachineLearning.Errataandfeedbacktohttps://mml-book.com. 1.3 ExercisesandFeedback 15 Supervised Unsupervised Table1.1 Thefour Continuous Regression Dimensionalityreduction pillarsofmachine latentvariables (Chapter9) (Chapter10) learning Categorical Classification Densityestimation latentvariables (Chapter12) (Chapter11) timisticevaluationsofmachinelearningsystems.Recallthatthegoalisto 718 buildapredictorthatperformswellonfuturedata. 719 The terms “supervised” and “unsupervised” (the columns in Table 1.1) 720 learning refer to the question of whether or not we provide the learning 721 722 algorithm with labels during training. An example use case of supervised supervisedlearning learning is when we build a classifier to decide whether a tissue biopsy is 723 cancerous. For training, we provide the machine learning algorithm with 724 a set of images and a corresponding set of annotations by pathologists. 725 726 Thisexpertannotationiscalledalabelinmachinelearning,andformany label supervised learning tasks it is obtained at great cost or effort. After the 727 classifieristrained,weshowitanimagefromanewbiopsyandhopethat 728 it can accurately predict whether the tissue is cancerous. An example use 729 case of unsupervised learning (using the same cancer biopsy problem) is 730 if wewant to visualizethe properties of thetissue around whichwe have 731 found cancerous cells. We could choose two particular features of these 732 images and plot them in a scatter plot. Alternatively we could use all the 733 features and find a two dimensional representation that approximates all 734 thefeatures,andplotthisinstead.Sincethistypeofmachinelearningtask 735 736 doesnotprovidealabelduringtraining,itiscalledunsupervisedlearning. unsupervised Thesecondpartofthebookprovidesabriefoverviewoftwofundamental learning 737 738 supervised(regressionandclassification)andunsupervised(dimensionality regression 739 reductionanddensityestimation)machinelearningproblems. classification Of course there are more than two ways to read this book. Most read- dimensionality 740 ers learn using a combination of top-down and bottom-up approaches, reduction 741 densityestimation sometimes building up basic mathematical skills before attempting more 742 complex concepts, but also choosing topics based on applications of ma- 743 chinelearning.ChaptersinPartImostlybuilduponthepreviousones,but 744 the reader is encouraged to skip to a chapter that covers a particular gap 745 thereader’sknowledgeandworkbackwardsifnecessary.ChaptersinPart 746 IIarelooselycoupledandareintendedtobereadinanyorder.Thereare 747 many pointers forward and backward between the two parts of the book 748 toassistthereaderinfindingtheirway. 749 1.3 Exercises and Feedback 750 WeprovidesomeexercisesinPartI,whichcanbedonemostlybypenand 751 paper. For Part II we provide programming tutorials (jupyter notebooks) 752 toexploresomepropertiesofthemachinelearningalgorithmswediscuss 753 inthisbook. 754 (cid:13)c2018MarcPeterDeisenroth,A.AldoFaisal,ChengSoonOng.TobepublishedbyCambridgeUniversityPress. 16 IntroductionandMotivation We appreciate that Cambridge University Press strongly supports our 755 aim to democratize education and learning by making this book freely 756 availablefordownloadat 757 https://mml-book.com 758 whereyoucanalsofindthetutorials,errataandadditionalmaterials.You 759 canalsoreportmistakesandprovidefeedbackusingtheURLabove. 760 Draft(2018-09-12)fromMathematicsforMachineLearning.Errataandfeedbacktohttps://mml-book.com. 2 768 Linear Algebra When formalizing intuitive concepts, a common approach is to construct 769 a set of objects (symbols) and a set of rules to manipulate these objects. 770 771 Thisisknownasanalgebra. algebra Linear algebra is the study of vectors and certain rules to manipulate 772 vectors. The vectors many of us know from school are called “geometric 773 vectors”, which are usually denoted by having a small arrow above the 774 letter, e.g., x and y. In this book, we discuss more general concepts of 775 →− →− vectorsanduseaboldlettertorepresentthem,e.g.,xandy. 776 In general, vectors are special objects that can be added together and 777 multiplied by scalars to produce another object of the same kind. Any 778 object that satisfies these two properties can be considered a vector. Here 779 aresomeexamplesofsuchvectorobjects: 780 1. Geometricvectors.Thisexampleofavectormaybefamiliarfromschool. 781 Geometric vectors are directed segments, which can be drawn, see 782 → → Figure 2.1(a). Two geometric vectors x, y can be added, such that 783 → → → x + y = z is another geometric vector. Furthermore, multiplication 784 by a scalar λ →x, λ R is also a geometric vector. In fact, it is the 785 ∈ original vector scaled by λ. Therefore, geometric vectors are instances 786 ofthevectorconceptsintroducedabove. 787 2. Polynomials are also vectors, see Figure 2.1(b): Two polynomials can 788 be added together, which results in another polynomial; and they can 789 be multiplied by a scalar λ R, and the result is a polynomial as 790 ∈ well.Therefore,polynomialsare(ratherunusual)instancesofvectors. 791 → → 4 Figure2.1 x + y Differenttypesof 2 vectors.Vectorscan besurprising 0 objects,including y → 2 (a)geometric x − → vectorsand(b) y 4 − polynomials. 6 − 2 0 2 − x (a)Geometricvectors. (b)Polynomials. 17 Draftchapter(September21,2018)from“MathematicsforMachineLearning”(cid:13)c2018byMarc PeterDeisenroth,AAldoFaisal,andChengSoonOng.TobepublishedbyCambridgeUniversity Press.Reporterrataandfeedbacktohttp://mml-book.com.Pleasedonotpostordistributethis file,pleaselinktohttps://mml-book.com.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.