ebook img

An Introduction to Machine learning: with Application in R PDF

42 Pages·2.147 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview An Introduction to Machine learning: with Application in R

MICHAEL CLARK CENTER FOR SOCIAL RESEARCH UNIVERSITY OF NOTRE DAME AN INTRODUCTION TO MACHINE LEARNING WITH APPLICATIONS IN R MachineLearning 2 Contents Preface 5 Introduction: Explanation & Prediction 6 Some Terminology 7 Tools You Already Have 7 The Standard Linear Model 7 Logistic Regression 8 Expansions of Those Tools 9 Generalized Linear Models 9 Generalized Additive Models 9 The Loss Function 10 Continuous Outcomes 10 Squared Error 10 Absolute Error 10 Negative Log-likelihood 10 R Example 11 Categorical Outcomes 11 Misclassification 11 Binomial log-likelihood 11 Exponential 12 Hinge Loss 12 Regularization 12 R Example 13 3 ApplicationsinR Bias-Variance Tradeoff 14 Bias & Variance 14 The Tradeoff 15 Diagnosing Bias-Variance Issues & Possible Solutions 16 Worst Case Scenario 16 High Variance 16 High Bias 16 Cross-Validation 16 Adding Another Validation Set 17 K-fold Cross-Validation 17 Leave-one-out Cross-Validation 17 Bootstrap 18 Other Stuff 18 Model Assessment & Selection 18 Beyond Classification Accuracy: Other Measures of Performance 18 Process Overview 20 Data Preparation 20 Define Data and Data Partitions 20 Feature Scaling 21 Feature Engineering 21 Discretization 21 Model Selection 22 Model Assessment 22 Opening the Black Box 22 The Dataset 23 R Implementation 24 MachineLearning 4 Feature Selection & The Data Partition 24 k-nearest Neighbors 25 Strengths & Weaknesses 27 Neural Nets 28 Strengths & Weaknesses 30 Trees & Forests 30 Strengths & Weaknesses 33 Support Vector Machines 33 Strengths & Weaknesses 35 Other 35 Unsupervised Learning 35 Clustering 35 Latent Variable Models 36 Graphical Structure 36 Imputation 36 Ensembles 36 Bagging 37 Boosting 37 Stacking 38 Feature Selection & Importance 38 Textual Analysis 39 More Stuff 39 Summary 39 Cautionary Notes 40 Some Guidelines 40 Conclusion 40 Brief Glossary of Common Terms 41 5 ApplicationsinR Preface Thepurposeofthisdocumentistoprovideaconceptualintroduction tostatisticalormachinelearning(ML)techniquesforthosethatmight notnormallybeexposedtosuchapproachesduringtheirrequired typicalstatisticaltraining1. Machinelearning2 canbedescribedas 1Igenerallyhaveinmindsocialscience aformofastatistics,oftenevenutilizingwell-knownnadfamiliar researchersbuthopefullykeepthings generalenoughforotherdisciplines. techniques,thathasbitofadifferentfocusthantraditionalanalytical 2Alsoreferredtoasstatisticallearning, practiceinthesocialsciencesandotherdisciplines. Thekeynotionis statisticalengineering,datascienceor thatflexible,automaticapproachesareusedtodetectpatternswithin datamininginothercontexts. thedata,withaprimaryfocusonmakingpredictionsonfuturedata. IfonesurveysthenumberoftechniquesavailableinMLwithout context,itwillsurelybeoverwhelmingintermsofthesheernumber ofthoseapproachesandalsothevarioustweaksandvariationsof them. However,thespecificsofthetechniquesarenotasimportant asmoregeneralconceptsthatwouldbeapplicableinmosteveryML setting,andindeed,manytraditionalonesaswell. Whiletherewillbe examplesusingtheRstatisticalenvironmentanddescriptionsofafew specificapproaches,thefocushereismoreonideasthanapplication3 3Indeed,thereisevidencethatwith andkeptattheconceptuallevelasmuchaspossible. However,some largeenoughsamplesmanytechniques convergetosimilarperformance. appliedexamplesofmorecommontechniqueswillbeprovidedin detail. Asforprerequisiteknowledge,Iwillassumeabasicfamiliaritywith regressionanalysestypicallypresentedtothoseinapplieddisciplines, particularlythoseofthesocialsciences. Regardingprogramming,one shouldbeatleastsomewhatfamiliarwithusingRandRstudio,and eitherofmyintroductionshereandherewillbeplenty. NotethatI won’tdoasmuchexplainingoftheRcodeasinthoseintroductions, andinsomecasesIwillbemoreconcernedwithgettingtoaresult thanclearlydetailingthepathtoit. Armedwithsuchintroductory knowledgeascanbefoundinthosedocuments,iftherearepartsof Rcodethatareunclearonewouldhavethetoolstoinvestigateand discoverforthemselvesthedetails,whichresultsinmorelearning anyway. Thelatestversionofthisdocumentis datedApril16,2013(originalMarch 2013). MachineLearning 6 Introduction: Explanation & Prediction FOR ANY PARTICULAR ANALYSIS CONDUCTED,emphasiscanbe placedonunderstandingtheunderlyingmechanismswhichhavespe- cifictheoreticalunderpinnings,versusafocusthatdwellsmoreon performanceand,moretothepoint,futureperformance. Thesearenot mutuallyexclusivegoalsintheleast,andprobablymoststudiescon- tainalittleofbothinsomeformorfashion. Iwillrefertotheformer emphasisasthatofexplanation,andthelatterthatofprediction. Instudieswithamoreexplanatoryfocus,traditionallyanalysiscon- cernsasingledataset. Forexample,oneassumesadatagenerating distributionfortheresponse,andoneevaluatestheoverallfitofa singlemodeltothedataathand,e.g. intermsofR-squared,andstatis- ticalsignificanceforthevariouspredictorsinthemodel. Oneassesses howwellthemodellinesupwiththetheorythatledtotheanalysis, andmodifiesitaccordingly,ifneedbe,forfuturestudiestoconsider. Somestudiesmaylookatpredictionsforspecific,possiblyhypothetical valuesofthepredictors,orexaminetheparticularnatureofindividual predictorseffects. Inmanycases,onlyasinglemodelisconsidered. Ingeneralthough,littleattemptismadetoexplicitlyunderstandhow wellthemodelwilldowithfuturedata,butwehopetohavegained greaterinsightastotheunderlyingmechanismsguidingtheresponse ofinterest. FollowingBreiman(2001),thiswouldbemoreakintothe datamodelingculture. Fortheothertypeofstudyfocusedonprediction,newertechniques areavailablethatarefarmorefocusedonperformance,notonlyfor thecurrentdataunderexaminationbutforfuturedatatheselected modelmightbeappliedto. Whilestillpossible,relativepredictorim- portanceislessofanissue,andoftentimestheremaybenoparticular theorytodrivetheanalysis. Theremaybethousandsofinputvari- ables,suchthatnosimplesummarywouldlikelybepossibleanyway. However,manyofthetechniquesappliedinsuchanalysesarequite powerful,andstepsaretakentoensurebetterresultsfornewdata. AgainreferencingBreiman(2001),thisperspectiveismoreofthealgo- rithmicmodelingculture. Whilethetwoapproachesarenotexclusive,Ipresenttwoextreme viewsofthesituation: Toparaphraseprovocatively,’machinelearningisstatisticsminusany checkingofmodelsandassumptions’. ~BrianRipley,2004 ... thefocusinthestatisticalcommunityondatamodelshas: Ledtoirrelevanttheoryandquestionablescientificconclusions. 7 ApplicationsinR Keptstatisticiansfromusingmoresuitablealgorithmicmodels. Preventedstatisticiansfromworkingonexcitingnewproblems. ~Leo Brieman,2001 Respectivedepartmentsofcomputerscienceandstatisticsnowover- lapmorethaneverasmorerelaxedviewsseemtoprevailtoday,but therearepotentialdrawbackstoplacingtoomuchemphasisoneither approachhistoricallyassociatedwiththem. Modelsthat’justwork’ havethepotentialtobedangerousiftheyarelittleunderstood. Situa- tionsforwhichmuchtimeisspentsortingoutdetailsforanill-fitting modelsufferstheconverseproblem-some(thoughoftenperhapsvery littleactually)understandingwithlittlepragmatism. Whilethispaper willfocusonmorealgorithmicapproaches,guidancewillbeprovided withaneyetowardtheiruseinsituationswherethetypicaldatamod- elingapproachwouldbeapplied,therebyhopefullysheddingsome lightonapathtowardobtainingthebestofbothworlds. Some Terminology Forthoseusedtostatisticalconceptssuchasdependentvariables, clustering,andpredictors,etc. youwillhavetogetusedtosomedif- ferencesinterminology4 suchastargets,unsupervisedlearning,and 4Seethisforacomparison inputsetc. Thisdoesn’ttaketoomuch,evenifitissomewhatannoying whenoneisfirststartingout. Iwon’tbetoobeholdentoeitherinthis paper,anditshouldbeclearfromthecontextwhat’sbeingreferredto. InitiallyIwillstartoffmostlywithnon-MLtermsandnoteinbrackets it’sMLversiontohelptheorientationalong. Tools You Already Have ONE THING THAT IS IMPORTANT TO KEEP IN MIND AS YOU BEGIN is thatstandardtechniquesarestillavailable,althoughwemighttweak themordomorewiththem. Sohavingabasicbackgroundinstatistics isallthatisrequiredtogetstartedwithmachinelearning. The Standard Linear Model Allintroductorystatisticscourseswillcoverlinearregressioningreat detail,anditcertainlycanserveasastartingpointhere. Wecande- scribeitasfollowsinmatrixnotation: y = N(µ,σ2) µ = Xβ MachineLearning 8 Whereyisanormallydistributedvectorofresponses[target]with meanµandconstantvarianceσ2. Xisatypicalmodelmatrix,i.e. a matrixofpredictorvariablesandinwhichthefirstcolumnisavec- torof1sfortheintercept[bias5],and βisthevectorofcoefficients 5Yes,youwillsee’bias’refertoan [weights]correspondingtotheinterceptandpredictorsinthemodel. intercept,andalsomeansomething entirelydifferentinourdiscussionof Whatmightbegivenlessfocusinappliedcourseshoweverishow biasvs.variance. oftenitwon’tbethebesttoolforthejoborevenapplicableintheform itispresented. Becauseofthismanyappliedresearchersarestillham- meringscrewswithit,evenastheexplosionofstatisticaltechniques ofthepastquartercenturyhasrenderedobsoletemanycurrentintro- ductorystatisticaltextsthatarewrittenfordisciplines. Evenso,the conceptsonegainsinlearningthestandardlinearmodelaregeneral- izable,andevenafewmodificationsofit,whilestillmaintainingthe basicdesign,canrenderitstillveryeffectiveinsituationswhereitis appropriate. Typicallyinfitting[learning]amodelwetendtotalkaboutR- squaredandstatisticalsignificanceofthecoefficientsforasmall numberofpredictors. Forourpurposes,letthefocusinsteadbeon theresidualsumofsquares6 withaneyetowardsitsreductionand 6∑(y−f(x))2where f(x)isafunction modelcomparison. Wewillnothaveasituationinwhichweareonly ofthemodelpredictors,andinthis contextalinearcombinationofthem consideringonemodelfit,andsomustfindonethatreducesthesum (Xβ). ofthesquarederrorsbutwithoutunnecessarycomplexityandoverfit- ting,conceptswe’llreturntolater. Furthermore,wewillbemuchmore concernedwiththemodelfitonnewdata[generalization]. Logistic Regression Logisticregressionisoftenusedwheretheresponseiscategoricalin nature,usuallywithbinaryoutcomeinwhichsomeeventoccursor doesnotoccur[label]. Onecouldstillusethestandardlinearmodel here,butyoucouldendupwithnonsensicalpredictionsthatfallout- sidethe0-1rangeregardingtheprobabilityoftheeventoccurring,to goalongwithothershortcomings. Furthermore,itisnomoreeffort norisanyunderstandinglostinusingalogisticregressionoverthe linearprobabilitymodel. Itisalsogoodtokeeplogisticregressionin mindaswediscussotherclassificationapproacheslateron. Logisticregressionisalsotypicallycoveredinanintroductionto statisticsforapplieddisciplinesbecauseofthepervasivenessofbinary responses,orresponsesthathavebeenmadeassuch7. Likethestan- 7Itisgenerallyabadideatodiscretize dardlinearmodel,justafewmodificationscanenableonetouseitto continuousvariables,especiallythe dependentvariable.Howevercontextual providebetterperformance,particularlywithnewdata. Thegistis, issues,e.g.diseasediagnosis,might itisnotthecasethatwehavetoabandonfamiliartoolsinthemove warrantit. towardamachinelearningperspective. 9 ApplicationsinR Expansions of Those Tools Generalized Linear Models Tobegin,logisticregressionisageneralizedlinearmodelassuminga binomialdistributionfortheresponseandwithalogitlinkfunctionas follows: y = Bin(µ,size =1) η = g(µ) η = Xβ Thisisthesamepresentationformatasseenwiththestandardlin- earmodelpresentedbefore,exceptnowwehavealinkfunction g(.) andsoaredealingwithatransformedresponse. Inthecaseofthe standardlinearmodel,thedistributionassumedisthegaussianand thelinkfunctionistheidentitylink,i.e. notransformationismade. Thelinkfunctionusedwilldependontheanalysisperformed,and whilethereischoiceinthematter,thedistributionsusedhaveatypi- cal,orcanonicallinkfunction8. 8Asanotherexample,forthePoisson Generalizedlinearmodelsexpandthestandardlinearmodel,which distribution,thetypicallinkfunction wouldbethelog(µ) isaspecialcaseofgeneralizedlinearmodel,beyondthegaussian distributionfortheresponse,andallowforbetterfittingmodelsof categorical,count,andskewedresponsevariables. Wehavealsohavea counterparttotheresidualsumofsquares,thoughwe’llnowrefertoit asthedeviance. Generalized Additive Models Additivemodelsextendthegeneralizedlinearmodeltoincorporate nonlinearrelationshipsofpredictorstotheresponse. Wemightnoteit asfollows: y = family(µ,...) η = g(µ) η = Xβ+ f(X) Sowehavethegeneralizedlinearmodelbutalsosmoothfunctions f(X)ofoneormorepredictors. MoredetailcanbefoundinWood (2006)andIprovideanintroductionhere. ThingsdostarttogetfuzzywithGAMs. Itbecomesmoredifficult toobtainstatisticalinferenceforthesmoothedtermsinthemodel, andthenonlinearitydoesnotalwayslenditselftoeasyinterpretation. Howeverreallythisjustmeansthatwehavealittlemoreworktoget thedesiredlevelofunderstanding. GAMscanbeseenasasegueto- wardmoreblackbox/algorithmictechniques. Comparedtosomeof thosetechniquesinmachinelearning,GAMsarenotablymoreinter- MachineLearning 10 pretable,thoughperhapslesssothanGLMs. Also,partoftheestima- tionprocessincludesregularizationandvalidationindeterminingthe natureofthesmoothfunction,topicsofwhichwewillreturnlater. The Loss Function GIVEN A SET OF PREDICTOR VARIABLES X andsomeresponsey,we lookforsomefunction f(X)tomakepredictionsofyfromthoseinput variables. Wealsoneedafunctiontopenalizeerrorsinprediction-a lossfunction, L(Y, f(X)). Withchosenlossfunction,wethenfindthe modelwhichwillminimizeloss,generallyspeaking. Wewillstartwith thefamiliarandnoteacoupleothersthatmightbeused. Continuous Outcomes Squared Error Theclassiclossfunctionforlinearmodelswithcontinuousresponseis thesquarederrorlossfunction,ortheresidualsumofsquares. L(Y, f(X)) = ∑(y− f(X))2 Absolute Error Foranapproachmorerobusttoextremeobservations,wemight chooseabsoluteratherthansquarederrorasfollows. Inthiscase, predictionsareaconditionalmedianratherthanaconditionalmean. L(Y, f(X)) = ∑|(y− f(X))| Negative Log-likelihood Wecanalsothinkofourusuallikelihoodmethodslearnedinastan- dardappliedstatisticscourseasincorporatingalossfunctionthatis thenegativelog-likelihoodpertainingtothemodelofinterest. Ifwe assumeanormaldistributionfortheresponsewecannotetheloss functionas: L(Y, f(X)) = nlnσ+∑ 1 (y− f(X))2 2σ2 Inthiscaseitwouldconvergetothesameanswerasthesquared error/leastsquaressolution.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.