T T AnalyzingLinguisticData Apracticalintroductiontostatistics R.H.Baayen Interfacultyresearchunitforlanguageandspeech,UniversityofNijmegen,& MaxPlanckInstituteforPsycholinguistics,Nijmegen e-mail:[email protected] F F CambridgeUniversityPress A A R R D D i ii T T F F TO JORN, CORINE, THERA AND TINEKE A A R R D D iii iv T T 4.2.2 Arethemeansthesame? . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.3 Arethevariancesthesame? . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Pairedvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.1 Arethemeansormediansthesame?. . . . . . . . . . . . . . . . . . . 89 4.3.2 Functionalrelations:linearregression . . . . . . . . . . . . . . . . . . 91 Contents 4.3.3 Whatdoesthejointdensitylooklike? . . . . . . . . . . . . . . . . . . 105 4.4 Anumericalvectorandafactor:analysisofvariance . . . . . . . . . . . . . 110 F F 4.4.1 Twonumericalvectorsandafactor:analysisofcovariance . . . . . . 117 4.5 Twovectorswithcounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 1 AnintroductiontoR 1 4.6 Anoteonstatisticalsignificance. . . . . . . . . . . . . . . . . . . . . . . . . . 123 1.1 Rasacalculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 1.2 GettingdataintoandoutofR . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Accessinginformationindataframes . . . . . . . . . . . . . . . . . . . . . . 6 5 Clusteringandclassification 127 1.4 Operationsondataframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A A 1.4.1 Sortingadataframebyoneormorecolumns. . . . . . . . . . . . . . 11 5.1.1 Tableswithmeasurements:Principalcomponentsanalysis . . . . . . 127 1.4.2 Changinginformationinadataframe . . . . . . . . . . . . . . . . . . 12 5.1.2 Tableswithmeasurements:Factoranalysis . . . . . . . . . . . . . . . 135 1.4.3 Extractingcontingencytablesfromdataframes . . . . . . . . . . . . 13 5.1.3 Tableswithcounts:CorrespondenceAnalysis . . . . . . . . . . . . . 139 1.4.4 Calculationsondataframes . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1.4 Tableswithdistances:Multi-dimensionalscaling . . . . . . . . . . . 146 1.5 Sessionmanagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1.5 Tableswithdistances:Hierarchicalclusteranalysis . . . . . . . . . . 148 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.2 Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 2 Graphicaldataexploration 21 5.2.1 Classificationtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 2.1 Randomvariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2.2 Discriminantanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 2.2 VisualizingRsinglerandomvariables . . . . . . . . . . . . . . . . . . . . . . . 22 5.2.3 SupRportvectormachines . . . . . . . . . . . . . . . . . . . . . . . . . 173 2.3 Visualizingtwoormorevariables. . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 2.4 Trellisgraphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6 Regressionmodeling 181 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 3 Probabilitydistributions 47 6.2 Ordinaryleastsquaresregression . . . . . . . . . . . . . . . . . . . . . . . . . 185 3.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Discretedistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2.1 Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 3.3 Continuousdistributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2.2 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 3.3.1 Thenormaldistribution . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2.3 Modelcriticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 D3.3.2 Thet,F andχ2distributions . . . . . . . . . . . . . . . . . . . . . . . 67 D6.2.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.3 Generalizedlinearmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.3.1 LogisticRegression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 4 Basicstatisticalmethods 73 6.3.2 Ordinallogisticregression . . . . . . . . . . . . . . . . . . . . . . . . . 227 4.1 Testsforsinglevectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4 Regressionwithbreakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 4.1.1 Distributiontests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1.2 Testsforthemean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.5 Modelsforlexicalrichness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 4.2 Testsfortwoindependentvectors . . . . . . . . . . . . . . . . . . . . . . . . 82 6.6 Generalconsiderations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 4.2.1 Arethedistributionsthesame? . . . . . . . . . . . . . . . . . . . . . . 83 6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 v vi T T 7 MixedModels 263 Thisbookprovidesanintroductiontothestatisticalanalysisofquantitativedatafor 7.1 Modelingdatawithfixedandrandomeffects . . . . . . . . . . . . . . . . . . 264 researchersstudyingaspectsoflanguageandlanguageprocessing. Thestatisticalanal- 7.2 Acomparisonwithtraditionalanalyses . . . . . . . . . . . . . . . . . . . . . 282 ysisofquantitativedataisoftenseenasanoneroustaskthatonewouldratherleaveto 7.2.1 Mixed-effectmodelsandQuasi-F . . . . . . . . . . . . . . . . . . . . 283 others. Statisticalpackagestendtobeusedasakindoforacle,fromwhichyouelicita 7.2.2 Mixed-effectmodelsandLatinSquaredesigns . . . . . . . . . . . . . 290 verdictastowhetheryouhaveoneormoresignificanteffectsinyourdata. Inorderto 7.2.3 Regressionwithsubjectsanditems. . . . . . . . . . . . . . . . . . . . 293 elicitaresponsefromtheoracle,onehastoclickone’swaythroughcascadesofmenus. 7.3 Shrinkageinmixed-effectmodels . . . . . . . . . . . . . . . . . . . . . . . . . 299 Afteramagic button press, voluminous output tendstobeproducedthathidesthep- 7.4 Generalizedlinearmixedmodels . . . . . . . . . . . . . . . . . . . . . . . . . 303 values,theultimategoalofthestatisticalpilgrimage,amonglotsofothernumbersthat F F 7.5 Casestudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 arecompletelymeaninglesstotheuser,asbefitsatrueoracle. 7.5.1 PrimedlexicaldecisionlatenciesforDutchneologisms . . . . . . . . 309 Theapproachtodataanalysistowhichthisbookprovidesaguideisfundamentally 7.5.2 Self-pacedreadinglatenciesforDutchneologisms . . . . . . . . . . . 312 differentinseveralways.Firstofall,wewillmakeuseofaradicallydifferenttoolfordo- 7.5.3 VisuallexicaldecisionlatenciesofDutcheight-yearolds . . . . . . . 315 ingstatistics,theinteractiveprogrammingenvironmentknownasR.Risanopensource 7.5.4 Mixed-effectmodelsincorpuslinguistics . . . . . . . . . . . . . . . . 321 implementationofthe(object-oriented)Slanguageforstatisticalanalysisoriginallyde- 7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 veloped at Bell Laboratories. It is the platform par excellence for research and devel- opment in computational statistics. It can be downloaded from the COMPREHENSIVE A Solutionstotheexercises A 329 A R ARCHIVE NETWORK (CRAN)athttp://cran.r-project.orgoroneofthemany mirrorsites.LearningtoworkwithRisinmanywayssimilartolearninganewlanguage. B OverviewofRfunctions 367 Onceyouhavemastereditsgrammar,andonceyouhaveacquiredsomebasicvocabu- lary,youwillalsohavebeguntoacquireanewwayofthinkingaboutdataanalysisthat isessentialforunderstandingthestructureinyourdata.ThedesignofRisespeciallyele- gantinthatithasaconsistentuniformsyntaxforspecifyingstatisticalmodels,nomatter whichtypeofmodelisbeingfitted. WhatisessentialaboutworkingwithR,andthisbringsustotheseconddifferencein ourapproach,isthatwewilldependheavilyonvisualization. Rhasoutstandinggraph- R ical facilities, whichRgenerally provide far more insight into the data than long lists of statisticsthatdependonoftenquestionablesimplifyingassumptions. Thatis,thisbook providesanintroductiontoexploratorydataanalysis.Moreover,wewillworkincrementally andinteractively.Theprocessofunderstandingthestructureinyourdataisalmostalways aniterativeprocessinvolvinggraphicalinspection,modelbuilding,graphicalinspection, updatingandadjustingthemodel,etc.TheflexibilityofRiscrucialformakingthisitera- tiveprocessofcomingtogripswithyourdatabotheasyandinfactquiteenjoyable. A third, at first sight heretical aspect of this book is that I have avoided all formal mathematics. The focus of this introduction is on explaining the key concepts and on D D providing guidelinesfortheproperuseofstatistical techniques. Ausefulmetaphoris learningtodriveacar.Inordertodriveacar,youneedtoknowthepositionandfunction oftoolssuchasthesteeringwheelandthebrake.Youalsoneedtoknowthatyoushould notdrivewiththehandbrakeon. Andyouneedtoknowthetrafficrules. Withoutthese threekindsofknowledge,drivingacarisextremelydangerous. Whatyoudonotneed toknowishowtoconstructacombustionengine,orhowtodrillforoilandrefineitso thatyoucanuseittofuelthatcombustionengine.Theaimofthisbookistoprovideyou withadrivinglicenceforexploratorydataanalysis. Thereisonecaveathere. Tostretch the metaphor to its limit: With R,you are receiving driving lessons in anall-powerful vii viii T T car, acombinationofaracingcar, alorry, apersonalvehicle, andalimousine. Conse- thatusingagraphicaluserinterfacewouldworkmorequickly,inwhichcaseyoumay quently, you have to be aresponsible driver, which meansthat you will find that you wanttoconsiderusingthecommercialsoftwareS-PLUS,whichofferssuchaninterface. willneedmanyadditionaldrivinglessonsbeyondthoseofferedinthisbook. Moreover, However, as pointed out by Crawley [2002], “If you enjoy wasting time, you can pull itneverhurtstoconsultprofessionaldrivers—statisticianswithasolidbackgroundin down the menus and click in the dialog boxes to your heart’s content. However, this mathematicalstatisticswhoknowtheinsandoutsofthetoolsandtechniques,andtheir takesabout5to10timesaslongaswritinginthecommandline. Lifeisshort. Usethe advantagesanddisadvantages.Otherintroductionsthatthereadermaywanttoconsider commandline.”(p.11) areDalgaard[2002],Verzani[2005],andCrawley[2002]. Thepresentbookiswrittenfor Thereareseveralways inwhichyou canusethis book. Ifyou usethis book asan readerswithlittleornoprogrammingexperience. ReadersinterestedintheRlanguage introductiontostatistics,itisimportanttoworkthroughtheexamples,notonlybyread- F F itselfshouldconsultBeckeretal.[1988]andVenablesandRipley[2002]. ing them through, but by trying them out in R. Each chapter also comes with a set of problems,withworked-outsolutionsinAppendixA.Ifyouusethisbooktolearnhow The approach I havetaken in this course is to work with real datasets rather than toapplyinRparticulartechniquesthatyouarealreadyfamiliarwith,thenthequickest with small artificial examples. Realdata are often messy, and it is important to know waytoproceedistostudythestructure oftherelevantdatafilesusedtoillustratethe howtoproceedwhenthedatadisplayallkindsofproblemsthatstandardintroductory technique.Onceyouhaveunderstoodhowthedataaretobeorganized,youcanloadthe textbookshardlyevermention.Unlessstatedotherwise,datasetsdiscussedinthisbook areavailableinthelanguageRpackage,whichisavailableattheCRAN archives. The dataintoRandtryouttheexample. Andonceyouhavegotthisworking,itshouldnot bedifficulttotryoutthesametechniqueonyourowndata. readerisencouragedtoworkthroughtheexampleswiththeactualdata,togetafeeling of what the data look like andAhow to work with R’s functions. To save typing, you ThisbookisorganizedasfollAows:Chapter1isanintroductiontothebasicsofR.Itex- cancopyandpastetheRcodeoftheexamplesinthisbookintotheRconsole(seethe plainshowtoloaddataintoR,andhowtoworkwithdatafromthecommandline.Chap- fileexamples.txtinlanguageR’sscriptsdirectory).ThelanguageRpackagealso ter 2 introduces a number of important visualization techniques. Chapter 3 discusses probabilitydistributions,andChapter4providesaguidetostandardstatisticaltestsfor makesavailableaseriesoffunctions. Theseconveniencefunctions, someofwhichare stillbeingdeveloped,beartheextension.fnctodistinguishthemfromthewell-tested singlerandomvariablesaswellasfortworandomvariables. Chapter5discussesmeth- functionsofRanditsstandardpackages. odsforclusteringandclassification. Chapter6discussesregressionmodelingstrategies, andChapter7introduces mixed-effectmodels, themodelsrequiredforanalyzingdata Animportant reasonfor usingRisthat itisacarefullydesignedprogramming en- setswithnestedorcrossedrepeatedmeasures. vironment that allows you, in a very flexible way, to write your own code, or modify IamindebtedtoCarmelO’ShannessyforallowingmetouseherdataonWarlpiri,to existingcode,totailorRtoyourspecificneeds. Toseewhythisisuseful,considerare- KorsPerdijkforsharinghisworkonthereadingskillsofyoungchildren,toJoanBres- searcherstudyingsiRmilaritiesinmeaningandformforalargenumberofwords.Suppose R nanforherdataonthedativealternationinEnglish,toMariaSpassovaforherdataon thataseparatemodelneedstobefittedforeachof1000wordstothedataoftheother999 Spanishauthorialhands,toKarenKeuneforhermaterialsonthesocialandgeographi- words. Ifyouareusedtothinkingaboutstatisticalquestionaspathsthroughcascaded calvariationintheNetherlandsandFlanders,toLauradeVaanforherexperimentson menus,youwilldiscardsuchananalysisasimpracticalalmostimmediately. Whenyou Dutchderivationalneologisms,toMirjamErnestusforherphonologicaldataonfinalde- workinR,yousimplywritethecodeforoneword,andthencycleitthroughonallother voicing,toWiekeTabakforherdataonetymologicalage,toJenHayfortheratingdata words.Researchersareoftenunnecessarilylimitedinthequestionstheyexplorebecause sets,andtoMichaelDunnforhisdataonthephylogeneticclassificationofPapuanand theyarethinkinginmenu-drivenlanguageinsteadofinaninteractiveprogramminglan- Oceaniclanguages. Manystudentsandcolleagueshavehelpedmewiththeircomments guagelikeR.Thisisanareawherelanguagedeterminesthought. andsuggestionsforimprovement. IwouldliketomentionbynameJoanBresnan,Mir- Ifyouarenewtoworkingwithaprogramminglanguage,youwillfindthatyouwill jamErnestus,JenHay,ReinholdKliegl,VictorKuperman,PetarMilin,IngoPlag,Stuart D D havetogetusedtogettingyourcommandsforRexactlyright. Rofferscommandline Robinson,HedderikvanRijn,EvaSmolka,andFionaTweedie. Iamespeciallyindebted editingfacilities,andyoucanalsopagethroughearliercommandswiththeupanddown toDouglasBatesforhisdetailedcommentsonChapter7,hisadviceforimprovingthe arrows ofyour keyboard. It isoften usefulto open asimpletext editor (emacs, gvim, languageRpackage,hishelpwiththecodefortemporaryancillaryfunctionsformixed- notepad), to prepare your commands in, and to copy and paste these commands into effectsmodeling,andtheinsightsofferedonmixed-effectsmodeling.Infact,Iwouldlike the R window. Especially more complex commands tend to be used more than once, tothankDoughereforalltheworkhehasputintodevelopingthelme4package,which and it is often much easier to make copies in the editor and modify these, than to try Ibelieveisthemostexcitingtooldiscussedinthisbookforanalyzinglinguisticexperi- to edit multiple-line commands in the R window itself. Output from R that is worth mentaldata.Lastbutnotleast,IamgratefultoTinekeforherfriendshipandsupport. remembering can be pasted back into the editor, which in this way comes to retain a detailed history of both your commands and of the relevant results. You might think ix x T T Packages areinstalledinafoldernamedlibrary, whichitselfis locatedinR’shome directory. Onmysystem,R’shomeis/home/harald/R-2.4.0,sopackagesarefound in/home/harald/R-2.4.0/library,andthecodeofthemainexamplesinthisbook islocatedin/home/harald/R-2.4.0/library/languageR/scripts. I recommend to create a file named .Rprofile in your home directory. This file Chapter 1 shouldcontaintheline library(languageR) F F An introduction to R tellingRthatuponstartupitshouldattachlanguageR.Alldatasetsandfunctionsde- finedinlanguageR,andalmost allpackagesthat wewillneed, willbeautomatically available. Alternatively,youcantypelibrary(languageR)totheRpromptyourself afteryouhavestartedR.AllexamplesinthisbookassumethatthelanguageRpackage hasbeenattached. InordertolearntoworkwithR,youhavetolearntospeakitslanguage,theSlanguage, Thewaytolearnalanguageistostartspeakingit. ThewaytolearnRandtheSlan- developedoriginallyatBellLaboratories[Beckeretal.,1988]. Thegrammarofthispro- guagethatitisbuilton,istostartusingit.Readingthroughtheexamplesinthischapter gramminglanguageisbeautifulAandeasytolearn. Itisimportanttomasteritsbasics,as isnotenoughtobecomeaconfiAdentuserofR.Forthis,youneedtoactuallytryoutthe thisgrammarisdesignedtoguideyoutowardstheappropriatewayofthinkingabout examplesbytypingthemattheRprompt.Youhavetobeverypreciseinyourcommands, yourdataandhowyoumightwanttocarryoutyouranalysis. whichrequiresadisciplinethatyouwillonlymasterifyoulearnfromexperience,from Whenyou begin to useRon anAppleMacintosh or aWindowsPC,you will start yourmistakesandtypos. Don’tbeputoffifRcomplainsaboutyourinitialattemptsto R either through a menus guiding you to applications, or by clicking on R’s icon. As useit,justcarefullycomparewhatyoutyped,letterbyletterandbracketbybracket,with aresult, a graphical userinterface is started up, with as central part awindow with a thecodeintheexamples. prompt (>), the place where you type your commands. On UNIX or LINUX, the same Ifyoutypeacommandthatextendsoverseparatelines,thestandardprompt>will windowisobtainedbyopeningaterminalandtypingRtoitsprompt. changeintothespecialcontinuationprompt+.Ifyouthinkyourcommandiscompleted, The sequence of commands in a given R session and the objects created are stored but still have a continuation prompt, there is something wrong with your syntax. To infilesnamed.RhiRstoryand.RDatawhenyouquitRandrespondpositivelytothe cancelthecommandR,useeithertheescapekey,ortypeCONTROL-C.AppendixCprovides questionwhetheryouwanttosaveyourworkspace. Ifyoudoso,thenyourresultswill anoverviewofoperatorsandfunctions,groupedbytopic,thatthereadermayfinduseful beavailabletoyouthenexttimeyoustartupR.Ifyouareusingagraphicaluserinter- asacomplementtotheexample-by-exampleapproachfollowedinthemaintextofthis face,this.RDatafilewillbelocatedbydefaultinthefolderwhereRhasbeeninstalled. introduction. InUNIX andLINUX, the.RDatafilewillbecreatedinthesamedirectorywhereRwas startedup. YouwilloftenwanttouseRfordifferentprojects,locatedindifferentdirectorieson 1.1 R as acalculator yourcomputer. OnUNIX andLINUX systems,simplyopenaterminalinthedesireddi- rectory, and start R. When using a graphical user interface, you have to use the File OnceyouhaveanRwindow,youcanuseRsimplyasacalculator.Toadd1and2,type drop-downmenu.Inordertochangetoanotherdirectory,selectChange dir. Youwill D > 1 + 2D alsohavetoloadthe.RDataand.RhistoryusingtheoptionsLoad Workspaceand Load History. andhittheRETURN(ENTER)key,andRwilldisplay OnceRisupandrunning,youneedtoinstallaseriesofpackages,includingthepack- agethatcomeswiththisbook,languageR.Thisisaccomplishedwiththefollowingin- [1] 3 struction,tobetypedattheRprompt: The[1]precedingtheanswerindicatesthat3isthefirstelementoftheanswer. Inthis example,itisalsotheonlyelement.Otherexamplesofarithmeticoperationsare install.packages(c("rpart", "chron", "Hmisc", "Design", "Matrix", "lme4", "coda", "e1071", "zipfR", "ape", "languageR"), > 2 * 3 # multiplication repos = "http://cran.r-project.org") [1] 6 1 2 T T > 6 / 3 # division > x = 3 [1] 2 > x + 1 # result is displayed, not assigned to x > 2 ˆ 3 # power [1] 4 [1] 8 > x # so x is unchanged > 9 ˆ 0.5 # square root [1] 3 [1] 3 Wecanworkwithvariablesinthesamewaythatweworkwithnumbers. Thehashmark#indicatesthatthetexttoitsrightisacommentthatshouldbeignoredby R.Operatorscanbestacked,inwhichcaseFitmaybenecessarytomakeexplicitbymeans > 4 ˆ 3 F [1] 64 ofparenthesestheorderinwhichtheoperationshavetobecarriedout. > x = 4 > 9 ˆ 0.5 ˆ 3 > y = 3 [1] 1.316074 > x ˆ y > (9 ˆ 0.5) ˆ 3 [1] 64 [1] 27 Themore common mathematicaloperations arecarried out with operators such as+, > 9 ˆ (0.5 ˆ 3) -,and*. Forarangeofstandardoperations,aswellasformorecomplexmathematical [1] 1.316074 A A calculations, awiderangeoffunctionsisavailable. Functionsarecommandsthattake someinput,dosomethingwiththatinput,andreturntheresulttotheuser. Above,we Notethattheevaluationofexponentiationproceedsfromrighttoleft,ratherthanfrom calculatedthesquarerootof9withthehelpofthe∧operator.Anotherwayofobtaining lefttoright. Useparentheseswheneveryouarenotabsolutelysureabouttheorderin thesameresultisbymeansofthesqrt()function. whichRevaluatesstackedoperators. TheresultsofcalculationscanbesavedandreferencedbyVARIABLES. Forinstance, > sqrt(9) wecanstoretheresultofadding1and2inavariablenamedx. Therearethreeways [1] 3 in which we can assign the result of our addition to x. We can use the equal sign as assignmentoperator, Theargumentofthesquarerootfunction,9,isenclosedbetweenparentheses. > x = 1 + 2 R R > x 1.2 Gettingdataintoandoutof R [1] 3 Bresnanetal.[2007]studiedthedativealternationinEnglishinthethree-million-word orwecanusealeftarrow(composedof<and-)orarightarrow(composedof-and>, Switchboard collection of recorded telephone conversations and in the Treebank Wall asfollows: Street Journal collection of news andfinancial reportage. In English, the recipient can berealizedeitherasanNP(MarygaveJohnthebook)orasaPP(MarygavethebooktoJohn). > x <- 1 + 2 Bresnanandcolleagueswereinterestedinpredictingtherealizationoftherecipient(as > 1 + 2 -> x NPorPP)fromawiderangeofpotentialexplanatoryvariables,suchastheanimacy,the TherighDtarrowisespeciallyusefulincaseyoutypedalongexpressionandonlythen lengthinDwords,andthepronominalityofthethemeandtherecipient. Asubsetoftheir decidethatyouwouldliketosaveitsoutputratherthanhaveitdisplayedonyourscreen. data collected from the treebank is availableas the data set verbs. (Bresnan andcol- Insteadofhavingtogobacktothebeginningoftheline,youcancontinuetypinganduse leaguesstudiedmanymorevariables, thefulldatasetisavailableasdative,andwe therightarrowasassignmentoperator. Wecanmodifythevalueofx,forinstance,by willstudyitindetailinlaterchapters.) YoushouldhaveattachedthelanguageRpack- increasingitsvaluebyone. ageatthispoint,otherwiseverbswillnotbeavailabletoyou. Wedisplaythefirst10rowsoftheverbsdatawiththehelpofthefunctionhead(). > x = x + 1 (Readers familiar with programming languages like C and Python should note that R numberingbeginswith1ratherthanwithzero.) Herewetakex,addone,andassigntheresult(4)backtox.Withoutthisexplicitassign- ment,thevalueofxremainsunchanged: > head(verbs, n = 10) 3 4 T T RealizationOfRec Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme R handles various other data formats as well, including sas.get() (which converts 1 NP feed animate inanimate 2.6390573 SAS data sets), read.csv()(which handles comma-separated spreadsheet data), and 2 NP give animate inanimate 1.0986123 read.spss()(forreadingSPSSdatafiles). 3 NP give animate inanimate 2.5649494 DatasetsandfunctionsinRcomewithextensivedocumentation,includingexamples. 4 NP give animate inanimate 1.6094379 Thisdocumentationisaccessedbymeansofthehelp()function.Manyexamplesinthe 5 NP offer animate inanimate 1.0986123 documentationcanbealsoexecutedwiththeexample()function. 6 NP give animate inanimate 1.3862944 7 NP pay animate inanimate 1.3862944 > help(verbs) 8 NP bring aniFmate inanimate 0.0000000 > example(verbs) F 9 NP teach animate inanimate 2.3978953 10 NP give animate inanimate 0.6931472 1.3 Accessinginformationindataframes Whentheoptionnisleftunspecified,thefirst6rowswillbedisplayedbydefault.Tables such as exemplified by verbs are referred to in R as DATA FRAMES. Each line in this Whenworkingwithdataframes,weoftenneedtoselectormanipulatesubsetsofrows dataframerepresentsaclausewitharecipient,andspecifieswhetherthisrecipientwas andcolumns. Rowsandcolumnsareselectedbymeansofamechanismreferredtoas realizedasanNPorasaPP.Eachlinealsoliststheverbused,theanimacyoftherecipient, subscripting. Initssimplestform,subscriptingcanbeachievedsimplybyspecifyingthe A A theanimacyofthetheme,andthelogarithmofthelengthofthetheme. Notethateach rowandcolumnnumbersbetweensquarebrackets,separatedbyacomma.Forinstance, elementary observation — here the realization of the recipient as NP or PP in a given toextractthelengthofthethemeforthefirstlineinthedataframeverbs,wetype clause—hasitsownlineintheinputfile.ThisisreferredtoastheLONGDATAFORMAT, > verbs[1, 5] wherelonghighlightsthatnoattemptismadetostorethedatamoreeconomically. [1] 2.639057 Itisgoodpracticetospellouttheelementsinthecolumnsofadataframewithsensible names.Forinstance,thefirstlinewithdataspecifiesthattherecipientwasrealizedasan Whateverprecedesthecommaisinterpretedasarestrictionontherows,andwhatever NPfortheverbtofeed,thattherecipientwasanimate,andthatthethemewasinanimate. followsthecommaisarestrictiononthecolumns.Inthisexample,therestrictionsareso Thelengthofthethemewas14,asshownwhenweundothelogarithmictransformation narrowthatonlyoneelementisselected,theoneelementthatsatisfiestherestrictionsthat withitsinverse,theexponentialfunctionexp(): itshouldbeonrow1andincolumn5. Theotherextremeisnorestrictionswhatsoever, aswhenwetypethenameofthedataframetotheprompt,whichisequivalenttotyping > exp(2.6390573)R R [1] 14 > verbs[ , ] # this will display all 903 rows of verbs! > log(14) [1] 2.639057 Whenweleavetheslotbeforethecommaempty,weimposenorestrictionsontherows: A data frame such as verbs can be saved outside R as an independent file with > verbs[ , 5] # show the elements of column 5 write.table(),enclosingthenameofthefile(includingitspath)betweenquotes. [1] 2.6390573 1.0986123 2.5649494 1.6094379 1.0986123 [5] 1.3862944 1.3862944 0.0000000 2.3978953 0.6931472 > write.table(verbs, file = "/home/harald/dativeS.txt") # linux ... > write.table(verbs, file = "/users/harald/dativeS.txt") # MacOSX > writeD.table(verbs, file = "c:stats/dativeS.txt") # Windows As thereDare 903 rows in verbs, the request to display the fifth column results in an orderedsequenceof903elements.Inwhatfollows,werefertosuchanorderedsequence UsersofWindowsshouldnotetheuseoftheforwardslashforpathspecification. Alter- asavector. Thankstothenumbersinsquarebracketsintheoutput,wecaneasilysee natively,onMacOSXorWindows,thefunctionfile.choose()maybeused,replacing that0.00istheseventhelementofthevector. Columnvectorscanalsobeextractedwith thefilename,inwhichcaseadialogboxisprovided. the$operatorprecedingthenameoftherelevantcolumn: ExternaldatainthistabularformatcanbeloadedintoRwithread.table(). We tellthisfunctionthatthefilewejustmadehasaninitialline,itsheader,thatspecifiesthe > verbs$LengthOfTheme # same as verbs[, 5] columnnames. Whenwespecifyarownumberbutleavetheslotafterthecommaempty,weimposeno > verbs = read.table("/home/harald/dativeS.txt", header = TRUE) restrictionsonthecolumns,andthereforeobtainarowvectorinsteadofacolumnvector: 5 6 T T > verbs[1, ] # show the elements of row 1 Notethattheappropriaterowsofverbsappearinexactlythesameorderasspecifiedin RealizationOfRec Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme rs. 1 NP feed animate inanimate 2.639057 Thecombination operator c()is not theonly function for creating vectors. Of the many other possibilities, the colon operator should be mentioned here. This operator Notethattheelementsofthisrowvectoraredisplayedtogetherwiththecolumnnames. brings intoexistencesequencesofincreasing ordecreasingnumberswithastepsizeof Rowandcolumnvectorscanbeextractedfromadataframeandassignedtoseparate one: variables: > col5 = verbs[ , 5] > 1 : 5 > head(col5, n = 5) F [1] 1 2 3 4 5 F [1] 2.6390573 1.0986123 2.5649494 1.6094379 1.0986123 > 5 : 1 [1] 5 4 3 2 1 Individualelementscanbeaccessedfromthesevectorsbythesamesubscriptingmecha- nism,butsimplifiedtojustoneindexbetweenthesquarebrackets: Inordertoselectfromverbstherowsspecifiedbyrsandthefirstthreecolumns,we > row1[1] specifytherowconditionbeforethecommaandthecolumnconditionafterthecomma: RealizationOfRec > verbs[rs, 1:3] 1 NP RealizationOfRec Verb AnimacyOfRec > col5[1] A A 638 PP pay animate [1] 2.639057 799 PP sell animate Becausetherowvectorhasnames,wecanalsoaddressitselementsbyname,properly 390 NP lend animate enclosedbetweendoublequotes: 569 PP sell animate 567 PP send inanimate > row1["RealizationOfRec"] RealizationOfRec Alternatively,wecouldhavespecifiedavectorofcolumnnamesinsteadofcolumnnum- 1 NP bers. Younowknowhowtoextractsingleelements,rowsandcolumnsfromdataframes, > verbs[rs, c("RealizationOfRec", "Verb", "AnimacyOfRec")] and how to access individual elements from vectors. However, we often need to ac- cessmorethanoneRrowormorethanonecolumnsimultaneously. Rmakesthispossible R Noteoncemorethatwhenstringsarebrought togetherintoavector, theymustbeen- byplacingvectorsbeforeorafterthecommawhensubscriptingthedataframe,instead closedbetweenquotes. ofsingleelements. (ForR,singleelementsareactuallyvectorswithonlyoneelement.) Thusfar,wehaveselectedrowsbyexplicitlyspecifyingtheirrownumbers.Often,we Therefore,itisusefultoknowhowtocreateyourownvectorsfromscratch.Thesimplest wayofcreatingavectoristocombineelementswiththeconcatenationoperatorc(). In donothavethisinformationavailable. Forinstance,supposeweareinterestedinthose observationsforwhichtheAnimacyOfThemehasthevalueanimate. Wedonotknow thefollowingexample,weselectsomearbitraryrownumbers. therownumbersoftheseobservations.Fortunately,wedonotneedthemeither,because > rs = c(638, 799, 390, 569, 567) wecanimposeaconditionontherowsofthedataframesuchthatonlythoserowswill > rs beselectedthatmeetthatcondition. Thecondition thatwewanttoimposeisthatthe [1] 638 799 390 569 567 valueinthecolumnofAnimacyOfThemeisanimate.Sincethisisaconditiononrows, WecanDnowusethisvectorofnumberstoselectpreciselythoserowsfromverbsthat itprecedDesthecomma. havetherownumbersspecifiedinrs.Wedosobyinsertingrsbeforethecomma. > verbs[verbs$AnimacyOfTheme == "animate", ] > verbs[rs, ] RealizationOfRec Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme RealizationOfRec Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme 58 NP give animate animate 1.0986123 638 PP pay animate inanimate 0.6931472 100 NP give animate animate 2.8903718 799 PP sell animate inanimate 1.3862944 143 NP give inanimate animate 2.6390573 390 NP lend animate animate 0.6931472 390 NP lend animate animate 0.6931472 569 PP sell animate inanimate 1.6094379 506 NP give animate animate 1.9459101 567 PP send inanimate inanimate 1.3862944 736 PP trade animate animate 1.6094379 7 8 T T Thisisequivalentto Twothingsarenoteworthy. First,thewordsanimateandinanimatearenotenclosedbe- tween quotes. Second, the last lineof the output mentions that there are two LEVELS: > subset(verbs, AnimacyOfTheme == "animate") animateandinanimate. Whereastherowandcolumnnamesarevectorsofstrings, Itisimportanttonotethattheequalityintheconditionisexpressedwithadoubleequal non-numericalcolumnsinadataframeareautomaticallyconvertedbyRintoFACTORS. sign. Thisisbecausethesingleequalsignistheassignmentoperator.Thefollowingex- Instatistics,afactorisanon-numericalpredictororresponse. Itsvaluesarereferredto ampleillustratesamorecomplexconditionwiththelogicaloperatorAND(&)(thelogical asitslevels.Here,thefactorAnimacyOfRechasasonlypossiblevaluesanimateand operatorforORis|). inanimate,henceithasonly twolevels. Most statistical techniquesdon’twork with string vectors, but with factors. This is the reason why R automatically converts non- > verbs[verbs$AnimacyOfTheme == "aFnimate" & verbs$LengthOfTheme > 2, ] numericalcolumnsintofactors.IfyourealFlywanttoworkwithastringvectorinsteadofa RealizationOfRec Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme factor,youhavetodotheback-conversionyourselfwiththefunctionas.character(): 100 NP give animate animate 2.890372 143 NP give inanimate animate 2.639057 > verbs.rs$AnimacyOfRec = as.character(verbs.rs$AnimacyOfRec) > verbs.rs$AnimacyOfRec Rowandcolumnnamesofadataframecanbeextractedwiththefunctionsrownames() [1] "animate" "animate" "animate" "animate" "inanimate" andcolnames(). > head(rownames(verbs)) Nowtheelementsofthevectorarestrings,andassuchproperlyenclosedbetweenquotes. [1] "1" "2" "3" "4" "A5" "6" WecanundothisconversionwiAthas.factor(). > colnames(verbs) [1] "RealizationOfRec" "Verb" "AnimacyOfRec" "AnimacyOfTheme" > verbs.rs$AnimacyOfRec = as.factor(verbs.rs$AnimacyOfRec) [5] "LengthOfTheme" Ifwerepeatthesesteps,butwithasmallersubsetofthedatainwhichAnimacyOfRecis Thevectorofcolumnnamesisastringvector. Perhapssurprisingly, thevector ofrow onlyrealizedasanimate, namesisalsoastringvector. Toseewhythisisuseful,weassignthesubtableofverbs obtainedbysubscriptingtherowswiththersvectortoaseparateobjectthatwename > verbs.rs2 = verbs[c(638, 390), ] verbs.rs. > verbs.rs2 > verbs.rs = verbs[rs, ] RealizationOfRec Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme R 638 RPP pay animate inanimate 0.6931472 Wecanextractthefirstlinenotonlybyrownumber, 390 NP lend animate animate 0.6931472 > verbs.rs[1, ] weobservethattheoriginaltwolevelsofAnimacyOfRecareremembered: RealizationOfRec Verb AnimacyOfRec AnimacyOfTheme LengthOfTheme 638 PP pay animate inanimate 0.6931472 > verbs.rs2$AnimacyOfRec butalsobyrowname, [1] animate animate Levels: animate inanimate > verbs.rs["638",] # same output In order to get rid of the uninstantiated factor level, we convert AnimacyOfRec to a TherowDnameisastringthatremindsusoftheoriginalrownumberinthedataframe D fromwhichverbs.rswasextracted: charactervector,andthenconvertitbacktoafactor: > verbs[638, ] # same output again > as.factor(as.character(verbs.rs2$AnimacyOfRec)) [1] animate animate Let’sfinallyextractacolumnthatdoesnot consistofnumbers, suchasthecolumn Levels: animate specifyingtheanimacyoftherecipient. > verbs.rs$AnimacyOfRec Analternativewiththesameresultis [1] animate animate animate animate inanimate Levels: animate inanimate > verbs.rs2$AnimacyOfRec[drop=TRUE] 9 10
Description: