Quantitative and Computational Methods for the Social Sciences Unsupervised Machine Learning for Clustering in Political and Social Research Philip D. Waggoner ISSN 2398-4023 (online) ISSN 2514-3794 (print) Elements in Quantitative and Computational Methods for the Social Sciences editedby R. Michael Alvarez CaliforniaInstituteofTechnology NathanielBeck NewYorkUniversity UNSUPERVISED MACHINE LEARNING FOR CLUSTERING IN POLITICAL AND SOCIAL RESEARCH Philip D. Waggoner University of Chicago UniversityPrintingHouse,CambridgeCB28BS,UnitedKingdom OneLibertyPlaza,20thFloor,NewYork,NY10006,USA 477WilliamstownRoad,PortMelbourne,VIC3207,Australia 314–321,3rdFloor,Plot3,SplendorForum,JasolaDistrictCentre, NewDelhi–110025,India 79AnsonRoad,#06–04/06,Singapore079906 CambridgeUniversityPressispartoftheUniversityofCambridge. ItfurtherstheUniversity’smissionbydisseminatingknowledgeinthepursuitof education,learning,andresearchatthehighestinternationallevelsofexcellence. www.cambridge.org Informationonthistitle:www.cambridge.org/9781108793384 DOI:10.1017/9781108883955 ©PhilipD.Waggoner2020 Thispublicationisincopyright.Subjecttostatutoryexception andtotheprovisionsofrelevantcollectivelicensingagreements, noreproductionofanypartmaytakeplacewithoutthewritten permissionofCambridgeUniversityPress. Firstpublished2020 AcataloguerecordforthispublicationisavailablefromtheBritishLibrary. ISBN978-1-108-79338-4Paperback ISSN2398-4023(online) ISSN2514-3794(print) Additionalresourcesforthispublicationatwww.cambridge.org/waggoner CambridgeUniversityPresshasnoresponsibilityforthepersistenceoraccuracyof URLsforexternalorthird-partyinternetwebsitesreferredtointhispublication anddoesnotguaranteethatanycontentonsuchwebsitesis,orwillremain, accurateorappropriate. Unsupervised Machine Learning for Clustering in Political and Social Research ElementsinQuantitativeandComputationalMethodsfortheSocialSciences DOI:10.1017/9781108883955 Firstpublishedonline:December2020 PhilipD.Waggoner UniversityofChicago Authorforcorrespondence:PhilipD.Waggoner,[email protected] Abstract:Intheageofdata-drivenproblem-solving,theabilitytoapply cuttingedgecomputationaltoolsforexplainingsubstantivephenomenain adigestiblewaytoawideaudienceisanincreasinglyvaluableskill.Such skillsarenolessimportantinpoliticalandsocialresearch.Yet,application ofquantitativemethodsoftenassumesanunderstandingofthedata, structure,patterns,andconceptsthatdirectlyinfluencethebroader researchprogram.Itisoftenthecasethatresearchersmaynotbeentirely awareoftheprecisestructureandnatureoftheirdataorwhattoexpectof theirdatawhenapproachinganalysis.Further,inteachingsocialscience researchmethods,itisoftenoverlookedthattheprocessofexploringdata isakeystageinappliedresearch,whichprecedespredictivemodelingand hypothesistesting.Thesetasks,though,requireknowledgeofappropriate methodsforexploringandunderstandingdataintheserviceofdiscerning patterns,whichcontributetodevelopmentoftheoriesandtestable expectations.ThisElementseekstofillthisgapbyofferingresearchersand instructorsanintroductionclustering,whichisaprominentclassof unsupervisedmachinelearningforexploring,mining,andunderstanding data.Idetailseveralwidelyusedclusteringtechniques,andpaireachwith Rcodeandrealdatatofacilitateinteractionwiththeconcepts.Three unsupervisedclusteringalgorithmsareintroduced:agglomerative hierarchicalclustering,k-meansclustering,andGaussianmixturemodels.I concludebyofferingahigh-levellookatthreeadvancedmethods:fuzzy C-means,DBSCAN,andpartitioningaroundmedoidsclustering.Thegoal istobringappliedresearchersintotheworldofunsupervisedmachine learning,boththeoreticallyaswellaspractically.Allcodeexampleswill leveragethecloudcomputingplatformCodeOceantoguidereaders throughimplementationofthesealgorithms. Keywords:clustering,unsupervisedmachinelearning,computationalsocial science,R ©PhilipD.Waggoner2020 ISBNs:9781108793384(PB),9781108883955(OC) ISSNs:2398-4023(online),2514-3794(print) Contents 1 Introduction 1 2 Setting the Stage for Clustering 7 3 Agglomerative Hierarchical Clustering 13 4 K-means Clustering 25 5 Gaussian Mixture Models 34 6 Advanced Methods 42 7 Conclusion 56 References 58 UnsupervisedMachineLearningforClustering 1 1 Introduction Whenpeoplethinkofmachinelearning,visionsofcomplexneuralnetworks, support vector machines, or random decision forests tend to come to mind. While these are indeed common machine learning methods, there is another widely used, but distinct class of machine learning: clustering. Cluster- ing, which is more aptly situated in unsupervised machine learning, allows researcherstoexplore,learn,andsummarizelargeamountsofdatainaneffi- cient way. Before diving into clustering and its application in political and socialresearch,considerfirstthedistinctionbetweensupervisedandunsuper- visedmachinelearningtobetterappreciatepreciselyhowclusteringworksand whyitisavaluableapproachtoexploratorydataanalysis(EDA),andpolitical andsocialresearchmorespecifically. Two key components are central to unsupervised machine learning. First, in unsupervised learning, the researcher works with unlabeled data, meaning classesarenotpredeterminedorspecified,andthusthereisnoexpectedout- cometobepredictedortopredictsomeotheroutcome.Yet,inabroadresearch program, researchers can and often do use unsupervised learning techniques, such as clustering, to label data (also called feature extraction in machine learningresearch),andthenfeedthisoutputtoasupervisedclassifer,forexam- ple.Butonitsown,unsupervisedmachinelearningworksalmostexclusively with unlabeled data. This relates to the second key component, which is that unsupervisedlearningprecludesanymeaningfulengagementbytheresearcher during the modeling process. This is in comparison to supervised learning, wheremodelparametersaretunedandtrainingdataaresetbytheresearcher, oftenforpredictivepurposes.Withsuchpreprocesseddataandaclearoutcome in mind, model fit and output are relatively easily diagnosed and inspected. As such, in the unsupervised learning world, there is no outcome or depen- dentvariablebeingpredictednorarethereparameterstobetunedtoresultin strongerstatisticalmodelsusedforinference.Rather,theresearcherfeedsunla- beleddatatoalearningalgorithmandallowspatternstoemerge,typicallybased onsimilarityamongobservations(within-grouphomogeneity)anddissimilar- itybetweengroupingsofobservations(between-groupheterogeneity).Suchan endeavorisespeciallyvaluableinEDA,wherearesearcherisinterestedinboth recoveringunderlying,nonrandomstructureindataorfeaturespace,whilealso simplifyingandsummarizinglargeamountsofdatainanintuitive,digestible way,butwithminimalassumptionsorinterferencewiththealgorithm. Though there is a trade-off between exploration (unsupervised) and confirmation (supervised), it is important to note that each are valuable in their respective spheres, and can even strengthen each other (Tukey, 1980). 2 QuantitativeandComputationalMethodsfortheSocialSciences Whenaresearcherisconcernedwithfittingamodeltodatatominimizesome prediction error rate or build a maximally accurate learner, supervised tech- niquesmaybemoreappropriate.Yet,whenthatresearcherismoreconcerned withexploringandsummarizingdata,perhapsasastepinthebroaderresearch program,thenunsupervisedtechniquesmaybepreferred.Indeed,inunsuper- visedmachinelearning,thepatternsthatemergearemeanttobesimplifications ofmorecomplex,underlyingpatternsthatnaturallyexistinthedata.Whether these patterns accurately reflect “real life” or preconceptions of substantive phenomenaisoftenlefttotheresearchertodecide.Inadditiontothevalidation techniquesdiscussedlaterintheElement,thereareadditionalwaystoverify andvalidateresultsofunsupervisedlearners,suchascomparisonacrossmul- tipleanddifferentalgorithmsoroftendomainexpertise.Thus,theideahereis thatemergentpatternsshouldbeevaluatedonthebasisofdomainexpertise,as wellascheckedagainstothermethodsandalgorithms,asanyresearcherwould doinmorecommon“robustnesschecks.”WhilethisElementstaysawayfrom normativeprescriptionsforwhichapproachtodataexplorationandanalysisis “better”(asbettersooftendoesnotexist),thegoalatpresentistointroducea new,additional wayofthinkingabout,exploring,andunderstandingpatterns indata. The focus of this Element, then, is on clustering, which is one of the most commonformsofunsupervisedmachinelearning.Clusteringalgorithmsvary widelyandareexceedinglyvaluableformakingsenseofoftenlarge,unwieldy, unlabeled,andunstructureddatabydetectingandmappingdegreesofsimilar- itybetweenobjectsinsomefeaturespace.Thoughakeydeviceinsimplifying thecomplexityofdatatorevealunderlyingstructure,selectingandfittingclus- teringalgorithmscanbecomplicatedforacouplereasons.First,inclustering thereistypicallynosingle“right”or“one-size-fits-all”algorithmforaques- tionorproblem.Theselectionofanalgorithmdependsonavarietyoffactors suchasthesizeandstructureofthedata,thegoalsoftheresearcher,thelevelof domainexpertise,thetransformationofthedata,howobservationsaretreated, andsoon.Assuch,clusteringisoftenaprocessofselecting,fitting,evaluating, repeating,andthencomparingacrossdifferentspecificationsofasinglealgo- rithm or between multiple algorithms using the same data. This process will becomemuchclearerastheElementprogresses. Next, and related, consider performance evaluation. In regression, for example,researchersareofteninterestedinhowwellthemodelfitandwhether itperformedasexpected(e.g.,didlearnerXsufficientlyminimizethepredic- tion error rate compared to learner Y?). But recall, in unsupervised learning there are no parameters to be estimated nor are there clear expectations for emergentpatternsasthereareinsupervisedlearning.Asaresult,evaluationof UnsupervisedMachineLearningforClustering 3 unsupervised algorithmic performance is rarely straightforward. To go about this,whichisunpackedatlengththroughoutthisElement,theresearchershould alwayscompareacrossseveralspecificationsandalgorithms,aswellasapply domain expertise related to the problem and data in question. Such a holis- ticapproachtodataanalysisshouldbecharacteristicofallresearchprograms, whethersupervisedorunsupervised. For example, suppose a researcher is interested in exploring and learning aboutAmericanvotingbehavior.Theresearcher,armedwithalarge,unlabeled datasetofelectionreturnsmaybeginwithhierarchicalclusteringtoseewhether groupings exist among voters in the country. Using visual output like a den- drogram(whichisdiscussedlateratgreaterlength),mayrevealthattwobroad campsofvoterstendtobeclusteredtogether.Asastartingplace,theresearcher may suspect these two clusters of voters represent the two major choices of American national elections: Republicans and Democrats. Regardless of the qualityoftheassumption,theresearchermaystillbeunsureofpreciselyhow and why these clusters among American voters exists. The researcher may progresstospecifyamoreadvancedalgorithmrequiringalittlemoreinforma- tion, such as the CLARA (clustering large applications) algorithm, assuming twogroupsexistbasedonthehierarchicaldendrogramfromthefirststage.If the results corroborate similar patterns revealing two broad groups, then the researcher has a better sense that the data may indeed represent some con- sequential groupings among the voting population, which could be political parties. However, if there are less clearly defined clusters from the CLARA iterationwhentwoclusterswereassumed(basedonthefirsthierarchicalclus- teringstage),thentheresearchermaywanttoupdatethealgorithmtohuntfor three or four clusters instead of two. In addition to visual corroboration, the researcher can then leverage across common methods of internal validation suchastheDunnindex,averagesilhouettewidth,orconnectivity(allofwhich are discussed more later) across the different clustering algorithms to under- standwhetherthesealgorithmsanditerationsarepointingtosimilargroupings intheAmericanvotingpopulation. Note that this example began with no clear expectations of patterns that shouldemergeagainstwhicharesearchercouldcomparesomeestimatedout- put to some expected output (as would be the case in supervised learning). Rather,theresearchersimplyfedunlabeledvotingdatatoseveralalgorithms, and observed (and then compared) the emergent patterns. The rinsing and repeatingassociatedwithunsupervisedlearningiscentraltounderstandingpre- ciselyhowunsupervisedlearningcanbeeffectivelyleveragedtolearnabout andexploredatainaprincipledmanner.Then,oncetheresearcherhaslearned thedatastructureandpatterns,theremainderoftheresearchprogramcanbe 4 QuantitativeandComputationalMethodsfortheSocialSciences adjustedtodeveloptestableexpectations,estimaterelationships,andgenerate inferences.Inapproachingclustering,thereshouldbeheavyemphasisonthe learningpart.Theresearcherbeginswithsomebaselevelinterestinthetopic and is armed with a rich, yet unlabeled dataset. Prior to making specific pre- dictions, given the lack of clearly defined expectations of how the data are structured, the researcher could specify and compare a number of different algorithms to “let the data speak” freely. Note that these goals of exploring dimensionalityandreducingspatialcomplexityarealsopresentinothermeth- ods more commonly employed in the social sciences such as latent variable modeling, multidimensional scaling, and correspondence analysis. Though similaringoals,however,aswillbediscussedthroughoutthisElement,unsu- pervisedmachinelearningapproachesproblemsofcomplexityanddimension reduction from a fundamentally different place. For example, often no effort ismadeduringthemodelingprocesstoinferanymeaningordefinitionofthe emergentclusters;ratherthegoalismostoftenpurediscovery.Inferenceshould happenatadifferentstage.Thisandotherdistinctionsarefrequentlyrevisited throughouttheElement. Inordertoeffectivelyintroduceunfamiliarreaderstounsupervisedmachine learning, it makes most sense to start with the most commonly leveraged form of unsupervised learning: clustering. Within clustering, there are many approaches, as well as even more algorithms within these approaches, with manymorecurrentlybeingdevelopedeachyeartodealwithnewandunique problemsandpatternsindata(e.g.,meanshiftingclustering).Asnosinglework couldevereffectivelycoverallclusteringapproachesandalgorithms,letalone theentiretyofunsupervisedlearning,Ibeginwithanddetailthethreemostly widely used (and taught) clustering algorithms: first, agglomerative hierar- chicalclustering;second,k-meansclustering(“hard”partitioning);andthird, Gaussian mixture models (model-based “soft” partitioning). And for readers interested in going beyond these three approaches, I conclude with a section detailingmorecomplexandrecentadvancesinclustering,thoughatahigher level:fuzzyC-meansclustering;density-basedclustering(theDBSCANalgo- rithm); and partitioning around medoids clustering (i.e., “k-medoids” via the PAMalgorithm).1 1 ReadersshouldnotethattheapproachtoclusteringpresentedinthisElementissituatedwithin anunsupervisedlearningframework,whereweareinterestedinexploringandpartitioningunla- beleddata.Thisisdistinctfrommodel-basedclustering,whichestimatesparametersandforms clustersinaprobabilisticfashion.Thoughamodel-basedtechnique(Gaussianmixturesmodels) iscoveredlaterinthisElementforintroductorypurposes,readersinterestedinamoretechni- caltreatmentofmodel-basedclusteringshouldconsidertherecentbookbyBouveyronetal. (2019).