GRAPHICAL MODELS WITH STRUCTURED FACTORS, NEURAL FACTORS, AND APPROXIMATION-AWARE TRAINING by Matthew R. Gormley AdissertationsubmittedtoJohnsHopkinsUniversityinconformity withtherequirementsforthedegreeofDoctorofPhilosophy Baltimore, MD October 2015 c 2015MatthewR.Gormley ⃝ AllRightsReserved Abstract This thesis broadens the space of rich yet practical models for structured prediction. We introduce a general framework for modeling with four ingredients: (1) latent variables, (2) structural constraints, (3) learned (neural) feature representations of the inputs, and (4) training that takes the approximations made during inference into account. The thesis builds up to this framework through an empirical study of three NLP tasks: semantic role labeling,relationextraction,anddependencyparsing—obtainingstate-of-the-artresultson the former two. We apply the resulting graphical models with structured and neural fac- tors, and approximation-aware learning to jointly model part-of-speech tags, a syntactic dependency parse, and semantic roles in a low-resource setting where the syntax is unob- served. Wepresentanalternativeviewofthesemodelsasneuralnetworkswithatopology inspiredbyinferenceongraphicalmodelsthatencodeourintuitionsaboutthedata. Keywords: Machinelearning,naturallanguageprocessing,structuredprediction,graph- icalmodels,approximateinference,semanticrolelabeling,relationextraction,dependency parsing. ThesisCommittee: ( advisors) † JasonEisner(Professor,ComputerScience,JohnsHopkinsUniversity) † MarkDredze(AssistantResearchProfessor,ComputerScience,JohnsHopkinsUniversity) † BenjaminVanDurme(AssistantResearchProfessor,ComputerScience,JohnsHopkinsUniversity) SlavPetrov(StaffResearchScientist,Google) ii Acknowledgements First, I thank my co-advisors, Jason Eisner and Mark Dredze, who always managed to align their advice exactly when it mattered and challenge me through their opposition on everything else. Jason exemplified for me how to think like a visionary and to broaden my sightsasaresearcher,evenaswecontinuallydelveddeeperintoourwork. Marktaughtme totakeastepbackfrommyresearchandambitions. Heshowedmehowtobeanempiricist, a pragmatist, and a scientist. Together, Mark and Jason demonstrated all the best qualities ofadvisors,teachers,andmentors. Ihopethatsomeofitrubbedoffonme. Thanks to my committee members, Benjamin Van Durme and Slav Petrov, alongside JasonEisnerandMarkDredze. Benfrequentlytookontheroleofbothpublisherandcritic formywork: headvertisedthereal-worldapplicationsofmyresearchandchallengedmeto considerthelinguisticunderpinnings. Ateverystepalongtheway,Slav’sforward-looking questionswerepredictiveofthedetailsthatwouldrequirethemostattention. ManyfacultyatJohnsHopkinsimpactedmethroughteaching,conversations,andmen- toring. Inparticular,IwouldliketothankthosewhomademyexperienceattheCenterfor LanguageandSpeechProcessing(CLSP)andtheHumanLanguageTechnologyCenterof Excellence(HLTCOE)sorich: SanjeevKhudanpur,MattPost,AdamLopez,JimMayfield, MaryHarper,DavidYarowsky,ChrisCallison-Burch,andSuchiSaria. Workingwithother researchers taught me a lot: thanks to Spence Green, Dan Bikel, Jakob Uszkoreit, Ashish Venugopal, and Percy Liang. Emails exchanges with Jason Naradowsky, Andre´ Martins, Alexander Rush, and Valentin Spitkovsky were key for replicating prior work. Thanks to David Smith, Zhifei Li, and Veselin Stoyanov, who did work that was so complementary thatwecouldn’tresistputtingitalltogether. ThestaffattheCLSP,theHLTCOE,andtheCSDepartmentmadeeverythingabreeze, fromhighperformancecomputingtofindingaclassroomatthelastminute—specialthanks toMaxThomasandCraigHarmanbecausegoodcodedrivesresearch. Myfellowstudentsandpostdocsmadethisthesispossible. MycollaborationswithMo Yu and Meg Mitchell deserve particular note. Mo taught me how to use every trick in the book, and then invent three more. Meg put up with and encouraged my incessant over- engineering that eventually led to Pacaya. To my lab mates, I can’t say thank you enough: Nick Andrews, Tim Vieira, Frank Ferraro, Travis Wolfe, Jason Smith, Adam Teichart, DingquanWang,VeselinStoyanov,SharonLi,JustinSnyder,RebeccaKnowles,Nathanial Wes Filardo, Michael Paul, Nanyun Peng, Markus Dreyer, Carolina Parada, Ann Irvine, Courtney Napoles, Darcey Riley, Ryan Cotterell, Tongfei Chen, Xuchen Yao, Pushpendre Rastogi,BrianKjersten,andEhsanVariani. I am indebted to my friends in Baltimore. Thanks to: Andrew for listening to my iii research ramblings over lunch; Alan and Nick for much needed sports for the sake of rest; the Bettles, the Kuks, and the Lofti for mealshare and more; everyone who babysat; Merv andtheNewSongMen’sBibleStudyforgivingmeperspective. Thankstonutat andnunanforteachingmetoseekfirstJesusinallthatIdo. Thanksto myana¨bixel,baluk,andch’utinmialforintroducingmetoK’iche’andgivingmeasecond homeinGuatemala. Esther,thanks—youwereallthemotivationIneededtofinish. To my wife, Candice Gormley: you deserve the most thanks of all. You got us through every success and failure of my Ph.D. Most importantly, you showed me how to act justly, lovemercy,andwalkhumblywithourGod. iv Contents Abstract ii Acknowledgements iii Contents viii ListofTables x ListofFigures xi 1 Introduction 1 1.1 MotivationandPriorWork . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Whydowewanttobuildrich(joint)models? . . . . . . . . . . . . 2 1.1.2 InferencewithStructuralConstraints . . . . . . . . . . . . . . . . 3 1.1.3 Learningunderapproximations . . . . . . . . . . . . . . . . . . . 4 1.1.4 WhataboutNeuralNetworks? . . . . . . . . . . . . . . . . . . . . 4 1.2 ProposedSolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 ContributionsandThesisStatement . . . . . . . . . . . . . . . . . . . . . 6 1.4 OrganizationofThisDissertation . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 PrefaceandOtherPublications . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background 10 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 ASimpleRecipeforMachineLearning . . . . . . . . . . . . . . . 10 2.2 NeuralNetworksandBackpropagation . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 NumericalDifferentiation . . . . . . . . . . . . . . . . . . . . . . 15 2.3 GraphicalModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 FactorGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 MinimumBayesRiskDecoding . . . . . . . . . . . . . . . . . . . 17 2.3.3 ApproximateInference . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3.1 BeliefPropagation . . . . . . . . . . . . . . . . . . . . . 18 2.3.3.2 LoopyBeliefPropagation . . . . . . . . . . . . . . . . . 19 2.3.3.3 BetheFreeEnergy . . . . . . . . . . . . . . . . . . . . . 20 v CONTENTS 2.3.3.4 StructuredBeliefPropagation . . . . . . . . . . . . . . . 20 2.3.4 TrainingObjectives . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.4.1 ConditionalLog-likelihood . . . . . . . . . . . . . . . . 23 2.3.4.2 CLLwithLatentVariables . . . . . . . . . . . . . . . . 23 2.3.4.3 EmpiricalRiskMinimization . . . . . . . . . . . . . . . 24 2.3.4.4 EmpiricalRiskMinimizationUnderApproximations . . 25 2.4 ContinuousOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.1 OnlineLearningandRegularizedRegret . . . . . . . . . . . . . . . 26 2.4.2 OnlineLearningAlgorithms . . . . . . . . . . . . . . . . . . . . . 28 2.4.2.1 StochasticGradientDescent . . . . . . . . . . . . . . . . 28 2.4.2.2 MirrorDescent . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2.3 CompositeObjectiveMirrorDescent . . . . . . . . . . . 28 2.4.2.4 AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2.5 ParallelizationoverMini-batches . . . . . . . . . . . . . 30 3 LatentVariablesandStructuredFactors 31 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.1 PipelineModelwithUnsupervisedSyntax . . . . . . . . . . . . . . 33 3.2.1.1 BrownClusters . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1.2 UnsupervisedGrammarInduction . . . . . . . . . . . . 34 3.2.1.3 SemanticDependencyModel . . . . . . . . . . . . . . . 35 3.2.2 PipelineModelwithDistantly-SupervisedSyntax . . . . . . . . . . 35 3.2.2.1 ConstrainedGrammarInduction . . . . . . . . . . . . . 36 3.2.3 JointSyntacticandSemanticParsingModel . . . . . . . . . . . . . 38 3.2.4 PipelineModelwithSupervisedSyntax(Skyline) . . . . . . . . . . 39 3.2.5 FeaturesforCRFModels . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.5.1 TemplateCreationfromPropertiesofOrderedPositions . 41 3.2.5.2 AdditionalFeatures . . . . . . . . . . . . . . . . . . . . 43 3.2.6 FeatureSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.2 FeatureTemplateSets . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.1 CoNLL-2009: High-resourceSRL . . . . . . . . . . . . . . . . . . 48 3.5.2 CoNLL-2009: Low-ResourceSRL . . . . . . . . . . . . . . . . . . 49 3.5.3 CoNLL-2008,-2005withoutaTreebank . . . . . . . . . . . . . . 50 3.5.4 AnalysisofGrammarInduction . . . . . . . . . . . . . . . . . . . 53 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4 NeuralandLog-linearFactors 56 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 RelationExtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Background: CompositionalEmbeddingModel . . . . . . . . . . . . . . . 60 vi CONTENTS 4.3.1 CombiningFeatureswithEmbeddings . . . . . . . . . . . . . . . . 60 4.3.2 TheLog-BilinearModel . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.3 DiscussionoftheCompositionalModel . . . . . . . . . . . . . . . 62 4.4 ALog-linearModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 HybridModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.6 MainExperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.6.1 ExperimentalSettings . . . . . . . . . . . . . . . . . . . . . . . . 65 4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.7 AdditionalACE2005Experiments . . . . . . . . . . . . . . . . . . . . . . 71 4.7.1 ExperimentalSettings . . . . . . . . . . . . . . . . . . . . . . . . 71 4.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.8 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5 Approximation-awareLearningforStructuredBeliefPropagation 75 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 DependencyParsingbyBeliefPropagation . . . . . . . . . . . . . . . . . 77 5.3 Approximation-awareLearning . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4 DifferentiableObjectiveFunctions . . . . . . . . . . . . . . . . . . . . . . 82 5.4.1 AnnealedRisk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.2 L Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 2 5.4.3 Layer-wiseTraining . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.4 BetheLikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.5 GradientsbyBackpropagation . . . . . . . . . . . . . . . . . . . . . . . . 84 5.5.1 BackpropagationofDecoder/Loss . . . . . . . . . . . . . . . . . 84 5.5.2 BackpropagationthroughStructuredBP . . . . . . . . . . . . . . . 85 5.5.3 BPandBackpropagationwith PTREE . . . . . . . . . . . . . . . . 85 5.5.4 BackpropofHypergraphInside-Outside . . . . . . . . . . . . . . . 87 5.6 OtherLearningSettings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6 Graphical Models with Structured and Neural Factors and Approximation- awareLearning 96 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.5 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.5.1 Approximation-UnawareTraining . . . . . . . . . . . . . . . . . . 101 6.5.2 Approximation-AwareTraining . . . . . . . . . . . . . . . . . . . 102 6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 vii CONTENTS 6.6.1 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.6.3 ErrorAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7 Conclusions 112 7.1 SummaryoftheThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.2.1 OtherStructuredFactorsandApplications . . . . . . . . . . . . . . 113 7.2.2 Pruning-awareLearning . . . . . . . . . . . . . . . . . . . . . . . 113 7.2.3 Hyperparameters: OptimizingorAvoiding . . . . . . . . . . . . . 114 7.2.4 Multi-taskLearningforDomainAdaptation . . . . . . . . . . . . . 114 A Pacaya: A General Toolkit for Graphical Models, Hypergraphs, and Neural Networks 116 A.1 CodeLayout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 A.2 FeatureSetsfromPriorWork . . . . . . . . . . . . . . . . . . . . . . . . . 117 A.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 A.3.1 DifferencesfromExistingLibraries . . . . . . . . . . . . . . . . . 118 A.3.2 NumericalStabilityandEfficientSemiringsinJava . . . . . . . . . 119 A.3.3 CommentsonEngineeringtheSystem . . . . . . . . . . . . . . . . 119 A.3.3.1 Experiment1: Inside-OutsideAlgorithm . . . . . . . . . 120 A.3.3.2 Experiment2: ParallelBeliefPropagation . . . . . . . . 121 B BetheLikelihood 123 Bibliography 139 Vita 140 viii List of Tables 2.1 BriefSummaryofNotation . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Featuretemplatesforsemanticrolelabeling . . . . . . . . . . . . . . . . . 40 3.2 FeaturetemplatesselectedbyinformationgainforSRL. . . . . . . . . . . 45 3.3 Test F1 of supervised SRL and sense disambiguation with gold (oracle) syntax averaged over the CoNLL-2009 languages. See Table 3.6(a) for per-languageresults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Test F1 of supervised SRL and sense disambiguation with supervised syn- tax,averagedoverCoNLL-2009languages. SeeTable3.6(b)forper-language results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Test F1 of supervised SRL and sense disambiguation with no supervision for syntax, averaged over CoNLL-2009 languages. See Table 3.6(c) for per-languageresults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Performance of joint and pipelined models for semantic role labeling in high-resourceandlow-resourcesettingsonCoNLL-2009. . . . . . . . . . . 51 3.7 Performanceofsemanticrolelabelerswithdescreasingannotatedresources. 52 3.8 Performanceofsemanticrolelabelersinmatchedandmismatchedtrain/test settingsonCoNLL2005/2008. . . . . . . . . . . . . . . . . . . . . . . . . 53 3.9 PerformanceofgrammarinductiononCoNLL-2009. . . . . . . . . . . . . 54 3.10 PerformanceofgrammarinductiononthePennTreebank. . . . . . . . . . 54 4.1 ExamplerelationsfromACE2005. . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Featuresetsusedin FCM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 NamedentitytagsforACE2005andSemEval2010. . . . . . . . . . . . . 68 4.4 PerformanceofrelationextractorsonACE2005out-of-domaintestsets. . . 69 4.5 PerformanceofrelationextractorsonSemEval2010Task8. . . . . . . . . 70 4.6 PerformanceofrelationextractorsonACE2005out-of-domaintestsetsfor thelow-resourcesetting. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.1 Beliefpropagationunrolledthroughtime. . . . . . . . . . . . . . . . . . . 86 5.2 Impactofexactvs. approximateinferenceonadependencyparser. . . . . . 93 5.3 Fullperformanceresultsofdependencyparseron19languagesfromCoNLL- 2006/2007. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.1 AdditiveexperimentforfivelanguagesfromCoNLL-2009. . . . . . . . . . 105 6.2 Performanceofapproximation-awarelearningonsemanticrolelabeling. . . 107 ix LISTOFTABLES 6.3 SRLperformanceonfourmodelsforerroranalysis. . . . . . . . . . . . . . 108 6.4 PerformanceofSRLacrossrolelabels. . . . . . . . . . . . . . . . . . . . . 111 7.1 ExamplesentencesfromnewswireandTwitterdomains. . . . . . . . . . . 114 A.1 Speedcomparisonofinsidealgorithmimplementations . . . . . . . . . . . 121 A.2 SpeedcomparisonofBPimplementations . . . . . . . . . . . . . . . . . . 121 x
Description: