ebook img

Empirical Gaussian priors for cross-lingual transfer learning PDF

0.09 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Empirical Gaussian priors for cross-lingual transfer learning

Empirical Gaussian priors for cross-lingual transfer learning AndersSøgaard 6 CenterforLanguageTechnology 1 UniversityofCopenhagen 0 Njalsgade140,DK-2300Copenhagen 2 [email protected] n a J Abstract 9 ] Sequencemodellearningalgorithmstypicallymaximizelog-likelihoodminusthe L normofthemodel(orminimizeHammingloss+norm).Incross-lingualpart-of- C speech(POS)tagging,ourtargetlanguagetrainingdataconsistsofsequencesof . s sentenceswithword-by-wordlabelsprojectedfromtranslationsinklanguagesfor c whichwehavelabeleddata, viawordalignments. Ourtrainingdataistherefore [ verynoisy,andifRademachercomplexityishigh,learningalgorithmsareproneto 1 overfit.Norm-basedregularizationassumesaconstantwidthandzeromeanprior. v Weinsteadproposetousetheksourcelanguagemodelstoestimatetheparameters 6 ofaGaussianpriorforlearningnewPOStaggers.Thisleadstosignificantlybetter 6 performanceinmulti-sourcetransferset-ups. Wealsopresentadrop-outversion 1 that injects (empirical) Gaussian noise during online learning. Finally, we note 2 thatusingempiricalGaussianpriorsleadstomuchlowerRademachercomplexity, 0 andissuperiortooptimallyweightedmodelinterpolation. . 1 0 6 1 Cross-lingual transfer learningofsequence models 1 : v i The peopleofthe world speakabout6,900differentlanguages. Open-sourceoff-the-shelfnatural X languageprocessing(NLP)toolboxeslikeOpenNLP1andCoreNLP2coveronly6–7languages,and r we have sufficient labeled training data for inducingmodels for about 20–30 languages. In other a words, supervisedsequencelearningalgorithmsare notsufficientto inducePOS modelsforbuta smallminorityoftheworld’slanguages. What can we do for all the languagesfor which no training data is available? UnsupervisedPOS inductionalgorithmshavemethodologicalproblems(in-sampleevaluation,community-widehyper- parametertuning,etc.),andperformanceisprohibitiveofdownstreamapplications. Someworkon unsupervisedPOStagginghasassumedotherresourcessuchastagdictionaries[Lietal.,2012],but such resourcesare also only available for a limited numberof languages. In ourexperiments, we assume thatnotrainingdata or tag dictionariesareavailable. Our onlyassumptionis a bitof text translatedintomultiplelanguages,specifically,fragmentsoftheBible. We willuseBibledatafor annotationprojection,aswellasforlearningcross-lingualwordembeddings(§3). Unsupervisedlearningwithtypologicallyinformedpriors[Naseemetal.,2010]isaninterestingap- proachtounsupervisedPOSinductionthatismoreapplicabletolow-resourcelanguages.Ourwork is related to this work, butwe learn informedpriorsrather than stipulate them and combinethese priorswithannotationprojection(learningfromnoisylabels)ratherthanunsupervisedlearning. 1https://opennlp.apache.org/ 2http://nlp.stanford.edu/software/corenlp.shtml 1 Annotationprojectionreferstotransferringannotationfromoneormoresourcelanguagestothetar- getlanguage(forwhichnolabeleddataisotherwiseavailable),typicallythroughwordalignments. Inourexperimentsbelow,weuseanunsupervisedwordalignmentalgorithmtoalign15×12lan- guage pairs. For 15 languages, we have predicted POS tags for each word in our multi-parallel corpus. For each word in one of our 12 target language training datasets, we thus have up to 15 votesfor each wordtoken, possiblyweightedby the confidenceof the word alignmentalgorithm. Inthispaper,wesimplyusethemajorityvotes. Thisistheset-upassumedthroughoutinthispaper (see§3formoredetails): Low-resourcecross-lingualPOStagging Wehaveatourdisposalk(=15)sourcelanguagemod- elsanda multi-parallelcorpus(theBible)thatwe canuse toprojectannotationfromthe k source languagesto new targetlanguagesforwhich nolabeleddata is available. If we use k > 1 source languages, we refer to this as multi-source cross-lingual transfer; if we only use a single source language, we refer to this as single-source cross-lingualtransfer. In this paper, we only consider multi-sourcecross-languagetransferlearning. Since the training data sets for our target languages (the annotation projections) are very noisy, the risk of over-fitting is extremely high. We are therefore interested in learning algorithms that efficiently limit the Rademacher complexity of the learning problem, i.e., the chance of fitting to random noise. In other words, we want a modelwith higher integrated bias and lower integrated variance[Gemanetal., 1992]. Our approach– using empiricalGaussian priors– is introducedin §2,includingadrop-outversionoftheregularizer. §3describesourexperiments. In§4,weprovide someobservations,namelythatusingempiricalGaussianpriorsreduces(i)Rademachercomplexity and (ii) integrated variance, and (iii) that using empiricalGaussian priorsis superior to optimally weightedmodelinterpolation. 2 Empirical Gaussianpriors We will apply empirical Gaussian priors to linear-chain conditional random fields (CRFs; Laffertyetal. [2001]) and averaged structured perceptrons [Collins, 2002]. Linear-chain CRFs are trained by maximising the conditional log-likelihood of labeled sequences LL(w,D) = logP(y|x)withw∈Rm andDadatasetconsistingofsequencesofdiscreteinputsym- bPolhsx,xyi∈=Dx1,...,xn associatedwith sequencesofdiscretelabelsy = y1,...,yn. Lk-regularized CRFs maximize LL(w,D)−|w|k with typically k ∈ {0,1,2,∞}, which all introduce costant- width, zero-meanregularizers. We refer to Lk-regularizedCRFs as L2-CRF. Lk regularizersare parametricpriorswheretheonlyparameteristhewidthoftheboundingshape. TheL2-regularizer is a Gaussian prior with zero mean, for example. The regularisedlog-likelihoodwith a Gaussian 2 priorisLL(w,D)− 12Pmj (cid:16)λjσ−j2µj(cid:17) . Forpracticalreasons,hyper-parametersµj andσj aretypi- callyassumedtobeconstantforallvaluesofj. Thisalsoholdsforrecentworkonparametricnoise injection, e.g., Søgaard [2013]. If these parameters are assumed to be constant, the above objec- tivebecomesequivalenttoL2-regularization. However,youcanalsotrytolearntheseparameters. In empiricalBayes [Casella, 1985], the parametersare learned from D itself. SmithandOsborne [2005]suggestlearningtheparametersfromavalidationset. Inourset-up,wedonotassumethat wecanlearnthepriorsfromtrainingdata(whichisnoisy)orvalidationdata(whichisgenerallynot available in cross-linguallearning scenarios). Instead we estimate these parameters directly from sourcelanguagemodels. When we estimate Gaussian priors from source language models, we will learn which features are invariant across languages, and which are not. We thereby introduce an ellipsoid regularizer whose centre is the average source model. In our experiments, we consider both the case where variance is assumed to be constant – which we call L2-regularization with priors (L2-PRIOR)— andthecasewherebothvariancesandmeansarelearned–whichwecallempiricalGaussianpriors (EMPGAUSS). L2-PRIORistheL2-CRFobjectivewithσ2 =CwithCaregularizationparameter, j and µ = µˆ the average value of the corresponding parameter in the observed source models. j j 1 (λj−µj)2 EMPGAUSSreplacestheaboveobjectivewithLL(λ)+Pjlogσ√2πe− 2σ2 ,which,assuming model parameters are mutually independent, is the same as jointly optimising model probability andlikelihoodofthe data. Notethatminimizingthe squaredweightsisequivalentto maximizing 2 the log probabilityof the weightsunder a zero-meanGaussian prior, and in the same way, this is equivalent to minimising the above objective with empirically estimated parameters µˆ and σµ . j j In other words, empirical Gaussian priors are bounding ellipsoids on the hypothesis space with learnedwidthsandcentres.Also,notethatinsingle-sourcecross-lingualtransferlearning,observed variance is zero, and we therefore replace this with a regularization parameter C shared with the baseline. In the single-source set-up, L2-PRIOR is thus equivalent to EMPGAUSS. We use L- BFGStomaximizeourbaselineL2-regularizedobjectives,aswellasourempiricalGaussianprior objectives. Practicalobservations (i)UsingempiricalGaussianpriorsdoesnotassumeidenticalfeaturerep- resentationsinthesourceandtargetmodels. Modelparametersforwhichfeatureswereunseenin thesourcelanguages,cannaturallybeassignedGaussianswithparametershµ=0,σ =σaviwhere σav istheaveragevarianceintheestimatedGaussians. Inourexperiments,werelyonsimplefea- turerepresentationsthatareidenticalforalllanguages. (ii)Also,considertheobviousextensionof usingempiricalGaussianpriorsinthemulti-sourceset-up,whereweregularizethetargettostayin oneofseveralboundingellipsoidsratherthantheonegivenbythefullsetofsourcemodels. These ellipsoids could come from typologically differentgroups of source languagesor from individual sourcelanguages(andthenhaveconstantwidth).Whilethisistechnicallyanon-convexregularizer, wecansimplyrunonemodelpersourcegroupandchoosetheonewiththebestfittodata. Wedo notexplorethisdirectionfurtherinthispaper. 2.1 EmpiricalGaussiannoiseinjection We also introduce a drop-out variant of empirical Gaussian priors. Our point of departure is av- erage structured perceptron. We implement empirical Gaussian noise injection with Gaussians h(µ1,σ1),...,(µm,σm)i for m features as follows. We initialise our model parameters with the meansµ . Foreveryinstancewepassover,wedrawacorruptionvectorgofrandomvaluesv from j i thecorrespondingGaussians(1,σ ). Weinjectthenoiseingbytakingpairwisemultiplicationsof i gandourfeaturerepresentationsoftheinputsequencewiththerelevantlabelsequences. Notethat thisdrop-outalgorithmisparameter-free,butofcoursewecouldeasilythrowinahyper-parameter controllingthedegreeofregularization.WegivethealgorithminAlgorithm1. Algorithm1AveragedstructuredperceptronwithempiricalGaussiannoise 1: T = {hx1,y1i,...,hxn,yni}w.xi = hv1,...iandvk = hf1,...,fmi,w0 = hw1 : µˆ1,...,wm : µˆmi 2: fori≤I×|T|do 3: forj ≤ndo 4: g←sample(N(1,σ1),...,N(1,σm)) 5: yˆ ←argmaxywi·g 6: wi+1 ←wi+Φ(xj,yj)·g−Φ(xj,yˆ)·g 7: endfor 8: endfor 3 Cross-lingual POSExperiments Data In our multi-sourcecross-languagetransfer learningset-up, we rely on 15 multiple source languagemodelstoestimateourpriors.Weuseasubsetofthedatain[Agicetal.,2015]. Annotation projection We learn IBM-2 word alignment models from the Bible using EM to projectannotationfromthe15sourcelanguagestoour10targetlanguages.Weassigneachwordin eachtargetlanguagethemajorityvotetagafterprojectingfromallsourcelanguages. Features We use a simple feature template considering only orthographic features and cross- lingual word embeddings. The orthographic features include whether the current word contains capitalletters,hyphens,ornumbers.Theembeddingsare40-dimensionaldistributionalvectorscap- turinginformationaboutthedistributionofwordsinamulti-parallelcorpus. Welearnedtheseem- beddingsusinganimprovementoverthetechniquesuggestedinSøgaardetal.[2015].Søgaardetal. 3 [2015]suggestaremarkablysimpleapproachtolearningdistributionalrepresentationsofwordsthat transferacrosslanguages.Inparalleldocumentcollectionsorparallelcorpora,wecanrepresentthe meaning of words by a vector encoding in what documentsor sentences each word occurs. This isknownasinvertedindexingindatabasetheory. Weencodethemeaningofwordsbybinaryvec- tors encoding their presence in biblical verses and then apply SVD to reduce these vectors to 40 dimensions. Dimensionality was chosen for comparabilitywith other publicly available bilingual embeddings. Whilethisapproachassumesfewerresourcesavailable,publishedresultssuggestthat such representationsaresuperiorto previouswork [Søgaardetal., 2015]. We improveon thisap- proachbyshiftingandrownormalisation.WetunedtheparametersonDanishdevelopmentdata. Baselinesandsystems OurfirstbaselineisL2-regularizedCRFlearnedusingL-BFGS.Ourbatch CRFsystemsareL2-PRIOR andEMPGAUSS.Oursecondbaselineisanonlineaveragedstructured perceptronwithL2weightdecay,learnedusingadditiveupdates. Weaugmentaveragedstructured perceptronwithempiricalGaussiannoiseinjection(Algorithm2),leadingtoEMPGAUSSNOISE. Parameters InL2-PRIOR,wesetthevariancetobethesameforallparameters,namelyequivalent to the regularization parameter in our L2-regularized baseline. The parameter was optimized on Danishdevelopmentdata. Results We report the results of several systems: Our CRF models – L2-CRF, L2-PRIOR, MULTI-L2-PRIOR and EMPGAUSS – as well as our online models– L2-PERC and EMPGAUSS- NOISE. We present the macro-average performances across our 10 target languages below. We computesignificanceusingWilcoxonoverdatasetsfollowingDemsar[2006]andmarkp<0.01by **. L2-CRF L2-PRIOR EMPGAUSS L2-PERC EMPGAUSSNOISE 76.1 80.31 81.02 75.04 80.54 ∗∗ ∗∗ ∗∗ 4 Observations Wemakethefollowingadditionalobservations:(i)FollowingtheprocedureinZhuetal.[2009],we cancomputetheRademachercomplexityofourmodels,i.e.,theirabilitytolearnnoiseinthelabels (overfit). Sampling POS tags randomly from a uniform distribution, chance complexity is 0.083. With small sample sizes, L2-CRFs actually begin to learn patterns with Rademacher complexity rising to 0.086,whereasboth L2-PRIOR and EMPGAUSS neverlearna betterfit thanchance. (ii) Gemanetal.[1992]presentasimpleapproachtoexplicitlystudyingbias-variancetrade-offsduring learning. They draw subsamples of l < m training data points D1,...,Dk and use a validation datasetofm datapointstodefinetheintegratedvarianceofourmethods. Again,weseethatusing ′ empiricalGaussianpriorsleadtolessintegratedvariance. (iii)AnempiricalGaussianprioreffec- tivelylimitsustohypothesesinHinaellipsoidaroundtheaveragesourcemodel.Wheninferenceis exact,andourlossfunctionisconvex,welearnthemodelwiththesmallestlossonthetrainingdata withinthisellipsoid. Modelinterpolationof(someweightingof)theaveragesourcemodelandthe unregularizedtargetmodelcanpotentiallyresultinthesame model,butsincemodelinterpolation islimitedtothehyperplaneconnectingthetwomodels,theprobabilityofthistohappenisinfinitely 1 small( ). Sinceforanyeffectiveregularizationparametervalue(suchthattheregularizedmodel isdiffer∞entfromtheunregularizedmodel),theempiricalGaussianpriorcanbeexpectedtohavethe same Rademacher complexity as model interpolation, we conclude that using empirical Gaussian priorsissuperiortomodelinterpolation(anddataconcatenation). References ZeljkoAgic, DirkHovy,andAndersSøgaard. Ifallyouhaveis abitoftheBible: LearningPOS taggersfortrulylow-resourcelanguages. InACL,2015. GeorgeCasella. AnintroductiontoempiricalBayesdataanalysis. AmericanStatistician,39:83–87, 1985. MichaelCollins. DiscriminativeTrainingMethodsforHiddenMarkovModels:TheoryandExper- imentswithPerceptronAlgorithms. InEMNLP,2002. 4 Janez Demsar. Statistical comparisonsof classifiers over multiple data sets. Journal of Machine LearningResearch,7:1–30,2006. StuartGeman,ElieBienenstock,andReneDoursat.Neuralnetworksandthebias/variancedilemma. NeuralComputation,4:1–58,1992. JohnLafferty,AndrewMcCallum,andFernandoPereira. Conditionalrandomfields: probabilistic modelsforsegmentingandlabelingsequencedata. InICML,2001. ShenLi,Joa˜oGrac¸a,andBenTaskar. Wiki-lysupervisedpart-of-speechtagging. InEMNLP,2012. TahiraNaseem,HarrChen,ReginaBarzilay,andMarkJohnson. Usinguniversallinguisticknowl- edgetoguidegrammarinduction. InProceedingsofEMNLP,2010. AndrewSmithandMilesOsborne.Regularisationtechniquesforconditionalrandomfields:Param- eterisedversusparameter-free. InIJCNLP,2005. AndersSøgaard. Zipfiancorruptionsforrobustpostagging. InProceedingsofNAACL,2013. AndersSøgaard,ZˇeljkoAgic´,He´ctorMart´ınezAlonso,BarbaraPlank,BerndBohnet,andAnders Johannsen. Invertedindexingforcross-lingualnlp. InACL,2015. JerryZhu,TimothyRogers,andBryanGibson. HumanRademachercomplexity. InNIPS,2009. 5

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.