ebook img

DOLDA - a regularized supervised topic model for high-dimensional multi-class regression PDF

1.1 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DOLDA - a regularized supervised topic model for high-dimensional multi-class regression

DOLDA - A REGULARIZED SUPERVISED TOPIC MODEL FOR HIGH-DIMENSIONAL MULTI-CLASS REGRESSION MÅNSMAGNUSSON,LEIFJONSSONANDMATTIASVILLANI ABSTRACT. Generating user interpretable multi-class predictions in data rich environments withmanyclassesandexplanatorycovariatesisadauntingtask. WeintroduceDiagonalOr- thantLatentDirichletAllocation(DOLDA),asupervisedtopicmodelformulti-classclassifica- tionthatcanhandlebothmanyclassesaswellasmanycovariates.Tohandlemanyclasseswe usetherecentlyproposedDiagonalOrthant(DO)probitmodel(Johndrowetal.,2013)together 6 withanefficientHorseshoepriorforvariableselection/shrinkage(Carvalhoetal.,2010). We 1 0 proposeacomputationallyefficientparallelGibbssamplerforthenewmodel. Animportant 2 advantageofDOLDAisthatlearnedtopicsaredirectlyconnectedtoindividualclasseswith- t outtheneedforareferenceclass.Weevaluatethemodel’spredictiveaccuracyontwodatasets c O anddemonstrateDOLDA’sadvantageininterpretingthegeneratedpredictions. 0 2 ] L 1. INTRODUCTION M During the last decades more and more textual data have become available, creating a . t growing need to statistically analyze large amounts of textual data. The hugely popular a t LatentDirichletAllocation(LDA)modelintroducedbyBleietal.(2003)isagenerativeprob- s [ abilitymodelwhereeachdocumentissummarizedbyasetoflatentsemanticthemes,often 2 called topics; formally, a topic is a probability distribution over the vocabulary. An esti- v 0 matedLDAmodelisthereforeacompressedlatentrepresentationofthedocuments. LDAis 6 a mixed membership model where each document is a mixture of topics, where each word 2 0 (token) in a document belongs to a single topic. The basic LDA model is unsupervised, i.e. 0 the topics are learned solely from the words in the documents without access to document . 2 labels. 0 6 Inmanysituationstherearealsootherinformationwewouldliketoincorporateinmod- 1 : elingacorpusofdocuments. Acommonexampleiswhenwehavelabeleddocuments,such v i as ratings of movies together with a movie description, illness category in medical journals X or the location of the identified bug together with bug reports. In these situation, one can r a use a so called supervised topic model to find the semantic structure in the documents that are related to the class of interest. One of the first approaches to supervised topic models was proposed by Mcauliffe and Blei (2008). The authors propose a supervised topic model basedonthegeneralizedlinearmodelframework,therebymakingitpossibletolinkbinary, Keywordsandphrases. TextClassification,LatentDirichletAllocation,HorseshoePrior,DiagonalOrthantProbit Model,Interpretablemodels. Magnusson:DivisionofStatisticsandMachineLearning,Dept.ofComputerandInformationScience,SE-58183Linkop- ing,Sweden. E-mail: [email protected]. Jonsson: EricssonABandDept. ofComputerandInformationScience, SE-16480Stockholm,Sweden.E-mail:[email protected]:DivisionofStatisticsandMachineLearning, Dept.ofComputerandInformationScience,SE-58183Linkoping,Sweden.E-mail:[email protected]. 1 DOLDASUPERVISEDTOPICMODEL 2 count and continuous response variables to topics that are inferred jointly with the regres- sion/classificationeffects. Inthisapproachthesemanticcontentofatextintheformoftopics predictstheresponsevariabley. Thisapproachisoftenreferredtoasdownstreamsupervised topic models, contrary to an upstream supervised approach where the label y governs how thetopicsareformed,seee.g. Ramageetal.(2009). Many downstream supervised topic models have been studied, mainly in the machine learningliterature. McauliffeandBlei(2008)focusondownstreamsupervisionusinggener- alized linear regression models. Jiang et al. (2012) propose a supervised topic model using a max-margin approach to classification and Zhu et al. (2013) propose a logistic supervised topic model using data augmentation with polya-gamma variates . Perotte et al. (2011) use a hierarchical binary probit model to model a hierarchical label structure in the form of a binarytreestructure. Mostoftheproposedsupervisedtopicmodelshavebeenmotivatedbytryingtofindgood classificationmodelsandthefocushasnaturallybeenonthepredictiveperformance. How- ever,thepredictiveperformanceofmostsupervisedtopicmodelsarejustslightlybetterthan using a Support Vector Machine (SVM) with covariates based on word frequencies (Jameel et al., 2015). While predictive performance is certainly important, the real attraction of su- pervised topic models comes from their ability to learn semantically relevant topics and to use those topics to produce accurate interpretable predictions of documents or textual data. The interpretability of a model is an often neglected feature, but is crucial in real world ap- plications. As an example, Parnin and Orso (2011) show that bug fault localization systems arequicklydisregardedwhentheuserscannotunderstandhowthesystemhasreachedthe predictive conclusion. Compared to other text classification systems, topic models are very well suited for interpretable predictions since topics are an abstract entity that are possible forhumanstograsp. Theproblemsofinterpretabilityinmulti-classsupervisedtopicmodels canbedividedintothreemainareas. First, most supervised topic models use a logit or probit approach where the model is identifiedbytheuseofareferencecategory,towhichtheeffectofanycovariateiscompared. Thisdefeatsoneofthemainpurposesofsupervisedtopicmodelssincethiscomplicatesthe interpretabilityofthemodels. Second,tohandlemulti-classcategorizationatopicshouldbeabletoaffectmultipleclasses, and some topics may not influence any class at all. In most supervised topic modeling ap- proaches(suchasJiangetal.2012;Zhuetal.2013;Jameeletal.2015)themulti-classproblem is solved using binary classifiers in a “one-vs-all” classification approach. This approach workswellinthesituationofevenlydistributedclasses,butmaynotworkwellforskewed class distributions Rubin et al. (2012). A one-vs-all approach also makes it more difficult to interpretthemodel. Estimatingonemodelperclassmakesitimpossibletoseewhichclasses that are affected by the same topic and which topics that do not predict any label. In these situations we would like to have one topic model to interpret. The approach of one-vs-all predictions are also costly from an estimation point of view since we need to estimate one modelperclassZhengetal.(2015). DOLDASUPERVISEDTOPICMODEL 3 Third,therecanbesituationswithhundredsofclassesandhundredsoftopics(seeJonsson et al. (2016) for an example). Without regularization or variable selection we would end up withamodelwithtoomanyparameterstointerpretandveryuncertainparameterestimates. In a good predictive supervised topic model one would like to find a small set of topics that are strong determinants of a single document class label. This is especially relevant whenthenumberofobservationsindifferentclassesareskewed,aproblemcommoninreal world situations (Rubin et al., 2012). In the more rare classes we would like to induce more shrinkagewhileinthesituationwithmoredatawewouldliketohavelessshrinkageinthe model. Multi-class regression is a non-trivial problem in Bayesian modeling. Historically, the multinomial probit model has been preferred due to the data augmentation approach pro- posed by Albert and Chib (1993). Augmenting the sampler using latent variables lead to straightforwardGibbssamplingwithconditionally-conjugateupdatesoftheregressionco- efficients. The Albert-Chib sampler often tend to mix slowly, and the same holds for im- proved sampler such as the parameter expansion approach in Imai and van Dyk (2005). Recently, a similar data augmentation approach using polya-gamma variables is proposed fortheBayesianlogisticregressionmodelbyPolsonetal.(2013). Thisapproachpreservethe conditional-conjugacy in the case of a Normal prior for the regression coefficients and has beenthefoundationforthesupervisedtopicmodelinZhuetal.(2013). In this paper we explore a new approach to supervised topic models that produce ac- curate multi-class predictions from semantically interpretable topics using a fully Bayesian approach, hence solving all three of the above mentioned problems. The model combines LDAwiththerecentlyproposedDiagonalOrthant(DO)probitmodelJohndrowetal.(2013) for multi-class classification with anefficient Horseshoe prior that achieves sparsity and in- terpretation by aggressive shrinkage (Carvalho et al., 2010). The new Diagonal Orthant La- tent Dirichlet Allocation (DOLDA)1 model is demonstrated to have competitive predictive performanceyetproducinginterpretablemulti-classpredictionsfromsemanticallyrelevant topics. 2. DIAGONAL ORTHANT LATENT DIRICHLET ALLOCATION 2.1. Handlingthechallengesforhigh-dimensionalinterpretablesupervisedtopicmodels. To solve the first and second challenge identified in the Introduction, reference classes and multi-classmodels,weproposetousetheDiagonalOrthant(DO)probitmodelinJohndrow et al. (2013) as an alternative to the multinomial probit and logit models. Johndrow et al. (2013) propose a Gibbs sampler for the model and shows that it mixes well. One of the benefitsoftheDOmodelisthatallclassescanbeindependentlymodeledusingbinaryprobit modelswhenconditioningonthelatentvariable,therebyremovingtheneedforareference class. The parameters of the model can be interpreted as the effect of the covariate on the marginal probability of a specific class, which make this model especially attractive when it comes to interpreting the inferred topics. This model also include multiple classes in an 1 DOLDAisSwedishforhiddenorlatent. DOLDASUPERVISEDTOPICMODEL 4 Symbol Description Symbol Description Thesetofwordtypes/vocabulary β ThepriorforΦ:K V V × V Thesizeofthevocabularyi.eV= Θ Document-topicproportions:D K |V| × v Wordtype θd Topicprobabilityfordocumentd Thesetoftopics α ThepriorforΘ:D K K × K Thenumberoftopicsi.eK= M #oftopicsindicatorsineachdocumentbytopics:D K |K| × L Thenumberoflabels/categories a Matrixoflatentgaussianvariables:D L × Thesetoflabels/categories η Coefficientmatrixforeachlabelandcovariate:(K+P) L L × D #ofobservations/documentsi.e.D=|D| η0 Priorforη:L×(K+P) D Thesetofobservations/documents zn,d Topicindicatorfortokennindocumentd P Thenumberofnon-textualcovariates/features z¯ Proportionoftopicindicatorsbydocument:D K × N Thetotalnumberoftokens wn,d Tokennindocumentd Nd Thenumberoftokensindocumentd wd Vectoroftokensindocumentd:1×Nd N #obstopic-wordtypeindicators:K×V yd Labelfordocumentd Φ Thematrixwithword-topicprobabilities:K V X Covariate/featurematrix(includingintercept):D P × × φk Thewordprobabilitiesfortopick:1×V xd Covariate/featuresfordocumentd TABLE 1. DOLDAmodelnotation. efficientwaythatmakesitpossibletoestimateamulti-classlinearmodelinparalleloverthe classes. The third problem of modeling supervised topic models is that the semantic meanings of all topics do not necessarily have an effect on our label of interest; one topic may have an effect on one or more classes, and some topics may just be noise that we do not want to use in the supervision. In the situation with many topics and many classes we will also have a very large number of parameters to analyze. The Horseshoe prior in Carvalho et al. (2010) was specifically designed to filter out signals from massive noise. This prior uses a local-globalshrinkageapproachtoshrinksome(ormost)coefficientstozerowhileallowing for sparse signals to be estimated without any shrinkage. This approach has shown good performanceinlinearregressiontypesituations(Castilloetal.,2015),somethingthatmakes it straight forward to incorporate other covariates into our model, which is rarely done in thearea ofsupervised topic models. Differentglobal shrinkageparametersare usedfor the differentclassestohandletheproblemwithunbalancednumberofobservationsindifferent classes. Thismakesitpossibletoshrinkmorewhentherearelessdataforagivenclassand shrinklessinclasseswithmoreobservations. 2.2. Generative model. The generative model is described below. See also a graphical de- scriptionofthemodelinFigure2.1. AsummaryofthenotationisgiveninTable1 (1) Foreachtopick = 1,...,K (a) Drawadistributionoverwordsφ Dir (γ ) k V k ∼ (2) Foreachlabell L ∈ (a) Drawaglobalshrinkageparameterτ C+(0,1) l ∼ (b) Drawlocalshrinkageparametersforthe pthcovariateλ C+(0,1) l,p ∼ (c) Drawcoefficients2η (0,τ2λ2 ) l,p ∼ NK+P l l,p (3) Foreachobservation/documentd = 1,...,D 2 Theinterceptisestimatedusinganormalprior. DOLDASUPERVISEDTOPICMODEL 5 z α γ θ w N Φ K τ λ x a η L D y FIGURE 2.1. TheDiagonalOrthantprobitsupervisedtopicmodel(DOLDA) (a) Drawtopicproportionsθ α Dir (α) d K | ∼ (b) Forn = 1,...,N d (i) Drawtopicassignmentz θ Categorical(θ ) n,d d d | ∼ (ii) Drawwordw z ,φ Categorical(φ ) n,d| n,d zn,d ∼ zn,d (c) y Categorical(p )where d d ∼ (cid:34) (cid:35) 1 L (cid:16) (cid:17) − (cid:16) (cid:16) (cid:17) (cid:16) (cid:17)(cid:17) ∑ (0,1) (0,1) (0,1) pd = FlN (z¯,x)(cid:62)d ηl F1N (z¯,x)(cid:62)d η1 ,...,FLN (z¯,x)(cid:62)d ηL · · · l and F()istheunivariatenormalCDF(Johndrowetal.,2013). l 3. INFERENCE 3.1. The MCMC algorithm. Markov Chain Monte Carlo (MCMC) is used to estimate the modelparameters. Weusedifferentglobalshrinkageparametersτ foreachclass,motivated l by the fact that the different classes can have different number of observations. This gives thefollowingsamplerforinferenceinDOLDA. (1) Samplethelatentvariablesa(i) ((xz¯)Tη ,1)forl = y anda ((xz¯)Tη ,1) d,l ∼ N+ d l d d,l ∼ N− d l forl = y ,where and isthepositiveandnegativetruncatednormaldistribu- d + (cid:54) N N− tion,truncatedat0. (2) Sample all the regression coefficients as in an ordinary Bayesian linear regression per class label l where η (cid:0)µ ,((Xz¯)T(X z¯)+τ2Λ ) 1(cid:1) and Λ is a diag- l ∼ MVN l l l − l onal matrix with the local shrinkage parameters λ per parameter in η and µ = i l l ((Xz¯)T(Xz¯)+τ2Λ ) 1(Xz¯)Ta l l − l (3) Sampletheglobalshrinkageparametersτ atiteration jusingthefollowingtwostep l slicesampling: (cid:32) (cid:34) (cid:35) (cid:33) 1 u 0, 1+ −1 ∼ U τ l,(j 1) − (cid:32) (cid:33) (cid:34) (cid:35) 1 1 ∑P (cid:18)ηl,p(cid:19)2 1 (p+1)/2, I < (1 u)/u τ2 ∼ G 2 λ τ2 − l,j p=1 l,p l,(j 1) − DOLDASUPERVISEDTOPICMODEL 6 where I indicatesthetruncationregionforthetruncatedgamma. (4) Sampleeachlocalshrinkageparameterλ as i,l (cid:32) (cid:34) (cid:35) (cid:33) 1 u 0, 1+ −1 ∼ U λ2 p,l,(j 1) − (cid:32) (cid:33) (cid:34) (cid:35) 1 1 (cid:18)ηl,p(cid:19)2 1 I < (1 u)/u λ2 ∼ E 2 τ λ2 − p,l,j l p,l,(j 1) − (5) Samplethetopicindicatorsz (cid:16) (cid:17) p(zi,d = k|wi,z¬i,η,a) ∝ φv,k· n(dd,k),¬i+α · (cid:32) (cid:34) (cid:35)(cid:33) exp −12 ∑L −2ηNl,k (cid:16)ad,l −(z¯¬di xd)ηl(cid:124)(cid:17)+(cid:18)ηNl,k(cid:19)2 l d d wheren(d) isa D K countmatrixcontainingthesufficientstatisticsforΘ. × Φ (6) Samplethetopic-vocabularydistributions (w) φ Dir(β +n ) k ∼ k k wheren(w) isaK V countmatrixcontainingthesufficientstatisticsforΦ. × 3.2. Efficient parallel sampling of z. To improve the speed of the sampler we cache the calculations done in the supervised part of the topic indicator sampler and parallelize the sampler. Somecommonlyusedtextcorporahaveseveralhundredsofmillionstopicindica- tors, so efficient sampling of the z are absolutely crucial in practical applications. The basic samplerfor z canbeslowduetotheserialnatureofthecollapsedsamplerandthefactthat thesupervisedpartof p(z )involvesadotproduct. i,d (cid:16) (cid:17) Thesupervisedpartofdocumentdcanbeexpressedasexp g i where d¬,k (cid:34) (cid:35) gd¬,ik = −12 ∑L −2ηNl,k (cid:16)ad,l −(z¯¬di xd)ηl(cid:124)(cid:17)+(cid:18)ηNl,k(cid:19)2 . l d d Byrealizingthatsamplingatopicindicatorjustmeansupdatingasmallpartofthisequation wecanderivetherelationship g = g i 1 ∑L η η d,k d¬,k− N2 l,k l,zi d l where the expression ∑Lη η can be calculated once per iteration in η and be stored in l l,k l,zi a two-dimensional array of size K2. We can then use the above relationship to update the supervisionaftersamplingeachtopicindicatorbycalculating g i “onthefly”basedonthe d¬,k (i 1) previoussupervisedcontribution g¬ − inthefollowingway d,k (cid:34) (cid:35) gd¬,ik = gd¬,(ki−1)+ N12 ∑L ηl,kηl,zi −∑L ηl,kηl,z(i 1) d l l − Caching g i leadstoanorderofmagnitudespeedupforamodelwith100topics. d¬,k DOLDASUPERVISEDTOPICMODEL 7 To further improve the performance we parallelize the sampler and use that documents Φ Φ are conditionally independent given . By sampling , instead of marginalizing it out, we reducetheefficiencyoftheMCMCsomewhat,butwewillconvergetothetrueposteriorand the gain from parallelization is usually far greater than the reduced efficiency (Magnusson etal.,2015). Insummary,wehavethefollowingsamplerforz i,d (cid:16) (cid:17) (cid:16) (cid:17) p(zi,d = k|·) ∝ φk,v· n(dd,k),¬i+α ·exp gd¬,ik . Φ thatcanbesampledinparalleloverthedocuments,andtheelementsin canbesampledin parallelovertopics. Thecodeispubliclyavailableathttps://github.com/lejon/DiagonalOrthantLDA. It is also straightforward to use the recently proposed cyclical Metropolis-Hastings pro- posalsinZhengetal.(2015)forinferenceinDOLDA.Theadditionalsamplingofλ andτ p,l l in our model can be done in O(K+P) and is hence not affecting the overall complexity of the sampler. But, as shown in Magnusson et al. (2015), it is not obvious that the reduction insamplingcomplexitywillresultinafastersamplingwhenMCMCefficiencyistakeninto account. 3.3. Evaluationofconvergenceandprediction. WeevaluatetheconvergenceoftheMCMC algorithmbymonitoringthelog-likelihoodovertheiterations: (cid:34) (cid:35) D J ∑ ∑ ∏ log (w,y) = log (1 cdf ( (z¯ x )η(cid:124))) cdf ( (z¯ x )η(cid:124)) L − N − d d j N − d d l d s=1 l=s (cid:54) + logp(w) (cid:124) (cid:123)(cid:122) (cid:125) LDAmarginalLL To make predictions for a new document we first need to sample the topic indicators of thegivendocumentfrom p(z = k w ,Φ)∝φ¯ (M +α), i new k,v d,k | · whereφ¯ isthemeanofthelastpartoftheposteriordrawsofΦ. Weusetheposteriormean k,v Φ basedonthelastiterationsinsteadofintegratingout toavoidpotentialproblemswithlabel switching. However, wehavenotseenanyindicationsoflabelswitchingafterconvergence inourexperiment,probablybecausethedatasetsusedfordocumentpredictionsareusually quite large. The topic indicators are sampled for the predicted document using the fast PC- LDAsamplerinMagnussonetal.(2015). Themeanofthesampledtopicindicatorvectorfor thepredicteddocument,z¯ ,isthenusedforclasspredictions: new (cid:16) (cid:17) ypred = argmax (z¯new,xnew)(cid:62)η . This is a maximum a posteriori estimate, but it is straightforward to calculate the whole predictivedistributionforthelabel. DOLDASUPERVISEDTOPICMODEL 8 Dataset Classes(L) Vocabulary(V) Documents(D) Tokens(N) IMDb 20 7530 8648 307569 20Newsgroups 20 23941 15077 2008897 TABLE 2. Datasetsusedinexperiment 4. EXPERIMENTS We collected a dataset containing the 8648 highest rated movies at IMDb.com. We use boththetextualdescriptionaswellasinformationaboutproducersanddirectorstoclassify a given movie to a genre. We also analyze the classical 20 Newsgroup dataset to compare the accuracy with state-of-the-art supervised models. Our companionpaper (Jonsson et al., 2016)appliestheDOLDAmodeldevelopedheretobuglocalizationinalargescalesoftware engineering context using a dataset with 15 000 bug reports each belonging to one of 118 classes. We evaluate the proposed topic model with regard to accuracy and distribution of regression coefficients. The experiments are performed on 2 sockets with 8-core Intel Xeon E5-2660 Sandy Bridge processors at 2.2GHz and 32 GB DDR3 1600 memory at the National SuperComputercentre(NSC)atLinkï¿œpingUniversity. 4.1. Data and priors. The datasets are tokenized and a standard stop list of English words are removed, as well as the most rare word types that makes up of 1 % of the total tokens; weonlyincludegenreswithatleast10movies. In all experiments we used a relative vague priors setting α = β = 0.01 for the LDA part ofthemodelandc = 100forthepriorvarianceoftheη coefficientsinthenormalmodeland for the intercept coefficient when using the Horseshoe prior. The accuracy experiment for IMDbwasconductedusing5-foldcrossvalidationandthe20Newsgroupscorpususedthe sametrainandtestsetasinZhuetal.(2012)toenabledirectcomparisonsofaccuracy. Inthe analysis of the IMDb dataset no cross-validation was conducted, instead the whole data set wasusedforestimation. 4.2. Results. 20 Newsgroups. Figure 4.1 displays the accuracy on the hold-out test set for the 20 News- groups dataset for different number of topics. The accuracy of our model is slightly lower than MedLDA and SVM using only textual features, but higher than both the classical su- pervisedmulti-classLDAandtheordinaryLDAtogetherwithanSVMapproach. We can also see from Figure 4.1 that the accuracy of using the DOLDA model with the topics jointly estimated with the supervision part outperforms a two-step approach of first estimating LDA and then using the DO probit model with the pre-estimated mean topic indicators as covariates. This is true for both the Horseshoe prior and the normal prior, but thedifferenceisjustafewpercentinaccuracy. The advantage of DOLDA is that it produces interpretable predictions with semantically relevant topics. It is therefore reassuring that DOLDA can compete in accuracy with other less interpretable models, even when the model is dramatically simplified by aggressive MEDLDA:MAXIMUMMARGINSUPERVISEDTOPICMODELS DOLDASUPERVISEDTOPICMODEL 9 0.85 0.85 0.8 0.8 0.75 0.75 0.7 Accuracy 0.7 Accuracy0.65 0.6 0.65 MedLDAc MedLDAc+SVM DiscLDA 0.55 MedLDAc 0.6 multi−sLDA multi−sLDA sLDA 0.5 DiscLDA LDA+SVM LDA+SVM SVM SVM 0.55 0.45 0 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 90 100 110 120 # Topics # Topics (a) (b) FIGURE 4.1. Accuracy of MedLDA, taken from Zhu et al. 2012 (left) and ac- Figure5: Classification accuracy of different models forc:u(raa)cbyinoafrDyOanLdD(Ab)fmourltthi-ecl2a0ssNcleawsssigfircoa-uptestset(right). tiononthe20Newsgroupdata. follow Lacoste-Julien et al. (2008) to set K =2K +K , where K is the number of class-specific 0 1 0 topicsandK isthenumberofsharedtopics,andK =2K . Here,wesetK =1, ,8,10. 1 1 0 0 ··· We can see that the max-margin MedLDAc performs better than the likelihood-based down- streammodels,includemulti-sLDA,sLDA,andthebaselineLDA+SVM.Thebestperformancesof thetwodiscriminativemodels(i.e.,MedLDAcandDiscLDA)arecomparable. However,MedLDAc is easier to learn and faster in testing, as we shall see in Section 5.3.2. Moreover, the different ap- proximate inference algorithms used in MedLDAc (i.e., variational approximation) and DiscLDA (i.e., Monte Carlo sampling methods) can also make the performance different. In our alterna- tive implementation using collapsed variational inference (Teh et al., 2006) method for MedLDAc (preliminary results in preparation for submission), we were able to achieve slightly better results. However, the collapsed variational method is much more expensive. Finally, since MedLDAc al- ready integrates the max-margin principle into its traininFgIG, oUuRrEc4o.n2je.cAtucrecuisratchyatfothreDcOomLbDinAatoionntheIMDbdatawithnormalandHorse- of MedLDAc and SVM does not further improve the pesrhfooremparnicoermanudchuosnintghiasttwasok.stWepe abpelpiervoeachwiththeHorseshoeprior. that the slight differences between MedLDAc and MedLDAc+SVM are due to the tuning of regu- larizationparameters. Forefficiency,wedonotcHhaonrgseesthheoreegshulrairnizkaatgioenfcoornisntatnetrpCrdetuartiniogntaralinpiunrgposes. Our next data set illustrates the inter- MedLDAc. The performance of MedLDAc wouplrdebtaetiiomnparlovsterdenifgwtheoseflDecOt aLDgoAo.dSCeeinaldsioffeoruenrtcompanion paper (Jonsson et al., 2016) in the iterationsbecausethedatarepresentationischanging. software engineering literature for further demonstrations of DOLDAs ability to produce Multi-classClassification: Weperformmulti-classclassificationon20Newsgroupswithallthe interpretablepredictionsinindustrialapplicationswithoutsacrificingpredictionaccuracy. 20 categories. The data set has a balanced distribution over the categories. For the test set, which contains7505documentsintotal,thesmallestcategoryhas251documentsandthelargestcategory IMDb. Figure 4.2 displaysthe accuracy on the IMDb dataset asa function of the number of has 399 documents. For the training set, which contains 11269 documents, the smallest and the topics. The estimated DOLDA model also contains several other discrete covariates, such largest categories contain 376 and 599 documents, respectively. Therefore, the na¨ıve baseline that predictsthemostfrequentcategoryforallthetesatsdotchuemfielnmts’shadsitrheectcolarsasinfidcaptiroondauccceurr.acTyh0e.0a5c3c2u.racy of the more aggressive Horseshoe prior WecompareMedLDAc withLDA+SVM,miusltbi-estLteDrAt,hDanisctLhDeAn,oarmndatlhpersitoarndfaorrdamllutlotip-cilcasssizes. A supervised approach with topics and SVMbuiltonrawtext. WeusetheSVMstruct pascukpaegrevwisiitohnaicnofsetrrfuendcjtoioinntalys∆isℓda(gya)in!oℓuI(type=rfyodr)mingatwo-stepapproach. ̸ tosolvethesub-stepoflearningq(ηηη)andbuildtheSVMclassifiersforLDA+SVM.Theparameter The Horseshoe prior gives somewhat higher accuracy than the normal prior, and incor- ℓisselectedwith5foldcross-validation.17 Theaverageresultsaswellasstandarddeviationsover porating the Horseshoe prior let us handle many additional covariates since the shrinkage 17.Thetraditional0/1costdoesnotyieldthebestresults.pInrimoroswtcialslesa,cttheasselaecttyedpℓe’soafrevaaroruiandb1le6.selection. ToillustratetheinterpretationofDOLDAwefitanewmodelusingonlytopicsascovari- 2a2t6e1s. NotefirstinFigure4.3howtheHorseshoepriorisabletodistinguishbetweensocalled signaltopicsandnoisetopics;theHorseshoepriorisaggressivelyshrinkingalargefraction DOLDASUPERVISEDTOPICMODEL 10 FIGURE 4.3. Coefficients for the IMDb dataset with 80 topics using the nor- malprior(left)andtheHorseshoeprior(right). FIGURE 4.4. Coefficients for the genre Romance in the IMDb dataset with 80 topicsusingtheHorseshoeprior(upper)andanormalprior(below). of the regression coefficient toward zero. This is achieved without the need of setting any hyper-parameters. TheHorseshoeshrinkagemakesiteasytoidentifythetopicsthataffectagivenclass. This is illustrated for the Romance genre in the IMDb dataset in Figure 4.4. This genre consists of relatively few observations (only 39 movies), and the Horseshoe prior therefore shrinks mostcoefficientstozero,keepingonlyonelargesignaltopicthathappenstohaveanegative effectontheRomancegenre. Thenormalpriorontheotherhandgivesamuchmoredense, andthereforemuchlessinterpretablesolution. DiggingdeeperintheinterpretationofwhattriggersaRomancegenreprediction,Table3 showsthe10topwordforTopic39. Fromthistableitisclearthatthesignaltopicidentified usingtheHorseshoepriorissomesortof“crime”topicthatisnegativelycorrelatedwiththe Romancegenre,somethingthatmakesintuitivesense. Thecrimetopicisclearlyexpectedto bepositivelyrelatedtotheCrimegenre,andFigure4.5showsthatthisisindeedthecase.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.