Table Of ContentDOLDA - A REGULARIZED SUPERVISED TOPIC MODEL FOR
HIGH-DIMENSIONAL MULTI-CLASS REGRESSION
MÅNSMAGNUSSON,LEIFJONSSONANDMATTIASVILLANI
ABSTRACT. Generating user interpretable multi-class predictions in data rich environments
withmanyclassesandexplanatorycovariatesisadauntingtask. WeintroduceDiagonalOr-
thantLatentDirichletAllocation(DOLDA),asupervisedtopicmodelformulti-classclassifica-
tionthatcanhandlebothmanyclassesaswellasmanycovariates.Tohandlemanyclasseswe
usetherecentlyproposedDiagonalOrthant(DO)probitmodel(Johndrowetal.,2013)together
6
withanefficientHorseshoepriorforvariableselection/shrinkage(Carvalhoetal.,2010). We
1
0 proposeacomputationallyefficientparallelGibbssamplerforthenewmodel. Animportant
2 advantageofDOLDAisthatlearnedtopicsaredirectlyconnectedtoindividualclasseswith-
t outtheneedforareferenceclass.Weevaluatethemodel’spredictiveaccuracyontwodatasets
c
O anddemonstrateDOLDA’sadvantageininterpretingthegeneratedpredictions.
0
2
]
L 1. INTRODUCTION
M
During the last decades more and more textual data have become available, creating a
.
t growing need to statistically analyze large amounts of textual data. The hugely popular
a
t LatentDirichletAllocation(LDA)modelintroducedbyBleietal.(2003)isagenerativeprob-
s
[
abilitymodelwhereeachdocumentissummarizedbyasetoflatentsemanticthemes,often
2
called topics; formally, a topic is a probability distribution over the vocabulary. An esti-
v
0 matedLDAmodelisthereforeacompressedlatentrepresentationofthedocuments. LDAis
6
a mixed membership model where each document is a mixture of topics, where each word
2
0 (token) in a document belongs to a single topic. The basic LDA model is unsupervised, i.e.
0
the topics are learned solely from the words in the documents without access to document
.
2
labels.
0
6 Inmanysituationstherearealsootherinformationwewouldliketoincorporateinmod-
1
: elingacorpusofdocuments. Acommonexampleiswhenwehavelabeleddocuments,such
v
i as ratings of movies together with a movie description, illness category in medical journals
X
or the location of the identified bug together with bug reports. In these situation, one can
r
a use a so called supervised topic model to find the semantic structure in the documents that
are related to the class of interest. One of the first approaches to supervised topic models
was proposed by Mcauliffe and Blei (2008). The authors propose a supervised topic model
basedonthegeneralizedlinearmodelframework,therebymakingitpossibletolinkbinary,
Keywordsandphrases. TextClassification,LatentDirichletAllocation,HorseshoePrior,DiagonalOrthantProbit
Model,Interpretablemodels.
Magnusson:DivisionofStatisticsandMachineLearning,Dept.ofComputerandInformationScience,SE-58183Linkop-
ing,Sweden. E-mail: mans.magnusson@liu.se. Jonsson: EricssonABandDept. ofComputerandInformationScience,
SE-16480Stockholm,Sweden.E-mail:leif.jonsson@ericsson.com.Villani:DivisionofStatisticsandMachineLearning,
Dept.ofComputerandInformationScience,SE-58183Linkoping,Sweden.E-mail:mattias.villani@liu.se.
1
DOLDASUPERVISEDTOPICMODEL 2
count and continuous response variables to topics that are inferred jointly with the regres-
sion/classificationeffects. Inthisapproachthesemanticcontentofatextintheformoftopics
predictstheresponsevariabley. Thisapproachisoftenreferredtoasdownstreamsupervised
topic models, contrary to an upstream supervised approach where the label y governs how
thetopicsareformed,seee.g. Ramageetal.(2009).
Many downstream supervised topic models have been studied, mainly in the machine
learningliterature. McauliffeandBlei(2008)focusondownstreamsupervisionusinggener-
alized linear regression models. Jiang et al. (2012) propose a supervised topic model using
a max-margin approach to classification and Zhu et al. (2013) propose a logistic supervised
topic model using data augmentation with polya-gamma variates . Perotte et al. (2011) use
a hierarchical binary probit model to model a hierarchical label structure in the form of a
binarytreestructure.
Mostoftheproposedsupervisedtopicmodelshavebeenmotivatedbytryingtofindgood
classificationmodelsandthefocushasnaturallybeenonthepredictiveperformance. How-
ever,thepredictiveperformanceofmostsupervisedtopicmodelsarejustslightlybetterthan
using a Support Vector Machine (SVM) with covariates based on word frequencies (Jameel
et al., 2015). While predictive performance is certainly important, the real attraction of su-
pervised topic models comes from their ability to learn semantically relevant topics and to
use those topics to produce accurate interpretable predictions of documents or textual data.
The interpretability of a model is an often neglected feature, but is crucial in real world ap-
plications. As an example, Parnin and Orso (2011) show that bug fault localization systems
arequicklydisregardedwhentheuserscannotunderstandhowthesystemhasreachedthe
predictive conclusion. Compared to other text classification systems, topic models are very
well suited for interpretable predictions since topics are an abstract entity that are possible
forhumanstograsp. Theproblemsofinterpretabilityinmulti-classsupervisedtopicmodels
canbedividedintothreemainareas.
First, most supervised topic models use a logit or probit approach where the model is
identifiedbytheuseofareferencecategory,towhichtheeffectofanycovariateiscompared.
Thisdefeatsoneofthemainpurposesofsupervisedtopicmodelssincethiscomplicatesthe
interpretabilityofthemodels.
Second,tohandlemulti-classcategorizationatopicshouldbeabletoaffectmultipleclasses,
and some topics may not influence any class at all. In most supervised topic modeling ap-
proaches(suchasJiangetal.2012;Zhuetal.2013;Jameeletal.2015)themulti-classproblem
is solved using binary classifiers in a “one-vs-all” classification approach. This approach
workswellinthesituationofevenlydistributedclasses,butmaynotworkwellforskewed
class distributions Rubin et al. (2012). A one-vs-all approach also makes it more difficult to
interpretthemodel. Estimatingonemodelperclassmakesitimpossibletoseewhichclasses
that are affected by the same topic and which topics that do not predict any label. In these
situations we would like to have one topic model to interpret. The approach of one-vs-all
predictions are also costly from an estimation point of view since we need to estimate one
modelperclassZhengetal.(2015).
DOLDASUPERVISEDTOPICMODEL 3
Third,therecanbesituationswithhundredsofclassesandhundredsoftopics(seeJonsson
et al. (2016) for an example). Without regularization or variable selection we would end up
withamodelwithtoomanyparameterstointerpretandveryuncertainparameterestimates.
In a good predictive supervised topic model one would like to find a small set of topics
that are strong determinants of a single document class label. This is especially relevant
whenthenumberofobservationsindifferentclassesareskewed,aproblemcommoninreal
world situations (Rubin et al., 2012). In the more rare classes we would like to induce more
shrinkagewhileinthesituationwithmoredatawewouldliketohavelessshrinkageinthe
model.
Multi-class regression is a non-trivial problem in Bayesian modeling. Historically, the
multinomial probit model has been preferred due to the data augmentation approach pro-
posed by Albert and Chib (1993). Augmenting the sampler using latent variables lead to
straightforwardGibbssamplingwithconditionally-conjugateupdatesoftheregressionco-
efficients. The Albert-Chib sampler often tend to mix slowly, and the same holds for im-
proved sampler such as the parameter expansion approach in Imai and van Dyk (2005).
Recently, a similar data augmentation approach using polya-gamma variables is proposed
fortheBayesianlogisticregressionmodelbyPolsonetal.(2013). Thisapproachpreservethe
conditional-conjugacy in the case of a Normal prior for the regression coefficients and has
beenthefoundationforthesupervisedtopicmodelinZhuetal.(2013).
In this paper we explore a new approach to supervised topic models that produce ac-
curate multi-class predictions from semantically interpretable topics using a fully Bayesian
approach, hence solving all three of the above mentioned problems. The model combines
LDAwiththerecentlyproposedDiagonalOrthant(DO)probitmodelJohndrowetal.(2013)
for multi-class classification with anefficient Horseshoe prior that achieves sparsity and in-
terpretation by aggressive shrinkage (Carvalho et al., 2010). The new Diagonal Orthant La-
tent Dirichlet Allocation (DOLDA)1 model is demonstrated to have competitive predictive
performanceyetproducinginterpretablemulti-classpredictionsfromsemanticallyrelevant
topics.
2. DIAGONAL ORTHANT LATENT DIRICHLET ALLOCATION
2.1. Handlingthechallengesforhigh-dimensionalinterpretablesupervisedtopicmodels.
To solve the first and second challenge identified in the Introduction, reference classes and
multi-classmodels,weproposetousetheDiagonalOrthant(DO)probitmodelinJohndrow
et al. (2013) as an alternative to the multinomial probit and logit models. Johndrow et al.
(2013) propose a Gibbs sampler for the model and shows that it mixes well. One of the
benefitsoftheDOmodelisthatallclassescanbeindependentlymodeledusingbinaryprobit
modelswhenconditioningonthelatentvariable,therebyremovingtheneedforareference
class. The parameters of the model can be interpreted as the effect of the covariate on the
marginal probability of a specific class, which make this model especially attractive when
it comes to interpreting the inferred topics. This model also include multiple classes in an
1
DOLDAisSwedishforhiddenorlatent.
DOLDASUPERVISEDTOPICMODEL 4
Symbol Description Symbol Description
Thesetofwordtypes/vocabulary β ThepriorforΦ:K V
V ×
V Thesizeofthevocabularyi.eV= Θ Document-topicproportions:D K
|V| ×
v Wordtype θd Topicprobabilityfordocumentd
Thesetoftopics α ThepriorforΘ:D K
K ×
K Thenumberoftopicsi.eK= M #oftopicsindicatorsineachdocumentbytopics:D K
|K| ×
L Thenumberoflabels/categories a Matrixoflatentgaussianvariables:D L
×
Thesetoflabels/categories η Coefficientmatrixforeachlabelandcovariate:(K+P) L
L ×
D #ofobservations/documentsi.e.D=|D| η0 Priorforη:L×(K+P)
D Thesetofobservations/documents zn,d Topicindicatorfortokennindocumentd
P Thenumberofnon-textualcovariates/features z¯ Proportionoftopicindicatorsbydocument:D K
×
N Thetotalnumberoftokens wn,d Tokennindocumentd
Nd Thenumberoftokensindocumentd wd Vectoroftokensindocumentd:1×Nd
N #obstopic-wordtypeindicators:K×V yd Labelfordocumentd
Φ Thematrixwithword-topicprobabilities:K V X Covariate/featurematrix(includingintercept):D P
× ×
φk Thewordprobabilitiesfortopick:1×V xd Covariate/featuresfordocumentd
TABLE 1. DOLDAmodelnotation.
efficientwaythatmakesitpossibletoestimateamulti-classlinearmodelinparalleloverthe
classes.
The third problem of modeling supervised topic models is that the semantic meanings
of all topics do not necessarily have an effect on our label of interest; one topic may have
an effect on one or more classes, and some topics may just be noise that we do not want
to use in the supervision. In the situation with many topics and many classes we will also
have a very large number of parameters to analyze. The Horseshoe prior in Carvalho et al.
(2010) was specifically designed to filter out signals from massive noise. This prior uses a
local-globalshrinkageapproachtoshrinksome(ormost)coefficientstozerowhileallowing
for sparse signals to be estimated without any shrinkage. This approach has shown good
performanceinlinearregressiontypesituations(Castilloetal.,2015),somethingthatmakes
it straight forward to incorporate other covariates into our model, which is rarely done in
thearea ofsupervised topic models. Differentglobal shrinkageparametersare usedfor the
differentclassestohandletheproblemwithunbalancednumberofobservationsindifferent
classes. Thismakesitpossibletoshrinkmorewhentherearelessdataforagivenclassand
shrinklessinclasseswithmoreobservations.
2.2. Generative model. The generative model is described below. See also a graphical de-
scriptionofthemodelinFigure2.1. AsummaryofthenotationisgiveninTable1
(1) Foreachtopick = 1,...,K
(a) Drawadistributionoverwordsφ Dir (γ )
k V k
∼
(2) Foreachlabell L
∈
(a) Drawaglobalshrinkageparameterτ C+(0,1)
l
∼
(b) Drawlocalshrinkageparametersforthe pthcovariateλ C+(0,1)
l,p
∼
(c) Drawcoefficients2η (0,τ2λ2 )
l,p ∼ NK+P l l,p
(3) Foreachobservation/documentd = 1,...,D
2
Theinterceptisestimatedusinganormalprior.
DOLDASUPERVISEDTOPICMODEL 5
z
α γ
θ w N Φ K
τ λ
x
a
η
L
D y
FIGURE 2.1. TheDiagonalOrthantprobitsupervisedtopicmodel(DOLDA)
(a) Drawtopicproportionsθ α Dir (α)
d K
| ∼
(b) Forn = 1,...,N
d
(i) Drawtopicassignmentz θ Categorical(θ )
n,d d d
| ∼
(ii) Drawwordw z ,φ Categorical(φ )
n,d| n,d zn,d ∼ zn,d
(c) y Categorical(p )where
d d
∼
(cid:34) (cid:35) 1
L (cid:16) (cid:17) − (cid:16) (cid:16) (cid:17) (cid:16) (cid:17)(cid:17)
∑ (0,1) (0,1) (0,1)
pd = FlN (z¯,x)(cid:62)d ηl F1N (z¯,x)(cid:62)d η1 ,...,FLN (z¯,x)(cid:62)d ηL
· · ·
l
and F()istheunivariatenormalCDF(Johndrowetal.,2013).
l
3. INFERENCE
3.1. The MCMC algorithm. Markov Chain Monte Carlo (MCMC) is used to estimate the
modelparameters. Weusedifferentglobalshrinkageparametersτ foreachclass,motivated
l
by the fact that the different classes can have different number of observations. This gives
thefollowingsamplerforinferenceinDOLDA.
(1) Samplethelatentvariablesa(i) ((xz¯)Tη ,1)forl = y anda ((xz¯)Tη ,1)
d,l ∼ N+ d l d d,l ∼ N− d l
forl = y ,where and isthepositiveandnegativetruncatednormaldistribu-
d +
(cid:54) N N−
tion,truncatedat0.
(2) Sample all the regression coefficients as in an ordinary Bayesian linear regression
per class label l where η (cid:0)µ ,((Xz¯)T(X z¯)+τ2Λ ) 1(cid:1) and Λ is a diag-
l ∼ MVN l l l − l
onal matrix with the local shrinkage parameters λ per parameter in η and µ =
i l l
((Xz¯)T(Xz¯)+τ2Λ ) 1(Xz¯)Ta
l l − l
(3) Sampletheglobalshrinkageparametersτ atiteration jusingthefollowingtwostep
l
slicesampling:
(cid:32) (cid:34) (cid:35) (cid:33)
1
u 0, 1+ −1
∼ U τ
l,(j 1)
−
(cid:32) (cid:33) (cid:34) (cid:35)
1 1 ∑P (cid:18)ηl,p(cid:19)2 1
(p+1)/2, I < (1 u)/u
τ2 ∼ G 2 λ τ2 −
l,j p=1 l,p l,(j 1)
−
DOLDASUPERVISEDTOPICMODEL 6
where I indicatesthetruncationregionforthetruncatedgamma.
(4) Sampleeachlocalshrinkageparameterλ as
i,l
(cid:32) (cid:34) (cid:35) (cid:33)
1
u 0, 1+ −1
∼ U λ2
p,l,(j 1)
−
(cid:32) (cid:33) (cid:34) (cid:35)
1 1 (cid:18)ηl,p(cid:19)2 1
I < (1 u)/u
λ2 ∼ E 2 τ λ2 −
p,l,j l p,l,(j 1)
−
(5) Samplethetopicindicatorsz
(cid:16) (cid:17)
p(zi,d = k|wi,z¬i,η,a) ∝ φv,k· n(dd,k),¬i+α ·
(cid:32) (cid:34) (cid:35)(cid:33)
exp −12 ∑L −2ηNl,k (cid:16)ad,l −(z¯¬di xd)ηl(cid:124)(cid:17)+(cid:18)ηNl,k(cid:19)2
l d d
wheren(d) isa D K countmatrixcontainingthesufficientstatisticsforΘ.
×
Φ
(6) Samplethetopic-vocabularydistributions
(w)
φ Dir(β +n )
k ∼ k k
wheren(w) isaK V countmatrixcontainingthesufficientstatisticsforΦ.
×
3.2. Efficient parallel sampling of z. To improve the speed of the sampler we cache the
calculations done in the supervised part of the topic indicator sampler and parallelize the
sampler. Somecommonlyusedtextcorporahaveseveralhundredsofmillionstopicindica-
tors, so efficient sampling of the z are absolutely crucial in practical applications. The basic
samplerfor z canbeslowduetotheserialnatureofthecollapsedsamplerandthefactthat
thesupervisedpartof p(z )involvesadotproduct.
i,d
(cid:16) (cid:17)
Thesupervisedpartofdocumentdcanbeexpressedasexp g i where
d¬,k
(cid:34) (cid:35)
gd¬,ik = −12 ∑L −2ηNl,k (cid:16)ad,l −(z¯¬di xd)ηl(cid:124)(cid:17)+(cid:18)ηNl,k(cid:19)2 .
l d d
Byrealizingthatsamplingatopicindicatorjustmeansupdatingasmallpartofthisequation
wecanderivetherelationship
g = g i 1 ∑L η η
d,k d¬,k− N2 l,k l,zi
d l
where the expression ∑Lη η can be calculated once per iteration in η and be stored in
l l,k l,zi
a two-dimensional array of size K2. We can then use the above relationship to update the
supervisionaftersamplingeachtopicindicatorbycalculating g i “onthefly”basedonthe
d¬,k
(i 1)
previoussupervisedcontribution g¬ − inthefollowingway
d,k
(cid:34) (cid:35)
gd¬,ik = gd¬,(ki−1)+ N12 ∑L ηl,kηl,zi −∑L ηl,kηl,z(i 1)
d l l −
Caching g i leadstoanorderofmagnitudespeedupforamodelwith100topics.
d¬,k
DOLDASUPERVISEDTOPICMODEL 7
To further improve the performance we parallelize the sampler and use that documents
Φ Φ
are conditionally independent given . By sampling , instead of marginalizing it out, we
reducetheefficiencyoftheMCMCsomewhat,butwewillconvergetothetrueposteriorand
the gain from parallelization is usually far greater than the reduced efficiency (Magnusson
etal.,2015).
Insummary,wehavethefollowingsamplerforz
i,d
(cid:16) (cid:17) (cid:16) (cid:17)
p(zi,d = k|·) ∝ φk,v· n(dd,k),¬i+α ·exp gd¬,ik .
Φ
thatcanbesampledinparalleloverthedocuments,andtheelementsin canbesampledin
parallelovertopics. Thecodeispubliclyavailableathttps://github.com/lejon/DiagonalOrthantLDA.
It is also straightforward to use the recently proposed cyclical Metropolis-Hastings pro-
posalsinZhengetal.(2015)forinferenceinDOLDA.Theadditionalsamplingofλ andτ
p,l l
in our model can be done in O(K+P) and is hence not affecting the overall complexity of
the sampler. But, as shown in Magnusson et al. (2015), it is not obvious that the reduction
insamplingcomplexitywillresultinafastersamplingwhenMCMCefficiencyistakeninto
account.
3.3. Evaluationofconvergenceandprediction. WeevaluatetheconvergenceoftheMCMC
algorithmbymonitoringthelog-likelihoodovertheiterations:
(cid:34) (cid:35)
D J
∑ ∑ ∏
log (w,y) = log (1 cdf ( (z¯ x )η(cid:124))) cdf ( (z¯ x )η(cid:124))
L − N − d d j N − d d l
d s=1 l=s
(cid:54)
+ logp(w)
(cid:124) (cid:123)(cid:122) (cid:125)
LDAmarginalLL
To make predictions for a new document we first need to sample the topic indicators of
thegivendocumentfrom
p(z = k w ,Φ)∝φ¯ (M +α),
i new k,v d,k
| ·
whereφ¯ isthemeanofthelastpartoftheposteriordrawsofΦ. Weusetheposteriormean
k,v
Φ
basedonthelastiterationsinsteadofintegratingout toavoidpotentialproblemswithlabel
switching. However, wehavenotseenanyindicationsoflabelswitchingafterconvergence
inourexperiment,probablybecausethedatasetsusedfordocumentpredictionsareusually
quite large. The topic indicators are sampled for the predicted document using the fast PC-
LDAsamplerinMagnussonetal.(2015). Themeanofthesampledtopicindicatorvectorfor
thepredicteddocument,z¯ ,isthenusedforclasspredictions:
new
(cid:16) (cid:17)
ypred = argmax (z¯new,xnew)(cid:62)η .
This is a maximum a posteriori estimate, but it is straightforward to calculate the whole
predictivedistributionforthelabel.
DOLDASUPERVISEDTOPICMODEL 8
Dataset Classes(L) Vocabulary(V) Documents(D) Tokens(N)
IMDb 20 7530 8648 307569
20Newsgroups 20 23941 15077 2008897
TABLE 2. Datasetsusedinexperiment
4. EXPERIMENTS
We collected a dataset containing the 8648 highest rated movies at IMDb.com. We use
boththetextualdescriptionaswellasinformationaboutproducersanddirectorstoclassify
a given movie to a genre. We also analyze the classical 20 Newsgroup dataset to compare
the accuracy with state-of-the-art supervised models. Our companionpaper (Jonsson et al.,
2016)appliestheDOLDAmodeldevelopedheretobuglocalizationinalargescalesoftware
engineering context using a dataset with 15 000 bug reports each belonging to one of 118
classes. We evaluate the proposed topic model with regard to accuracy and distribution of
regression coefficients. The experiments are performed on 2 sockets with 8-core Intel Xeon
E5-2660 Sandy Bridge processors at 2.2GHz and 32 GB DDR3 1600 memory at the National
SuperComputercentre(NSC)atLinkᅵpingUniversity.
4.1. Data and priors. The datasets are tokenized and a standard stop list of English words
are removed, as well as the most rare word types that makes up of 1 % of the total tokens;
weonlyincludegenreswithatleast10movies.
In all experiments we used a relative vague priors setting α = β = 0.01 for the LDA part
ofthemodelandc = 100forthepriorvarianceoftheη coefficientsinthenormalmodeland
for the intercept coefficient when using the Horseshoe prior. The accuracy experiment for
IMDbwasconductedusing5-foldcrossvalidationandthe20Newsgroupscorpususedthe
sametrainandtestsetasinZhuetal.(2012)toenabledirectcomparisonsofaccuracy. Inthe
analysis of the IMDb dataset no cross-validation was conducted, instead the whole data set
wasusedforestimation.
4.2. Results.
20 Newsgroups. Figure 4.1 displays the accuracy on the hold-out test set for the 20 News-
groups dataset for different number of topics. The accuracy of our model is slightly lower
than MedLDA and SVM using only textual features, but higher than both the classical su-
pervisedmulti-classLDAandtheordinaryLDAtogetherwithanSVMapproach.
We can also see from Figure 4.1 that the accuracy of using the DOLDA model with the
topics jointly estimated with the supervision part outperforms a two-step approach of first
estimating LDA and then using the DO probit model with the pre-estimated mean topic
indicators as covariates. This is true for both the Horseshoe prior and the normal prior, but
thedifferenceisjustafewpercentinaccuracy.
The advantage of DOLDA is that it produces interpretable predictions with semantically
relevant topics. It is therefore reassuring that DOLDA can compete in accuracy with other
less interpretable models, even when the model is dramatically simplified by aggressive
MEDLDA:MAXIMUMMARGINSUPERVISEDTOPICMODELS
DOLDASUPERVISEDTOPICMODEL 9
0.85 0.85
0.8
0.8
0.75
0.75
0.7
Accuracy 0.7 Accuracy0.65
0.6
0.65 MedLDAc
MedLDAc+SVM
DiscLDA 0.55 MedLDAc
0.6 multi−sLDA multi−sLDA
sLDA 0.5 DiscLDA
LDA+SVM LDA+SVM
SVM SVM
0.55 0.45
0 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 90 100 110 120
# Topics # Topics
(a) (b)
FIGURE 4.1. Accuracy of MedLDA, taken from Zhu et al. 2012 (left) and ac-
Figure5: Classification accuracy of different models forc:u(raa)cbyinoafrDyOanLdD(Ab)fmourltthi-ecl2a0ssNcleawsssigfircoa-uptestset(right).
tiononthe20Newsgroupdata.
follow Lacoste-Julien et al. (2008) to set K =2K +K , where K is the number of class-specific
0 1 0
topicsandK isthenumberofsharedtopics,andK =2K . Here,wesetK =1, ,8,10.
1 1 0 0
···
We can see that the max-margin MedLDAc performs better than the likelihood-based down-
streammodels,includemulti-sLDA,sLDA,andthebaselineLDA+SVM.Thebestperformancesof
thetwodiscriminativemodels(i.e.,MedLDAcandDiscLDA)arecomparable. However,MedLDAc
is easier to learn and faster in testing, as we shall see in Section 5.3.2. Moreover, the different ap-
proximate inference algorithms used in MedLDAc (i.e., variational approximation) and DiscLDA
(i.e., Monte Carlo sampling methods) can also make the performance different. In our alterna-
tive implementation using collapsed variational inference (Teh et al., 2006) method for MedLDAc
(preliminary results in preparation for submission), we were able to achieve slightly better results.
However, the collapsed variational method is much more expensive. Finally, since MedLDAc al-
ready integrates the max-margin principle into its traininFgIG, oUuRrEc4o.n2je.cAtucrecuisratchyatfothreDcOomLbDinAatoionntheIMDbdatawithnormalandHorse-
of MedLDAc and SVM does not further improve the pesrhfooremparnicoermanudchuosnintghiasttwasok.stWepe abpelpiervoeachwiththeHorseshoeprior.
that the slight differences between MedLDAc and MedLDAc+SVM are due to the tuning of regu-
larizationparameters. Forefficiency,wedonotcHhaonrgseesthheoreegshulrairnizkaatgioenfcoornisntatnetrpCrdetuartiniogntaralinpiunrgposes. Our next data set illustrates the inter-
MedLDAc. The performance of MedLDAc wouplrdebtaetiiomnparlovsterdenifgwtheoseflDecOt aLDgoAo.dSCeeinaldsioffeoruenrtcompanion paper (Jonsson et al., 2016) in the
iterationsbecausethedatarepresentationischanging.
software engineering literature for further demonstrations of DOLDAs ability to produce
Multi-classClassification: Weperformmulti-classclassificationon20Newsgroupswithallthe
interpretablepredictionsinindustrialapplicationswithoutsacrificingpredictionaccuracy.
20 categories. The data set has a balanced distribution over the categories. For the test set, which
contains7505documentsintotal,thesmallestcategoryhas251documentsandthelargestcategory
IMDb. Figure 4.2 displaysthe accuracy on the IMDb dataset asa function of the number of
has 399 documents. For the training set, which contains 11269 documents, the smallest and the
topics. The estimated DOLDA model also contains several other discrete covariates, such
largest categories contain 376 and 599 documents, respectively. Therefore, the na¨ıve baseline that
predictsthemostfrequentcategoryforallthetesatsdotchuemfielnmts’shadsitrheectcolarsasinfidcaptiroondauccceurr.acTyh0e.0a5c3c2u.racy of the more aggressive Horseshoe prior
WecompareMedLDAc withLDA+SVM,miusltbi-estLteDrAt,hDanisctLhDeAn,oarmndatlhpersitoarndfaorrdamllutlotip-cilcasssizes. A supervised approach with topics and
SVMbuiltonrawtext. WeusetheSVMstruct pascukpaegrevwisiitohnaicnofsetrrfuendcjtoioinntalys∆isℓda(gya)in!oℓuI(type=rfyodr)mingatwo-stepapproach.
̸
tosolvethesub-stepoflearningq(ηηη)andbuildtheSVMclassifiersforLDA+SVM.Theparameter
The Horseshoe prior gives somewhat higher accuracy than the normal prior, and incor-
ℓisselectedwith5foldcross-validation.17 Theaverageresultsaswellasstandarddeviationsover
porating the Horseshoe prior let us handle many additional covariates since the shrinkage
17.Thetraditional0/1costdoesnotyieldthebestresults.pInrimoroswtcialslesa,cttheasselaecttyedpℓe’soafrevaaroruiandb1le6.selection.
ToillustratetheinterpretationofDOLDAwefitanewmodelusingonlytopicsascovari-
2a2t6e1s. NotefirstinFigure4.3howtheHorseshoepriorisabletodistinguishbetweensocalled
signaltopicsandnoisetopics;theHorseshoepriorisaggressivelyshrinkingalargefraction
DOLDASUPERVISEDTOPICMODEL 10
FIGURE 4.3. Coefficients for the IMDb dataset with 80 topics using the nor-
malprior(left)andtheHorseshoeprior(right).
FIGURE 4.4. Coefficients for the genre Romance in the IMDb dataset with 80
topicsusingtheHorseshoeprior(upper)andanormalprior(below).
of the regression coefficient toward zero. This is achieved without the need of setting any
hyper-parameters.
TheHorseshoeshrinkagemakesiteasytoidentifythetopicsthataffectagivenclass. This
is illustrated for the Romance genre in the IMDb dataset in Figure 4.4. This genre consists
of relatively few observations (only 39 movies), and the Horseshoe prior therefore shrinks
mostcoefficientstozero,keepingonlyonelargesignaltopicthathappenstohaveanegative
effectontheRomancegenre. Thenormalpriorontheotherhandgivesamuchmoredense,
andthereforemuchlessinterpretablesolution.
DiggingdeeperintheinterpretationofwhattriggersaRomancegenreprediction,Table3
showsthe10topwordforTopic39. Fromthistableitisclearthatthesignaltopicidentified
usingtheHorseshoepriorissomesortof“crime”topicthatisnegativelycorrelatedwiththe
Romancegenre,somethingthatmakesintuitivesense. Thecrimetopicisclearlyexpectedto
bepositivelyrelatedtotheCrimegenre,andFigure4.5showsthatthisisindeedthecase.