Table Of ContentMulti-Conditional Learning: Generative/Discriminative Training
for Clustering and Classification
Andrew McCallum, Chris Pal,Greg Druck andXuerui Wang
DepartmentofComputerScience
140GovernorsDrive
UniversityofMassachusetts
Amherst,MA01003–9264
{mccallum,pal,gdruck,xuerui}@cs.umass.edu
Abstract constraints on parameter estimation. Our experimental re-
sults on a variety of standard text data sets show that this
This paper presents multi-conditional learning (MCL), a density-estimationconstraintisamoreeffectiveregularizer
trainingcriterionbasedonaproductofmultipleconditional than “shrinkage toward zero,” which is the basis of tradi-
likelihoods. When combining the traditional conditional
tionalregularizers,suchastheGaussianprior—reducinger-
probability of “label given input” with a generative proba-
rorbynearly50%insomecases. Aswellasimprovingac-
bilityof“inputgivenlabel”thelateractsasasurprisinglyef-
curacy,the inclusionofa densityestimationcriterionhelps
fectiveregularizer. Whenappliedtomodelswithlatentvari-
improveconfidenceprediction.
ables,MCLcombinesthestructure-discoverycapabilitiesof
generative topic models, such as latent Dirichlet allocation In additionto simple conditionalmodels, there hasbeen
andtheexponentialfamilyharmonium,withtheaccuracyand growinginterestinconditionally-trainedmodelswithlatent
robustness of discriminative classifiers, such as logistic re- variables(Jebara&Pentland,1998;McCallumetal.,2005;
gressionandconditionalrandomfields.Wepresentresultson
Quattonietal.,2004). Simultaneouslythereisimmensein-
severalstandardtextdatasetsshowingsignificantreductions
terestedingenerative“topicmodels,”suchaslatentDirich-
in classification error due to MCL regularization, and sub-
let allocation, and its progeny, as well as their undirected
stantialgainsinprecisionandrecallduetothelatentstructure
analogues,includingtheharmoniummodels(Wellingetal.,
discoveredunderMCL.
2005;Xingetal.,2005;Smolensky,1986).
Inthispaperwealsodemonstratemulti-conditionallearn-
Introduction ing applied to latent-variable models. MCL discovers
a latent space projection that captures not only the co-
Conditional-probability training, in the form of maximum
occurrence of features in input (as in generative models),
entropyclassifiers(Bergeretal.,1996)andconditionalran-
butalsoprovidestheabilitytoaccuratelypredictdesignated
domfields(CRFs) (Laffertyet al., 2001; Sutton& McCal-
outputs(asindiscriminativemodels). WefindthatMCLis
lum,2006),hashaddramaticandgrowingimpactonnatural
morerobustthantheconditionalcriterionalone,whilealso
languageprocessing,informationretrieval,computervision,
beingmorepurposefulthangenerativelatentvariablemod-
bioinformatics,andotherrelatedfields. However,discrimi-
els. On the document retrieval task introduced in Welling
nativemodelstendtooverfitthetrainingdata,andaprioron
etal. (2005),wefindthatMCLmorethandoublesprecision
parameters typically provides limited relief. In fact, it has
andrecallincomparisonwiththegenerativeharmonium.
beenshownthatinsomecasesgenerativena¨ıveBayesclas-
sifiers provide higher accuracy than conditional maximum In latent variable models, MCL can be seen as a form
entropyclassifiers (Ng & Jordan, 2002). We thus consider of semi-supervised clustering—with the flexibility to op-
alternativetrainingcriteriawithreducedrelianceonparame- erate on relational, structured, CRF-like models in a prin-
terpriors,whichalsocombinegenerativeanddiscriminative cipled way. MCL here aims to combine the strengths of
learning. CRFs(handlingauto-correlationandnon-independentinput
This paper presents multi-conditionallearning, a family featuresin makingpredictions), with the strengthsof topic
ofparameterestimationobjectivefunctionsbasedonaprod- models (discovering co-occurrence patterns and useful la-
uctofmultipleconditionallikelihoods.Inoneconfiguration tent projections). This paper sets the stage for various in-
of this approach, the objective function is the (weighted) teresting future work in multi-conditional learning. Many
productofthe“discriminative”probabilityoflabelgivenin- configurationsofmulti-conditionallearningarepossible,in-
put, and the “generative” probability of the input given la- cluding ones with more than two conditionalprobabilities.
bel. Theformeraimstofindagooddecisionboundary,the Forexample,transferlearningcouldnaturallybeconfigured
later aimsto modelthe density of the input, and the single as the productof conditionalprobabilitiesfor the labels of
setofparametersinourna¨ıve-Bayes-structuredmodelthus eachtask,withsomelatentvariablesandparametersshared.
strives for both. All regularizers provide some additional Semi-supervisedlearningcouldbeconfiguredastheproduct
ofconditionalprobabilitiesforpredictingthe label, aswell
Copyright (cid:13)c 2006, American Association for Artificial Intelli- aspredictingeachinputgiventhe others. Theseconfigura-
gence(www.aaai.org).Allrightsreserved. tionsarethesubjectofongoingwork.
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2006 2. REPORT TYPE 00-00-2006 to 00-00-2006
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Multi-Conditional Learning: Generative/Discriminative Training for
5b. GRANT NUMBER
Clustering and Classification
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
University of Massachusetts,Center for Intelligent Information REPORT NUMBER
Retrieval,Department of Computer Science,Amherst,MA,01003
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
The original document contains color images.
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE 7
unclassified unclassified unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
Multi-Conditional Learning and MRFs regressionormaximumentropy(Bergeretal.,1996)model
In the following exposition we first present the general canbewrittenwithsimilarna¨ıvegraphicalstructures. Here
framework of multi-conditional learning. We then derive we considerna¨ıveMRFs whichcanalso be representedby
the equations used for multi-conditional learning in sev- asimilargraphicalstructurebutdefineajointdistributionin
eral structured Markov Random Field (MRF) models. We termsofunnormalizedpotentialfunctions.
introduce discrete hidden (sub-class) variables into na¨ıve Consider data D = {(y˜n,x˜j,n);n = 1,...,N,j =
MRFmodels, creatingmulti-conditionalmixtures,anddis- 1...Mn} wherethereare N instancesandwithineach in-
cuss how multi-conditional methods are derived. We then stance there are Mn realizations of discrete random vari-
constructbinarywordoccurrencemodelscoupledwithhid- ables{x}.Wewilluseyntodenoteasinglediscreterandom
den continuousvariables, as in the exponentialfamily har- variableforaclasslabel. Modelparametersaredenotedby
monium,demonstratingtheadvantagesofmulti-conditional θ. ForacollectionofN documentswethushaveMn word
learningforthesemodelsalso. eventsforeachdocument. Thejointdistributionofthedata
can be modeled using a set of na¨ıve MRFs, one for each
TheMCLFramework observationsuchthat
Consideradatasetconsistingofi=1,...,N instances.We 1 Mn
willconstructprobabilisticmodelsconsistingofdiscreteob- P(x ,...,x ,y|θ)= φ(y|θ ) φ(x ,y|θ ) (4)
servedrandomvariables{x},discretehiddenvariables{z} 1 Mn Z y j x,y
j=1
and continuoushidden variables z. Denote an outcome of Y
where
arandomvariableasx˜. Definej = 1,...,N pairsofdis-
s
jointsubsetsofobservations{x˜A}ij and{x˜B}ij,whereour Mn
indices denotethe ith instance of the variablesin subset j. Z = ... φ(y|θ ) φ(x ,y|θ ). (5)
y j x,y
Wewillconstructamulti-conditionalobjectivebytakingthe
productofdifferentconditionalprobabilitiesinvolvingthese Xy Xx1 xXMn jY=1
subsetsandwewilluseα toweightthecontributionsofthe Ifwedefinepotentialfunctionsφ(·) to consistofexponen-
j
different conditionals. Using these definitions the optimal tiatedlinearfunctionsofmultinomialvariables(sparsevec-
parametersettingsunderourmulti-conditionalcriterionare torswith a single 1 in one of the dimensions), y for labels
givenby andwj foreachword,ana¨ıveMRFcanbewrittenas
argmax P {x˜ },{z},z |{x˜ } ;θ αjdz , 1 Mn
θ Yi,j {Xz}ijZ (cid:0)(cid:8) A (cid:9)ij B ij (cid:1) ij P(y,{w})= Z exp(cid:18)yTθy+yTθTx,yj=1wj(cid:19). (6)
(1) X
wherewederivethesemarginalconditionallikelihoodsfrom To simplify our presentation, consider now combining
asingleunderlyingjointprobabilitymodelwithparameters our multinomial word variables {w} such that x =
θ. Ourunderlyingjointprobabilitymodelmayitselfbenor- [ Mn w ;1]. One can also combine θ and θ into θ
malizedlocally,globallyor usingsomecombinationofthe j=1 j y x,y
suchthat
two. P 1
For the experiments in this paper we will partition ob- P(y,x)= exp(yTθTx) (7)
Z
served variables into a set of “labels” y and a set of “fea-
Underthismodel,tooptimizeL from(2)wehave
tures” x. We define two pairs of subsets: {x ,x } = MC
A B 1
{y,x}and{xA,xB}2 = {x,y}. We thenconstructmulti- exp(yTθTx) exp(yTθTx)
conditional objective functions L with the following P(y|x)= andP(x|y)=
MC exp(yTθTx) Z(y)
form y
(8)
LMC =log P(y|x)αP(x|y)β where P
(2)
=αL (θ)+βL (θ).
y(cid:0)|x x|y (cid:1) Mn
Inthisconfigurationonecanthinkofourobjectiveashaving Z(y)= ... exp(yTθTx,ywj)exp(yTθy). (9)
a generativecomponentP(x|y) and a discriminativecom- Xw1 wXMnjY=1
ponentP(y|x). Anotherattractivedefinitionusingtwopairs
Thegradientsofthelogconditionallikelihoodscontainedin
is: {x ,x } = {y,x} and {x ,x } = {x,∅}, giving
A B 1 A B 2 ourobjectivecanthenbecomputedusing:
risetoobjectivesofthefollowingform
L=log(P(y|x)αP(x)β , (3) ∇L (θ)= N x yT − yexp(yTθTxn)xnyT
whichrepresentsawayofrestructuringa(cid:1)jointlikelihoodto y|x n=1 n n P yexp(yTθTxn) !
concentratemodelingpoweronaconditionaldistributionof X
interest. Thisobjectiveissimilartotheapproachadvocated =N hxyTiP˜(x,y)−PhhxyTiP(y|x)iP˜(x)
inMinka(2005). (cid:16) (cid:17)(10)
Na¨ıveMRFsforDocuments whereh·i denotestheexpectationwithrespecttodistri-
P(x)
Thegraphicaldescriptionsofthena¨ıveBayesmodelfortext butionP(x)andweuseP˜(x)todenotetheempiricaldistri-
documents(Nigametal.,2000)andthemultinomiallogistic butionof the data, the distributionobtainedplacing a delta
Forexample,thegradientforthe“weights”λ compris-
xe,zs
y Labels ing the elements of the potential function parameters θx,z
Labels y arecomputedfrom
z HToidpdicesn ∂∂Lλx|y(θ) = N Mn P(zn|{x˜}n,y˜n;θ)fxe,zs(x˜j,n,zn)
xe,zs nX=1Xj=1(cid:20)Xzn
x1 x2 … xMn x1 x2 … xMn − P({x}n,zn|y˜n;θ)fxe,zs(xj,n,zn) ,
Xzn {Xx}n (cid:21)
(15)
M Word Events M Word Events
n n wheref (x,z)arebinaryfeaturefunctionsevaluatingto
xe,zs
onewhenthestateofx = x and thestateofz = z . The
e s
Figure 1: (Left)Afactor graph (Kschischang et al., 2001) fora updates for the potentials function parameters using Ly|x
na¨ıveMRF.(Right)Afactorgraphforamixtureofna¨ıveMRFs. takeaformsimilartothestandard“maximumentropy”gra-
In these models each word occurrence is a draw from a discrete dientcomputations,augmentedwithahiddenvariable. We
randomvariable;thereareMnrandomvariablesindocumentn. termmixturemodelstrainedmymulti-conditionallearning
multi-conditionalmixtures(MCM).
functiononeachdatapointandnormalizedbyN. Tocom- HarmoniumStructuredModels
pute∇Lx|y(θx,y),weobservethat A harmonium model (Smolensky, 1986) is a two layer
MarkovRandomField (MRF) consisting of observedvari-
Mn Mn exp(yTθT w ) ables and hidden variables. Like all MRFs, the model we
P(x|y)= P(wj|y)= exp(yTx,θyT jw ) , presenthere will be defined in terms of a globallynormal-
jY=1 jY=1(cid:18) wj x,y j (cid:19) ized product of (unnormalized)potential functions defined
(11) upon subsets of variables. A harmonium can also be de-
P
scribedasa typeofrestrictedBoltzmannmachine(Hinton,
andtherefore
2002). Inthefollowingwepresentanewtypeofexponen-
N Mn tialfamilymulti-attributeharmonium,extendingthemodels
∇L (θ )= w˜ y˜T −w y˜TP(w |y˜ ) . usedinWellingetal. (2005)andthedual-wingharmonium
x|y x,y j,n n j,n n j,n n
workofXingetal. (2005).
nX=1Xj=1(cid:16) (cid:17) Ourexponentialfamilyharmoniumstructuredmodelcan
(12)
bewrittenas
P(x,z|Θ)=exp θTf (x )+ θTf (z )
MixturesofNa¨ıveMRFs i i i j j j
(cid:26) i j
Wecanextendthebasicna¨ıveMRFmodelshowninFigure X X (16)
1 (Left) by adding a hiddensubclass variable as illustrated + θTf (x ,z )−A(Θ) ,
ij ij i j
(Right).Inamixtureofna¨ıveMRFsthejointdistributionof
i j (cid:27)
thedataforeachobservationcanbemodeledusing XX
where z is a vector of continuousvalued hidden variables,
1 Mn x is a vector of observations, θi representsparameter vec-
P({x},y,z|θ)= φ(y|θ )φ(y,z|θ ) φ(x ,z|θ ), tors (or weights), θ represents a parameter vector on a
Z y y,z j x,z ij
jY=1 cross productof states, fi denotesfeature functions, Θ =
(13) {θ ,θ ,θ } is the set of all parameters and A is the log-
ij i j
where the φ(y,z|θ ) potentialencodesa sparse compati- partitionfunctionornormalizationconstant. Aharmonium
y,z
bilityfunctionrelatinglabelsorclassestoasubsetofstates modelfactorizesthethirdtermof(16)intoθTf (x ,z )=
ij ij i j
ofthehiddendiscretevariablez. f (x )TWTf (z ),whereWT isaparametermatrixwith
To optimize a mixture of na¨ıve MRFs, we use the ex- i i ij j j ij
dimensions a × b, i.e., with rows equal to the number of
pected gradient algorithm (Salakhutdinov et al., 2003). In
statesof f (x ) andcolumnsequaltothe numberof states
thismodelwecancomputethegradientofthecompletelog i i
off (z ). Inthemodelsweconstructherewewillusebi-
likelihoodandthisgradientdecomposeswithrespecttoour j j
narywordoccurrencevectorsthathavedimensionM , the
expectationsuchthatthefollowingcomputationcanbeeffi- v
size ofourvocabulary. Thisisin contrasttoourmodelsin
cientlyperformed,
theprevioussectionwherewehadadifferentnumberofdis-
∂ cretewordeventsMnforeachdocumentn. Wewilldenote
∇Lx|y(θ)= ∂θ lnP({x}|y;θ) one of the observed input variables xd as a discrete label
denotedasyinFigure2.
∂
= P(z|{x},y;θ) lnP({x},z|y;θ). Figure2illustratesamulti-attributeharmoniummodelas
∂θ afactorgraph. Aharmoniumrepresentsthefactorizationof
z
X (14) ajointdistributionforobservedandhiddenvariablesusing
… where h·i denotes the expectation under the empiri-
z z z z Hidden P˜(x)
1 2 k n
(Topics) caldistribution, h·iP(x) is an expectationunderthe models
marginaldistributionandN isthenumberofdataelements.
Wecanthuscomputethegradientofthelog-likelihoodwith
respecttotheweightmatrixWusing
… …
∂L 1 Nd 1 Ns
= WTx˜ x˜T − WTx˜ x˜T ,
∂WT N i i N i,(j) i,(j)
d s !
i=1 j=1
X X
(22)
x1 x2 … xMv y Observed awrheesraemNpdleasrientdheexendumbybejraonfdvNectoarrseothfeobnsuemrvbeedrdoaftsaa,mx˜pi,l(ejs)
Variables s
used per data vector, computedusing Gibbssamplingwith
conditionals(17),(18)and(19).Inourexperimentsherewe
Binary Word Domain Labels havefounditpossibletouseeitheroneorasmallnumberof
Occurrences MarkovChainMonteCarlo(MCMC)(Andrieuetal.,2003)
stepsinitializedfromthedatavector(thecontrastivediver-
genceapproach(Hinton,2002)). StandardMCMCapproxi-
Figure 2: Afactorgraphforamulti-attributeharmoniummodel mationsforexpectationsarealsopossible. Weusestraight-
ortwolayerMRF. forward gradient-based optimization for model parameters
withalearningrateandamomentumterm.Finally,forcon-
ditional likelihood and multi-conditional likelihood based
agloballynormalizedproductoflocalfunctions. Inourex-
learning,gradientvaluescanbeobtainedfrom
periments here we shall use the harmonium’sfactorization
structure to define an MRF and we will then define sets of
∂L ∂F(x ,x ;θ)
marginal conditionals distributions of some observed vari- MC =N (α+β) b d
∂θ ∂θ
ablesgivenothersthatareofparticularinterestsoastoform (cid:20) (cid:20)(cid:28) (cid:29)P˜(xb,xd)
ourmulti-conditionalobjective. ∂F(x ,x ;θ)
Importantly,usingagloballynormalizedjointdistribution −α b d (23)
∂θ
withthisconstructionitisalsopossibletoderivetwoconsis- (cid:28)(cid:28) (cid:29)P(xd|xb;θ)(cid:29)P˜(xb)(cid:21)
tentconditionalmodels,oneforhiddenvariablesgivenob- ∂F(x ,x ;θ)
servedvariablesandoneforobservedvariablesgivenhidden −β b d
∂θ
variables (Welling et al., 2005). The conditional distribu- (cid:28)(cid:28) (cid:29)P(xb|xd;θ)(cid:29)P˜(xd)(cid:21)(cid:21)
tionsdefinedbythesemodelscanalsobeusedtoimplement
Relationships to OtherWork
sampling schemes for various probabilities in the underly-
ingjointmodel. However,itisimportanttorememberthat
Theoretical and empirical results in Ng and Jordan (2002)
the originalmodelparameterizationis notdefined in terms
havesupportedthenotionthat,whileadiscriminativemodel
of these conditional distributions. In our experiments be-
may have a lower asymptotic error (with more data), the
low we usea jointmodelwith a formdefinedby(16)with
error rate of classifications based on an analogous genera-
WT =[WTWT]suchthatthe(exponentialfamily)condi-
b d tivemodelcanoftenapproachanasymptoticallyhigherer-
tionaldistributionsconsistentwiththejointmodelare
ror rate faster. Hybridsmethodscombininggenerativeand
P(z |x˜) = N(z ;µˆ,I), µˆ =µ+WTx˜ (17) discriminative methods are appealing in that they have the
n n
potentialtodrawuponthestrengthsofbothapproaches.For
P(x |˜z) = B(x ;θˆ ), θˆ =θ +W z˜ (18)
b b b b b b example,in Raina etal. (2003),a highdimensionalsubset
P(x |˜z) = D(x ;θˆ ), θˆ =θ +W z˜, (19) of parametersare trainedundera jointlikelihoodobjective
d d d d d d
while another smaller subset of parameters are trained un-
where N(), B() and D() represent Normal, Bernoulli and
der a conditional likelihood objective. In contrast, in our
Discrete distributionsrespectively. The following equation
approach all parameters are optimized under a number of
canbeusedtorepresentthemarginaldistributionofx,
conditionalobjectives.
P(x|θ,Λ)=exp{θTx+xTΛx−A(θ,Λ)}, (20) InCorduneanuandJaakkola(2003),amethodcharacter-
whereΛ= 1WWT andθcombinesθ andθ . Thelabels ized as information regularization is formulated for using
forthismode2larethediscreterandomvadriable(bi.e.y=x ) informationaboutthemarginaldensityofunlabeleddatato
d
andthefeaturesarethebinaryvariables. constrain an otherwise free conditional distribution. Their
In an exponential family model with exponential func- approachcan be thoughtof as a method for penalizingde-
tionF(x;θ), itiseasyto verifythatthegradientofthelog cisionboundariesthatoccurinareasofhighmarginalden-
marginal likelihood L of the observed data x, can be ex- sity. In termsof theregularizationperspective,ourmulti-
pressed conditionalapproachusesadditionalorauxiliaryconditional
distributions derived from an underlying joint probability
∂L(θ;x) ∂F(x;θ) ∂F(x;θ)
=N − , modelasregularizers.Furthermore,ourapproachisdefined
∂θ (cid:20)(cid:28) ∂θ (cid:29)P˜(x) (cid:28) ∂θ (cid:29)P(x;θ)(cid:21) within the context of an underlying joint model. It is our
(21) belief that these additional conditional distributions in our
objective function can serve as a regularizerfor the condi- • The Web Knowledge Base (webkb) data set consists of
tionaldistributionsweprimarilycareabout,theprobability webpages from four universities that are classified into
of labels. As such, we weightthe conditionaldistributions faculty, student, course, and project (we discard the
differentlyinourobjective. categoriesofstaff,department,andother).
With equalweightingof conditionalsand anappropriate
Wedetermineαandβ,theweightsofeachcomponentof
definition of subsets of variables, the method can be seen
our objective function, and the Gaussian prior variance σ2
asatypeofpseudo-likelihood(Besag,1975). However,our
using cross validation. Specifically, we use 10-fold cross-
goalsarequitedifferent,inthatwearenottryingtoapprox-
validation,with5foldsusedforchoosingtheseparameters
imateajointlikelihood,butrather,wewishtoexplicitlyop-
and 5 folds used for testing. The models tend to be quite
timizefortheconditionaldistributionsinourobjective.
sensitive to the values of α and β. Additionally, because
The mixtures of na¨ıve MRFs we present resemble the
there is no longer a guaranteeof convexity, thoughtfulini-
multiple mixture components per class approach used in
tialization of parameters is sometimes required. In future
Nigam et al. (2000). The conditional distributions arising
work, we hope to more thoroughlyunderstand and control
forourlabelsgivenourdataarealsorelatedtomixturesof
fortheseengineeringissues.
experts(Jordan&Jacobs,1994),conditionalmixturemod-
Duringpreprocessing, we removewords that only occur
els(Jebara&Pentland,1998),simplemixturesofmaximum
oncein theeachcorpus, aswellasstopwords,HTML,and
entropymodels(Pavlovetal.,2002),andmixturesofcondi-
emailmessageheaders. Wealsotestwithsmall-vocabulary
tionalrandomfields(McCallumetal.,2005;Quattonietal.,
versionsofeachdatasetinwhichthevocabularysizeisre-
2004).Thecontinuouslatentvariablemodelwepresenthere
ducedto2000usinginformationgain.
issimilartothedualwingharmoniumortwolayerrandom
The results are presented in Table 1. The parenthesized
fieldpresentedinXingetal. (2005)forminingtextandim-
valuesarethestandarddeviationsofthetestaccuracyacross
ages. In that approach a lower dimensional representation
the cross validation folds. On 15 of 20 data sets, we show
of image and text data is obtained by optimizing the joint
improvementsoverbothmaximumentropyandna¨ıveBayes.
likelihoodofaharmoniummodel.
Although the differences in accuracy are small in some
cases, the overall trend across data sets illustrates the po-
Experimental Results
tentialofMCLforregularization.Infact,thedifferencebe-
Inthissection,wepresentexperimentalresultsusingmulti- tween the mean accuracy for maximum entropy and MCL
conditionalobjectivefunctionsinthecontextofthemodels islargerthanthedifferencebetweenthemeanaccuraciesof
described. First, we apply na¨ıve Markov random fields to na¨ıveBayesandmaximumentropy.Acrossalldatasets,the
documentclassificationandshowthatthemulti-conditional mean MCL accuracyis significantly greater than the mean
training provides better regularization than the traditional accuracies of naive Bayes (p = 0.001) and maximum en-
Gaussianprior. Next,wedemonstratemixtureformsofthe tropy(p=0.0002)underaone-tailedpairedt-test.
modelonbothrealandsyntheticdata,includinganexample Wealsofoundthatin10of15datasetsonwhichwealso
of topic discovery. Finally, we show that in harmonium- calculatedtheareaundertheaccuracy/coveragecurve,MCL
structuredmodels, the multi-conditionalobjectiveprovides providedbetterconfidenceestimates.
aquantitativelybetterlatentspace.
MixturesofNa¨ıveMRFs
Na¨ıveMRFsandMCLasRegularization Inordertodemonstratetheabilityofmulti-conditionalmix-
WeusetheobjectivefunctionαL (θ)+βL (θ)inna¨ıve tures to successfully classify data that is not linearly sep-
y|x x|y
MRFsandcomparetothegenerativena¨ıveBayesmodeland arable, we perform the following synthetic data experi-
the discriminative maximum entropy model for document ments. Four class labels are each associated with four 4-
classification. Wepresentextensiveexperimentswithcom- dimensional Gaussians, having means and variances uni-
montextdatasets,whicharebrieflydescribedbelow. formly sampled between 0-100. Positions of data points
generatedfromtheGaussiansareroundedtointegervalues.
• 20 Newsgroups is a corpus of approximately 20,000
For some samples of the Gaussian means and variances—
newsgroup messages. We use the entire corpus (abbre-
e.g. an XOR configuration—a significant portion of the
viatedasnews),aswellastwosubsets(talkandcomp).
data would be misclassified by the best linear separator.
• The industry sector corpus is a collection of corporate MCMs,however,canlearnandcombinemultiplelinearde-
webpagessplit into about70 categories. We use the en- cision boundaries. A MCM with two hidden subclasses
tire corpus(sector),aswellasthreesubsets: healthcare, perclassattainsanaccuracyof75%,whereasna¨ıveBayes,
financial(finan),andtechnology. maximumentropy,andnon-mixturemulti-conditionalna¨ıve
MRFshaveaccuraciesof54%,52%,and56%,respectively.
• The movie review corpus (movie) is a collection of user
Withexplicitly-constructedXORpositioning,MCMattains
movie reviews from the Internet Movie Database, com-
99%,whiletheothersyieldlessthan50%.
piledbyBoPangatCornellUniversity. We usedthepo-
Running these MCMs on the talk data set yields “top-
laritydataset(v2.0),wherethetaskistoclassifythesen-
ics”similartolatentDirichletallocation(LDA)(Bleietal.,
timentofeachreviewaspositiveornegative.
2003),exceptthatparameterestimationisdriventodiscover
• Thesraadatasetconsistsof73,218UseNetarticlesfrom topicsthatnotonlyre-generatethewords,butalsohelppre-
fourdiscussiongroups: simulatedautoracing,simulated dicttheclasslabel;(thusMCMcanalsobeunderstoodasa
aviation,realautos,andrealaviation. “semi-supervised”topicmodel). Furthermore,MCMtopics
Data NaiveBayes MaxEnt MCL Topic1(guncontrol) Topic2(Wacoincident)
news 85.3(0.61) 82.9(0.82) 85.9(0.89) guns 1.27 nra 1.63
news(2000) 76.4(0.88) 77.4(0.81) 77.7(0.48) texas 1.19 assault 1.52
comp 85.1(1.78) 83.7(0.68) 83.4(0.94) gun 1.18 waco 1.21
comp(2000) 81.8(1.36) 82.2(0.75) 84.0(1.05) enforcement 1.14 compound 1.19
talk 84.6(1.02) 82.3(1.43) 83.7(1.27) ... ... ... ...
talk(2000) 83.7(2.17) 81.6(2.27) 84.3(1.21) president -0.83 employer -0.90
sector 75.6(2.05) 88.0(1.13) 87.4(0.84) peace -0.85 cult -0.94
sector(2000) 73.9(0.78) 82.0(1.03) 83.2(1.56) years -0.88 terrorists -1.02
tech 91.0(1.33) 91.8(2.24) 93.1(1.69) feds -1.17 matthew -1.15
tech(2000) 92.9(2.46) 91.4(2.03) 94.5(1.81)
finan 92.3(2.36) 89.2(1.52) 91.5(2.57)
Table 2: Two MCM-discovered “topics” associated with the
finan(2000) 87.3(3.31) 89.6(1.82) 94.6(1.79)
politics.guns label in a run on talk data set. On the
health 93.5(4.36) 94.0(3.74) 95.5(4.00)
left, discussion about gun control in Texas. The negatively-
health(2000) 95.0(5.00) 91.0(3.39) 95.5(4.30)
weighted words are prominent in other classes, including
movie 78.6(1.20) 82.6(2.96) 82.7(2.50)
politics.misc. Ontheright,discussionaboutthegunrights
movie(2000) 90.9(1.98) 88.8(1.96) 94.0(1.05)
ofDavidKoreshwhenfederal agents stormedtheircompound in
sraa 95.9(0.15) 96.1(0.23) 96.7(0.09)
Waco,TX.AspectsoftheDavidiancult,however,werediscussed
sraa(2000) 93.7(0.20) 94.7(0.13) 95.0(0.21)
inreligion.misc.
webkb 87.9(2.14) 92.4(0.84) 92.4(1.04)
webkb(2000) 84.7(1.20) 92.4(1.07) 92.7(1.40)
mean 86.5(6.73) 87.7(5.39) 89.4(5.76) 5 5 5
4.5 4.5 4.5
4 4 4
3.5 3.5 3.5
3 3 3
Table1:DocumentclassificationaccuraciesfornaiveBayes,max- 2.5 2.5 2.5
2 2 2
imumentropy,andMCL. 1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
−01 0 1 2 3 4 5 6 7 0 0 2 4 6 8 10 −01 0 1 2 3 4 5 6 7
aredefinednotonlybypositivewordassociations,butalso
by prominent negative word associations. The words with Figure 3: (Left)Joint likelihood optimization. (Middle) One of
mostpositiveandnegativeθx,z areshowninTable2. the many near optimal solutions found by conditional likelihood
optimization. (Right) An optimal solution found by our multi-
Lower-variance ConditionalMixtureEstimation conditionalobjective.
Consider data generated from two classes, each with four
sub-classesdrawnfrom2-DisotropicGaussians(similarto
reduced 20 newsgroups data set prepared in MATLAB by
the example in Jebara and Pentland (2000)). The data are
Sam Roweis. In this data set, 16242 documentsare repre-
illustratedbyred◦’sandblue×’sinFigure3. Usingjoint,
sentedby100wordvocabularybinaryoccurrencesand are
conditional,andmulti-conditionallikelihood,wefitmixture
labeledasoneoffourdomains.
modelswithtwo(diagonalcovariance,i.e.na¨ıve)subclasses
usingconditionalexpectedgradientoptimization(Salakhut- To evaluate the quality of our latent space, we retrieve
documents that have the same domain label as a test doc-
dinovetal.,2003). Thefiguredepictstheparametersofthe
ument based on their cosine coefficient in the latent space
best models found under our objectives using ellipses for
whenobservingonlybinaryoccurrences.Werandomlysplit
constantprobabilityunderthemodel.
dataintoatrainingsetof12,000documentsandatestsetof
From this illustrative example, we see that the parame-
4242documents.Weuseajointmodelwithacorresponding
ters estimated by joint likelihood would completely fail to
fullrankmulti-variateBernoulliconditionalforbinaryword
classify ◦ versus × given location. In contrast, the condi-
occurrencesandadiscreteconditionalfordomains.Figure4
tionalobjectivefocuses completelyon the decision bound-
showstheprecision-recallresults. ML-1is ourmodelwith
ary, however, in 30 random initializations, this produced
no domain label information. ML-2 is optimized with do-
parameterswith veryhighvariance,andlittle interpretabil-
mainlabelinformation.CLisoptimizedtopredictdomains
ity. Ourmulti-conditionalobjective,however,optimizesfor
from words and MCL is optimized to predict both words
both class label prediction and class-conditioned density,
fromdomainsanddomainsfromwords. FromFigure4we
yielding good classification accuracy, and sensible, low-
seethatthelatentspacecapturedbythemodelismorerele-
varianceparameterestimates.
vantfordomainclassificationwhenthemodelisoptimized
Multi-ConditionalHarmoniums undertheCLandMCLobjectives.MCLmorethandoubles
theprecisionandrecallatreasonablevaluesofthecounter-
Weareinterestedinthequalityofthelatentrepresentations
parts.
obtainedwhenoptimizingmulti-attributeharmoniumstruc-
tured models under standard (joint) maximum likelihood
DiscussionandConclusions
(ML), conditional likelihood (CL) and multi-conditional
likelihood(MCL)objectives. Weuseasimilartestingstrat- Wehavepresentedmulti-conditionallearninginthecontext
egy to Welling et al. (2005) but focus on comparing the of na¨ıve MRFs, mixtures of na¨ıve MRFs and harmonium-
different latent spaces obtained with the various optimiza- structured models. For Naive MRFs, we show
tion objectives. As in Welling et al. (2005), we used the that multi-conditional learning provides improved regu-
mumentropyapproachtonaturallanguageprocessing. Compu-
0.75 tationalLinguistics,22,39–72.
ML 1
0.7 ML 2 Besag,J.(1975).Statisticalanalysisofnon-latticedata.TheStatis-
CL
0.65 MCL tician,24,179–195.
Blei,D.,Ng,A.,&Jordan,M.(2003). LatentDirichletallocation.
0.6
JournalofMachineLearningResearch,3,993–1022.
Precision00..045.555 CHoinirzdtoautnni,oenaG.n.uP(,r2oA0c0.e,2e&)d.inJgTasarakoikfnoiUlnang,cTper.rot(ad2iu0nc0tyt3s)in.ofAOrentxipfiinecfriaotslrmIbnyatteimlolniigneriemngciuezl.ianrg-
contrastivedivergence. NeuralComputation,14,1771–1800.
0.4
Jebara,T.,&Pentland,A.(1998).Maximumconditionallikelihood
0.35 viaboundmaximizationandtheCEMalgorithm. InNeuralIn-
formationProcessingSystems(NIPS),11.
0.3
Jebara,T.,&Pentland,A.(2000).OnreversingJensen’sinequality.
0.25
10−4 10−3 10−2 10−1 100 NIPS13.
Recall
Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of
experts and the EM algorithm. Neural Computation, 6, 181–
Figure4:Precision-recallcurvesforthe“20newsgroups”dataus- 214.
ingML,CLandMCLwith20latentvariables. Randomguessing Kschischang, F. R., Frey, B., & Loeliger, H.-A. (2001). Factor
isahorizontallineat.25. graphs and thesum-product algorithm. IEEETransactions on
InformationTheory,47,498–519.
Lafferty,J.,McCallum,A.,&Pereira,F.(2001). Conditionalran-
larization, and flexible, robust mixtures. In the context dom fields: Probabilistic models for segmenting and labeling
of harmonium-structured models our experiments show sequencedata. Proc.ICML,282–289.
that multi-conditional contrastive-divergence-based opti-
McCallum, A., Bellare, K., & Pereira, F. (2005). A conditional
mizationprocedurescanleadtolatentdocumentspaceswith randomfieldfordiscriminatively-trainedfinite-statestringedit
superiorquality. distance. ConferenceonUncertaintyinAI(UAI).
Multi-conditional learning is well suited for multi-task
Minka,T.(2005). Discriminativemodels,notdiscriminativetrain-
and semi-supervised learning, since multiple prediction ing. MSR-TR-2005-144.
tasks are easily and naturally defined in the MCL frame-
Ng,A.Y.,&Jordan,M.(2002). Ondiscriminativevs.generative
work. In recent work by Ando and Zhang (2005), semi-
classifiers:AcomparisonoflogisticregressionandnaiveBayes.
supervised and multi-task learning methods are combined. NIPS14.
Their approach involves auxiliary prediction problems de-
Nigam,K.,McCallum,A.K.,Thrun,S.,&Mitchell,T.M.(2000).
fined for unlabeled data such that model structures arising
Textclassificationfromlabeledandunlabeleddocumentsusing
from these tasks are also useful for another classification EM. MachineLearning,39,103–134.
problemofparticularinterest. Theirapproachinvolvesfind-
Pavlov,D.,Popescul,A.,Pennock,D.,&Ungar,L.(2002). Mix-
ing the principal components of the parameters space for
turesofconditionalmaximumentropymodels. NECResearch
auxiliarytasks. OnecansimilarlyusetheMCLapproachto InstituteTechnicalReportNECI.
defineauxiliaryconditionaldistributionsamongfeatures.In
Quattoni, A.,Collins, M.,&Darrell, T.(2004). Conditional ran-
this way MCL is a natural framework for semi-supervised
domfieldsforobjectrecognition. NIPS17,1097–1104.
learning. We are presently exploring MCL in these multi-
Raina,R.,Shen,Y.,Ng,A.Y.,&McCallum,A.(2003). Classifi-
taskandsemi-supervisedsettings.
cationwithhybridgenerative/conditionalmodels. NIPS.
Salakhutdinov, R., Roweis, S., & Ghahramani, Z. (2003). Opti-
Acknowledgements
mization with EM and expectation-conjugate-gradient. Proc.
Thiswork wassupported inpart bytheCenter for IntelligentIn- ICML.
formationRetrieval,inpartbytheCentralIntelligenceAgency,the Smolensky, P. (1986). Information processing in dynamical sys-
NationalSecurityAgencyandtheNationalScienceFoundationun- tems: foundations of harmony theory. In D. Rumehart and
derNSFgrant#IIS-0326249,andinpartbytheDefenseAdvanced J. McClelland (Eds.), Parallel distributed processing: Explo-
Research Projects Agency (DARPA), through the Department of rations in the microstructure of cognition. volume 1: Founda-
the Interior, NBC, Acquisition Services Division, under contract tions,194–281.MITPress.
numberNBCHD030010.
Sutton,C.,&McCallum,A.(2006).Anintroductiontoconditional
randomfieldsforrelationallearning.InL.GetoorandB.Taskar
References
(Eds.),Introductiontostatisticalrelationallearning.MITPress.
Ando,R.K.,&Zhang,T.(2005).Aframeworkforlearningpredic- Toappear.
tivestructuresfrommultipletasksandunlabeleddata. Journal Welling, M., Rosen-Zvi, M., & Hinton, G. (2005). Exponential
ofMachineLearningResearch,6,1817–1853. familyharmoniumswithanapplicationtoinformationretrieval.
Andrieu,C.,deFreitas,N.,Doucet,A.,&Jordan,M.(2003). An NIPS,1481–1488.
introductiontoMCMCformachinelearning. MachineLearn- Xing,E.,Yan,R.,&Hauptmann,A.G.(2005). Miningassociated
ing,50,5–43. textandimageswithdual-wingharmoniums. Proc.Uncertainty
inArtificialIntelligence.
Berger,A.L.,Pietra,S.A.D.,&Pietra,V.J.D.(1996). Amaxi-