ebook img

Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity PDF

0.44 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity HoseinAzarbonyad,MostafaDehghani,TomKenter,MaartenMarx,JaapKamps,and MaartendeRijke UniversityofAmsterdam,Amsterdam,TheNetherlands {h.azarbonyad,dehghani,tom.kenter,maartenmarx,kamps,derijke}@uva.nl Abstract. Ahighdegreeoftopicaldiversityisoftenconsideredtobeanimpor- tantcharacteristicofinterestingtextdocuments.Arecentproposalformeasuring 7 topicaldiversityidentifiesthreeelementsforassessingdiversity:words,topics, 1 anddocumentsascollectionsofwords.Topicmodelsplayacentralroleinthis 0 approach.Usingstandardtopicmodelsformeasuringdiversityofdocumentsis 2 suboptimalduetogeneralityandimpurity.Generaltopicsonlyincludecommon n informationfromabackgroundcorpusandareassignedtomostofthedocuments a inthecollection.Impuretopicscontainwordsthatarenotrelatedtothetopic; J impuritylowerstheinterpretabilityoftopicmodelsandimpuretopicsarelikelyto 6 getassignedtodocumentserroneously.Weproposeahierarchicalre-estimation 1 approachfortopicmodelstocombatgeneralityandimpurity;theproposedap- proachoperatesatthreelevels:words,topics,anddocuments.Ourre-estimation ] R approachformeasuringdocuments’topicaldiversityoutperformsthestateofthe I artonPubMeddatasetwhichiscommonlyusedfordiversityexperiments. . s c [ 1 Introduction Quantitativenotionsoftopicaldiversityintextdocumentsareusefulinseveralcontexts, 1 v e.g., to assess the interdisciplinarity of a research proposal [3] or to determine the 3 interestingness of a document [2]. An influential formalization of diversity has been 7 introducedinbiology[17].Itdecomposesdiversityintermsofelementsthatbelongto 2 categorieswithinapopulation[20]andformalizesthediversityofapopulationdasthe 4 expecteddistancebetweentworandomlyselectedelementsofthepopulation: 0 . 1 T T (cid:88)(cid:88) 0 div(d)= p p δ(i,j), (1) i j 7 i=1j=1 1 : wherep andp aretheproportionsofcategoriesiandj inthepopulationandδ(i,j) v i j i isthedistancebetweeniandj.Bacheetal.[3]haveadaptedthisnotionofdiversity X to quantify the topical diversity of a text document. Words are considered elements, r topicsarecategories,andadocumentisapopulation.Whenusingtopicmodelingfor a measuringtopicaldiversityoftextdocumentd,Bacheetal.[3]modelelementsbased ontheprobabilityofawordwgivend,P(w|d),categoriesbasedontheprobabilityof wgiventopict,P(w|t),andpopulationsbasedontheprobabilityoftgivend,P(t|d). Inprobabilistictopicmodeling,atestimationtime,thesedistributionsareusually assumedtobesparse.First,thecontentofadocumentisassumedtobegeneratedby a small subset of words from the vocabulary (i.e., P(w|d) is sparse). Second, each topic is assumed to contain only some topic-specific related words (i.e., P(w|t) is sparse).Finally,eachdocumentisassumedtodealwithafewtopicsonly(i.e.,(P(t|d) issparse).Whenapproximatedusingcurrentlyavailablemethods,P(w|t)andP(t|d) areoftendenseratherthansparse[13,19,21].Densedistributionscausetwoproblems forthequalityoftopicmodelswhenusedformeasuringtopicaldiversity:generality andimpurity.Generaltopicsmostlycontaingeneralwordsandaretypicallyassigned tomostdocumentsinacorpus.Impuretopicscontainwordsthatarenotrelatedtothe topic.GeneralityandimpurityoftopicsbothresultinlowqualityP(t|d)distributions. Weproposeahierarchicalre-estimationprocessformakingthedistributionsP(w|d), P(w|t)andP(t|d)moresparse.Were-estimatetheparametersofthesedistributionsso thatgeneral,collection-wideitemsareremovedandonlysalientitemsarekept.Forthe re-estimationweusetheconceptofparsimony[9]toextractonlyessentialparameters ofeachdistribution. Our main contributions are: (1) We propose a hierarchical re-estimation process fortopicmodelstoaddresstwomainproblemsinestimatingtopicaldiversityoftext documents,usingabiologicallyinspireddefinitionofdiversity.(2)Westudytheefficacy ofeachlevelofre-estimation,andimprovetheaccuracyofestimatingtopicaldiversity, outperformingthecurrentstate-of-the-art[3]onapubliclyavailabledatasetcommonly usedforevaluatingdocumentdiversity[1]. 2 Relatedwork Ourhierarchicalre-estimationmethodformeasuringtopicaldiversityrelatestomeasur- ingtextdiversity,improvingthequalityoftopicmodels,modelparsimonization,and evaluatingtopicmodels. Text diversity and interestingness. Recent studies measure topical diversity of document[2,3,8]bymeansofLatentDirichletAllocation(LDA)[4].Themaindiversity measureinthisworkisRao’smeasure[17](Equation1),inwhichthediversityofatext documentisproportionaltothenumberofdissimilartopicsitcovers.Whilewealsouse Rao’smeasure,wehypothesizethatpureLDAisnotgoodenoughformodelingtext diversityandproposeare-estimationprocessforadaptingtopicmodelsformeasuring topicaldiversity. Improvingthequalityoftopicmodels.Thetwomostimportantissueswithtopic models are the generality problem and the impurity problem [5, 13, 19, 21]. Many approacheshavebeenproposedtoaddressthegeneralityproblem[21–23].Themain differencewithourworkisthatpreviousworkdoesnotyieldsparsetopicrepresentations or topic word distributions. Soleimani and Miller [19] propose parsimonious topic models(PTM)toaddressthegeneralityandimpurityproblems.PTMachievesstate-of- the-artresultscomparedtoexistingtopicmodels.Unlike[19],wedonotmodifythe trainingprocedureofLDAbutproposeamethodtorefinethetopicmodels. Modelparsimonization. Inlanguagemodelparsimonization,thelanguagemodel of a document is considered to be a mixture of a general background model and a document-specificlanguagemodel[6,7,9,26].Thegoalistoextractthedocument- specificpartandremovethegeneralwords.Weemployparsimonizationforre-estimating topicmodels.Themainassumptionin[9]isthatthelanguagemodelofadocumentisa mixtureofitsspecificlanguagemodelandagenerallanguagemodel: P(w|d)=λP(w|θ˜ )+(1−λ)P(w|θ ), (2) d C wherew isaterm,dadocument,θ˜ thedocumentspecificlanguagemodelofd,θ d C thelanguagemodelofthecollectionC,andλisamixingparameter.Themaingoalis toestimateP(w|θ˜ )foreachdocument.ThisisdoneinaniterativemannerusingEM d algorithm.Theinitialparametersofthelanguagemodelaretheparametersofstandard languagemodel,estimatedusingmaximumlikelihood:P(w|θ˜ )= tfw,d ,where d (cid:80)w(cid:48)tfw(cid:48),d tf isthefrequencyofwind.Thefollowingstepsarecomputediteratively: w,d E-step: λP(w|θ˜ ) e =tf · d , (3) w w,d λP(w|θ˜ )+(1−λ)P(w|θ )) d C M-step: e P(w|θ˜ )= w , (4) d (cid:80) e w(cid:48) w(cid:48) where θ˜ is the parsimonized language model of document d, C is the background d collection, P(w|θ ) is estimated using maximum likelihood estimation, and λ is a C parameterthatcontrolsthelevelofparsimonization.Alowvalueofλwillresultina moreparsimonizedmodelwhileλ=1yieldsamodelwithoutparsimonization.TheEM processstopsafterafixednumberofiterationsorafterconvergence. Evaluatingtopicmodels.Weevaluatetheeffectivenessofourre-estimatedmodels bymeasuringthetopicaldiversityoftextdocuments.Inaddition,in§6,weanalyzethe effectivenessofourre-estimationapproachintermsofpurityindocumentclustering anddocumentclassificationtasks.Forclassification,following[10,16,19],wemodel topicsasdocumentfeatureswithvaluesP(t|d).Forclustering,eachtopicisconsidered aclusterandeachdocumentisassignedtoitsmostprobabletopic[16,24,25]. 3 Measuringtopicaldiversityofdocuments Tomeasuretopicaldiversityoftextdocuments,weproposeHiTR(hierarchicaltopic modelre-estimation).HiTRcanbeappliedtoanytopicmodelingapproachthatmodels documentsasdistributionsovertopicsandtopicsasdistributionsoverwords. TheinputtoHiTRisacorpusoftextdocuments.Theoutputisaprobabilitydis- tribution over topics for each document in the corpus. HiTR has three levels of re- estimation: (1) document re-estimation (DR) re-estimates the language model per documentP(w|d);(2)topicre-estimation(TR)re-estimatesthelanguagemodelper topicP(w|t);and(3)topicassignmentre-estimation(TAR)re-estimatesthedistribu- tionovertopicsperdocumentP(t|d).Basedonapplyingornotapplyingre-estimation atdifferentlevels,therearesevenpossiblere-estimationapproaches;seeFig.1.HiTR referstothemodelthatusesallthreere-estimationtechniques,i.e.,TM+DR+TR+TAR. Next,wedescribeeachofthere-estimationstepsinmoredetail. Document re-estimation (DR) re-estimates P(w|d). Here, we remove unneces- sary information from documents before training topic models. This is comparable to pre-processing steps, such as removing stopwords and high- and low-frequency words, that are typically carried out prior to applying topic models [4, 11, 15, 16]. level 1 level 2 level 3 TM TAR TM + TAR TM TR TM + TR Corpus TAR TM + TR + TAR TM + DR DR TAR TM + DR + TAR TM TR TM + DR + TR TAR HiTR Hierarchical Topic model Re-estimation Fig.1:Differenttopicre-estimationapproaches.TMisatopicmodelingapproach like,e.g.,LDA.DRisdocumentre-estimation,TRistopicre-estimation,andTAR istopicassignmentre-estimation. Properpre-processingofdocuments,however,takeslotsofeffortandinvolvestuning manyparameters.Documentre-estimation,however,removesimpureelements(general words)fromdocumentsautomatically.Ifgeneralwordsareabsentfromdocuments,we expect that the trained topic models will not contain general topics. After document re-estimation,wecantrainanystandardtopicmodelonthere-estimateddocuments. Documentre-estimationusestheparsimonizationmethoddescribedin§2.There- estimatedmodelP(w|θ˜ )in(4)isusedasthelanguagemodelofdocumentd,andafter d removingunnecessarywordsfromd,thefrequenciesoftheremainingwords(words withP(w|θ˜ )>0)arere-estimatedfordusingthefollowingequation: d (cid:106) (cid:107) tf(w,d)= P(w|θ˜ )·|d| , d where |d| is the document length in words. Topic modeling is then applied on the re-estimateddocument-wordfrequencymatrix. Topic re-estimation (TR) re-estimates P(w|t) by removing general words. The re-estimateddistributionsareusedtoassigntopicstodocuments.Thegoalofthisstep is to increase the purity of topics by removing general words that have not yet been removedbyDR.Thetwomainadvantagesoftheincreasedpurityoftopicsare(1)it improveshumaninterpretationoftopics,and(2)itleadstomoredocument-specifictopic assignments,whichisessentialformeasuringtopicaldiversityofdocuments. Ourmainassumptionisthateachtopic’slanguagemodelisamixtureofitstopic- specific language model and the language model of the background collection. TR extractsatopic-specificlanguagemodelforeachtopicandremovesthepartthatcanbe explainedbythebackgroundmodel.Weinitializeθ˜ andθ asfollows: t T (cid:80) P(w|θTM) P(w|θ˜)=P(w|θTM) P(w|θ )= t∈T t t t T (cid:80) (cid:80) P(w(cid:48)|θTM) w(cid:48)∈V t(cid:48)∈T t(cid:48) wheretisatopic,θ˜ istopic-specificlanguagemodeloft,andθ isthebackground t T languagemodelofT (thecollectionofalltopics),P(w|θTM)istheprobabilityofw t belongingtotopictestimatedbyatopicmodelTM.Havingtheseestimations,thesteps ofTRaresimilartothestepsofparsimonization,exceptthatintheE-stepweestimate tf ,thefrequencyofwint,byP(w|θTM). w,t t Topic assignment re-estimation (TAR) re-estimates P(t|d). In topic modeling, mosttopicsareusuallyassignedwithanon-zeroprobabilitytomostofdocuments.For documentswhichareinrealityaboutafewtopics,thistopicassignmentisincorrect andoverestimatesitsdiversity.TARaddressesthegeneraltopicsproblemandachieves moredocumentspecifictopicassignments.Tore-estimatetopicassignments,atopic modelisfirsttrainedonthedocumentcollection.Thismodelisusedtoassigntopics todocumentsbasedontheproportionofwordstheyhaveincommon.Wethenmodel thedistributionovertopicsperdocumentasamixtureofitsdocument-specifictopic distributionandthetopicdistributionoftheentirecollection. WeinitializeP(t|θ˜ )andP(t|θ )asfollows: d C (cid:80) P(t|θTM) P(t|θ˜ )=P(t|θTM) P(t|θ )= d∈C d . d d C (cid:80) (cid:80) P(t(cid:48)|θTM) t(cid:48)∈T d(cid:48)∈C d(cid:48) Here, t is a topic, d a document, P(t|θ˜ ) the document-specific topic distribution, d andP(t|θ )thedistributionoftopicsintheentirecollectionC,andP(t|θTM)the C d probabilityofassigningtopicttodocumentdestimatedbyatopicmodelTM.The remainingstepsofTARfollowtheonesofparsimonization,thedifferencebeingthatin theE-step,weestimatef usingP(t|θTM). t,d d 4 Experimentalsetup Our main research question is: (RQ1) How effective is HiTR in measuring topical diversityofdocuments?Howdoesitcomparetothestate-of-the-artinaddressingthe generalandimpuretopicsproblem? To address RQ1 we run our models on a binary classification task. We generate a synthetic dataset of documents with high and low topical diversity (the process is detailedbelow),andthetaskforeverymodelistopredictwhetheradocumentbelongs tothehighorlowdiversityclass.WeemployHiTRtore-estimatetopicmodelsanduse there-estimatedmodelsformeasuringtopicaldiversityofdocuments.Togaindeeper insightsintohowHiTRperforms,weconductaseparateanalysisofthelasttwolevels ofre-estimation,TRandTAR:1 (RQ2.1)DoesTRincreasethepurityoftopics?Ifso, howdoesusingthemorepuretopicsinfluencetheperformanceintopicaldiversitytask? (RQ2.2)HowdoesTARaffectthesparsityofdocument-topicassignments?Andwhat istheeffectofre-estimateddocument-topicassignmentsonthetopicaldiversitytask? ToanswerRQ2.1,wefirstevaluatetheperformanceofTRonthetopicaldiversitytask andcompareitsperformancetoDRandTAR.ToanswerRQ2.2,wefirstevaluateTAR togetherwithLDAinatopicaldiversitytaskandanalyzeitseffectontheperformance ofLDAtostudyhowsuccessfulTARisinremovinggeneraltopicsfromdocuments. Dataset,pre-processing,evaluationmetrics,andparameters:Following[3],we generate500documentswithahighvalueofdiversityand500documentswithalow valueofdiversity.Weselectover300,000documentsarticlespublishedbetween2012to 2015fromPubMed[1].Forgeneratingdocumentswithahighvalueofdiversity,wefirst select20journalsandcreate10pairsofjournals.Eachpaircontainstwojournalsthatare relativelyunrelatedtoeachother(weusethepairsofjournalsselectedin[3]).Foreach pairofjournalsAandBweselect50articlestocreate50probabilitydistributionsover 1AstheDRlevelofre-estimationdirectlyemploystheparsimoniouslanguagemodelingtech- niquesin[9],weomititfromourin-depthanalysis. topics:werandomlyselectonearticlefromAandonefromBandgenerateadocument byaveragingtheselectedarticle’sbagoftopiccounts.Thus,foreachpairofjournals wegenerate50documentswithahighdiversityvalue.Also,foreachofthechosen20 journals,werepeattheprocedurebutinsteadofchoosingarticlesfromdifferentjournals, weselectthemfromthesamejournaltogenerate25non-diversedocuments. Forpre-processingdocuments,weremovestopwordsincludedinthestandardstop wordlistfromPython’sNLTKpackage.Inaddition,weremovethe100mostfrequent wordsinthecollectionandwordswithfewerthan5occurrences. Measuringtopicaldiversity:Afterre-estimatingworddistributionsindocuments, topics,anddocumenttopicdistributionsusingHiTR,weusethefinaldistributionsover topicsperdocumentformeasuringtopicaldiversity.Diversityoftextsiscomputedusing Rao’scoefficient[3]usingEquation1.Weusethenormalizedangulardistanceδ for measuringthedistancebetweentopics,sinceitisaproperdistancefunction[2]. Tomeasuretheperformanceoftopicmodelsonthetopicaldiversitytask,weuseROC curvesandreporttheAUCvalues[3].Wealsomeasurethecoherenceoftheextracted topics;thismeasureindicatesthepurityofP(w|t)distributions,whereahighvalueof coherenceimplieshighpuritywithintopics.Weestimatecoherenceusingnormalized pointwisemutualinformationbetweenthetopN wordswithinatopic[11,16].Asthe referencecorpusforcomputingwordoccurrences,weusetheEnglishWikipedia.2 ThetopicmodelingapproachusedinourexperimentswithHiTRisLDA.Follow- ing[3,18,19]wesetthenumberoftopicsto100.Wesetthetwohyperparametersto α = 1/T and β = 0.01, where T is the number of topics, following [16]. In the re- estimationprocess,ateachstepoftheEMalgorithm,wesetthethresholdforremoving unnecessarycomponentsfromthemodelto0.0001andremovetermswithanestimated probabilitylessthanthisthresholdfromthelanguagemodels,asin[9]. Weperform10-foldcrossvalidation,using8foldsastrainingdata,1foldtotune theparameters(λforDR,TR,andTAR),and1foldfortesting.Ourbaselineforthe topicaldiversitytaskisthemethodproposedin[3],whichusesLDA.Wealsocompare ourresultstoPTM[19],whichweuseinsteadofLDAformeasuringtopicaldiversity. PTMisthebestavailabletopicmodelingapproach,andthecurrentstateoftheart. Forstatisticalsignificancetesting,wecompareourmethodstoPTMusingpairedtwo- tailedt-testswithBonferronicorrection.Toaccountformultipletesting,weconsideran improvementsignificantif:p≤α/m,wheremisthenumberofconductedcomparisons and α is the desired significance. We set α = 0.05. In §5, (cid:78) and (cid:72) indicate that the correspondingmethodperformssignificantlybetterandworsethanPTM,respectively. 5 Results In this section, we report on the performance of HiTR on the topical diversity task. Additionallyweanalyzetheeffectivenessoftheindividualre-estimationapproaches. 5.1 Topicaldiversityresults Fig.2plotstheperformanceofourtopicmodelsacross differentlevelsofre-estimation,andthemodelswecompareto,onthePubMeddataset. WeplotROCcurvesandcomputeAUCvalues.ToplottheROCcurvesweusethediver- sityscorescalculatedforthegeneratedpseudo-documentswithdiversitylabels.HiTR improvestheperformanceofLDAby17%andPTMby5%intermsofAUC.FromFig.2 two observations can be made. 2WeuseadumpofJune2,2015,containing15.6millionarticles. First, HiTR benefits from the ROC Curve threere-estimationapproachesit 1.0 encapsulatesbysuccessfullyim- 0.8 provingthequalityofestimated ate R diversityscores.Second,theper- e 0.6 v formanceofLDA+TAR,which Positi0.4 Random AUC=0.5 tprrioesbletom,aidsdhreigsshetrhtehagnentheeraplietry- True 0.2 LPLLDDDTMAAA ++ AATTUUARCCR =A= AU00UC..77C=8=0.07.377 formanceofLDA+TR,whichad- LDA+DR AUC=0.75 HiTR AUC=0.82 0.0 dressesimpurity.Generaltopics 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate haveastrongernegativeeffecton Fig.2:Performanceoftopicmodelsintopicaldi- measuringtopicaldiversitythan versitytaskonthePubMeddataset.Theimprove- impure topics. Also, LDA+DR mentofHiTRoverPTMisstatisticallysignificant outperforms LDA+TR. So, re- (p<0.05)intermsofAUC. moving impurity from P(t|d) distributions is the most effective approach in the topical diversity task, and remov- ingimpurityfromP(w|d)distributionsismoreeffectivethanremovingimpurityfrom P(w|t)distributions.Table1illustratesthedifferencebetweenLDAandHiTRwiththe topicsassignedbythetwomethodsforanon-diversedocumentthatiscombinedfrom twodocumentsfromthesamejournal,entitled“MolecularNeuroscience:Challenges Ahead”and“RewardNetworksintheBrainasCapturedbyConnectivityMeasures,” usingtheproceduredescribedin§4.Asonlyaverybasicstopwordlistbeingapplied, words like also and one still appear. We expect to have a low diversity value for the combineddocument.However,usingRao’sdiversitymeasure,thetopicaldiversityof thisdocumentbasedontheLDAtopicsis0.97.Thisisduetothefactthattherearethree document-specifictopics—topics1,2and4—andfourgeneraltopics.Topics1and2are verysimilarandtheirδis0.13.Theother,moregeneraltopicshavehighδvalues;the averageδvaluebetweenpairsoftopicsisashighas0.38.Forthesamedocument,HiTR onlyassignsthreedocument-specifictopicsandtheyaremorepureandcoherent.The averageδvaluebetweenpairsoftopicsassignedbyHiTRis0.19.Thediversityvalueof thisdocumentusingHiTRis0.16,whichindicatesthatthisdocumentisnon-diverse. Hence,HiTRismoreeffectivethanotherapproachesinmeasuringtopicaldiversityof documents;itsuccessfullyremovesgeneralityfromP(t|d). 5.2 Topicre-estimationresults ToanswerRQ2.1,wefocusontopicre-estimation (TR). Since TR tries to remove impurity from topics, we expect it to increase the coherenceofthetopicsbyremovingunnecessarywordsfromtopics.Wemeasurethe Table 1: Topic assignments for a non-diverse document using LDA and HiTR. OnlytopicswithP(t|d)>0.05areshown. LDA HiTR TopicP(t|d)Top5words P(t|d)Top5words 1 0.21 brain,anterior,neurons,cortex,neuronal 0.68 brain,neuronal,neurons,neurological,nerve 2 0.14 channel,neuron,membrane,receptor,current 0.23 channel,synaptic,neuron,receptor,membrane 3 0.10 use,information,also,new,one 0.09 network,nodes,cluster,community,interaction 4 0.08 network,nodes,cluster,functional,node 5 0.08 using,method,used,image,algorithm 6 0.08 time,study,days,period,baseline 7 0.07 data,values,number,average,used Table 2: Topic model coherence in terms of average normalized mutual informa- tionbetweentop10wordsinthetopicsonthePubMeddataset. LDA PTM LDA+TR LDA+DR+TR 8.17 9.89 9.46 10.29(cid:78) purityoftopicsbasedonthecoherenceofwordsinP(w|t)distributions.Table2shows the coherence of topics according to different topic modeling approaches, in terms ofaveragemutualinformation.TRsignificantlyincreasesthecoherenceoftopicsby removingtheimpurepartsfromtopics.ThecoherenceofPTMishigherthanofTR. However,whenwefirstapplyDR,trainLDA,andfinallyapplyTR,thecoherenceofthe extractedtopicsissignificantlyhigherthanthecoherenceoftopicsextractedbyPTM. WeconcludethatTRiseffectiveinremovingimpurityfromtopics.Moreover,DRalso contributesinmakingtopicsmorepure. 5.3 Topic assignment re-estimation results To answer RQ2.2, we focus on TAR (topicassignmentre-estimation).WeareinterestedinseeinghowHiTRdealswithgen- eraltopics.Wesumtheprobabilityofassigningatopictoadocument,overalldocuments: for each topic t, we com- (cid:80) pute P(t|d),whereC is d∈C LDA the document collection. Fig. 3 LDA+TAR showsthedistributionofproba- Non-general topics bility mass before and after ap- plying TAR; topics are sorted based on the topic assignment probability of LDA. LDA as- signs a vast proportion of the probability mass to a relatively small number of topics, mostly generaltopicsthatareassigned to most documents. We expect thatmanytopicsarerepresented Fig.3:Thetotalprobabilityofassigningtopicsto in some documents, while rela- the documents in the PubMed dataset estimated tivelyfewtopicswillberelevant using LDA and LDA+TAR. (The two areas are to all documents. After apply- equaltothenumberofdocuments(N ≈300K)). ingTAR,thedistributionisless skewedandtheprobabilitymassismoreevenlydistributed. (cid:80) Therearetopicsthathaveahigh P(t|d)valueinLDA’stopicassignmentsanda d (cid:80) high P(t|d)valueafterapplyingTARtoo;wemarkedthemas“non-generaltopics” d in Fig. 3. Table 3, column 2 shows the top five words for these topics. TAR is able tofindthesethreenon-generaltopicsandtheirassignmentprobabilitiestodocuments intheP(t|d)distributionsisnotchangedasmuchastheactualgeneraltopics.Thus, TARremovesgeneraltopicsfromdocumentsandincreasestheprobabilityofdocument- specifictopicsforeachdocument.TofurtherinvestigatewhetherTARreallyremoves general topics, Table 3, column 3 shows the top five words for the first 10 topics in Fig.3,excludingthe“non-generaltopics.”Theseseventopicshavethehighestdecrease Table3: TopfivewordsforthetopicsdetectedbyTARasgeneraltopicsandnon- generaltopics. TopicNon-generaltopics Generaltopics 1 health,services,public,countries,data use,information,also,new,one 2 surgery,surgical,postoperative,patient,performedci,study,analysis,data,variables 3 cells,cell,treatment,experiments,used time,study,days,period,baseline 4 group,control,significantly,compared,groups 5 study,group,subject,groups,significant 6 may,also,effects,however,would 7 data,values,number,average,used (cid:80) in P(t|d) values due to TAR. Clearly, they contain general words and are not d (cid:80) informative.Fig.3showsthatafterapplyingTAR,the P(t|d)valueshavedecreased d dramaticallyforthesetopics,withoutcreatingnewgeneraltopics. 5.4 Parameteranalysis Next,weanalyzetheeffectoftheλparameterontheper- formance of DR, TR, and TAR. Fig. 4 displays the performance at different levels of re-estimation. With λ = 1, no re-estimation occurs, and all methods equal LDA. We see that DR peaks with moderate values of λ (0.4 ≤ 0.8 0.7 λ ≤ 0.45). This reflects that 0.6 documents contain a moderate 0.5 C amount of general information U0.4 A 0.3 and that DR is able to success- LDA 0.2 DR fully deal with it. For λ ≥ 0.8, 0.1 TR TAR theperformanceofDRandLDA 0.0 0.0 0.2 0.4 0.6 0.8 1.0 isthesameandforthesevalues λ of λ DR does not increase the Fig.4: The effect of the λ parameter on the per- quality of LDA. Also, the best formanceoftopicsmodelsinthetopicaldiversity performance of TR is achieved taskonthePubMeddataset. with high values of λ (0.65 ≤ λ≤0.75).Fromthisobservationweconcludethattopicstypicallyneedonlyasmall amountofre-estimation.Withthisslightre-estimation,TRisabletoimprovethequality ofLDA.However,forλ≥0.75theaccuracyofTRdegrades.Lastly,TARachievesits bestperformancewithlowvaluesofλ(0.02≤λ≤0.05).Hence,mostofthenoiseisin theP(t|d)distributionsandaggressivere-estimationallowsTARtoremovemostofit. 6 Analysis Inthissection,wewanttogainadditionalinsightsintoHiTRanditseffectsontopic computation. The purity of topic assignments based on P(t|d) distributions has the highest effect on the quality of estimated diversity scores. Thus, we investigate how pureestimatedtopicassignmentsareusingHiTR.Tothisend,wecomparedocument clusteringandclassificationresults,basedonthetopicsassignedbyHiTR,LDAand PTM.Forclustering,following[16],weconsidereachtopicasacluster.Eachdocumentd isassignedtothetopicthathasthehighestprobabilityvalueinP(t|d).Forclassification, we use all topics assigned to the document and consider P(t|d) as features for a supervisedclassificationalgorithm;weuseSVM.Weviewhighaccuracyinclustering orclassificationasanindicatorofhighpurityoftopicdistributions.Ourfocusisnot onachievingatopclusteringorclassificationperformance:thesetasksareameansto assessthepurityoftopicdistributionsusingdifferenttopicmodels. Datasetsandmetrics.WeuseRCV1[12],20-NewsGroups,3andOhsumed.4RCV1 contains806,791documentswithcategorylabelsfor126categories.Forclusteringand classificationofdocuments,weuse55categoriesinthesecondlevelofthehierarchy. 20-NewsGroupscontains∼20,000documents(20categories,around1,000documents per category). Ohsumed contains 50,216 documents grouped into 23 categories. For measuringthepurityofclustersweusepurityandnormalizedmutualinformation(NMI) [14].Weuse10-foldcrossvalidationandthesamepre-processingasin§4. Purityresults.ThetoppartofTable4showsresultsonthedocumentclusteringtask. Aswecansee,thetopicdistributionsextractedusingHiTRscorehigherthantheones extractedusingLDAandPTMintermsofbothpurityandNMI.Thisshowstheability ofHiTRtomakeP(t|d)morepure.Thetwo-levelre-estimatedtopicmodelsachieve higherpurityvaluesthantheirrespectiveone-levelcounterpartsexceptthecombination ofDRandTR,whichindicatesthatre-estimationateachlevelcontributestothepurity of P(t|d). The combination of TR and DR is not effective in increasing purity over itsone-levelcounterpartsonmostofthedatasets,indicatingthatTRandDRaddress similarissues.ButwheneachofthemiscombinedwithTAR,thepurityofthetopic distributionsincreases,implyingthatDR/TRandTARaddresscomplementaryissues. ThebottompartofTable4showsresultsonthedocumentclassificationtask.HiTR ismoreaccurateinestimatingP(t|d);itsaccuracyishigherthanthatofothertopic models.Thehighervaluesinclassificationtask,comparedtoclusteringtask,indicatethat themostprobabletopicdoesnotnecessarilycontainallinformationaboutthecontentof adocument.Ifadocumentisaboutmorethanonetopic,theclassifierutilizesallP(t|d) informationandperformsbetter.Therefore,thehigheraccuracyofHiTRinthistaskis anindicatorofitsabilitytoassigndocument-specifictopicstodocuments. 7 Conclusions We have proposed Hierarchical Topic model Re-estimation (HiTR), an approach for measuringtopicaldiversityoftextdocuments.Itaddressestwomainissueswithtopic models,topicgeneralityandtopicimpurity,whichnegativelyaffectmeasuringtopical diversityscoresinthreeways.First,theexistenceofdocument-unspecificwordswithin P(w|d)(thedistributionofwordswithindocuments)yieldsgeneraltopicsandimpure topics.Second,theexistenceoftopic-unspecificwordswithinP(w|t)(thedistribution ofwordswithintopics)yieldsimpuretopics.Third,theexistenceofdocument-unspecific topicswithinP(t|d)(thedistributionoftopicswithindocuments)yieldsgeneraltop- ics. We have proposed three approaches for removing unnecessary or even harmful informationfromprobabilitydistributions,whichwecombineinourmethodforHiTR. EstimateddiversityscoresfordocumentsusingHiTRaremoreaccuratethanthose obtainedusingthecurrentstate-of-the-arttopicmodelingmethodPTM,orageneral purpose topic model such as LDA. HiTR outperforms PTM because it adapts topic modelsforthetopicaldiversitytask.Thequalityoftopicmodelsformeasuringtopical 3Availableathttp://www.ai.mit.edu/people/˜jrennie/20Newsgroups/ 4Availableathttp://disi.unitn.it/moschitti/corpora.htm

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.