ebook img

Discriminative Neural Topic Models PDF

0.69 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Discriminative Neural Topic Models

Discriminative Neural Topic Models GauravPandeyandAmbedkarDukkipati DepartmentofComputerScienceandAutomation IndianInstituteofScience Bangalore,India {gaurav.pandey, ad}@csa.iisc.ernet.in 7 1 0 2 Abstract b e We propose a neural network based approach for learning topics from text and F imagedatasets.Themodelmakesnoassumptionsabouttheconditionaldistribution 8 oftheobservedfeaturesgiventhelatenttopics. Thisallowsustoperformtopic 2 modelling efficiently using sentences of documents and patches of images as observedfeatures,ratherthanlimitingourselvestowords. Moreover,theproposed ] approachisonline,andhencecanbeusedforstreamingdata. Furthermore,since G theapproachutilizesneuralnetworks,itcanbeimplementedonGPUwithease, L andhenceitisveryscalable. . s c [ 1 Introduction 2 Mixedmembershipmodelingisthetaskofassigningtheobservedfeaturesofanobjecttolatent v classes. Eachobjectisassumedtobeamixtureoverthelatentclasses,andtheseclassesareshared 6 9 amongtheobjects. Thisisdistinctfrommixturemodeling,whereallthefeaturesoftheobjectare 7 assumedtobesampledfromasinglelatentclass. 6 Traditionalapproachestomixedmembershipmodelingassumeaparametricformforthedistribution 0 ofalatentclassoverthefeatures,andthedistributionofanobjectovertheseclasses. Forinstance,in . 1 caseofdocuments,theobservedfeaturesarethewords,andthelatentclassesarethetopics. Atopic 0 isassumedtohaveamultinomialdistributionoverthewords,whileadocumentisassumedtohavea 7 multinomialdistributionoverthetopics[6]. Thisstatisticalmodelisalsoreferredtoasprobabilistic 1 latentsemanticindexing(pLSI).OnecanfurtherdefineDirichletpriorsontheparametersofthe : v multinomialdistributions,toobtainafullBayesiantreatmentoftopicmodeling[1,9]. Thismodel, i commonly referred to as latent Dirichlet allocation (LDA), is widely used for finding topics in X documentcollections,andinferringpopulationstructuresfromgenotypedata. r a BothLDAandpLSIaregenerativemodels,wherebytheconditionaldistributionoftheobserved featuresgiventhelatentclassesisexplicitlymodeled. Thisrequiresexplicitknowledgeaboutthe parametricformofthesedistributions. Thenumberofuniquewordsinalanguagearefinite,and hence, it is possible to make these distributions as general as possible by choosing them to be multinomial. However,asthesetofalluniqueobservedfeaturesbecomescomparabletothesizeof thecollection,thisapproachbecomeslessandlessmeaningful. Hence,whilegenerativetopicmodelsareusefulformodelingwordsinadocumentcollection,they can’tbedirectlyusedformodelingsentencesandparagraphsindocuments,orpixelsofanimage. An intermediatepreprocessingstepisrequiredthattransformsalltheobservedfeaturesinthecollection toarelativelysmallsetofunique‘codewords’[3].Sincethisintermediatestepisindependentoftopic modeling,itcanresultinsuboptimalfeatures. Findinggoodintermediaterepresentationssuitablefor topicmodelingcanbeaschallengingasfindingthetopicsthemselves. While the observed features can be extremely complicated to model (for instance, the pixels in animage), theconditionaldistributionofthelatentclassesgiventheobservedfeaturesisalways multinomial.Theparametersofthemultinomialdistributionarefunctionsoftheobservedfeatures.In thispaper,weproposeamodelforlearningthelatentclassesofanobjectwithoutexplicitlymodeling theobservedfeatures. Thisallowsustomakeminimalassumptionsaboutthedistributionsofthe observedfeatures. Inparticular,weproposediscriminativeneuraltopicmodels(DNTM),whereby the conditional distribution of the latent classes or topics given the observed features is directly modeledusingneuralnetworks. Themodelcanbetrainedend-to-endusingbackpropagation. Toour knowledge,thisisthefirstapproachtotopicmodelingthatdoesn’texplicitlymodelthedistribution ofobservedfeaturesgiventhelatentclasses. InordertoestablishthatDNTMindeedlearnsmeaningfultopics,weuseitforassigningwordsto topicsforwell-knowntextcorpuses. Thelearnttopicsareusedinsecondarytaskssuchasclustering. DespitethefactthatLDAmakesminimalassumptionsformodelingwordsindocumentcollections, our model is able to outperform LDA even for these datasets. Note that these experiments only serveasasanitycheckconfirmingthattheproposedmodelindeedlearnsmeaningfultopics,andis competitivewithLDAevenfortaskswhereLDAmakesminimalassumptions. WealsouseDNTMforlearningtopicsonCIFAR-10datasetinanunsupervisedmanner. Weuse aconvolutionalneuralnetworktoconverttheimageintospatialfeatures,whichcorrespondtothe words of the document. The CNN is trained by the backpropagating the gradient from the topic model,thatis,wetraintheCNNtolearnfeatures,thatcorrespondtomeaningfultopics. Inorderto preventthemodelfromoverfitting,weuseadversarialtrainingbycouplingitwithagenerator[4]. Themodelistrainedtoperformpoorlyongeneratedimages,whilethegeneratoristrainedtogenerate imagesthatperformwellonthetopicmodellingtask. 2 DiscriminativeNeuralTopicModels LetX(1),...X(n)betheobservedfeaturesfornobjects,whileZ(1),...,Z(n)bethecorresponding latentclasses. LetK bethenumberoflatentclasses. 2.1 Modelingthewordsinadocument Incaseofdocuments,X(i) representstheembeddingvectorofthejth wordintheith document j X(i)withanarbitraryorder,andZ(i)representsthecorrespondingtopicorlatentclass. Inparticular, j thereexistsanembeddingvectorforeverywordinthevocabulary. Theembeddingsareinitialized randomlyandaretrainedbybackpropagatingthegradientofthetopicmodel. Thelengthoftheithdocumentisgivenbym . Weassumethatgiventhewordsofthedocument,the i topicsareindependentlydistributed,thatis, P (Z(i)|X(i))= (cid:89)mi P (Z(i)|X(i)), (1) θ θ j j=1 Z(i),1 ≤ j ≤ m aremultinomialrandomvariableswhoseparametersarefunctionsofX(i). An j i explicitparametricformforthisdistributionisspecifiedintheexperimentssection. Intherestofthe paper,thedependenceofthedistributionsonθisassumedwithoutbeingexplicitlystated. Since, we wish to associate each word X(i), with a single topic Z(i) with high confidence, we j j minimizetheentropyofP(Z(i)|X(i)), 1≤j ≤m ,thatis j i K H(Z(i)|X(i))=−(cid:88)P(Z(i) =k|X(i))logP(Z(i) =k|X(i)) (2) j j j k=1 Minimizing entropy ensures that the conditional distribution over the topics for a word is highly peakedforaveryfewtopics. Thisisalsoreferredtoasclusterassumption,andhasbeenusedfor clustering[7]andsemisupervisedlearning[5,2],whereitenjoysconsiderablesuccess. Next,weneedtoensurethatthewordsinasingledocumenthavesimilardistributionoverthetopics. Thiscorrespondstominimizingthevariancebetweentheprobabilitydistributionsovertopicsforthe wordsofthedocument. WeachievethisbyminimizingtheKL-divergencebetweentheconditional distributionoftopicsgiventhewordsP(Z(i)|X(i)),andthecorrespondingaverageoverthewords, j 2 thatis KL(P ||P¯ ;X(i))=(cid:88)K P(Z(i) =k|X(i))logP(Zj(i) =k|X(i)) (3) Z(i) Z(i) j P (k|X(i)) j d k=1 whereP (k|X(i))isthedistributionoverthetopicsfordocumentd. Ifweassumethateachobserved d wordinthedocumentoccurswithequalprobability,thenthedocumentdistributionoverthetopicsis givenbyP (k|X(i))= 1 (cid:80)mi P(Z(i) =k|X(i)). d mi j=1 j Finally,inordertoensurethatallthedocumentsdon’tgetassignedthesametopic,weaddatermfor encouragingbalanceamongthetopics. Towardsthisend,wedefine n P¯(k)= 1 (cid:88)P (k|X(i)) (4) n d i=1 Wemaximizetheentropyoftheabovedistributiontoenforceauniformprioronthetopics. Waysto relaxthisassumptionarediscussedlaterinthepaper. Combiningtheabovethreecriteria,theobjectivefunctionforthejthwordofithdocumentisgiven by F (θ)=H(Z(i)|X(i))+KL(P ||P ;X(i))−H(P¯), (5) ij j Z(i) dZ(i) j whereθ istheparameterofthedistributionP(Z |X(i)). Theabovetermisminimizedforallthe i wordsofallthedocumentsinthecorpus,thatis,(cid:80)n (cid:80)mi F (θ). i=1 j=1 ij 2.2 Regularizingthemodel In order to ensure that the model doesn’t overfit the training data, we use negative sampling. In particular,fortextdocumments,werandomlysamplethewordsinthevocabulary,tocreateafake document. Next,weforcethemodeltoperformpoorlyonthefakedocumentasfollows: 1. Thedistributionsovertopicsforthewordsinafakedocumentshouldbehighlyuncertain. Thiscorrespondstomaximizingtheentropyofthecorrespondingtopicdistributions,thatis, H(Z(i)|X(i))ismaximizedfor1≤i≤n,1≤j ≤m . j i 2. Thedistributionovertopicsforthewordsinafakedocument,shouldhavehighvariance. ThiscorrespondstomaximizingtheKL-divergenceofthetopicdistributionfromtheaverage forthewordsinthefakedocument,thatisKL(P ||P ;X(i))ismaximized. Z(i) dZ(i) j 2.3 DocumentClustering Firstly,weverifythatthemodelindeedperformstopicmodelling. Towardsthatend,weapplyour modeltolearntopicson20newsgroup,oneofthemostwidelyuseddatasetsfortextcategorization. Inordertoensurereproducibility,weuseapre-processedversionofthedataset1. Thepreprocessed 20newsgroupdatasetconsistsof11,293documentsfortrainingand7,528documentsfortesting, withavocabularysizeof8,165stemmedwords. Thedataisalmostevenlydividedbetweenthe20 classes. Wediscarddocumentswithlessthan2words. For clustering tasks, we combine the training and test sets. We use two metrics to evaluate the effectivenessoftheproposedtopicmodelforclustering-purityandnormalizedmutualinformation (NMI)2. We compare the proposed model against several standard algorithms for clustering and topic- modelling. TheseincludeK-means,Normalizedcuts[10],probabilisticLatentSemanticindexing (pLSI) and Latent Dirichlet Allocation (LDA). For LDA, pLSI and the proposed model DNTM, thetopicscorrespondtoclustersandadocumentisassignedtothecluster/topicwiththehighest posteriorprobabilityforthegivendocument. Thisapproachhasbeenshowntobemoreeffective thanclusteringthetopicrepresentationsofdocuments[8]. 1Thepre-processeddatasetisavailableathttps://sites.google.com/site/renatocorrea02/textcategorizationdatasets 2Detailsaboutthetwometricscanbeobtainedathttp://nlp.stanford.edu/IR-book/html/htmledition/evaluation- of-clustering-1.html 3 Table1: Clusteringresultsontextdatasets (a) 20newsgroupdataset Method Purity NMI K-means 34.3% 32.4% NormalizedCut 23.1% 21.7% pLSI 57.7% 56.5% LDA 54.7% 55.3% DNTM 56.5% 56.7% Table2: Top10stemmedwordsfrom10randomlyselectedtopicslearntonthe20newsgroupdataset byDNTM.Notethateachtopiciscoherent,thatis,thewordscorrespondstoaspecificclassof20 newsgroup. Topic1 ’line’’power’’sound’’tape’’radio’’cabl’’switch’’light’’phone’’work’ Topic2 ’effect’’drug’’medic’’doctor’’food’’articl’’health’’research’’school’ Topic3 ’drive’’card’’mac’’driver’’problem’’disk’’monitor’’appl’’video’’work’ Topic4 ’game’’team’’plai’’fan’’player’’win’’hockei’’score’’season’’playoff’ Topic5 ’kill’’world’’war’’muslim’’death’’armenian’’attack’’peopl’’histori’’citi’ Topic6 ’car’’engin’’bike’’dod’’ride’’articl’’road’’bmw’’mile’’front’ Topic7 ’state’’govern’’israel’’american’’isra’’right’’nation’’clinton’’presid’’polit’ Topic8 ’file’’program’’softwar’’graphic’’imag’’color’’code’’version’’format’’ftp’ Topic9 ’sale’’price’’bui’’sell’’offer’’interest’’compani’’book’’ship’’cost’ Topic10 ’god’’christian’’jesu’’love’’church’’bibl’’faith’’sin’’christ’’word’ Theclusteringresultsfor20newsgrouparegiveninTable1(a). Ascanbeseenfromtheresults, thereisadistinctimprovementinperformanceasonemovesfromstandardclusteringalgorithmsto topic-modellingalgorithmssuchaspLSI,LDAandDNTM.Moreover,theperformanceobtained usingDNTMisatparwiththeperformanceusingLDAandpLSI. 2.4 Thetopics AlthoughtheproposedmodelDNTMonlylearnsthedistributionofthetopicsgiventhewords,we can obtain the distribution of the words given the topics as follows: First, we compute the joint probabilityofthetopictandthewordwoccurringtogetherasfollows: P¯(t,w)= 1 (cid:88)n 1 (cid:88)mi P(Z(i) =t|X(i) =w)P(X(i) =w) n m j j j i i=1 j=1 Here, P(X(i) = w) = 1, if the ith word of the jth document is w and 0 otherwise. Informally, j theaboveequationsimplycomputestheprobabilityofobservingthetopict,everytimetheword w occurs, and takes the average of all these probabilities. Finally, to compute the posterior, we normalizetheabovedistribution. P¯(w,t) P¯(w|t)= (6) (cid:80) P¯(w(cid:48),t) w(cid:48)∈vocabulary Usingtheabovemethod,weobtainthedistributionofthetopicsoverthewordsforDNTM.The20 mostprobablewordsforeachtopicobtainedusingDNTMarelistedinTable2.4. Onecanobserve thatthetopicslearntbythemodelarecoherent,thatis,thewordscorrespondingtoasingletopic occurcommonlyinasingleclass. 3 Topicmodellinginimages In order to extend the proposed model for images, one needs to define ‘words’ and ‘documents’ forimages. Themostcommonapproachforobtainingwordsfromimagesinvolvesextractionof features (SIFT, HOG etc.,) from images. These features are then clustered using K-means. The 4 (a) Topic1 (b) Topic2 (c) Topic3 (d) Topic4 (e) Topic5 (f) Topic6 (g) Topic7 (h) Topic8 (i) Topic9 Figure1: SomeofthetopicslearntbyDNTMonCIFAR-10. Foreachtopic,the24imageswiththe highestprobabilityforthegiventopicarelistedinthefigure. corresponding quantized features are used as words [? ? ? ] for topic modelling. Learning of featureshappensindependentlyoftopicmodelling,therebyresultinginsuboptimalfeaturesfortopic modelling. Anotherapproachthathasbeenconsideredin[? ],involvestrainingadeepBoltzmann machineonthepixelsofanimage. ThesamplesfromthedeepestlatentlayerofDBMarethenused aswordsfortrainingahierarchicalBayesianmodel. Inthiswork,however,welearnthewordsfromthepixelsofanimagebyemployingaconvolutional network. Inparticular,letX(i)representstheithimage,Y(i)representsthejthwordoftheithimage j andZ(i)representsthecorrespondingtopic. InordertoconvertthetheX(i)intowordsY(i),wefeed j j X(i)toaconvolutionalneuralnetwork. TheoutputoftheCNNistreatedaswordsoftheimage. For instance,iftheoutputofCNNconsistsof100featuresmapsofsize8x8,wetreatthemas64words, whereeachwordisrepresentedusing100dimensions. Thetopicdistributionofawordisobtained byapplying1×1convolutiontothewordrepresentation,andthenapplyingsoftmaxtotheoutput. NotethatthewordsY(i)aredeterministicfunctionsofX(i),andhence,allowthebackpropagation j ofgradientfromY(i)totheparametersoftheCNN. j Nowthatwehavedefinedthewordsinanimage,weneedtodefinetheconceptofadocument. The definitionofadocumentdependsontheunderlyingtask. Inthispaper,weconsidertwopossible definitionsfordocuments. Inthefirstscenario,weareinterestedinclusteringsimilarimagestogether. Forthistask,weconsiderthatallthewordsinanimagebelongtothesamedocument. Thisisthe usual assumption in topic modelling. The words and the document are then fed to the proposed model,exactlyasfordocuments. 3.1 Treatingimagesasdocuments We use the proposed model to learn topics in CIFAR-10 dataset. The dataset consists of 50,000 trainingimagesand10,000testimagesdividedinto10classes. Thisdatasetisquitechallenging, 5 sincethereishighvariabilitywithineachclass,eventhoughtheindividualimagesareonly32x32 pixels. Weuseconvolutionalneuralnetworksforobtainingthefeaturesfromtheimages. Inparticular,for the32x32CIFAR-10images,weapply4layersofstridedconvolutionandReLUnonlinearityto obtain64(8x8)featuresperimage. Thesefeaturesfunctionaswords,andarefedasinputtothe DNTM.NotethatCNNistrainedonlybytheDNTM,therebycouplingfeatureextractionandtopic modelling. WetraintheDNTMtoextract100topicsfromtheCIFAR-10dataset. 9ofthosetopicsareshownin Figure3. Inparticular,foreachtopic,wehavelistedthe24mostprobableimages. Onecanobserve thatthetopicsarequalitativelycoherent. Weevaluatethepurityofthelearnedtopicsbyretrievingthe100mostprobableCIFAR-10images foreachtopic. Themostcommonlabelamongthe100imagesassociatedwiththetopic, isthen assignedtothetopic. Thepurityofagiventopicisthefractionoftheretrievedimagesthathavethe samelabelastheoneassignedtothetopic. Usingtheproposedmodel,weobtainameanpurityof 49.3%forthelearnedtopics. References [1] DavidMBlei,AndrewYNg,andMichaelIJordan. LatentDirichletAllocation. Journalof MachineLearningResearch,3:993–1022,2003. [2] OlivierChapelleandAlexanderZien. Semi-supervisedclassificationbylowdensityseparation. InAISTATS,pages57–64,2005. [3] LiFei-FeiandPietroPerona. ABayesianhierarchicalmodelforlearningnaturalscenecate- gories.In2005IEEEComputerSocietyConferenceonComputerVisionandPatternRecognition (CVPR’05),volume2,pages524–531.IEEE,2005. [4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,AaronCourville,andYoshuaBengio. Generativeadversarialnets. InAdvancesinneural informationprocessingsystems,pages2672–2680,2014. [5] YvesGrandvaletandYoshuaBengio. Semi-supervisedlearningbyentropyminimization. In Advancesinneuralinformationprocessingsystems,pages529–536,2004. [6] ThomasHofmann. ProbabilisticLatentSemanticIndexing. InProceedingsofthe22ndannual InternationalACMSIGIRconferenceonResearchandDevelopmentinInformationRetrieval, pages50–57.ACM,1999. [7] AndreasKrause,PietroPerona,andRyanGGomes. Discriminativeclusteringbyregularized information maximization. In Advances in neural information processing systems, pages 775–783,2010. [8] YueLu,QiaozhuMei,andChengXiangZhai. Investigatingtaskperformanceofprobabilistic topicmodels: anempiricalstudyofplsaandlda. InformationRetrieval,14(2):178–203,2011. [9] JonathanKPritchard,MatthewStephens,andPeterDonnelly. Inferenceofpopulationstructure usingmultilocusgenotypedata. Genetics,155(2):945–959,2000. [10] JianboShiandJitendraMalik. Normalizedcutsandimagesegmentation. IEEETransactions onpatternanalysisandmachineintelligence,22(8):888–905,2000. 6

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.