Table Of Content

Transfer Topic Modeling with Ease and Scalability Jeon-Hyung Kang, Jun Ma, Yan Liu Computer Science Department University of Southern California 3 Los Angeles, CA 90089 1 jeonhyuk,junma,[email protected] 0 2 n a Abstract ful in modeling text corpus. However, several major J characteristicsdistinguishsocialmediadatafromtra- 6 Theincreasingvolumeofshort textsgeneratedonso- ditional text corpus and raise great challenges to the 2 cialmediasites,suchasTwitterorFacebook,creates LDA model. First, each post or tweet is limited to a ] a great demand for effective and efficient topic mod- certain number of characters, and as a result abbre- L eling approaches. While latent Dirichlet allocation viated syntax is often introduced. Second, the texts C (LDA) can be applied, it is not optimal due to its are very noisy, with broad topics and repetitive, less . weakness in handling short texts with fast-changing meaningful content. Third, the input data typically s c topicsandscalabilityconcerns. Inthispaper,wepro- arrives in high-volume streams. [ pose a transfer learning approach that utilizes abun- It is known that the topic outputs by LDA com- 2 dantlabeleddocumentsfromotherdomains(suchas pletelydependontheworddistributionsinthetrain- v Yahoo! News or Wikipedia) to improve topic mod- ing documents. Therefore, LDA results on blog and 6 eling, with better model fitting and result interpre- microblogging data would naturally be poor, with a 8 tation. Specifically, wedevelopTransfer Hierarchical 6 cluster of words that co-occur in many documents LDA (thLDA) model, which incorporates the label 5 without actual semantic meanings [18, 24]. Intu- 1. information from other domains via informative pri- itively, the generative process of LDA model can be ors. In addition, we develop a parallel implementa- 0 guided by the document labels so that the learned 3 tion of our model for large-scale applications. We hidden topics can be more meaningful. Therefore, in 1 demonstrate the effectiveness of our thLDA model [4, 12, 21], discriminative training of LDA has been : on both a microblogging dataset and standard text v explored and in particular, [20] applied labeled LDA i collections including AP and RCV1 datasets. (lLDA) to analyze Twitter data by using hashtags X as labels. This approach addresses our challenges r a to a certain extent. Even though supervised LDA 1 Introduction gives more comprehensible topics than LDA, it has the general limitation of the LDA framework in that Social-mediawebsites,suchasTwitterandFacebook, each document is represented by a single topic dis- have become a novel real-time channel for people to tribution. This makes comparing documents diffi- shareinformationonabroadrangeofsubjects. With cultwhenwehavesparse,noisydocumentswithcon- millions of messages and updates posted daily, there stantly changing topics. Furthermore, it is very dif- is an obvious need for effective ways to organize the ficult to obtain the labels (e.g. hashtags) for con- data tsunami. Latent Dirichlet Allocation (LDA), as tinuously growing text collections like social media, aBayesianhierarchicalmodeltocapturethetextgen- not to mention the fact that on Twitter application, erationprocess[5],hasbeenshowntobeverypower- many hashtags refer to very broad topics (e.g. “#in- 1 ternet”, “#sales”) and therefore could even be mis- detailanddiscussitsrelationshipwithexistingwork. leading when used to guide the topic models. [3, 2] In Section 3, we demonstrate the effectiveness of our proposed hierarchical LDA model (hLDA), that gen- model, and summarize the work and discuss future erates topic hierarchies from an infinite number of directions in Section 4. topics, which are represented by a topic tree, and each document is assigned a path from tree root to one of the tree leaves. hLDA has the capability to 2 Related Work encode the semantic topic hierarchies, but when fed withnoisyandsparsedatasuchastheusergenerated 2.1 Topic Models short tweet messages, it is not robust and its results lack meaningful interpretations [18, 24]. LatentDirichletAllocation(LDA)[5]hasgainedpop- Recently, hLDA has been studied in [9] to sum- ularity for automatically extracting a representation marize multiple documents. They built a two-level of corpus. LDA is a completely unsupervised model learning model using hLDA model to discover simi- that views documents as a mixture of probabilistic lar sentences to the given summary sentences using topics that is represented as a K dimensional ran- hidden topic distributions of the document. The dis- dom variable θ. In generative story, each document tancedependentCRP[6]studiedseveraltypesofde- is generated by first picking a topic distribution θ cay: window decay, exponential decay, and logistic fromtheDirichletpriorandthenuseeachdocument’s decay,wherecustomers’tableassignmentsdependon topic distribution θ to sample latent topic variables external distances between them. Our paper aims to z . LDA makes the assumption that each word is i bridge the gap between short, noisy texts and their generatedfromonetopicwherez isalatentvariable i actualgenerationprocesswithoutadditionallabeling indicating the hidden topic assignment for word w . i efforts. At the same time, we develop parallel algo- Theprobabilityofchoosingawordw undertopicz , i i rithms to speed up inference so that our model can p(w |z ,β), depends on different documents. LDA is i i be applicable to large-scale applications. not appropriate for labeled corpora, so it has been In this paper, we propose a simple solution model, extended in several ways to incorporate a supervised transferhierarhicialLDA(thLDA)model. Thebasic labelset into itslearning process. In [21], Ramage et idea is extracting human knowledge on topics from al. introduced Labeled LDA, a novel model that use source domain corpus in the form of representative multi-labeled corpora to address the credit assign- words that are consistently meaning across contexts ment problem. Unlike LDA, Labeled LDA constrains ormediaandencodethemaspriorstolearnthetopic topics of documents to a given label set. Instead of hierarchies in target domain corpus of the hLDA using symmetric Dirichlet distribution with a single model. To extract source domain corpus, thLDA hyper-parameter α as a Dirichlet prior on the topic model utilizes related labeled documents from other distribution θ , it restricted θ to only the topics (d) (d) sources (such as Yahoo! news or socially tagged web that correspond to observed labels Λ . In [3, 2], (d) pages) and to encode the laels, we modified a nested the authors proposed a stochastic processes, where Chinese Restaurant Process (nCRP) as guidance for the Bayesian inference are no longer restricted to fi- inferring latent topics of target domains. We base nite dimensional spaces. Unlike LDA, it does not re- our model on hierarchical LDA (hLDA) [3,2] mainly strictthegivennumberoftopicsandallowsarbitrary because hLDA has the natural capability to encode breadthanddepthoftopichierarchies. Thetopicsin the semantic topic hierarchies with document clus- hLDAmodelarerepresentedbyatopictree,andeach ters. In addition, recent study suggests that hierar- document is assigned a path from tree root to one of chicalDirichletprocessprovidesaneffectiveexplana- the tree leaves. Each document is generated by first tion model for human transfer learning [8]. sampling a path along the topic tree, and then sam- The rest of the paper is organized as follows: In pling topic z among all the topic nodes in the path i Section 2, we describe the proposed methodology in for each word w . The authors proposed a nested i 2 Chinese restaurant process prior on path selection. berealizedintwoways: oneisutilizingthedocument labels from the other domain (with the assumption that the target domain and source domain share the p(occupied table i | previous customers) same label space) so that the learned hidden topics (1) = ni canbemuchmoremeaningful,andtheotherisutiliz- γ+n−1 ing the documents from other domains to enrich the p(unoccupied table | previous customers) contents so that we can learn a more robust shared γ latent space. (2) = γ+n−1 In[4,12,21],theauthorsproposedthediscrimina- tive LDA, which adds supervised information to the ThenestedChineserestaurantprocessimpliesthat original LDA model, and guides the generative pro- the first customer sits at the first table and the nth cessofdocumentsbyusingthelabels. Thesemethods customer sits at a table i which is drawn from the clearlydemonstratetheadvantagesofthediscrimina- above equations. When a path of depth d is sam- tive training of generative models. However, this is pled, there are d number of topic nodes along the differentfromtransferlearningsincetheysimplyuti- path, and the document sample the topic zi among lizethelabeleddocumentsinthesamedomaintohelp alltopicnodesinthepathforeachwordwi basedon buildthegenerativemodel. Butthesamemotivation GEM distribution. The experiment result of hLDA can be applied to transfer learning, in which the su- shows that the above document generating story can pervised information is used to guide the generation actually encode the semantic topic hierarchies. of the common latent semantic space shared by both the source domain and the target domain. Trans- 2.2 Transfer Learning ferring information from source to target domain is extremelydesirableinsocialmediaanalysis,inwhich Transfer learning has been extensively studied in the the target domain example features are very sparse, pastdecadetoleveragethedata(eitherlabeledorun- with a lot of missing features. Based on the shared labeled) from one task (or domain) to help another common latent semantic space, the missing features [17]. In summary, there are two types of transfer can be recovered to some extent, which is helpful in learning problems, i.e. shared label space or shared better representing these examples. feature space. For shared label space [1], the main objectiveistotransferthelabelinformationbetween observationsfromdifferentdistributions(i.e. domain 3 Transfer Topic Models adaptation) or uncover the relations between multiple labels for better prediction (i.e. multi-task learn- Content analysis on social media data is a challeng- ing); for shared feature space, one of the most rep- ing problem due to its unique language characteris- resentative works is self-taught learning [19], which tics, which prevents standard text mining tools from uses sparse coding [13] to construct higher-level fea- being used to their full potential. Several models tures via abundant unlabeled data to help improve have been developed to overcome this barrier by ag- the performance of the classification task with a lim- gregating many messages [11], applying temporal at- ited number of labeled examples. tributes [15, 22, 23], examining entities [15] or ap- As an unsupervised generative model, LDA pos- plying manual annotation to guide the topic genera- sesses the advantage of modeling the generative pro- tion [20]. The main motivation of our work is that cedure of the whole dataset by establishing the rela- previous unsupervised approaches for analyzing so- tionshipsbetweenthedocumentsandtheirassociated cial media data fail to achieve high accuracy due to hidden topics, and between the hidden topics and thenoiseandsparseness,whilesupervisedapproaches the concrete words, in an un-supervised way. Intu- require annotated data, which often requires a sig- itively,transferlearningonthisgenerativemodelcan nificant amount of human effort. Even though hi- 3 erarchical LDA has the natural capability to encode thesemantictopichierarchieswithclustersofsimilar documents based on that hierarchies, it still cannot provide robust result for noise and sparseness (Fig 2(a)) because of the exchangeability assumption in (a) Dirichlet Process mixture [6] . Exchangeability is es- sential for CRP mixture to be equivalent to a DP mixture; thus customers are exchangeable. However this assumption is not reasonable when it is applied to microblogging dataset. Based on our experiment, when the data set is noisy and sparse, unrelated words tends to cluster with other word because of co-occurrences.(Fig 2(a)) 3.1 Knowledge Extraction from (b) Source Domain Considerthetaskofmodelingshortandsparsedocu- Figure 1: (a)Yahoo! news home page and the cate- mentsbasedonspecificsourcedomainstructure. For goriesofnews(b)Thegraphicalmodelrepresentation example, a user who is interested in browsing docu- of transfer Hierarchical LDA with a modified-nested ments with particular categories of topics might pre- CRP prior fer to see clusters of other documents based on his category. Clustering target domain by transferring histopichierarchycategoryinthesourcedomain,we target document by computing cosine similarity be- couldproducebetterdocumentclustersandtopichi- tween the target document and a node in the source erarchies by leveraging the context information from domain hierarchy. We start from the root node of the source domain. the hierarchy and keep assigning only the most sim- User generated categories can be found from var- ilar topic node at each level while only considering ious source domains: Twitter list, Flickr collection child nodes of currently assigned topic nodes as next andset,Del.icio.us hierarchyandWikipediaorNews level candidates. categories. We transfer source domain knowledge to target domain documents by assigning a prior path, sequence of assigned topic nodes from root to leaf 3.2 Transfer Hierarchical LDA node. The prior paths of the documents can be used to identify whether two target documents are similar We incorporated the label hierarchies into the model or not, so that our model could group similar docu- by changing the prior of the path in hLDA in a way mentsclustertogetherwhilekeepdifferentdocuments thatthepathselectionfavorstheonesintheexisting separatebasedonthelabel. Tolabeleachdocument’s label hierarchies. Similar to original hLDA, thLDA prior path on the source domain hierarchy, we gen- models each document, a mixture of topics on the erate word vectors of nodes in the source domain hi- path,andgenerateseachwordfromoneofthetopics. erarchy. Each label is generated by measuring the In the original hLDA model, the path prior is nested similaritybetweensourcedomaintopichierarchyand ChineseRestaurantProcess (nCRP),wheretheprob- document. There are many ways to measure similar- ability of choosing one topic in one topic layer de- ity between two vectors: cosine similarity, euclidean pends on the number of documents already assigned distance, Jaccard index and so on. For simplicity, to each node in that layer. That is, nodes assigned in this paper, we label our prior knowledge of the with more documents will have a higher probability 4 of generating new documents. However, thLDA in- number of L level paths which is drawn from modi- corporatessupervisionbymodifyingpathpriorusing fied nested CRP. c represents the restaurant cor- m,l following equation: reponding to the lth topic distribution in mth document and distribution of c will be defined by the m,l p(table i without prior | previous customers) modifiednestedCRPconditionedonalltheprevious n (3) = i cn,l where n < m. We assume that each table in a γ+kλ+n restaurant is assigned a parameter vector β that is p(table i with prior | previous customers) a multinomial topic distribution over vocabulary for n +λ eachtopick fromaDirichletpriorη. Wealsoassume (4) = i γ+kλ+n that the words are generated from a mixture model which is a specific random distribution of each doc- p(unoccupied table | previous customers) γ ument. A document is drawn by first choosing an (5) = L level path through the modified-nested CRP and γ+kλ+n then drawing the words from the L topics which are where k is the total number of labels for the doc- associated with the restaurants along the path. Λ(d) ument, d is the total number of documents, and λ = (l , ..., l ) refers binary label presence indicators 1 k is the weighted indicator variable that controls the and Φ is the label prior for label k. k strengthofaprioraddedintotheoriginalnestedChi- thLDA is able to transfer source domain knowl- nese restaurant process. The graphical model of the edge to topic hierarchy by making an assignment re- new model is shown in Fig 1 (b). Whether the cus- lated to not only how many documents are assigned tomer sits at a specific table is not only related to in topic i (n ) but also how close current topic is i how many customers are currently sitting at table i to the documents in topic i based on the source do- (ni) and how often a new customer chooses a new main knowledge (λ). For unseen topics from source table (γ), but is also related to how close the current domain, we do not have knowledge to transfer from customer is to customers at table i (λ). Note that, source domain, so we assign probability proportional in this work we use same λ for different topic for tothenumberofdocumentsalreadyassignedintopic simplicity; however table specific λ or different prior i. Unlike transferring knowledge only for labeled en- for different table can be applied for a sophisticated tities or given source domain knowledge, our model model. learns both unlabeled and labeled data based on dif- In the transfer hierarchical LDA model, the docu- ferent prior probability equations (4), (5) and (6). ments in a corpus are assumed to be drawn from the A modified nested Chinese restaurant process can following generative process: be imagined through the following scenario. As in [3, 2], suppose that there are an infinite number of (1) For each table k in the infinite tree infinite-table Chinese restaurants in a city and there (a) Generate β ∼Dirichlet(η) (k) is only one headquarter restaurant. There is a note (2) For each document d on each table in every restaurant which refers to (a) Generate c ∼ modified nCRP (γ,λ) (d) other restaurants and each restaurant is referred to (b) Generate θ |{m,π}∼GEM(m,π) (d) once. So, starting from the headquarter restaurant, (c) For each word, all other restaurants are connected as a branching i. Choose level Z |θ ∼Discrete(θ ) d,n d d tree. One can think of the table as a door to other ii. Choose word W |{z ,c ,β} d,n d,n d restaurants unless the current restaurant is the leaf ∼Discrete(β [z ]) cd d,n noderestaurantofthetree. So,startingfromtheroot The variables notation are: z , the topic assign- restaurant, one can reach the final destination table d,n ments of the nth word in the dth document over ofleafnoderestaurant,andthecustomersinthesame L topics; w , the nth word in the dth document. restaurant share the same path. When a new cus- d,n In Fig 1 (b), T represents a collection of an infinite tomerarrives,insteadoffollowingtheoriginalnested 5 chineserestaurantprocess,weputahigherweighton First,giventhecurrentglobalstateoftheCRP,we the table where similar customers are seated. sample the topic assignment for word n in document d from processor p: (8) 3.3 Gibbs Sampling for Inference p(z |z ,c,w,m,π,η)∝ d,n,p −(d,n,p) The inference procedure in our thLDA model is sim- p(zd,n,p|zd,−n,p,m,π)p(wd,n,p|z,c,w−(d,n,p),η) ilar to hLDA except for the modification of the path prior. We use the gibbs sampling scheme: for each Here z is the vectors of topic allocations on −(d,n,p) document in the corpus, we first sample a path c in process p excluding z and w is the nth d (d,n,p) −(d,n,p) the topic tree based on the path sampling of the rest word in document d on process p excluding w . (d,n,p) of the documents c in the corpus: Note that on a separate processor, we need to use −d total vocabulary size and the number of words that p(c |w,c ,z,η,γ,λ) have been assigned to the topic in the global state d −d (6) ∝p(c |c ,γ,λ)p(w |c,w ,z,η) of the CRP. We merge the P topic assignment count d −d d −d table to a single set of counts after each LDA Gibbs Here the p(c |c ,γ,λ) is the prior probability on iteration so that the global sampling state is shared d −d paths implied by the modified nested CRP, and among P processes. [16]. p(w |c,w ,z,η)istheprobabilityofthedatagiven Second, on P separate processors, we sample path d −d for a particular path. selection,conditionaldistributionforc givenwand d,p ThesecondstepintheGibbssamplinginferencefor c variables for all documents other than d. each document is to sample the level assignments for each word in the document given the sampled path: p(cd,p|w,c−d,p,z,η,γ,λ)∝ (9) p(c |c ,γ,λ)p(w |c,w ,z,η) d,p −d,p d,p −d,p p(z |z ,c,w,m,π,η) d,n −(d,n) (7) ∝p(z |z ,m,π)p(w |z,c,w ,η) The conditional distribution for both prior and the d,n d,−n d,n −(d,n) likelihoodofthedatagivenaparticularchoiceofc d,p The first term is the distribution over levels from arecomputedlocally. Notethat, tocomputethesec- GEM,andthesecondtermisthewordemissionprob- ondtermitneedstobeknowntheglobalstateofthe ability given a topic assignment for each word. The CRP: documents’ path assignment and the number onlythingweneedtochangeintheinferencescheme ofinstancesofawordthathavebeenassignedtothe is the path prior probability in the path sampling topic index by c on the tree. d step. To merge topic trees for the global state of the CRP,wefirstdefinethesimilaritybetweentwotopics as the cosine similarity of two topics’ word distribu- 3.4 Parallel Inference Algorithm tion vector: We developed thLDA parallel approximate inference algorithm on independent P processors to facilitate (10) similarityβiβj =βi·βj/||βi|| ||βj|| learning efficiency. In other words, we split the data intoPparts,andimplementthLDAoneachprocessor ForgivenPnumberofinfinitetreesofChineserestau- performingGibbssamplingonpartialdata. However, rants, we pick one tree as the base tree, and recur- the gibbs sampler requires that each sample step is sively merge topics in the remaining P-1 trees into conditioned on the rest of the sampling states, hence the base tree in a top-down manner. For each topic weintroduceatreemergestagetohelpPGibbssam- node t being merged, we find the most similar node i plers share the sampling states periodically during t in the base tree where t and t have same parent j i j the P independent inference processes. node. 6 3.5 Discussion Our thLDA model is different from existing topic models. In lLDA, they incorporate supervision by restricting θ to each document’s label set. So word- d topic Z assignments are restricted by its given la- d,n bels. The number of unique labels K in lLDA is the sameasthenumberoftopicsKinLDA.UnlikelLDA, foreach iteration do thLDA does not directly correlate label and topic by parallel foreach document d do modifying θd, so the number of topics are not deter- foreach word n do mined or set by the number of labels, that number sample the topic assignment serves as a guidance for inferring latent topics. The p(z |z ,c,w,m,π,η) proposed thLDA model is significant in that it can d,n,p −(d,n,p) end overcome the barrier that unsupervised models have whenitisappliedtonoisyandsparsedata. Bytrans- end ferringdifferentdomainknowledge,thLDAalsosaves foreach process p do merge topic assignment count table to a timeandthecumbersomeannotationeffortsrequired single set of counts forsupervisedmodels. thLDAhasanadvantageover end LDA model by producing a topic hierarchy and doc- parallel foreach document d do ument clusters without additionally computing sim- sample path p(c |w,c ,z,η,γ,λ) ilarity among topic distribution of documents. Fur- d,p −d,p using global state of the CRP thermore, thLDA offers major advantages over other end supervised or semisupervised LDA models by pro- Pick one tree q as a base tree in every merge viding mixture of detailed a topic hierarchy below a stage, certain level in an unsupervised way, while provid- foreach tree from process p ∈ P-{q} do ing the above that level topic structure guided by foreach depth do given prior knowledge. By applying prior knowledge foreach topic node i do in both supervised and unsupervised ways, we can find most similar topic node j ∈ q apply thLDA to learn deeper level of topic hierarchy if parent(i)==parent(j) than the given depth of source domain prior hierar- merge topics i to topic j chy. In the following experiment, we will show the else performance of thLDA a combination of supervised add topic node i to tree q priorknowledgeuptoacertainlevelandunsupervised end below that level. end end end 4 Experiment Results Algorithm 1: The parallel inference algorithm In our experiments, we used one source domain and three target domain text data sets to show the effectiveness of our transfer hierarchical LDA model. We used two well known text collections: the Associated Press (AP) Data set[10] and Reuters Corpus Volume 1(RCV1) Data set[14] and one sparse and noisy Mi- croblog data from Twitter for target domains and Yahoo! News categories for source domain. 7 (a) hLDA (b) thLDA Figure 2: The topic hierarchy learned from (A) hLDA and (B) thLDA as well as example tweets assigned to the topic path. 4.1 Dataset Description tive log likelihood. RCV1[14] is an archive of over 800,000 manually We used the web crawler to fetch news titles on 8 categorized newswire stories provided by Reuters, categoriesin Yahoo! News: science, business, health, Ltd. distributed as a set of on-line appendices to sports,politics,world,technologyandentertainment, aJMLRarticle. Italsoincludes126categories, asso- as shown in Fig 1 (a). We parsed and stemmed ciated industries and regions. For this work a subset news titles and computed the tf-idf score to gener- of RCV1 data set is used. Sample RCV1 data set ateweightedwordvectorforeachtopiccategory. We has D = 55,606 documents with a vocabulary size computed each topic category score using the top 50 V = 8,625 unique terms. We randomly divided it tf-idf weighted word vector and picked one optimal into 44,486 observed documents and 11,120 held-out category per level as a label for each target docu- documents for experiments. ment. We have crawled the Twitter data for two Text Retrieval Conference AP (TREC-AP) [10] weeks and obtained around 58,000 user profiles and contains Associated Press news stories from 1988 to 2,000,000 tweets. Twitter users use some structure 1990. The original data includes over 200,000 docu- conventions such as a user-to-message relation (i.e. ments with 20 categories. The sample AP data set initial tweet authors, via, cc, by), type of message from[3],whichissampledfromasubsetoftheTREC (i.e. Broadcast, conversation, or retweet messages), AP corpus contains D= 2,246 documents with a vo- type of resources (i.e. URLs, hashtags, keywords) to cabulary size V = 10,473 unique terms. We divided overcome the 140 character limit. To capture trend- documents into 1,796 observed documents and 450 ing topics, many applications analyze twitter data held-out documents to measure the held-out predic- and group similar trending topics using structure in- 8 formation(i.e. hashtagorurl)andshowsalistoftop has key words focusing on smart phones (Android, Ntrendingtopics. Howeveraccordingto[7],only5% iPhone, andApple)andthe5thnodesin3rdcolumn of tweets contain a hashtag with 41% of these con- that covers online multimedia resources. However, taining a URL, and also only 22% of tweets include the2ndleveltopicnodesarelessinformativeandthe aURL.Inthiswork,insteadofusingstructureinfor- relationship between child nodes in the 3rd level and mation (i.e. hashtag or url) of a tweet, we used only theirparentnodesinthe2ndlevelislesssemantically words. We removed structural information such as meaningful. The topics belong to the same parent in initialtweetauthors,via,cc,by,andurlandstemmed level 3 usually do not relate to each other in our run word to transform any word into its base form using result. Ideally, this should work for documents that the Porter Stemmer. For the experiment, we ran- arelonginlengthanddenseinworddistributionand domlysampledandusedD=21,364documentswith overlapping. However, hLDA gives less interpretable a vocabulary size V = 31,181 unique terms. We ran- results on noisy and sparse data. domlydivideditinto16,023observeddocumentsand 5,341 held-out documents. Transfer Hierarchical LDA We implemented standard thLDA by modifying HLDA-C. Fig 2(b) 4.2 Performance Comparison on shows two important advantages of our model: first, Topic Modeling topic nodes are better interpretable by transferring our source domain knowledge. Second, all the child LDA and Labeled LDA We implemented LDA topics reflect a strong semantic relationship with and lLDA using the standard collapsed Gibbs Sam- their parent. For example, in the topic ”world”, the pling method. To compare the learned topic results first child ”iraq war wikileak death” relates to the between supervised LDA and unsupervised LDA Iraq war topic, the second child ”police kill injury topic models, we ran standard LDA with 9, 20 and drup”relatestocriminalsandmoreinterestingly,the 50 topics and lLDA with 9 topics: Yahoo news 8 top 3rdchildtopic”chilerescueminer”denotestherecent level categories and 1 topic as freedom of topic. The eventof33trapedminersinChile. Notethatweonly results show two main observations: first, multiple impose prior knowledge on the 2nd level and all 3rd topics from the standard LDA mapped to popular level topics are automatically emerged from the data topics (i.e. technology, entertainment and sports) in set. In ”science” topics the 1st child topic ”space lLDA; second, not all lLDA topics were discovered Nasa moon” is about astronomy and the 2nd child by standard LDA (i.e. topics such as politics, world, topic ”BP oil gulf spill environment” is about the and science are not discovered). recent Gulf Oil Spill. Furthermore, example tweets assignedtothetopicnodesshowastrongassociation Hierarchical LDA We used HLDA-C [3, 2] with with tweet clustering. a fixed depth tree and a stick breaking prior on the To quantitatively measure performance of thLDA, depth weights. The topic hierarchy generated by we used predictive held-out likelihood. We divided hLDA is shown in Fig 2(a). Being an unsupervised the corpus into the observed and the held-out set, model, the hLDA gives the result that totally de- and approximate the conditional probability of the pendsontermco-occuranceinthedocuments. hLDA held-out set given the training set. To make a fair gives a topic hierarchy that is not easily understood comparison we applied same hyper parameters that by human beings, because each tweet contains only exist in all three models while applied a different hy- small number of terms and co-occurance would be per parameter η to obtain a similar number of topics very sparse and less relevant compared to long docu- forhLDAandthLDAmodels. Following[2], weused ments. Nodes in the topic hierarchy capture some outer samples: taking samples 100 iterations apart clusters of words from the input documents, such using a burn-in of 200 samples. We collected 800 as the 2nd topic in 3rd column in Fig 2(a), which samplesofthelatentvariablesgiventheheld-outdoc- 9 (a) Twitter (b) AP (c) RCV1 Figure 3: The held-out predictive log likelihood comparison for LDA, hLDA, and thLDA model on three different data set uments and computed their conditional probability loglikelihoodoftrainingdataduringtheGibbssam- given the outer sample with the harmonic mean. Fig plingiterationsisshowninFig4(a). Inallcases,the 3 illustrates the held-out likelihood for LDA, hLDA, gibbssamplingconvergestothedistributionthathas and thLDA on Twitter, AP, and RCV1 corpus. In similar log likelyhood. Fig 4 (b) shows the speedup the figure, we applied a set of fixed topic cardinality fromtheparallelinferencemethod. Inadditiontothe on LDA and fixed depth 3 of the hierarchy on hLDA overheadoftopictreemergingstage, thesystemalso andthLDA.WeseethatthLDAalwaysprovidesbet- suffers from the overhead of state loading and saving ter predictive performance than LDA or hLDA on time, which is similar since they occur everytime we all three cases (Fig 3). Interestingly, thLDA pro- needtoupdatetheglobaltree. Becauseoftheseover- vides significantly better performance on (a) Twit- heads,thesystemperformsbetterwhenmergingstep ter data set, while hLDA shows poor performance islarge. Whenwemergetopictreesinevery50gibbs than LDA. As Blei et al [3, 2] showed that even- iterations, thespeedupwith4processesis2.36times tually with large numbers of topics LDA will dom- faster than with 1 process, but when we merge topic inate hLDA and thLDA in predictive performance, treesinevery200gibbsiterations,thespeedupis3.25 however thLDA performs better in predictive perfor- times. Theoverheadcanberoughlyseenifweextend mance with reasonable numbers of topics. the lines in Fig 4 (b) to intersect with y-axes. Since For manual evaluation of tweet assignments on themergingstagecomplexityislineartothenumber learned topics, we randomly selected 100 tweets and of topic trees being merged, we see a greater over- manually annotate correctness. For LDA, we pick head in 4 process experiment. However, the merging the highest assigned topic and for hLDA, thLDA, algorithm complexity does not depend on the size of and parallel-thLDA, third level topic node is used dataset, which means the merging overhead will be and their accuracy were 41% 46%, 71%, and 56% re- ignorable when run with huge datasets. spectively. Table 1 shows example tweets with their assigned topic from LDA, hLDA, and thLDA. 5 Conclusion 4.3 Performance Comparison on Scal- ablity In this paper, we proposed a transfer learning ap- proachforeffectiveandefficienttopicmodelinganal- We evaluate our approximate parallel learning ysis on social media data. More specifically, we de- methodonthetwitterdatasetwith5000tweets. The veloped transfer hierarchical LDA model, an exten- 10