ebook img

Alison Smith, Tak Yeon Lee, Forough Poursabzi-Sangdeh, Jordan Boyd-Graber, Kevin Seppi PDF

17 Pages·2017·7.17 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Alison Smith, Tak Yeon Lee, Forough Poursabzi-Sangdeh, Jordan Boyd-Graber, Kevin Seppi

Alison Smith, Tak Yeon Lee, ForoughPoursabzi-Sangdeh, Jordan Boyd-Graber, Kevin Seppi, Niklas Elmqvist, and Leah Findlater. Evaluating Visual Representations for Topic Understanding and Their Effects on Manually Generated Labels. Transactions of the Association for Computational Linguistics,2017. @article{Smith:Lee:Poursabzi-Sangdeh:Boyd-Graber:Seppi:Elmqvist:Findlater-2017, Volume = {5}, Author = {Alison Smith and Tak Yeon Lee and Forough Poursabzi-Sangdeh and Jordan Boyd-Graber and Kevin Seppi and Niklas Elmqvist and Leah Findlater}, Url = {docs/2017_tacl_eval_tm_viz.pdf}, Journal = {Transactions of the Association for Computational Linguistics}, Year = {2017}, Pages = {1--15}, Title = {Evaluating Visual Representations for Topic Understanding and Their Effects on Manually Generated Labels}, Abstract = {Probabilistic topic models are important tools for indexing, summarizing, and analyzing large document collections by their themes. However, promoting end-user understanding of topics remains an open research problem. We compare labels generated by users given four topic visualization techniquesword lists, word lists with bars, word clouds, and network graphsagainst each other and against automatically generated labels. Our basis of comparison is participant ratings of how well labels describe documents from the topic. Our study has two phases: a labeling phase where participants label visualized topics and a validation phase where different participants select which labels best describe the topics’ documents. Although all visualizations produce similar quality labels, simple visualizations such as word lists allow participants to quickly understand topics, while complex visualizations take longer but expose multi-word expressions that simpler visualizations obscure. Automatic labels lag behind user-created labels, but our dataset of manually labeled topics highlights linguistic patterns (e.g., hypernyms, phrases) that can be used to improve automatic topic labeling algorithms.}, } Links: • Journal[https://transacl.org/ojs/index.php/tacl/article/view/887] Downloadedfromhttp://cs.colorado.edu/~jbg/docs/2017_tacl_eval_tm_viz.pdf 1 Evaluating Visual Representations for Topic Understanding and Their Effects on Manually Generated Topic Labels AlisonSmith TakYeonLee ForoughPoursabzi-Sangdeh† ∗ ∗ JordanBoyd-Graber† NiklasElmqvist LeahFindlater ∗ ∗ UniversityofMaryland,CollegePark,MD ∗ †UniversityofColorado,Boulder,CO amsmit,tylee @cs.umd.edu { } forough.poursabzisangdeh, jordan.boyd.graber @colorado.edu { } elm,leahkf @cs.umd.edu { } Abstract identifywordsthatappeartogetherinsimilardocu- ments. Thesesetsofwordsor“topics”evinceinter- Probabilistictopicmodelsareimportanttools nal coherence and can help guide users to relevant for indexing, summarizing, and analyzing documents. Forinstance,anFBIinvestigatorsifting large document collections by their themes. throughthereleasedHillaryClintone-mailsmaysee However, promoting end-user understanding a topic with the words “Benghazi”, “Libya”, “Blu- of topics remains an open research prob- lem. We compare labels generated by users menthal”, and “success”, spurring the investigator given four topic visualization techniques— to dig deeper to find further evidence of inappro- word lists, word lists with bars, word clouds, priate communication with longtime friend Sidney and network graphs—against each other and BlumenthalregardingBenghazi. against automatically generated labels. Our A key challenge for topic modeling, however, is basis of comparison is participant ratings of how to promote end-user understanding of individ- howwelllabelsdescribedocumentsfromthe topic. Our study has two phases: a label- ual topics and the overall model. Most existing ing phase where participants label visualized topic presentations use simple word lists (Chaney topics and a validation phase where different andBlei,2012;Eisensteinetal.,2012). Althougha participants select which labels best describe variety of alternative topic visualization techniques the topics’ documents. Although all visual- exist (Sievert and Shirley, 2014; Yi et al., 2005), izations produce similar quality labels, sim- therehasbeennosystematicassessmenttocompare plevisualizationssuchaswordlistsallowpar- them. Beyondexploringdifferentvisualizationtech- ticipants to quickly understand topics, while complexvisualizationstakelongerbutexpose niques, another means of making topics easier for multi-wordexpressionsthatsimplervisualiza- users to understand is to provide descriptive labels tions obscure. Automatic labels lag behind tocomplementatopic’ssetofwords(Aletrasetal., user-created labels, but our dataset of man- 2014). Unfortunately, manual labeling is slow and, ually labeled topics highlights linguistic pat- while automatic labeling approaches exist (Lau et terns (e.g., hypernyms, phrases) that can be al., 2010; Mei et al., 2007; Lau et al., 2011), their used to improve automatic topic labeling al- effectivenessisnotguaranteedforalltasks. gorithms. To better understand these problems, we use la- beling to evaluate topic model visualizations. Our 1 ComprehensibleTopicModelsNeeded study compares the impact of four commonly used A central challenge of the “big data” era is to help topic visualization techniques on the labels that usersmakesenseoflargetextcollections(Hothoet users create when interpreting a topic (Figure 1): al.,2005). Acommonapproachtosummarizingthe word lists, word lists with bars, word clouds, and mainthemesinacorpusistousetopicmodels(Blei, networkgraphs. OnAmazonMechanicalTurk,one 2012), which are data-driven statistical models that set of users viewed a series of individual topic vi- 1 TransactionsoftheAssociationforComputationalLinguistics,vol.5,pp.1–16,2017.ActionEditor:TimothyBaldwin. Submissionbatch:2/2016;Revisionbatch:6/2016;Published1/2017. c2017AssociationforComputationalLinguistics.DistributedunderaCC-BY4.0license. (cid:13) sualizations and provided a label to describe each pus, andindividualtopicscanlinkbacktotheorig- topic,whileasecondsetofusersassessedthequal- inaldocumentstosupportdirectedexploration. The ityofthoselabelsalongsideautomaticallygenerated topic distributions can also be used to present other ones.1 Better labels imply that the topic visualiza- documentsrelatedtoagivendocument. tionprovideusersamoreaccurateinterpretation(la- Clusteringishardbecausetherearemultiplerea- beling)ofthetopic. sonable objectives that are impossible to satisfy si- The four visualization techniques have inherent multaneously (Kleinberg, 2003). Topic modeling trade-offs. Perhapsunsurprisingly,thereisnomean- evaluation has focused on perplexity, which mea- ingful difference in the quality of the labels pro- sures how well a model can predict words in un- ducedfromthefourvisualizationtechniques. How- seendocuments(Wallachetal.,2009b;Jelineketal., ever, simple visualizations (word list and word 1977). However,Changetal.(2009)arguethateval- cloud) support a quick, first-glance understanding uations optimizing for perplexity encourage com- of topics, while more complex visualizations (net- plexity at the cost of human interpretability. New- workgraph)takelongerbutrevealrelationshipsbe- man et al. (2010a) build on this insight, noting that tween words. Also, user-created labels are better “oneindicatorofusefulnessistheeasebywhichone received than algorithmically-generated labels, but could think of a short label to describe the topic.” more detailed analysis uncovers features specific to Unlikepreviousinterpretabilitystudies,hereweex- high-quality labels (e.g., tendency towards abstrac- aminetheconnectionbetweenatopic’svisualrepre- tion, inclusion of phrases) and the types of topics sentation (not just its content) and its interpretabil- forwhichautomaticlabelingworks. Thesefindings ity. motivatefutureautomaticlabelingalgorithms. Recentworkhasfocusedonautomaticgeneration oflabelsfortopics. Lauetal.(2011)useWikipedia 2 Background articles to automatically label topics. The assump- tion is that for each topic there will be a Wikipedia Presentingthefulltextofadocumentcorpusisoften article title that offers a good representation of the impractical. For truly large and complex text cor- topic. Aletras et al. (2014) use a graph-based ap- pora, abstractions, such as topic models, are neces- proach to better rank candidate labels. They gen- sary. Here we review probabilistic topic modeling erate a graph from the words in candidate articles andtopicmodelinterfaces. and use PageRank to find a representative label. In Section 3 we use an adapted version of the method 2.1 ProbabilisticTopicModeling presented by Lau et. al. (2011) as a representative Topicmodelingalgorithmsproducestatisticalmod- automaticlabelingalgorithm. els that discover key themes in documents (Blei, 2012). Many specific algorithms exist; in this work 2.2 TopicModelVisualizations weuseLatentDirichletAllocation(Bleietal.,2003, The topic visualization techniques in our study— LDA) as it is commonly employed. LDA is an un- word list, word list with bars, word cloud, and net- supervised statistical topic modeling algorithm that work graph—commonly appear in topic modeling considers each document to be a “bag of words” tools. Here, we provide an overview of tools that and can scale to large corpora (Zhai et al., 2012; display an entire topic model or models to the user, Hoffman et al., 2013; Smola and Narayanamurthy, while more detail on the individual topic visualiza- 2010). Assuming that each document is an admix- tiontechniquescanbefoundinSection3.2. ture of topics, inference discovers each topic’s dis- Topical Guide (Gardner et al., 2010), Topic tribution over words and each document’s distribu- Viz (Eisenstein et al., 2012), and the Topic Model tionovertopicsthatbestexplainthecorpus. Theset Visualization Engine (Chaney and Blei, 2012) are of topics provide a high-level overview of the cor- toolsthatsupportcorpusunderstandinganddirected browsing through topic models. They display the 1Data available at https://github.com/ modeloverviewasanaggregateofunderlyingtopic alisonmsmith/Papers/tree/master/ TopicRepresentations. visualizations. Forexample,TopicalGuideuseshor- 2 Figure 1: Examples of the twelve experimental conditions, each a different visualization of the same topic about the George W. Bush presidential administration and the Iraq War. Rows represent cardinality, or number of topic words shown (five, ten, twenty). Columns represent visualization techniques. For word list and word list with bars,topicwordsareorderedbytheirprobabilityforthetopic. Wordlistwithbarsalsoincludeshorizontalbarsto representtopic-termprobabilities.Inthewordcloud,wordsarerandomlyplacedbutaresizedaccordingtotopic-term probabilities. The network graph uses a force-directed layout algorithm to co-locate words that frequently appear togetherinthecorpus. izontal word lists when displaying an overview of topics are displayed using a network graph visu- an entire topic model but uses a word cloud of the alization, and related topics are displayed within a top 100 words for a topic when displaying only a treemap(Shneiderman,1992)layout. Theresultisa single topic. Topic Viz and the Topic Model Visu- visualizationwhererelatedwordsclusterwithintop- alization Engine both represent topics with vertical icsandrelatedtopicsclusterintheoveralllayout. wordlists;thelatteralsousessetnotation. TopicFlow (Smith et al., 2015) visualizes how Othertoolsprovideadditionalinformationwithin a model changes over time using a Sankey dia- topic model overviews, such as the relationship be- gram (Riehmann et al., 2005). The individual top- tween topics or temporal changes in the model. ics are represented both as word lists in the model However, they still require the user to understand overview and as word list with bars when view- individual topics. LDAVis (Sievert and Shirley, ing a single topic or comparing between two top- 2014) includes information about the relationship ics. Argviz (Nguyen et al., 2013) captures tempo- between topics in the model. Multi-dimensional ralshiftsintopicsduringadebateoraconversation. scaling projects the model’s topics as circles onto The individual topics are presented as word lists in a two-dimensional plane based on their inter-topic the model overview and using word list with bars distances;thecirclesaresizedbytheiroverallpreva- fortheselectedtopics. Kleinetal.(2015)useadust- lence. The individual topics, however, are then vi- and-magnet visualization (Yi et al., 2005) to visu- sualized on demand using a word list with bars. alize the force of topics on newspaper issues. The Smith et al. (2014) visualize a topic model using temporal trajectories of several newspapers are dis- a nested network graph layout called group-in-a- played as dust trails in the visualization. The indi- box (Rodrigues et al., 2011, GIB). The individual vidualtopicsaredisplayedaswordclouds. 3 In contrast to these visualizations which sup- articleusingitsTF-IDFvectorandcalculatethecen- port viewing the underlying topics on demand, Ter- troid (average TF-IDF) of the retrieved articles. To mite (Chuang et al., 2012) uses a tabular layout rank and choose the most representative of the set, of words and topics to provide an overview of the we calculate the cosine similarity between the cen- model to compare across topics. It organizes the troidTF-IDFvectorandtheTF-IDFvectorofeachof model into clusters of related topics based on word the articles. We choose the title of the article with overlap. Thisclusteredrepresentationisbothspace- themaximumcosinesimilaritytothecentroid. Un- efficientandspeedscorpusunderstanding. like Lau et al. (2011), we do not include the topic Despitethebreadthoftopicmodelvisualizations, words or Wikipedia title n-grams derived from our a small set of individual topic representations are label set, as these labels are typically not the best ubiquitous: word list, word list with bars, word candidates. Althoughotherautomaticlabelingtech- cloud,andnetworkgraph. Inthefollowingsections, niques exist, we choose this one as it is representa- wecomparethesetopicvisualizationtechniques. tiveofgeneraltechniques. 3 Method: ComparingVisualizations 3.2 Visualizations As discussed in Section 2, our study compares Weconductacontrolledonlinestudytocomparethe four of the most common topic visualization tech- fourcommonlyusedvisualizationtechniquesidenti- niques. To produce a meaningful comparison, the fiedinSection2: wordlist,wordlistwithbars,word space given to each visualization is held constant: cloud, and network graph. We also compare effec- 400 250 pixels. Figure 1 shows each visualiza- tivenesswiththenumberoftopicwordsshown,that × tion for the three cardinalities (or number of words is, the cardinality of the visualization: five, ten or displayed)forthesametopic. twentytopicwords. Word List The most straightforward topic repre- 3.1 Dataset sentation is a list of the top n words in the topic, Weselectacorpusthatdoesnotassumedomainex- ranked by their probability. In practice, topic word pertise: 7,156NewYorkTimesarticlesfromJanuary listshavemanyvariations. Theycanberepresented 2007(Sandhaus,2008). Wemodelthecorpususing horizontally (Gardner et al., 2010; Smith et al., an LDA (Blei et al., 2003) implementation in Mal- 2015) or vertically (Eisenstein et al., 2012; Chaney let(Yaoetal.,2009)withdomain-specificstopwords andBlei,2012),withorwithoutcommasseparating and standard hyperparameter settings.2 Our simple the individual words, or using set notation (Chaney setup is by design: our goal is to emulate the “off and Blei, 2012). Nguyen et al. (2013) add the the shelf” behavior of conventional topic modeling weights to the word list by sizing the words based toolsusedbynoviceusers. Insteadofimprovingthe on their probability for the topic, which blurs the quality of the model using asymmetric priors (Wal- boundarywithwordclouds;however,thisapproach lach et al., 2009a) or bigrams (Boyd-Graber et al., is not common. We use a horizontal list of equally 2014), our topic model has topics of variable qual- sized words ordered by the probability p(w z) for | ity, allowing us to explore the relationship between the word w in the topic z. For space efficiency, we topicqualityandourtaskmeasures. organizeourwordlistintwocolumnsandadditem Automatic labels are generated from representa- numberstomaketheorderingexplicit. tive Wikipedia article titles using a technique sim- WordListwithBars Combiningbargraphswith ilar to Lau et al. (2011). We first index Wikipedia wordlistsyieldsavisualrepresentationthatnotonly using Apache Lucene.3 To label a topic, we query conveys the ordering but also the absolute value of Wikipedia with the top twenty topic words to re- the weights associated with the words. We use a trieve fifty articles. These articles’ titles comprise similarimplementationtoSmithetal.(2015)toadd our candidate set of labels. We then represent each horizontalbarstothewordlistforatopiczwherethe 2n=50,α=0.1,β=0.01 lengthofeachbarrepresentstheprobability p(w z) | 3http://lucene.apache.org/ foreachwordw. 4 Figure2:Thelabelingtaskforthenetworkgraphandtenwords.Userscreateashortlabelandfullsentencedescribing thetopicandratetheirconfidencethatthelabelandsentencerepresentthetopicwell. WordCloud Thewordcloud(ortagcloud)isone specificallyiflog(count(w ,w ))>k,withk=0.1.4 1 2 of the most popular and well-known text visualiza- Edge width and color are applied uniformly to fur- tion techniques and is a common visualization for ther reduce complexity in the graph. The network topics. Many options exist for word cloud layout, graphisdisplayedusingaforce-directedgraphlay- color scheme, and font size (Mueller, 2012). Ex- out algorithm (Fruchterman and Reingold, 1991) isting work on layouts is split between those that where all nodes repel each other but links attract size words by their frequency or probability for the connectednodes. topic(Ramageetal.,2010)andthosethatsizebythe rank order of the word (Barth et al., 2014). We use 3.3 Cardinality acombinationofthesetechniqueswheretheword’s Althougheverywordhassomeprobabilityforevery font size is initially set proportional to its probabil- topic, p(w z), visualizations typically display only ity in a topic p(w z). However, when the word is the top n w|ords. The cardinality may interact with | too large to fit in the canvas, the size is gradually the effectiveness of the different visualization tech- decreased (Barth et al., 2014). We use a gray scale niques (e.g., more complicated visualizations may to visually distinguish words and display all words degradewithmorewords). Weusen 5,10,20 . ∈{ } horizontallytoimprovereadability. 3.4 TaskandProcedure The study includes two phases with different users. In Labeling (Phase I), users describe a topic given Network Graph Our most complex topic visual- a specific visualization, and we measure speed and izationisanetworkgraph. Weuseasimilarnetwork self-reported confidence in completing the task. In graph implementation to Smith et al. (2014), which Validation(PhaseII),usersselectthebestandworst representseachtopicasanode-linkdiagram,where amongasetofPhaseIdescriptionsandanautomat- wordsarecircularnodeswithedgesdrawnbetween icallygenerateddescriptionforhowwelltheyrepre- commonly co-occurring words. Each word’s radius senttheoriginaltopics’documents. is scaled by the probability p(w z) for the word w | Phase I: Labeling For each labeling task, users in a topic z. While Smith et al. (2014) draw edges see a topic visualization, provide a short label (up basedon document-levelco-occurrence, weinstead useedgestopulltogetherphrases,sotheyaredrawn 4Fromk 0.01,0.05,0.1,0.5 ,wechosek=0.1asthebest ∈{ } between words w1 and w2 based on bigram count, trade-offbetweencomplexityandprovidedinformation. 5 Figure3: Thevalidationtaskshowsthetitlesofthetoptendocumentsandfivepotentiallabelsforatopic. Usersare askedtopickthebestandworstlabels.FourlabelswerecreatedbyPhaseIusersafterviewingdifferentvisualizations ofthetopic,whilethefifthwasgeneratedbythealgorithm. Thelabelsareshowninrandomorder. to three words), then give a longer sentence to de- inthecorpus. Algorithmicallygeneratedlabels(not scribe the topic, and finally use a five-point Likert sentences)arealsoincluded. Figure3showsanex- scale to rate their confidence that the label and sen- ampleofthevalidationtask. tencerepresentthetopicwell. Wealsotrackthetime Theuser-generatedlabelsandsentencesareeval- toperformthetask. Figure2showsanexampleofa uated separately. For each task, the user sees the labeling task using the network graph visualization titles of the top ten documents associated with a techniquewithtenwords. topic and a randomized set of labels or sentences, Labelingtasksarerandomlygroupedintohuman oneelicitedfromeachofthefourvisualizationtech- intelligence tasks (HIT) on Mechanical Turk5 such niques within a given cardinality. The set of labels that each HIT includes five tasks from the same vi- alsoincludesanalgorithmicallygeneratedlabel. We sualizationtechnique.6 ask the user to select the “best” and “worst” of the labelsorsentencesbasedonhowwelltheydescribe Phase II: Validation In the validation phase, a the documents. Documents are associated to topics new set of users assesses the quality of the labels based on the probability of the topic, z, given the andsentencescreatedinPhaseIbyevaluatingthem document, d, p(z d). Only the title of each docu- against documents associated with the given topic. | mentisinitiallyshowntotheuserwithanoptionto It is important to evaluate the topic labels in con- “showarticle”(orviewthefirst400charactersofthe text;alabelthatsuperficiallylooksgoodisuselessif document). it is not representative of the underlying documents All labels are lowercased to enforce uniformity. 5All users are in the US or Canada, have more than fifty Wemergeidenticallabelssousersdonotseedupli- previouslyapprovedHITs,andhaveanapprovalratinggreater cates. Ifamergedlabelreceivesa“best”or“worst” than90%. vote, thevoteissplitequallyacrossalloftheorigi- 6We did not restrict users from performing multiple HITs, nalinstances(i.e.,acrossmultiplevisualizationtech- which may have exposed them to multiple visualization tech- niques.Userscompletedonaverage1.5HITs. niques with that label). Finally, we track task com- 6 pletiontime. ce (a) TOPIC 25 (coh. = 0.21)(b) TOPIC 26 (0.21) (c) TOPIC 3 (0.20) n Eachusercompletesfourrandomlyselectedvali- re e dationtasksaspartofaHIT,withtheconstraintthat oh each task must be from a different topic. We also h C ig use ground truth seeding for quality control: each / H HIT includes one additional test task that has a pur- w s posefullybadlabelgeneratedbyconcatenatingthree ic p o random dictionary words. If the user does not pick T the bad label as the “worst”, we discard all data in thatHIT. ce (d) TOPIC 9 (0.01) (e) TOPIC 16 (0.01) (f) TOPIC 23 (0.02) n e r e 3.5 StudyDesignandDataCollection oh C For Phase I, we use a factorial design with factors w o of Visualization (levels: word list, word list with / L w bars,wordcloud,andnetworkgraph)andCardinal- s ic ity (levels: 5, 10, and 20), yielding twelve condi- p o T tions. For each of the fifty topics in the model and eachofthetwelveconditions,atleastfiveusersper- Figure 4: Word list with bar visualizations of the three form the labeling task, describing the topic with a best (top) and worst (bottom) topics according to their labelandsentence,resultinginaminimumof3,000 coherencescore,whichisshowntotherightofthetopic label and sentence pairs. Each HIT includes five of number. Theaveragetopiccoherenceis0.09(SD=0.05). these labeling tasks, for a minimum of 600 HITs. Theusersarepaid$0.30perHIT. Wefirstprovideanexampleofuser-generatedla- For Phase II, we compare descriptions across bels and sentences: the user labels for the topic thefourvisualizationtechniques(andautomatically shown in Figure 1 include government, iraqwar, generatedlabels),butonlywithinagivencardinality politics,bushadministration,andwaronterror. Ex- level rather than across cardinalities. We collected amplesofsentencesinclude“PresidentBush’smili- 3,212labelandsentencepairsfrom589usersduring taryplaninIraq”and“WorldnewsinvolvingtheUS Phase I. For validation in Phase II, we use the first presidentandIraq”.7 five labels and sentences collected for each condi- tionforatotalof3.000labelsandsentences. These To interpret the results, it is useful to also un- are shown in sets of four (labels or sentences) dur- derstand the quality of the generated topics, which ing Phase II, yielding a total of 1,500 (3,000/4 + variesthroughoutthemodelandmayimpactauser’s 3,000/4) tasks. Each HIT contains four validation ability to generate good labels. We measure topic tasks and one ground truth seeding task, for a to- qualityusingtopiccoherence,anautomaticmeasure talof375 HITs. Toincreaserobustness,wevalidate thatcorrelateswithhowmuchsenseatopicmakesto twice for a total of 750 HITs, without allowing any a user (Lau et al., 2014).8 The average topic coher- two labels or sentences to be compared twice. The ence for the model is 0.09 (SD = 0.05). Figure 4 usersget$0.50perHIT. shows the three best (top) and three worst topics (bottom)accordingtotheirobservedcoherence: the 4 Results coherence metric distinguishes obvious topics from inscrutable ones. Section 4.3 shows that users cre- We analyze labeling time and self-reported confi- dence for the labeling task (Phase I) before report- 7The complete set of labels and sentences are available ing on the label quality assessments (Phase II). We at https://github.com/alisonmsmith/Papers/ tree/master/TopicRepresentations. thenanalyzelinguisticqualitiesofthelabels,which 8We use a reference corpus of 23 million Wikipedia arti- shouldmotivatefutureworkinautomaticlabelgen- cles for computing normalized pointwise mutual information eration. neededforcomputingtheobservedcoherence. 7 Technique WordList WordListw/Bars WordCloud NetworkGraph Cardinality 5 10 20 5 10 20 5 10 20 5 10 20 #taskscompleted 264 268 268 264 280 260 268 268 268 267 274 263 53.0 53.2 52.1 58.4 58.7 60.7 52.7 49.4 68.4 55.0 55.6 77.9 Avgtime(SD) (44.3) (46.6) (53.3) (75.1) (51.1) (57.9) (47.4) (37.4) (85.4) (50.7) (56.0) (71.9) 3.7 3.7 3.6 3.6 3.6 3.7 3.5 3.6 3.6 3.4 3.6 3.7 Avgconfidence(SD) (0.9) (0.9) (0.9) (0.9) (0.8) (0.8) (1.0) (0.9) (0.9) (1.1) (0.8) (0.8) Table1: Overviewofthelabelingphase: numberoftaskscompleted, theaverageandstandarddeviation(inparen- theses) for time spent per task in seconds, and the average and standard deviation for self-reported confidence on a 5-pointLikertscaleforeachofthetwelveconditions. 吀椀洀攀 tions (p<.05). This effect is likely due to the net- ⠀猀攀挀⸀⤀ 圀漀爀搀 䰀椀猀琀 圀漀爀搀 䰀椀猀琀 眀⼀ 䈀愀爀猀 圀漀爀搀 䌀氀漀甀搀 一攀琀眀漀爀欀 䜀爀愀瀀栀 㠀  workgraphbecomingincreasinglydensewithmore 㜀  nodes (Figure 1, bottom right). In contrast, the rel- 㘀  atively simple word list visualization was signifi- 㔀  cantlyfasterwithtwentywordsthanthethreeother 㐀  visualizations (p<.05), taking only 52.1s on aver- 㔀眀漀爀搀猀㄀  ㈀  㔀 ㄀  ㈀  㔀 ㄀  ㈀  㔀 ㄀  ㈀  age(SD=53.4). Wordlistwithbarsandwordcloud werenotsignificantlydifferentfromeachother. Figure 5: Average time for the labeling task, across vi- As a secondary analysis, we examine the rela- sualizations and cardinalities, ordered from left to right byvisualcomplexity. For20words, networkgraphwas tionshipbetweenelapsedtimeandtheobservedco- significantlyslowerandwordlistwassignificantlyfaster herence for each topic. Topics with high coher- thantheothervisualizationtechniques. Errorbarsshow ence scores, for example, may be faster to label, standarderror. because they are easier to interpret. However, the small negative correlation between time and coher- atedlowerqualitylabelsforlowcoherencetopics. ence(Figure6,top)wasnotsignificant(r48= .13, − p=.364). 4.1 LabelingTime More complex visualization techniques take longer 4.2 Self-ReportedLabelingConfidence to label (Table 1 and Figure 5). The labeling tasks For each labeling task, users rate their confidence took on average 57.9 seconds (SD=58.5) to com- thattheirlabelsandsentencesdescribethetopicwell pleteandatwo-wayANOVA(visualizationtechnique onascalefrom1(leastconfident)to5(mostconfi- cardinality) reveals significant main effects for dent). The average confidence across all conditions × both the visualization technique9 and the cardinal- was3.6(SD=0.9). Kruskal-Wallistestsshowasig- ity,10 aswellasasignificantinteractioneffect.11 nificant impact of visualization technique on con- Forlowercardinality,thelabelingtimeacrossvi- fidence with five and ten words, but not twenty.12 sualization techniques is similar, but there are no- While average confidence ratings across all condi- table differences for higher cardinality. Posthoc tions only range from 3.4 to 3.7, perceived confi- pairwise comparisons based on the interaction ef- dence with network graph suffers when the visual- fect (with Bonferroni adjustment) found no signif- izationhastoofewwords(Table1). icant differences between visualizations with five As a secondary analysis, we compare the self- words and only one significant difference for ten reported confidence with observed coherence for words (word list with bars was slower than word each topic (Figure 6, bottom). Increased user con- cloud, p<.05). Fortwentywords,however,thenet- fidencewithmorecoherenttopicsissupportedbya work graph was significantly slower at an average moderate positive correlation between topic coher- of77.9s(SD=72.0)thantheother threevisualiza- 12Fivewords: χ2=12.62,p=.006.Tenwords: χ2=7.94, 9F(3,3199)=10.58,p<.001,ηp2=.01 p=.047.Weused3nonparametrictestsbecausethedat3aisordi- 10F(2,3199)=14.60,p<.001,ηp2=.01 nalandwecannotguaranteethatalldifferencesbetweenpoints 11F =4.59,p<.001,η2=.01 onthescaleareequal. (6,3199) p 8 testswithBonferronicorrectionshowthatautomatic labels were significantly worse than user-generated labelsfromeachofthevisualizationtechniques(all comparisons p<.05). No other pairwise compar- isonsweresignificant. For sentences, no visualization technique emerged as better than the others. Additionally, there is no existing automatic approach to compare against. The distribution of “best” counts here was relatively uniform. Separate Kruskal-Wallis tests for each cardinality to examine the impact of the visualization techniques on “best” counts did not revealanysignificantresults. As a secondary qualitative analysis, we examine therelationshipbetweentopiccoherenceandtheas- sessedqualityofthelabels. Theautomaticalgorithm tendedtoproducebetterlabelsforthecoherenttop- ics than for the incoherent topics. For example, Topic26(Figure4,b)— music,band,songs —and { } Topic 31 (Figure 4, c)— food, restaurant, wine — { } are two of the most coherent topics. The automatic algorithmlabeledTopic26asmusicandTopic31as Figure6: Relationshipbetweenobservedcoherenceand food. For both of these coherent topics, the labels labeling time (top) and observed coherence and self- generated by the automatic algorithm secured the reportedconfidence(bottom)foreachtopic.Thepositive most“best”votesandno“worst”votes. Incontrast, correlation (Slope = 1.64 and R2 = 0.10) for confidence Topic 16 (Figure 4, e)— years, home, work —and issignificant. { } Topic 23 (Figure 4, f)— death, family, board — { } are two of the least coherent topics. The automatic enceandconfidence(r =.32, p=.026). Thisre- labels refusalofwork and deathofmichaeljackson 48 sult provides further evidence that topic coherence yielded the most “worst” votes and fewest “best” isaneffectivemeasurementoftopicinterpretability. votes. To further demonstrate this relationship, we ex- 4.3 OtherUsers’RatingofLabelQuality tractedfromthe50topicsthetopandbottomquar- Other users’ perceived quality of topic labels is the tilesof13topicseach14 basedontheirobservedco- best real-world measure of quality (as described in herence scores. Figure 8 shows a comparison of Section 3.4). Overall, the visualization techniques the “best” and “worst” votes for the topic labels for had similar quality labels, but automatically gener- these quartiles, including user-generated and auto- ated labels do not fare well. Automatic labels get matically generated labels. For the top quartile, the far fewer “best” votes and far more “worst” votes number of “best” votes per technique ranged from thanuser-generatedlabelsproducedfromanyofthe 61 for automatic labels to 96 for the network graph fourvisualizationtechniques(Figure7). Chi-square visualization. Therangeforthebottomquartilewas tests on the distribution of “best” votes for labels larger, from only 45 “best” votes for automatic la- foreachcardinalityshowthatthevisualizationmat- belsto99forwordlistwithbars. Theautomaticla- ters.13 Posthoc analysis using pairwise Chi-square bels, in particular, received a large relative increase in“best”voteswhencomparingthebottomquartile 13Five words: χ2 = 16.47, p = .002. Ten words: 4,N=500 χ2 =14.62,p=.006.Twentywords:χ2 =22.83, 14Wecouldnotgetexactquartiles,becausewehave50top- 4,N=500 4,N=500 p<.001. ics,soweroundeduptoinclude13topicsineachquartile. 9

Description:
@article{Smith:Lee:Poursabzi-Sangdeh:Boyd-Graber:Seppi:Elmqvist: Author = {Alison Smith and Tak Yeon Lee and Forough Poursabzi-Sangdeh
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.