ebook img

DTIC ADA453536: Story Link Detection and New Event Detection are Asymmetric PDF

4 Pages·0.1 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA453536: Story Link Detection and New Event Detection are Asymmetric

Story Link Detection and New Event Detection are Asymmetric Francine Chen AymanFarahat Thorsten Brants PARC PARC PARC 3333CoyoteHillRd 3333CoyoteHillRd 3333CoyoteHillRd PaloAlto,CA94304 PaloAlto,CA94304 PaloAlto,CA94304 [email protected] [email protected] [email protected] Abstract eachsystemseparately. Wealsoinvestigatetheutilityof anumberoftechniquesforimprovingthesystems. Story link detection has been regarded as a coretechnologyforotherTopicDetectionand 2 CommonProcessingand Models Trackingtaskssuchasneweventdetection. In this paperwe analyzestorylinkdetectionand The Link Detection and New Event Detection systems new event detection in a retrieval framework that we developed for TDT2002 share many process- and examine the effect of a number of tech- ing steps in common. This includes preprocessing niques, including part of speech tagging, new to tokenize the data, recognize abbreviations, normal- similaritymeasures,andanexpandedstoplist, ize abbreviations, remove stop-words, replace spelled- ontheperformanceofthetwodetectiontasks. out numbers by digits, add part-of-speech tags, replace Wepresentexperimentalresultsthatshowthat the tokens by their stems, and then generating term- the utility of the techniques on the two tasks frequency vectors. Document frequency counts are in- differs,asisconsistentwithouranalysis. crementally updated as new sources of stories are pre- sented to the system. Additionally, separate source- specific counts are used, so that, for example, the 1 Introduction term frequencies for the New York Times are com- puted separately from stories from CNN. The source- Topic Detection and Tracking (TDT) research is spon- specific, incremental, document frequency counts are soredbytheDARPATIDESprogram. Theresearchhas used to compute a TF-IDF term vector for each story. five tasks related to organizing streams of data such as Stories are compared using either the cosine distance newswire and broadcast news (Wayne, 2000). A link (cid:0)(cid:2)(cid:1)(cid:4)(cid:3)(cid:6)(cid:5)(cid:8)(cid:7)(cid:10)(cid:9)(cid:12)(cid:11)(cid:13)(cid:7)(cid:15)(cid:14)(cid:17)(cid:16)(cid:19)(cid:18) % (cid:20)(cid:22)(cid:21)(cid:17)(cid:23)(cid:25)(cid:24)(cid:27)(cid:26)(cid:4)(cid:28)(cid:29)(cid:31)(cid:30)! (cid:4)"(cid:23)(cid:25)(cid:24)(cid:27)(cid:26)(cid:4)(cid:28)(cid:29)$#! or Hellinger detection(LNK) system detects whether two stories are (cid:20)&(cid:21)’(cid:23)(cid:25)(cid:24)(cid:27)(cid:26)(cid:4)(cid:28)(cid:29)(cid:31)(cid:30)! # "(cid:20)(cid:22)(cid:21)’(cid:23)((cid:24))(cid:26)(cid:4)(cid:28)(cid:29)$#* # / “linked”,ordiscussthesameevent.Astoryaboutaplane distance(cid:0)(cid:2)(cid:1)+(cid:3),(cid:5)-(cid:7) (cid:9) (cid:11)(cid:13)(cid:7) (cid:14) (cid:16).(cid:18) (cid:23)(cid:25)(cid:24)0(cid:29)(cid:31)(cid:30)1(cid:28)(cid:26)2 (cid:23)(cid:25)(cid:24)0(cid:29)$#5(cid:28)(cid:26)2 for crashandanotherstoryaboutthefuneralofthecrashvic- (cid:20) (cid:26) (cid:20)(cid:22)(cid:21)(cid:12)(cid:23)(cid:25)(cid:24)0(cid:29)(cid:31)(cid:30)1(cid:28)(cid:26)2 43 (cid:20)(cid:22)(cid:21)(cid:12)(cid:23)((cid:24)6(cid:29)$#$(cid:28)(cid:26)2 timsareconsideredtobelinked.Incontrast,astoryabout terms7 indocuments(cid:7) (cid:9) and(cid:7) (cid:14) . Tohelpcompensatefor hurricaneAndrewandastoryabouthurricaneAgnesare stylistic differences between various sources, e.g., news notlinkedbecausetheyaretwodifferentevents. Anew paper vs. broadcast news, translation errors, and auto- event detection (NED)system detects when a storydis- matic speech recognition errors (Allan et al., 1999), we cussesapreviouslyunseenevent. Linkdetectioniscon- subtracttheaverageobservedsimilarityvalues,insimilar sidered to be a core technology for new event detection spirittotheuseofthresholdsconditionedonthesources andtheothertasks. (Carbonelletal.,2001) Several groups are performing research on the TDT tasks of link detection and new event detection (e.g., 3 NewEvent Detection (Carbonellet al.,2001)(Allanetal., 2000)). Inthispa- per,wecomparethelinkdetectionandneweventdetec- Inordertodecidewhetheranewdocument(cid:7) describesa tiontasksinaninformationretrievalframework,examin- newevent,itiscomparedtoallpreviousdocumentsand ingthecriteriaforimprovingaNED systembasedona the document (cid:7)98 with highestsimilarity is identified. If LNK system, andgivespecific directions forimproving the score (cid:0)(cid:17):(cid:31);’<(cid:12)=(cid:10)(cid:5)-(cid:7)(cid:10)(cid:16)>(cid:18)@?BAC(cid:0)(cid:2)(cid:1)(cid:4)(cid:3),(cid:5)-(cid:7)9(cid:11)(cid:13)(cid:7)D8(cid:2)(cid:16) exceeds a thresh- Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2005 2. REPORT TYPE 00-00-2005 to 00-00-2005 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Story Link Detection and New Event Detection are Asymmetric 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Palo Alto Research Center (PARC),2222 Coyote Hill Road,Palo REPORT NUMBER Alto,CA,94304 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 3 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 oldE(cid:25)F ,thenthereisnosufficientlysimilarpreviousdoc- 1 LNK − Hellinger vs. Cosine ument,and(cid:7) isclassifiedasanewevent. 0.9 on cos 4 Link Detection 0.8 oofnf choesll off hell 0.7 In order to decide whether a pair of stories (cid:7)G(cid:9) and (cid:7)(cid:10)(cid:14) 0.6 are linked, we compute the similarity between the two documents using the cosine and Hellinger metrics. The CDF0.5 similarity metrics are combined using a support vector 0.4 machineandthemarginisusedasaconfidencemeasure 0.3 thatisthresholded. 0.2 5 EvaluationMetric 0.1 TDT system evaluation is based on the number of false 00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 score alarmsandmisses producedbya system. Inlinkdetec- tion,thesystemshoulddetectlinkedstorypairs; innew Figure1: CDFforcosineandHellingersimilarityonthe eventdetection,thesystemshoulddetectnewstories. A LNKtaskforon-topicandoff-topicpairs. detectioncost H4IKJ (cid:18) HMLONF!F LONF!F H4UWV UWV (1) 1 NED − Hellinger vs. Cosine (cid:26) (cid:26)-Q5RTS (cid:26)-Q5R\[ 3+P 3(cid:8)P 3(cid:8)P 3(cid:8)PYX(cid:25)Z(cid:13)X iscomputedwherethecosts H]L4NF!F andHOUWV aresetto1 0.9 LON U^V and0.1, respectively. F!F and arethe computed 0.8 missandfalsealarmpProbabilities.P and are (cid:26)-Q5R (cid:26)-Q(cid:31)R 0.7 theaprioritargetandnon-targetproPbabilities,P sX\eZ*tXto0.02 and0.98,respectively. Thedetectioncost isnormalized arity)0.6 bydividingbymin(cid:5) H LONF!F (cid:11) H UWV (cid:16) sothata Simil0.5 perfectsystemscores0,an3(cid:8)dP (cid:26)-aQ5rR andom3(cid:8)bP aX(cid:25)sZ(cid:13)eXli(cid:26)-nQ5eR scores1. DF(0.4 C Equalweightisgiventoeachtopicbyaccumulatingerror 0.3 probabilitiesseparatelyforeachtopicandthenaveraged. Hellinger on−topic 0.2 Hellinger off−topic Theminimumdetectioncostisthedecisioncostwhenthe cosine on−topic decisionthresholdissettotheoptimalconfidencescore. 0.1 cosine off−topic 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 Differencesbetween LNKand NED Similarity The conditions for false alarms and misses are reversed Figure2: CDFforcosineandHellingersimilarityonthe for the LNK and NED tasks. In the LNK task, incor- NEDtaskforon-topicandoff-topicpairs. rectlyflaggingtwostoriesasbeingonthesameeventis consideredafalsealarm.Incontrast,intheNEDtask,in- correctlyflaggingtwostoriesasbeingonthesameevent LNKandNEDsystems,namely,part-of-speechtagging, willcauseatruefirststorytobemissed. Conversely,in- anexpandedstoplist,andnormalizingabbreviationsand correctly labelingtwostoriesthatareonthesameevent transformingspelledoutnumbersintonumbers. Wealso asnotlinkedisamiss,butfortheNEDtask,incorrectly investigatedtheuseofdifferentsimilaritymeasures. labelingtwostoriesonthesameeventasnotlinkedmay 6.1 SimilarityMeasures resultinafalsealarm. In this section, we analyze the utility of a number of The systems developed for TDT primarily use cosine techniquesfortheLNKandNEDtasksinaninformation similarityasthesimilaritymeasure. Inworkontextseg- retrievalframework.ThedetectioncostinEqn.1assigns mentation (Brants et al., 2002), better performance was a highercost to false alarms since H_LONF!F (cid:18)a‘ ‘(cid:15)b observed with the Hellinger measure. Table 1 shows and H UWV (cid:18)d‘ ‘\e(cid:25)f . A LNK s3(cid:25)yPst(cid:26)-eQ(cid:31)mR shou[ ld thatforLNK,thesystembasedoncosinesimilarityper- (cid:26)-Q(cid:31)R [ minimizef3calPseX\Z*aXlarmsbyidentifyingonlylinkedstories, formedbetter;incontrast,forNED,thesystembasedon which results in high precision for LNK. In contrast, a Hellingersimilarityperformedbetter. NEDsystemwillminimizefalsealarmsbyidentifyingall The LNK task requires high precision, which corre- storiesthatarelinked,whichtranslatestohighrecallfor sponds to a large separation between the on-topic and LNK.Basedonthisobservation,weinvestigatedanum- off-topicdistributions,asshownforthecosinemetricin berof precisionandrecall enhancing techniques for the Figure 1. The NED task requires high recall (low CDF Table1: Effectofdifferentsimilaritymeasuresontopic- Table3:Effectofusingan“ASRstoplist”and“enhanced weighted minimum normalized detection costs on the preprocessing”forhandlingASRdifferencesontheTDT TDT2002dryrundata. 2001evaluationdata. System Cosine Hellinger Change(%) ASRstop No Yes No LNK 0.3180 0.3777 -0.0597(-18.8) Preproc Std Std Enh NED 0.7059 0.5873 +0.1186(+16.3) LNK 0.312 0.299(+4.4%) 0.301(+3.3%) NED 0.606 0.641(-5.5%) 0.587(+3.1%) Table2: Effectofusingpart-of-speechonminimumnor- 7 Summary andConclusions malizeddetectioncostsontheTDT2002dryrundata. System A PoS PoS Change(%) We have presented a comparison of story link detection S LNK 0.3180 0.3334 -0.0154(Ahg f %) andneweventdetectioninaretrievalframework,show- [ NED 0.6403 0.5873 +0.0530( f %) ing that the two tasks are asymmetric in the optimiza- S [i tion of precision andrecall. We performedexperiments comparingtheeffectofseveraltechniquesontheperfor- values for on-topic). Figure 2, which is based on pairs manceofLNKandNEDsystems.Althoughmanyofthe thatcontainthecurrentstoryanditsmostsimilarstoryin processingtechniquesusedbyoursystemsarethesame, thestoryhistory,showsagreaterseparationinthisregion the results of our experiments indicate that some tech- withtheHellingermetric.Forexample,at10%recall,the niquesaffecttheperformanceofLNKandNEDsystems Hellingermetrichas71%falsealarmrateascomparedto differently. These differences may be due in part to the 75%forthecosinemetric. asymmetryinthetasksandthecorrespondingdifferences inwhetherimprovingprecisionorrecallforthelinktask ismoreimportant. 6.2 Part-of-Speech(PoS)Tagging 8 Acknowledgments Toreduceconfusionamongsomewordsenses,wetagged the terms as one of five categories: adjective, noun, WethankJamesAllanofUMassforsuggestingthatpre- propernouns,verb,orother,andthencombinedthestem cisionandrecallmaypartiallyexplaintheasymmetryof andpart-of-speechtocreatea“taggedterm”. Forexam- LNKandNED. ple, ‘N train’represents the term‘train’whenusedas a noun. TheLNKandNEDsystemsweretestedusingthe taggedterms. Table2showstheoppositeeffectPoStag- References ginghasonLNKandNED. JamesAllan,HubertJin,MartinRajman,CharlesWayne, Dan Gildea, Victor Lavrenko, Rose Hoberman, and 6.3 StopWords David Caputo. 1999. Topic-based novelty detection. Summer workshop final report, Center for Language The broadcast news documents in the TDT collection andSpeechProcessing,JohnsHopkinsUniversity. have been transcribed using Automatic Speech Recog- nition (ASR). There are systematic differences between James Allan, Victor Lavrenko, and Hubert Jin. 2000. First storydetectioninTDTis hard. InCIKM, pages ASR and manually transcribed text. For example “30” 374–381. willbespelledoutas“thirty”and‘CNN”isrepresented as three separate tokens “C”, “N”, and “N”. To handle Thorsten Brants, Francine Chen, and Ioannis Tsochan- thesedifferences,an“ASRstoplist”wascreatedbyiden- taridis. 2002. Topic-based document segmentation tifyingtermswithstatisticallydifferentdistributionsina with probabilistic latent semanticanalysis. In CIKM, parallelcorpusofmanuallyandautomaticallytranscribed pages211–218,McLean,VA. documents,theTDT2corpus. Table3showsthatuseof Jaime Carbonell, Yiming Yang, Ralf Brown, Chun Jin, an ASR stoplist on the topic-weighted minimum detec- andJian Zhang. 2001. Cmu tdt report. Slides at the tioncostsimprovesresultsforLNKbutnotforNED. TDT-2001meeting,CMU. We also performed “enhanced preprocessing” to nor- Charles Wayne. 2000. Multilingual topic detection malizeabbreviations andtransformspelled-outnumbers andtracking: Successfulresearchenabledbycorpora into numerals, which improves both precision and re- and evaluation. In LREC, pages 1487–1494, Athens, call. Table3showsthatenhancedpreprocessingexhibits Greece. worseperformancethantheASRstoplistforLinkDetec- tion,butyieldsbestresultsforNewEventDetection.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.