ebook img

DTIC ADA463058: Comparing Evaluation Metrics for Sentence Boundary Detection PDF

0.67 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA463058: Comparing Evaluation Metrics for Sentence Boundary Detection

COMPARINGEVALUATIONMETRICSFORSENTENCEBOUNDARYDETECTION YangLiu1 Elizabeth Shriberg2;3 1 UniversityofTexasatDallas,Dept. ofComputerScience,Richardson,TX,U.S.A. 2 SRI International,MenloPark,CA,U.S.A. 3 InternationalComputerScience Institute,Berkeley,CA,U.S.A. ABSTRACT event of interest (i.e., sentence boundaries) by using different cor- pora. Highly skewed priors are inherent to this and related tasks, In recent NIST evaluations on sentence boundary detection, a sinceboundaryeventsaretypicallyrarecomparedtononboundaries. single error metric was used to describe performance. Additional Unlikemoststudiesinmachinelearning,thisworkfocusesonareal metrics,however,areavailableforsuchtasks,inwhichawordstream languageprocessingtask.Thestudyisexpectedtohelpusbetterun- ispartitionedintosubunits. Thispapercomparesalternativeevalua- derstandevaluationmetricsthatwillbegeneralizabletomanysimi- tionmetrics(cid:151)includingtheNISTerrorrate,classi(cid:2)cationerrorrate larlanguageprocessingtasksinvolvingsegmentationofthespeech perwordboundary,precisionandrecall,ROCcurves,DETcurves, streamintosubunits(cid:151)forexample,topic,story,ordialogact. precision-recall curves, and area under the curves(cid:151)and discusses advantagesanddisadvantagesofeach. Unlikemanystudiesinma- 2. METRICS chinelearning,weuserealdataforarealtask.We(cid:2)ndbene(cid:2)tfrom using curves in addition to a single metric. Furthermore, we (cid:2)nd Thetask ofsentence boundary detection inspeech istodetermine thatdataskewhasanimpactonmetrics,andthatdifferencesamong thelocationofboundariesgivenawordsequence(typicallyfroma differentsystemoutputsaremorevisibleinprecision-recallcurves. speech recognizer) and associated audio signal. In this study, we Resultsareexpectedtohelpusbetterunderstandevaluationmetrics use reference transcriptions, to avoid confounding discussion with thatshouldbegeneralizabletosimilarlanguageprocessingtasks. theeffectsofrecognition errors themselves; the metricsforrecog- nitionoutputapplysimilarlyafteraligningrecognitionoutputwith IndexTerms— sentenceboundary detection,precision,recall, referencetranscriptions.Wecanrepresentthetaskastwo-wayclas- ROCcurve si(cid:2)cationordetection,thatis,labeleachinterwordboundary asei- ther(cid:147)sentence(cid:148)or(cid:147)no-sentence(cid:148).Table1showsthenotationweuse 1. INTRODUCTION fortheconfusion matrixresult. Foragiven task, thetotalnumber ofsamplesistp+fn+fp+tn,andthetotalnumberofpositive Sentenceboundarydetectionhasreceivedincreasingattentioninre- samplesistp+fn. centyears, asawaytoenrich speech recognition output for better readability and improved downstream natural language processing systemtrue systemfalse [1,2,3]. Automaticsentenceboundary detectionitselfwaspartof referencetrue tp fn therecentNISTrichtranscriptionevaluations.1 Inaddition,studies referencefalse fp tn havebeenconducted toevaluatetheimpactofsentence segmenta- tiononsubsequenttasks,includingspeechtranslation,parsing,and Table1. Aconfusionmatrixforthesystemoutput. (cid:147)True(cid:148)means speechsummarization[2,4,5]. positiveexamples,thatis,sentenceboundariesinthistask. IntheNISTevaluations,systemperformanceforsentencebound- 2.1. MetricsDescription arydetectionhasbeenevaluatedusinganerrorrate(totalnumberof insertedanddeletedboundaries,dividedbythenumberofreference Various metrics have been used for evaluating sentence boundary boundaries).Researchstudies([2,3])havealsobeguntolookatthe detection or similar tasks, in individual studies. For example, in ROCcurve, DETcurve, and F-measure. Ofcourse, sincetheulti- [6, 7], metrics are developed that treat the sentences as units and mategoalistoaiddownstreamlanguageprocessingtasks,aproper measure whether the reference and hypothesized sentences match way to evaluate sentence boundary detection would be to look at exactly. Sloterrorrate[8]wasintroduced(cid:2)rstforinformationex- the impact on the downstream tasks. In fact, in [2] it was shown tractiontasks,andlaterusedforsentenceboundarydetection.Kappa that the optimal segmentation for parsing is indeed different from statisticshaveoftenbeenusedtoevaluatehumanannotationconsis- thatobtainedwhenoptimizingforsentenceboundary detectionac- tency,andcanalsobeusedtoevaluatesystemperformance,thatis, curacy using the aforementioned NIST metric. Other application- treating system output as a ‘human’ annotation. Other metrics in basedstudiesincludetheimpactofsentenceboundarydetectionfor thegeneralclassi(cid:2)cationliterature,suchascostcurves[9],havenot machinetranslation[4]andforsummarization[5]. beenwidelyusedforevaluatingsentenceboundarydetection. This paper examines various evaluation metrics and discusses Inthispaperwefocusonthefollowingmetrics: theiradvantagesanddisadvantages. Wefocusontheboundary de- (cid:15) NISTmetric. The NIST error rate is the sum of the inser- tection task itself, rather than its impact on downstream applica- tionanddeletionerrorsperthenumberofreferencesentence tions. In addition, we evaluate the effect of different priors of the boundaries.UsingthenotationinTable1,thisbecomes 1Seehttp://www.nist.gov/speech/tests/rt/rt2004/fall/formoreinformation fn+fp NISTerrorrate= onNISTevaluations. tp+fn Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2007 2. REPORT TYPE 00-00-2007 to 00-00-2007 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Comparing Evaluation Metrics for Sentence Boundary Detection 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION International Computer Science Institute,1947 Center Street Suite REPORT NUMBER 600,Berkeley,CA,94704 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 4 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 Note that the NIST evaluation tool mdeval2 allows bound- dominant in the other [11]. Such a relationship also holds for the arieswithinasmallwindowtomatchup,inordertotakeinto ROCand DETcurves. This isstraightforward from the de(cid:2)nition account different alignments from speech recognizers. We ofthesecurves(cid:151)truepositiveversusfalsepositiveinROCcurves, ignore that detail in this study and simply treat the task as and miss rate (i.e., 1 (cid:0) true positive) versus false positive on the straightforwardclassi(cid:2)cation. scaleofthenormaldeviateinDETcurves. Sincenormaldeviation (cid:15) Classi(cid:2)cationerrorrate. Ifthistaskisrepresented asaclas- isamonotonicfunction,changingtheaxistoanormaldeviatescale preservesthepropertyofbeingdominant. si(cid:2)cation task for each interword boundary point, then the classi(cid:2)cationerrorrate(CER)is 3. ANALYSISONRT04DATASET fn+fp CER= tp+fn+fp+tn 3.1. SentenceBoundaryDetectionTaskSetup (cid:15) Precisionandrecall.Thesemetricsarewidelyusedininfor- Tostudythebehaviorofdifferentmetricsonrealdata,weevaluated sentenceboundarydetectionforastate-of-the-artsystemontwodif- mationretrieval,andarede(cid:2)nedasfollows: ferent corpora. We used the RT04 NIST evaluation data, conver- tp precision= sationaltelephone speech (CTS)and broadcast newsspeech (BN). tp+fp Thetotalnumberofwordsinthe testsetisabout4.5KinBNand recall= tp 3.5K in CTS.The prior probability of a sentence boundary is dif- tp+fn ferentacrossthetwocorpora,about14%forCTSand8%forBN.4 Comparingthetwocorporaallowsustoinvestigatetheeffectofdata Asinglemetricisoftenusedtore(cid:3)ectbothprecisionandre- skewdifferencesonthemetrics. call,andtheirtradeoff: SystemoutputisbasedontheICSI+SRI+UWsentencebound- 2(cid:2)precision(cid:2)recall ary detection system [3]. Five different system outputs are used F-measure= inthisstudy: decisiontreeclassi(cid:2)ersusingprosodic featuresonly, precision+recall 4-gram language model (LM),hidden Markov model (HMM) that (cid:15) ROCcurve. Receiver operating characteristic (ROC) curves combinesprosodyandlanguagemodel,maximumentropy(Maxent) areusedfordecisionmakinginmanydetectiontasks. They modelusingprosodicandtextualinformation,andthecombination show the relationship between true positives (= tp ) and ofHMMandMaxent.5 Foralltheseapproaches,thereisaposterior tp+fn falsepositives(= fp )asthedecisionthresholdvaries. probabilitygeneratedforeachinterwordboundary,whichweuseto fp+tn plotthecurvesortosetthedecisionthresholdforasinglemetric. (cid:15) PRcurve. The precision-recall (PR)curve shows what hap- penstoprecisionandrecallasthedecisionthresholdisvar- 3.2. ResultsandDiscussion ied. Table 2 shows different single performance measures for sentence (cid:15) DETcurve.Adetectionerrortradeoff(DET)curveplotsthe boundary detectionforCTSand BN.Athreshold of0.5isusedto miss rate (= fn ) versus the false alarms (i.e., false posi- generatetheharddecisionforeachboundarypoint.Weusedtherec- tp+fn tive),usingthenormaldeviatescale[10].Itiswidelyusedin ognizerforcedalignmentoutput(slightlydifferentfromtheoriginal thetaskofspeakerrecognition,butlessusedforotherclassi- transcripts)astheinputtosentenceboundary detection. Reference (cid:2)cationproblems. boundarieswereobtainedbymatchingtheoriginalsentencebound- (cid:15) AUC.Thecurvesaboveprovideagoodviewforthesystem’s ariestothealignmentoutput. performance atdifferent decision points. However, asingle Figure1showstheROC,PR,andDETcurvesforthe(cid:2)vesystem numberisoftenpreferredwhencomparingtwocurvesortwo outputs,forbothCTSandBN.ThepointsshowninthePRcurves models. Area under the curves (AUC) is used for this pur- correspond to using 0.5 as the decision threshold (i.e., the results pose.ItisoftenusedforbothROCandPRcurves,butlessso showninTable2). ThepointsforHMM,Maxent,andtheircombi- forDETcurves.3 nationareclosetoeachother,andthusarenotindividuallylabeled witharrows. 2.2. Relationship InTable2,foralmostallthecases(exceptrecallonCTS),the combination of HMM and Maxent achieves the best performance. Foragiventask,thenumberofpositivesamples(i.e.,np=tp+fn) ThecurvesalsoshowthatgenerallyHMM,Maxent,andtheircom- andthetotalnumberofsamples(i.e.,tp+fn+fp+tn)are(cid:2)xed. binationareclosetoeachother,andmuchbetterthantheothertwo Therefore, precision and recall uniquely determine the confusion curvesfortheprosodyandLM,forbothCTSandBN.However,in matrix,and hence theNISTerrorrateand classi(cid:2)cationerrorrate. thisstudy,ourgoalisnottodeterminethebestmodeltooptimizea Eachofthetwoerrorratescanuniquely determinetheothersince singleperformancemetric.Wearemoreinterestedinlookingatdif- thedenominatorsinthemareproportionally.However,fromthetwo ferentsystemoutputsandhowtheybehavewithrespecttoevaluation errorrates(withoutdetailedinformationaboutinsertionordeletion metrics. errors),wecannotinfertheprecisionandrecallrate. TheROCandPRcurvesareone-to-onemappingcurves. Each 3.2.1. Domainandmetrics point in one curve uniquely determines the confusion matrix, and BNand CTShave different speaking styles and classdistributions thusthepointintheothercurve. FortheROCandPRcurves,ithas (priorsofsentenceboundaries),andthuscomparisonsacrossthetwo beenshownthatifacurveisdominantinonespace, thenitisalso domainsusingasinglemetricmaynotbeinformative.Forexample, 2The scoring tool is available from http://www.nist.gov/speech/tests/rt/ theCERissimilaracrossthetwodomains(forHMM,Maxent,and rt2004/fall/tools/. 3ForDETcurves,singlemetricssuchastheEER(equalerrorrate)and 4IntheEARSprogram,eachsentence-likeunitwascalledan(cid:147)SU(cid:148)[12]. DCF(detectioncostfunction)areoftenusedinspeakerrecognition. 5Detailsofthemodelingapproachescanbefoundin[3]. BN CTS Prosody LM HMM Maxent HMM+Maxent Prosody LM HMM Maxent HMM+Maxent NISTerrorrate(%) 73.86 74.31 52.58 50.21 47.87 53.94 40.22 29.42 28.38 27.78 CER(%) 6.10 6.14 4.34 4.15 3.96 7.76 5.79 4.23 4.08 4.00 Precision 0.751 0.751 0.821 0.822 0.845 0.864 0.842 0.876 0.894 0.896 Recall 0.391 0.384 0.606 0.635 0.639 0.547 0.736 0.823 0.812 0.817 F-measure 0.514 0.508 0.698 0.717 0.727 0.670 0.785 0.848 0.851 0.855 ROCAUC 0.893 0.941 0.978 0.975 0.981 0.928 0.969 0.985 0.985 0.987 PRAUC 0.601 0.652 0.804 0.815 0.832 0.791 0.878 0.929 0.934 0.938 Table2.DifferentperformancemeasuresforsentenceboundarydetectioninCTSandBN.Thedecisionthresholdis0.5. CTS ROC curves BN ROC curves 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Prosody e positive00..56 PLHMrMoMsody e positive00..56 LHMHMMMaxMMe+nMtaxent Tru0.4 MHMaxMe+nMtaxent Tru0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 False positive False positive CTS PR curves BN PR curves 1 1 0.9 0.9 0.8 0.8 Prosody 0.7 LM 0.7 on0.6 LM on0.6 Prosody Precisi0.5 MHaMxeMnt Precisi00..45 HMM 0.4 PLMrosody HMM+Maxent 0.3 Prosody HMMMa+xMeanxtent 0.3 HMM 0.2 LM Maxent HMM 0.2 HMM+maxent 0.1 Maxent HMM+Maxent 0.1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall CTS DET curves BN DET curves 90 90 Prosody 80 80 Prosody LM LM HMM 60 60 HMM Maxent %) Maxent %) HMM+Maxent y (in 40 HMM+maxent y (in 40 bilit 20 bilit 20 a a b b pro 10 pro 10 ss 5 ss 5 Mi Mi 2 2 1 1 0.5 0.5 0.2 0.2 0.1 0.1 0 . 10 .2 0 .5 1 2 5 10 20 40 60 80 90 0 . 10 .2 0 .5 1 2 5 10 20 40 60 80 90 False Alarm probability (in %) False Alarm probability (in %) Fig.1.ROC,PR,andDETcurvesforCTSandBNfrom(cid:2)vedifferentsystems:Prosody,LM,HMM,Maxent,andthecombinationofHMM andMaxent. theircombination),buttosomeextentthatisbecauseofthehigher curves,andmoreeffectivelyshowdifferencesbetweenalgorithms. degreeofskewforBNthanforCTS.Usingothermetricssuchasthe NISTerrorrateandprecision/recallcanbetterre(cid:3)ectinherent per- 4. CONCLUSIONS formancedifferences. Asexpected,forimbalanceddatasetswhere the majority is negative examples, ROC curves show weakness in Weusedareal-worldspoken language processingtasktocompare distinguishingdifferentclassi(cid:2)ersorcomparing acrosstasks,since differentperformancemetricsforsentenceboundarydetection.While thelargenumberofnegativeexamples(i.e.,tn+fp)oftenresults thisstudyisbasedonaparticularsentenceboundarydetectionsys- insmalldifferencesbetweenthefalsepositiverates. TheAUCfor tem and posterior probability distribution, the focus was on gen- theROCcurvesisquitehighforbothBNandCTS,whereasinthe eralcomparison ofalternativemetrics, rather than on performance PRspace the difference between BNand CTSismore noticeable. speci(cid:2)cs. We examined single metrics, including the NIST error The PRcurves and the associated AUC values are much worse in rate,classi(cid:2)cationerrorrate,precision,recall,andAUC,aswellas BNthan CTS.For the imbalanced data, PRcurves often have ad- decision curves (ROC,PR,and DET).Singlemetrics provide lim- vantagesinexposingthedifferencebetweenalgorithms,asshownin itedinformation; decisioncurvesillustratewhichmodelisbestfor Figure1. DETcurves also better illustratethe difference between aspeci(cid:2)cregionandshouldbepreferablefordownstreamlanguage thecurvesacrossthetwocorpora(e.g.,theslopesofthecurves). processing.Furthermore,dataskewhasanimpactonmetrics.Foran imbalanceddataset,PRcurvesgenerallyprovidebettervisualization 3.2.2. Interactionamongdomains,models,andmetrics thandoROCcurves,forviewingdifferencesamongdifferentalgo- rithms.Finally,whiletheanalysisinthispaperisbasedonsentence There issome difference between models acrossthetwodomains. boundary detection,thenatureofthistaskissimilartomanyother ForBN,usingonlytheprosodymodelperformssimilarlytoorslightly languageprocessingapplications(e.g.,storysegmentation). Hence, betterthantheLMalone, intermsoferrorrate, precision, and re- (cid:2)ndingsshouldbegeneralizabletoothersimilartasks. call. However, the AUC values for the prosody model are worse thanthosefortheLM,forbothROCandPRcurves.Asshownfrom 5. ACKNOWLEDGMENTS thePRcurves,intheregionaroundthedecisionthreshold(andalso the region to the left, i.e., with lower recall), the prosody curve is Wethank MaryHarper, AndreasStolcke, MariOstendorf, Dustin Hillard, better than the LM, but not in other regions. Therefore, using the andBarbara Peskin forthejoint workonthesentence boundary detection systemused,anddiscussionofperformanceevaluation. Thisworkissup- curveshelpstodeterminewhatmodelorsystemoutputisbetterfor portedbyDARPAunderContractNo. HR0011-06-C-0023. Approvedfor theregionofinterest.ForBN,thePRcurvesfortheprosodymodel publicrelease;distributionunlimited. and the LMcrossin themiddle, butthisisnotso for CTS,where theLMaloneachievesbetterperformancethanprosodyalone,using 6. REFERENCES mostofthemetrics(exceptprecision,asshowninTable2). [1] D.Jones,F.Wolf,E.Gibson,E.Williams,E.Fedorenko,D.Reynolds, 3.2.3. Singlemetricsversuscurves andM.Zissman, (cid:147)Measuringthereadability ofautomatic speech-to- texttranscripts,(cid:148) inProc.ofEurospeech,2003,pp.1585(cid:150)1588. Table2showsthatthedifferentmeasurementsforthissentencebound- [2] M. Harper, B. Dorr, B. Roark, J. Hale, Z. Shafran, Y. Liu, arytaskarehighlycorrelatedforonecorpus(cid:151)analgorithmisoften M. Lease, M. Snover, L. Young, R. Stewart, and A. Krasnyan- better than another using many single metrics. However, one sin- skaya, (cid:147)Final report: parsingspeechandstructural eventdetection,(cid:148) glemetricdoesnotprovidealltheinformation,sinceitisthemea- http://www.clsp.jhu.edu/ws2005/groups/eventdetect/documents/(cid:2)nal report.pdf,2005. sureforoneparticularchosendecisionpoint. Asdescribedearlier, [3] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and theNISTerrorrateandCERcannotdetermineconfusionmatrix,or M.Harper, (cid:147)Enriching speech recognition with automatic detection precision and recall, as they combine insertion and deletion errors ofsentenceboundariesanddis(cid:3)uencies,(cid:148)IEEETransactionsonAudio, (althoughthatinformationcanbeavailable). Fordownstream pro- Speech,andLanguageProcessing,vol.14(5),pp.1526(cid:150)1540,2006. cessing,ifadifferentdecisionregionismorepreferable, usingthe [4] C.ZongandF.Ren, (cid:147)Chineseutterancesegmentationinspokenlan- curveswilleasilyexposesuchinformationaswhichmodelperforms guage translation,(cid:148) in Proc. of the 4th International Conference on better. For example, [2] shows that the optimal point for parsing ComputationalLinguisticsandIntelligentTextProcessing,2003. isdifferentfromthatchosentooptimizethesingleNISTerrorrate [5] J.Mrozinski,E.Whittaker,P.Chatain,andS.Furui, (cid:147)Automaticsen- (intuitively,shorterutterancesaremoreappropriateforparsing). tencesegmentationofspeechforautomaticsummarization,(cid:148) inProc. ofICASSP,2006. ForthePR,ROC,andDETcurves,fromthediscussioninSec- tion2,weknowthatthedominanceofacurveinonespaceimplies [6] J.Ang,Y.Liu,andE.Shriberg, (cid:147)Automaticdialogactsegmentation andclassi(cid:2)cationinmultipartymeetings,(cid:148) inProc.ofICASSP,2005. dominanceinotherspaces.Additionally,ifacurveforonealgorithm [7] M.Zimmermann,Y.Liu,E.Shriberg,andA.Stolcke, (cid:147)Towardjoint is dominant over another one, then the AUC is greater. However, segmentationandclassi(cid:2)cationofdialogactsinmultipartymeetings,(cid:148) thattheAUCofacurveisbetterthananotherdoesnotmeanthatthe inProc.ofMLMIWorkshop,2005. curveisdominant. Similarly,theAUCcomparisonforthePRand [8] J.Makhoul,F.Kubala,andR.Schwartz, (cid:147)Performancemeasuresfor ROCcurves can be different. Forexample, comparing HMM and informationextraction,(cid:148) inProc.oftheDARPABroadcastNewsWork- Maxent on both corpora, Maxent has better AUC in the PR space shop,1999. (notverysigni(cid:2)cant),butnotinROC,asshowninTable2. [9] C.DrummondandR.Holte, (cid:147)Explicitly representing expected cost: Inmanycases,curvesfordifferentalgorithmscrosseachother; AnalternativetoROCrepresentation,(cid:148) inProc.ofSIGKDD,2000. therefore, itisnoteasytoconclude thatone classi(cid:2)eroutperforms [10] A.Martin,G.Doddington,T.Kamm,M.Ordowski,andM.Przybocki, theother. Thedecisionisoftenbasedondownstream applications (cid:147)TheDETcurveinassessmentofdetectiontaskperformance,(cid:148)inProc. ofEurospeech,1997. (e.g.,improvereadability,inputtomachinetranslation,orinforma- tionextraction).Forthissituation,usingboththecurves,alongwith [11] J.DavisandM.Goadrich, (cid:147)Therelationshipbetweenprecision-recall andROCcurves,(cid:148) inProc.ofICML,2006. single value measurement is a better idea. For visualization, PR curves expose information better than ROC, especially for imbal- [12] S.Strassel,SimpleMetadataAnnotationSpecificationV6.2,Linguistic DataConsortium,2004. anceddatasets. DETcurvesarealsoeasiertovisualizethanROC

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.