Table Of ContentD
Series ISSN 1947-4040 R
O
R
• E
T
A
L
Series Editor: Graeme Hirst, University of Toronto
Statistical Significance Testing for Natural Language Processing S
T
A
Rotem Dror, Technion — Israel Institute of Technology T
I
S
Lotem Peled-Cohen, Technion — Israel Institute of Technology T
I
C
Segev Shlomov, Technion — Israel Institute of Technology A
L
Roi Reichart, Technion — Israel Institute of Technology S
I
G
N
I
F
Data-driven experimental analysis has become the main evaluation tool of Natural Language Processing (NLP) I
C
algorithms. In fact, in the last decade, it has become rare to see an NLP paper, particularly one that proposes a A
N
new algorithm, that does not include extensive experimental analysis, and the number of involved tasks, datasets, C
E
domains, and languages is constantly growing. This emphasis on empirical results highlights the role of statistical T
E
significance testing in NLP research: If we, as a community, rely on empirical evaluation to validate our hypotheses S
T
I
and reveal the correct language processing mechanisms, we better be sure that our results are not coincidental. N
G
The goal of this book is to discuss the main aspects of statistical significance testing in NLP. Our guiding F
O
assumption throughout the book is that the basic question NLP researchers and engineers deal with is whether R
or not one algorithm can be considered better than another one. This question drives the field forward as it allows N
A
T
the constant progress of developing better technology for language processing challenges. In practice, researchers U
R
and engineers would like to draw the right conclusion from a limited set of experiments, and this conclusion A
L
should hold for other experiments with datasets they do not have at their disposal or that they cannot perform L
A
due to limited time and resources. The book hence discusses the opportunities and challenges in using statistical N
G
significance testing in NLP, from the point of view of experimental comparison between two algorithms. We U
A
cover topics such as choosing an appropriate significance test for the major NLP tasks, dealing with the unique G
E
aspects of significance testing for non-convex deep neural networks, accounting for a large number of comparisons P
R
between two NLP algorithms in a statistically valid manner (multiple hypothesis testing), and, finally, the unique O
C
challenges yielded by the nature of the data and practices of the field. E
S
S
I
N
ABOUT SYNTHESIS G
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis lectures
provide concise original presentations of important research and
development topics, published quickly in digital and print formats. For M
more information, visit our website: http://store.morganclaypool.com
O
R
G
A
N
&
C
store.morganclaypool.com LA
Y
P
O
O
L
Statistical Significance Testing
for Natural Language Processing
Synthesis Lectures on Human
Language Technologies
Editor
GraemeHirst,UniversityofToronto
SynthesisLecturesonHumanLanguageTechnologiesiseditedbyGraemeHirstoftheUniversity
ofToronto.Theseriesconsistsof50-to150-pagemonographsontopicsrelatingtonatural
languageprocessing,computationallinguistics,informationretrieval,andspokenlanguage
understanding.Emphasisisonimportantnewtechniques,onnewapplications,andontopicsthat
combinetwoormoreHLTsubfields.
StatisticalSignificanceTestingforNaturalLanguageProcessing
RotemDror,LotemPeled-Cohen,SegevShlomov,andRoiReichart
2020
DeepLearningApproachestoTextProduction
ShashiNarayanandClaireGardent
2020
LinguisticFundamentalsforNaturalLanguageProcessingII:100Essentialsfrom
SemanticstoPragmatics
EmilyM.BenderandAlexLascarides
2019
Cross-LingualWordEmbeddings
AndersSøgaard,IvanVulić,SebastianRuder,andManaalFaruqui
2019
BayesianAnalysisinNaturalLanguageProcessing,SecondEdition
ShayCohen
2019
ArgumentationMining
ManfredStedeandJodiSchneider
2018
QualityEstimationforMachineTranslation
LuciaSpecia,CarolinaScarton,andGustavoHenriquePaetzold
2018
iv
NaturalLanguageProcessingforSocialMedia,SecondEdition
AtefehFarzindarandDianaInkpen
2017
AutomaticTextSimplification
HoracioSaggion
2017
NeuralNetworkMethodsforNaturalLanguageProcessing
YoavGoldberg
2017
Syntax-basedStatisticalMachineTranslation
PhilipWilliams,RicoSennrich,MattPost,andPhilippKoehn
2016
Domain-SensitiveTemporalTagging
JannikStrötgenandMichaelGertz
2016
LinkedLexicalKnowledgeBases:FoundationsandApplications
IrynaGurevych,JudithEckle-Kohler,andMichaelMatuschek
2016
BayesianAnalysisinNaturalLanguageProcessing
ShayCohen
2016
Metaphor:AComputationalPerspective
TonyVeale,EkaterinaShutova,andBeataBeigmanKlebanov
2016
GrammaticalInferenceforComputationalLinguistics
JeffreyHeinz,ColindelaHiguera,andMennovanZaanen
2015
AutomaticDetectionofVerbalDeception
EileenFitzpatrick,JoanBachenko,andTommasoFornaciari
2015
NaturalLanguageProcessingforSocialMedia
AtefehFarzindarandDianaInkpen
2015
SemanticSimilarityfromNaturalLanguageandOntologyAnalysis
SébastienHarispe,SylvieRanwez,StefanJanaqi,andJackyMontmain
2015
v
LearningtoRankforInformationRetrievalandNaturalLanguageProcessing,Second
Edition
HangLi
2014
Ontology-BasedInterpretationofNaturalLanguage
PhilippCimiano,ChristinaUnger,andJohnMcCrae
2014
AutomatedGrammaticalErrorDetectionforLanguageLearners,SecondEdition
ClaudiaLeacock,MartinChodorow,MichaelGamon,andJoelTetreault
2014
WebCorpusConstruction
RolandSchäferandFelixBildhauer
2013
RecognizingTextualEntailment:ModelsandApplications
IdoDagan,DanRoth,MarkSammons,andFabioMassimoZanzotto
2013
LinguisticFundamentalsforNaturalLanguageProcessing:100Essentialsfrom
MorphologyandSyntax
EmilyM.Bender
2013
Semi-SupervisedLearningandDomainAdaptationinNaturalLanguageProcessing
AndersSøgaard
2013
SemanticRelationsBetweenNominals
ViviNastase,PreslavNakov,DiarmuidÓSéaghdha,andStanSzpakowicz
2013
ComputationalModelingofNarrative
InderjeetMani
2012
NaturalLanguageProcessingforHistoricalTexts
MichaelPiotrowski
2012
SentimentAnalysisandOpinionMining
BingLiu
2012
vi
DiscourseProcessing
ManfredStede
2011
BitextAlignment
JörgTiedemann
2011
LinguisticStructurePrediction
NoahA.Smith
2011
LearningtoRankforInformationRetrievalandNaturalLanguageProcessing
HangLi
2011
ComputationalModelingofHumanLanguageAcquisition
AfraAlishahi
2010
IntroductiontoArabicNaturalLanguageProcessing
NizarY.Habash
2010
Cross-LanguageInformationRetrieval
Jian-YunNie
2010
AutomatedGrammaticalErrorDetectionforLanguageLearners
ClaudiaLeacock,MartinChodorow,MichaelGamon,andJoelTetreault
2010
Data-IntensiveTextProcessingwithMapReduce
JimmyLinandChrisDyer
2010
SemanticRoleLabeling
MarthaPalmer,DanielGildea,andNianwenXue
2010
SpokenDialogueSystems
KristiinaJokinenandMichaelMcTear
2009
IntroductiontoChineseNaturalLanguageProcessing
Kam-FaiWong,WenjieLi,RuifengXu,andZheng-shengZhang
2009
vii
IntroductiontoLinguisticAnnotationandTextAnalytics
GrahamWilcock
2009
DependencyParsing
SandraKübler,RyanMcDonald,andJoakimNivre
2009
StatisticalLanguageModelsforInformationRetrieval
ChengXiangZhai
2008
Copyright©2020byMorgan&Claypool
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedin
anyformorbyanymeans—electronic,mechanical,photocopy,recording,oranyotherexceptforbriefquotations
inprintedreviews,withoutthepriorpermissionofthepublisher.
StatisticalSignificanceTestingforNaturalLanguageProcessing
RotemDror,LotemPeled-Cohen,SegevShlomov,andRoiReichart
www.morganclaypool.com
ISBN:9781681737959 paperback
ISBN:9781681737966 ebook
ISBN:9781681738307 epub
ISBN:9781681737973 hardcover
DOI10.2200/S00994ED1V01Y202002HLT045
APublicationintheMorgan&ClaypoolPublishersseries
SYNTHESISLECTURESONHUMANLANGUAGETECHNOLOGIES
Lecture#45
SeriesEditor:GraemeHirst,UniversityofToronto
SeriesISSN
Print1947-4040 Electronic1947-4059