Table Of Content

The Information Retrieval Series Tetsuya Sakai Laboratory Experiments in Information Retrieval Sample Sizes, Effect Sizes, and Statistical Power The Information Retrieval Series Volume 40 SeriesEditors ChengXiangZhai MaartendeRijke EditorialBoard NicholasJ.Belkin CharlesClarke DianeKelly FabrizioSebastiani Moreinformationaboutthisseriesathttp://www.springer.com/series/6128 Tetsuya Sakai Laboratory Experiments in Information Retrieval Sample Sizes, Effect Sizes, and Statistical Power 123 TetsuyaSakai WasedaUniversity Tokyo,Japan ISSN1387-5264 TheInformationRetrievalSeries ISBN978-981-13-1198-7 ISBN978-981-13-1199-4 (eBook) https://doi.org/10.1007/978-981-13-1199-4 LibraryofCongressControlNumber:2018951773 ©SpringerNatureSingaporePteLtd.2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSingaporePteLtd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Preface Classical statistical significance tests are useful to some extent in information retrieval (IR) evaluation, if we can regard the data at hand (e.g., a set of search queriesor“topics”fromatestcollection)asarandomsamplefromsomepopulation. Insuchasituation,weusuallywanttodiscusspopulationparameters(e.g.,popula- tion means) based on what we have observed (e.g., sample means and variances). However, significance tests are sometimes misinterpreted and/or misused. In the firsthalfofthisbook(Chaps.1,2,3,4,and5),Ifirstreviewparametricsignificance testsforcomparingsystemmeans(i.e.,t-testsandANOVAs)andshowhoweasily they can be conducted using Microsoft Excel or R. I also discuss a few multiple comparison procedures for researchers who are interested in comparing every systempair,includingarandomisedversionoftheTukeyHSDtest.Ithendiscuss knownlimitationsofclassicalsignificancetestingandprovidepracticalguidelines forreportingresearchresultsregardingcomparisonofmeans.Thesechapters(minus Chap.1Sect.1.3wherenoncentraldistributionsarediscussed)shouldbesuitablefor undergraduatestudentsincomputerscience. Chapters6and7(plustheaforementionedChap.1Sect.1.3),inwhichstatistical power is discussed, are probably suitable for graduate students and researchers whoregularlyconductlaboratoryexperiments.InChap.6,Iintroducetopicsetsize design,whichleveragesthestatisticalpowerandconfidenceintervaltechniquesas described inNagata (seebelow),toenable testcollection builderstodetermine an appropriatenumberoftopicstocreate.MyExceltoolsfortopicsetsizedesignbased ont-tests,one-wayANOVAandconfidenceintervalsareeasytouse.InChap.7,I describepower-analysis-basedmethodsfordetermininganappropriatesamplesize for a new experiment based on a similar experiment done in the past. My R tools forpoweranalysiswitht-testsandANOVAwereadoptedfromthosedevelopedby Toyoda (see below), and rely on R libraries including pwr. I provide case studies fromIRforbothExcel-basedtopicsetsizedesignandR-basedpoweranalysis. My main claims in this book are as follows. Researchers should conduct statistically well-designed experiments so that the right amount of effort is spent for seeking the truth; they should report experimental results appropriately so that v vi Preface their collective efforts will actually add up. I discuss IR as the primary target of research, but the techniques discussed in this book are applicable to related fields such as natural language processing and recommendation, as long as data at hand canberegardedasarandomsamplefromsomepopulation. This book occasionally cites work from IR literature in the 1960s and 1970s. SomeofthemareavailablefordownloadfromtheSIGIRMuseum.1 My five Excel tools for topic set size design (described in Chap.6) are based onaJapanesebookbyProfessorYasushiNagataofWasedaUniversity.Hekindly answered my numerous questions about statistical significance testing and sample sizedesign.MyfiveRscriptsforpoweranalysis(describedinChap.7)arebasedon theoriginalscriptswrittenbyProfessorHidekiToyoda,alsoofWasedaUniversity. He kindly allowed me to modify his scripts and distribute them for research purposes. I thank these two professors for publishing fantastic Japanese books on thesetopics,fromwhichIhavelearntalot. I am also very grateful to Dr. Shunsuke Horii of Waseda University for carefully checking my manuscript and giving me very useful comments, and to my anonymous reviewer who gave me excellent suggestions from the perspective of an IR evaluation expert. I would also like to thank Mio Sugino of Springer for her patience and support and my PhD student Zhaohao Zeng for checking the manuscript,includingthecodeprovidedinthisbook. Finally, I thank my family (Miho, Rio, and Sunny-Uni-Hermione) for under- standingthelifeandresponsibilitiesofanacademicandsupportingmeallthetime. Tokyo,Japan TetsuyaSakai June2018 1http://sigir.org/resources/museum/ Contents 1 Preliminaries ................................................................. 1 1.1 PrinciplesofSignificanceTesting...................................... 2 1.1.1 SampleMeansandPopulationMeans......................... 2 1.1.2 Hypotheses,TestStatistics,andP-Values..................... 3 1.1.3 α,β,andStatisticalPower ..................................... 5 1.2 Well-KnownProbabilityDistributions................................. 6 1.2.1 LawofLargeNumbers......................................... 6 1.2.2 NormalDistributionandtheCentralLimitTheorem......... 8 1.2.3 χ2Distribution.................................................. 11 1.2.4 t Distribution.................................................... 13 1.2.5 F Distribution................................................... 16 1.3 LessWell-KnownProbabilityDistributions........................... 17 1.3.1 Noncentralt Distribution....................................... 17 1.3.2 Noncentralχ2Distribution..................................... 22 1.3.3 NoncentralF Distributions..................................... 22 References..................................................................... 24 2 t-Tests ......................................................................... 27 2.1 Introduction ............................................................. 27 2.2 Pairedt-Test............................................................. 29 2.3 Two-Samplet-Test...................................................... 30 2.4 Welch’sTwo-Samplet-Test ............................................ 31 2.5 WhichTwo-Samplet-Test?............................................. 32 2.6 Conductingat-TestwithExcel ........................................ 33 2.7 Conductingat-TestwithR............................................. 36 2.8 ConfidenceIntervalsfortheDifferenceinPopulationMeans........ 39 References..................................................................... 41 3 AnalysisofVariance......................................................... 43 3.1 One-WayANOVA ...................................................... 44 3.1.1 One-WayANOVAwithEqualGroupSizes................... 44 3.1.2 One-WayANOVAwithUnequalGroupSizes................ 47 vii viii Contents 3.2 Two-WayANOVAWithoutReplication............................... 49 3.3 Two-WayANOVAwithReplication................................... 51 3.4 ConductinganANOVAwithExcel.................................... 53 3.5 ConductinganANOVAwithR......................................... 54 3.6 ConfidenceIntervalsforSystemMeans ............................... 56 References..................................................................... 57 4 MultipleComparisonProcedures.......................................... 59 4.1 Introduction ............................................................. 59 4.2 FamilywiseErrorRate.................................................. 60 4.3 BonferroniCorrection .................................................. 61 4.3.1 PrinciplesandLimitationsoftheBonferroniCorrection..... 61 4.3.2 BonferroniCorrectionwithR.................................. 62 4.4 TukeyHSDTest......................................................... 67 4.4.1 TukeyHSDwithUnequalGroupSizes........................ 68 4.4.2 TukeyHSDwithEqualGroupSizes........................... 69 4.4.3 TukeyHSDwithPairedObservations......................... 70 4.4.4 SimultaneousConfidenceIntervals............................ 71 4.4.5 TukeyHSDwithR.............................................. 71 4.5 RandomisationTestandItsTukeyHSDVersion...................... 73 4.5.1 RandomisationTestforTwoSystems ......................... 74 4.5.2 RandomisedTukeyHSDTest.................................. 77 References..................................................................... 80 5 TheCorrectWaystoUseSignificanceTests .............................. 81 5.1 LimitationsofSignificanceTests....................................... 81 5.1.1 CriticismsfromtheLiterature.................................. 81 5.1.2 ThreeProblems,AmongOthers ............................... 83 5.2 EffectSizes.............................................................. 85 5.2.1 EffectSizesfort-Tests ......................................... 85 5.2.2 EffectSizesforTukeyHSDandRTHSD...................... 88 5.2.3 EffectSizesforANOVA ....................................... 89 5.3 HowtoReportYourResults............................................ 92 5.3.1 ComparingTwoSystems....................................... 93 5.3.2 ComparingMoreThanTwoSystems.......................... 95 References..................................................................... 97 6 TopicSetSizeDesignUsingExcel.......................................... 99 6.1 OverviewofTopicSetSizeDesign.................................... 99 6.2 TopicSetSizeDesignwiththePairedt-Test.......................... 102 6.2.1 HowtoUsethePairedt-Test-BasedTool ..................... 102 6.2.2 HowthePairedt-Test-BasedTopicSetSize DesignWorks................................................... 103 6.3 TopicSetSizeDesignwiththeTwo-Samplet-Test................... 108 6.3.1 HowtoUsetheTwo-Samplet-Test-BasedTool .............. 108 Contents ix 6.3.2 HowtheTwo-Samplet-Test-BasedTopicSetSize DesignWorks................................................... 109 6.4 TopicSetSizeDesignwithOne-WayANOVA........................ 111 6.4.1 HowtoUsetheANOVA-BasedTool.......................... 111 6.4.2 HowtheANOVA-BasedTopicSetSizeDesignWorks....... 112 6.5 TopicSetSizeDesignwithConfidenceIntervalsforPairedData.... 115 6.5.1 HowtoUsethePaired-DataCI-BasedTool................... 115 6.5.2 HowthePaired-DataCI-BasedToolWorks................... 116 6.6 TopicSetSizeDesignwithConfidenceIntervals forUnpairedData....................................................... 118 6.6.1 HowtoUsetheTwo-SampleCI-BasedTool.................. 118 6.6.2 HowtheTwo-SampleCI-BasedToolWorks.................. 119 6.7 EstimatingPopulationVariances....................................... 120 6.8 ComparingtheDifferentTopicSetSizeDesignMethods ............ 122 6.8.1 PairedandTwo-Samplet-Testsvs.One-WayANOVA....... 124 6.8.2 CI-BasedTopicSetSizeDesign:Paired vs.UnpairedData............................................... 128 6.8.3 One-WayANOVAvs.ConfidenceIntervals................... 129 References..................................................................... 131 7 PowerAnalysisUsingR..................................................... 133 7.1 Introduction ............................................................. 133 7.2 OverviewoftheRScriptsforPowerAnalysis ........................ 134 7.3 PowerAnalysiswiththePairedt-Test................................. 134 7.4 PowerAnalysiswiththeTwo-Samplet-Test.......................... 136 7.5 PowerAnalysiswithOne-WayANOVA............................... 137 7.6 PowerAnalysiswithTwo-WayANOVAWithoutReplication ....... 139 7.7 PowerAnalysiswithTwo-WayANOVAwithReplication ........... 140 7.8 Summary ................................................................ 143 References..................................................................... 145 8 Conclusions................................................................... 147 8.1 AQuickSummaryoftheBook ........................................ 147 8.2 StatisticalReforminIR?................................................ 148 References..................................................................... 148 Index............................................................................... 149

Description:

Covering aspects from principles and limitations of statistical significance tests to topic set size design and power analysis, this book guides readers to statistically well-designed experiments. Although classical statistical significance tests are to some extent useful in information retrieval (I

Laboratory Experiments in Information Retrieval - Sample Sizes, Effect Sizes, and Statistical Power PDF

157 Pages·2018·5.05 MB·English

by Tetsuya Sakai

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Laboratory Experiments in Information Retrieval - Sample Sizes, Effect Sizes, and Statistical Power

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.