ebook img

Learning to Quantify PDF

145 Pages·2023·4.953 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Learning to Quantify

The Information Retrieval Series Andrea Esuli Alessandro Fabris Alejandro Moreo Fabrizio Sebastiani Learning to Quantify The Information Retrieval Series Volume 47 SeriesEditors ChengXiangZhai,UniversityofIllinois,Urbana,IL,USA MaartendeRijke,UniversityofAmsterdam,TheNetherlandsandAholdDelhaize, Zaandam,TheNetherlands EditorialBoardMembers NicholasJ.Belkin,RutgersUniversity,NewBrunswick,NJ,USA CharlesClarke,UniversityofWaterloo,Waterloo,ON,Canada DianeKelly,UniversityofTennesseeatKnoxville,Knoxville,TN,USA FabrizioSebastiani ,ConsiglioNazionaledelleRicerche,Pisa,Italy InformationRetrieval (IR) deals with access to and search in mostly unstructured information,in text, audio, and/orvideo, either from one large file or spread over separateanddiversesources,instaticstoragedevicesaswellasonstreamingdata. Itispartofbothcomputerandinformationscience,andusestechniquesfrome.g. mathematics,statistics, machinelearning,databasemanagement,orcomputational linguistics. Information Retrieval is often at the core of networked applications, web-baseddatamanagement,orlarge-scaledataanalysis. The Information Retrieval Series presents monographs, edited collections, and advancedtextbooksontopicsofinterestforresearchersinacademiaandindustry alike. Its focus is on the timely publication of state-of-the-art results at the forefrontofresearchandontheoreticalfoundationsnecessarytodevelopadeeper understandingofmethodsandapproaches. Thisseriesisabstracted/indexedinEICompendexandScopus. Andrea Esuli • Alessandro Fabris • Alejandro Moreo • Fabrizio Sebastiani Learning to Quantify AndreaEsuli AlessandroFabris IstitutodiScienzaeTecnologie DipartimentodiIngegneria dell’Informazione dell’Informazione ConsiglioNazionaledelleRicerche UniversitàdiPadova Pisa,Italy Padova,Italy AlejandroMoreo FabrizioSebastiani IstitutodiScienzaeTecnologie IstitutodiScienzaeTecnologie dell’Informazione dell’Informazione ConsiglioNazionaledelleRicerche ConsiglioNazionaledelleRicerche Pisa,Italy Pisa,Italy ThisworkwassupportedbyIstitutodiScienzaeTecnologiedell’Informazione ISSN1871-7500 ISSN2730-6836 (electronic) TheInformationRetrievalSeries ISBN978-3-031-20466-1 ISBN978-3-031-20467-8 (eBook) https://doi.org/10.1007/978-3-031-20467-8 ©TheEditor(s)(ifapplicable)andTheAuthor(s)2023.Thisbookisanopenaccesspublication. Open Access This bookis licensed under the terms of the Creative Commons Attribution 4.0Inter- nationalLicense(http://creativecommons.org/licenses/by/4.0/),whichpermitsuse,sharing,adaptation, distribution andreproduction inanymediumorformat,aslong asyougive appropriate credit tothe originalauthor(s)andthesource,providealinktotheCreativeCommonslicenseandindicateifchanges weremade. Theimages or other third party material in this book are included in the book’s Creative Commons license,unlessindicatedotherwiseinacreditlinetothematerial.Ifmaterialisnotincludedinthebook’s CreativeCommonslicenseandyourintendeduseisnotpermittedbystatutoryregulationorexceedsthe permitteduse,youwillneedtoobtainpermissiondirectlyfromthecopyrightholder. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Policymakers orcomputerscientistsmaybe interestedinfindingtheneedleinthe haystack(...), butsocialscientistsaremore commonlyinterestedin characterizingthe haystack. (Daniel J. Hopkinsand Gary King,2010) Preface Inanumberofapplicationsinvolvingclassification,thefinalgoalisnotdetermining which class (or classes) individual unlabelled instances belong to, but estimating theprevalence(or“relativefrequency”,or“priorprobability”)ofeachclassinthe unlabelleddata.Inrecentyearsithasbeenpointedoutthat,inthesecases,itwould make sense to directly optimise machine learning algorithms for this goal, rather than(somehowindirectly)justoptimisingtheclassifier’sabilitytolabelindividual instances.Thetaskoftrainingestimatorsofclassprevalenceviasupervisedlearning isknownaslearningtoquantify,or,moresimply,quantification.Itisbynowwell knownthatperformingquantificationbyclassifyingeachunlabelledinstanceviaa standard classifier and then counting the instances that have been assigned to the class (the Classify and Count method) usually leads to biased estimators of class prevalence,i.e.,topoorquantificationaccuracy;asaresult,methods(andevaluation measures)thataddressquantificationasataskinitsownrighthavebeendeveloped. This book covers the main applications of quantification, the main methods that havebeendevelopedforlearningtoquantify,themeasuresthathavebeenadopted forevaluatingit,andthechallengesthatstillneedtobeaddressedbyfutureresearch. The book is divided in seven chapters. Chapter 1 sets the stage for the rest of the book by introducing fundamental notions such as class distributions, their estimation,anddatasetshift,byarguingforthesuboptimalityofusingclassification techniques for performing this estimation, and by discussing why learning to quantify has evolved as a task of its own, rather than remaining a by-product of classification.Chapter2providesthemotivationforwhatistocomebydescribing the applications that quantification has been put at, ranging from improving clas- sificationaccuracyin domainadaptation,to measuringandimprovingthefairness ofclassificationsystemswithrespecttoasensitiveattribute,tosupportingresearch anddevelopmentin the social sciences, in politicalscience, epidemiology,market research,andothers.InChapter3wemoveontodiscusstheexperimentalevaluation of quantification systems; we look at evaluation measures for the various types of quantification systems (binary, single-label multiclass, multi-label multiclass, ordinal),butalso atevaluationprotocolsforquantification,thatessentially consist in ways to extract multiple testing samples for use in quantification evaluation vii viii Preface from a single classification test set. Chapter 4 is possibly the central chapter of the book, and looks at the various supervised learning methods for learning to quantifythat have been proposedover the years, be they of an aggregativenature (i.e., methods that require the classification of all individual unlabelled items as an intermediate step) or of a non-aggregative nature (i.e., methods in which no classificationofindividualitemsisperformed).InChapter5welookatanumberof “advanced”(orniche)topics in quantification,includingquantificationforordinal data,cross-lingualquantificationoftextualitems,quantificationfornetworkeddata, and quantification for streaming data. Chapter 6 looks at other aspects of the “quantificationlandscape”thathavenotbeencoveredinthepreviouschapters,and discusses the evolutionof quantificationresearch,from its beginningsto the most recent quantification-based “shared tasks”, the landscape of quantification-based, publicly available software libraries, and other tasks in data science that present important similarities with quantification. Chapter 6 also presents the results of experiments, that we have carried out ourselves, in which we compare many of themethodsdiscussedinChapter4onacommontestinginfrastructure.Chapter7 concludesthebook,pointingtopotentialfuturedevelopmentsinthequantification arena. The bookismostly addressedto researchersin data science thatmightwantto come up to speed with the state of the art in learning to quantify, but it can be useful also to researchers and scientists that operate in other disciplines and that applytechniquesfromdatascience to theirownapplicationdomains.Indeed,itis ourexperiencethatmanypotentialusersofquantificationtechniques(whooperate inthefieldstoucheduponinChapter2,andpossiblyinotherstoo)donotusethem, thussettlingforsuboptimal“classifyandcount”techniques,forthesimplefactthat theyarenotawareoftheirexistence,andoftheexistenceofquantificationasatask ofitsown;itisalsothosepotentialusersthatwehopewillbeinspiredbythisbook. We thus hope that the availability of a book that surveys all aspects of the quantification workflow and presents them in a hopefully accessible form, will increasetheinterestinthissubjectonthepartofresearchersandpractitionersalike, andwillcontributetomakingquantificationbetterknowntopotentialusersofthis technologyandtoresearchersinterestedinadvancingthefield. Pisa,Italy AndreaEsuli Padova,Italy AlessandroFabris Pisa,Italy AlejandroMoreo Pisa,Italy FabrizioSebastiani Acknowledgments The work of Andrea Esuli, Alejandro Moreo, and Fabrizio Sebastiani has been supported by the SoBigData++ project, funded by the European Commission (Grant871042)undertheH2020ProgrammeINFRAIA-2019-1,bytheAI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020, and by the SoBigData.it and FAIR projects, funded by the Italian Ministry of University and Research under the NextGenerationEU program. The authors’ opinions do not necessarily reflect those of the European Commission. The work by Alessandro Fabris was supported by MIUR (Italian Ministry for University and Research) under the “Departments of Excellence” initiative(Law232/2016). ix Contents 1 TheCaseforQuantification ................................................ 1 1.1 ClassDistributionsandTheirEstimation.............................. 2 1.2 TheSuboptimalityofClassifyandCount ............................. 3 1.3 NotationalConventions................................................. 5 1.4 QuantificationProblems................................................ 6 1.5 DatasetShiftandQuantification........................................ 8 1.5.1 Types of Dataset Shift and Their Relation toQuantification............................................... 11 1.6 QuantificationandBiasMitigation..................................... 14 1.7 StructureofThisBook.................................................. 16 2 ApplicationsofQuantification.............................................. 19 2.1 ImprovingClassificationAccuracy .................................... 19 2.1.1 WordSenseDisambiguation.................................. 21 2.2 Fairness.................................................................. 22 2.2.1 ImprovingFairness............................................ 22 2.2.2 MeasuringFairness............................................ 23 2.3 SentimentAnalysis ..................................................... 24 2.4 SocialandPoliticalSciences........................................... 25 2.5 MarketResearch ........................................................ 27 2.6 Epidemiology ........................................................... 28 2.7 EcologicalModelling................................................... 29 2.8 ResourceAllocation .................................................... 31 3 EvaluationofQuantificationAlgorithms.................................. 33 3.1 MeasuresforEvaluatingSLQ,BQ,andMLQ......................... 34 3.1.1 PropertiesofEvaluationMeasuresforSLQ,BQ, andMLQ....................................................... 35 3.1.2 Bias ............................................................ 37 3.1.3 AbsoluteErroranditsVariants............................... 37 3.1.4 RelativeAbsoluteErroranditsVariants ..................... 38 xi

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.