Table Of Content

BBrriigghhaamm YYoouunngg UUnniivveerrssiittyy BBYYUU SScchhoollaarrssAArrcchhiivvee Theses and Dissertations 2013-08-12 PPrraaccttiiccaall CCoosstt--CCoonnsscciioouuss AAccttiivvee LLeeaarrnniinngg ffoorr DDaattaa AAnnnnoottaattiioonn iinn AAnnnnoottaattoorr--IInniittiiaatteedd EEnnvviirroonnmmeennttss Robbie A. Haertel Brigham Young University - Provo Follow this and additional works at: https://scholarsarchive.byu.edu/etd Part of the Computer Sciences Commons BBYYUU SScchhoollaarrssAArrcchhiivvee CCiittaattiioonn Haertel, Robbie A., "Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments" (2013). Theses and Dissertations. 4242. https://scholarsarchive.byu.edu/etd/4242 This Dissertation is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please contact scholarsarchive@byu.edu, ellen_amatangelo@byu.edu. PracticalCost-ConsciousActiveLearningforDataAnnotationin Annotator-InitiatedEnvironments RobbieA.Haertel Adissertationsubmittedtothefacultyof BrighamYoungUniversity inpartialfulfillmentoftherequirementsforthedegreeof DoctorofPhilosophy EricKarlRingger,Chair KevinDarrellSeppi ChristopheGerardGiraud-Carrier MichaelD.Jones KentEldonSeamons DepartmentofComputerScience BrighamYoungUniversity August2013 Copyright c 2013RobbieA.Haertel � AllRightsReserved ABSTRACT PracticalCost-ConsciousActiveLearningforDataAnnotationin Annotator-InitiatedEnvironments RobbieA.Haertel DepartmentofComputerScience,BYU DoctorofPhilosophy Manyprojectsexistwhosepurposeistoaugmentrawdatawithannotationsthatincreasethe usefulnessofthedata. Thenumber oftheseprojectsisrapidlygrowingandin theageof“bigdata” the amount of data to be annotated is likewise growing within each project. One common use of suchdataisinsupervisedmachinelearning,whichrequireslabeleddatatotrainapredictivemodel. Annotationisoftenaveryexpensiveproposition,particularlyforstructureddata. Thepurposeof this dissertation is to explore methods of reducing the cost of creating such data sets, including annotatedtextcorpora. We focus on active learning to address the annotation problem. Active learning employs modelstrainedusingmachinelearningtoidentifyinstancesinthedatathataremostinformative and least costly. We introduce novel techniques for adapting vanilla active learning to situations whereindata instancesareof varyingbenefitand cost,annotators requestwork“on-demand,” and therearemultiple,fallibleannotatorsofdifferinglevelsofaccuracyandcost. Inordertoaccount for data instances of varying cost, we build a model of cost from real annotation data based on a user study. We also introduce a novel cost-conscious active learning algorithm which we call return-on-investment, that selects instances for annotation that contain the most benefit per unit cost. To address theissue of annotators thatrequest instances “on-demand,” we develop aparallel, “no-wait” framework that performs computation while the annotator is annotating. As a result, annotatorsneednotwaitforthecomputertodeterminethebestinstanceforthemtoannotate—a common problem with existing approaches. Finally, we introduce a Bayesian model designed to simultaneously infer ground truth annotations from noisy annotations, infer each individual annotators accuracy,and predict itsown accuracy on unseendata, without theuse ofa held-out set. WeextendROI-based active learningandourannotation framework tohandlemultipleannotators using this model. As a whole, our work shows that the techniques introduced in this dissertation reducethecostofannotationinscenariosthataremoretrue-to-lifethanpreviousresearch. Keywords: activelearning,cost-sensitivelearning,machinelearning,return-on-investment,Bayesian models,parallelactivelearning,naturallanguageprocessing,part-of-speechtagging ACKNOWLEDGMENTS Nanosgigantumhumerisinsidentes,“Dwarvesstandingontheshouldersofgiants,”1 isan oldmetaphortypicallyused torefertothefactthatnewresearchisalwaysbuiltuponamuchlarger bodyofexistingresearch. Whilethisiscertainlythecase,anotherinterpretationexists. Namely, a scientist is unable to perform his research without the enormous assistance, aid, and support of many others. The following are a sampling of some of the giants that carried me while I was workingtowards mydegree; myapologiesin advancefor anyofthose thatIhavenotmentioned by name—knowthatIamappreciativeofallthosewhohaveassistedinanyway. Firstandforemost,IwouldliketothankDr. EricRingger,myadvisor. Iamevergrateful that, in his first year at Brigham Young University, he took a chance on a student of Linguistics interestedinnaturallanguageprocessingbyinvitingmetodoadoctorateunderhistutelage. Hehas procuredfundingformeandhastaughtmemorethanIcouldhaveimaginedthroughhisclasses, ourresearch,andotherinteractions. Moreimportantly,hehastaughtmehowtoperformresearch so that I may continue to learn and discover new things. He has as always been very supportive, inspiring, positive, uplifting, and, most of all, patient, even when I am sure I did not make these thingseasy. Hehasalsodeftlyhandledtheadministrativeaspectsofmydegree. I am also very grateful for Dr. Kevin Seppi, my second committee member. He has gone wellbeyondthecallofdutyofasecondcommitteememberandinmymindheisreallyaco-advisor. I have learned volumes from his classes and our interactions. Unlike the stereotypical professor, Dr. Seppiwasalwaysinthetrenches: likeDr.Ringger,hewasalwayseitherpresentphysicallyor availablebyphoneandemaillateintotheeveningsofpaperdeadlines. Iamthankfulforhissupport andencouragement. Likewise, my full Ph.D. committee has been very supportive. They supported me in my decisions to take time off for internships and also helped ensure I finished my dissertation after leavingmystudiestoworkfulltime. 1TranslationtakenfromWikipedia[97],whichcontainsaninterestingdiscussionaboutthehistoryanduseofthe phrase. Myotherco-authorshavebeenincrediblyhelpfulandinsightful,including,butnotlimited to: Dr. JamesCarroll’s,PaulFelt,GeorgeBusby,PeterMcClanahan,MarcCarmenandDr. Deryle Lonsdale. Althoughwehavenot(yet)beenco-authors,Iamalsoverygratefulformeaningfuland stimulatingdiscussionswithDr. DanielWalkerIV. IverymuchenjoyeddiscussionsIhadwithmycolleaguesatconferencesandworkshopsthat undoubtedlyshapedmyviewsofactivelearning. Inparticular,IwouldliketothankDr.BurrSettles forhiskindfeedbackofadraftofChapter10andforsharinghisthoughtsandideasaboutallthings activelearning. Ialsohad several fruitfulconversations with Dr. KatrinTomanekon thesubjectof cost-consciousactivelearning. Inaddition,Iamthankfulforherassistanceasco-organizerofthe ActiveLearningWorkshopforNLP,2010. Otherswhoinfluencedmethroughourconversations andinteractionsincludeDr.MichaelBloodgoodandDr.KevinSmall. Completionofmydegreewouldnothavebeenpossiblewithouttheaidoffinancialsupport and other resources. The Computer Science department has been very generous in this regard, providing funding for my research for all but one year. For that year, I am grateful to Microsoft whokindlyprovidedmewithaMentorGrant. BrighamYoungUniversityalsoprovidescomputing resourcesfreeofchargeviatheFultonSupercomputingLab,withoutwhich,noneofthisresearch wouldhavebeencompleted. Ofcourse,IamalsoextremelygratefultoGooglefortheexperience gainedonbothofmypaidinternships;RobertGardner,MaxLin,andGideonMannwerefabulous hosts who helped me reach my potential and complete successful internships. During this last year while working on my dissertation while a full-time employee of Google, management has been veryaccommodatinginallowingmetofinish;ittrulywasapriorityforthoseIworkwithandthose above me to finish. While here at Google, I have completed some parts of the dissertation using companyprovidedequipment. Last, but certainly not least, I would like to thank my family. I am most grateful for the support and sacrifice proffered me by the love of my life, my beautiful and dear wife of nearly elevenyears,Meri. Shehasmadeincrediblesacrificesandhaswillinglytakenuponherselfextra burdensathometoallowmetimetofinishmydissertation. Thisdissertationsimplywouldnothave beenpossiblewithouther,forwhichIwillbeeternallygrateful. Thisdegreeisasmuchhersasitis mineandIamsoblessedtobemarriedtoherandtowalkthejourneyoflifebyhersideandwith herhelp. Whileagraduatestudent,allfourofmychildren,Jared,Alex,Nathan,andmylittleprincess Caroline, have beenborn. They, too, have been verypatient withme throughthis process. Finally, I wouldliketothankmyparents. Theyhaveprovidedcontinuousloveandsupportfortheir,“eternal student.” TableofContents 1 Introduction 1 2 ASurveyofPractical,Cost-ConsciousActiveLearning 4 2.1 SupervisedMachineLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 FormalDefinitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 SupervisedMachineLearning . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 ActiveLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Transductivevs.InductiveActiveLearning . . . . . . . . . . . . . . . . . . . . . 11 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Benefit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.2 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.3 CostinSimulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 ScoringFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.1 EVSIScoringFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.2 Return-on-InvestmentScoringFunction . . . . . . . . . . . . . . . . . . . 17 2.5.3 OtherCost-SensitiveScoringFunctions . . . . . . . . . . . . . . . . . . . 19 2.6 CostandBenefitFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6.1 DecisionTheoryBenefitFunctions . . . . . . . . . . . . . . . . . . . . . 19 2.6.2 HeuristicBenefitFunctions . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6.3 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 CharacteristicsofRealAnnotators . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.8 Learner-Initiatedvs.Annotator-InitiatedActiveLearning . . . . . . . . . . . . . . 24 vi 3 Roadmap 26 4 ActiveLearningforPart-of-SpeechTagging: AcceleratingCorpusAnnotation 30 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 PartofSpeechTagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 ActiveLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.1 ActiveLearningintheLanguageContext . . . . . . . . . . . . . . . . . . 33 4.3.2 QuerybyCommittee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.3 QuerybyUncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.4 AdaptationsofQBU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.2 DataSets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.3 GeneralResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4.4 QBCResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4.5 QBUResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4.6 ResultsontheBNC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4.7 AnotherPerspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.6 Errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 AssessingtheCostsofMachine-AssistedCorpusAnnotationThroughaUserStudy 48 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 ExperimentalDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.1 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.2 ControlVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.3 SessionSize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.4 DataSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 vii 5.2.5 UserInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.6 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3 DescriptiveStatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 HourlyCostModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.5 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.6 Addendum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6 AssessingtheCostsofSamplingMethodsinActiveLearningforAnnotation 62 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2 BenefitandCostinActiveLearning . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3 EvaluationMethodologyandResults . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.4 NormalizedMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7 ReturnonInvestmentforActiveLearning 69 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.2 TheRoleofCostandBenefit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.3 BackgroundandDecisionTheoreticFrameworkforActiveLearning . . . . . . . . 71 7.4 ReturnonInvestment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.5.1 UtilityEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.5.2 CostEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.5.3 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.6.1 CostEstimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.6.2 UtilityEstimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.7 ConclusionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.8 Errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 viii 8 AnAnalyticandEmpiricalEvaluationofReturn-on-Investment-BasedActiveLearn- ing 84 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 8.3 TheoreticalAnalysisofROI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 8.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.4.1 CostSimulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.4.2 CostEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.4.3 BenefitEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.4.4 PracticalActiveLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.5 FromTheorytoPractice: ToWhatDegreeAretheConditionsMet? . . . . . . . . 96 8.6 ActiveLearningResultsandDiscussion . . . . . . . . . . . . . . . . . . . . . . . 100 8.7 ConclusionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9 ParallelActiveLearning: EliminatingWaitTimewithMinimalStaleness 104 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.2 FromZeroStalenesstoZeroWait . . . . . . . . . . . . . . . . . . . . . . . . . . 107 9.2.1 ZeroStaleness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 9.2.2 TraditionalBatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 9.2.3 AllowingOldScores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 9.2.4 EliminatingWaitTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 9.3 ExperimentalDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 9.5 ConclusionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 9.6 Addendum: StrengtheningtheCasefortheParallelFramework . . . . . . . . . . . 120 9.6.1 AnalysisofEffectsofRelativeTimeSpentAnnotatingonPerformance . . 120 9.6.2 EmpiricalComparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 9.6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 ix

Description:

Augmenting the Syriac corpus with detailed morphological annotations requires trained annotating the New Testament portion of the Peshitta (100,000 words) with . FX ,Y ; let GX ,Y be the set of all such functions for X and Y them in a way that properly characterizes the cost/benefit trade-off.

Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments PDF

178 Pages·2016·4.07 MB·English

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments PDF Free - Full Version

by Unknow| 2016| 178 pages| 4.07| English

Download Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments

Detailed Information

Author:	Unknown
Publication Year:	2016
Pages:	178
Language:	English
File Size:	4.07
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments PDF?

Yes, on https://PDFdrive.to you can download Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments on my mobile device?

After downloading Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments?

Yes, this is the complete PDF version of Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.