Table Of ContentBBrriigghhaamm YYoouunngg UUnniivveerrssiittyy
BBYYUU SScchhoollaarrssAArrcchhiivvee
Theses and Dissertations
2013-08-12
PPrraaccttiiccaall CCoosstt--CCoonnsscciioouuss AAccttiivvee LLeeaarrnniinngg ffoorr DDaattaa AAnnnnoottaattiioonn iinn
AAnnnnoottaattoorr--IInniittiiaatteedd EEnnvviirroonnmmeennttss
Robbie A. Haertel
Brigham Young University - Provo
Follow this and additional works at: https://scholarsarchive.byu.edu/etd
Part of the Computer Sciences Commons
BBYYUU SScchhoollaarrssAArrcchhiivvee CCiittaattiioonn
Haertel, Robbie A., "Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated
Environments" (2013). Theses and Dissertations. 4242.
https://scholarsarchive.byu.edu/etd/4242
This Dissertation is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for
inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more
information, please contact scholarsarchive@byu.edu, ellen_amatangelo@byu.edu.
PracticalCost-ConsciousActiveLearningforDataAnnotationin
Annotator-InitiatedEnvironments
RobbieA.Haertel
Adissertationsubmittedtothefacultyof
BrighamYoungUniversity
inpartialfulfillmentoftherequirementsforthedegreeof
DoctorofPhilosophy
EricKarlRingger,Chair
KevinDarrellSeppi
ChristopheGerardGiraud-Carrier
MichaelD.Jones
KentEldonSeamons
DepartmentofComputerScience
BrighamYoungUniversity
August2013
Copyright c 2013RobbieA.Haertel
�
AllRightsReserved
ABSTRACT
PracticalCost-ConsciousActiveLearningforDataAnnotationin
Annotator-InitiatedEnvironments
RobbieA.Haertel
DepartmentofComputerScience,BYU
DoctorofPhilosophy
Manyprojectsexistwhosepurposeistoaugmentrawdatawithannotationsthatincreasethe
usefulnessofthedata. Thenumber oftheseprojectsisrapidlygrowingandin theageof“bigdata”
the amount of data to be annotated is likewise growing within each project. One common use of
suchdataisinsupervisedmachinelearning,whichrequireslabeleddatatotrainapredictivemodel.
Annotationisoftenaveryexpensiveproposition,particularlyforstructureddata. Thepurposeof
this dissertation is to explore methods of reducing the cost of creating such data sets, including
annotatedtextcorpora.
We focus on active learning to address the annotation problem. Active learning employs
modelstrainedusingmachinelearningtoidentifyinstancesinthedatathataremostinformative
and least costly. We introduce novel techniques for adapting vanilla active learning to situations
whereindata instancesareof varyingbenefitand cost,annotators requestwork“on-demand,” and
therearemultiple,fallibleannotatorsofdifferinglevelsofaccuracyandcost. Inordertoaccount
for data instances of varying cost, we build a model of cost from real annotation data based on
a user study. We also introduce a novel cost-conscious active learning algorithm which we call
return-on-investment, that selects instances for annotation that contain the most benefit per unit
cost. To address theissue of annotators thatrequest instances “on-demand,” we develop aparallel,
“no-wait” framework that performs computation while the annotator is annotating. As a result,
annotatorsneednotwaitforthecomputertodeterminethebestinstanceforthemtoannotate—a
common problem with existing approaches. Finally, we introduce a Bayesian model designed
to simultaneously infer ground truth annotations from noisy annotations, infer each individual
annotators accuracy,and predict itsown accuracy on unseendata, without theuse ofa held-out set.
WeextendROI-based active learningandourannotation framework tohandlemultipleannotators
using this model. As a whole, our work shows that the techniques introduced in this dissertation
reducethecostofannotationinscenariosthataremoretrue-to-lifethanpreviousresearch.
Keywords: activelearning,cost-sensitivelearning,machinelearning,return-on-investment,Bayesian
models,parallelactivelearning,naturallanguageprocessing,part-of-speechtagging
ACKNOWLEDGMENTS
Nanosgigantumhumerisinsidentes,“Dwarvesstandingontheshouldersofgiants,”1 isan
oldmetaphortypicallyused torefertothefactthatnewresearchisalwaysbuiltuponamuchlarger
bodyofexistingresearch. Whilethisiscertainlythecase,anotherinterpretationexists. Namely,
a scientist is unable to perform his research without the enormous assistance, aid, and support
of many others. The following are a sampling of some of the giants that carried me while I was
workingtowards mydegree; myapologiesin advancefor anyofthose thatIhavenotmentioned by
name—knowthatIamappreciativeofallthosewhohaveassistedinanyway.
Firstandforemost,IwouldliketothankDr. EricRingger,myadvisor. Iamevergrateful
that, in his first year at Brigham Young University, he took a chance on a student of Linguistics
interestedinnaturallanguageprocessingbyinvitingmetodoadoctorateunderhistutelage. Hehas
procuredfundingformeandhastaughtmemorethanIcouldhaveimaginedthroughhisclasses,
ourresearch,andotherinteractions. Moreimportantly,hehastaughtmehowtoperformresearch
so that I may continue to learn and discover new things. He has as always been very supportive,
inspiring, positive, uplifting, and, most of all, patient, even when I am sure I did not make these
thingseasy. Hehasalsodeftlyhandledtheadministrativeaspectsofmydegree.
I am also very grateful for Dr. Kevin Seppi, my second committee member. He has gone
wellbeyondthecallofdutyofasecondcommitteememberandinmymindheisreallyaco-advisor.
I have learned volumes from his classes and our interactions. Unlike the stereotypical professor,
Dr. Seppiwasalwaysinthetrenches: likeDr.Ringger,hewasalwayseitherpresentphysicallyor
availablebyphoneandemaillateintotheeveningsofpaperdeadlines. Iamthankfulforhissupport
andencouragement.
Likewise, my full Ph.D. committee has been very supportive. They supported me in my
decisions to take time off for internships and also helped ensure I finished my dissertation after
leavingmystudiestoworkfulltime.
1TranslationtakenfromWikipedia[97],whichcontainsaninterestingdiscussionaboutthehistoryanduseofthe
phrase.
Myotherco-authorshavebeenincrediblyhelpfulandinsightful,including,butnotlimited
to: Dr. JamesCarroll’s,PaulFelt,GeorgeBusby,PeterMcClanahan,MarcCarmenandDr. Deryle
Lonsdale. Althoughwehavenot(yet)beenco-authors,Iamalsoverygratefulformeaningfuland
stimulatingdiscussionswithDr. DanielWalkerIV.
IverymuchenjoyeddiscussionsIhadwithmycolleaguesatconferencesandworkshopsthat
undoubtedlyshapedmyviewsofactivelearning. Inparticular,IwouldliketothankDr.BurrSettles
forhiskindfeedbackofadraftofChapter10andforsharinghisthoughtsandideasaboutallthings
activelearning. Ialsohad several fruitfulconversations with Dr. KatrinTomanekon thesubjectof
cost-consciousactivelearning. Inaddition,Iamthankfulforherassistanceasco-organizerofthe
ActiveLearningWorkshopforNLP,2010. Otherswhoinfluencedmethroughourconversations
andinteractionsincludeDr.MichaelBloodgoodandDr.KevinSmall.
Completionofmydegreewouldnothavebeenpossiblewithouttheaidoffinancialsupport
and other resources. The Computer Science department has been very generous in this regard,
providing funding for my research for all but one year. For that year, I am grateful to Microsoft
whokindlyprovidedmewithaMentorGrant. BrighamYoungUniversityalsoprovidescomputing
resourcesfreeofchargeviatheFultonSupercomputingLab,withoutwhich,noneofthisresearch
wouldhavebeencompleted. Ofcourse,IamalsoextremelygratefultoGooglefortheexperience
gainedonbothofmypaidinternships;RobertGardner,MaxLin,andGideonMannwerefabulous
hosts who helped me reach my potential and complete successful internships. During this last year
while working on my dissertation while a full-time employee of Google, management has been
veryaccommodatinginallowingmetofinish;ittrulywasapriorityforthoseIworkwithandthose
above me to finish. While here at Google, I have completed some parts of the dissertation using
companyprovidedequipment.
Last, but certainly not least, I would like to thank my family. I am most grateful for the
support and sacrifice proffered me by the love of my life, my beautiful and dear wife of nearly
elevenyears,Meri. Shehasmadeincrediblesacrificesandhaswillinglytakenuponherselfextra
burdensathometoallowmetimetofinishmydissertation. Thisdissertationsimplywouldnothave
beenpossiblewithouther,forwhichIwillbeeternallygrateful. Thisdegreeisasmuchhersasitis
mineandIamsoblessedtobemarriedtoherandtowalkthejourneyoflifebyhersideandwith
herhelp.
Whileagraduatestudent,allfourofmychildren,Jared,Alex,Nathan,andmylittleprincess
Caroline, have beenborn. They, too, have been verypatient withme throughthis process. Finally, I
wouldliketothankmyparents. Theyhaveprovidedcontinuousloveandsupportfortheir,“eternal
student.”
TableofContents
1 Introduction 1
2 ASurveyofPractical,Cost-ConsciousActiveLearning 4
2.1 SupervisedMachineLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 FormalDefinitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 SupervisedMachineLearning . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 ActiveLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Transductivevs.InductiveActiveLearning . . . . . . . . . . . . . . . . . . . . . 11
2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Benefit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.3 CostinSimulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 ScoringFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.1 EVSIScoringFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 Return-on-InvestmentScoringFunction . . . . . . . . . . . . . . . . . . . 17
2.5.3 OtherCost-SensitiveScoringFunctions . . . . . . . . . . . . . . . . . . . 19
2.6 CostandBenefitFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.1 DecisionTheoryBenefitFunctions . . . . . . . . . . . . . . . . . . . . . 19
2.6.2 HeuristicBenefitFunctions . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.3 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 CharacteristicsofRealAnnotators . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Learner-Initiatedvs.Annotator-InitiatedActiveLearning . . . . . . . . . . . . . . 24
vi
3 Roadmap 26
4 ActiveLearningforPart-of-SpeechTagging: AcceleratingCorpusAnnotation 30
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 PartofSpeechTagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 ActiveLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 ActiveLearningintheLanguageContext . . . . . . . . . . . . . . . . . . 33
4.3.2 QuerybyCommittee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.3 QuerybyUncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.4 AdaptationsofQBU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.2 DataSets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.3 GeneralResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.4 QBCResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.5 QBUResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.6 ResultsontheBNC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.7 AnotherPerspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6 Errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 AssessingtheCostsofMachine-AssistedCorpusAnnotationThroughaUserStudy 48
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 ExperimentalDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 ControlVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.3 SessionSize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.4 DataSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
vii
5.2.5 UserInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.6 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 DescriptiveStatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 HourlyCostModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.6 Addendum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 AssessingtheCostsofSamplingMethodsinActiveLearningforAnnotation 62
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 BenefitandCostinActiveLearning . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 EvaluationMethodologyandResults . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 NormalizedMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7 ReturnonInvestmentforActiveLearning 69
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 TheRoleofCostandBenefit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 BackgroundandDecisionTheoreticFrameworkforActiveLearning . . . . . . . . 71
7.4 ReturnonInvestment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.5.1 UtilityEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.5.2 CostEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.5.3 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.6.1 CostEstimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.6.2 UtilityEstimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.7 ConclusionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.8 Errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
viii
8 AnAnalyticandEmpiricalEvaluationofReturn-on-Investment-BasedActiveLearn-
ing 84
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.3 TheoreticalAnalysisofROI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.4.1 CostSimulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.4.2 CostEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.4.3 BenefitEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4.4 PracticalActiveLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.5 FromTheorytoPractice: ToWhatDegreeAretheConditionsMet? . . . . . . . . 96
8.6 ActiveLearningResultsandDiscussion . . . . . . . . . . . . . . . . . . . . . . . 100
8.7 ConclusionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9 ParallelActiveLearning: EliminatingWaitTimewithMinimalStaleness 104
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2 FromZeroStalenesstoZeroWait . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2.1 ZeroStaleness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.2.2 TraditionalBatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.2.3 AllowingOldScores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.2.4 EliminatingWaitTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.3 ExperimentalDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.5 ConclusionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.6 Addendum: StrengtheningtheCasefortheParallelFramework . . . . . . . . . . . 120
9.6.1 AnalysisofEffectsofRelativeTimeSpentAnnotatingonPerformance . . 120
9.6.2 EmpiricalComparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
ix
Description:Augmenting the Syriac corpus with detailed morphological annotations requires trained annotating the New Testament portion of the Peshitta (100,000 words) with . FX ,Y ; let GX ,Y be the set of all such functions for X and Y them in a way that properly characterizes the cost/benefit trade-off.