How to Select a Good Training-data Subset for Transcription: Submodular Active Selection for Sequences HuiLin,Jeff Bilmes Department ofElectrical Engineering, University ofWashington,Seattle, WA 98195 {hlin,bilmes}@ee.washignton.edu Abstract measureusedtoevaluateallexamplesinthepool. Twopopu- larstrategiesformeasuringinformativenessincludeuncertainty Given a large un-transcribed corpus of speech utterances, we sampling and the query-by-committee approach. Uncertainty addresstheproblemofhow toselect agoodsubsetforword- sampling[2]isthesimplestandmostcommonlyusedstrategy. leveltranscriptionunderagivenfixedtranscriptionbudget.We In this framework, an initial system is trained typically using employ submodular active selection on a Fisher-kernel based a small set of labeled examples. Then, the system examines graphoverun-transcribedutterances. Theselectionistheoreti- therestoftheunlabeledexamples,andqueriesexamplesthatit callyguaranteedtobenear-optimal.Moreover,ourapproachis ismost uncertainabout. Themeasurement ofuncertaintycan abletobootstrapwithoutrequiringanyinitialtranscribeddata, either be entropy [3, 4, 5] or a confidence score [6, 7, 8, 3]. whereastraditionalapproachesrelyheavilyonthequalityofan Query-by-committee[9,10,11]alsostartswithlabeleddata.A initial model trained on some labeled data. Our experiments setofdistinctmodelsaretrainedascommitteemembers. Each onphonerecognitionshowthatourapproachoutperformsboth committeememberisthenallowedtovoteonthelabellingsof average-caserandomselectionanduncertaintysamplingsignif- theunlabeledexamples.Themostinformativeexampleistaken icantly. astheonethecommitteemostdisagreesabout. IndexTerms:Transcription,labeling,submodularity,submod- It has been shown that both uncertainty sampling and ularselection,activelearning,sequencelabeling,phonerecog- query-by-committeemayfailwhentheytendtoqueryoutliers, nition,speechrecognition whichisthemainmotivatingfactorforotherstrategieslikees- timatederrorreduction[12].Theproblemisthatoutliersmight 1. Introduction havehighuncertainty(oracommitteemightfindthemcontro- versial)buttheyarenotgoodsurrogatesfor“typical”samples. Inautomaticspeech recognitionandmanyotherlanguageap- Indeed, an ideal selection strategy should choose a subset of plications, unlabeled data are abundant but labels (e.g., tran- samplesthat,whenconsideredtogether,constituteinsomeform scriptions)are expensiveand time-consumingto acquire. For agoodrepresentationoftheentiretrainingdataset. Methods example, largeamountsofspeech datacaneasilybeobtained suchas[13,14,15,3]addressthisproblem,allofwhichhave via telephone calls, and via modern voice-based applications beenshowntobesuperiortomethodsthatdonotconsiderrep- suchasMicrosoft’sTellmeandGoogle’svoicesearch. Ideally, resentativenessmeasures. Ourapproachhereinalsobelongsto itwouldbepossibletolabelallofthisdataforuseasatrain- this category. In particular, we use Fisher kernel (Section 4) ing set in a speech recognition system, as aptly conveyed by tobuildagraphovertheunlabeledsamplesequences,andop- thewellknownphrase“thereisnodatalikemoredata.” Un- timizesubmodularfunctions(tobedefined)overthegraphto fortunately,thiswouldbeimpracticalgiventheeverincreasing find the most representative subset. Note that our Fisher ker- amount of available unlabeled data. Accurate phonetic tran- nelisoveranunsupervisedgenerativemodel,whichenablesus scription of speech utterances requires phonetic training and tobootstrapouractivelearningapproachwithoutneedingany even then it may take a month to annotate 1 hour of speech initiallabeleddata,yetweachievegoodperformance(seeSec- [1],nottomentionthedifficultyoftranscribingatthearticula- tion 5) perhaps because of the approximate optimality of our torylevel.Partlyduetothis,suchlow-leveltranscriptionefforts submodularprocedures. Thisapproachportendswelltounder- have been sidelined by the community in favor of word-level representedlanguagesforwhichaninitiallabeledsetmightbe transcriptions.Butevenwordleveltranscriptionsaretimecon- unavailable. suming(about10timesrealtime),especiallyforconversational spontaneousspeech. Thisproblemisparticularlyacuteforun- Despite pre-existing extensive studies of active learning, derrepresentedlanguagesordialectswithfewspeakers,where thereisrelativelylittleworkonactivelearningforsequencela- linguisticexpertsareevenhardertofind. beling.Severalmethodshavebeenproposed,mostofwhichare Inthispaper,weaddressthefollowingquestion:givenlim- based either on uncertainty sampling or query-by-committee. itedresources(timeand/orbudget),howcanweoptimallyse- In[11,16,6],confidencescoresfromaspeechrecognizerare lectatrainingdatasubsetfortranscriptionsuchthattheresult- usedtoindicatetheinformativenessofspeechutterances. Ac- ing system has optimal performance. In fact, this is a well- tivelearning methodsin [17] select the most uncertain exam- knownproblemandgoesbythenameofbatchactivelearning, plesbasedonanEM-stylealgorithmforlearningHMMsfrom where a subset of data that is most informativeand represen- partiallylabeleddata. In[18], severalobjectivefunctionsand tative of the whole is selected for labeling. Often, examples algorithmsare introducedfor active learning in HMMs. Sev- arequeriedinagreedyfashionaccordingtoaninformativeness eralnewquerystrategiesforprobabilisticsequencemodelsare introducedin[3]andanempiricalanalysisisconductedonava- This work was supported by an ONR MURI grant (No. rietyofbenchmarkdatasets.Ourapproachcanbedistinguished N000140510388). from these methods in that we select the most representative Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2009 2. REPORT TYPE 00-00-2009 to 00-00-2009 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER How to Select a Good Training-data Subset for Transcription: 5b. GRANT NUMBER Submodular Active Selection for Sequences 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Department of Electrical Engineering,University of REPORT NUMBER Washington,Seattle,WA 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), Brighton, UK, September 2009. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 4 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 subset in a submodular framework, where submodularitythe- Thegreedyalgorithm,moreover,islikelytobethebestwe oreticallyguaranteesthat theselection problemcan be solved candoinpolynomialtime,unlessP =NP. efficiently and near-optimally (see Section 2, Theorem 1 and Theorem2).Submodularityhasalreadybeensuccessfullyused Theorem 2. Feige 1998 [22] Unless P=NP, there is no inactivelearningtasks. Robustsubmodularobservationselec- polynomial-timealgorithmthatguaranteesasolutionS∗with tionisexploredin[19]. In[15],theauthorsrelateFisherinfor- mationmatricestosubmodularfunctionssothattheoptimiza- z(S∗)≥(1−1/e+ǫ) max z(S),ǫ>0 (5) tioncanbedoneefficientlyandeffectively. Tothebestofour |S|≤K knowledge,ourapproachisthefirstworkthatincorporatessub- modularityforactivelearninginsequencelabelingtaskssuch asspeechrecognition. 3. SubmodularSelection 2. Background Batchactivelearningproblemsareoftencast as adatasubset selection,wheretheactivelearnercanaskforthelabelsofthe 2.1. Submodularity subset of data of size within budget, and that is most likely Considerasetfunctionz :2V →R,whichmapssubsetsS ⊆ to yield the most accurate classifier. Problem (3) can also be viewedasadataselectionproblem. Supposewehaveasetof V ofafinitesetV torealnumbers. Intuitively,V isthesetof unlabeledtrainingexamplesV = {1,2,...,N},wherecertain allunlabeledutterances,andthefunctionz(·)scoresthequality pairs(i,j)aresimilarandthesimilarityofiandjismeasured ofanychosensubset. z(·)iscalledsubmodular[20]ifforany S,T ⊆V, by a nonnegative value wi,j. We can represent the unlabeled datausingagraphG=(V,E),withnonnegativeweightswi,j z(S∪T)+z(S∩T)≤z(S)+z(T) (1) associatedwitheachedge(i,j). Thedataselectionproblemis to finda subset S that ismost representativeof thewholeset An equivalent condition for submodularity is the property of V,giventheconstraint|S|≤K.Tomeasurehow“representa- diminishingreturns.ThatisforanyR⊆S ⊆V ands∈V, tive”SisofthewholesetV,weintroduceseveralsubmodular setfunctions. z(S∪{s})−z(S)≤z(R∪{s})−z(R) (2) 3.1. SubmodularSetFunctions Intuitively,thismeansthataddinganelementshelpsatleastas muchasifweaddittoasmallersetRthanifweaddittothe Ourfirstobjectiveistheuncapacitatedfacilitylocationfunction supersetS. Submodularityisthediscreteanalogofconvexity [23]: [20].Asconvexitymakescontinuousfunctionsmoreamenable Facilitylocation:z1(S)= maxwi,j (6) tooptimization,submodularityplaysanessentialroleincombi- iX∈V j∈S natorialoptimization.Commonsubmodularfunctionsappearin ItmeasuresthesimilarityofStothewholesetV.Wecanalso manyimportantsettingsincludinggraph-cut[21],setcovering measurethesimilarityofStotheremainder,i.e.,thegraphcut [22],andfacilitylocationproblems[23]. function: 2.2. SubmodularSelection Graphcut:z2(S)= wi,j (7) WewanttoselectagoodsubsetSoftrainingdataV thatmax- i∈XV\SXj∈S imizes some objective function, such that the size of S is no largerthanK(ourbudget).Thatis,wewishtocompute: Both of these functions are submodular as seen by verifying inequality2(proofomittedduetospacelimitations). max{z(S):|S|≤K} (3) InordertoapplyTheorem1,theobjectivefunctionshould S⊆V alsosatisfythenondecreasingproperty. Obviously,thefacility WhileNPhard,thisproblemcanbeapproximatelysolvedus- locationobjectivefunctionisnondecreasing. Forthegraphcut ingasimplegreedyforward-selectionalgorithm.Thealgorithm objective,theincrementofaddingkintoSis startswithS = ∅,anditerativelyaddstheelements ∈ V \S thatmaximallyincreasestheobjectivefunctionvalue,i.e., z (S∪{k})−z (S)= w − w 2 2 i,k k,j s=argmaxs∈V\Sz(S∪{s}) (4) i∈XV\S j∈XS∪{k} until |S| = K. Actually, when z(·) is a nondecreasing and whichisnotalwaysnonnegative.Fortunately,theproofofThe- normalized submodular set function, this simple greedy algo- orem1doesnotusethemonotonepropertyforallpossiblesets rithmperformsnear-optimallyas guaranteedby thefollowing [24][19,page58]. Thegraphcutcanalsomeettheconditions theorems. forTheorem1if|S|≪|V|,whichisusuallythecaseinappli- cationswherewehavealargeamountofdatabutonlylimited Theorem1. Nemhauseretal.1978[24].Ifsubmodularfunc- resourcesforlabeling. tion z(·) satisfies: i) nondecreasing: for all S1 ⊆ S2 ⊆ V, With the above objectives, we can use the greedy algo- z(S1) ≤ z(S2); ii) normalized: z(∅) = 0, then the set SG∗ rithmtosolvethedataselectionproblemefficientlyand near- obtainedbythegreedyalgorithmisnoworsethanaconstant optimally. Thegreedyalgorithmforsubmodulardataselection fraction(1−1/e)awayfromtheoptimalvalue,i.e., withthefacilitylocationobjectiveisdescribedinAlgorithm1, whereρi =maxj∈Swi,j isupdatedtooptimizetherunningof 1 z(S∗)≥ 1− max z(S) thealgorithm.Thegraph-cutobjectivealgorithmissimilarand G (cid:18) e(cid:19)S⊆V:|S|≤K isomittedtoconservespace. Algorithm1Greedyalgorithmforfacilitylocationobjective 1: Input: G = (V,E)withweightswi,j onedge(i,j); K: 10% thenumberofexamplestobeselected 9% 2: Initialization: S = ∅,ρi = 0,i = 1,...,N whereN = ent 8% m |V| ov 7% Facility location obj. with L1 3: while|S|≤Kdo pr 45:: kS∗==Sar∪g{mka∗x}k∈V\SPi∈V,(i,k)∈E(max{ρi,wi,k}−ρi) PER im 56%% GUnracpehrt aciuntt yo bsja. mwpitlhin gL1 6: foralli∈V do e 4% v 7: ρi =max{ρi,wi,k∗} ati 3% 8: endfor Rel 2% 9: endwhile 1% 0% 4. FisherKernel 2.5 5 10 20 30 40 50 60 70 80 90 Percentage(%) of the train ing data We express the pairwise “similarity” between the utterances Figure1: Relativeimprovementsovertheaveragephoneerror i and j in terms of kernel function κ(i,j) so that wi,j = rateofrandomselection.Noinitialmodelscenario. κ(i,j). Sincetheexamplesaresequenceswithpossiblydiffer- entlengths,weusetheFisherkernel[25],whichisapplicableto Gaussian mixture model (GMM) was optimized according to variablelengthsequences. Consideragenerativemodel(e.g.,a theamountoftrainingdataavailable.The48phoneswerethen hiddenMarkovmodels,ormoregenerally,adynamicBayesian mappeddownto39phonesforscoringpurposesfollowingstan- network(DBN))withparametersθthatmodelsthegeneration dardpractice[27]. Recognitionwasperformedusingstandard pitrhofceeastsuroefstehqeuseenqcueenwcieth. DleenngothteTXi.iT=he(nxai,1fi,x.e.d.,lexnig,Tthi)vaescttohre, Vmiotedrebliwseaasrcnhotwuisthedouhtearephtooneemtipchlaasnigzeuatgheemacoodueslti(camlaondgeulaingge knownastheFisherscore,canbeextractedas: performance,andsincethisspeedsupexperimentalturnaround ∂ time by avoiding tedious language model scaling and penalty Ui = ∂θlogp(Xi|θ) (8) parametertuningwhenlargerandomselectionexperimentsare performed). 100 trials of random selection experiments were EachcomponentofUiisaderivativeofthelog-likelihoodscore performedforeachofthepercentagenumbersabove.Theaver- for thesequence Xi withrespect toa particularparameter — agephoneerrorrates(PER)werecalculatedandusedasbase- theFisherscoreisthusavectorhavingthesamelengthasthe line. Thestandarddeviationwasaround0.01forsmallpand numberofparametersθ.ThecomputationofgradientsinEq.8 about0.005forlargerp.Apartfromthedataselectionstrategy, inthecontextofDBNsisdescribedindetailin[26]. experimentsonuncertaintysamplingandsubmodularselection Given Fisher scores, different sequences with different followedexactlythesamesetupsasrandomselection. lengthsmayberepresentedbyfixed-lengthvectors,sowecan Uncertaintysamplingandsubmodularselectionwereeval- easily define several Fisher kernel functions to measure pair- uatedundertwoscenarios. Thefirstscenarioweconsideredis wise similarity, e.g., cosine similarity, radial-basis function whenthereisnoinitialmodelavailable.Inthisscenario,uncer- (RBF)kernelsimilarity,orasshownbelow,thenegativeℓ sim- 1 taintysamplingwouldtypicallyrandomlyselectasmallportion ilarity: of the unlabeled data to label, and then train an initial model usingtheserandomlyselected data. Wedidthefollowing: a) Negativeℓ1norm:κ(i,j)=−||Ui−Uj||1 (9) randomlyselectα%ofthetrainingdata,acquirethelabelsand ThegenerativemodelthatisusedtogeneratetheFisherscore trainaninitialmodel;b)usethelearnedmodeltopredicttheun- may contain several types of parameters (i.e., discrete condi- labeleddata,selecttheMmostuncertainsamplesforlabelling; tionalprobabilitytablesandcontinuousGaussianparameters), c)retrainthemodelusingalllabeleddata. Ifthenumberofla- andthevaluesassociatedwithdifferenttypesofparametersmay beledsamplesreachesthetargetamount, stop, elsegotostep have quite different numeric dynamic ranges. In order to re- b). Weusedα = 1andM = 100intheexperiments,andthe duce the heterogeneity within the Fisher score vector, all our average per-frame log-likelihood was used as the uncertainty experimentsapplythefollowingglobalvariancenormalization measurement. toproducethefinalFisherscorevectorsU′: For our submodular selection method, HMMs with 16- i componentGMMswereobtainedbyunsupervisedtrainingus- Ui′ =(diag(Σ))−12 ·(Ui−U¯) (10) ingalltheunlabeleddata. Thismodelwasusedasthegenera- tivemodelfortheFisherscoreusinggmtkKernel,aGMTK whereU¯ = N1 Ni=1UiandΣ= N1 Ni=1(Ui−U¯)T(Ui−U¯) n[2o8r]mDwBaNs iumsepdlemtoencotantsiotrnucotftFhieshgerrapkhern(weles.aTlshoetnesetgeadtivoethℓe1r P P 5. Experiments measures which had similar results). The relative PER im- provements over the average of the 100 random experiments We evaluated our methods on a phone recognition task using are shown in Figure 1. As we can see, uncertainty sampling the TIMIT corpus. Random selection was used as a base- achieves improvements over random sampling in general, but line. Specifically, we randomly take p% of the TIMIT train- whenthetargetpercentagenumberissmall(i.e.,2.5%and5%), ingset,wherep=2.5,5,10,20,30,40,50,60,70,80,90.For whichisusuallythecaseinreal-worldapplications,itperforms eachsubset,a3-statecontext-independent(CI)hiddenMarkov similarlytorandomselectionsincethemodelusedfortheun- model (HMM)(implemented as aDBN) wastrainedfor each certaintymeasurementisoflowquality.Ontheotherhand,sub- ofthe48phones. ThenumberofGaussiancomponentsinthe modulardataselectionoutperformsbothrandomselectionand [6] D. Hakkani-tu¨r and A. Gorin, “Active learning for automatic 14% speechrecognition,”ininProceedingsoftheICASSP,2002,pp. 3904–3907. nt 12% [7] A.CulottaandA.McCallum,“Reducinglabelingeffortforstruc- me turedpredictiontasks,”inProceedingsoftheNationalConference ov 10% Facility location obj. with L1 onArtificialIntelligence(AAAI),2005. R impr 8% GUnracpehrt aciuntt yo bsja. mwpitlhin gL1 [8] sDp.oHkeankklaanngi-uTau¨gre,pGr.oRceiscscianrgd,i”,AanCdMGT.rTaunrs,.“SApeneacchtiLvaenagp.pPrrooaccehssto., E vol.3,no.3,pp.1–31,2006. e P 6% [9] D.Cohn,L.Atlas,andR.Ladner,“Improvinggeneralizationwith v elati 4% a1c9t9iv4e. learning,”MachineLearning,vol.15,no.2,pp.201–221, R 2% [10] I.DaganandS.Engelson,“Committee-basedsamplingfortrain- ingprobabilisticclassifiers,”inICML. MorganKaufmann,1995, 0% pp.150–157. 2.5 5 10 20 30 40 50 60 70 80 90 [11] G. Tur, R. Schapire, and D. Hakkani-Tur, “Active learning for Percentage(%) of the training dat a spokenlanguageunderstanding,”inICASSP,vol.1,2003. Figure2: Relativeimprovementsovertheaveragephoneerror [12] N. Roy and A. McCallum, “Toward optimal active learning rateofrandomselection.Withinitialmodelscenario. throughsamplingestimationoferrorreduction,”inICML,2001, pp.441–448. [13] A.Fujii,T.Tokunaga,K.Inui,andH.Tanaka,“Selectivesampling uncertaintysampling,especiallywhenthepercentageissmall. for example-based word sense disambiguation,” Computational Thisimpliesthatevenamodeltrainedwithoutanylabelingin- Linguistics,vol.24,no.4,pp.573–597,1998. formationworksquitewellforourapproach. Inotherwords, [14] H. Nguyen and A. Smeulders, “Active learning using pre- thesubmodulardataselectionapproachproposedhereisquite clustering,”inICML. ACMNewYork,NY,USA,2004. robusttothescenariowherenoinitial“boot”modelisavailable. [15] S. Hoi, R. Jin, J. Zhu, and M. Lyu, “Batchmode active learn- Our second scenario is when an initial model is available inganditsapplicationtomedicalimageclassification,”inICML. to help the data selection. Such a model should have reason- ACMNewYork,NY,USA,2006,pp.417–424. ablequality.Inourexperiments,weassumeaveryhighquality [16] G. Tur, D. Hakkani-Tu¨r, and R. Schapire, “Combining active initial model to strongly contrast with our first scenario – an and semi-supervised learning for spoken language understand- initialmodelwith16-componentGMM-HMMswastrainedon ing,”SpeechCommunication,vol.45,no.2,pp.171–186,2005. allthelabeledTIMITdata,whichwasthenusedintheuncer- [17] T.Scheffer,C.Decomain,andS.Wrobel,“Activehiddenmarkov modelsforinformationextraction,”inProceedingsofthe4thIn- taintysamplingapproach,andalsointhesubmodularselection ternationalConferenceonAdvancesinIntelligentDataAnalysis. methodasthegenerativemodel. TheresultsareshowninFig- Springer-VerlagLondon,UK,2001,pp.309–318. ure2—withabetterqualityinitialmodel,uncertaintysampling [18] B.AndersonandA.Moore,“Activelearningforhiddenmarkov performsbetterwhenselectingsmallpercentagesofthedatabut models: Objective functions and algorithms,” in Machine notnecessarilywithmoredata(presumablyduetoitsselection Learning-Internationalworkshop,vol.22,2005,p.9. of unrepresentative outliers). Submodular data selection also [19] A. R. Krause, “Optimizing sensing: Theory and applications,” performs better in general with a better quality initial model. Ph.D.dissertation,CarnegieMellonUniversity,2008. In particular, more than 12% relative improvement over ran- [20] L.Lovasz,“Submodularfunctionsandconvexity,”Mathematical domselectionisachievedwhenselecting2.5%ofthedata.And programming-Thestateoftheart,(eds.A.Bachem,M.Grotschel again,submodularselectionoutperformsbothrandomsampling andB.Korte)Springer,pp.235–257,1983. anduncertaintysampling. Also,noticethatthereareonlyrel- [21] M. Goemans and D. Williamson, “Improved approximation al- atively minor performance drops in our approach when shift- gorithms for maximum cut and satisfiability problems using ingfromasupervisedtrainedinitialmodeltoanunsupervised semidefiniteprogramming,”JournaloftheACM(JACM),vol.42, no.6,pp.1115–1145,1995. trainedinitialmodel,illustratingyetagainthatsubmodularse- [22] U.Feige,“Athresholdoflnnforapproximatingsetcover,”Jour- lectionseemsrobusttothequalityoftheinitialmodel. naloftheACM(JACM),vol.45,no.4,pp.634–652,1998. [23] G.Cornuejols,M.FISHER,andG.Nemhauser,“Ontheuncapac- 6. References itatedlocationproblem,”inStudiesinIntegerProgramming:Pro- [1] L.Lamel, R.Kassel, andS.Seneff, “Speechdatabasedevelop- ceedingsoftheInstituteofOperationsResearchWorkshop,Spon- ment: Designand analysisof the acoustic-phonetic corpus,” in sored by IBM,University of Bonn, Germany, Sept. 8-12, 1975, SpeechInput/OutputAssessmentandSpeechDatabases. ISCA, vol.1. NorthHolland,1977,pp.163–177. 1989. [24] G.Nemhauser,L.Wolsey,andM.Fisher,“Ananalysisofapprox- imationsformaximizingsubmodularsetfunctionsI,”Mathemat- [2] D.LewisandW.Gale,“Asequentialalgorithmfortrainingtext icalProgramming,vol.14,no.1,pp.265–294,1978. classifiers,”inProceedingsofthe17thannualinternationalACM SIGIR conference onResearchand development ininformation [25] T.JaakkolaandD.Haussler,“Exploitinggenerativemodelsindis- retrieval. Springer-VerlagNewYork,Inc.NewYork,NY,USA, criminativeclassifiers,”Advancesinneuralinformationprocess- 1994,pp.3–12. ingsystems,pp.487–493,1999. [3] B. Settles and M. Craven, “An analysis [26] J.Bilmes,“FisherkernelsforDBNs,”UniversityofWashington, of active learning strategies for sequence label- Tech.Rep.,2008. ing tasks,” in EMNLP, 2008. [Online]. Available: [27] K.LeeandH.Hon,“Speaker-independentphonerecognitionus- http://pages.cs.wisc.edu/∼bsettles/pub/settles.emnlp08.pdf ing hidden Markov models,” IEEE Transactions on Acoustics, Speech,andSignalProcessing,vol.37,no.11,pp.1641–1648, [4] B. Varadarajan, D. Yu, L. Deng, and A. Acero, “Maximizing 1989. global entropy reduction for active learning in speech recogni- tion,”inICASSP,2009. [28] J. Bilmes and C. Bartels, “Graphical model architectures for speechrecognition,”IEEESignalProcessingMagazine,vol.22, [5] Y.Wu, R.Zhang, andA.Rudnicky, “Dataselectionforspeech no.5,pp.89–100,September2005. recognition,”inASRU,Dec.2007,pp.562–565.