Table Of ContentHow to Select a Good Training-data Subset for Transcription:
Submodular Active Selection for Sequences
HuiLin,Jeff Bilmes
Department ofElectrical Engineering, University ofWashington,Seattle, WA 98195
{hlin,bilmes}@ee.washignton.edu
Abstract measureusedtoevaluateallexamplesinthepool. Twopopu-
larstrategiesformeasuringinformativenessincludeuncertainty
Given a large un-transcribed corpus of speech utterances, we sampling and the query-by-committee approach. Uncertainty
addresstheproblemofhow toselect agoodsubsetforword- sampling[2]isthesimplestandmostcommonlyusedstrategy.
leveltranscriptionunderagivenfixedtranscriptionbudget.We In this framework, an initial system is trained typically using
employ submodular active selection on a Fisher-kernel based a small set of labeled examples. Then, the system examines
graphoverun-transcribedutterances. Theselectionistheoreti- therestoftheunlabeledexamples,andqueriesexamplesthatit
callyguaranteedtobenear-optimal.Moreover,ourapproachis ismost uncertainabout. Themeasurement ofuncertaintycan
abletobootstrapwithoutrequiringanyinitialtranscribeddata, either be entropy [3, 4, 5] or a confidence score [6, 7, 8, 3].
whereastraditionalapproachesrelyheavilyonthequalityofan Query-by-committee[9,10,11]alsostartswithlabeleddata.A
initial model trained on some labeled data. Our experiments setofdistinctmodelsaretrainedascommitteemembers. Each
onphonerecognitionshowthatourapproachoutperformsboth committeememberisthenallowedtovoteonthelabellingsof
average-caserandomselectionanduncertaintysamplingsignif- theunlabeledexamples.Themostinformativeexampleistaken
icantly. astheonethecommitteemostdisagreesabout.
IndexTerms:Transcription,labeling,submodularity,submod-
It has been shown that both uncertainty sampling and
ularselection,activelearning,sequencelabeling,phonerecog-
query-by-committeemayfailwhentheytendtoqueryoutliers,
nition,speechrecognition
whichisthemainmotivatingfactorforotherstrategieslikees-
timatederrorreduction[12].Theproblemisthatoutliersmight
1. Introduction
havehighuncertainty(oracommitteemightfindthemcontro-
versial)buttheyarenotgoodsurrogatesfor“typical”samples.
Inautomaticspeech recognitionandmanyotherlanguageap-
Indeed, an ideal selection strategy should choose a subset of
plications, unlabeled data are abundant but labels (e.g., tran-
samplesthat,whenconsideredtogether,constituteinsomeform
scriptions)are expensiveand time-consumingto acquire. For
agoodrepresentationoftheentiretrainingdataset. Methods
example, largeamountsofspeech datacaneasilybeobtained
suchas[13,14,15,3]addressthisproblem,allofwhichhave
via telephone calls, and via modern voice-based applications
beenshowntobesuperiortomethodsthatdonotconsiderrep-
suchasMicrosoft’sTellmeandGoogle’svoicesearch. Ideally,
resentativenessmeasures. Ourapproachhereinalsobelongsto
itwouldbepossibletolabelallofthisdataforuseasatrain-
this category. In particular, we use Fisher kernel (Section 4)
ing set in a speech recognition system, as aptly conveyed by
tobuildagraphovertheunlabeledsamplesequences,andop-
thewellknownphrase“thereisnodatalikemoredata.” Un-
timizesubmodularfunctions(tobedefined)overthegraphto
fortunately,thiswouldbeimpracticalgiventheeverincreasing
find the most representative subset. Note that our Fisher ker-
amount of available unlabeled data. Accurate phonetic tran-
nelisoveranunsupervisedgenerativemodel,whichenablesus
scription of speech utterances requires phonetic training and
tobootstrapouractivelearningapproachwithoutneedingany
even then it may take a month to annotate 1 hour of speech
initiallabeleddata,yetweachievegoodperformance(seeSec-
[1],nottomentionthedifficultyoftranscribingatthearticula-
tion 5) perhaps because of the approximate optimality of our
torylevel.Partlyduetothis,suchlow-leveltranscriptionefforts
submodularprocedures. Thisapproachportendswelltounder-
have been sidelined by the community in favor of word-level
representedlanguagesforwhichaninitiallabeledsetmightbe
transcriptions.Butevenwordleveltranscriptionsaretimecon-
unavailable.
suming(about10timesrealtime),especiallyforconversational
spontaneousspeech. Thisproblemisparticularlyacuteforun- Despite pre-existing extensive studies of active learning,
derrepresentedlanguagesordialectswithfewspeakers,where thereisrelativelylittleworkonactivelearningforsequencela-
linguisticexpertsareevenhardertofind. beling.Severalmethodshavebeenproposed,mostofwhichare
Inthispaper,weaddressthefollowingquestion:givenlim- based either on uncertainty sampling or query-by-committee.
itedresources(timeand/orbudget),howcanweoptimallyse- In[11,16,6],confidencescoresfromaspeechrecognizerare
lectatrainingdatasubsetfortranscriptionsuchthattheresult- usedtoindicatetheinformativenessofspeechutterances. Ac-
ing system has optimal performance. In fact, this is a well- tivelearning methodsin [17] select the most uncertain exam-
knownproblemandgoesbythenameofbatchactivelearning, plesbasedonanEM-stylealgorithmforlearningHMMsfrom
where a subset of data that is most informativeand represen- partiallylabeleddata. In[18], severalobjectivefunctionsand
tative of the whole is selected for labeling. Often, examples algorithmsare introducedfor active learning in HMMs. Sev-
arequeriedinagreedyfashionaccordingtoaninformativeness eralnewquerystrategiesforprobabilisticsequencemodelsare
introducedin[3]andanempiricalanalysisisconductedonava-
This work was supported by an ONR MURI grant (No. rietyofbenchmarkdatasets.Ourapproachcanbedistinguished
N000140510388). from these methods in that we select the most representative
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2009 2. REPORT TYPE 00-00-2009 to 00-00-2009
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
How to Select a Good Training-data Subset for Transcription:
5b. GRANT NUMBER
Submodular Active Selection for Sequences
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
Department of Electrical Engineering,University of REPORT NUMBER
Washington,Seattle,WA
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
In Proceedings of the Annual Conference of the International Speech Communication Association
(INTERSPEECH), Brighton, UK, September 2009.
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE Same as 4
unclassified unclassified unclassified Report (SAR)
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
subset in a submodular framework, where submodularitythe- Thegreedyalgorithm,moreover,islikelytobethebestwe
oreticallyguaranteesthat theselection problemcan be solved candoinpolynomialtime,unlessP =NP.
efficiently and near-optimally (see Section 2, Theorem 1 and
Theorem2).Submodularityhasalreadybeensuccessfullyused Theorem 2. Feige 1998 [22] Unless P=NP, there is no
inactivelearningtasks. Robustsubmodularobservationselec- polynomial-timealgorithmthatguaranteesasolutionS∗with
tionisexploredin[19]. In[15],theauthorsrelateFisherinfor-
mationmatricestosubmodularfunctionssothattheoptimiza- z(S∗)≥(1−1/e+ǫ) max z(S),ǫ>0 (5)
tioncanbedoneefficientlyandeffectively. Tothebestofour |S|≤K
knowledge,ourapproachisthefirstworkthatincorporatessub-
modularityforactivelearninginsequencelabelingtaskssuch
asspeechrecognition. 3. SubmodularSelection
2. Background Batchactivelearningproblemsareoftencast as adatasubset
selection,wheretheactivelearnercanaskforthelabelsofthe
2.1. Submodularity subset of data of size within budget, and that is most likely
Considerasetfunctionz :2V →R,whichmapssubsetsS ⊆ to yield the most accurate classifier. Problem (3) can also be
viewedasadataselectionproblem. Supposewehaveasetof
V ofafinitesetV torealnumbers. Intuitively,V isthesetof
unlabeledtrainingexamplesV = {1,2,...,N},wherecertain
allunlabeledutterances,andthefunctionz(·)scoresthequality
pairs(i,j)aresimilarandthesimilarityofiandjismeasured
ofanychosensubset. z(·)iscalledsubmodular[20]ifforany
S,T ⊆V, by a nonnegative value wi,j. We can represent the unlabeled
datausingagraphG=(V,E),withnonnegativeweightswi,j
z(S∪T)+z(S∩T)≤z(S)+z(T) (1) associatedwitheachedge(i,j). Thedataselectionproblemis
to finda subset S that ismost representativeof thewholeset
An equivalent condition for submodularity is the property of V,giventheconstraint|S|≤K.Tomeasurehow“representa-
diminishingreturns.ThatisforanyR⊆S ⊆V ands∈V, tive”SisofthewholesetV,weintroduceseveralsubmodular
setfunctions.
z(S∪{s})−z(S)≤z(R∪{s})−z(R) (2)
3.1. SubmodularSetFunctions
Intuitively,thismeansthataddinganelementshelpsatleastas
muchasifweaddittoasmallersetRthanifweaddittothe Ourfirstobjectiveistheuncapacitatedfacilitylocationfunction
supersetS. Submodularityisthediscreteanalogofconvexity [23]:
[20].Asconvexitymakescontinuousfunctionsmoreamenable Facilitylocation:z1(S)= maxwi,j (6)
tooptimization,submodularityplaysanessentialroleincombi- iX∈V j∈S
natorialoptimization.Commonsubmodularfunctionsappearin
ItmeasuresthesimilarityofStothewholesetV.Wecanalso
manyimportantsettingsincludinggraph-cut[21],setcovering
measurethesimilarityofStotheremainder,i.e.,thegraphcut
[22],andfacilitylocationproblems[23].
function:
2.2. SubmodularSelection
Graphcut:z2(S)= wi,j (7)
WewanttoselectagoodsubsetSoftrainingdataV thatmax- i∈XV\SXj∈S
imizes some objective function, such that the size of S is no
largerthanK(ourbudget).Thatis,wewishtocompute:
Both of these functions are submodular as seen by verifying
inequality2(proofomittedduetospacelimitations).
max{z(S):|S|≤K} (3)
InordertoapplyTheorem1,theobjectivefunctionshould
S⊆V
alsosatisfythenondecreasingproperty. Obviously,thefacility
WhileNPhard,thisproblemcanbeapproximatelysolvedus-
locationobjectivefunctionisnondecreasing. Forthegraphcut
ingasimplegreedyforward-selectionalgorithm.Thealgorithm
objective,theincrementofaddingkintoSis
startswithS = ∅,anditerativelyaddstheelements ∈ V \S
thatmaximallyincreasestheobjectivefunctionvalue,i.e.,
z (S∪{k})−z (S)= w − w
2 2 i,k k,j
s=argmaxs∈V\Sz(S∪{s}) (4) i∈XV\S j∈XS∪{k}
until |S| = K. Actually, when z(·) is a nondecreasing and whichisnotalwaysnonnegative.Fortunately,theproofofThe-
normalized submodular set function, this simple greedy algo- orem1doesnotusethemonotonepropertyforallpossiblesets
rithmperformsnear-optimallyas guaranteedby thefollowing [24][19,page58]. Thegraphcutcanalsomeettheconditions
theorems. forTheorem1if|S|≪|V|,whichisusuallythecaseinappli-
cationswherewehavealargeamountofdatabutonlylimited
Theorem1. Nemhauseretal.1978[24].Ifsubmodularfunc- resourcesforlabeling.
tion z(·) satisfies: i) nondecreasing: for all S1 ⊆ S2 ⊆ V, With the above objectives, we can use the greedy algo-
z(S1) ≤ z(S2); ii) normalized: z(∅) = 0, then the set SG∗ rithmtosolvethedataselectionproblemefficientlyand near-
obtainedbythegreedyalgorithmisnoworsethanaconstant
optimally. Thegreedyalgorithmforsubmodulardataselection
fraction(1−1/e)awayfromtheoptimalvalue,i.e.,
withthefacilitylocationobjectiveisdescribedinAlgorithm1,
whereρi =maxj∈Swi,j isupdatedtooptimizetherunningof
1
z(S∗)≥ 1− max z(S) thealgorithm.Thegraph-cutobjectivealgorithmissimilarand
G (cid:18) e(cid:19)S⊆V:|S|≤K isomittedtoconservespace.
Algorithm1Greedyalgorithmforfacilitylocationobjective
1: Input: G = (V,E)withweightswi,j onedge(i,j); K: 10%
thenumberofexamplestobeselected 9%
2: Initialization: S = ∅,ρi = 0,i = 1,...,N whereN = ent 8%
m
|V| ov 7% Facility location obj. with L1
3: while|S|≤Kdo pr
45:: kS∗==Sar∪g{mka∗x}k∈V\SPi∈V,(i,k)∈E(max{ρi,wi,k}−ρi) PER im 56%% GUnracpehrt aciuntt yo bsja. mwpitlhin gL1
6: foralli∈V do e 4%
v
7: ρi =max{ρi,wi,k∗} ati 3%
8: endfor Rel
2%
9: endwhile
1%
0%
4. FisherKernel 2.5 5 10 20 30 40 50 60 70 80 90
Percentage(%) of the train ing data
We express the pairwise “similarity” between the utterances Figure1: Relativeimprovementsovertheaveragephoneerror
i and j in terms of kernel function κ(i,j) so that wi,j = rateofrandomselection.Noinitialmodelscenario.
κ(i,j). Sincetheexamplesaresequenceswithpossiblydiffer-
entlengths,weusetheFisherkernel[25],whichisapplicableto
Gaussian mixture model (GMM) was optimized according to
variablelengthsequences. Consideragenerativemodel(e.g.,a
theamountoftrainingdataavailable.The48phoneswerethen
hiddenMarkovmodels,ormoregenerally,adynamicBayesian
mappeddownto39phonesforscoringpurposesfollowingstan-
network(DBN))withparametersθthatmodelsthegeneration
dardpractice[27]. Recognitionwasperformedusingstandard
pitrhofceeastsuroefstehqeuseenqcueenwcieth. DleenngothteTXi.iT=he(nxai,1fi,x.e.d.,lexnig,Tthi)vaescttohre, Vmiotedrebliwseaasrcnhotwuisthedouhtearephtooneemtipchlaasnigzeuatgheemacoodueslti(camlaondgeulaingge
knownastheFisherscore,canbeextractedas:
performance,andsincethisspeedsupexperimentalturnaround
∂ time by avoiding tedious language model scaling and penalty
Ui = ∂θlogp(Xi|θ) (8) parametertuningwhenlargerandomselectionexperimentsare
performed). 100 trials of random selection experiments were
EachcomponentofUiisaderivativeofthelog-likelihoodscore performedforeachofthepercentagenumbersabove.Theaver-
for thesequence Xi withrespect toa particularparameter — agephoneerrorrates(PER)werecalculatedandusedasbase-
theFisherscoreisthusavectorhavingthesamelengthasthe line. Thestandarddeviationwasaround0.01forsmallpand
numberofparametersθ.ThecomputationofgradientsinEq.8 about0.005forlargerp.Apartfromthedataselectionstrategy,
inthecontextofDBNsisdescribedindetailin[26].
experimentsonuncertaintysamplingandsubmodularselection
Given Fisher scores, different sequences with different followedexactlythesamesetupsasrandomselection.
lengthsmayberepresentedbyfixed-lengthvectors,sowecan
Uncertaintysamplingandsubmodularselectionwereeval-
easily define several Fisher kernel functions to measure pair-
uatedundertwoscenarios. Thefirstscenarioweconsideredis
wise similarity, e.g., cosine similarity, radial-basis function
whenthereisnoinitialmodelavailable.Inthisscenario,uncer-
(RBF)kernelsimilarity,orasshownbelow,thenegativeℓ sim-
1 taintysamplingwouldtypicallyrandomlyselectasmallportion
ilarity:
of the unlabeled data to label, and then train an initial model
usingtheserandomlyselected data. Wedidthefollowing: a)
Negativeℓ1norm:κ(i,j)=−||Ui−Uj||1 (9) randomlyselectα%ofthetrainingdata,acquirethelabelsand
ThegenerativemodelthatisusedtogeneratetheFisherscore trainaninitialmodel;b)usethelearnedmodeltopredicttheun-
may contain several types of parameters (i.e., discrete condi- labeleddata,selecttheMmostuncertainsamplesforlabelling;
tionalprobabilitytablesandcontinuousGaussianparameters), c)retrainthemodelusingalllabeleddata. Ifthenumberofla-
andthevaluesassociatedwithdifferenttypesofparametersmay beledsamplesreachesthetargetamount, stop, elsegotostep
have quite different numeric dynamic ranges. In order to re- b). Weusedα = 1andM = 100intheexperiments,andthe
duce the heterogeneity within the Fisher score vector, all our average per-frame log-likelihood was used as the uncertainty
experimentsapplythefollowingglobalvariancenormalization measurement.
toproducethefinalFisherscorevectorsU′: For our submodular selection method, HMMs with 16-
i
componentGMMswereobtainedbyunsupervisedtrainingus-
Ui′ =(diag(Σ))−12 ·(Ui−U¯) (10) ingalltheunlabeleddata. Thismodelwasusedasthegenera-
tivemodelfortheFisherscoreusinggmtkKernel,aGMTK
whereU¯ = N1 Ni=1UiandΣ= N1 Ni=1(Ui−U¯)T(Ui−U¯) n[2o8r]mDwBaNs iumsepdlemtoencotantsiotrnucotftFhieshgerrapkhern(weles.aTlshoetnesetgeadtivoethℓe1r
P P
5. Experiments measures which had similar results). The relative PER im-
provements over the average of the 100 random experiments
We evaluated our methods on a phone recognition task using are shown in Figure 1. As we can see, uncertainty sampling
the TIMIT corpus. Random selection was used as a base- achieves improvements over random sampling in general, but
line. Specifically, we randomly take p% of the TIMIT train- whenthetargetpercentagenumberissmall(i.e.,2.5%and5%),
ingset,wherep=2.5,5,10,20,30,40,50,60,70,80,90.For whichisusuallythecaseinreal-worldapplications,itperforms
eachsubset,a3-statecontext-independent(CI)hiddenMarkov similarlytorandomselectionsincethemodelusedfortheun-
model (HMM)(implemented as aDBN) wastrainedfor each certaintymeasurementisoflowquality.Ontheotherhand,sub-
ofthe48phones. ThenumberofGaussiancomponentsinthe modulardataselectionoutperformsbothrandomselectionand
[6] D. Hakkani-tu¨r and A. Gorin, “Active learning for automatic
14% speechrecognition,”ininProceedingsoftheICASSP,2002,pp.
3904–3907.
nt 12% [7] A.CulottaandA.McCallum,“Reducinglabelingeffortforstruc-
me turedpredictiontasks,”inProceedingsoftheNationalConference
ov 10% Facility location obj. with L1 onArtificialIntelligence(AAAI),2005.
R impr 8% GUnracpehrt aciuntt yo bsja. mwpitlhin gL1 [8] sDp.oHkeankklaanngi-uTau¨gre,pGr.oRceiscscianrgd,i”,AanCdMGT.rTaunrs,.“SApeneacchtiLvaenagp.pPrrooaccehssto.,
E vol.3,no.3,pp.1–31,2006.
e P 6% [9] D.Cohn,L.Atlas,andR.Ladner,“Improvinggeneralizationwith
v
elati 4% a1c9t9iv4e. learning,”MachineLearning,vol.15,no.2,pp.201–221,
R
2% [10] I.DaganandS.Engelson,“Committee-basedsamplingfortrain-
ingprobabilisticclassifiers,”inICML. MorganKaufmann,1995,
0% pp.150–157.
2.5 5 10 20 30 40 50 60 70 80 90 [11] G. Tur, R. Schapire, and D. Hakkani-Tur, “Active learning for
Percentage(%) of the training dat a spokenlanguageunderstanding,”inICASSP,vol.1,2003.
Figure2: Relativeimprovementsovertheaveragephoneerror [12] N. Roy and A. McCallum, “Toward optimal active learning
rateofrandomselection.Withinitialmodelscenario. throughsamplingestimationoferrorreduction,”inICML,2001,
pp.441–448.
[13] A.Fujii,T.Tokunaga,K.Inui,andH.Tanaka,“Selectivesampling
uncertaintysampling,especiallywhenthepercentageissmall.
for example-based word sense disambiguation,” Computational
Thisimpliesthatevenamodeltrainedwithoutanylabelingin- Linguistics,vol.24,no.4,pp.573–597,1998.
formationworksquitewellforourapproach. Inotherwords, [14] H. Nguyen and A. Smeulders, “Active learning using pre-
thesubmodulardataselectionapproachproposedhereisquite clustering,”inICML. ACMNewYork,NY,USA,2004.
robusttothescenariowherenoinitial“boot”modelisavailable. [15] S. Hoi, R. Jin, J. Zhu, and M. Lyu, “Batchmode active learn-
Our second scenario is when an initial model is available inganditsapplicationtomedicalimageclassification,”inICML.
to help the data selection. Such a model should have reason- ACMNewYork,NY,USA,2006,pp.417–424.
ablequality.Inourexperiments,weassumeaveryhighquality [16] G. Tur, D. Hakkani-Tu¨r, and R. Schapire, “Combining active
initial model to strongly contrast with our first scenario – an and semi-supervised learning for spoken language understand-
initialmodelwith16-componentGMM-HMMswastrainedon ing,”SpeechCommunication,vol.45,no.2,pp.171–186,2005.
allthelabeledTIMITdata,whichwasthenusedintheuncer- [17] T.Scheffer,C.Decomain,andS.Wrobel,“Activehiddenmarkov
modelsforinformationextraction,”inProceedingsofthe4thIn-
taintysamplingapproach,andalsointhesubmodularselection
ternationalConferenceonAdvancesinIntelligentDataAnalysis.
methodasthegenerativemodel. TheresultsareshowninFig-
Springer-VerlagLondon,UK,2001,pp.309–318.
ure2—withabetterqualityinitialmodel,uncertaintysampling
[18] B.AndersonandA.Moore,“Activelearningforhiddenmarkov
performsbetterwhenselectingsmallpercentagesofthedatabut
models: Objective functions and algorithms,” in Machine
notnecessarilywithmoredata(presumablyduetoitsselection Learning-Internationalworkshop,vol.22,2005,p.9.
of unrepresentative outliers). Submodular data selection also [19] A. R. Krause, “Optimizing sensing: Theory and applications,”
performs better in general with a better quality initial model. Ph.D.dissertation,CarnegieMellonUniversity,2008.
In particular, more than 12% relative improvement over ran- [20] L.Lovasz,“Submodularfunctionsandconvexity,”Mathematical
domselectionisachievedwhenselecting2.5%ofthedata.And programming-Thestateoftheart,(eds.A.Bachem,M.Grotschel
again,submodularselectionoutperformsbothrandomsampling andB.Korte)Springer,pp.235–257,1983.
anduncertaintysampling. Also,noticethatthereareonlyrel- [21] M. Goemans and D. Williamson, “Improved approximation al-
atively minor performance drops in our approach when shift- gorithms for maximum cut and satisfiability problems using
ingfromasupervisedtrainedinitialmodeltoanunsupervised semidefiniteprogramming,”JournaloftheACM(JACM),vol.42,
no.6,pp.1115–1145,1995.
trainedinitialmodel,illustratingyetagainthatsubmodularse-
[22] U.Feige,“Athresholdoflnnforapproximatingsetcover,”Jour-
lectionseemsrobusttothequalityoftheinitialmodel.
naloftheACM(JACM),vol.45,no.4,pp.634–652,1998.
[23] G.Cornuejols,M.FISHER,andG.Nemhauser,“Ontheuncapac-
6. References
itatedlocationproblem,”inStudiesinIntegerProgramming:Pro-
[1] L.Lamel, R.Kassel, andS.Seneff, “Speechdatabasedevelop- ceedingsoftheInstituteofOperationsResearchWorkshop,Spon-
ment: Designand analysisof the acoustic-phonetic corpus,” in sored by IBM,University of Bonn, Germany, Sept. 8-12, 1975,
SpeechInput/OutputAssessmentandSpeechDatabases. ISCA, vol.1. NorthHolland,1977,pp.163–177.
1989. [24] G.Nemhauser,L.Wolsey,andM.Fisher,“Ananalysisofapprox-
imationsformaximizingsubmodularsetfunctionsI,”Mathemat-
[2] D.LewisandW.Gale,“Asequentialalgorithmfortrainingtext
icalProgramming,vol.14,no.1,pp.265–294,1978.
classifiers,”inProceedingsofthe17thannualinternationalACM
SIGIR conference onResearchand development ininformation [25] T.JaakkolaandD.Haussler,“Exploitinggenerativemodelsindis-
retrieval. Springer-VerlagNewYork,Inc.NewYork,NY,USA, criminativeclassifiers,”Advancesinneuralinformationprocess-
1994,pp.3–12. ingsystems,pp.487–493,1999.
[3] B. Settles and M. Craven, “An analysis [26] J.Bilmes,“FisherkernelsforDBNs,”UniversityofWashington,
of active learning strategies for sequence label- Tech.Rep.,2008.
ing tasks,” in EMNLP, 2008. [Online]. Available: [27] K.LeeandH.Hon,“Speaker-independentphonerecognitionus-
http://pages.cs.wisc.edu/∼bsettles/pub/settles.emnlp08.pdf ing hidden Markov models,” IEEE Transactions on Acoustics,
Speech,andSignalProcessing,vol.37,no.11,pp.1641–1648,
[4] B. Varadarajan, D. Yu, L. Deng, and A. Acero, “Maximizing
1989.
global entropy reduction for active learning in speech recogni-
tion,”inICASSP,2009. [28] J. Bilmes and C. Bartels, “Graphical model architectures for
speechrecognition,”IEEESignalProcessingMagazine,vol.22,
[5] Y.Wu, R.Zhang, andA.Rudnicky, “Dataselectionforspeech
no.5,pp.89–100,September2005.
recognition,”inASRU,Dec.2007,pp.562–565.