Copyright2017IEEE.PublishedintheIEEE2017InternationalConferenceonAcoustics,Speech,andSignalProcessing (ICASSP 2017), scheduled for 5-9 March 2017 in New Orleans, Louisiana, USA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective worksforresaleorredistributiontoserversorlists,ortoreuseanycopyrightedcomponentofthisworkinotherworks,must beobtainedfromtheIEEE.Contact: Manager,CopyrightsandPermissions/IEEEServiceCenter/445HoesLane/P.O.Box 1331/Piscataway,NJ08855-1331,USA.Telephone: +Intl. 908-562-3966. 7 1 0 2 n a J 3 1 ] L C . s c [ 1 v 3 1 3 4 0 . 1 0 7 1 : v i X r a END-TO-ENDASR-FREEKEYWORDSEARCHFROMSPEECH KartikAudhkhasi,AndrewRosenberg,AbhinavSethy,BhuvanaRamabhadran,BrianKingsbury IBMWatson,IBMT.J.WatsonResearchCenter,YorktownHeights,NewYork ABSTRACT Training a good ASR system is time-consuming and requires substantialamountoftranscribedaudiodata. Themainnoveltyof End-to-end (E2E) systems have achieved competitive results com- thispaperisanend-to-endASR-freeKWSsystemmotivatedbythe pared to conventional hybrid hidden Markov model (HMM)-deep recentsuccessofE2EsystemsforASR.Ourfully-neuralE2EKWS neuralnetworkbasedautomaticspeechrecognition(ASR)systems. systemlacksbothanASRsystemandaFST-index,whichmakesit Such E2E systems are attractive due to the lack of dependence on fasttotrain.Wetrainoursysteminonly2hourswithminimalsuper- alignments between input acoustic and output grapheme or HMM visionanddonotneedfullytranscribedtrainingaudio. Oursystem statesequenceduringtraining. Thispaperexploresthedesignofan performssignificantlybetterthanchanceandrespectablycompared ASR-free end-to-end system for text query-based keyword search to a state-of-the-art hybrid HMM-DNN ASR-based system which (KWS) from speech trained with minimal supervision. Our E2E takesover72hourstotrain. KWSsystemconsistsofthreesub-systems.Thefirstsub-systemisa Thenextsectiongivesanoverviewofrelatedpriorwork. Sec- recurrentneuralnetwork(RNN)-basedacousticauto-encodertrained tion2introducesourE2EASR-freesystembydiscussingitsthree toreconstructtheaudiothroughafinite-dimensionalrepresentation. constituentsub-systems-anRNN-basedacousticencoder-decoder, Thesecondsub-systemisacharacter-levelRNNlanguagemodelus- a convolutional neural network (CNN)-RNN character language ingembeddingslearnedfromaconvolutionalneuralnetwork.Since model (LM), and a feed-forward KWS network. Section 3 dis- theacousticandtextqueryembeddingsoccupydifferentrepresen- cussestheexperimentalsetup,trainingoftheE2EKWSsystem,and tationspaces,theyareinputtoathirdfeed-forwardneuralnetwork analysisoftheresults.Section4givesdirectionsforfuturework. that predicts whether the query occurs in the acoustic utterance or not. ThisE2EASR-freeKWSsystemperformsrespectablydespite 1.1. PriorWork lackingaconventionalASRsystemandtrainsmuchfaster. Mostpriorworksrelevanttothispaperhavefocusedonquery-by- IndexTerms— End-to-endsystems,neuralnetworks,keyword example(QbyE)retrievalofspeechfromadatabase. Theuserpro- search,automaticspeechrecognition videsaspeechutteranceofthequerytobesearched,incontrastto a text query used in the KWS setup of this paper. Dynamic time 1. INTRODUCTION warping(DTW)ofacousticfeaturesextractedfromthespeechquery and speech utterances from the database is a classic technique in Deep neural networks (DNNs) have pushed the state-of-the-art for suchQbyEsystems[9,10].ThecostofDTWalignmentservesasa automatic speech recognition (ASR) systems [1]. This has led to matchingscoreforretrieval. significantperformanceimprovementsonseveralwell-knownASR Chen,Parada,andSainath[11]presentasystemforQbyEwhere benchmarkssuchasSwitchboard[2,3]. End-to-end(E2E)orfully- theaudiodatabasecontainsexamplesofcertainkey-phrases,suchas neuralarchitectureshavebecomeanalternativetothehybridhidden “hellogenie”.Thelastkstatevectorsfromthefinalhiddenlayerof Markov model (HMM)-DNN architecture. These include the con- anRNNacousticmodelgiveafixed-dimensionalrepresentationfor nectionisttemporalclassification[4,5]loss-basedrecurrentneural boththespeechqueryandeachutteranceinthespeechdatabase.The network(RNN)andattention-basedRNNs[6,7]. KWSdetectorthencomputescosinesimilaritybetweenthespeech Automaticspeechrecognitionisoftennottheendgoalofreal- query and utterance representations. Levin et. al [12] compares world speech information processing systems. Instead, an impor- several non-neural network-based techniques for computing fixed- tantendgoalisinformationretrieval, inparticularkeywordsearch dimensionalrepresentationsofspeechsegmentsforQbyE,including (KWS), that involves retrieving the speech utterances containing a principalcomponentanalysisandLaplacianeigenmaps. Chunget. user-specifiedtextqueryfromalargedatabase. ConventionalKWS al[13]alsopresentasimilarQbyEsystemusinganacousticRNN from speech uses an ASR system as a front-end that converts the encoder-decoder network. Kamper, Wang, and Livescu [14] use a speechdatabaseintoafinite-statetransducer(FST)indexcontaining Siameseconvolutionalneuralnetwork(CNN)forobtainingacous- all possible hypotheses word sequences with their associated con- ticembeddingsoftheaudioqueryandutterancesfromthedatabase, fidencescoresandtimestamps[8]. Theuser-specifictextqueryis andtrainthisnetworkbyminimizingatriplethingeloss. thencomposedwiththisFSTindextofindputativekeywordloca- Our paper is closely related to the recent work of Palaz, Syn- tionsandconfidencescores. naeve,andCollobert[15],wheretheauthorsproposeaCNN-based TheauthorsthankDoganCanandShrikanthNarayananoftheSignal ASRsystemtrainedonabagofwordsinthespeechutterance. The AnalysisandInterpretationLab,UniversityofSouthernCaliforniaforuse- CNN emits a time sequence of posterior distributions over the vo- fuldiscussions. ThispaperusestheIARPA-Babel404b-v1.0afulllanguage cabulary,whichisthenaggregatedovertimetoproduceanestimate pack. Supported by the Intelligence Advanced Research Projects Activ- ofthebagofwords. However,wenotethattheirsystemistrained ity (IARPA) via Department of Defense U.S. Army Research Laboratory on a fixed vocabulary of 1000 words compared with our proposed (DoD/ARL)contractnumberW911NF-12-C-0012. TheU.S.Government isauthorizedtoreproduceanddistributereprintsforGovernmentalpurposes open-vocabularysystemthatdoesnotusewordidentity.Inaddition, notwithstandinganycopyrightannotationthereon. thetrainingexamplesusedintheirapproachuseastrongersupervi- sionofwordidentitycomparedwithamuchweakersupervisionin performmax-poolingovertimeoneachofthesevectorstoobtaina our proposed model. The next section presents the architecture of scalarpermaskandaM-dimensionalembeddingvector. Thisem- ourE2EASR-freeKWSsystem. beddingvectorthenfeedsintoaGRU-RNNthatpredictsoneoutof Kcharactersateachtimestep. 2. END-TO-ENDASR-FREEKWSARCHITECTURE OurE2EASR-freeKWSsystemisphilosophicallysimilartoacon- ventional hybrid ASR-based KWS system, with three sub-systems thatmodeltheacoustics,language,andkeywordsearch. However, there are several differences in the structures of these sub-systems andtheirtraining. 2.1. RNNAcousticAuto-encoder Motivated by prior work [11, 13] on computing fixed-dimensional representations from variable length acoustic feature vector se- quence,weuseanRNN-basedauto-encoderasshowninFigure1. TheencoderprocessesT acousticfeaturevectors(x ,...,x )by 1 T a uni-directional RNN with gated recurrent unit (GRU) [16] hid- denunitsunrolledoverT timesteps. Afully-connectedlayerwith weight matrix W and D rectified linear units (ReLUs) denoted by g then processes the hidden state vector he from time step T. T A decoder GRU-RNN takes the resulting D-dimensional acoustic Fig.2. ThisfigureshowsancharacterCNN-RNNLMforencoding representationg(WheT)asinputateachtimestept ∈ {1,...,T} textqueries.Weshowtwoconvolutionalmasksforsimplicity. toreconstructtheoriginalsequenceofT acousticfeaturevectors. OnekeydifferencebetweenourLMandtheonefromKimet. al[17]isthatwetrainourLMtopredictasequenceofcharacters insteadofwords.ThenextsectionpresentsouroverallKWSsystem thatuseslearnedacousticandqueryembeddingstopredictwhether thequeryoccursintheutteranceornot. 2.3. OverallE2EASR-freeKWSSystem ThefinalblockintheoverallKWSsystemisaneuralnetworkthat takesthespeechutteranceandtextqueryembeddingsasinputand predicts whether the query occurs in the utterance or not. This in contrasttopreviousworksonQbyEinspeechwhereboththespeech utterance and speech query lie in the same acoustic representation space and cosine similarity is enough to match the two. Figure 3 showstheoverallE2EKWSsystem. Weextracttheencodersfrom both the acoustic RNN auto-encoder and the CNN-RNN character LM,andfeedthemintoafeed-forwardneuralnetwork. Incontrast to conventional ASR-based approaches to KWS from speech, the Fig. 1. This figure shows an RNN acoustic auto-encoder for a E2E system in Figure 3 is also jointly trainable after the utterance T-length input sequence of acoustic feature vectors through a D- dimensional encoded representation g(Whe), where g denotes a andqueryencodershavebeenpre-trained.Thenextsectionpresents T ourdatapreparation,experiments,resultsandanalysis. ReLUactivationfunction. WeusetheD-dimensionaloutputg(Whe)oftheRNNencoder T 3. EXPERIMENTS,RESULTS,ANDANALYSIS asthevectorrepresentationoftheinputacousticfeaturevectorse- quence. This“acousticmodel”doesnotuseanytranscribedspeech 3.1. DataDescriptionandPreparation data to train, in contrast with an acoustic model in a conventional ASR system. This is in line with our overall goal of making the WeusedtheGeorgianfull-languagepackfromOptionPeriod3of entireKWSsystemASR-freeandtrainwithlesssupervision. The the IARPA Babel program for keyword search from low resource nextsectiondescribesthelanguagemodelusedtoproduceafixed languages. The training data contains 40 hours of transcribed au- dimensionalrepresentationofthetextquery. diocorrespondingto45kutterancesandagraphemicpronunciation dictionarycontaining35kwords. Weusethe15hourdevelopment 2.2. CNN-RNNCharacterLM audio, 2k in-vocabulary (IV) keywords and 400 out-of-vocabulary WeusetheCNN-basedcharacterRNN-LMarchitecturefromKim (OOV) keywords for testing. The keywords include both single- et. al[17]forderivingqueryembeddings. Figure2showsthisLM word and multi-word queries. Multilingual acoustic features have forthesimplecaseof2convolutionalmasks. Wemaptheinputse- beensuccessfulintheBabelprogramcomparedtoconventionalfea- quenceofN characters(c ,...,c )toamatrixofd-dimensional tures such as Mel frequency cepstral coefficients. We use an 80- 1 N characterembeddingsviaalook-uptable. Next,M d×w convo- dimensional multilingual acoustic front-end [18] trained on all 24 lutionalmasksoperateontheresultingd×N embeddingmatrixto BabellanguagesfromtheBasePeriodtoOptionPeriod3,excluding produceasetofM N-dimensionalvectors,onepermask. Wethen Georgian. Thetrainingoftheacousticauto-encoderRNNandCNN-RNN characterLMdoesnotrequiresspecialselectionoftrainingexam- ples. However,thefinalKWSneuralnetworkrequiresasetofposi- tiveandnegativeexamplestotrain. Wewantedtokeepthetraining ofthisnetworkindependentofthelistoftestqueries.Hencewecon- structedpositiveexamplesbytakingallwordsinthe35kvocabulary and finding a maximum of 100 utterances that contain each word. Wethenselectedanequalnumberofnegativeexamplesbypicking utterances that do not contain the particular vocabulary word. We ensuredgoodcoverageoftheacoustictrainingdatabyconstraining eachacousticutterancetobepickedinonlyamaximumof5positive and negative examples. We also excluded acoustic utterances that didnotcontainspeech. Thisresultedin62kpositiveand62kneg- ativeexampleseachfortrainingtheKWSnetwork. Weprocessed thedevelopmentaudioandtestqueriesinsimilarfashion,andended up with 7.5k positive and negative examples each for testing. We implementedthesysteminKeras[19]andTheano[20]. 3.2. AcousticAuto-encoder Weused300hiddenGRUneuronsintheacousticencoderandde- coderRNNs. Wesortedtheacousticfeaturesequencesinthetrain- Fig. 4. This figure shows the training loss of the RNN acoustic ingdatasetinincreasingorderoflengthsinceitimprovedtraining encoder-decoder (top plot) and the CNN-RNN character LM (bot- convergence. We unrolled both the RNNs for 15 seconds or 1500 tomplot)astrainingproceeds. timestepswhichisthelengthofthemaximumacousticutterancein the training data set. We padded all acoustic feature sequences to maketheirlengthequalto1500timestepsandexcludedtheseextra framesfromthelossfunctionandgradientcomputation. Weuseda shows the progress of training cross-entropy.Unlike the acoustic lineardenselayerofsize300tocomputetheembeddingfromthefi- RNNauto-encoder,wetrainedthisnetworkforafewepochs. nalhiddenstatevectoroftheencoderRNN.Wetrainedtheacoustic auto-encoderbyminimizingthemean-squaredreconstructionerror oftheinputsequenceofacousticfeaturevectorsusingtheAdamop- 3.4. KWSNeuralNetwork timizationalgorithm[21]withamini-batchsizeof40utterancesand AftertrainingtheacousticRNNauto-encoderandCNN-RNNchar- learningrateof1×10−3. Weusedthe“newbob”annealingsched- acterLM,weremovedthedecodersfrombothmodels,concatenated ulebyreducingthelearningratebyhalfwheneverthevalidationset theencoderoutputs(resultingina600-dimensionalvector),andfed lossdidnotdecreasesufficiently.ThetopplotofFigure4showsthe them into a fully-connected feed-forward neural network with one progressofthetrainingsetlossastrainingproceeds. Wefindthat ReLU hidden layer of size 256. The output layer contained two thelossdropssignificantlyintheinitialpartoftrainingandisnearly softmaxneuronsthatdetectwhethertheinputqueryoccurredinthe constant after 15k utterances. Further epochs through the training acousticutteranceornot. Weapplied50%dropouttoalllayersof datadidnotyieldsignificantimprovementsinloss.Thenextsection thisnetworksinceitimprovedclassificationperformance. Weused discussesdetailsaboutthetrainingoftheCNN-RNNcharacterLM. Adam to train this network by minimizing the cross-entropy loss withabatchsizeof128examples,learningrateof1×10−3,andthe Table2. ThistablecomparestheKWSaccuracyoftheE2WKWS newboblearningrateannealingschedule. Wefirstback-propagated andDNN-HMMhybridASRsystemsforIVandOOVqueries. the errors through this KWS network only for a few epochs, and QueryType→ IV OOV thenthroughtheentirenetwork.Thelatterdidnothaveasignificant DNN-HMM(2gmwordLM) 76.7 50.0(chance) impactonKWSaccuracy.Figure5showsthetraininglossandclas- DNN-HMM(4gmgraphemeLM) 70.7 55.5 sificationaccuracyofthisKWSneuralnetworkoverseveralepochs. E2EASR-free 55.6 57.7 Weobservethatthenetworkgraduallyreachesanabove-chanceclas- sificationaccuracyofapproximately56%. We then tested the KWS network on the test set of 2k in- 3.3. CNN-RNNCharacterLM vocabulary(IV),400out-of-vocabulary(OOV)queriesand15hours Weusedalltheacoustictranscriptsandconvertedthemtosentences of development audio. To get the topline performance, we also over 39 unique graphemes. We broke sentences longer than 50 trainedahybridDNN-HMMsystemforGeorgian. ThisASRsys- graphemes into smaller chunks for preparing mini-batches, since temused6000context-dependentstatesintheHMMandafive-layer the maximum length of a query is 23 graphemes. We used 50- deep neural network with 1024 neurons in each layer. We trained dimensionalembeddingsforeachgraphemeand300convolutional this network first using the frame-wise cross-entropy criterion and masksofsize50×3,whichresultedina300-dimensionalembed- then using Hessian-free sequence minimum Bayes risk (sMBR) ding for each input sequence. The decoder RNN used 256 GRUs training [22, 23]. We used two LMs - a bigram word LM trained andasoftmaxlayerof39neuronsateachtimestep. Weminimized onavocabularyof35kwordsanda4-gramgraphemeLMtrained thecross-entropyoftheoutputgraphemesequence,andusedAdam over 39 graphemes. The word error rate (WER) of this hybrid withamini-batchsizeof256sequences,learningrateof1×10−3, ASR system is 41.9%. We then performed KWS over the 1-best and the newbob annealing schedule. The bottom plot in Figure 4 transcript obtained by Viterbi decoding of the development audio, 5. REFERENCES [1] G.Hinton,L.Deng,D.Yu,G.E.Dahl,A.Mohamed,N.Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B.Kingsbury,“Deepneuralnetworksforacousticmodelingin speechrecognition:Thesharedviewsoffourresearchgroups,” IEEESignalProcessingMagazine,vol.29,no.6,pp.82–97, 2012. [2] G. Saon, T. Sercu, S. Rennie, and H. J. Kuo, “The IBM 2016 English Conversational Telephone Speech Recognition System,” inProc.Interspeech,2016. [3] W.Xiong,J.Droppo,X.Huang,F.Seide,M.Seltzer,A.Stol- cke,D.Yu,andG.Zweig,“TheMicrosoft2016Conversational SpeechRecognitionSystem,” inarXiv:1609.03528,2016. [4] A. Graves, S. Ferna´ndez, F. Gomez, and J. Schmidhuber, “Connectionisttemporalclassification: labellingunsegmented sequencedatawithrecurrentneuralnetworks,”inProc.ICML. ACM,2006,pp.369–376. [5] Y.Miao,M.Gowayyed,andF.Metze, “EESEN:End-to-end Fig.5.Thisfigureshowsthetraininglossandclassificationaccuracy speechrecognitionusingdeepRNNmodelsandWFST-based oftheKWSneuralnetworkastrainingproceeds. decoding,” inProc.ASRU.IEEE,2015,pp.167–174. [6] D.Bahdanau,K.Cho,andY.Bengio, “Neuralmachinetrans- lationbyjointlylearningtoalignandtranslate,”arXivpreprint arXiv:1409.0473,2014. instead of the full-blown lattice-based KWS for simplicity and a fair-comparisontotheE2EKWSapproach. [7] DBahdanau,J.Chorowski,D.Serdyuk,andY.Bengio, “End- Table 2 shows the classification accuracies of the DNN-HMM to-endattention-basedlargevocabularyspeechrecognition,”in ASRsystemandtheproposedE2EASR-freeKWSsystem. Weob- Proc.ICASSP.IEEE,2016,pp.4945–4949. tain a classification accuracy of 55.6% on IV and 57.7% on OOV [8] D.CanandM.Saraclar, “Latticeindexingforspokentermde- queries, which is significantly above chance. As expected, the IV tection,” IEEETransactionsonAudio,Speech,andLanguage performanceislowerthanthatofthehybridASRsystemusing2- Processing,vol.19,no.8,pp.2338–2347,2011. gm word LM. But it is interesting to note that the E2E ASR-free [9] T.J.Hazen,W.Shen,andC.White, “Query-by-examplespo- andhybridsystemusing4-gmgraphemeLMhavecloseraccuracies, kentermdetectionusingphoneticposteriorgramtemplates,”in especiallyforOOVqueries,wheretheE2EKWSsystemperforms Proc.ASRU.IEEE,2009,pp.421–426. better by 2.2% absolute. This result is encouraging, since the hy- bridsystemusesword-leveltranscriptionsfortrainingtheacoustic [10] Y.ZhangandJ.R.Glass,“Unsupervisedspokenkeywordspot- modeland36timesmoretrainingtimethantheE2EASR-freeKWS tingviasegmentaldtwongaussianposteriorgrams,” inProc. system. WeperformedfurtheranalysisofthedependenceofKWS ASRU.IEEE,2009,pp.398–403. performanceonquerylength.Table1showstheclassificationaccu- [11] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example racyasafunctionofnumberofgraphemesinthequery.Weobserve keywordspottingusinglongshort-termmemorynetworks,” in thatboththeASR-basedandE2EKWSsystemshavedifficultyde- Proc.ICASSP.IEEE,2015,pp.5236–5240. tectingshortqueries. IncaseoftheE2Esystem, thisisbecauseit isdifficulttoderiveareliablerepresentationforshortqueriesdueto [12] K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed- thelackofcontext.AkeyadvantageoftheE2EKWSsystemisthat dimensionalacousticembeddingsofvariable-lengthsegments ittakes36timeslesstimetotrainthantheDNN-HMMsystem. inlow-resourcesettings,”inProc.ASRU.IEEE,2013,pp.410– 415. 4. CONCLUSIONSANDFUTUREWORK [13] Y. Chung, C. Wu, C. Shen, H. Lee, and L. Lee, “Audio Word2Vec: UnsupervisedLearningofAudioSegmentRepre- sentationsusingSequence-to-sequenceAutoencoder,”inProc. Thispaperpresentedanovelend-to-endASR-freeapproachtotext Interspeech,2016. query-based KWS from speech. This is in contrast to ASR-based approachesandtopreviousworksonquery-by-exampleretrievalof [14] H. Kamper, W. Wang, and K. Livescu, “Deep convolutional audio. Theproposedsystemtrainswithminimalsupervisionwith- acousticwordembeddingsusingword-pairsideinformation,” outanytranscriptionoftheacousticdata. ThesystemusesanRNN inProc.ICASSP.IEEE,2016,pp.4950–4954. acousticauto-encoder,aCNN-RNNcharacterLM,andaKWSneu- [15] D.Palaz,G.Synnaeve,andR.Collobert, “Jointlylearningto ralnetworkthatdecideswhethertheinputtextqueryoccursinthe locate and classify words using convolutional networks,” in acousticutterance.Weshowthatthesystemperformsrespectablyon Proc.Interspeech,2016. aGeorgiankeywordsearchtaskfromtheBabelprogram,andtrains 36timesfasterthanaconventionalDNN-HMMhybridASRsystem. [16] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empiri- Futureworkshouldfocusonclosingtheperformancegapwiththe calevaluationofgatedrecurrentneuralnetworksonsequence hybridASRsystemandestimatingtimesofthedetectedkeywords. modeling,” arXivpreprintarXiv:1412.3555,2014. [17] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models,” arXiv preprint arXiv:1508.06615,2015. [18] J. Cui, B. Kingsbury, B. Ramabhadran, A. Sethy, K. Au- dhkhasi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom, M. Picheny, et al., “Multilingual representations for low re- source speech recognition and keyword search,” in Proc. ASRU.IEEE,2015,pp.259–266. [19] F.Chollet, “Keras,”https://github.com/fchollet/ keras,2015. [20] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e- prints,vol.abs/1605.02688,May2016. [21] D.KingmaandJ.Ba, “Adam: Amethodforstochasticopti- mization,” arXivpreprintarXiv:1412.6980,2014. [22] J.Martens, “DeeplearningviaHessian-freeoptimization,” in Proc.ICML,2010,pp.735–742. [23] B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable mini- mumBayesrisktrainingofdeepneuralnetworkacousticmod- elsusingdistributedHessian-freeoptimization,” inProc.In- terspeech,2012. Fig.3. ThisfigureshowstheoverallE2EKWSsystem. Thefinite-dimensionalembeddingsfromtheacousticandqueryencodersfeedinto theKWSneuralnetworkthatpredictifthequeryoccursintheutteranceornot. Table1.ThistablecomparestheKWSaccuracyoftheE2WKWSandDNN-HMMhybridASRsystemsfordifferentIVquerylengths. QueryLength→ ≤3 4 5 6 7 8 9 10 11 12 13 14 ≥15 DNN-HMM(2gmwordLM) 69.8 72.5 74.6 77.9 77.3 78.8 76.7 80.0 78.7 78.9 74.5 77.1 78.6 DNN-HMM(4gmgraphemeLM) 70.6 74.7 71.1 72.8 71.9 70.1 66.4 68.4 67.3 65.2 65.6 65.4 65.3 E2EASR-free 51.8 56.4 56.5 55.6 55.3 55.1 55.7 52.1 53.5 58.4 55.8 56.7 60.0