ebook img

End-to-End ASR-free Keyword Search from Speech PDF

0.66 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview End-to-End ASR-free Keyword Search from Speech

Copyright2017IEEE.PublishedintheIEEE2017InternationalConferenceonAcoustics,Speech,andSignalProcessing (ICASSP 2017), scheduled for 5-9 March 2017 in New Orleans, Louisiana, USA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective worksforresaleorredistributiontoserversorlists,ortoreuseanycopyrightedcomponentofthisworkinotherworks,must beobtainedfromtheIEEE.Contact: Manager,CopyrightsandPermissions/IEEEServiceCenter/445HoesLane/P.O.Box 1331/Piscataway,NJ08855-1331,USA.Telephone: +Intl. 908-562-3966. 7 1 0 2 n a J 3 1 ] L C . s c [ 1 v 3 1 3 4 0 . 1 0 7 1 : v i X r a END-TO-ENDASR-FREEKEYWORDSEARCHFROMSPEECH KartikAudhkhasi,AndrewRosenberg,AbhinavSethy,BhuvanaRamabhadran,BrianKingsbury IBMWatson,IBMT.J.WatsonResearchCenter,YorktownHeights,NewYork ABSTRACT Training a good ASR system is time-consuming and requires substantialamountoftranscribedaudiodata. Themainnoveltyof End-to-end (E2E) systems have achieved competitive results com- thispaperisanend-to-endASR-freeKWSsystemmotivatedbythe pared to conventional hybrid hidden Markov model (HMM)-deep recentsuccessofE2EsystemsforASR.Ourfully-neuralE2EKWS neuralnetworkbasedautomaticspeechrecognition(ASR)systems. systemlacksbothanASRsystemandaFST-index,whichmakesit Such E2E systems are attractive due to the lack of dependence on fasttotrain.Wetrainoursysteminonly2hourswithminimalsuper- alignments between input acoustic and output grapheme or HMM visionanddonotneedfullytranscribedtrainingaudio. Oursystem statesequenceduringtraining. Thispaperexploresthedesignofan performssignificantlybetterthanchanceandrespectablycompared ASR-free end-to-end system for text query-based keyword search to a state-of-the-art hybrid HMM-DNN ASR-based system which (KWS) from speech trained with minimal supervision. Our E2E takesover72hourstotrain. KWSsystemconsistsofthreesub-systems.Thefirstsub-systemisa Thenextsectiongivesanoverviewofrelatedpriorwork. Sec- recurrentneuralnetwork(RNN)-basedacousticauto-encodertrained tion2introducesourE2EASR-freesystembydiscussingitsthree toreconstructtheaudiothroughafinite-dimensionalrepresentation. constituentsub-systems-anRNN-basedacousticencoder-decoder, Thesecondsub-systemisacharacter-levelRNNlanguagemodelus- a convolutional neural network (CNN)-RNN character language ingembeddingslearnedfromaconvolutionalneuralnetwork.Since model (LM), and a feed-forward KWS network. Section 3 dis- theacousticandtextqueryembeddingsoccupydifferentrepresen- cussestheexperimentalsetup,trainingoftheE2EKWSsystem,and tationspaces,theyareinputtoathirdfeed-forwardneuralnetwork analysisoftheresults.Section4givesdirectionsforfuturework. that predicts whether the query occurs in the acoustic utterance or not. ThisE2EASR-freeKWSsystemperformsrespectablydespite 1.1. PriorWork lackingaconventionalASRsystemandtrainsmuchfaster. Mostpriorworksrelevanttothispaperhavefocusedonquery-by- IndexTerms— End-to-endsystems,neuralnetworks,keyword example(QbyE)retrievalofspeechfromadatabase. Theuserpro- search,automaticspeechrecognition videsaspeechutteranceofthequerytobesearched,incontrastto a text query used in the KWS setup of this paper. Dynamic time 1. INTRODUCTION warping(DTW)ofacousticfeaturesextractedfromthespeechquery and speech utterances from the database is a classic technique in Deep neural networks (DNNs) have pushed the state-of-the-art for suchQbyEsystems[9,10].ThecostofDTWalignmentservesasa automatic speech recognition (ASR) systems [1]. This has led to matchingscoreforretrieval. significantperformanceimprovementsonseveralwell-knownASR Chen,Parada,andSainath[11]presentasystemforQbyEwhere benchmarkssuchasSwitchboard[2,3]. End-to-end(E2E)orfully- theaudiodatabasecontainsexamplesofcertainkey-phrases,suchas neuralarchitectureshavebecomeanalternativetothehybridhidden “hellogenie”.Thelastkstatevectorsfromthefinalhiddenlayerof Markov model (HMM)-DNN architecture. These include the con- anRNNacousticmodelgiveafixed-dimensionalrepresentationfor nectionisttemporalclassification[4,5]loss-basedrecurrentneural boththespeechqueryandeachutteranceinthespeechdatabase.The network(RNN)andattention-basedRNNs[6,7]. KWSdetectorthencomputescosinesimilaritybetweenthespeech Automaticspeechrecognitionisoftennottheendgoalofreal- query and utterance representations. Levin et. al [12] compares world speech information processing systems. Instead, an impor- several non-neural network-based techniques for computing fixed- tantendgoalisinformationretrieval, inparticularkeywordsearch dimensionalrepresentationsofspeechsegmentsforQbyE,including (KWS), that involves retrieving the speech utterances containing a principalcomponentanalysisandLaplacianeigenmaps. Chunget. user-specifiedtextqueryfromalargedatabase. ConventionalKWS al[13]alsopresentasimilarQbyEsystemusinganacousticRNN from speech uses an ASR system as a front-end that converts the encoder-decoder network. Kamper, Wang, and Livescu [14] use a speechdatabaseintoafinite-statetransducer(FST)indexcontaining Siameseconvolutionalneuralnetwork(CNN)forobtainingacous- all possible hypotheses word sequences with their associated con- ticembeddingsoftheaudioqueryandutterancesfromthedatabase, fidencescoresandtimestamps[8]. Theuser-specifictextqueryis andtrainthisnetworkbyminimizingatriplethingeloss. thencomposedwiththisFSTindextofindputativekeywordloca- Our paper is closely related to the recent work of Palaz, Syn- tionsandconfidencescores. naeve,andCollobert[15],wheretheauthorsproposeaCNN-based TheauthorsthankDoganCanandShrikanthNarayananoftheSignal ASRsystemtrainedonabagofwordsinthespeechutterance. The AnalysisandInterpretationLab,UniversityofSouthernCaliforniaforuse- CNN emits a time sequence of posterior distributions over the vo- fuldiscussions. ThispaperusestheIARPA-Babel404b-v1.0afulllanguage cabulary,whichisthenaggregatedovertimetoproduceanestimate pack. Supported by the Intelligence Advanced Research Projects Activ- ofthebagofwords. However,wenotethattheirsystemistrained ity (IARPA) via Department of Defense U.S. Army Research Laboratory on a fixed vocabulary of 1000 words compared with our proposed (DoD/ARL)contractnumberW911NF-12-C-0012. TheU.S.Government isauthorizedtoreproduceanddistributereprintsforGovernmentalpurposes open-vocabularysystemthatdoesnotusewordidentity.Inaddition, notwithstandinganycopyrightannotationthereon. thetrainingexamplesusedintheirapproachuseastrongersupervi- sionofwordidentitycomparedwithamuchweakersupervisionin performmax-poolingovertimeoneachofthesevectorstoobtaina our proposed model. The next section presents the architecture of scalarpermaskandaM-dimensionalembeddingvector. Thisem- ourE2EASR-freeKWSsystem. beddingvectorthenfeedsintoaGRU-RNNthatpredictsoneoutof Kcharactersateachtimestep. 2. END-TO-ENDASR-FREEKWSARCHITECTURE OurE2EASR-freeKWSsystemisphilosophicallysimilartoacon- ventional hybrid ASR-based KWS system, with three sub-systems thatmodeltheacoustics,language,andkeywordsearch. However, there are several differences in the structures of these sub-systems andtheirtraining. 2.1. RNNAcousticAuto-encoder Motivated by prior work [11, 13] on computing fixed-dimensional representations from variable length acoustic feature vector se- quence,weuseanRNN-basedauto-encoderasshowninFigure1. TheencoderprocessesT acousticfeaturevectors(x ,...,x )by 1 T a uni-directional RNN with gated recurrent unit (GRU) [16] hid- denunitsunrolledoverT timesteps. Afully-connectedlayerwith weight matrix W and D rectified linear units (ReLUs) denoted by g then processes the hidden state vector he from time step T. T A decoder GRU-RNN takes the resulting D-dimensional acoustic Fig.2. ThisfigureshowsancharacterCNN-RNNLMforencoding representationg(WheT)asinputateachtimestept ∈ {1,...,T} textqueries.Weshowtwoconvolutionalmasksforsimplicity. toreconstructtheoriginalsequenceofT acousticfeaturevectors. OnekeydifferencebetweenourLMandtheonefromKimet. al[17]isthatwetrainourLMtopredictasequenceofcharacters insteadofwords.ThenextsectionpresentsouroverallKWSsystem thatuseslearnedacousticandqueryembeddingstopredictwhether thequeryoccursintheutteranceornot. 2.3. OverallE2EASR-freeKWSSystem ThefinalblockintheoverallKWSsystemisaneuralnetworkthat takesthespeechutteranceandtextqueryembeddingsasinputand predicts whether the query occurs in the utterance or not. This in contrasttopreviousworksonQbyEinspeechwhereboththespeech utterance and speech query lie in the same acoustic representation space and cosine similarity is enough to match the two. Figure 3 showstheoverallE2EKWSsystem. Weextracttheencodersfrom both the acoustic RNN auto-encoder and the CNN-RNN character LM,andfeedthemintoafeed-forwardneuralnetwork. Incontrast to conventional ASR-based approaches to KWS from speech, the Fig. 1. This figure shows an RNN acoustic auto-encoder for a E2E system in Figure 3 is also jointly trainable after the utterance T-length input sequence of acoustic feature vectors through a D- dimensional encoded representation g(Whe), where g denotes a andqueryencodershavebeenpre-trained.Thenextsectionpresents T ourdatapreparation,experiments,resultsandanalysis. ReLUactivationfunction. WeusetheD-dimensionaloutputg(Whe)oftheRNNencoder T 3. EXPERIMENTS,RESULTS,ANDANALYSIS asthevectorrepresentationoftheinputacousticfeaturevectorse- quence. This“acousticmodel”doesnotuseanytranscribedspeech 3.1. DataDescriptionandPreparation data to train, in contrast with an acoustic model in a conventional ASR system. This is in line with our overall goal of making the WeusedtheGeorgianfull-languagepackfromOptionPeriod3of entireKWSsystemASR-freeandtrainwithlesssupervision. The the IARPA Babel program for keyword search from low resource nextsectiondescribesthelanguagemodelusedtoproduceafixed languages. The training data contains 40 hours of transcribed au- dimensionalrepresentationofthetextquery. diocorrespondingto45kutterancesandagraphemicpronunciation dictionarycontaining35kwords. Weusethe15hourdevelopment 2.2. CNN-RNNCharacterLM audio, 2k in-vocabulary (IV) keywords and 400 out-of-vocabulary WeusetheCNN-basedcharacterRNN-LMarchitecturefromKim (OOV) keywords for testing. The keywords include both single- et. al[17]forderivingqueryembeddings. Figure2showsthisLM word and multi-word queries. Multilingual acoustic features have forthesimplecaseof2convolutionalmasks. Wemaptheinputse- beensuccessfulintheBabelprogramcomparedtoconventionalfea- quenceofN characters(c ,...,c )toamatrixofd-dimensional tures such as Mel frequency cepstral coefficients. We use an 80- 1 N characterembeddingsviaalook-uptable. Next,M d×w convo- dimensional multilingual acoustic front-end [18] trained on all 24 lutionalmasksoperateontheresultingd×N embeddingmatrixto BabellanguagesfromtheBasePeriodtoOptionPeriod3,excluding produceasetofM N-dimensionalvectors,onepermask. Wethen Georgian. Thetrainingoftheacousticauto-encoderRNNandCNN-RNN characterLMdoesnotrequiresspecialselectionoftrainingexam- ples. However,thefinalKWSneuralnetworkrequiresasetofposi- tiveandnegativeexamplestotrain. Wewantedtokeepthetraining ofthisnetworkindependentofthelistoftestqueries.Hencewecon- structedpositiveexamplesbytakingallwordsinthe35kvocabulary and finding a maximum of 100 utterances that contain each word. Wethenselectedanequalnumberofnegativeexamplesbypicking utterances that do not contain the particular vocabulary word. We ensuredgoodcoverageoftheacoustictrainingdatabyconstraining eachacousticutterancetobepickedinonlyamaximumof5positive and negative examples. We also excluded acoustic utterances that didnotcontainspeech. Thisresultedin62kpositiveand62kneg- ativeexampleseachfortrainingtheKWSnetwork. Weprocessed thedevelopmentaudioandtestqueriesinsimilarfashion,andended up with 7.5k positive and negative examples each for testing. We implementedthesysteminKeras[19]andTheano[20]. 3.2. AcousticAuto-encoder Weused300hiddenGRUneuronsintheacousticencoderandde- coderRNNs. Wesortedtheacousticfeaturesequencesinthetrain- Fig. 4. This figure shows the training loss of the RNN acoustic ingdatasetinincreasingorderoflengthsinceitimprovedtraining encoder-decoder (top plot) and the CNN-RNN character LM (bot- convergence. We unrolled both the RNNs for 15 seconds or 1500 tomplot)astrainingproceeds. timestepswhichisthelengthofthemaximumacousticutterancein the training data set. We padded all acoustic feature sequences to maketheirlengthequalto1500timestepsandexcludedtheseextra framesfromthelossfunctionandgradientcomputation. Weuseda shows the progress of training cross-entropy.Unlike the acoustic lineardenselayerofsize300tocomputetheembeddingfromthefi- RNNauto-encoder,wetrainedthisnetworkforafewepochs. nalhiddenstatevectoroftheencoderRNN.Wetrainedtheacoustic auto-encoderbyminimizingthemean-squaredreconstructionerror oftheinputsequenceofacousticfeaturevectorsusingtheAdamop- 3.4. KWSNeuralNetwork timizationalgorithm[21]withamini-batchsizeof40utterancesand AftertrainingtheacousticRNNauto-encoderandCNN-RNNchar- learningrateof1×10−3. Weusedthe“newbob”annealingsched- acterLM,weremovedthedecodersfrombothmodels,concatenated ulebyreducingthelearningratebyhalfwheneverthevalidationset theencoderoutputs(resultingina600-dimensionalvector),andfed lossdidnotdecreasesufficiently.ThetopplotofFigure4showsthe them into a fully-connected feed-forward neural network with one progressofthetrainingsetlossastrainingproceeds. Wefindthat ReLU hidden layer of size 256. The output layer contained two thelossdropssignificantlyintheinitialpartoftrainingandisnearly softmaxneuronsthatdetectwhethertheinputqueryoccurredinthe constant after 15k utterances. Further epochs through the training acousticutteranceornot. Weapplied50%dropouttoalllayersof datadidnotyieldsignificantimprovementsinloss.Thenextsection thisnetworksinceitimprovedclassificationperformance. Weused discussesdetailsaboutthetrainingoftheCNN-RNNcharacterLM. Adam to train this network by minimizing the cross-entropy loss withabatchsizeof128examples,learningrateof1×10−3,andthe Table2. ThistablecomparestheKWSaccuracyoftheE2WKWS newboblearningrateannealingschedule. Wefirstback-propagated andDNN-HMMhybridASRsystemsforIVandOOVqueries. the errors through this KWS network only for a few epochs, and QueryType→ IV OOV thenthroughtheentirenetwork.Thelatterdidnothaveasignificant DNN-HMM(2gmwordLM) 76.7 50.0(chance) impactonKWSaccuracy.Figure5showsthetraininglossandclas- DNN-HMM(4gmgraphemeLM) 70.7 55.5 sificationaccuracyofthisKWSneuralnetworkoverseveralepochs. E2EASR-free 55.6 57.7 Weobservethatthenetworkgraduallyreachesanabove-chanceclas- sificationaccuracyofapproximately56%. We then tested the KWS network on the test set of 2k in- 3.3. CNN-RNNCharacterLM vocabulary(IV),400out-of-vocabulary(OOV)queriesand15hours Weusedalltheacoustictranscriptsandconvertedthemtosentences of development audio. To get the topline performance, we also over 39 unique graphemes. We broke sentences longer than 50 trainedahybridDNN-HMMsystemforGeorgian. ThisASRsys- graphemes into smaller chunks for preparing mini-batches, since temused6000context-dependentstatesintheHMMandafive-layer the maximum length of a query is 23 graphemes. We used 50- deep neural network with 1024 neurons in each layer. We trained dimensionalembeddingsforeachgraphemeand300convolutional this network first using the frame-wise cross-entropy criterion and masksofsize50×3,whichresultedina300-dimensionalembed- then using Hessian-free sequence minimum Bayes risk (sMBR) ding for each input sequence. The decoder RNN used 256 GRUs training [22, 23]. We used two LMs - a bigram word LM trained andasoftmaxlayerof39neuronsateachtimestep. Weminimized onavocabularyof35kwordsanda4-gramgraphemeLMtrained thecross-entropyoftheoutputgraphemesequence,andusedAdam over 39 graphemes. The word error rate (WER) of this hybrid withamini-batchsizeof256sequences,learningrateof1×10−3, ASR system is 41.9%. We then performed KWS over the 1-best and the newbob annealing schedule. The bottom plot in Figure 4 transcript obtained by Viterbi decoding of the development audio, 5. REFERENCES [1] G.Hinton,L.Deng,D.Yu,G.E.Dahl,A.Mohamed,N.Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B.Kingsbury,“Deepneuralnetworksforacousticmodelingin speechrecognition:Thesharedviewsoffourresearchgroups,” IEEESignalProcessingMagazine,vol.29,no.6,pp.82–97, 2012. [2] G. Saon, T. Sercu, S. Rennie, and H. J. Kuo, “The IBM 2016 English Conversational Telephone Speech Recognition System,” inProc.Interspeech,2016. [3] W.Xiong,J.Droppo,X.Huang,F.Seide,M.Seltzer,A.Stol- cke,D.Yu,andG.Zweig,“TheMicrosoft2016Conversational SpeechRecognitionSystem,” inarXiv:1609.03528,2016. [4] A. Graves, S. Ferna´ndez, F. Gomez, and J. Schmidhuber, “Connectionisttemporalclassification: labellingunsegmented sequencedatawithrecurrentneuralnetworks,”inProc.ICML. ACM,2006,pp.369–376. [5] Y.Miao,M.Gowayyed,andF.Metze, “EESEN:End-to-end Fig.5.Thisfigureshowsthetraininglossandclassificationaccuracy speechrecognitionusingdeepRNNmodelsandWFST-based oftheKWSneuralnetworkastrainingproceeds. decoding,” inProc.ASRU.IEEE,2015,pp.167–174. [6] D.Bahdanau,K.Cho,andY.Bengio, “Neuralmachinetrans- lationbyjointlylearningtoalignandtranslate,”arXivpreprint arXiv:1409.0473,2014. instead of the full-blown lattice-based KWS for simplicity and a fair-comparisontotheE2EKWSapproach. [7] DBahdanau,J.Chorowski,D.Serdyuk,andY.Bengio, “End- Table 2 shows the classification accuracies of the DNN-HMM to-endattention-basedlargevocabularyspeechrecognition,”in ASRsystemandtheproposedE2EASR-freeKWSsystem. Weob- Proc.ICASSP.IEEE,2016,pp.4945–4949. tain a classification accuracy of 55.6% on IV and 57.7% on OOV [8] D.CanandM.Saraclar, “Latticeindexingforspokentermde- queries, which is significantly above chance. As expected, the IV tection,” IEEETransactionsonAudio,Speech,andLanguage performanceislowerthanthatofthehybridASRsystemusing2- Processing,vol.19,no.8,pp.2338–2347,2011. gm word LM. But it is interesting to note that the E2E ASR-free [9] T.J.Hazen,W.Shen,andC.White, “Query-by-examplespo- andhybridsystemusing4-gmgraphemeLMhavecloseraccuracies, kentermdetectionusingphoneticposteriorgramtemplates,”in especiallyforOOVqueries,wheretheE2EKWSsystemperforms Proc.ASRU.IEEE,2009,pp.421–426. better by 2.2% absolute. This result is encouraging, since the hy- bridsystemusesword-leveltranscriptionsfortrainingtheacoustic [10] Y.ZhangandJ.R.Glass,“Unsupervisedspokenkeywordspot- modeland36timesmoretrainingtimethantheE2EASR-freeKWS tingviasegmentaldtwongaussianposteriorgrams,” inProc. system. WeperformedfurtheranalysisofthedependenceofKWS ASRU.IEEE,2009,pp.398–403. performanceonquerylength.Table1showstheclassificationaccu- [11] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example racyasafunctionofnumberofgraphemesinthequery.Weobserve keywordspottingusinglongshort-termmemorynetworks,” in thatboththeASR-basedandE2EKWSsystemshavedifficultyde- Proc.ICASSP.IEEE,2015,pp.5236–5240. tectingshortqueries. IncaseoftheE2Esystem, thisisbecauseit isdifficulttoderiveareliablerepresentationforshortqueriesdueto [12] K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed- thelackofcontext.AkeyadvantageoftheE2EKWSsystemisthat dimensionalacousticembeddingsofvariable-lengthsegments ittakes36timeslesstimetotrainthantheDNN-HMMsystem. inlow-resourcesettings,”inProc.ASRU.IEEE,2013,pp.410– 415. 4. CONCLUSIONSANDFUTUREWORK [13] Y. Chung, C. Wu, C. Shen, H. Lee, and L. Lee, “Audio Word2Vec: UnsupervisedLearningofAudioSegmentRepre- sentationsusingSequence-to-sequenceAutoencoder,”inProc. Thispaperpresentedanovelend-to-endASR-freeapproachtotext Interspeech,2016. query-based KWS from speech. This is in contrast to ASR-based approachesandtopreviousworksonquery-by-exampleretrievalof [14] H. Kamper, W. Wang, and K. Livescu, “Deep convolutional audio. Theproposedsystemtrainswithminimalsupervisionwith- acousticwordembeddingsusingword-pairsideinformation,” outanytranscriptionoftheacousticdata. ThesystemusesanRNN inProc.ICASSP.IEEE,2016,pp.4950–4954. acousticauto-encoder,aCNN-RNNcharacterLM,andaKWSneu- [15] D.Palaz,G.Synnaeve,andR.Collobert, “Jointlylearningto ralnetworkthatdecideswhethertheinputtextqueryoccursinthe locate and classify words using convolutional networks,” in acousticutterance.Weshowthatthesystemperformsrespectablyon Proc.Interspeech,2016. aGeorgiankeywordsearchtaskfromtheBabelprogram,andtrains 36timesfasterthanaconventionalDNN-HMMhybridASRsystem. [16] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empiri- Futureworkshouldfocusonclosingtheperformancegapwiththe calevaluationofgatedrecurrentneuralnetworksonsequence hybridASRsystemandestimatingtimesofthedetectedkeywords. modeling,” arXivpreprintarXiv:1412.3555,2014. [17] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models,” arXiv preprint arXiv:1508.06615,2015. [18] J. Cui, B. Kingsbury, B. Ramabhadran, A. Sethy, K. Au- dhkhasi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom, M. Picheny, et al., “Multilingual representations for low re- source speech recognition and keyword search,” in Proc. ASRU.IEEE,2015,pp.259–266. [19] F.Chollet, “Keras,”https://github.com/fchollet/ keras,2015. [20] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e- prints,vol.abs/1605.02688,May2016. [21] D.KingmaandJ.Ba, “Adam: Amethodforstochasticopti- mization,” arXivpreprintarXiv:1412.6980,2014. [22] J.Martens, “DeeplearningviaHessian-freeoptimization,” in Proc.ICML,2010,pp.735–742. [23] B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable mini- mumBayesrisktrainingofdeepneuralnetworkacousticmod- elsusingdistributedHessian-freeoptimization,” inProc.In- terspeech,2012. Fig.3. ThisfigureshowstheoverallE2EKWSsystem. Thefinite-dimensionalembeddingsfromtheacousticandqueryencodersfeedinto theKWSneuralnetworkthatpredictifthequeryoccursintheutteranceornot. Table1.ThistablecomparestheKWSaccuracyoftheE2WKWSandDNN-HMMhybridASRsystemsfordifferentIVquerylengths. QueryLength→ ≤3 4 5 6 7 8 9 10 11 12 13 14 ≥15 DNN-HMM(2gmwordLM) 69.8 72.5 74.6 77.9 77.3 78.8 76.7 80.0 78.7 78.9 74.5 77.1 78.6 DNN-HMM(4gmgraphemeLM) 70.6 74.7 71.1 72.8 71.9 70.1 66.4 68.4 67.3 65.2 65.6 65.4 65.3 E2EASR-free 51.8 56.4 56.5 55.6 55.3 55.1 55.7 52.1 53.5 58.4 55.8 56.7 60.0

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.