CHARACTER-LEVEL INCREMENTAL SPEECH RECOGNITION WITHRECURRENT NEURALNETWORKS Kyuyeon Hwang andWonyong Sung Department ofElectricaland ComputerEngineering Seoul NationalUniversity 1,Gwanak-ro, Gwanak-gu, Seoul,08826Korea 6 [email protected]; [email protected] 1 0 2 n ABSTRACT tion.Also,ourmodelistraineddirectlyfromspeechandtextcorpus a anddoesnotrequireexternalworddictionaryorsenonemodeling. J In real-time speech recognition applications, the latency is an im- TherehavebeeneffortstodealwithOOVwordsinconventional portant issue. We have developed a character-level incremental 8 HMMbasedASRsystems.In[4],graphemesareemployedasbasic speechrecognition(ISR)systemthatrespondsquicklyevenduring 2 units instead of phonemes. Also, a sub-lexical language model is thespeech,wherethehypothesesaregraduallyimprovedwhilethe proposedin[5]fordetectingpreviouslyunseenwords. speaking proceeds. The algorithm employs a speech-to-character ] RNN-basedcharacter-levelend-to-endASRsystemswerestud- L unidirectionalrecurrentneuralnetwork(RNN),whichisend-to-end iedin[6,7,8, 9,10]. However, theylackthecapabilityof dictat- C trained with connectionist temporal classification (CTC), and an ing OOV words since the decoding is performed with word-level RNN-based character-level language model (LM). The output val- . LMs.Recently,alexicon-freeend-to-endASRsystemisintroduced s uesoftheCTC-trainedRNNarecharacter-levelprobabilities,which in[11],whereacharacter-levelRNNLMisemployed. Wefurther c areprocessedbybeamsearchdecoding.TheRNNLMaugmentsthe [ improvethisapproachbyemployingprefixtreebasedonlinebeam decodingbyprovidinglong-termdependencyinformation. Wepro- searchwithadditionaldepth-pruningforISR. 2 posetree-based online beam searchwithadditional depth-pruning, Thecharacter-levelISRsystemproposed inthispaper iscom- v which enables the system to process infinitely long input speech posedofanacousticRNNandanRNNLM.TheacousticRNNis 1 withlowlatency. Thissystemnotonlyrespondsquicklyonspeech end-to-endtrainedwithconnectionisttemporalclassification(CTC) 8 but also can dictate out-of-vocabulary (OOV) words according to [12]usingWallStreetJournal(WSJ)speechcorpus[13]. Theout- 5 pronunciation. The proposed model achieves the word error rate putoftheacousticRNNistheprobabilityofcharacters, whichare 6 (WER) of 8.90% on the Wall Street Journal (WSJ) Nov’92 20K decoded with character-level beam search to generate N-best hy- 0 evaluationsetwhentrainedontheWSJSI-284trainingset. potheses. Toimprovetheperformance, acharacter-levelRNNLM . 1 IndexTerms— Incrementalspeechrecognition,character-level, isemployedtoaugmentthebeamsearchperformance.Also,wepro- 0 recurrent neural networks, connectionist temporal classification, posedepth-pruningforefficienttree-basedbeamsearch. TheRNN 6 beamsearch LMisseparatelytrainedwithlargetextcorpusthatisalsoincludedin 1 WSJcorpus.Unlikeforword-levellanguagemodeling,conventional : statisticalLMssuchasn-gramback-offmodelscannotbeusedbe- v i 1. INTRODUCTION cause much longer history window is required for character-level X prediction. BothacousticRNN andRNNLMhavedeep unidirec- Incremental speech recognition (ISR) allows a speech-based inter- r tionallongshort-termmemory(LSTM)networkstructures[14,15]. a actionsystemtoreactquicklywhiletheutteranceisbeingspoken. ForcontinuousISRoninfinitelylonginputspeech,theyaretrained Unlike offline sentence-wise automatic speech recognition (ASR), withvirtuallyinfinitetrainingdatastreamsthataregeneratedbyran- where the decoding result is available after a user finishes speak- domlyconcatenatingtrainingsequences. ing,ISRreturnsN-bestdecodingresultswithsmalllatencyduring Theproposedmodelisevaluatedonasingletestsequencethat speech. TheseN-bestresults,orhypotheses, graduallyimproveas is generated by concatenating all test utterances in WSJ eval92 the system receives more speech data. Since ISR is usually em- (Nov’9220kevaluationset)withoutanyexternalresetofRNNstates ployed for immediate reaction to speech, word stability [1, 2] and at theutterance boundaries. The ISRperformance isexamined by incrementallatticegeneration[3]havebeenimportanttopics. varyingthebeamwidthanddepth.Generally,widerbeamincreases In this paper, we introduce an end-to-end character-level ISR theaccuracy. Under thesamebeam width, there isatrade-off be- system withtwo unidirectional recurrent neural networks (RNNs). tweentheaccuracyandstability(orlatency),wherethebalancebe- An acoustic RNN roughly dictates the input speech and an RNN- tweenthemcanbeadjustedbythebeamdepth. based language model isemployed toaugment thedictation result throughdecoding. Comparedtoaconventionalword-levelbackend forspeechrecognitionsystem,thecharacter-levelASRiscapableof 2. MODELS dictating out of vocabulary (OOV) words based on the pronuncia- 2.1. Acousticmodel ThisworkwassupportedinpartbytheBrainKorea21PlusProjectand theNationalResearchFoundationofKorea(NRF)grantfundedbytheKorea TheacousticmodelisadeepRNNtrainedwithCTC[12]. Thenet- government(MSIP)(No.2015R1A2A1A10056051). work consists of two LSTMlayers with768 cells each, where the RNN LM THREE ISSUES ADVANCED MICRO OF AMERICA THE ONLY input history WAY TO DIVERSIFY INTO TREATING MODERN ARMIES N THIN LOOKING AHEAD TO MR. LEYSEN WITH AN INTOLERABLE POP CUT WHEN AN ALL POWERFUL STUDENT SEEKS ITS T H I S THIS CORE DRIVING UPJOHN STOVES AMERICAN EXPRESS HASN’T YET SWORED PARTICULARLY C THIC WITH THE RESTRUCTURING IS A COMMITMENT TO BUY POTENTIAL BUYERS IN THE OPEN MARKET Fig.2. Beamsearchtreeconsistingoflabelnodes. TheCTCblank labelisnotincluded. Fig.1. Exampleofcharacter-levelrandomtextgenerationwiththe RNNLM. T H T’ - H’ - networkhastotal12.2Mtrainableparameters.Themodelissimilar totheoneinthepreviousworkaboutend-to-endspeechrecognition withRNNs[6]exceptafewmajordifferences.Inourcase,theRNN T T istrainedbyonlineCTC[16]withverylongtrainingsequencesthat aregeneratedbyrandomlyconcatenatingseveralutterances. There T’ - T’ - isnoneedtoreset theRNNstatesattheutteranceboundary. This is necessary for ISR systems that runs continuously with an infi- niteinputaudiostream. Also,ourmodelhasaunidirectionalstruc- Fig. 3. CTC state transition between two label nodes. If the two turesincebidirectionalnetworksthatareusuallyemployedforend- nodeshavethesamelabel,thenatransitionbetweenthesameCTC to-end speech recognition are not suitable for low-latency speech stateisnotallowed. recognition.Thisisbecausethebackwardlayersinthebidirectional networkscannotbecomputedbeforetheinpututteranceisfinished. Theinputofthenetworkisa40-dimensionallogmel-frequency RNNLMhasadeepLSTMnetworkstructurewithtwoLSTMlayers filterbankfeaturevectorwithenergyandtheirdeltaanddouble-delta whereeachofthemhas512memorycells,resultingintotal3.2M values,resultinginan123-dimensional vector. Thefeaturevectors parameters. areextracted every 10 mswith 25 msHamming window. The in- The input of the RNN LM is a 30-dimensional vector, where put vectors are element-wisely standardized based on the statistics thecurrentlabel(character)isone-hotencoded. Theoutputisalso obtainedfromthetrainingset. Theoutputisa31-dimensionalvec- a 30-dimensional vector which represents the probabilities of next tor that consists of the probabilities of 26 upper case alphabets, 3 labels. AlthoughtheRNNLMistrainedtopredictthenextcharac- specialcharacters,theend-of-sentence(EOS)symbol,andtheCTC terswithonlygiventhecurrentcharacter,thepastcharacterhistories blanklabel. areinternallystoredinsidetheRNNandusedfortheprediction. It Thenetworksaretrainedwithstochasticgradientdescent(SGD) iswellknown thatRNNLMcanremember contextsforverylong with8parallelinputstreamsonaGPU[17]. Thenetworksareun- timesteps. rolled2048timesandweightupdatesareperformedevery1024for- As for the acoustic RNN, the RNN LM is trained on a very wardsteps. Thenetworkperformancesareevaluatedatevery10M longtextstreamthatisgeneratedbyattachingrandomlypickedsen- training frames. The evaluation is performed on total 2 M frames tencesandinsertingEOSlabelsbetweensentences. TheRNNLM fromthedevelopment set. Thelearningratestartsfrom1×10−5 is trained with AdaDelta [19] based SGD method for accelerated and is reduced by the factor of 10 whenever the WER on the de- trainingandbetterannealing. TheWSJLMtrainingtextwithnon- velopment set is not improved for 6 consecutive evaluations. The verbalized punctuation, whichcontains about 215 Mcharacters, is trainingendswhenthelearningratedropsbelow1×10−7. usedfor trainingtheRNN LM.Randomly selected1% of thecor- Wetrained the networks on twotraining sets. Thefirst one is pusisreservedforevaluation, onwhichthefinalbits-per-character thestandardWSJSI-284setandthesecondone,SI-ALL,isthe (BPC)oftheRNNLMis1.167(character-levelperplexityof2.245). setofallspeakerindependenttrainingutterancesintheWSJcorpus. Random sentences can be generated following themethod de- Notethat the utterances with verbalized punctuations are removed scribedin[20]. Briefly,thenextlabelisrandomlypickedfollowing from both training sets. Also, odd transcriptions are filtered out, theprobabilitiesofthecurrentoutputoftheRNNLMandfedback whichmakes thefinal SI-284and SI-ALLsetscontain roughly totheRNNinthenext step. Byiteratingthesesteps, textscanbe 71and167hoursofspeech,respectively.WSJdev93(Nov’9320k sequentiallygeneratedasshowninFigure1. Fromtheexample, it developmentset)andeval92(Nov’9220kevaluationset)setsare isclearthattheRNNLMlearnedthelinguisticstructuresaswellas usedasthedevelopmentsetandtheevaluationset,respectively. spellingsofwordsthatfrequentlyappear. 2.2. Languagemodel 3. CHARACTER-LEVELBEAMSEARCH AnRNNlanguage model (LM)[18] isemployed fortheproposed 3.1. Tree-basedCTCbeamsearch ISRsystemsinceconventionalstatisticalLMssuchasn-gramback- offmodelsarenotsuitableforcharacter-levelpredictionsincethey LetL betheset of labelswithout the CTCblank label. Thelabel cannot make use of very long history windows. Specifically, the sequence z is a sequence of labels in L. The length of the label Newly converged history New root Best node 12 SI-284,BW=128 SI-284,BW=512 _ T H I S _ 11 SI-ALL,BW=128 SI-ALL,BW=512 %) Original root E S E ER( 10 W After pruning 9 E A S 8 0 20 40 60 80 100 Fig.4. Example of depth-pruning withthe beam depth of 2. The Beamdepth(characters) pruningisperformedbyselectinganew rootnodesothatthenew depth of the best hypothesis node becomes the beam depth. The Fig.5. WERoftheproposedonlinedecodingontheevaluationset shadednodesindicatetheoriginalactivenodes.Also,thepathofthe withrespecttothebeamdepth.Experimentsareconductedwithtwo besthypothesisisdrawnwiththickstrokes. acousticRNNstrainedonSI-284andSI-ALLandbeamsearch isperformedwiththebeamwidth(BW)of128and512. sequencezislessthanorequaltothenumberofinputframes. The objectiveofthebeamsearchdecodingistofindthelabelsequence whereαistheLMweightandβistheinsertionbonus. Thismodi- thathasthemaximumposteriorprobabilitygiventheinputfeatures ficationcanbeappliedbyaddingtheadditionaltermswithαandβ fromtime1totgeneratedbytheacousticRNNs,thatis, tothelogprobabilityofthedestinationstatewhenastatetransition zmax =argmzaxP(z|x1:t), (1) betweentwodifferentlabelnodesoccurs. The probability of the next label is computed using the RNN wherex istheinputfeaturesfromtime1tot. LMwhenanewactivelabelnodeisaddedtothebeamsearchtree. 1:t However,theCTC-trainedRNNoutputhasonemoreblankla- For this, the RNN LMcontext (hidden activations) is copied from bel. LetL′ bethesetoflabels(orCTCstates)withtheadditional the parent node to the child node and the RNN LM processes the CTC blank label, and the path π(i) be a sequence of labels in L′ newlabelofthechildnodewiththecopiedcontext.Therefore,each t fromtime1tot.Thelengthofthepathπ(i)isthesameast.Bythe activenodehasitsownRNNLMcontext. t definitionofCTC,everyπcanbereducedintothecorrespondingz. Forexample,πwith“aab-c–a”correspondstozwith“abca”,where 3.2. Pruning “-”istheblanklabel. Therecanbemanypaths,π(i),thatcanbereducedintothesame Pruningofthesearchtreeisperformedbythestandardbeamsearch t z.LetF(·)beafunctionthatmapsapathtothecorrespondinglabel approach. Thatis,ateachframe,onlytheactivenodeswiththetop sequence,thatis,F(π(i)) = z,thentheposteriorprobabilityin(1) N hypothesesandtheirancestornodesremainaliveaftertheprun- t ingwiththebeamwidthofN. However, thisstandardpruning, or becomes, width-pruning,cannotpreventthetreefromgrowingindefinitelyes- P(z|x1:t)= X P(πt(i)|x1:t). (2) peciallywhentheinputspeechisverylong.Thisgraduallydegrades theefficiencyofbeamsearchonrecentnodessincemoreandmore {∀i|F(πt(i))=z} hypotheses would be wasted tomaintain theold part of the lattice Therefore, ifthetwodifferentpathsπ(j) andπ(k) inthedecoding thatisalreadyoutofthecontextrangeoftheRNNLMs. networkaremappedtothesamez,thenttheycantbemergedbysum- Toremedythisissue,weproposeanadditionalpruningmethod called depth-pruning. The procedure is as follows. First, find the mingtheirprobabilities. M-thancestorofthenodewiththebesthypothesis,whereM isthe For the beam search, we first represent the latticewith a tree- beamdepth.Then,theancestornodebecomesanewrootnode.The basedstructuresothateachnodehasoneoflabelsinLasdepicted pruningisperformedbyremovingthenodesthatarenotdescendants inFigure2. Then,backtrackingfromanynodegeneratesaunique labelsequencez.TodealwithCTCstatetransitions,weneedastate- ofthenewrootnode. Inthisway,abeamcanbebetterutilizedfor basednetworkthatisrepresentedwithCTCstates,L′. Asshownin recenthypothesesratherthanolderones.Figure4showsanexample ofdepth-pruning withthebeamdepthof2. Notethatthedepthof Figure 3, this can be easily done by expanding each tree node, of some nodes can be larger than the beam depth. In the following whichlabelisinL,intotwoCTCstates,onewiththecorresponding labelinL′ followedbytheblankCTClabel. Sincethelabel-level experiments,depth-pruningisperformedevery20frames. (L)searchnetworkisbasedonatreestructure,twodifferentstate- level(L′)pathswithdifferentlabelsequencesnevermeeteachother. 4. EXPERIMENTS Thissimplifiestheproblemsincethereisnointeractionbetweentwo differentsequence labelings(hypotheses) and(2)istheonlyequa- TheproposedISRsystemisevaluatedonasingle42-minutespeech tionthatweshouldconcern. stream that is formed by concatenating all 333 utterances in the As proposed in [8, 11], external language models can be inte- evaluationset,eval92(WSJNov’9220kevaluationset). Weuse gratedbymodifyingtheposteriorprobabilitytermin(1)into: α = 2.0 and β = 1.5 for the system trained with SI-284, and log(P(z|x1:t))=log(PCTC(z|x1:t)) (3) α=T1h.e5eafnfedcβts=of2b.e0amfordethpethoathnedrwonidethtratointehdewfinitahlSWIE-RALarLe.exam- +αlog(PLM(z))+β|z|, inedinFigure5. Thegapbetweenthebeamwidthof128and512 100:HE’STHE 150:HE’STHEONLYGU 200:HE’STHEONLYGUYWHOCOULDS 250:HE’STHEONLYGUYWHOCOULDSHOWUPINTHE 300:...INTHEPLAZAI 350:...INTHEPLAZAINROCKR 400:...INTHEPLAZAINDRAWRATEOFSEVE 450:...INTHEPLAZAINDRAWRATEOFSEVENTYFIVETHO 500:...INTHEPLAZAANDDRAWCROWDOFSEVENTYFIVETHOUSANDPEO 550:...INTHEPLAZAANDDRAWCROWDOFSEVENTYFIVETHOUSANDPEOPLES 600:...INTHEPLAZAANDDRAWCROWDOFSEVENTYFIVETHOUSANDPEOPLESAYSONELA 650:...INTHEPLAZAANDDRAWCROWDOFSEVENTYFIVETHOUSANDPEOPLESAYSONELATINDIPLOM 700:...INTHEPLAZAANDDRAWCROWDOFSEVENTYFIVETHOUSANDPEOPLESAYSONELATINDIPLOMAT Groundtruth:HE’STHEONLYGUYWHOCOULDSHOWUPINTHEPLAZAANDDRAW ACROWDOFSEVENTYFIVETHOUSANDPEOPLESAYSONELATINDIPLOMAT Fig.6.ExampleofISRpartialresults.Thebesthypothesisisshownatevery50frames(500ms).Theword“ROCK”iscorrectedto“DRAW” afterhearing“RATE”and“INDRAWRATE”to“ANDDRAWCROWD”whilehearing“PEOPLE”. The proposed ISR system is compared with other end-to-end Table1. CER/ WERinpercent on theevaluation set withonline word-levelspeechrecognitionsystemsinTable2.Theothersystems depth-pruning and offlinesentence-wise decoding. The error rates perform sentence-wise offline decoding with bidirectional RNNs. arereportedwithtwoacoustic RNNstrainedon SI-284(71 hrs) ThebestresultwasachievedbyMiaoetal. [9]withaCTC-trained andSI-ALL(167hrs). deepbidirectionalLSTMnetworkandaretrainedtrigramLMwith Method Beamwidth SI-284 SI-ALL extendedvocabulary. Thesystemswiththeoriginaltrigrammodel providedwiththeWSJcorpusperformworsethanourISRsystem Online(noLM) 512 10.96/38.37 9.66/35.44 with character-level RNN LM. On the other hand, our system is Online 128 4.25/9.87 3.56/8.56 beaten bytheother ones withextended trigrammodels. However, Online 512 3.80/8.90 3.39/8.06 moreprecisecomparisonofthedecodingstagesshouldbedoneby Sentence-wise 128 4.46/10.30 3.63/8.84 employingthesameCTCmodel. Sentence-wise 512 4.04/9.45 3.38/8.28 Figure 6 shows theincremental speech recognition result with theproposedISRsystem. Thebesthypothesisisreportedevery50 frames(500ms).Itisshownthatthepastbestresultcanbecorrected Table2. Comparison of WERswithother end-to-end speech rec- bymakinguseoftheadditionalspeechinput.Forexample,theword ognizersintheliterature. Forreference, WERsof phoneme based “ROCK”ischangedto“DRAW”intheframe450bylisteningthe GMM/DNN-HMM systems are also reported. All systems are word “RATE”.Moreover, thecorrection of “IN DRAW RATE”to trainedwithSI-284andevaluatedoneval92. “ANDDRAWCROWD”duringhearingtheword“PEOPLE”inthe System Model WER frame 500 is a good evidence that long term context can also be ProposedISR Uni.CTC+Char.RNNLM 8.90% considered. GravesandJaitly[6] CTC+Trigram(extended) 8.7% Miaoetal.[9] CTC+Trigram(extended) 7.34% Miaoetal.[9] CTC+Trigram 9.07% 5. CONCLUDINGREMARKS Hannunetal.[8] CTC+Bigram 14.1% Bahdanauetal.[10] Encoder-decoder+Trigram 11.3% Acharacter-levelincrementalspeechrecognizerisproposedandan- Woodlandetal.[21] GMM-HMM+Trigram 9.46% alyzedthroughoutthepaper.TheproposedsystemcombinesaCTC- Miaoetal.[9] DNN-HMM+Trigram 7.14% trained RNN with a character-level RNN LM through tree-based beam search decoding. For online decoding with very long input speech, depth-pruning is proposed to prevent indefinite growth of is roughly 0.5% to 1% WER. However, there was littledifference the search tree. When the proposed model is trained with WSJ whenthebeamwidthincreasesfrom512to2048inourpreliminary SI-284,8.90%WERcanbeachievedontheverylongspeechthat experiments. The best performing beam depths are 50 and 30 for isformedbyconcatenatingallutterancesintheWSJeval92eval- the SI-284 and SI-ALL systems, respectively. This means the uation set. The incremental recognition result shows the evidence SI-ALL system can recognize speech more immediately than the thatcharacter-level RNNLMcanlearndependencies betweentwo SI-284system. Weconsiderthisisbecausetheacousticmodelof words even when they are five words apart, which are hard to be theSI-ALLsystemcanembedstrongerlanguagemodelduetoin- caughtusingconventionaln-gramback-offlanguagemodels. creasedtrainingdata,andcanmakedecisionmorepreciselywithout Notethattheproposedsystemonlyrequiresspeechandtextcor- relyingontheexternal language model much. Thecharacter error pusfortraining. Externallexiconorsenonemodelingisnotneeded rate(CER)andWERarereportedinTable1withtheoptimalbeam fortraining,whichisahugeadvantage.Moreover,itisexpectedthat depths.Forcomparison,wealsoreportsentence-wiseofflinedecod- OOVwordsorinfrequentwordssuchasnamesofplacesorpeople ingresultswithoutdepth-pruning. canbedictatedastheyarepronounced. 6. REFERENCES [15] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed, “Hybrid speech recognition with deep bidirectional LSTM,” [1] EthanOSelfridge, IkerArizmendi, PeterAHeeman, andJa- inAutomaticSpeechRecognitionandUnderstanding(ASRU), sonDWilliams,“Stabilityandaccuracyinincrementalspeech 2013IEEEWorkshopon.IEEE,2013,pp.273–278. recognition,”inProceedingsoftheSIGDIAL2011Conference. [16] KyuyeonHwangandWonyongSung, “Onlinesequencetrain- AssociationforComputationalLinguistics,2011,pp.110–119. ing of recurrent neural networks withconnectionist temporal [2] Ian McGraw and Alexander Gruenstein, “Estimating word- classification,” arXivpreprintarXiv:1511.06841,2015. stabilityduringincrementalspeechrecognition,”Training,vol. [17] Kyuyeon Hwang and Wonyong Sung, “Single stream paral- 17,no.27,327,pp.6–4,2011. lelization of generalized LSTM-like RNNs on a GPU,” in [3] GerhardSagerer,HeikeRautenstrauch,GernotAFink,Bernd Acoustics, Speech and Signal Processing (ICASSP), 2015 Hildebrandt,AJusek,andFranzKummert, “Incrementalgen- IEEE International Conference on. IEEE, 2015, pp. 1047– erationofwordgraphs.,” inICSLP.Citeseer,1996. 1051. [4] Mirjam Killer, Sebastian Stu¨ker, and Tanja Schultz, [18] Toma´sˇ Mikolov, Stefan Kombrink, Luka´sˇ Burget, Jan Honza “Grapheme based speech recognition.,” in INTERSPEECH, Cˇernocky`, andSanjeevKhudanpur, “Extensionsof recurrent 2003. neural network language model,” in Acoustics, Speech and SignalProcessing(ICASSP),2011IEEEInternationalConfer- [5] Maximilian Bisani and Hermann Ney, “Open vocabulary enceon.IEEE,2011,pp.5528–5531. speech recognition with flat hybrid models.,” in INTER- SPEECH,2005,pp.725–728. [19] Matthew D Zeiler, “ADADELTA: An adaptive learning rate method,” arXivpreprintarXiv:1212.5701,2012. [6] AlexGravesandNavdeepJaitly, “Towardsend-to-endspeech recognition with recurrent neural networks,” in Proceedings [20] IlyaSutskever,JamesMartens,andGeoffreyEHinton, “Gen- of the 31st International Conference on Machine Learning erating text with recurrent neural networks,” in Proceedings (ICML-14),2014,pp.1764–1772. of the 28th International Conference on Machine Learning (ICML-11),2011,pp.1017–1024. [7] AwniHannun,CarlCase,JaredCasper,BryanCatanzaro,Greg Diamos,ErichElsen,RyanPrenger,SanjeevSatheesh,Shubho [21] Phillip C Woodland, Julian J Odell, Valtcho Valtchev, and Sengupta,AdamCoates,etal., “DeepSpeech: Scalingupend- SteveJ Young, “Largevocabulary continuous speech recog- to-endspeech recognition,” arXivpreprint arXiv:1412.5567, nitionusingHTK,” inAcoustics,Speech,andSignalProcess- 2014. ing, 1994. ICASSP-94., 1994 IEEEInternational Conference on.IEEE,1994,vol.2,pp.II–125. [8] Awni Y Hannun, Andrew L Maas, Daniel Jurafsky, and An- drew Y Ng, “First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs,” arXiv preprintarXiv:1408.2873,2014. [9] Yajie Miao, Mohammad Gowayyed, and Florian Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” arXiv preprint arXiv:1507.08240,2015. [10] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Phile- mon Brakel, and Yoshua Bengio, “End-to-end attention- based large vocabulary speech recognition,” arXiv preprint arXiv:1508.04395,2015. [11] AndrewLMaas,ZiangXie,DanJurafsky,andAndrewYNg, “Lexicon-free conversational speech recognition with neural networks,” in NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Compu- tationalLinguistics: HumanLanguage Technologies, Denver, Colorado,USA,May31-June5,2015,2015,pp.345–354. [12] AlexGraves,SantiagoFerna´ndez,FaustinoGomez,andJu¨rgen Schmidhuber, “Connectionist temporal classification: la- bellingunsegmentedsequencedatawithrecurrentneuralnet- works,” in Proceedings of the 23rd international conference onMachinelearning.ACM,2006,pp.369–376. [13] DouglasBPaulandJanetMBaker, “ThedesignfortheWall StreetJournal-basedCSRcorpus,”inProceedingsofthework- shoponSpeechandNaturalLanguage.AssociationforCom- putationalLinguistics,1992,pp.357–362. [14] SeppHochreiter and Ju¨rgen Schmidhuber, “Long short-term memory,” Neuralcomputation, vol.9,no.8,pp.1735–1780, 1997.