ebook img

Evaluating the Performance of a Speech Recognition based System PDF

0.28 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Evaluating the Performance of a Speech Recognition based System

Evaluating the Performance of a Speech Recognition based System Vinod Kumar Pandey, Sunil Kumar Kopparapu TCS Innovation Labs - Mumbai, Tata Consultancy Services, Pokharan Road 2 6 Thane 400 601, Maharastra, India. 1 {vinod.pande, sunilkumar.kopparapu}@tcs.com 0 2 n Abstract. Speech based solutions have taken center stage with growth a J in the services industry where there is a need to cater to a very large number of people from all strata of the society. While natural language 1 speechinterfacesarethetalkintheresearchcommunity,yetinpractice, 1 menu based speech solutions thrive. Typically in a menu based speech ] solution the user is required to respond by speaking from a closed set L of words when prompted by the system. A sequence of human speech C response to the IVR prompts results in the completion of a transaction. . A transaction is deemed successful if the speech solution can correctly s recognizeallthespokenutterancesoftheuserwheneverpromptedbythe c [ system. The usual mechanism to evaluate the performance of a speech solution is to do an extensive test of the system by putting it to actual 1 peopleuseandthenevaluatingtheperformancebyanalyzingthelogsfor v successfultransactions.Thiskindofevaluationcouldleadtodissatisfied 3 test users especially if the performance of the system were to result in 4 5 a poor transaction completion rate. To negate this the Wizard of Oz 2 approach is adopted during evaluation of a speech system. Overall this 0 kindofevaluationsisanexpensivepropositionbothintermsoftimeand . cost. In this paper, we propose a method to evaluate the performance 1 of a speech solution without actually putting it to people use. We first 0 describe the methodology and then show experimentally that this can 6 1 be used to identify the performance bottlenecks of the speech solution : even before the system is actually used thus saving evaluation time and v expenses. i X r a Keywords: Speech solution evaluation, Speech recognition, Pre-launch recog- nition performance measure. 1 Introduction There are several commercial menu based ASR systems available around the world for a significant number of languages and interestingly speech solution based on these ASR are being used with good success in the Western part of the globe [5], [3], [6], [2]. Typically, a menu based ASR system restricts user to 2 speak from a pre-defined closed set of words for enabling a transaction. Before commercial deployment of a speech solution it is imperative to have a quan- titative measure of the performance of the speech solution which is primarily based on the speech recognition accuracy of the speech engine used. Generally, the recognition performance of any speech recognition based solution is quanti- tativelyevaluatedbyputtingittoactualusebythepeoplewhoaretheintended users and then analyzing the logs to identify successful and unsuccessful trans- actions. This evaluation is then used to identifying any further improvement in the speech recognition based solution to better the overall transaction comple- tionrates.Thisprocessofevaluationisbothtimeconsumingandexpensive.For evaluation one needs to identify a set of users and also identify the set of actual usage situations and perform the test. It is also important that the set of users are able to use the system with ease meaning that even in the test conditions the performance of the system, should be good, while this can not usually be guaranteedthisaspectofkeepingtheuserexperiencegoodmakesitnecessaryto employ a wizard of Oz (WoZ) approach. Typically this requires a human agent intheloopduringactualspeechtransactionwherethehumanagentcorrectsany mis-recognitionbyactuallylisteningtotheconversationbetweenthehumanuser and the machine without the user knowing that there is a human agent in the loop.TheuseofWoZisanotherexpenseinthetestingaspeechsolution.Allthis makes testing a speech solution an expensive and time consuming procedure. In this paper, we describe a method to evaluate the performance of a speech solutionwithoutactualpeopleusingthesystemasisusuallydone.Wethenshow howthismethodwasadoptedtoevaluateaspeechrecognitionbasedsolutionas a case study. This is the main contribution of the paper. The rest of the paper is organized as follows. The method for evaluation without testing is described in Section 2. In Section 3 we present a case study and conclude in Section 4. 2 Evaluation without Testing Fig. 1 shows the schematic of a typical menu based speech solution having 3 nodes. At each node there are a set of words that the user is expected to speak and the system is supposed to recognize. In this particular schematic, at the entry node the user can speak any of the n words, namely W or W or ··· or 1 2 W ; n is usually called the perplexity of the node in the speech literature. The n larger the n the more the perplexity and higher the confusion and hence lower the recognition accuracies. In most commercial speech solutions the perplexity is kept very low, typically a couple of words. Once the word at the entry node has been recognized (say word W has been recognized), the system moves on k to the second node where the active list of words to be recognized could be one of W , W , W , ... W if the perplexity at the kth node is p. This is k1 k2 k3 kp carried on to the third node. A transaction is termed successful if and only if the recognition at each of the three nodes is correct. For example, typically in a banking speech solution the entry node could expect someone to speak among 3 Fig.1. Schematic of a typical menu based ASR system (W is spoken word). n /credit card/1, /savings account/, /current account/, /loan product/, /demat/, and /mutual fund transfer/ which has a perplexity of 6. Once a person speaks, say, /savings account/ and is recognized correctly by the system, at the second nodeitcouldbe/account balance/or/cheque/or/last 5 transactions/(perplex- ity 3) and at the third node (say, on recognition of /cheque/) it could be /new cheque book request/, /cheque status/, and /stop cheque request/ (perplexity 3). Note 1. Though we will not dwell on this, it is important to note that an error in recognition at the entry node is more expensive than a recognition error at a lower node. Based on the call flow, and the domain the system can have several nodes for completion of a transaction. Typical menu based speech solutions strive for a 3 - 5 level nodes to make it usable. In any speech based solution (see Fig. 2) first the spoken utterance is hypothesized into a sequence of phonemes using the acoustic models. Since the phoneme recognition accuracy is low, instead of choosing one phoneme it identifies l-best (typically l = 3) matching phonemes. Thisphonelatticeisthenmatchedwithalltheexpectedwords(languagemodel) atthatnodetofindthebestmatch.Foranodewithperplexityntheconstructed phonemelatticeofthespokenutteranceiscomparedwiththephonemesequence representation of all the n words (through the lexicon which is one of he key 1 Wewilluse//toindicatethespokenword.Forexample/W /representsthespoken 1 equivalent of the written word W . 1 4 Acoustic Models (cid:63) (cid:63) Lexicon (cid:45) Speech Speech Recognition Language Model (cid:63) Text Output Fig.2. A typical speech recognition system. In a menu based system the language model is typically the set of words that need to be recognized at a given node. components of a speech recognition system). The hypothesized phone lattice is declared one of the n words depending on the closeness of the phoneme lattice to the phoneme representation of the n words. Wehypothesizethatwecanidentifytheperformanceofamenubasedspeech systembyidentifyingthepossibleconfusionamongallthewordsthatareactive at a given node. Note 2. If active words at a given node are phonetically similar it becomes dif- ficult for the speech recognition system to distinguish them which in turn leads to recognition errors. We used Levenshtein distance [1], [4] a well known measure to analyze and identify the confusion among the active words at a given node. This analysis gives a list of all set of words that have a high degree of confusability among them; this understanding can be then used to (a) restructure the set of active words at that node and/or (b) train the words that can be confused by using a larger corpus of speech data. This allows the speech recognition engine to be equipped to be able to distinguish the confusing words better. Actual use of this analysis was carried out for a speech solution developed for Indian Railway Inquiry System to identify bottlenecks in the system before its actual launch. 3 Case Study AschematicofaspeechbasedRailwayInformationsystem,developedforHindi language is shown in Fig. 3. The system enables user to get information on five 5 Fig.3. Call flow of Indian Railway Inquiry System (W is spoken word) n different services, namely, (a) Arrival of a given train at a given station, (b) Departure of a given train at a given station, (c) Ticket availability on a given date in a given train between two stations, and class, (d) Fare in a given class in a given train between two stations, and (e) PNR status. At the first recognition node (node-1), there are one or more active words correspond- ing to each of these services. For example, for selecting the service Fare, the user can speak among /kiraya jankari/, /kiraya/, /fare/. Similarly, for selecting service Ticket availability, user can speak /upalabdhata jankari/ or /ticket availability/ or /upalabdhata/. Note 3. Generallytheperplexityatanodeisgreaterthanonequaltothenum- ber of words that need to be recognized at that node. In this manner each of the services could have multiple words or phrases that canmeanthesamethingandthespeakercouldutteranyofthesewordstorefer to that service. The sum of all the possible different ways in which a service can be called (d ) summed over all the 5 services gives the perplexity (N) at that i node, namely, 5 (cid:88) N = d (1) i i=0 The speech recognition engine matches the phoneme lattice of the spoken ut- terance with all the N words which are active. The active word (one among the N words) with highest likelihood score is the recognized word. In order to avoid low likelihood recognitions a threshold is set so that even the best likeli- hood wordis returned only if the likelihood score is greater than the predefined threshold. Completion of a service requires recognitions at several nodes with different perplexity at each node. Clearly depending on the type of service that the user is wanting to use; the user has to go through different number of recog- nition nodes. For example, to complete the Arrival service it is required to 6 pass through 3 recognition nodes namely (a) selection of a service, (b) selection of a train name and (c) selection of the railway station. While the perplexity (the words that are active) at the service selection node is fixed the perplexity at the station selection node could depend on the selection of the train name at an earlier node. For example, if the selected train stops at 23 stations, then the perplexity at the station selection node will be ≥23. For confusability analysis at each of the node, we have used the Levenshtein distance [4] or the edit distance as is well known in computer science literature. Wefoundthattheutterances/Sahi/and/Galat/have100%recognition.These wordsSahiisrepresentedbythestringofphonemesinthelexiconasS AA HH IandthewordGalatisrepresentedasthephonemesequenceG L AX tTinthe lexicon.WeidentifiedtheeditdistancebetweenthesetwowordsSahiandGalat andusedthatdistancemeasureasthethresholdthatisabletodifferentiateany two words (say T). So if the distance between any two active words at a given recognition node is lower than the threshold T, then there is a greater chance thatthosetwoactivewordscouldgetconfused(onewordcouldberecognizedas theotherwhichiswithinadistanceofT).Therearewaysinwhichthispossible misrecognition words could be avoided. The easiest way is to make sure that these two words together are not active at a given recognition node. Table 1. List of Active Words at node 1 W. No. Active Word Phonetic W kiraya jankari K I R AA Y AA J AA tN K AA R I 1 W kiraya K I R AA Y AA 2 W fare F AY R 3 W aagaman jankari AA G AX M AX tN J AA tN K AA R I 4 W aagaman AA G AX M AX tN 5 W arrival departure AX R AA I V AX L dD I P AA R CH AX R 6 W upalabdhata jankariU P AX L AX B tDH AX tT AA J AA tN K AA R I 7 W ticket availability tT I K EY tT AX V AY L AX B I L I tT Y 8 W upalabdhata U P AX L AX B tD AX tTH AA 9 W arrival AX R AA I V AX L 10 W prasthan P R AX S tTH AA tN 11 W departure dD I P AA R CH AX R 12 W p n r jankari P I EY tN AA R J AA tN K AA R I 13 W p n r P I AX tN AA R 14 Table 1 shows the list of active word at the node 1 when the speech applica- tion was initially designed and Table 2 shows the edit distance between all the activewordsatthenodeservicegiveninFig.3.ThedistancebetweenwordsSahi and Galat was found to be 5.7 which was set at the threshold, namely T =5.7. Thisthresholdvaluewasusedtoidentifyconfusingactivewords.Clearly,asseen 7 in the Table the distance between word pairs fare, pnr and pnr, prasthan is 5.2 and 5.8 respectively, which is very close to the threshold value of 5.7. This can cause a high possibility that /fare/ may get recognized as pnr and vice-versa. One can derive from the analysis of the active words that fare and pnr can Table 2. Distance Measurement for Active Words at Node 1 of the Railway Inquiry System W W W W W W W W W W W W W W 1 2 3 4 5 6 7 8 9 10 11 12 13 14 W 0 8.4 14.28.5 14.115.311 20 17.113 13 13.26.2 11.2 1 W 8.4 0 7.2 13.88.5 14.717.815.711 7.2 7.8 7.7 11.26.2 2 W 14.27.2 0 14.27.2 14.818.215.811.28.2 8.2 7.8 14.25.8 3 W 8.5 13.814.20 8.4 15.99.6 19.714.313 13 14.78.2 11.2 4 W 14.18.5 7.2 8.4 0 13.216.715.79.7 7.2 6.7 8.2 14.17.1 5 W 15.314.714.815.913.20 18.718.915.59.4 13.77 17.311.8 6 W 11 17.818.29.7 16.718.70 20 11.217.115.617.810.213.8 7 W 20 15.715.819.715.718.920 0 14.515.817.516.518.615.7 8 W 17.111 11.214.39.7 15.511.214.50 10 9.2 11.116.39.7 9 W 13 7.2 8.2 13.17.2 9.4 17.115.810 0 8.5 8 13 7.8 10 W 13.17.8 8.2 13.16.7 13.715.717.59.2 8.5 0 8.4 11.75.2 11 W 13.27.7 7.8 14.78.2 7 17.816.511.18.1 8.4 0 12.16.8 12 W 6.2 11.214.28.2 14.117.310.218.616.313.111.712.10 9.8 13 W 11.26.2 5.8 11.27.1 11.813.815.79.6 7.8 5.2 6.8 9.8 0 14 not coexist as active words at the same node. The result of the analysis was to remove the active words fare and pnr at that node. When the speech system was actually tested by giving speech samples, 17 outof20instancesof/pnr/waswasrecognizedasfareandvice-versa.Similarly 19 out of 20 instances /pnr/ was misrecognized as prasthan and vice versa This confusionisexpectedascanbeseenfromtheeditdistanceanalysisoftheactive words in the Table 2. This modified active word list (removal of fare and pnr) increased the recognition accuracy at the service node (Fig. 3) by as much as 90%. A similar analysis was carried out at other recognition nodes and the active word list was suitably modified to avoid possible confusion between active word pair.Thisanalysisandmodificationofthelistofactivewordsatanoderesulted inasignificantimprovementinthetransactioncompletionrate.Wewillpresent more experimental results in the final paper. 4 Conclusion In this paper we proposed a methodology to identify words that could lead to confusion at any given node of a speech recognition based system. We used 8 edit distance as the metric to identifying the possible confusion between the active words. We showed that this metric can be used effectively to enhance the performanceofaspeechsolutionwithoutactuallyputtingittopeopletest.There isasignificantsavingintermsofbeingabletoidentifyrecognitionbottlenecksin a menu based speech solution through this analysis because it does not require actualpeopletestingthesystem.Thismethodologywasadoptedtorestructuring the set of active words at each node for better speech recognition in an actual menu based speech recognition system that caters to masses. References 1. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press (2001) 2. Kim,C.,Stern,R.:Featureextractionforrobustspeechrecognitionbasedonmax- imizing the sharpness of the power distribution and on power flooring. IEEE In- ternational Conference on Acoustics Speech and Signal Processing pp. 4574–4577 (2010) 3. Lua, X., Matsudaa, S., Unokib, M., Nakamuraa, S.: Temporal contrast normaliza- tionandedge-preservedsmoothingoftemporalmodulationstructuresofspeechfor robust speech recognition. Speech Communication 52, 1–11 (2010) 4. Navarro,G.:Aguidedtourtoapproximatestringmatching.ACMComputingSur- veys 33, 31–88 (2001) 5. Sun, Y., Gemmeke, J., Cranen, B., Bosch, L., Boves, L.: Using a dbn to integrate sparse classification and gmm-based asr. Proceedings of Interspeech 2010 (2010) 6. Zhao, Y., Juang, B.: A comparative study of noise estimation algorithms for vts- based robust speech recognition. Proceedings of Interspeech 2010 (2010)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.