ebook img

DTIC ADA460991: On Combining Language Models: Oracle Approach PDF

5 Pages·0.09 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA460991: On Combining Language Models: Oracle Approach

On Combining Language Models : Oracle Approach (cid:3) Kadri Hacioglu and Wayne Ward Center for Spoken Language Research University of Colorado at Boulder hacioglu,whw @cslr.colorado.edu f g ABSTRACT report a significant perplexity improvement with a moderate in- creaseinword/semanticaccuracy, at -bestlist(rescoring)level, In this paper, we address the problem of combining several lan- usingadialog-contextdependent,semNanticallymotivatedgrammar guage models (LMs). Wefind thatsimple interpolation methods, basedlanguagemodel. like log-linear and linear interpolation, improve the performance Statisticallanguagemodelingisa”learningfromdata”problem. butfallshortoftheperformanceofanoracle.Theoracleknowsthe Thegenericstepstobefollowedforlanguagemodelingare referencewordstringandselectsthewordstringwiththebestper- formance(typically,wordorsemanticerrorrate)fromalistofword preparationoftrainingdata strings, whereeachwordstringhasbeenobtained byusing adif- (cid:15) ferentLM.Actually,theoracleactslikeadynamiccombinerwith selectionofamodeltype harddecisionsusingthereference.Weprovideexperimentalresults (cid:15) specificationofthemodelstructure thatclearlyshowtheneedforadynamiclanguagemodelcombina- (cid:15) tiontoimprovetheperformancefurther.Wesuggestamethodthat estimationofmodelparameters mimicsthebehavioroftheoracleusinganeuralnetworkorade- (cid:15) cisiontree. ThemethodamountstotaggingLMswithconfidence Thetrainingdatashouldconsistoflargeamountsoftext,which measuresandpickingthebesthypothesiscorrespondingtotheLM is hardly satisfied in new applications. In those cases, complex withthebestconfidence. modelsfittothetrainingdata. Ontheotherhand, simplemodels cannotcapturetheactualstructure. IntheBayes’(sequence)de- 1. INTRODUCTION cisionframeworkofspeechrecognition/understanding weheavily constrainthemodelstructuretocomeupwithatractableandprac- Statisticallanguagemodels(LMs)areessentialinspeechrecog- ticalLM.Forinstance,inaclass/word -gramLMthedependency nitionand understanding systemsforhighword andsemantic ac- ofawordisoftenrestrictedtotheclasnsthatitbelongsandthede- curacy,nottomentionrobustnessandportability.Severallanguage pendencyofaclassislimitedto -1previousclasses. Theestima- modelshavebeenproposedandstudiedduringthepasttwodecades tionofthemodelparameters,whnicharecommonlytheprobabili- [8]. Althoughithasturnedouttobearatherdifficulttasktobeat ties,isanotherimportantissueinlanguagemodeling.Besidesdata the (almost) standard class/word n-grams (typically or ), sparseness,theestimationalgorithms(e.g.EMalgorithm)mightbe therehasbeenagreatdealofinterestingrammarbansed=la2ngua3ge responsiblefortheestimatedprobabilitiestobefarfromoptimal. models[1].Apromisingapproachforlimiteddomainapplications Theaforementioned problemsoflearninghavedifferenteffects istheuseofsemanticallymotivatedphraselevelstochasticcontext ondifferentLMtypes.Therefore,itiswisetodesignLMsbasedon freegrammars(SCFGs)toparseasentenceintoasequenceofse- differentparadigmsandcombinetheminsomeoptimalsense.The mantictagswhicharefurthermodeledusing -grams[2,9,10,3]. simplest combination method is the so called linear interpolation ThemainmotivationbehindthegrammarbasnedLMsistheinabil- [4]. Recently, the linear interpolation in the logarithmic domain ityof -gramstomodellonger-distanceconstraintsinalanguage. hasbeeninvestigatedin[6].Perplexityresultsonacoupleoftasks Withtnheadvent offairlyfastcomputers and efficientparsing and haveshownthatthelog-linearinterpolationisbetterthanthelinear searchschemesseveralresearchershavefocusedonincorporating interpolation. Theoretically, a far more powerful method for LM relatively complex language models into speech recognition and combination is the maximum entropy approach [7]. However, it understanding systemsatdifferentlevels. Forexample,in[3],we has not been widely used in practice, since it is computationally demanding. (cid:3)TheworkissupportedbyDARPAthroughSPAWARundergrant Inthisresearch,weconsidertwoLMs: #N66001-00-2-8906. class-based3-gramLM(baseline). (cid:15) dialogdependentsemanticgrammarbased3-gramLM[3]. (cid:15) AfterN-bestlistrescoringexperimentswithlinearandlog-linear interpolation, we realized that the performance in terms of word andsemanticaccuraciesfallconsiderablyshortoftheperformance ofanoracle. Weexplaintheset-upfortheoracleexperimentand point out that the oracle is a dynamic LM combiner. To fill the performancegap,wesuggestamethodthatcanmimictheoracle. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2001 2. REPORT TYPE 00-00-2001 to 00-00-2001 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER On Combining Language Models: Oracle Approach 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Colorado,Center for Spoken Language REPORT NUMBER Research,Boulder,CO,80303 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 4 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 s IWANTTOFLYFROM MIAMIFLORIDATOSYDNEYAUS- dialog goal dialog context T<R>ALIAONOCTOBERFIFTH /s s [iwant][depart loc][arrive<loc>][date] /s G S < > < > s I DON’T TO FLY FROM MIAMI FLORIDA TO SYDNEY concept A<F>TERAREAONOCTOBERFIFTH /s generator s [Pronoun][Contraction][depart lo<c][>arrive loc][after][Noun][date] </s> C < > word Figure2:Examplesofparsingintoconceptsandfillerclasses sequence generator W The concept model is conditioned on the dialog context. Al- phoneme thoughthereareseveralwaystodefineadialogcontext,weselect sequence thelastquestionpromptedbythesystemasthedialogcontext.Itis generator simpleandyetstronglypredictiveandconstraining. P Theconceptsareclassesofphraseswiththesamemeaning. Put waveform differently,aconceptclassisasetofallphrasesthatmaybeused generator toexpressthatconcept(e.g. [i want],[arrive loc]). Thoseconcept classesareaugmentedwithsingleword,multiplewordandasmall A numberofbroad(andunambigious)partofspeech(POS)classes. speech Incaseswheretheparserfails,webreakthephraseintoasequence ofwordsandtagthemusingthissetof”filler”classes. Twoexam- Figure1:Aspeechproductionmodel plesinFigure2clearlyillustratethescheme. Thestructureoftheconceptsequencesiscapturedbyan -gram LM.Wetrain aseperate language model for each dialog cnontext. Giventhecontext and ,theconcept se- The paper is organized as follows. Section 2 presents the lan- quenceprobabilitieSsarecaClcu=latce0dc1as(cid:1)(cid:1)((cid:1)focrK;cK+)1 guage models considered in this study. In Section 3, we briefly n=3 explain combining of LMs using linear and log-linear interpola- tion. Section4explainsthesetupfortheoracleexperiment. Ex- perimentalresultsarereportedinSection5. Thefutureworkand P(C=S)=P(c1=<s>;S)P(c2=<s>;c1;S) conclusionsaregiveninthelastsection. K+1 2. LANGUAGEMODELS Y P(ck=ck(cid:0)2;ck(cid:0)1;S) k=3 Inlanguagemodeling,thegoalistofindtheprobabilitydistribu- where and are for the sentence-begin and sentence-end tionofwordsequences,i.e. ,where . symbolcs0,respeccKtiv+e1ly. We first describe a model foPr(sWen)tence genWera=tiown1i;nwa2:d(cid:1)ia(cid:1)l(cid:1)o;gw[L5] Each concept class is written as a CFG and compiled into a on which our grammar LM is based. The model is illustrated in stochasticrecursivetransitionnetwork(SRTN).Theproductionrules Figure 1. Here, the user has a specific goal that does not change define complete paths beginning from the start-node through the throughout the dialog. According tothegoal and thedialog con- end-node in these nets. The probability of a complete path tra- texttheuserfirstpicksasetofconceptswithrespectivevaluesand versedthroughoneormoreSRTNsinitiatedbythetop-levelSRTN thenusephrasegeneratorsassociatedwithconceptstogeneratethe associated with the concept is the probability of the phrase given wordsequence.Thewordsequenceisnextmappedintoasequence thatconcept. Thisprobabilityiscalculatedasthemultiplicationof ofphonesandconvertedintoaspeechsignalbytheuser’svocalap- allarcprobabilitiesthatdefinesthepath.Thatis, paratuswhichwefinallyobserveasasequenceofacousticfeature vectors. K Assumingthat P(W=C) = Qi=1P(si=ci) K Mi thedialogcontext isgiven, = Qi=1Qj=1P(rj=ci) (cid:15) S where isasubstringin ( isindependentof buttheconceptsequence ,i.e. )andsi arethWepr=odwuc1t;iown2:r:uwleLst=hats1c;o:n:sst2r;uscKt .KTh(cid:20)e (cid:15) W S , C cLonceptra1n;rd2r;u:l:e:rsMeqiuencesareassumedtobeuniqueinthesiabove P(W=C;S)=P(W=C) equations. Theparserusesheuristicstocomplywiththisassump- (W,C)pairisunique(possiblewitheitherViterbiapproxima- tion. (cid:15) tionorunambigiousassociationbetweenCandW), SCFG and n-gram probabilities are learned from a text corpus onecaneasilyshowthat isgivenby bysimplecountingandsmoothing.Oursemanticgrammarshavea P(W) lowdegreeofambiguityandthereforedonotrequirecomputation- allyintensivestochastictrainingandparsingtechniques. (1) The class based LM can be considered as a very special case P(W)=P(W=C)P(C=S) ofourgrammarbasedmodel. Concepts(orclasses)arerestricted In(1)weidentifytwomodels: to those that represent a list of semantically similar words, like Conceptmodel: [city name],[day of week],[month day]andsoforth.So,instead (cid:15) P(C=S) ofruleprobabilitieswehavegiventheclassthewordprobabilities, Syntacticmodel: .Forsimplicity,eachwordbelongstoatmostoneclass. (cid:15) P(W=C) P(wi=cj) Wc reference class based f c LM Ic class based best W W N-best list LM c neural network select max f g N-best list oracle best W grammar based Ig grammar based LM W LM g Wg f g f c I S S dialog context dialog context training data Figure3:Thesetupfororacleexperiments Figure 4: The LMcombining system based on the oracle ap- proach. 3. LINEAR AND LOG-LINEAR INTERPO- besthypothesisafterrescoring. Theoraclecompareseachhypoth- LATION esistothereferenceandpicktheonewiththebestword(orseman- AssumingthatwehaveMlanguagemodels, , tic)accuracy. the combined LM obtained using the linear inPtie(rWpol)a;tiio=n1(a;t2s;e(cid:1)n(cid:1)-(cid:1) ;M Fortrainingpurposes,wecreatetheinputfeaturevectorbyaug- tencelevel)isgivenby mentingfeaturesfromeachrescoringmodule( )andthedia- logcontext( ). TheoutputvectoristheLMinfdgi;cfactor fromthe M (2) oracle. TheeSlementthatcorrespondstotheLMwiththeIbestfinal P(W)=X(cid:21)iPi(W) hypothesisisunityandtherestarezeros. Aftertrainingtheoracle i=1 combiner (here, weassume a neural network), we setour system where arepositiveinterpolationweightsthatsumuptounity. asshowninFigure4. Theinputtotheneuralnetwork(NN)isthe The(cid:21)loig-linear interpolation suggests an LM, again at sentence augmentedfeaturevector. TheoutputoftheNNistheLMindica- level,givenby torprobably withfuzzy values. So, wefirstpickthemax output, andthen,weselectandoutputtherespectivewordstring. M (3) P(W)= 1 YPi(W)(cid:21)i 5. EXPERIMENTAL RESULTS Z((cid:21))i=1 ThemodelsweredevelopedandtestedinthecontextoftheCU where isthenormalization factorand itisafunction ofthe Communicator dialog system which is used for telephone-based interpoZla(ti(cid:21)o)nweights. Thelinearityinlogarithmicdomainisobvi- flight, hotelandrentalcarreservations [11]. Thetextcorpus was ousifwetakethelogarithmofbothsides. Inthesequel,weomit dividedintotwopartsastrainingandtestsetswith15220and1220 the normalization term,as itscomputation isvery expensive. We sentences, respectively. The test set was further divided into two hope thatitsimpacton theperformance isnotsignificant. Yet,it parts. Each part, in turn, was used to optimize language and in- preventsusfromreportingperplexityresults. terpolation weights to be used for the other part in a ”jacknife paradigm”. The results were reported as the average of the two 4. THEORACLEAPPROACH results. The average sentence length of the corpus was 4 words (end-of-sentence was treated as a word). Weidentified 20 dialog Theset-upfororacleexperimentsisillustratedinFigure3. The contextsandlabeledeachsentencewiththeassociateddialogcon- purposeofthisset-upistwofold.First,weuseittoevaluatetheor- text. acleperformance.Second,weuseittopreparedataforthetraining Wetrained a dialog independent (DI)class based LM and dia- ofastochasticdecisionmodel.Forthesakeofsimplicity,weshow logdependent(DD)grammarbasedLM.InallLMs issetto3. theset-upfortwoLMsanddoexperimentsaccordingly. Nonethe- Itmust benoted thattheDIclass-based LMserved anstheLMof less,theset-upcanbeextendedtoanarbitrarynumberofLMs. thebaselinesystem with921 unigramsincluding 19classes. The The language models are used for N-best list rescoring. The total number of the distinct words in the lexicon was 1681. The N-best list is generated by a speech recognizer using a relatively grammar-based LM had 199 concept and filler classes that com- simplerLM(here,aclass-basedtrigramLM).Theframeworkfor pletelycover thelexicon. Inrescoringexperiments wesettheN- N-bestlistrescoringisthefollowingMAPdecision: bestlistsizeto10. Wethinkthatthechoiceof isareson- (cid:3) argmax (4) abletradeoffbetweenperformanceandcomplexNity=. 10 W = pAP(W=CW)P(CW=S) Theperplexity resultsarepresented inTable 1. Theperplexity W 2LN ofthegrammar-basedLMis36.8%betterthanthebaselineclass- where istheacousticprobabilityfromthefirstpass, isthe basedLM. uniquepcAonceptsequence associatedwith ,and deCnWotesthe Wedidexperimentsusing -bestlistsfromthebaselinerecog- N-best list. Each rescoring module supplWies the oLraNcle with their nizer. Wefirstdeterminedthe10bestpossibleperformanceinWER goalwiththemainfocusontheselectionoffeatures. Atthemo- Table1:Perplexityresults ment,thenumberofconcepts,thenumberoffillerclassesandthe LM Perplexity numberof3-gramhitsinasentence(allnormalizedbythelength DIclass3-gram 22.0 of the sentence) and the behavior of n-grams in a context are the DDSCFG3-gram 13.9 featuresthatweconsidertouse.Also,ithasbeenobservedthatthe performanceoftheoracleisstillfarfromthebestpossibleperfor- mance. Thisispartly duetothe verysmallnumber of LMsused intherescoring,partlyduetotheoracle’sharddecisioncombining offered by -best lists. This is done by picking the hypothesis strategy and partly due to the static combination with the acous- withthelow1e0stWERfromeachlist.Thisgivesanupperboundfor ticmodel. Theworkisinprogress towards thegoaloffillingthe the performance gain possible from rescoring -best lists . The performancegap. rescoringresultsintermsofabsoluteandrelativ1e0improvementsin WER and semantic error rate (SER)along with the best possible 7. REFERENCES improvement are reported in Table 2. Itshould be noted that the optimizations aremade using WER.Theslightdrop inSERwith [1] J.K.Baker.Trainablegrammarsforspeechrecognition.In interpolation might be due to that. Actually this is good for text SpeechCommunicationsforth97thMeetingofthe transcription but not for a dialog system. We believe that the re- AcousticalSocietyofAmerica,pages31–35,June1979. sults willreverse ifwe replace the optimization using WERwith [2] J.GillettandW.Ward.Alanguagemodelcombining theoptimizationusingSER. trigramsandstochasticcontext-freegrammars.In5-th InternationalConferenceonSpokenLanguageProcessing, pages2319–2322,Sydney,Australia,1998. Table2:TheWERandSERresultsofthe10-bestlistrescoring [3] K.HaciogluandW.Ward.Dialog-contextdependent with different LMs: the baseline WER is 25.9% and SER is languagemodelscombiningn-gramsandstochastic 23.7% context-freegrammars.InsubmittedtoInternational ConferenceofAcoustics,Speech,andSignalProcessing, Method WER SER Salt-Lake,Utah,,2001. ClassbasedLMalone 0.0% 0.0% [4] F.JelinekandR.Mercer.Interpolatedestimationofmarkov GrammarbasedLMalone 1.4(5.4)% 1.4(5.9)% sourceparametersfromsparsedata.PatternRecognitionin Linearinterpolation 1.6(6.2)% 1.3(5.5)% Practice,23:381,1980. Log-linearinterpolation 1.7(6.6)% 1.2(5.1)% [5] A.Keller,B.Rueber,F.Seide,andB.Tran.PADIS-an Oracle 3.0(11.6)% 2.7(11.4)% automatictelephoneswitchboardanddirectoryinformation Best 6.4(24.1)% 5.5(23.2)% system.SpeechCommunication,23:95–111,1997. [6] D.Klakow.Log-linearinterpolationoflanguagemodels.In 5-thInternationalConferenceonSpokenLanguage Theperformancegapbetweentheoracleandinterpolationmeth- Processing,pages1695–1699,Sydney,Australia,1998. odspromotesthesysteminFigure4. Weexpectthat,basedonthe universal approximation theory, a neural network with consistent [7] R.Rosenfeld.Amaximumentropyapproachtoadaptive features,sufficientlylargetrainingdataandpropertrainingwould languagemodeling.ComputerSpeechandLanguage, approximate fairly well the behavior of the oracle. On the other (10):187–228,1996. hand, theperformancegapbetweentheoracleandthebestpossi- [8] R.Rosenfeld.Twodecadesofstatisticallanguagemodeling: ble performance from 10-best lists suggests the use of more than Wheredowegofromhere?ProceedingsoftheIEEE, twolanguage models anddynamic combination withtheacoustic 88(8):1270–1278,August2000. model. [9] B.Souvignier,A.Keller,B.Rueber,H.Schramm,and F.Seide.Thethoughtfulelephant:Strategiesforspoken 6. CONCLUSIONS dialogsystems.IEEETransactionsonSpeechandAudio Processing,8(1):51–62,January2000. Wehavepresentedourrecentworkonlanguagemodelcombin- [10] Y.Wang,M.Mahajan,andX.Huang.Aunifiedcontext-free ing. Wehave shown that although a simpleinterpolation of LMs grammarandn-grammodelforspokenlanguageprocessing. improves theperformance, itfailstoreachtheperformance ofan InInternationalConferenceofAcoustics,Speech,andSignal oracle.WehaveproposedamethodforLMcombinationthatmim- Processing,pages1639–1642,Istanbul,Turkey,2000. icsthebehavioroftheoracle. Althoughourworkisnotcomplete [11] W.WardandB.Pellom.TheCUcommunicatorsystem.In without a neural network that mimics the oracle, we argue that IEEEWorkshoponAutomaticSpeechRecognitionand the universal approximation theory ensures the success of such a Understanding,Keystone,Colorado,1999. method.However,extensiveexperimentsarerequiredtoreachthe

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.