Speaker Adaptation of Language Models for Automatic Dialog Act Segmentation of Meetings Ja´chymKola´rˇ1,YangLiu2,ElizabethShriberg3,4 1UniversityofWestBohemia,DepartmentofCybernetics,Pilsen,CzechRepublic 2UniversityofTexasatDallas,DepartmentofComputer Science,Richardson,TX,USA 3SRIInternational, USA 4International ComputerScienceInstitute, USA [email protected], [email protected], [email protected] Abstract capturespeakers’idiosyncraticlexicalpatternsassociatedwith DAboundaries. Dialog act (DA) segmentation in meeting speech is important GeneralLMadaptationhasbeenstudiedratherextensively for meeting understanding. In this paper, we explore speaker inspeechrecognitionandotherlanguageprocessingtasks,us- adaptationofhiddeneventlanguagemodels(LMs)forDAseg- ingbothsupervised[10,11]andunsupervised[12,13,14]ap- mentationusingtheICSIMeetingCorpus. Speakeradaptation proaches.AusefulsurveyofLMadaptationtechniquesisgiven isperformedusingalinearcombinationofthegenericspeaker- in[15]. Thetypicalapproachistousethetestwordstodeter- independentLMandanLMtrainedononlythedatafromin- mine the topic for the test document, and then use the topic- dividual speakers. We test the method on 20 frequent speak- specific LM or its combination with the generic LM. For the ers,onbothreferencewordtranscriptsandtheoutputofauto- sentencesegmentationtask,[16]successfullyusedbothacous- maticspeechrecognition.Resultsindicateimprovementsfor17 ticandlexicaldatafromanotherdomain(withslightlydifferent speakersonreferencetranscripts,andfor15speakersonauto- definition for sentence boundaries) to aid automatic sentence matictranscripts. Overall,thespeaker-adaptedLMyieldssta- segmentationinmeetings. tisticallysignificantimprovementoverthebaselineLMforboth Although topic- and domain-based LM adaptation ap- testconditions. proaches have received significant attention in the literature, IndexTerms: languagemodeladaptation,speakeradaptation, muchlessisknownaboutLMadaptationforindividualtalkers. meetings,dialogactsegmentation,sentencesegmentation Akita and Kawahara [17] showed improved recognition per- formanceusingLMspeakeradaptationbyscalingthen-gram 1. Introduction probabilitieswiththeunigramprobabilitiesestimatedviaprob- abilisticlatentsemanticanalysis. TurandStolcke[18]demon- Segmentation of speech into sentence-like units is a crucial stratedthatunsupervisedwithin-speakerLMadaptationsignif- first step in spoken language processing, since many down- icantlyreducedworderrorrateinmeetingrecognition. Unlike streamlanguageprocessingtechniques(e.g.,parsing,automatic previousworkinDAsegmentation(whichtypicallyfocuseson summarization, information extraction and retrieval, machine featuresormodelingapproachesinaspeaker-independentfash- translation)aretypicallytrainedonwell-formattedinput(such ion)orLMadaptation(mostlytopic-basedadaptationinthetask as written text). Several different approaches have been em- of speech recognition), in this study, we investigate whether ployed for sentence segmentation, including hidden Markov speakeradaptationofLMsmayhelpinautomaticDAsegmen- models (HMMs), multilayer perceptrons, maximum entropy, tationofmeetings. conditional random fields, AdaBoosting, and support vector The remainder of the paper is organized as follows. We machines[1,2,3,4,5,6]. Manyoftheseapproachesrelyon describeourapproachinSection2.Theexperimentalsetup,re- bothacoustic(prosodic)andlexicalinformation. sults,anddiscussionareshowninSection3;Section4provides Studiesonsentencesegmentationhavebeenconductedin conclusions. different domains, including broadcast news, conversational telephone speech, and meetings. Our task in this paper is to automatically segment meeting recordings into sentences, or 2. Method dialog acts (DAs) for this domain. In many real-life meet- 2.1. Hidden-eventLanguageModel ingapplications,thespeakersareoftenknownbeforehandand recordedonaseparatechannel.Moreover,manymeetingshave Foragivenwordsequencew1w2...wi...wn,thetaskofDAseg- recurring participants, presenting the opportunity for adapting mentation is to determine which inter-word boundaries corre- models to the individual talkers. Speaker adaptation methods spond to a DA boundary. We label each inter-word boundary were first successfully used in the cepstral domain for speech as either a within-unit boundary or a boundary between DAs. recognition [7, 8]. In [9], we evaluated speaker-dependent Forexample,intheutterance“yesweshouldbedonebynoon”, prosodymodelsforDAsegmentationinmeetings. However,it there are two dialog acts: “yes” (an answer), and “we should wasleftunansweredwhetherspeaker-dependentlanguagemod- bedonebynoon”(astatement). Eachendsinasegmentation els(LMs)wouldalsobenefitDAsegmentation.Inthispaperwe boundary. aimtoaddressthisquestion;ourgoalistotrytoadapttheLMto We use a hidden event LM [19] to automatically detect Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2007 2. REPORT TYPE 00-00-2007 to 00-00-2007 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Speaker Adaptation of Language Models for Automatic Dialog Act 5b. GRANT NUMBER Segmentation of Meetings 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION SRI International,333 Ravenswood Avenue,Menlo Park,CA,94025-3493 REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES http://wwww.speech.sri.com/papers/IS07-kolar-p1271.pdf 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 4 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 DA boundaries in the unstructured word sequence. The hid- 38.2%(onthewholecorpus). Togeneratethe“reference”DA deneventLMdescribesthejointdistributionofwordsandDA boundariesfortheSTTwords, wealignedthereferencesetup boundaries,PLM(W,B).Themodelistrainedbyexplicitlyin- to the recognition output with the constraint that two aligned cludingtheDAboundaryasatokeninthevocabularyinword- wordscouldnotoccurfurtherapartthanafixedtimethreshold. basedn-gramLM.Duringtesting,thehiddeneventLMusesthe ArobustestimationoftheinterpolationweightλinEq(1) pairofwordandDAboundaryasthehiddenstateinHMM,and may be a problem because of data sparsity. In the jackknife thewordsasobservations, andperformsaViterbiorforward- approach, one half of speaker’s test data is used to estimate backwarddecodingtofindtheDAboundariesgiventheword λfortheotherhalf, andviceversa. Wetesttwomethods for sequence. We used trigram LMs with modified Kneser-Ney estimating λs. First, λs are estimated individually for each smoothing[20]intheSRILMtoolkit[21]. speaker. Second, we set λ as the average value of the inter- polationweightsacrossallthespeakers. Notethatinthelatter 2.2. SpeakerAdaptationApproach approach,however,westilluseonlydatafromthefirsthalvesof individualtestsetstoestimateλforthesecondhalves,andvice To adapt the generic speaker-independent LM to a particular versa.Thisapproacheliminateshavingtwosignificantlydiffer- speaker,weuseaninterpolationapproach.Thespeaker-adapted entvaluesofλforasinglespeaker,whichdidoccurforsomeof modelSAisobtainedfromalinearcombinationofthespeaker- the20analyzedspeakers. Itindicatedthatforthosespeakers, independentmodelSI andaspeaker-dependentmodelSDas there is a mismatch in the two halves of the test data used in follows: jackknife,andthustheweightswerenotoptimizedproperlyfor PSA(ti|hi;λ)=λPSI(ti|hi)+(1−λ)PSD(ti|hi) (1) thetestset. where ti denotes a token (word or DA boundary) and hi its 3.2. EvaluationMetrics historyofn−1tokensinann-gramLM.λisaweightingfac- torthatisempiricallyoptimizedonheld-outdata. Wecompare WemeasureDAsegmentationperformanceusinga“boundary differentapproachestoestimateλs,asdescribedinSection3. errorrate”[1]: NotethattheSD dataisalreadycontainedintheSI datafor I+M BER= [%] (2) LMtraining;therefore,thisinterpolationdoesnothelpreduce NW out-of-vocabularyrate,itrathergivesalargerweightton-grams whereI denotesthenumberoffalseDAboundaryinsertions, observedinthedatacorrespondingtoaparticularspeakerand M thenumberofmisses,andNW thenumberofwordsinthe isexpectedtobebettersuitabletothisspeaker. testset.Inaddition,wereportoverallresultsusinganothercom- mon metric, the NIST error rate. For DA segmentation, it is 3. ResultsandDiscussion defined as the number of misclassified boundaries divided by thetotalnumberofDAboundariesinthereference. Itcanbe 3.1. DataandExperimentalSetup expressedas TheICSImeetingcorpus[22]containsapproximately72hours I+M NIST = [%] (3) ofmultichannelconversationalspeechdataandassociatedhu- NDA mantranscripts,manuallyannotatedforDAs[23].Weselected whereNDA denotesthenumberofDAboundariesinthetest the top 20 speakers in terms of total words. Each speaker’s set.Notethatthiserrorratemaybeevenhigherthan100%. datawassplitintoatrainingset(∼70%ofdata)andatestset (∼30%),withthecaveatthataspeaker’srecordinginanypar- 3.3. ResultsforIndividualSpeakers ticular meeting appeared in only one of the sets. Because of datasparsity, especiallyforthelessfrequentspeakers, wedid Table1showsacomparisonofDAsegmentationperformance not use a separate development set, but rather jackknifed the forthebaselinespeaker-independentLMandspeaker-adapted testsetinourexperiments. LMs for individual speakers, using reference transcripts. The Thetotaltrainingsetforspeaker-independentmodels(com- speakers displayed in the table are sorted according to the to- prisingthetrainingportionsofthe20analyzedspeakers,aswell talnumbersofwordstheyhaveinthecorpus. Thenumbersof asalldatafrom32otherless-frequentspeakers)contains567k wordsinadaptationandtestsetsarealsolisted. Theresultsin- words. DatasetsizesforindividualspeakersareshowninTa- dicatethatfor17of20speakers,performanceimprovedusing bles 1 and 2; size of training sets used for speaker adaptation bothindividualandglobalweights,andtwootherspeakersim- (referred to as “adaptation sets”) ranges from 5.2k to 115.2k provedonlyforoneofthetwointerpolationmethods.However, words. We use the official corpus speaker IDs. The first let- the degree of the improvement varies across particular speak- ter of the ID denotes the sex of the speaker (“f” or “m”); the ers. For8talkers,theimprovementwasstatisticallysignificant secondletterindicateswhetherthespeakerisanative(“e”)or atp ≤ 0.05usingaSigntestforbothmethods. For4othersit nonnative(“n”)speakerofEnglish. wassignificantforonlyoneofthemethods. Weusetwodifferenttestconditions: referencetranscripts Table2reportsthecorrespondingresultsfortheSTTcondi- (REF) and speech recognition output (Speech-To-Text, STT). tions. Notethatthetestsetsizesdifferfromtheprevioustable, Recognition results were obtained using the state-of-the-art since the number of words in STT outputs is usually smaller SRI CTS system [24], which was trained using no acous- thanthatinthecorrespondingreference. Theresultsshowthat tic data or transcripts from the analyzed meeting corpus. To 15speakersimprovedusingbothinterpolationmethods,while represent a fully automatic system, we also used automatic 4otherspeakersimprovedjustforoneofthemethods. Again, speech/nonspeechsegmentation. Worderrorratesforthisdif- for8talkers,theimprovementwassignificantatp ≤ 0.05for ficult data are still quite high; the STT system performed at bothmethods,andfor4otherstheimprovementwassignificant Table1: Boundaryerrorrates(BER)[%]inREFerenceconditionsforindividualspeakers. ID=SpeakerID,#Adaptand#Testdenote thenumberofwordsintheadaptationandtestsetsforeachspeaker,Baselinespeaker-independentmodelperformance,AdInWadapta- tionwithindividualweights,andAdGlWadaptationwithglobalweights.IDsofspeakerswhoimprovedusingbothmethodsareshown inboldface,*and**indicatethattheimprovementissignificantbyaSigntestatp<=0.05foroneorbothmethods,respectively. ID #Adapt #Test Baseline AdInW AdGlW ID #Adapt #Test Baseline AdInW AdGlW me013** 115.2k 51.2k 6.75 6.55 6.52 mn052 10.7k 3.8k 7.33 7.28 7.28 me011* 50.6k 24.8k 7.40 7.38 7.25 mn021** 9.6k 4.1k 6.68 5.41 5.65 fe008** 50.6k 22.6k 7.51 7.12 7.16 me003 9.3k 3.6k 8.78 8.45 8.56 fe016* 32.0k 15.4k 7.35 7.22 7.18 mn005** 7.7k 3.1k 7.83 7.01 6.92 mn015** 31.9k 14.7k 8.05 7.75 7.80 me045 8.1k 2.4k 8.90 8.94 8.90 me018* 31.8k 14.7k 6.64 6.43 6.45 me025 7.7k 2.4k 8.06 8.02 7.85 me010** 26.1k 12.6k 7.24 6.96 6.84 me006 6.9k 1.5k 9.53 10.32 9.47 mn007* 21.0k 10.1k 7.59 7.36 7.31 me026 5.2k 2.5k 5.80 5.76 5.80 mn017** 21.0k 7.1k 7.02 6.44 6.44 me012** 5.3k 2.1k 6.85 6.29 6.29 mn082 13.3k 4.2k 6.33 6.28 6.21 fn002 5.9k 1.5k 10.92 10.79 11.33 Table2: Boundaryerrorrates(BER)[%]inSpeech-To-Textconditionsforindividualspeakers. ID=SpeakerID,#Adaptand#Test denotethenumberofwordsintheadaptationandtestsetsforeachspeaker,Baselinespeaker-independentmodelperformance,AdInW adaptationwithindividualweights,andAdGlWadaptationwithglobalweights. IDsofspeakerswhoimprovedusingbothmethods areshowninboldface, *and**indicatethattheimprovementissignificantbyaSigntestatp <= 0.05foroneorbothmethods, respectively. ID #Adapt #Test Baseline AdInW AdGlW ID #Adapt #Test Baseline AdInW AdGlW me013** 115.2k 43.4k 8.29 8.16 8.18 mn052* 10.7k 3.5k 10.87 10.49 10.17 me011** 50.6k 22.9k 8.81 8.59 8.51 mn021* 9.6k 4.1k 8.20 7.83 7.73 fe008* 50.6k 19.5k 9.19 9.05 8.89 me003 9.3k 3.2k 9.36 9.36 9.33 fe016 32.0k 13.9k 8.42 8.40 8.31 mn005** 7.7k 3.0k 11.47 8.94 10.42 mn015** 31.9k 13.7k 10.16 9.90 9.84 me045 8.1k 2.1k 11.08 11.42 11.27 me018** 31.8k 13.3k 8.12 7.83 7.90 me025 7.7k 1.6k 14.36 14.23 14.36 me010** 26.1k 11.3k 8.39 7.96 7.91 me006* 6.9k 1.3k 10.94 10.01 10.40 mn007** 21.0k 8.4k 11.28 10.73 10.78 me026 5.2k 2.3k 7.35 7.01 6.88 mn017** 21.0k 6.0k 8.92 8.01 7.84 me012 5.3k 1.9k 8.68 8.15 8.31 mn082 13.3k 3.7k 10.37 10.45 10.10 fn002 5.9k 1.4k 13.40 13.40 12.89 foronemethod.Aninterestingobservationisthatforbothtest- withaspeaker-dependentmodeltrainedontherecognizerout- ingconditions,therelativeerrorreductionachievedbyspeaker put. The idea was to allow the model also to adapt for error adaptationisnotcorrelatedwiththeamountofadaptationdata. patternstypicalforanindividualtalker. However,thisadapta- Thisfindingsuggeststhatspeakersdifferinherentlyinhowsim- tionperformedlesswellthanusingreferencetranscriptionsas ilartheyaretothegenericspeaker-independentLM.Sometalk- thetrainingdata,whichindicatesthat,atleastwiththeamount ers probably differ more and thus show more gain, even with of data available for our experiments, it is preferable to adapt lessdata. LMs using clean data. In consequence it also suggests that prospectiveunsupervisedapproachestoLMspeakeradaptation 3.4. OverallResults willperformlesswellthanthesupervisedapproach. An overall comparison of performance of baseline speaker- independentandspeaker-adaptedLMsispresentedinTable3. The test set contains 203k words for REF and 180k words Table 3: Overall boundary error rates (BER) [%] and NIST for STT conditions. These results show that for both condi- errorrates[%]inREFerenceandSTTconditions tions,speaker-adaptedLMs—witheitherglobalinterpolation Testconditions REF STT weightsorindividualweights—outperformthebaseline. The overallimprovementsbyLMspeakeradaptationforbothcon- Metric BER NIST BER NIST ditionsarestatisticallysignificantatp < 10−15,usingaSign Baseline 7.30 48.56 9.06 68.65 test. Ofthetwoweightoptions,globalinterpolationresultsin Individualweights 7.02 46.74 8.79 66.61 better performance; however, the difference between the two Globalweights 6.99 46.56 8.76 66.38 approachesissignificantonlyatp<0.1. AdaptwithSTTdata N/A N/A 8.97 68.03 Inspeech-to-textconditionswealsotriedinterpolatingthe speaker-independentmodeltrainedonreferencetranscriptions 4. Conclusions of Markov chains,” in IEEE Transaction on Speech and AudioProcessing,vol.2,no.2.,pp.291–298,1994 Wehaveexploredspeakeradaptationofhiddeneventlanguage modelsforautomaticDAsegmentationofmultipartymeetings. [8] Gales,M.J.F.:“MaximumLikelihoodLinearTransforma- Thespeakeradaptationisbasedonalinearcombinationofthe tionsforHMM-basedSpeechRecognition,”inComputer genericspeaker-independentandspeaker-dependentLMs. We SpeechandLanguage,vol.12.,pp.75–98,1998 evaluatedthemethodon20frequentspeakerswithawiderange [9] Kola´ˇr, J., Shriberg, E., Liu, Y.: “On Speaker-Specific oftotalwordsavailableforLMadaptation. Weexaminedthe Prosodic Models for Automatic Dialog Act Segmenta- approachusingbothreferencewordtranscriptsandrecognizer tionofMulti-PartyMeetings,” inProc.INTERSPEECH- outputs. We found improvements for 17 speakers using ref- ICSLP2006,Pittsburgh,PA,USA,2006 erence transcripts, and for 15 speakers using automatic tran- [10] Kneser, R., Peters, J., Klakow, D.: “Language Model scripts.Overall,weachievedastatisticallysignificantimprove- Adaptation Using Dynamic Marginals,” in Proc. EU- mentoverthebaselineLMforbothtestingconditions.Theim- ROSPEECH,Rhodes,Greece,1997 provementwasachievedevenforsomespeakerswhohadonly arelativelysmallamountofdataavailableforadaptation. We [11] Seymore, K., Rosenfeld, R.: “Using Story Topics for concludethatspeakeradaptationofLMsaidsDAsegmentation, LanguageModelAdaptation,”inProc.Eurospeech1997, andthatfutureworkshouldinvestigatethepotentialofspeaker- Rhodes,Greece,1997 specificmodelingforothertasks. Otherimportantareasforfu- [12] Gretter,R.,Riccardi,G.: “On-lineLearningofLanguage tureextensionsincludetheintegrationoflexicalwithprosodic Models with Word Error Probability Distributions,” in or even multimodal information, and exploration of unsuper- Proc.ICASSP,SaltLakeCity,UT,2001 visedapproaches. [13] Niesler, T., Willett, D.: “UnsupervisedLanguageModel Adaptation for Lecture Speech Transcription,” in Proc. 5. Acknowledgments ICSLP,Denver,CO,2002 This material is based upon work supported by the Ministry [14] Nanjo, H., Kawahara, T.: “Unsupervised Language ofEducationoftheCzechRepublic,projectsNo.1M0567and Model Adaptation for Lecture Speech Recognition,” in ME909,andbytheU.S.DefenseAdvancedResearchProjects Proc.ISCAandIEEESSPR,Tokyo,Japan,2003 Agency (DARPA) under Contract No. NBCHD030010. Any [15] Bellegarda, J.R.: “Statistical Language Model Adapta- opinions, findings and conclusions or recommendations ex- tion: Review and Perspectives,” in Speech Communica- pressedinthismaterialarethoseoftheauthorsanddonotnec- tion,vol.42,pp.93–108,2004 essarilyreflecttheviewsoftheDARPAortheDepartmentof Interior-NationalBusinessCenter(DOI-NBC). [16] Cuendet, S., Hakkani-Tu¨r, D., Tur, G.: “Model Adapta- tion for Sentence Segmentation from Speech,” in Proc. 6. References IEEE/ACL Workshop on Spoken Language Technology (SLT),Aruba,2006 [1] Shriberg, E. et al.: “Prosody-based Automatic Segmen- [17] Akita, Y., Kawahara, T.: “Language Model Adapta- tation of Speech into Sentences and Topics,” in Speech tion based on PLSA of Topics and Speakers,” in Proc. Communication,vol.32,no.1–2,pp.127–154,2000 INTERSPEECH-ICSLP2004,Jeju,Korea,2004 [2] Warnke, V. et al.: “Integrated Dialog Act Segmenta- [18] Tur, G., Stolcke, A.: “Unsupervised Language Model tionandClassificationUsingProsodicFeaturesandLan- Adaptation for Meeting Recognition,” in Proc. IEEE guageModels”inProc.EUROSPEECH97,pp.207–210, ICASSP2007,Honolulu,HI,2007 Rhodes,Greece,1997 [19] Stolcke, A. et al.: “Automatic Detection of Sen- [3] Huang, J., Zweig, G.: “Maximum Entropy Model for tenceBoundariesandDisfluenciesBasedonRecognized Punctuation Annotation from Speech,” in Proc. ICSLP Words,”inProc.ICSLP,pp.2247–2250,Sydney,1998 2002,Denver,CO,2002 [20] Chen,S.F.,Goodman,J.:“AnEmpiricalStudyofSmooth- [4] Liu,Y.etal.:“UsingConditionalRandomFieldsforSen- ingTechniquesforLanguageModeling,” Tech.Rep.TR- tenceBoundaryDetectioninSpeech,”inProc.ACL,Ann 10-98,CSGroup,HarvardUniversity,1998 Arbor,2005 [21] Stolcke,A.:“SRILM-AnExtensibleLanguageModeling [5] Kola´ˇr, J., Shriberg, E., Liu, Y.: “Using Prosody for Au- Toolkit,”inProc.ICSLP,Denver,CO,2002 tomaticSentenceSegmentationofMulti-PartyMeetings,” [22] Janin, A. et al.: “The ICSI Meeting Corpus,” in Proc. inText,SpeechandDialogue(TSD)2006,LectureNotes ICASSP-2003,HongKong,2003 inArtificialIntelligence(LNAI),vol.4188,pp.629–636, Springer-Verlag,BerlinHeidelberg [23] Dhillon,R.etal.:“MeetingRecorderProject:DialogAct LabelingGuide,”ICSITechnicalReportTR-04-02,Inter- [6] Akita,Y.etal.: “SentenceBoundaryDetectionofSpon- nationalComputerScienceInstitute,Berkeley,CA,2004 taneous Japanese Using Statistical Language Model and Support Vector Machines,” in Proc. Interspeech-ICSLP, [24] Zhu, Q. et al.: “Using MLP Features in SRI’s Conver- Pittsburgh,PA,2006 sational Speech Recognition System,” in Proc. INTER- SPEECH2005,pp.2141–2144,Lisboa,2005 [7] Gauvain, J.L., Lee, C.H.: “Maximum A Posteriori Es- timation for Multivariate Gaussian Mixture Obsevation