ITERATED CLASS-SPECIFIC SUBSPACES FOR SPEAKER-DEPENDENT PHONEME CLASSIFICATION PaulM. Baggenstoss NavalUnderseaWarfareCenter NewportRI,02841,USA phone:(+001)401-832-8240 email: [email protected] web:www.npt.nuwc.navy.mil/csf ThisworkwassupportedbyOfficeofNavalResearchONR321US ABSTRACT transformationsavailabletoeachphoneme.UnderCSM,the “commonfeaturespace” is the time-series(raw data)itself. The features based on the MEL cepstrum have long dom- FeaturePDFs,evaluatedondifferentfeaturespacesarepro- inated probabilistic methods in automatic speech regogni- jectedbacktotherawdataspacewherethelikelihoodcom- tion(ASR).Thisfeaturesethasevolvedtomaximizegeneral parison is done. Besides its generality, the CSM paradigm ASRperformancewithinaBayesianclassifierframeworkus- has manyadditionaladvantagesas well. Forexamplethere ingacommonfeaturespace. Now,however,withtheadvent isaquantitativeclass-dependentmeasuretooptimizethatal- of the PDF projectiontheorem (PPT) and the class-specific lowsthedesignofthe class-dependentfeaturesinisolation, method (CSM), it is possible to design features separately withoutregardtotheotherclasses. for each phonemeand compare log-likelihoodvalues fairly across various feature sets. In this paper, class-dependent features are found by optimizing a set of frequency-band 2. CLASS-SPECIFICAPPROACH functions for projection of the spectral vectors, analogous When applyingCSM, one must find class-dependentsignal totheMELfrequencybandfunctions,individuallyforeach processing to produce features that characterize each class class. Using this method, we show significant improve- orsub-class. Weseekanautomaticmeansofoptimizingthe ment overstandard MEL cepstrum methods in speaker and matrixAforagivensubclass. WefirstreviewCSM. phonemespecificrecognition. 2.1 Class-SpecificMethod(CSM) 1. INTRODUCTION Lettherebe M classesamongwhichwe wouldlike toclas- TheMELcepstrumfeatures[1]anditsderivativeshavelong sify. Theclass-specificclassifier,basedonthePPT,isgiven been the staple of automatic speech recogniton (ASR) sys- by tems. OnemaywritetheMELcepstrumfeaturesas argmaxp (x|H ), p m m z=DCT(log(A′y)), (1) where p (x|H ) is the projected PDF (projected from the p m feature space to the raw data space). The projectedPDF is where vector y is the length-N/2+1 spectral vector, the givenby magnitude-squared DFT output and the columns of A are theMELbandfunctions[1]. Thelogarithmandthediscrete p (x|H )=J (x,A ,H )pˆ(z |H ), (3) p m m m 0,m m m cosine transform (DCT) are invertible functions. There is no dimensionreductionor informationloss so they may be where pˆ(z |H )isthefeaturePDFestimate(estimatedfrom m m consideredafeatureconditioningstepwhichresultsinmore trainingdata)andtheJ-functionisgivenby Gaussian-likeandindependentfeatures. Thus,wemaycon- centrateourattentiononthematrixmultiplication p(x|H ) J (x,A ,H )= 0,m , (4) m m 0,m p(z |H ) w=A′y. (2) m 0,m and H are class-dependent reference hypotheses. The 0,m Thekeyoperationhereisdimensionreductionbylinearpro- class-dependentfeaturesz arecomputedfromthespectral m jection onto a lower-dimensional space. Now, with the in- vectorythroughtheclass-dependentsubspacematricesA , m troductionoftheclass-specific method(CSM) andthePDF as projection theorem (PPT) [2], one is free to explore class- z =C(A′ y), (5) dependentfeatures within the rigid framework of Bayesian m m classification. Someworkhasbeendoneinclass-dependent whereCisthefeatureconditioningtransformation.Notethat features[3],[4]howeverexistingapproachesareonlyableto the J-function is a fixed function of x precicely defined by usedifferentfeaturesthroughtheuseofcompentationfactors thefeaturetransformationfrom xtozandthereferencehy- tomakelikelihoodcomparisonsfair. Suchapproacheswork pothesesH . Itisthe“correctionterm”thatallowsfeature 0,m iftheclass-dependentfeauretransformationsarerestrictedto PDFsfromvariousfeaturespacestobecomparedfairlybe- certainlimitedsets. Bothmethodsfallshortofthepotential cause the resulting log-likelihood function is a PDF on the ofthePPTwhichmakesnorestrictiononthetypeoffeature raw data space x. The J-function is a generalization of the Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2008 2. REPORT TYPE 00-00-2008 to 00-00-2008 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Iterated Class-Specific Subspaces for Speaker-Dependent Phoneme 5b. GRANT NUMBER Classification 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Naval Undersea Warfare Center,Newport,RI,02841 REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT see report 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 5 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 determinantoftheJacobianmatrixinthecaseofa1:1trans- • Orthonormality. The columnsof A are an orthonor- m formation.ThePPTguaranteesthatp (x|H )givenby(3)is mal set of vectors. We use a orthonormality under the p m aPDF,soitintegratesto1overxregardlessofthereference innerproduct hypothesisH orthe featuretransformationproducing z 0,m m from x. It is up to the designer to choose H and A to N/2 make pp(x|Hm)asgoodanestimateof p(x|H0m,m)aspossimble. <x,y>= (cid:229) eixiyi, Thedesignerisguidedbytheprinciplethatifzmisasuffient i=0 statisticforH vs. H ,then p (x|H )willequal p(x|H ) m 0,m p m m where e has the value 2 except for the end bins (0 and (provided pˆ(z |H )is a goodestimate). We can also think i m m N/2) where it has value 1. Ortho-normality under this ofitasawayofimbeddingalow-dimensionalPDFwithina innerproductmeansthatthespectralvectorswillbeor- high-dimensionalPDF. thonormalifextendedtothefullNbins.Useoforthonor- We havegoodreason,asweshallsee, touseacommon malityhelpstostabilizetheterm pˆ(z |H )asA isvar- referencehypothesis,H ,whichsimplifiestheclassifierto m m m 0 ied. argmaxJ (x,A ,H )p(z |H ) (6) • Energy sufficiency. The energy sufficiency constraint m m 0 m m m meansthatthetotalenergyinx, wheretheJ-functionJ (x)nowdependsonlyonA . Note m m N thatincontrasttootherclass-dependentschemesusingpair- E=(cid:229) x2 i wise or tree tests, CSM is a Bayesian classifier and has i=1 the promise CSM of providing a “drop-in” replacement to theMEL-cepstrumbasedfeatureprocessorsinexistingASR can be derived from the features. Energy sufficiency is systems. importantinthecontextoffloatingreferencehypotheses [2]. In order that the classifier result is scale invariant, 2.2 Findingaclass-specificsubspace weneedenergysufficiency. Withenergysufficiency,the term We are interested in adapting the matrix A to an individ- p(x|H ) 0 ualclass. We proposethe strategyof selecting A to max- m p(z |H ) imize the total log-likelihood of the training data using the m 0 projectedPDF.Let will be independentof the variance used on the H ref- 0 erence hypothesis. Note that E =e′y/N, where e = 1 1 L(x1,x2...xK;A )=(cid:229)K logp (xi|H ) (7) [1,2,2,2...,2,1]′, which is composed of the numberof m p m degreesoffreedomineachfrequencybin. Thus,energy i=1 sufficiencymeansthatthecolumnspaceofA needsto m containthevectore . where K is the number of training vectors. If we expand 1 p (x|H ), p m 2.2.1 Class-specificiteratedsubspace(CSIS) p(x|H ) Since we would like the feature set created by projecting 0 p (x|H )= pˆ(z |H ), p m (cid:20)p(z |H )(cid:21) m m onto the columns of A to characterize the statistical varia- m 0 tions within the class, a naturalfirst step is to use principal whereH istheindependentGaussiannoisehypothesis,we componentanalysis(PCA).Todothis,wearrangethespec- 0 see that the term p(x|H ) is independent of A . Thus, to tralvectorsfromthetrainingsetintoamatrix 0 m maximizeL,weneedtomaximizetheaveragevalueof X=[y1y2···yK], logpˆ(z |H )−logp(z |H ). (8) m m m 0 where K isthe numberof trainingvectors. To meetthe en- Our approach is to assume that the first term in (8) is only ergysufficiencyconstraint,wefixthefirstcolumnofAtobe weakly dependent on Am and concentrate on the second thenormalizede1 e term.GiventhesimplicityofthereferencehypothesisH0,the e˜ = 1 . 1 secondterm p(zm|H0)canbeknown,eitherinanalyticform ke1k orinanaccurateanalyticapproximation[5]. Thus,itiseasy To find the best linear subspace orthogonal to e , we first toanalyzeitsbehaviorasA changes.Wehaveobtainedthe 1 m orthogonalizethecolumnsofXtoe X =X−e˜ (e˜ ′X). firstderivativesoflogp(z |H )withrespecttoeachelement 1 n 1 1 m 0 LetUbethelargestPsingularvectorsofX ,orequivalently ofA . Weproceed,thenbyignoringtheterm pˆ(z |H )and n m m m thelargestPeigenvectorsofX X′.WethensetA=[e˜ U]. maximizingthefunction n n 1 Wethenproceedtomaximize(9)usinganiterativeapproach. We use the term class-specific iterated subspace (CSIS) to K Q(x1,x2...xK;Am)=−(cid:229) logp(zim|H0). (9) refertothecolumnsofAm obtainedinthisway. i=1 3. EXPERIMENTALAPPROACH Thechangeinpˆ(z |H )canbeminimizedasA ischanged m m m 3.1 DataSet byinsistingonanorthonormalformforA . Thus,bymax- m imizing L (7) under the restriction that A is orthonormal, We used the TIMIT [6] data set as a source of phonemes, m weapproximatelymaximizeL. Weapplythefollowingcon- drawingall of ourdata from the “training”portion. TIMIT straintstoA : consists of sampled time-series (in 16 kHz .wav files) of m scriptedsentencesreadbyawidevarietyofspeakersandin- 3.3.5 SubspaceProjection(MatrixMultiplication) cludes index tables that point to start and stop samples of Next,thespectralvectors,denotedbyy,wereprojectedonto eachspokenphonemeinthetext. Thereare61phonemesin alowerdimensionalsubspacebyamatrixasin(2)resulting thedatabase,havinga1to4charactercode.Weusetheterm infeaturevectors,denotedbyw. dataclassto representthe collection of all the phonemesof ForMFCC,thecolumnsofAwereMELfrequencyband a given type from a given speaker. The average numberof functions. The numberof columnsin matrix A was N +2 samples (utterences) of a given speaker/phonemecombina- c includingthezeroandNyquisthalf-bands. tionis about10andrangesfrom1 upto about30forsome For CSIS, A was an orthonormal matrix determined ofthemostcommonphonemes. We usedspeaker/phoneme from the optimization algorithm. For CSIS, the number of combinationswithnofewerthan10samples. columnsofAwasP+1wherePisthenumberofbasisfunc- tionsinadditiontothefirstcolumne˜ . 1 3.2 Cross-Validation 3.3.6 FeatureConditioning In all of our classification experiments, the utterences of a given speaker/phoneme were divided into two sets, even From a statistical point of view, feature conditioning has (samples2,4,6...) andodd(samples1,3,5...). Weconducted effect on the information content of the features. It does, two sub-experiments,trainingoneven, testing onodd, then however,makeprobabilitydensityfunction(PDF)estimation trainingonodd,testingoneven. Wereportedthesumofthe easieriftheresultingfeaturesareapproximatelyindependent classificationcountsfromthetwoexperiments. andGaussian. ForMFCC,thefeatureswereconditionedby taking the log and DCT as in (1). For CSIS, features were conditionedfirstbydividingfeatures2throughP+1bythe 3.3 Processing first feature. This effectively normalizes the features since WenowdescribetheprocessingforthefeaturesoftheMEL the firstfeature, beinga projectiononto e , isa poweresti- 1 frequency cepstral coefficient (MFCC) classifier and CSIS. mate for the segment. Lastly, the log of the first feature is Inordertoconcentrateonthebasicdimensionreductionstep taken. Mathematically,wehaveforCSIS (equation2),thesimplestpossibleprocessingandPDFmod- elingwasused. Eachstepintheprocessingisdescribedbe- w=A′y, low,intheorderinwhichitisprocessed. z =log(w ), 1 1 z =w/w , i=2,3,...P+1. 3.3.1 Resampling i i 1 We pre-processedall TIMIT .wav files by resampling from 3.3.7 J-functioncalculation 16 kHz down to 12 kHz. Phoneme endpoints were corre- J-function contributions must be included for FFT spondinglyconvertedandusedtoselectdatafromthe12kHz magnitude-squared, spectral normalization, matrix mul- time-series. tiplication, and feature conditioning. See [7] for details of theseclass-specificmodules. 3.3.2 Truncation 3.3.8 PDFmodelingandClassification The phonemedata was truncatedto a multiple of 384 sam- ples by truncating off the end. Those phoneme events that We used a simple multivariate Gaussian PDF model, or were below 384 samples at 12 kHz were dropped. Doing equivalently a Gaussian mixture model (GMM) with a sin- this allowed us to use FFT sizes of 48, 64, 96, 128, or 192 gle mixturecomponent. We assume independencebetween samples,whichareallfactorsof384. the membersof the sequencewithin a givenutterence, thus disregardingthe time ordering. The log-likelihoodvalue of asamplewasobtainedbyevaluatingthetotallog-likelihood 3.3.3 FFTprocessing of the feature sequence from the phoneme utterance. The We computed non-overlapped unshaded (rectangular win- reason we used such simplified processing and PDF mod- dow function) FFTs resulting in a sequence of magnitude- els was to concentrate ourdiscussion on the featuresthem- squaredFFTspectralvectorsoflength N/2+1,whereN is selves. Classificationwasaccomplishedbymaximizationof theFFTsize. ThenumberofFFTsinthesequencedepended log-likelihoodacross class models. For CSS and CSIS, we on how many non-overlappedFFTs fit within the truncated addedthelogJ-functionvaluetothelog-likelihoodvalueof phonemeutterance. theGMM[2],implementing(6)inthelogdomain. 4. EXPERIMENTALRESULTS 3.3.4 Spectralnormalization 4.1 DataDescription SpectralvectorswerenormalizedafterFFTprocessing. For non-speaker-dependent(MEL cepstrum) features, the spec- We selected fourteen phonemes for our experiments. For tralvectorswere normalizedbythe averagespectrumofall eachphoneme,wechoseasetoffromfourtosevenindivid- availabledata. ualspeakersofthesamesex. Weselectedphoneme/speaker ForCSIS(speaker-dependent)features,thespectralval- combinations that had large numbers of utterences per uesforeachspeaker/phonemecombinationwerenormalized speaker - a minimum of ten utterences per speaker. Thus, bytheaveragespectrumforthatspeaker/phoneme. In clas- eachphonemesetconsistedofabout60utterences.Phoneme sification experiments the average spectrum was computed sets were arrangedinto sevenpairsforusein two-phoneme fromthetrainingdatatoavoidissuesofdataseparation. individualspeakerexperiments. 4.2 BasisFunctionoptimization 1. E istheconfusionmatrixerrormetricwhichisthenum- c ber of off-diagonal elements in the confusion matrix. 4.2.1 ValidationofAssumptions Thus, it is a measure of speaker identity errors without An important experiment to perform is to validate the as- regardtothephoeneme. sumption used in section 2.2, that maximizing L (equation 2. Inter-phoneme error E counts the number of inter- ip 7) can be achieved by maximizing Q in equation (9). Al- phonemeerrors. thoughspacedoesnotpermitpresentingtheresults,wehave Allof ourexperimentsused strictseparationbetweentrain- obtainedoverwhelmingevidencethatthesecondtermin(9) ingandtestingdata(section3.2). Inallcases,datawassep- doesinfactdominate. aratedintoevenandoddevents(utterences).Firstallmodels were trained on odd events, events1,3,5, etc, and tested on 4.2.2 ChoiceofFFTsizeandmodelorder evenevents,2,4,6,etc.,thenallmodelsweretrainedoneven TheCSISapproachisparameterizedbytwoparameters,the events and tested on odd events. The error were added to FFT size N, and the model order P. The MFCC method is obtaintheaggregateerrorcount. parameterized by the FFT size N, and the numberof MEL bandsNc. We chose to use the same value of N forMFCC 4.3.2 Two-PhonemeExperiments. and CSIS. This ensured that the only significant difference Thetwo-phonemeexperimentsweredesignedtotesttheabil- between MFCC and CSIS would be the ability to choose ity to distinguish speakers of a given phoneme as well as matrix A as a function of class thanks to the PPT. Fea- m classify two phonemes in a limited multi-speaker environ- tureconditioningisalsodifferentbutisnotexpectedtocon- ment. In each of seven the two-phoneme experiments, we tributegreatlytoperformancedifferences. Forfaircompari- testedbothCSISandMFCCundertwoconditions.Insingle- son,weselectedtheFFT sizetomaximizetheperformance speaker (SS) classifier training, we separately trained a of MFCC, which turned out to be N =96. For MFCC, we modeloneachspeaker/phonemecombination. Inphoneme- usedalwaystheoptimumN =10. c class (PC) classifier training, we grouped all speakers of a For CSIS, we are left with deciding on the model or- givenphonemeintoasinglephonemeclass. FortheSSclas- der P. Refer to figure 1. In which we see the total log- sifiers, we measured E which included all errors, and E c ip whichcountsonlyinter-phonemeerrors. ForthePC classi- −1.705x 104 fiers,wecouldonlymeasureE . ip −1.71 Toprovidethemostmeaninfulperformancecomparison, Total log−likelihood Q−−−11..177.217552 wcoemobpitnimatiiozendotfhpeaprearmfoertmerasnNceaonfdMNFcCoCvebryalfilnsdeivnegnthexepbeersi-t ments.MetricE wasataminimumatN =10,N=96.For −1.73 ip c E , it was close to the minimumat the same parameterset- −1.7353 4 5 Model 6order P 7 8 9 tincg. Thus,wechoseN =10,N=96asthebenchmarkfor c Figure 1: Total log-likelihood (with even-odd cross- comparison. validation)as a function of P for speaker MGRL0 phoneme The seven experiments tested phonemes “IY” versus “N”withCSIS. “EH”, “AE” versus “EH” , “R” versus “L” , “AX” ver- sus “AXR” , “IX” versus “IH” , “N” versus “M” , and “DCL”versus“TCL”. Betweenfourandsevenspeakersper likelihoodLofspeakerMGRL0,phoneme“N”,asafunction phonemewere used with an averageof about12 utterences ofP. Even-oddcross-validationis used(section 3.2). Note perspeaker/phonemecombination.Theresultsareplottedin thatthelikelihoodincreasesuptoP=5thenexhibitsasteep figure 2. First, CSIS-SS, with P chosenseparatelyforeach decline. This suggests that a dimension-5subspace is opti- classperformedgenerallybetterthanCSIS-SS(5)whichuses maltorepresentthisspeaker/phonemecombination. Forthe modelorderfixedat P=5. Thisindicatesthatindividually individual speaker experiments, we chose model order for optimizedmodelorderisbetter. Thefactthatthemodelor- eachspeaker/phonemecombinationinthesameway. ders were determined individually without regard to other Toaddressthephoneme-classexperimentswewillneed classes validates is an important observation. In compari- toexpandthedatatoincludeallspeakersofagivenphoneme. sontoMFCC-SS,CSS-SSachievedalowerE inallexper- c We expanded the data to all male speakers of “N”. and at- iments. Asameansofcomparison,MFCCproducedhigher tainedapeakatP=8.Thisindicatesthatanincreaseinsub- values of E by 14, 22, 38, 22, 7.5, 59, 12 and 12 percent, c spacedimensionisrequired. Inphoneme-classexperiments, an averageof 25.5percenthigher. Using the E errormet- ip weusedaconstantvalueofP=8forallphonemeclasses. ric, for which we have no space to report detailed results, therewasnotmuchdifferencebetweenCSS-SSandMFCC- 4.3 ClassificationExperiments SS. For multi-speakertraining, MFCC-PC was consistently Weconductedsevenindividualspeakerexperiments,eachin- betterthanCSS-PC. volvingtwophonemes(seesection4.1). 4.3.3 Single-SpeakerExperiments. 4.3.1 Performancemetrics The single-speaker experiments were designed to test the Because in each experiment we used a number of individ- abilitytodistinguishphonemesofagivenspeaker.Ineachof ual speakers of each phoneme, it is possible to measure the seventeensingle-speakerexperiments,we gathereddata bothinter-speakererrors(speakeridentity errors)and inter- fromasinglespeakerandbetweenfourandsevenphonemes phonemeerrors. We definethefollowingperformancemet- intooneclassificationexperimentandmeasuredE . There- c rics: sultsaresummarizedinfigure3. CSISdoesgenerallybetter 100 40 90 35 MCC 80 30 CSS(5) 70 60 25 CSIS Ec 4500 Total errors (E)c20 15 30 MCC−SS 120001 IY vs EH 2 AE vs EH CCSSIISS3−−SSSSR vs L(5) 4 AX vs AXR 5 IX vs IH 6 N vs M DCL vs TCL7 1005 fmem0 fceg0 mkag0 fapb0 mcxm0 mmea0 fdaw0 mgrl0 mkdd0 msat1 mbma1 mprk0 fklh0 mjma0 mbth0 mbcg0 mmlm0 0 2 4 6 8 10 12 14 16 Experiment Number Figure2: ComparisonofMFCC andCSIS usingindividual speakererrormetric E . TheexperimentnumberisontheX Figure3: ComparisonofMFCCandCSISinsingle-speaker c axisinthefollowingorder: “IY”versus“EH”,“AE”versus experimentsusingerrormetric Ec. The experimentnumber “EH” , “R” versus “L” , “AX” versus “AXR”, “IX” versus isontheXaxisinthefollowingorder: fmem0fceg0mkag0 “IH”, “N” versus“M” , “DCL” versus“TCL”. CSIS-SS(5) fapb0 mcxm0 mmea0 fdaw0 mgrl0 mkdd0 msat1 mbma1 indicatesCSISwithmodelorderfixedatP=5. mprk0 fklh0 mjma0 mbth0 mbcg0 mmlm0, which is in or- derofincreasingMFCCerror. CSIS(5)indicatesCSISwith modelorderfixedatP=5. thanMFCC exceptin two experimentswhereit doesworse and one where it is the same. The total number of errors to clusterspeakersinto like-soundinggroups, which can be across the seventeen experiments was 435 for MFCC. We representedbyseparatelow-dimensionalCSISmodels. first tried CSIS with model order fixed at P=5 (indicated as CSIS(5) in the figure) and acheived total errors of 385. WethenselectedPindividuallybymaximizingthetotallog- REFERENCES likelihood(section 4.2.2)and acheived 361 errors, a reduc- [1] J. W. Picone, “Signal modeling techniques in speech tion of 6.5 percent and an improvementof 20 percent over MFCC. This is significant because in addition to matrix A recognition,” Proceedings of the IEEE, vol. 81, no. 9, pp.1215–1247,1993. beingafunctionofclass,thefeaturedimensionisalsoafunc- tionofclass. [2] P.M.Baggenstoss,“ThePDFprojectiontheoremandthe class-specific method,” IEEE Trans Signal Processing, pp.672–685,March2003. 4.4 DiscussionofResults [3] H. Watanabe and S. Katagiri, “Hmm speech recognizer We can drawsomemeaningfulconclusionsfromtheexper- based on discriminative metric design,” in Proc. 1997 iments. First, we see that both in discriminating phonemes IEEE International Conference on Acoustics, Speech, of a given speaker and in discriminating speakers of a andSignalProcessing(ICASSP97),vol.4,1997. given phoneme, CSIS is clearly better than MFCC. On the [4] M. K. Omar and M. Hasegawa-Johnson, “Strong-sense otherhand,MFCCisgenerallybetterinspeaker-independent class-dependent features for statistical recognition,” in phonemediscrimination. The reason may lie in the shrink- Proc. IEEE Workshop on Statistical Signal Processing, ingofthelinearsubspaceaswerestrictourselvestoasingle vol.4,pp.490–493,2003. speaker/singlephoneme.Whenthesubspaceislimited,CSIS maybeabletofindabetterstatisticalmodelofthedistribu- [5] S.M.Kay,A.H.Nuttall,andP.M.Baggenstoss,“Mul- tiuon. A second piece of evidence that supports this is the tidimensional probability density function approxima- factthat the highestimprovementof CSIS-SS overMFCC- tion for detection, classification and model orderselec- SSwasobtainedintheexperiment“N-vs-M”whichisoneof tion,” IEEE Trans. Signal Processing, pp. 2240–2252, themostdifficultproblemsinASR,anindicationthatCSIS Oct2001. produces a better PDF estimate at the center of the distri- [6] J. S. Garofolo, “Timit acoustic-phonetic continuous butions. Thus, when classes are more close to each other, speechcorpus,”LinguisticDataConsortium,1993. i.e. overlapped,thebetterPDFestimatewillbemoreimpor- [7] P.M.Baggenstoss,“Theclass-specificclassifier: Avoid- tant, becausethe optimaldecision boundaryis givenby the ing the curse of dimensionality (tutorial),” IEEE truelikelihoodratio. However,sinceMFCChasevolvedfor AerospaceandElectronicSystemsMagazine,specialTu- phonemediscrimination,itperformsbetterthanCSISinthe torialaddendum,vol.19,pp.37–52,January2002. inter-phomemeareas. Whentwophonemesareverysimilar, discriminationoccurs“nearthepeak”whereCSISperforms better. Future work should determine how can the strengths of bothCSISandMFCCbebestutilized.Theevidencewepro- videdsuggeststhatthe mostpromisingapproachforapply- ingCSIStomulti-speakerexperimentsmaylieintheability