EURASIP Journal on Audio, Speech, and Music Processing Perceptual Models for Speech, Audio, and Music Processing Guest Editors: Jont B. Allen, Wai-Yip Geoffrey Chan, and Stephen Voran Perceptual Models for Speech, Audio, and Music Processing EURASIP Journal on Audio, Speech, and Music Processing Perceptual Models for Speech, Audio, and Music Processing Guest Editors: Jont B. Allen, Wai-Yip Geoffrey Chan, and Stephen Voran Copyright©2007HindawiPublishingCorporation.Allrightsreserved. Thisisaspecialissuepublishedinvolume2007of“EURASIPJournalonAudio,Speech,andMusicProcessing.”Allarticlesareopen accessarticlesdistributedundertheCreativeCommonsAttributionLicense,whichpermitsunrestricteduse,distribution,andrepro- ductioninanymedium,providedtheoriginalworkisproperlycited. Editor-in-Chief D.O’Shaughnessy,UniversityofQuebec,Canada Associate Editors AdelM.Alimi,Tunisia T.Eriksson,Sweden DominicMassaro,USA JontB.Allen,USA HoracioFranco,USA BenMilner,UK XavierAmatriain,USA Q.-J.Fu,USA ClimentNadeu,Spain Ge´rardBailly,France Woon-SengGan,Singapore ElmarNoth,Germany MartinBouchard,Canada JimGlass,USA HiroshiG.Okuno,Japan DouglasS.Brungart,USA StevenGreenberg,USA JoePicone,USA Wai-YipGeoffreyChan,Canada R.CapobiancoGuido,Brazil GerhardRigoll,Germany DanChazan,Israel R.Heusdens,TheNetherlands M.Sandler,UK MarkClements,USA JamesKates,USA ThippurV.Sreenivas,India ChristopheD’Alessandro,France TatsuyaKawahara,Japan YannisStylianou,Greece RogerDannenberg,USA YvesLaprie,France S.Voran,USA LiDeng,USA Lin-ShanLee,Taiwan D.Wang,USA Contents PerceptualModelsforSpeech,Audio,andMusicProcessing,JontB.Allen, Wai-YipGeoffreyChan,andStephenVoran Volume2007,ArticleID12687,2pages PracticalGammatone-LikeFiltersforAuditoryProcessing,A.G.Katsiamis, E.M.Drakakis,andR.F.Lyon Volume2007,ArticleID63685,15pages AnFFT-BasedCompandingFrontEndforNoise-RobustAutomaticSpeechRecognition, BhikshaRaj,LorenzoTuricchia,BentSchmidt-Nielsen,andRahulSarpeshkar Volume2007,ArticleID65420,13pages WidebandSpeechRecoveryUsingPsychoacousticCriteria,VisarBerishaandAndreasSpanias Volume2007,ArticleID16816,18pages DenoisingintheDomainofSpectrotemporalModulations,NimaMesgaraniandShihabShamma Volume2007,ArticleID42357,8pages PerceptualContinuityandNaturalnessofExpressiveStrengthinSingingVoicesBasedonSpeech Morphing,TomokoYonezawa,NorikoSuzuki,ShinjiAbe,KenjiMase,andKiyoshiKogure Volume2007,ArticleID23807,9pages ElectrophysiologicalStudyofAlgorithmicallyProcessedMetric/RhythmicVariationsinLanguage andMusic,SølviYstad,CyrilleMagne,SnorreFarner,GregoryPallone,MitsukoAramaki, MireilleBesson,andRichardKronland-Martinet Volume2007,ArticleID30194,13pages TheEffectofListenerAccentBackgroundonAccentPerceptionandComprehension, AyakoIkenoandJohnH.L.Hansen Volume2007,ArticleID76030,8pages HindawiPublishingCorporation EURASIPJournalonAudio,Speech,andMusicProcessing Volume2007,ArticleID12687,2pages doi:10.1155/2007/12687 Editorial Perceptual Models for Speech, Audio, and Music Processing JontB.Allen,1Wai-YipGeoffreyChan,2andStephenVoran3 1BeckmanInstitute,UniversityofIllinois,405NorthMathewsAvenue,Urbana,IL61801,USA 2ElectricalandComputerEngineeringDepartment,Queen’sUniversity,99UniversityAvenue,Kingston,ON,CanadaK7L3N6 3InstituteforTelecommunicationSciences,325Broadway,Boulder,CO80305,USA Received22November2007;Accepted22November2007 Copyright©2007JontB.Allenetal.ThisisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense, whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited. Newunderstandingsofhumanauditoryperceptionhavere- duce the level of weaker neighboring spectral components, cently contributed to advances in numerous areas related and this is a form of spectral peak enhancement. The au- toaudio,speech,andmusicprocessing.Theseincludecod- thors apply this work as a preprocessor for a mel-cepstrum ing,speechandspeakerrecognition,synthesis,signalsepara- HMM-based automatic speech recognition algorithm and tion, signal enhancement, automatic content identification theydemonstrateimprovedperformanceforavarietyoflow- and retrieval, and quality estimation. Researchers continue SNRbackgroundnoiseconditions. toseekmoredetailed,accurate,androbustcharacterizations “Wideband speech recovery using psychoacoustic crite- ofhumanauditoryperception,fromtheperipherytotheau- ria” describes how a perceptual loudness criterion can be ditorycortex,andinsomecaseswholebraininventories. used advantageously in wideband speech coding. Authors ThisspecialissueonPerceptualModelsforSpeech,Au- V.BerishaandA.Spaniasproposeenhancinganarrowband dio,andMusicProcessingcontainssevenpapersthatexem- speechcoderbysendingafewsamplesofthehighband(4– plify the breadth and depth of current work in perceptual 8kHz)spectralenvelope,andthesesamplesareselectedac- modelinganditsapplications. cordingtoaloudnesscriterion.Theyapplythisperception- The issue opens with “Practical gammatone-like filters based technique to the standardized narrowband adaptive forauditoryprocessing”byA.G.Katsiamisetal.whichcon- multirate (AMR-NB) speech coder and evaluate the results tainsanicereviewonhowtomakecochlear-likefiltersusing throughsubjectivetesting.Onetestcomparesthisbandwidth classical signal processing methods. As described in the pa- extendedAMR-NBspeech(totalbitrate9.1kbps)toconven- per,thehumancochleaisnonlinear.Thenonlinearityinthe tionalAMR-NBspeech(totalbitrateof10.2kbps).Inspiteof cochleaisbelievedtocontrolfordynamicrangeissues,per- thelowertotalbit-rate,listenersshowaclearpreferencefor haps due to the small dynamic range of neurons. Having a thebandwidthextendedspeech. timedomainversionofthecochleawithabuiltinnonlinear- Next is “Denoising in the domain of spectrotemporal ity is an important tool in many signal processing applica- modulations” where N. Mesgarani and S. Shamma exam- tions.Thispapershowsonewaythismightbeaccomplished ine the effectiveness of denoising speech signals using a using a cascade of second-order sections. While we do not spectrotemporal modulation decomposition proposed ear- knowhowthehumancochleaaccomplishesthistaskofnon- lier by Chi, Ru, and Shamma. The decomposition is per- linear filtering, the technique described here is one reason- formed over two stages. First, the “early auditory system” ablemethodforsolvingthisverydifficultproblem. maps the input speech signal to an auditory spectrogram. B. Raj et al.apply perceptual modeling to the automatic Then, the “central auditory system” decomposes the spec- speech recognition problem in “An FFT-based companding trogram into spectral and temporal modulations. N. Mes- front end for noise-robust automatic speech recognition.” garani and S. Shamma demonstrate that speech and differ- These authors describe efficient FFT-based processing that ent types of noise are well separated in the spectrotempo- mimicstwo-tonesuppression,whichisakeyattributeofsi- ralmodulation domain. Theirdenoising experiment,based multaneousmasking.Thisprocessinginvolvesabankofrel- on Wiener filtering in the modulation domain, shows their atively wide filters, followed by a compressive nonlinearity, schemetoprovidedistinctivelybetterspeechqualitythana thenrelativelynarrowfilters,andfinallyanexpansionstage. conventionalWienerfilteringschemewhenthenoiseissta- Thenetresultisthatstrongspectralcomponentstendtore- tionary. 2 EURASIPJournalonAudio,Speech,andMusicProcessing In “Perceptual continuity and naturalness of expres- sive strength in singing voice based on speech morphing,” T. Yonezawa et al.address the synthesis of expression in a singingvoicewithaspecificfocusoncreatingnatural,con- tinuous transitions between expressive strengths. They em- ploy a speech morphing algorithm and subjective tests to identifyanonlinearmorphingpatternthatresultsinanearly linearprogressionofperceivedstrengthofexpression.Inad- ditional subjective testing the authors verify that this per- ceived linear progression does indeed equate to a natural sound. Next comes a very unusual article titled “Electrophysi- ologicalstudyofalgorithmicallyprocessedmetric/rhythmic variations in language and music.” Here S. Ystad et al. use event-related potentials (ERPs), which are small voltages recordedfromtheskinofthescalp,tostudyquestionsofme- ter,rhythm,semantics,andharmonyinlanguageandmusic. The key potential is called N400, which is known to relate tospeechperception.Theyfindthat“languageERPanalyses indicatethatsemanticallyincongruouswordsareprocessed independentlyofthesubject’sattention.”Thisarguesforau- tomaticsemanticprocessing.Forthecaseofmusictheyfind that their ERP analyses show that “rhythmic incongruities areprocessedindependentlyofattention.”Again,thisargues foran“automaticprocessingofrhythm.” Finally, A. Ikeno and J. H. L. Hansen consider a differ- entformofperceptionin“Theeffectoflisteneraccentback- ground on accent perception and comprehension.” Their paper describes experiments where three classes of English speakers (US, British, and nonnative) perform an accent classification task and a transcription task using English speech recordings that include three different regional ac- cents (Belfast, Cambridge, and Cardiff). For both tasks, significant effects are seen for listener accent background, speakeraccenttype,andtheinteractionofthesetwofactors as well. In light of this and other experimental results, they conclude that accent perception must involve both speech perceptionandlanguageprocessing. Wehopethatthisdiversecollectionofworksservestoin- formreadersaboutcurrentsuccessesandalsotoinspirethem in further efforts to model the various attributes of human auditoryperception,orapplysuchmodelstotheimportant openproblemsinspeech,audio,andmusicprocessing. ACKNOWLEDGMENTS Theguesteditorsextendthankstoalloftheauthorsandre- viewerswhohavemadethisspecialissuepossible. JontB.Allen Wai-YipGeoffreyChan StephenVoran HindawiPublishingCorporation EURASIPJournalonAudio,Speech,andMusicProcessing Volume2007,ArticleID63685,15pages doi:10.1155/2007/63685 Research Article Practical Gammatone-Like Filters for Auditory Processing A.G.Katsiamis,1E.M.Drakakis,1andR.F.Lyon2 1DepartmentofBioengineering,TheSirLeonBagritCentre,ImperialCollegeLondon,SouthKensingtonCampus, LondonSW72AZ,UK 2GoogleInc.,1600AmphitheatreParkwayMountainView,CA94043,USA Received10October2006;Accepted27August2007 RecommendedbyJontB.Allen Thispaperdealswithcontinuous-timefiltertransferfunctionsthatresembletuningcurvesatparticularsetofplacesonthebasilar membraneofthebiologicalcochleaandthataresuitableforpracticalVLSIimplementations.Theresultingfilterscanbeusedin afilterbankarchitecturetorealizecochleaimplantsorauditoryprocessorsofincreasedbiorealism.Toputthereaderintocontext, thepaperstartswithashortreviewonthegammatonefilterandthenexposestwoofitsvariants,namely,thedifferentiatedall-pole gammatonefilter(DAPGF)andone-zerogammatonefilter(OZGF),filterresponsesthatprovidearobustfoundationformodeling cochleatransferfunctions.TheDAPGFandOZGFresponsesareattractivebecausetheyexhibitcertaincharacteristicssuitablefor modelingavarietyofauditorydata:level-dependentgain,lineartailforfrequencieswellbelowthecenterfrequency,asymmetry, andsoforth.Inaddition,theirformsuggeststheirimplementationbymeansofcascadesofN identicaltwo-polesystemswhich renderthemasexcellentcandidatesforefficientanalogordigitalVLSIrealizations.Weprovideresultsthatshedlightontheirchar- acteristicsandattributesandwhichcanalsoserveas“designcurves”forfittingtheseresponsestofrequency-domainphysiological data.TheDAPGFandOZGFresponsesareessentiallya“missinglink”betweenphysiological,electrical,andmechanicalmodels forauditoryfiltering. Copyright©2007A.G.Katsiamisetal. This is an open access article distributed under the Creative Commons Attribution License,whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperly cited. 1. INTRODUCTION technologyhavebeencrucialforthesuccessfulimplementa- tionofusefulhearing-typemachines. Formorethantwentyyears,theVLSIcommunityhasbeen Acochleaprocessorcanbedesignedinaccordancewith performing extensive research to comprehend, model, and twowell-understoodandextensivelyanalyzedarchitectures: design in silicon naturally encountered biological auditory theparallelfilterbankandthetraveling-wavefiltercascade.A systemsandmorespecificallytheinnerearorcochlea.This multitude of characteristic examples representative of both ongoingeffortaimsnotonlyattheimplementationoftheul- architectures have been reported [2–6]. Both architectures timateartificialauditoryprocessor(orimplant),butalsoto essentially perform the same task; they analyze the incom- aidourunderstandingoftheunderlyingengineeringprinci- ingspectrumbysplittingtheinput(audio)signalintosub- plesthatnaturehasappliedthroughyearsofevolution.Fur- sequent frequency bands exactly as done by the biologi- thermore, parts of the engineering community believe that calcochlea.Moreover,transduction,nonlinearcompression, mimickingcertainbiologicalsystemsatarchitecturaland/or andamplificationcanbeincorporatedinbothtomodelef- operationallevelshouldinprincipleyieldsystemsthatshare fectively inner- and outer-hair-cells (IHC and OHC, resp.) nature’spower-efficientcomputationalability[1].Ofcourse, operation yielding responses similar to the ones observed engineers bearing in mind what can be practically realized fromthebiologicalcochleae.Figure1illustrateshowbasilar must identify what should and what should not be blindly membrane(BM)filteringismodeledinbotharchitectures. replicatedinsucha“bioinspired”artificialsystem.Justasit doesnotmakesensetocreateflappingairplanewingsonlyto mimicbirds’flying,itseemsequallymeaningfultoarguethat 2. MOTIVATION:ANALOGVERSUSDIGITAL not all operations of a cochlea can or should be replicated insiliconinanexactmanner.Abstractiveoperationalorar- Hearingisaperceptivetaskandnaturehasdevelopedaneffi- chitecturalsimplificationsdictatedbylogicandtheavailable cientstrategyinaccomplishingit:theadaptivetraveling-wave 2 EURASIPJournalonAudio,Speech,andMusicProcessing amplifierstructure.Bioinspiredanalogcircuitryiscapableof netosphere,theextraterrestrialoriginsoflifeonearth,and mimicking the dynamics of the biological prototype with muchmore,arguedthattheremustbeanactive,undamping ultra-low power consumption in the order of tens of μWs mechanisminthecochlea,andheproposedthatthecochlea (comparabletotheconsumptionofthebiologicalcochlea). hadthesamepositivefeedbackmechanismthatradioengi- Comparativecalculationswouldshowthatoptingforacus- neersappliedinthe1920sand1930stoenhancetheselectiv- tomdigitalimplementationofthesamedynamicswouldstill ityofradioreceivers[11,12].Goldhaddonearmy-timework costusconsiderablymoreintermsofbothsiliconareaand onradarsandassuchheappliedhissignal-processingknowl- power consumption [7]; power consumption savings of at edgetoexplainhowtheearworks.Heknewthattopreserve leasttwoordersofmagnitudeandsiliconareasavingsofat signal-to-noiseratio,asignalhadtobeamplifiedbeforethe least three can be expected should ultra-low power analog detector.“Surelynaturecannotbeasstupidastogoandput circuitry be used effectively. This is due to the fact that in anervefiber—thedetector—rightatthefrontendofthesen- contrasttothepowerhungrydigitalapproaches,whereasin- sitivityofthesystem,”Goldsaid.Goldhadhisideabackin gleoperationisperformedoutofaseriesofswitched-onor 1946,whilebeingagraduateastrophysiciststudentatCam- -off transistors, the individual devices are treated as analog bridgeUniversity,England.Hespottedaflawintheclassical computationalprimitives;operationaltasksareperformedin theoryofhearing(thesympatheticresonancemodel)devel- a continuous-time analog way by direct exploitation of the opedbyHermannvonHelmholtz[13]almostacenturybe- physicsoftheelementarydevice.Hence,theenergyperunit fore.Helmholtz’stheoryassumedthattheinnerearconsists computationislowerandpowerefficiencyisincreased.How- ofasetof“strings,”eachofwhichvibratesatadifferentfre- ever,forhigh-precisionsimulation,digitaliscertainlymore quency.Gold,however,realizedthatfrictionwouldprevent energy-efficient[8]. resonancefrombuildingupandthatsomeactiveprocessis Apartfromthat,realizingfiltertransferfunctionsinthe neededtocounteractthefriction.Hearguedthatthecochlea digitaldomaindoesnotimposesevereconstraintsandtrade- is “regenerative” adding energy to the very signal that it is offs to the designer apart from stability issues. For exam- tryingtodetect.Gold’stheoriesalsodaringlychallengedvon ple,in[9],anovelapplicationofafilteringdesigntechnique Be´ke´sy’slarge-scaletraveling-wavecochleamodels[14]and thatcanbeusedtofitmeasuredauditorytuningcurveswas he was also the first to predict and study otoacoustic emis- proposed.Auditoryfilterswereobtainedbyminimizingthe sions. Ignored for over 30 years, his research was rediscov- squareddifference,onalogarithmicscale,betweenthemea- eredbyaBritishengineerbythenameofDavidKemp,who sured amplitude of the nerve tuning curve and the magni- in1979proposedthe“active”cochleamodel[15].Kempsug- tude response of the digital IIR filter. Even though this ap- gested that the cochlea’s gain adaptation and sharp tuning proachwillshedsomelightonthekindoffilteringthereal wereduetotheOHCoperationintheorganofCorti. cochlea is performing, such computational techniques are Earlyphysiologicalexperiments(SteinbergandGardner notsuitedforanalogrealizations. 1937[16])showedthatthelossofnonlinearcompressionin Moreover, different analog design synthesis techniques thecochlealeadstoloudnessrecruitment.1 Moreover,itcan (switched-capacitor,Gm-C,log-domain,etc.)yielddifferent beshownthatthedynamicrangeofIHC(thecochlea’strans- practical implementations and impose different constraints ducers)isabout60dBrenderingtheminadequatetoprocess onthedesigner.Forexample,itiswellknownthatrealizing the achieved 120dB of input dynamic range without signal finite transmission zeros in a filter’s transfer function using compression.Itisbynowwidelyacceptedthatthe6orders thelog-domaincircuittechniqueisachallengingtask[10]. ofmagnitudeofinputacousticdynamicrangesupportedby As such, and with the filterbank architecture in mind, thehumanearareduetoOHC-mediatedcompression. findingfiltertransferfunctionsthathavethepotentialforan Evidence for the cochlea nonlinearity was first given by efficient analog implementation while grasping most of the Rhode. In his papers [17, 18], he demonstrated BM mea- biologicalcochlea’soperationalattributesisthefocusofthis surements yielding cochlea transfer functions for different andourongoingwork.Itgoeswithoutsayingthatthedesign input sound intensities. He observed that the BM displace- ofthesefiltersindigitalhardware(orevensoftware)willbe ment(orvelocity)variedhighlynonlinearlywithinputlevel. amuchsimplertaskthaninanalog. More specifically, for every four dBs of input sound pres- sure level (SPL) increase, the BM displacement (or veloc- 3. COCHLEANONLINEARITY:BMRESPONSES ity)asmeasuredataspecificBMplacechangedonlybyone dB.Thiscompressivenonlinearitywasfrequency-dependent Thecochleaisknowntobeanonlinear,causal,activesystem. Itisactivesinceitcontainsabattery(thedifferenceinionic andtookplaceonlynearthemostsensitivefrequencyregion, thepeakofthetuningcurve.Forotherfrequencies,thesys- concentration between scala vestibuli, tympani, and media, tem behaved linearly; that is, one dB change in input SPL calledtheendocochlearpotential,actsasasilentpowersup- yieldedonedBofoutputchangeforfrequenciesawayfrom ply for the hair cells in the organ of Corti) and nonlinear the center frequency. In addition, for high input SPL, the as evidenced by a multitude of physiological characteristics suchasgeneratingotoacousticemissions. In1948,ThomasGold(22May1920–1922June2004),a 1Loudnessrecruitmentoccursinsomeearsthathavehigh-frequencyhear- distinguishedcosmologist,geophysicist,andoriginalthinker inglossduetoadiseasedordamagedcochlea.Recruitmentistherapid withmajorcontributionstotheoriesofbiophysics,theorigin growthofloudnessofcertainsoundsthatarenearthesamefrequencyof oftheuniverse,thenatureofpulsars,thephysicsofthemag- aperson’shearingloss.