2496 IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.15,NO.8,NOVEMBER2007 Bandwidth Extension for Hierarchical Speech and Audio Coding in ITU-T Rec. G.729.1 Bernd Geiser, Student Member, IEEE, Peter Jax, Member, IEEE, PeterVary, Senior Member, IEEE, HervéTaddei,Member,IEEE, Stefan Schandl, MartinGartner, Cyril Guillaumé,and StéphaneRagot,Member,IEEE Abstract—Recommendation G.729.1 is a new ITU-T standard installedinfrastructure.Thisinteroperabilitycanbeguaranteed whichwasapprovedinMay2006.Thisrecommendationdescribes byimplementingahigh-qualityspeechandaudiocodingalgo- ahierarchicalspeechandaudiocodingalgorithmbuiltontopofa rithm“ontop”ofwidelydeployed“legacy”narrowbandstan- narrowbandcorecodec.Onechallengeinthecodecdesignisthe dards.Agatewaytothe“legacy”partofthenetworkcansimply generationofawidebandsignalwithaverylimitedadditionalbit rate (less than 2 kb/s). In this paper, we describe the respective discardtheadditional“high-quality”bitsandretainthebitsre- codec layer, which extends the transmitted acoustic bandwidth latedtothe“legacy”codec.Thus,thebitstreaminteroperability fromthenarrowbandfrequencyrange(50Hz–4kHz)tothewide- isensured.Hence,notranscodingisrequired,andnoadditional band frequencyrange(50Hz–7kHz). Theunderlying algorithm algorithmicdelayhastobeintroduced. uses a fairly coarse parametric description of the temporal and Furthermore,apartfrombandwidthscalability,bitratescal- spectralenergyenvelopesofthehighfrequencyband (4–7 kHz). This parameter set is quantized with a bit rate of 1.65 kb/s. At ability is a desirable feature for future VoIP infrastructure, es- thedecoderside,thehigh-frequencycomponentsareregenerated peciallyinhighlyheterogeneousnetworks.Incontrasttoother by appropriately shaping a synthetically generated “excitation existing multimode coding standards like the adaptive multi- signal.”Apartfromthealgorithmicdescriptionandadiscussion, ratewideband(AMR-WB)codec[2],wherethebitrateisnet- we state a complexity evaluation as well as some listening test work-controlledandselectedattheencoderside,ahierarchical results. coder generates a layered bitstream format where each addi- IndexTerms—Bandwidthextension,hierarchicalbitstreamor- tional layer successively improves the audio fidelity at the re- ganization,widebandspeechcoding. ceiving terminal. The bit rate scalability is obtained by appro- priatelytruncatingthehierarchicalbitstream,anoperationthat I. INTRODUCTION onlyintroducesnegligiblecomplexityandrequiresnofeedback channeltotheencoder.Inotherwords,asynchronousrateadap- WITH THE steadily increasing number of Voice-over-IP tationbecomesdispensable. (VoIP)customers,atrendofthetelecommunicationnet- The hierarchical coding approach opens up a wide range of workinfrastructuretowardspacket-switchedservicescanbeob- newapplications.Afewexamplesaregivenhere.Withahier- served.Theavailablebitrateseasilyallowthetransmissionof archicalcodec,thenetworkoperatormay,ifdesired,reducethe tolltohigh-qualitywideband(50Hz–7kHz)speechandaudio network load at every single node by reducing the transmitted signals. bit rate. Furthermore, serving users with different connections However,thecrucialfactorforthelarge-scaledeploymentof and/or terminal equipments within a telephone or video con- widebandcodingtechniquesinpacket-switchednetworksisthe ferencebecomespossiblewithoutmajortranscodingoverhead. interoperabilitywithexistingstandardsaswellaswithalready Someuserswillonlyunderstandthecorebitstreamofthehier- archicalcodecbutotherscandecodeasignalofhigherquality. Ifthecodecisdesignedsuchthatthecomputationalcomplexity ManuscriptreceivedNovember14,2006;revisedJuly10,2007.Theassociate editorcoordinatingthereviewofthismanuscriptandapprovingitforpublica- decreaseswithadecreaseofthebitrate,hierarchicalcodingmay tionwasDr.Hong-GooKang. becomeinterestingforsavingbatterylifeinmobiledevices.Fi- B.GeiserandP.VaryarewiththeInstituteofCommunicationSystemsand nally,for storageapplications,userscould, forexample,listen DataProcessing(IND),RWTHAachenUniversity,52056Aachen,Germany (e-mail:[email protected];[email protected]). to their voice mail box from different kinds of terminals and P. Jax was with the Institute of Communication Systems and Data Pro- still receive messages with the best possible quality. Besides, cessing (IND), RWTH Aachen University, 52056 Aachen, Germany. He is when the mail box gets full, it is very easy to reduce the re- nowwithThomsonCorporateResearch,D-30625Hannover,Germany(e-mail: [email protected]). quiredstoragecapacitybyjustcuttingoutsomepartsfromthe H. Taddei was with the Siemens AG, 80333 Munich, Germany. He is bitstreamofthestoredmessages. now with Nokia Siemens Networks, 81739 Munich, Germany (e-mail: Ahierarchicalcodecasdescribedabovehasbeendeveloped [email protected]). S. Schandl is with Siemens AG, 1210 Vienna, Austria, (e-mail: within the scope of the ITU-T under the acronym G.729EV [email protected]). (EV stands for “embedded variable bit rate”). It has recently M.GartnerwaswithSiemensAG,80333Munich,Germany.Heisnowwith beenstandardizedandpublishedasITU-TRec.G.729.1[1]and, theSoftwareDevelopmentDepartment,IntertexDataAB,SE-17444Sundby- berg,Sweden(e-mail:[email protected]). equivalently, as Annex J to Rec. G.729. A good overview of C. Guillaumé and S. Ragot are with France Télécom R&D/TECH/SSTP, G.729.1isprovidedin[3].Thename“G.729.1”hasbeengiven 22307 Lannion Cedex, France (e-mail: [email protected]; togivethecoderabettervisibility.Withinthiscoder,thehierar- [email protected]). DigitalObjectIdentifier10.1109/TASL.2007.907330 chicalcodingconceptisrealizedbasedonaG.729CS-ACELP 1558-7916/$25.00©2007IEEE GEISERetal.:BANDWIDTHEXTENSIONFORHIERARCHICALSPEECHANDAUDIOCODINGINITU-TREC.G.729.1 2497 (ConjugateStructureAlgebraicCodeEcitedLinearPrediction) [4] compatible codec acting asthe corelayer ina hierarchical framework. Several additional bitstream layers gracefully in- crease the speech or audio quality; one of them is the “time domain bandwidth extension” (TDBWE) layer which the re- mainderofthisarticlewillfocuson. Fig.1. HierarchicalbitstreamorganizationoftheG.729.1coder.Thebracketed numbersdenotebitsper20-ms“superframe.” A. BandwidthExtensioninSpeechCodecs Whentakingacloserlookattoday’slow-tomid-ratewide- bandspeechoraudiocodingstandards,itcanbeobservedthat envelopes is transmitted in the “TDBWE layer” of the hierar- the decoder side synthesis of the high-frequency band (or ex- chical bitstream. The bit rate used for parameter quantization tensionband)isoftenbasedonarathersimplesignalmodelthe is1.65kb/s.Onthedecoderside,first,anartificial“excitation parametersofwhichareencodedwithaverylowbitrate. signal”withaconsistentpitchstructureisproducedbasedonthe In classic speech coding, the well-known autoregressive parametersofthe8and12kb/scodeclayers.Then,itstimeand speech production model is exploited and an appropriate frequency envelopes are consecutively shaped by gain manip- coding of the residual is implemented (cf. [5]). In contrast, ulationsandfilteringoperationstomatchthetransmittedpara- modern wideband codecs often omit the encoding of the metricdescription. residual for their respective extension band components, i.e., Note that the algorithm has been given the accentuating at- thispartoftheresidualhastobeartificiallyregeneratedbythe tribute“TimeDomain”inordertodifferentiateitfromthetrans- decoder.Thus,thetransmittedparametersetcanbekeptrather form domain processing in the time domain aliasing cancella- limited. Usually, just some coarse signal characteristics are tion (TDAC) part of the codec (cf. Section II). A preliminary describedtherein.Inmoresophisticated(andcomplex)coding versionoftheTDBWEalgorithmispartoftheG.729EVcandi- schemes,theextensionbandcanevenbesynthesizedbyreusing datecodecdescribedin[18].Therespectivealgorithmicdetails informationfromlowerfrequencycomponents[6].Thiscanbe havebeenpublishedin[19]and[20]. interpreted as artificial bandwidth extension (e.g., [7] and [8]) B. ArticleOverview whichissupportedbyasmallamountofsideinformation. Forinstance,inthedecoderoftheAMR-WBcodec[2],the This paper is structured as follows. Section II gives a com- extensionbandcomponents(6.4–7kHz)areregeneratedusing prehensiveoverviewoftheG.729.1codec,whereasSectionIII linearpredictivecodingtechniques.Asyntheticwhitenoiseex- goesintothealgorithmicdetailsoftheTDBWEscheme.Adis- citationsignalisspectrallyshapedbyanall-polesynthesisfilter cussion(SectionIV)andanevaluationincludingsomeofficial withacharacteristicthatisextrapolatedfromthelowbandsyn- ITU-Tlisteningtestresults(SectionV)aswellasadditionalin- thesisfilter.Thegainofthenoisyexcitationiseitherestimated ternallisteningtestresultsfinallyleadtotheconclusion. or,forthehighestAMR-WBcodecmode,containedinthebit- stream.TheAMR-WBconcepthasbeensignificantlyextended II. ITU-TREC.G.729.1:CODECOVERVIEW intheAMR-WB+codec[9].Here,theextensionbandismuch ThespeechandaudiocoderdescribedinITU-TRec.G.729.1 larger(e.g.,4–8kHzifthesamplingfrequencyis16kHz),and [1] is a scalable wideband extension to the CS-ACELP nar- moresideinformation(synthesisfiltercoefficientsandcorrec- rowbandcodecfromITU-TRec.G.729[4].Itisscalableboth tiongainfactors)istransmittedtosupportthebandwidthexten- with respect to decoded signal bandwidth and bit rate. This is sion in the decoder. Another related approach is found in the achievedbymeans ofa hierarchicalbitstreamorganization as EnhancedaacPluscodec[10]whichsplitsthewidebandspeech illustratedinFig.1. or audio signal into frequency subbands by means of a com- The first codec layer (the “core layer”) corresponds to a bit plex-valued64-channelfilterbank.Forthehigh-frequencyfilter rateof8kb/s.Therespectivepartofthebitstreamiscompliant bankchannels,parametriccodingofthesubbandsignalcompo- with G.729, which makes G.729.1 fully interoperable with nentsisemployedusingseveraldetectorsandestimatorstocon- G.729 at 8 kb/s. A second layer enhances the narrowband trolthebitstreamcontents.Thiscollectionofparametriccoding speechqualitywitha“cascaded”codeexcitedlinearprediction toolsistermed“spectralbandreplication”[11],[12]. (CELP) coder stage which provides an additional fixed code- Apart from the standardized solutions, several other pro- book contribution. This stage consumes a bit rate of 3.9 kb/s posalsforspeechandaudiocodingalgorithmswithasimplified (78bits/20ms)plusanadditionalrateof0.1kb/s(2bits/20ms) extension band model and/or bandwidth extension techniques which is useful to the frame erasure concealment (FEC) al- arefoundintheliterature,e.g.,[13]–[17]. gorithm in the decoder. Layer 3 of the bitstream comprises The Time Domain Bandwidth Extension scheme which we some moreFEC bits (0.35 kb/s)and the TDBWE information introduce in this paper also follows these paradigms. In the (1.65 kb/s), which is used to synthesize the high frequency G.729.1codec,thenarrowbandfrequencyrange(50Hz–4kHz) components. Thus, starting at Layer 3 (net rate: 14 kb/s), isencodedbyatwo-stagenarrowbandcodecusingabitrateof the G.729.1 can produce a wideband signal. Layer 4 adds kb/s kb/s,whereastheextensionband(4–7kHz)is some final FEC information (0.25 kb/s) and the first TDAC synthesizedinthedecoderusingabandwidthextensionscheme. bits (1.75 kb/s). The remaining eight TDAC layers contribute Therefore,acoarseparametricdescriptionoftherespectivefre- 2kb/seach.TheTDACisamodifieddiscretecosinetransform quency components in terms of temporal and spectral energy (MDCT) domain [21] predictive transform coder which can 2498 IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.15,NO.8,NOVEMBER2007 Fig.3. G.729.1decoderoverview(w/oFEChandling)—solidlines:timedo- mainsignalflow,dashedlines:parameters. plingfrequencyoftheembeddedG.729hasbeenmade;hence, Fig. 2. G.729.1 encoder overview—solid lines: time domain signal flow, the produced bitstream at 8 kb/s is fully understandable by a dashedlines:parameters. legacyG.729 decoder. The high-band (HB) signal is analyzed every20msbytheTDBWEblock.Then,theTDACstage,using 40-mswindowswith50%overlap(20-msframeadvance),en- successively refine the wideband speech or audio quality. In codes the residual error in the low band and the preprocessed total,theG.729.1bitstreamcomprises12layers,corresponding inputsignal inthehighband.Theresidualerrorinthe to12hierarchicallyorganizedcodecmodes.Itshighestbitrate low band corresponds to the subtraction between the suitably is 32kb/s. aligned original low band signal and the reconstructed output from the localdecoder ofthe embedded CELP codec. Finally, A. Encoder someinformationwhichisbeneficialforframeerasureconceal- A high-level signal flow chart of the G.729.1 encoder is mentisaddedbytheFECencoder.ThisisimportantasG.729.1 shown in Fig. 2. Its input is a wideband audio signal which ismainlytargetedforapplicationsinpacket-switchednetworks. is sampled at kHz.1 This signal is segmented into so-calledsuperframesof20-mslength B. Decoder with Fig.3depictstheG.729.1decodersignalflow.Duetothehi- where the superframe index is omitted for notational conve- erarchicalcodingconcept,itsoperationdependsontheamount nience. These superframes comprise two frames of length 10 ofbitswhichhavebeenreceivedforthecurrentsuperframe,i.e., ms. Each of these frames, again, comprises two subframes of onthecurrentlyreceivedbitrate .For and12kb/s,only length5ms.Theglobalprocessingisdoneonthebasisof20-ms the CELP branch of the decoder is active and, after post-pro- superframes,i.e.,G.729.1usesa20-msframing. cessing and QMF synthesis, a narrowband signal at wideband TheG.729.1coderisbasedonasplitbandstructuresimilar sampling frequency kHz2 is output as . As toITU-TRec.G.722[22].Thus,thewidebandinput soon as the third bitstream layer is available ( kb/s, cf. is split into two subband signals of 4-kHz bandwidth each by Fig.1),theTDBWEdecoderisactivatedandthehigh-bandsyn- means of a quadrature mirror filter (QMF) filter bank (e.g., thesis isproduced.Thus,afterspectralmirroringand [5]), before further processing on a decimated time scale QMFsynthesis,awidebandoutputsignal isavailable. kHz is carried out. These two subband signals are Starting at kb/s, the TDAC decoder refines the wide- preprocessedbysuitableellipticinfiniteimpulseresponse(IIR) bandsignal.Therefore,itslowbandoutputisaddedtothede- filterstoremoveunwantedfrequencycomponents.Inaddition, coded CELP signal. In the high-frequency band the TDBWE the high band is spectrally mirrored by a multiplication with signalisreplacedbytheTDACsubbandsthatcouldbeproduced to obtain a more natural signal representation. The at the received bit rate . Alternatively, i.e., for nonreceived resultingsubbandsignals(ortherespectivesuperframes)are TDAC subbands, the TDBWE synthesis is scaled according to the TDAC spectral envelope. Since the MDCT transform with whichisusedintheTDACcoderusesanadditionallook-ahead of 20 ms, i.e., a relatively large 40-ms signal window is ex- and are further processed by the G.729.1 en- ploited therein, pre- and post-echo artifacts may be produced codingblocks.Thelowband(LB)signalisencodedona10-ms depending on the signal and on the quantizer employed. Con- frame basis utilizing the embedded G.729 compatible CELP sequently,appropriateprocessingblocksforpre-andpost-echo codec and an additional cascaded CELP stage. No modifica- reductionareintroducedtotacklesuchsituations. tionofthebitallocationinthebitstream,framesize,andsam- 2Again,Rec.G.729.1alsodefinesamodefornarrowbandoutput,i.e.,are- 1Rec.G.729.1alsodefinesamodefornarrowbandinput,i.e.,f =8kHz. ducedoutputsamplingfrequencyf =8kHzisusedandtheQMFsynthesis Inthiscase,thebandsplitfiltering,i.e.,theQMFanalysis,isomitted. isomitted. GEISERetal.:BANDWIDTHEXTENSIONFORHIERARCHICALSPEECHANDAUDIOCODINGINITU-TREC.G.729.1 2499 Fig.4. TDBWEencoder:parameterextractionandquantization. Fig. 5. Example of the time envelope representation. The shown envelope valuesarecomputedby 2 =10,thequantizedenvelopeisillustratedusinga III. TDBWE negativesign. Thissectionprovidesanin-depthdescriptionoftheTDBWE algorithm from the G.729.1 coder. The TDBWE encoder op- erates on the downsampled kHz and preprocessed (low-pass with kHz) high-band signal . Note that,owingtothedownsamplingandprefiltering,thehigh-band signal comprises frequencies between 0 and 3 kHz. These frequencies correspond to the original high-band range of4–7kHz. In the following, Sections III-A and B introduce the imple- mented parameter extraction and quantization methods while SectionsIII-C–Gdescribethedetailsofthedecoderalgorithms. Fig.6. Windowfunctionw (n)forthefrequencyenvelopecomputation. The output of the TDBWE decoder is the downsampled high- bandsynthesissignal . slightly asymmetric analysis window . This window is A. ParameterExtraction 128-taps(16-ms)longandisconstructedfromtherisingslope of a 144-tap Hann window, followed by the falling slope of a TheTDBWEencoderdepictedinFig.4extractsaparametric 113-tapHannwindow(cf.Fig.6) description of the high-band input signal . This para- metric description comprises a time envelope and a frequency envelope. Their computation is described subsequently. The (2) quantization scheme, also contained in Fig. 4, is detailed in SectionIII-B. Thewindowisconstructedsuchthatthefrequencyenvelope 1) TimeEnvelopeComputation: The20-msinputspeechsu- computation has a look-ahead of 16 samples (or 2 ms) and a perframe with is subdivided into look-backof32samples(or4ms). 16 segments of length 1.25 ms each, i.e., each segment com- Towindowthecurrentsuperframe,themaximumof is prisestensamples.The16timeenvelopeparameters with centeredonthesecond10-msframeofthecurrentsuperframe. arenowcomputedaslogarithmicsubframeen- Thewindowedsignalwith isthusgivenby ergies (3) (1) The frequency envelope parameters for the first part of the superframe are not computed. Instead, they are interpolated The binary logarithm has been chosen to at the decoder side between the transmitted parameters from ease an implementation in fixed-point arithmetic. Actually, it the current and from the previous superframe [see (35) in is reused in the TDAC module of G.729.1 and facilitates en- SectionIII-F]. ergyquantizationwitha“natural”stepsizeof dB.Thetime The windowed signal is now transformed via a envelope segment length of 1.25 ms has been chosen to con- discrete Fourier transform (DFT) of length 64. With this DFT ciselyrepresentthetemporalenergycharacteristicsofplosives length, the even bins of the full length 128-tap DFT are com- andtransientsinspeechsignals. putedasfollows: Anexampletimeenvelopewiththerespectivequantizedrep- resentation(seeSectionIII-B)isshowninFig.5. (4) 2) Frequency Envelope Computation: Here, the high-band frequency envelope is computed in terms of 12 subband en- ergies. For the computation of the respective parameters where .Equation(4)isimplementedusingthe with the signal is windowed by a fast Fourier transform (FFT) algorithm. Finally, the frequency 2500 IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.15,NO.8,NOVEMBER2007 TABLEI BITALLOCATIONFORTDBWEPARAMETERQUANTIZATION (8) is quantized with 5 bits using uniform 3-dB steps in the logarithmicdomain.Thisprocedureyieldsthequantizedvalue whichisnowsubtractedfromtheparameterset and (9) By this subtraction, the obtained values become independent from the overall signal level. Note that the parameter in factcorrespondstoageometricmeanofthesubframeenergies. However, no significant quality difference could be observed whenusingthearithmeticmeaninstead. Then,themeanremovedtimeenvelopeparametersetisgath- eredintwovectorsofdimension8 (10) whereasthefrequencyenvelopeparametersetformsthreevec- torsofdimension4 Fig.7. TDBWEdecoder:Overview. envelope parameter set is calculated as logarithmic weighted (11) subbandenergiesfor12evenlyspacedandequallywideover- lappingDFTdomainsubbandswithindex Finally,vectorquantizationbasedonpretrainedquantizationta- bles(codebooks)withthebitallocationfromTableIisapplied. The individual codebooks for , , , , and (5) havebeenobtainedbymodifyinggeneralizedLloyd–Maxcen- troidssuchthatacertainpairwisedistancebetweenthecentroids isguaranteed.Therefore,thecentroidsarerequantizedusinga Notethatthefrequencybinswithindices25–31arenotconsid- rectangulargridwithastepsizeof6dBinthelogarithmicdo- ered since they represent frequencies above 3 kHz. In (5), the main. The vectors and are quantized using the same frequencydomainweightingwindow isgivenas codebooktoreducestoragerequirements. (6) C. SignalSynthesis—Overview Thehigh-bandsignalsynthesisisperformedbytheTDBWE The thsubband starts at the DFT bin of index and spans a decoderandisbasedonthequantizedparametersetintroduced bandwidthofthreeDFTbins.Thiscorrespondstothephysical intheprecedingsection.Thereceivedparametersetisdecoded, subband division of which the respective subband boundaries andthedecodedmeanvalue isaddedinordertoobtainthe are givenby quantizedtimeandfrequencyenvelopes and (12) (7) kHz Additionally,theTDBWEdecoderusescertainparametersfrom Thephysicalbandwidthis Hzforeachsubbandbut theembeddedCELPlayers. thefirstone,whichamountsto Hz. Fig. 7 illustrates the concept of the TDBWE decoder: The decoded parameters are used to appropriately shape an artifi- B. Quantization ciallygeneratedexcitationsignal(SectionIII-D).Therefore,the ThequantizationoftheTDBWEparameterset(consistingof timeenvelopeofthegeneratedexcitationsignalisshapedasde- with and with ) is scribedinSectionIII-E,whereasthedesiredspectralcharacter- done via mean-removed split vector quantization (cf. Fig. 4). isticsarerestoredusingthefrequencyenvelopeshapingmecha- Therefore, we first calculate a mean time envelope value nismfromSectionIII-F.Finally,theTDBWEalgorithmimple- persuperframe mentsapostprocessingprocedure(seeSectionIII-G). GEISERetal.:BANDWIDTHEXTENSIONFORHIERARCHICALSPEECHANDAUDIOCODINGINITU-TREC.G.729.1 2501 (cid:127) the energy of the fixed codebook contributions from the coreandcascadeCELPlayers,computedaccordingto (13) where is the codevector from the fixed codebook of thecorelayerCELPcodecwithitsassociatedgainfactor ,while and aretherespectiveparametersfrom thecascadeCELPlayer; (cid:127) andtheenergyoftheembeddedCELPadaptivecodebook contributionwhichisgivenby Fig.8. TDBWEdecoder:Excitationsignalgeneration. (14) D. SignalSynthesis—ExcitationSignalGeneration with the vector from the adaptive codebook of the corelayerCELPcodecanditsassociatedgainfactor . Giventheseparametersfromthelowerbitstreamlayers,the TheTDBWEalgorithmaimsataconcisereproductionofthe excitationsignalgenerationisstructuredasfollows: high-frequencybandofspeechsignals.Usingtheparametricde- 1) estimationoftwogains and forthevoicedandun- scription from Section III-A, the coarse signal characteristics, voicedcontributionstotheexcitationsignal ; i.e.,thetimeandfrequencyenvelopes,canbereproduced.How- 2) pitchlagpostprocessing; ever,theremaybesignificantdifferencesinthefinestructureof 3) productionofthevoicedcontribution; thereconstructedspeech,dependingonthechoiceoftheexci- 4) productionoftheunvoicedcontribution; tation signal. 5) low-passfiltering. Thus, this “excitation signal,” which is the starting point Wespecifytheseindividualstepsinthefollowing. of the entire signal synthesis, has to fulfill the important 1) Estimation of Gains for the Voiced and Unvoiced requirement that its fine structure, especially the spectral Contributions: First, to get an initial estimate of the “har- fine structure, should closely resemble the fine structure of monics-to-noise” ratio, an instantaneous energy ratio of the the actual high-band speech signal. Assuming an idealized adaptive codebook and fixed codebook (including the cascade quasi-stationary speech production model, the excitation CELP fixed codebook) contributions is computed for each shouldmeetthefollowingcriteria. subframe (cid:127) Theexcitationsignalshouldingeneralbespectrallyflat. (cid:127) For voiced sounds, the excitation should contain har- (15) monics of the fundamental speech frequency , i.e., spectralpeaksatintegermultiplesof . In order to reduce the adaptive-to-fixed codebook power ratio (cid:127) Forunvoicedsounds,theexcitationmaybewhitenoise. in case of unvoiced sounds, a ”Wiener filter” characteristic is (cid:127) Mixed voiced/unvoiced sounds with an arbitrary “har- applied to monics-to-noise”energyratioshouldbepossible. (cid:127) The voiced contribution should not be dominant for high (16) frequenciesinordertoavoidso-called“overvoicing.”Typ- ically,theexcitationisnoisyforfrequenciesabove5–6kHz Thisleadstomoreconsistentunvoicedsounds.Finally,thegains (inthewidebandrange). forthevoicedandunvoicedcontributionsto canbede- The implemented excitation signal generator, depicted in termined. Therefore, an intermediate voiced gain is calcu- Fig. 8, replicates such behavior. Based on parameters from lated the embedded CELP and cascade CELP layers of the G.729.1 coder,weproduceahigh-bandexcitationsignalasaweighted (17) mixtureofnoise(unvoiced)andperiodic(voiced)components. Thelatterareproducedbyanoverlap-addofspectrallyshaped and suitably spaced glottal pulses. Thus, the natural speech With a gliding average of length 2, is slightly smoothed to characteristicisrepresentedratheraccurately. obtainthefinalvoicedgain Specifically,theTDBWEexcitationsignal isgener- atedona5-mssubframebasis.Therefore,thefollowingCELP (18) parameters which are transmitted in Layers 1 and 2 of the G.729.1bitstreamarereused: where is the intermediate voiced gain according to (17) (cid:127) theintegerpitchlag oftheembeddedCELPcodec; from the preceding subframe. The averaging of the squared (cid:127) therespectivefractionalpitchlag ; values favors a fast increase of in case of an unvoiced to 2502 IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.15,NO.8,NOVEMBER2007 voiced transition. To satisfy the constraint , the unvoicedgainisnowgivenby (19) 2) PitchLagPostprocessing: Theproductionofaconsistent pitch structure within the excitation signal requires a good estimate of the fundamental speech frequency of the speech production process or of its inverse, the pitch lag . WithinLayer1ofthebitstream,theintegerandfractionalpitch lagvalues and (cf.[4])areavailableforthefour5-ms subframesofthecurrentsuperframe.Foreachsubframe,thees- timationof isbasedontheseparameters.Theaimoftheen- coder-side pitch searchprocedure inthe CELP layer is tofind Fig.9. Pulseshapelookuptableforthevoicedcontributiontothesynthetic thepitchlagwhichminimizesthepowerofthelongtermpre- excitationsignals^ (n). diction (LTP) residual signal. That is, the LTP pitch lag is not necessarilyidenticalwith ,whichisarequirementforthecon- Note that this gliding average leads to a virtual precision en- cisereproductionofvoicedspeechcomponents.Themosttyp- hancementfromaresolutionof1/3to1/6ofasample.Finally, icaldeviationsarepitch-doublingandpitch-halvingerrors,i.e., thepostprocessedpitchlag isdecomposedintoitsintegerand the frequency corresponding to the LTP lag is half or double fractionalparts thatoftheoriginalfundamentalspeechfrequency.Inparticular, pitch-doubling(-tripling,etc.)errorshavetobestrictlyavoided and (25) here.Hence,thefollowingpostprocessingoftheLTPlaginfor- mationisused. 3) ProductionoftheVoicedContribution: Thevoicedcom- First,theLTPpitchlagforanoversampledtime-scaleisre- ponents oftheexcitationsignalare,accordingtothe constructedfrom and .Becausethefractionalresolu- discussion above, represented as shaped and weighted glottal tionofthepitchlagintheG.729.1CELPlayerisaspreciseas pulses.Inthefollowing,thesepulsesareindexedbytheglobal 1/3ofasample,theoversampledlagamountsto . “counter” . Hence, the voiced contribution is pro- Then an additional factor of 2 is considered such that an en- ducedbyoverlap-addofsingle-pulsecontributions hancedresolution[see(24)]canberepresented (26) (20) The (integer) factor between the currently observed LTP lag andthepostprocessedpitchlagoftheprecedingsubframe where is the gain factor for each pulse, is the pulse [see(23)]iscalculatedby3 position, and is the pulse shape. Thereby, the selection of the pulse shape depends on the “fractional pulse position” (21) .Thesefourparametersarederivedinthefollowing. Thepostprocessedpitchlagparameters and de- If the factor falls into the range , a relative error is terminethepulsespacingandthusthepulsepositionsaccording evaluated to (22) (27) If the magnitude of this relative error is below a threshold of , it is assumed that the current LTP lag is the result where is the (integer) position of the current pulse and ofabeginningpitch-doubling(-tripling,-quadruplication)error isthe(integer)positionofthepreviouspulse.Thefrac- phase.Thus,thepitchlagiscorrectedbydivisionbytheinteger tionalpartofthepulseposition factor ,therebyproducingacontinuouspitchlagbehaviorwith respecttothepreviouspitchlags (28) if (23) otherwise. servesasanindexforthepulseshapeselection.Theprototype pulseshapeswith and aretaken Then,amoderatelow-passfilter,realizedasamovingaverage fromalookuptablewhichisplottedinFig.9. withtwotaps,isappliedto Thepulseshapes arefilteredandresampledversionsof awideband(16-kHz)pulsefroma“typical”voicedspeechseg- (24) ment.Thesegmentwasselectedforitsspecificspectralcharac- 3bxcdenotesthehighestintegernumbernotgreaterthanx. teristicswhichavoidan“overvoicing”oftheexcitation(cf.dis- GEISERetal.:BANDWIDTHEXTENSIONFORHIERARCHICALSPEECHANDAUDIOCODINGINITU-TREC.G.729.1 2503 Fig.10. (a)Lowerbandspeechsignals (n)andparametersfortheexcitationsignalgeneration:Voicedgaing (solidline),unvoicedgaing (dashedline), : andpostprocessedpitchlagt =t +t =6.Theexamplespeechfragmentrepresentsanunvoiced/voiced/unvoicedtransition.(b)Exampleofthehigh-band excitationsignalgeneration:Voicedandunvoicedcontributionsaswellasthefinal(low-passfiltered)excitationsignal.Theparametersforthesignalfragment from(a)areused. cussionbelow).Sinceasamplingfrequencyof8kHzandares- where .Theimplementationoftherandomgen- olutionof1/6ofasampleistargetedforthegivenapplication, eratorisidenticalwiththerandomgeneratorusedintheG.729 theselectedpulsehasbeenupsampledto48kHzfirst.Thesix codec.Itproducesasignalofunitvariance. final pulse shapes have then been obtained by applying 5) Low-Pass Filtering: Having the voiced and unvoiced thefollowingoperations: contributions and , the final excita- (cid:127) low-pass filtering and decimation by a factor of 3 (with tion signal is obtained by low-pass filtering of threedifferentsubsamplingoffsets); . The 3-kHz low-pass filter is iden- (cid:127) high-pass filtering and decimation by a factor of 2 (with tical with the preprocessing low-pass filter for the high-band twodifferentsubsamplingoffsets); signalasshowninFig.2. (cid:127) spectralmirroring,i.e.,multiplicationby . To illustrate the excitation generation algorithm, Fig. 10(a) Note that spectral mirroring may give two different showstheparameters , ,and which resultsdependingonthestartingpositionofthepulse(evenor areobtainedfromthelow-bandspeechsignalsegment oddsampleindex).Thisfactisaccountedforinthepulsegain shownintheexample.Inparticular,itcanbeobservedthatthe calculation[cf.firstfactorin(29)]. pitchcontourevolvesverysmoothlyduringthe voicedperiod. Thegainfactors fortheindividualpulsesare,apartfrom The individual contributions to the excitation signal, the pro- thepositiondependentsigninversion,derivedfromthevoiced ductionofwhichisbasedontheseparameters,arevisualizedin gainparameter andfromthepitchlagparameters Fig.10(b). E. SignalSynthesis—TimeEnvelopeShaping even (29) The shaping of the time envelope of the excitation signal Here,thesquarerootensuresthatthevaryingpulsespacingdoes utilizesthereceivedanddecodedtimeenvelopeparam- nothaveanimpactontheresultingsignalenergy.Thefunction eters with toobtainasignal with even returns1iftheargumentisanevenintegernumberand atimeenvelopewhichis—exceptforquantizationnoise—iden- 0otherwise. ticaltothetimeenvelopeoftheencodersidehigh-bandsignal Withthedesigndescribedabove,thefullsubsampleresolu- .Thisisachievedbysimplescalarmultiplication tion of the pitch lag information can be utilized by a simple pulse shape selection. Further, the pulse shapes exhibit a cer- (31) tainspectralshapingwhichensuressmoothlyattenuatedhigher frequency components of the voiced excitation. This avoids a where . high-frequency “overvoicing.” Additionally, compared to unit Inordertodeterminethegainfunction ,theexcitation pulses,theappliedpulseshapesresultinastronglyreducedcrest signal issegmentedandanalyzedinthesamemanneras factoroftheexcitationsignalwhichleadstoanimprovedsub- describedinSectionIII-A1fortheparameterextractioninthe jectivequality. encoder.Theobtainedanalysisresultsare,again,timeenvelope 4) ProductionoftheUnvoicedContribution: Theunvoiced parameters with . They describe the ob- contribution isproducedusingthescaledoutputofa servedtimeenvelopeof .Thenapreliminarygainfactor whitenoisegenerator canbecalculatedvia random (30) (32) 2504 IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.15,NO.8,NOVEMBER2007 Fig.11. “Flat-top”Hannwindowforthetimeenvelopeshaping. Fig.12. Filterbankdesignforthefrequencyenvelopeshaping. Now,foreachsignalsegmentthesegainfactorsareinterpolated usinga“flat-top”Hannwindow and (36) These gains are used to control the channels of a filter bank (33) equalizer. The individual channels are defined by their band- passfilterimpulseresponses ( and ) and by a complementary high-pass contribution . Thereby, and constitute linear phase which is plotted in Fig. 11. The interpolated version of finite-impulseresponse(FIR)filterswithagroupdelayof2ms is finally computed as shown in (34) at the bottom of (16 samples) each. Note that this delay exactly matches the the page, where the gain factor is taken from the last look-aheadwhichisintroducedbytheencodersideparameter 1.25-mssegmentoftheprecedingsuperframe. extraction(SectionIII-A2).Thefilterbankequalizerisdesigned The effect of the multiplicative signal shaping operation in suchthatitsindividualchannelbandwidthsmatchthesubband (31) is that the spectrum components of the excitation signal divisionwhichisgivenby(7). are modified by a cyclic convolution with the Fourier Inparticular,thedesignisbasedonaKaiser-typeprototype transform of the gain function . To limit this impact on low-passfilter[23] thespectrumcomponentstothelowestpossibleamount,thein- terpolationwindow isdesignedsuchthat exhibits sufficientlow-passcharacteristics. (37) F. SignalSynthesis—FrequencyEnvelopeShaping of length 33, i.e., , where . In (37), The received frequency envelope parameters with isthemodifiedBesselfunctionofthefirstkind.Theshape werecomputedontheencodersideonthelast10 parameter hasbeenchosenas4andthenormalizationfactor msofthe20-mssuperframe.Thefirst10-msframeiscoveredby issetto inordertoachieveaunityfrequencyre- parameterinterpolationbetweenthecurrentparameterset sponseatneutralfilterbankequalizergains.Giventheprototype andtheparameterset fromtheprecedingsuperframe low-pass ,theindividualfilterbankchannels’impulse responses arenowderivedbymodulationsthereof (35) (38) Analogously to the time envelope shaping, the input signal (time-shaped excitation signal) to the frequency envelope with and .Thecomplementary shaping is analyzed according to the description from high-pass isdefinedby SectionIII-A2.Thisisdonetwicepersuperframe,i.e.,forthe first10-msframe aswellasforthesecond10-msframe (39) within the current superframe. The procedure yields two observed frequency envelope parameter sets with andframeindex .Now,acorrection for .Thereby, isonefor andzero gainfactor persubbandofindex isdeterminedforthe otherwise.Therespectivefrequencyresponsesforthefilterbank first andforthesecond frame designaredepictedinFig.12. (34) GEISERetal.:BANDWIDTHEXTENSIONFORHIERARCHICALSPEECHANDAUDIOCODINGINITU-TREC.G.729.1 2505 possible here. Hence, to attenuate these artifacts, an adaptive amplitude compression is applied to . Each sample of withinthe th1.25-mssegmentiscomparedtothede- coded and suitably aligned time envelope and the amplitudeof iscompressedinordertoattenuatelarge deviationsfromthisenvelope.Thiscanbeinterpretedasaselec- tivecompensation of the temporal smearing that is introduced bythefrequencyenvelopeshaping.Inparticular,thesignalcom- pressionisspecifiedasfollows: Fig.13. Adaptiveamplitudecompressionfunction. (43) Torealizethefrequencyenvelopeshaping,twoFIRfiltersare constructedforeachsuperframe Thecompressionfunctionfrom(43)isdepictedinFig.13. (40) IV. DISCUSSION with and . These two filters, im- Recapitulating, our approach to high-band speech synthesis plementedintheirnontransposedform[23],areappliedtothe differs significantly from existing schemes as applied in the signal inordertoobtaintheshapedsignal .For AMR-WBorAMR-WB+codecs.Incontrasttosuchbandwidth thefirstframe(i.e., )thisgives extensionmethods,theTDBWEdoesnottransmitready-to-be- usedgainfactorsandfiltercoefficientsassideinformationbut (41) only desired time and frequency envelopes. Gain factors and filter coefficients are computed at the receiver. These compu- Likewise,forthesecondframe tations take the actual envelopes of the excitation signal into account. Hence, our method is robust against potential devia- tionsintheexcitationsignalwhichmayoccurduringandafter (42) framelosses.Theseparatedanalysis,transmission,andshaping oftimeandfrequencyenvelopesmakeitpossibletoachievea Filteringoperationslike(41)and(42)maydegradethesignal’s goodresolutioninbothtimeandfrequencydomain.Thisleads time envelope. The temporal energy distribution is potentially toagoodreproductionofbothstationarysoundsaswellastran- “smeared”overanintervalwhichcorrespondstothelengthof sientsignals.Forspeechsignals,especiallythereproductionof thefrequencyenvelopeshapingfilter(i.e.,33tapsor4.125ms). stop consonants and plosives benefits from the improved time However, the filter bank design ensures that this time spread resolution. is constrained and the signal’s time envelope is virtually pre- Afurtherdifferenceisthatwedonotuseanylinearpredic- served. Measurements prove that for about 95% of all frames tivecoding (LPC) techniques to carry out the frequency enve- morethan90%oftheenergyoftheimpulseresponses lopeshaping.Insteadofaconventionalall-poleLPCsynthesis is concentrated within an interval of 1.375 ms. This length filter, we use a linear phase finite-impulse response filter. We roughly corresponds to the time envelope’s resolution. For havefoundthatinthiscase,theamountofringingartifactslike the remainder of the frames, at least 70% of the impulse clicks and crackles that stem from strongly time-variant filter responses’ energy is concentrated within this interval. Viewed coefficients is much lower if the respective filtering operation fromaspectralperspective,therelativelywideandoverlapping isappliedtoasyntheticallygeneratedexcitationsignal.Minor frequency responses of the filter bank channels—shown in residualartifactsaretreatedbyanadaptivepostprocessingpro- Fig.12—guaranteethe preservationofthe timeenvelope.The cedure. Within the whole TDBWE algorithm, we have taken actualspeechqualitygainthatisobtainedwiththeimplemented special care to produce smooth transitions in the time as well timeenvelopeshapingisobjectivelymeasuredinSectionV-B. asinthefrequencydomain. The TDBWE scheme is also a very modular and flexible G. SignalSynthesis—AdaptiveAmplitudeCompression conceptassingleblocksinthereceivercaneasilybeexchanged Asopposedtocommonspeechcodecsthatprovideatrueen- and improved without need to alter the encoder side or the codingoftheirrespectiveresidualsignal(e.g.,CELP),thereis bitstream format. Different decoders can be supported which nostrictcouplingbetweentheTDBWEexcitationandthepara- reconstruct the wideband signal with different precision, de- metricTDBWEsignaldescription.Therefore,someresidualar- pending on the available computational power. Furthermore, tifacts(clicks)maybepresentinthesynthesizedsignal . the received time and frequency envelope parameters cannot In a CELP codec, for instance, such situations can be handled only be used for bandwidth extension purposes. In fact they by the explicit encoding of the residual. However, this is not may also support subsequent signal enhancement schemes
Description: