Table Of ContentProgress in Speech Synthesis
Jan P.H. van Santen Richard W. Sproat
Joseph P. Olive Julia Hirschberg
Editors
Progress in Speech Synthesis
With 158 Illustrations
Springer
Jan P.H. van Santen Richard W. Sproat
Bell Laboratories Bell Laboratories
Room2D-452 Room2D-451
600 Mountain Avenue 600 Mountain Avenue
Murray Hill, NJ 07974-0636 USA Murray Hill, NJ 07974-0636 USA
Joseph P. Olive Julia Hirschberg
Bell Laboratories AT&T Research
Room 2D-447 Room 2C-409
600 Mountain Avenue 600 Mountain Avenue
Murray Hill, NJ 07974-0636 USA Murray Hill, NJ 07974-0636 USA
Library of Congress Cataloging-in-Publication Data
Progress in speech synthesis/Jan P.H. van Santen ... let al.l, editors.
p. cm.
Includes bibliographical references and index.
ISBN 978-1-4612-7328-8 ISBN 978-1-4612-1894-4 (eBook)
DOl 10.1007/978-1-4612-1894-4
1. Speech synthesis. 1. Santen, Jan P.H. van.
TK7882.S65S6785 1996
006.5' 4-dc20 96-10596
Printed on acid-free paper.
© 1997 Springer Science+Business Media New York
Originally published by Springer-Verlag New York, Inc. in 1997
Softcover reprint of the hardcover 1 st edition 1997
All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher, Springer Science+Business Media, LLC
except for brief excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and retrieval, electronic
adaptation, computer software, or by similar or dissimilar methodology now known or hereaf
ter developed is forbidden.
The use of general descriptive names, trade names, trademarks, etc., in this publication, even
if the former are not especially identified, is not to be taken as a sign that such names, as
understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely
by anyone.
Production managed by Natalie Johnson; manufacturing supervised by Jacqui Ashri.
Camera-ready copy prepared using LaTeX.
Printed and bound by Edwards Brothers, Inc., Ann Arbor, MI.
Additional material to this book can be downloaded from http://extras.springer.com.
9 8 7 6 543 2
ISBN 978-1-4612-7328-8
Preface
Text-to-speech synthesis involves the computation ofaspeech signal from input
text. Accomplishing this requires a system that consists ofan astonishing range
ofcomponents, from abstract linguistic analysis ofdiscourse structure to speech
coding.
Several implications flow from this fact. First, text-to-speech synthesis is in
herently multidisciplinary, as is reflected by the authors of this book, whose
backgroundsincludeengineering,multipleareasoflinguistics,computerscience,
mathematicalpsychology,andacoustics. Second,progressintheseresearchareas
is uneven because the problems faced vary so widely in difficulty. Forexample,
producingflawless pitchaccentassignmentfor complex, mUlti-paragraph textual
inputisextremelydifficult, whereas producingdecentsegmental durations given
thatallelsehasbeencomputedcorrectlyisnotverydifficult.Third,theonlywayto
summarizeresearchinallareasrelevantforTISisintheformofamulti-authored
book-no singleperson, oreven small groupofpersons, has asufficientlybroad
scope.
The most importantgoal ofthis bookis, ofcourse, to provide an overview of
these research areas by having invited key players in each area to contribute to
this book.This willgivethereaderacompletepictureofwhatthechallengesand
solutionsareandinwhatdirectionsresearchersaremoving.
Butan importantsecondgoalis to allow thereadertojudgewhatall this work
adds up to-the scientific sophistication may be impressive, but does it really
producegoodsyntheticspeech?Thebookattempts toanswerthis questionintwo
ways.First,wehaveaskedauthorstoincluderesultsfromsubjectiveevaluationsin
theirchapterwheneverpossible.Thereisalsoaspecial sectiononperceptionand
evaluationin the book. Second, the bookcontains aCD-ROMdisk withsamples
ofseveralofthesynthesizersdiscussed. Boththesuccessesandthefailures areof
interest-the latter in particularbecause it is unlikely that samples are included
demonstrating majorflaws inasystem.WethankChristianBenoitfor suggesting
this idea.
vi Preface
Abriefnoteonthehistoryofthisvolume:In1990,theFirstESCAWorkshopon
Text-to-SpeechSynthesiswasorganizedinAutrans,France.Theorganizersofthis
workshop,GerardBaillyandChristianBenoit,feltthattherewasaneedforabook
containinglonger verionsofpapers from the workshop proceedings, resulting in
TalkingMachines.In1994,theeditorsofthecurrentvolumeorganizedtheSecond
ESCAlIEEE/AAAIWorkshoponText-to-SpeechSynthesis andlikewisedecided
thatabookwasnecessarythatwouldpresenttheworkreportedintheproceedings
in amore complete, updated, and polished form. To ensure the highest possible
qualityofthechaptersforthecurrentvolume,weaskedmembersofthescientific
committeeoftheworkshoptoselectworkshoppapersforpossibleinclusion.The
editors added their selections, too. We then invited those authors whose work
receivedanunambiguousendorsementfromthisprocesstocontributetothebook.
Finally, wewanttothankthemanypeoplewhohavecontributed: thescientific
committeemembers whohelpedwiththeselectionprocess;fourteen anonymous
reviewers;BerndMoebiusforworkonstubbornfigures;AliceGreenwoodforedi
torialassistance;MikeTanenblattandJuergenSchroeterforprocessingthespeech
andvideofiles;DavidYarowskyforprovidingtheindex;ThomasvonFoersterand
KennethDreyhauptatSpringer-Verlagforexpeditingtheprocess;CathyHopkins
for administrativeassistance; andBellLaboratoriesforitsencouragementofthis
work.
JanP. H. vanSanten
RichardW. Sproat
JosephP. Olive
JuliaHirschberg
MurrayHill,NewJersey
October 1995
Contents
Preface v
Contributors xvii
I Signal Processing and SourceModeling 1
1 SectionIntroduction.RecentApproachestoModelingtheGlottal
SourceforTTS 3
DanKahn, MarianJ. Macchi
1.1 Modelingthe GlottalSource: Introduction 3
1.2 AlternativestoMonopulseExcitation . 4
1.3 AGuidetotheChapters 5
1.4 Summary ............. 6
2 SynthesizingAllophonicGlottalization 9
JanetB. Pierrehumbert,StefanFrisch
2.1 Introduction. . . . . . 9
2.2 ExperimentalData ...... 10
2.3 SynthesisExperiments .... 12
2.4 ContributionofIndividualSourceParameters 20
2.5 Discussion 20
2.6 Summary ................... 24
3 Text-to-Speech Synthesis with Dynamic Control ofSource
Parameters 27
LuisC. Oliveira
3.1 Introduction . 27
3.2 SourceModel . 27
viii Contents
3.3 Analysis Procedure 30
3.4 Analysis Results 33
3.5 Conclusions .... 36
4 Modificationofthe AperiodicComponentofSpeech Signalsfor
Synthesis 41
GaelRichard, ChristopheR. d'Alessandro
4.1 Introduction.................. 41
4.2 SpeechSignalDecomposition 43
4.3 AperiodicComponentAnalysis andSynthesis 47
4.4 Evaluation........ 50
4.5 SpeechModifications. . . 51
4.6 DiscussionandConclusion 54
5 On the Use ofa Sinusoidal Model for Speech Synthesis in
Text-to-Speech 57
Miguel Angel Rodriguez Crespo, Pilar Sanz Velasco,
LuisMonzonSerrano,JoseGregorioEscaladaSardina
5.1 Introduction............ 57
5.2 OverviewoftheSinusoidalModel 59
5.3 SinusoidalAnalysis. . . . . . . . 60
5.4 SinusoidalSynthesis . . . . . . . 60
5.5 SimplificationoftheGeneralModel 61
5.6 ParametersoftheSimplifiedSinusoidalModel . 64
5.7 FundamentalFrequencyandDurationModifications 65
5.8 AnalysisandResynthesisExperiments 66
5.9 Conclusions..................... 69
II LinguisticAnalysis 71
6 Section Introduction. The Analysis ofTextin Text-to-Speech
Synthesis 73
RichardW. Sproat
7 Language-IndependentData-Oriented Grapheme-to-Phoneme
Conversion 77
WalterM. P.Daelemans,AntalP.1. vandenBosch
7.1 Introduction..... 77
7.2 DesignoftheSystem 79
7.3 RelatedApproaches . 85
7.4 Evaluation 86
7.5 Conclusion..... 88
Contents ix
8 All-ProsodicSpeechSynthesis 91
ArthurDirksen,JohnS. Coleman
8.1 Introduction.... 91
8.2 Architecture.... 93
8.3 PolysyllabicWords 100
8.4 ConnectedSpeech 104
8.5 Summary..... 106
9 AModelofTimingforNonsegmentalPhonologicalStructure 109
JohnLocal, RichardOgden
9.1 Introduction........................ 109
9.2 SyllableLinkageandItsPhoneticInterpretationin YorkTalk. 110
9.3 TheDescriptionandModelingofRhythm . . . . . . . . .. 112
9.4 Comparison of the Output of YorkTalk with
NaturalSpeechandSynthesis. 117
9.5 Conclusion................. 119
10 A Complete Linguistic Analysis for an Italian
Text-to-SpeechSystem 123
GiulianoFerri,PieroPierucci,DonatellaSanzone
10.1 Introduction........ 123
10.2 TheMorphologicAnalysis . . . 125
10.3 ThePhoneticTranscription . . . 130
10.4 TheMorpho-SyntacticAnalysis 132
10.5 PerformanceAssessment 137
10.6 Conclusion . . . . . . . . . . . 137
11 DiscourseStructuralConstraintsonAccentinNarrative 139
ChristineH. Nakatani
ILl Introduction...................... 139
11.2 TheNarrativeStudy 140
11.3 ADiscourse-BasedInterpretationofAccentFunction 142
11.4 DiscourseFunctionsofAccent 146
11.5 Discussion 150
11.6 Conclusion . . . . . . . . . . 153
12 HomographDisambiguationinText-to-SpeechSynthesis 157
DavidYarowsky
12.1 ProblemDescription 157
12.2 PreviousApproaches 158
12.3 Algorithm...... 159
12.4 DecisionListsforAmbiguityClasses. 165
12.5 Evaluation 168
12.6 DiscussionandConclusions ..... 169
x Contents
III Articulatory Synthesis and Visual Speech 173
13 SectionIntroduction.TalkingHeadsinSpeechSynthesis 175
DominicW. Massaro,MichaelM.Cohen
14 SectionIntroduction.ArticulatorySynthesisandVisualSpeech 179
JuergenSchroeter
14.1 Bridging the Gap Between Speech Science and Speech
Applications 179
15 SpeechModelsandSpeechSynthesis 185
MaryE. Beckman
15.1 ThemeandSomeExamples. . . . . . . . . 185
15.2 ADecadeandaHalfofIntonationSynthesis 187
15.3 ModelsofTime . 197
15.4 Valediction . . . . . . . . . . . . . . . . . 202
16 A Framework for Synthesis of Segments Based on
PseudoarticulatoryParameters 211
CorineA. Bickley,KennethN. Stevens,DavidR. Williams
16.1 BackgroundandIntroduction. . . . . . . . 211
16.2 ControlParametersandMappingRelations. 212
16.3 ExamplesofSynthesisfromHLParameters 215
16.4 TowardRulesforSynthesis . . . . . . . . . 217
17 BiomechanicalandPhysiologicallyBasedSpeechModeling 221
ReinerF.Wilhelms-Tricarico,JosephS. Perkell
17.1 Introduction........... 221
17.2 ArticulatorySynthesizers . . . . 222
17.3 AFiniteElementTongueModel 222
17.4 TheController 227
17.5 Conclusions........... 232
18 Analysis-SynthesisandIntelligibilityofaTalkingFace 235
BertrandLeGoff,ThierryGuiard-Marigny,ChristianBenoit
18.1 Introduction...... 235
18.2 TheParametricModels . . . . 236
18.3 VideoAnalysis . . . . . . . . 236
18.4 Real-TimeAnalysis-Synthesis 237
18.5 IntelligibilityoftheModels . 238
18.6 Conclusion . . . . . . . . . . 244
19 3DModelsoftheLipsandJawforVisualSpeechSynthesis 247
ThierryGuiard-Marigny, AliAdjoudani,ChristianBenoit
19.1 Introduction........ 247
19.2 The2DModeloftheLips 248