Table Of Content

PRONUNCIATION MODELING IN SPEECH SYNTHESIS Corey Andrew Miller A DISSERTATION in Linguistics Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy 1998 _____________________________ Mark Liberman Supervisor of Dissertation _____________________________ George Cardona Graduate Group Chairperson © COPYRIGHT Corey Andrew Miller 1998 DEDICATION To Jonathan Connett. iii ACKNOWLEDGMENTS I am very pleased to have had the encouragement and support of a committee of three linguists for whom I have the greatest respect and admiration: Mark Liberman, William Labov and Eugene Buckley. Each of them made my transition back to Penn pleasant after what seemed like a long absence. It was a great pleasure to have Mark Randolph both as an external reader and as a colleague at Motorola. Mark’s work at MIT a decade ago has served as an inspiration to me. Orhan Karaali made this dissertation possible in this millennium. As my manager for over two years at Motorola, Orhan insisted on making my dissertation a priority at work. Harry Bliss provided his voice to this project and our whole group is very grateful for his patience and cooperation. My colleagues at Motorola listened to my ideas and provided technical and theoretical assistance at every turn: Noel Massey, Jerry Corrigan, William Thompson, Andrew Mackie, Erica Zeinfeld, Otto Schnurr, Lea Adams, Michael Murdock, Joseph Goldberg and Lynette Melnar. My entire management was supportive in this effort: Ira Gerson, Kevin Kloker and James Mikulski. D. J. Stockley made the pursuit of intellectual property a positive learning experience. My colleagues at my former workplace, Franklin Electronic Publishers, provided a great environment as I wet my feet in the industrial world: Will Dowling, David Justice and Mike Wolff. I thank Matt Lennig and Kristin Precoda for introducing me to speech technology and software development at BNR. My linguist friends and colleagues at Penn and others schools provided support and encouragement while I lived in Philadelphia and afterwards: Christine Zeller, Tom Veatch, Peter Slomanson, Fabiola Varela-García, Mary O’Malley, Bill Reynolds, Hadass Sheffer, Stephanie Strassel, Nadia iv Biassou, Lisa Lavoie, Hikyoung Lee, Alex Dimitriadis, and Jason Eisner. I thank Brian Sietsema, Pat Keating, Sean Fulop and Howard Nusbaum for providing valuable advice and information. My mother always believed that this dissertation would become a reality, and to her I owe inexpressible gratitude for providing encouragement at the more difficult moments of my graduate career. The thought of the rest of my family including my father, Stephanie, Ken, Jordan, Avery, Lee, Elihu, and my late grandparents has always kept me going. Thank you Jonathan for paying the bills, arranging exotic vacations and, most of all, hanging in there. Finally, I would like to thank the Summer Institute of Linguistics for their excellent phonetic fonts which are freely distributed on the World Wide Web. v ABSTRACT PRONUNCIATION MODELING IN SPEECH SYNTHESIS Corey Andrew Miller Mark Liberman This dissertation investigates the area of pronunciation modeling in speech synthesis. By pronunciation modeling, we mean architectures and principles for generating high-quality human-like pronunciations. The term pronunciation modeling has previously been applied in the context of speech recognition (e.g. Byrne et al. 1997). In that context, it describes theories and procedures for handling the pronunciation variation that naturally occurs across speakers. In contrast, our work is in the domain of text-to-speech synthesis, which, as we will show, requires modeling the pronunciation variation of an individual whose speech the synthesizer is attempting to model. We will explain our methodology for learning and reproducing pronunciation variation on an individual basis, and show how most crucial features of such variation can be easily generated using the architecture we describe. Throughout the course of this exposition, we highlight contributions to linguistic theory that such a thorough analysis of individual variation provides. We describe the postlexical module of an English text-to-speech synthesizer. This module is responsible for transforming underlying lexical pronunciations from a lexical database into contextually appropriate surface postlexical pronunciations. This transformation is achieved by machine learning of a corpus of hand-labeled postlexical pronunciations that have been aligned with lexical pronunciations. The machine learning is conducted by a vi neural network, whose architecture and data encoding we describe. A thorough analysis of the performance of the postlexical module is offered, with attention to the relative success of the neural network at learning a wide range of postlexical phenomena. We examine the extent to which a symbolic approach to allophony is warranted, and provide an acoustic analysis that attempts to provide an answer to this question. Assessments of the success of currently existing theories of phonetics, phonology and their interface are offered, based on the experience of generating a complete postlexical phonology of English for use in synthetic speech. vii TABLE OF CONTENTS Chapter 1. Introduction.....................................................................................................1 1.1. What is speech synthesis pronunciation modeling?................................................1 1.2. Computational phonology and speech technology..................................................2 1.2.1. Linguistic aspects of speech synthesis.............................................................4 1.2.2. Review of prior work in postlexical modeling..............................................14 1.2.3. Review of neural networks in phonology......................................................25 1.3. General phonological issues..................................................................................30 1.3.1. Phonetics-phonology interface......................................................................30 1.3.2. Lexical phonology and the interface between syntax, morphology and phonology..................................................................................................................35 1.4. Sociolinguistics/Variation.....................................................................................43 1.5. Overview of dissertation.......................................................................................48 Chapter 2. Rationale for modeling postlexical variation................................................50 2.1. Evaluation of synthetic speech..............................................................................50 2.1.1. Intelligibility..................................................................................................51 2.1.2. Comprehensibility.........................................................................................53 2.1.3. Acceptability..................................................................................................57 2.1.4. Naturalness....................................................................................................58 2.1.5. Case studies involving postlexical variation.................................................62 2.2. Approximation to training data.............................................................................66 2.3. Cross-dialectal comprehension..............................................................................67 2.4. Benefits of variability............................................................................................69 Chapter 3. Data sources..................................................................................................70 3.1. Lexical database....................................................................................................70 viii 3.1.1. Characteristics of source dictionaries............................................................79 3.1.2. Transcription consistency and simplification................................................80 3.2. Labeled speech corpus...........................................................................................85 3.2.1. Syntactic and prosodic labeling.....................................................................89 3.2.2. Levels of labeling..........................................................................................93 Chapter 4. Experimental approach to comparing gradient vs. discrete aspects of postlexical variation..........................................................................................................96 4.1. Acoustic neural network......................................................................................102 4.2. Experimental procedure......................................................................................106 4.2.1. Experiment on (cid:1)(cid:2)(cid:3)(cid:4)(cid:1)(cid:5)(cid:3)(cid:6).................................................................................111 4.2.2. Experiment on (cid:1)(cid:7)(cid:3)(cid:4)(cid:1)(cid:8)(cid:3)(cid:6)................................................................................116 4.2.3. Experiment on (cid:4)(cid:9)(cid:4)(cid:6)and(cid:6)(cid:4)(cid:10)(cid:4)............................................................................122 4.2.4. Experiment on /o/ and /i/.............................................................................130 4.3. Conclusions from acoustic analyses of allophony...............................................135 Chapter 5. Methods for learning segmental postlexical variation.................................139 5.1. Creation of postlexical training materials...........................................................139 5.1.1. Alignment of lexical and postlexical phones..............................................140 5.1.2. Creation of postlexical training database....................................................150 5.2. Characterization of learning problem..................................................................152 5.3. Neural network architecture................................................................................159 5.4. Data encoding......................................................................................................160 5.4.1. Features for lexical phones..........................................................................161 5.4.2. Stress...........................................................................................................165 5.4.3. Syntactic and prosodic information.............................................................165 5.4.4. Windowing..................................................................................................166 Chapter 6. Results.........................................................................................................169 ix 6.1. Analysis of neural network..................................................................................169 6.2. General phonological analysis.............................................................................171 6.3. General error analysis..........................................................................................173 6.4. Allophony............................................................................................................180 6.4.1. Vowel fronting............................................................................................180 6.4.2. Glottalization of vowels..............................................................................188 6.4.3. Coronal allophones......................................................................................189 6.5. Vowel reduction in function words.....................................................................199 6.6. Dialect.................................................................................................................205 Chapter 7. Conclusion...................................................................................................210 Appendix.........................................................................................................................210 References.......................................................................................................................217 x

Description:

This dissertation investigates the area of pronunciation modeling in speech synthesis. By modeling the pronunciation variation of an individual whose speech the synthesizer is attempting to model. Finally, we present a section on phonetic studies that have analyzed the postlexical variation in

PRONUNCIATION MODELING IN SPEECH SYNTHESIS Corey Andrew Miller A DISSERTATION in ... PDF

249 Pages·2004·1.01 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview PRONUNCIATION MODELING IN SPEECH SYNTHESIS Corey Andrew Miller A DISSERTATION in ...

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.