TEAM LinG DEVELOPMENTS IN SPEECH SYNTHESIS DEVELOPMENTS IN SPEECH SYNTHESIS Mark Tatham Department of Language and Linguistics, University of Essex, UK Katherine Morton Formerly University of Essex, UK Copyright ©2005 John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England Telephone (+44) 1243 779777 Email (for orders and customer service enquiries): [email protected] Visit our Home Page on www.wiley.com All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to [email protected], or faxed to (+44) 1243 770620. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Other Wiley Editorial Offices John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809 John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1 Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 0-470-85538-X (HB) Typeset in 10/12pt Times by Graphicraft, Limited, Hong Kong, China. Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire. This book is printed on acid-free paper responsibly manufactured from sustainable forestry in which at least two trees are planted for each one used for paper production. Contents Acknowledgements xiii Introduction 1 How Good is Synthetic Speech? 1 Improvements Beyond Intelligibility 1 Continuous Adaptation 2 Data Structure Characterisation 3 Shared Input Properties 4 Intelligibility: Some Beliefs and Some Myths 5 Naturalness 7 Variability 8 The Introduction of Style 10 Expressive Content 11 Final Introductory Remarks 13 Part I Current Work 15 1 High-Level and Low-Level Synthesis 17 1.1 Differentiating Between Low-Level and High-Level Synthesis 17 1.2 Two Types of Text 17 1.3 The Contextof High-Level Synthesis 18 1.4 Textual Rendering 20 2 Low-Level Synthesisers: Current Status 23 2.1 The Range of Low-Level Synthesisers Available 23 2.1.1 Articulatory Synthesis 23 2.1.2 Formant Synthesis 24 2.1.3 Concatenative Synthesis 28 Units for Concatenative Synthesis 28 Pepresentation of Speech in the Database 31 Unit Selection Systems: the Data-Driven Approach 32 Unit Joining 33 Cost Evaluation in Unit Selection Systems 35 Prosody and Concatenative Systems 35 Prosody Implementation in Unit Concatenation Systems 36 2.1.4 Hybrid System Approaches to Speech Synthesis 37 vi Developments in Speech Synthesis 3 Text-To-Speech 39 3.1 Methods 39 3.2 The Syntactic Parse 39 4 Different Low-Level Synthesisers: What Can Be Expected? 43 4.1 The Competing Types 43 4.2 The Theoretical Limits 45 4.3 Upcoming Approaches 45 5 Low-Level Synthesis Potential 47 5.1 The Input to Low-Level Synthesis 47 5.2 Text Marking 48 5.2.1 Unmarked Text 48 5.2.2 Marked Text: the Basics 48 5.2.3 Waveforms and Segment Boundaries 50 5.2.4 Marking Boundaries on Waveforms: the Alignment Problem 51 5.2.5 Labelling the Database: Segments 54 5.2.6 Labelling the Database: Endpointing and Alignment 55 Part II A New Direction for Speech Synthesis 57 6 A View of Naturalness 59 6.1 The Naturalness Concept 59 6.2 Switchable Databases for Concatenative Synthesis 60 6.3 Prosodic Modifications 61 7 Physical Parameters and Abstract Information Channels 63 7.1 Limitations in the Theory and Scope of Speech Synthesis 63 7.1.1 Distinguishing Between Physical and Cognitive Processes 64 7.1.2 Relationship Between Physical and Cognitive Objects 65 7.1.3 Implications 65 7.2 Intonation Contours from the Original Database 65 7.3 Boundaries in Intonation 67 8 Variability and System Integrity 69 8.1 Accent Variation 69 8.2 Voicing 72 8.3 The Festival System 74 8.4 Syllable Duration 75 8.5 Changes of Approach in Speech Synthesis 76 9 Automatic Speech Recognition 79 9.1 Advantages of the Statistical Approach 80 9.2 Disadvantages of the Statistical Approach 81 9.3 Unit Selection Synthesis Compared with Automatic Speech Recognition 81 Part III High-Level Control 83 10 The Need for High-Level Control 85 10.1 What is High-Level Control? 85 Contents vii 10.2 Generalisation in Linguistics 86 10.3 Units in the Signal 89 10.4 Achievements of a Separate High-Level Control 90 10.5 Advantages of Identifying High-Level Control 90 11 The Input to High-Level Control 93 11.1 Segmental Linguistic Input 93 11.2 The Underlying Linguistics Model 94 11.3 Prosody 96 11.4 Expression 98 12 Problems for Automatic Text Markup 99 12.1 The Markup and the Data 100 12.2 Generality on the Static Plane 101 12.3 Variability in the Database–or Not 102 12.4 Multiple Databases and Perception 105 12.5 Selecting Within a Marked Database 105 Part IV Areas for Improvement 109 13 Filling Gaps 111 13.1 General Prosody 111 13.2 Prosody: Expression 112 13.3 The Segmental Level: Accents and Register 113 13.4 Improvements to be Expected from Filling the Gaps 115 14 Using Different Units 119 14.1 Trade-Offs Between Units 119 14.2 Linguistically Motivated Units 119 14.3 A-Linguistic Units 121 14.4 Concatenation 123 14.5 Improved Naturalness Using Large Units 123 15 Waveform Concatenation Systems: Naturalness and Large Databases 127 15.1 The Beginnings of Useful Automated Markup Systems 129 15.2 How Much Detail in the Markup? 129 15.3 Prosodic Markup and Segmental Consequences 132 15.3.1 Method 1: Prosody Normalisation 132 15.3.2 Method 2: Prosody Extraction 133 15.4 Summary of Database Markup and Content 135 16 Unit Selection Systems 137 16.1 The Supporting Theory for Synthesis 137 16.2 Terms 138 16.3 The Database Paradigm and the Limits of Synthesis 139 16.4 Variability in the Database 139 16.5 Types of Database 140 16.6 Database Size and Searchability at Low-Level 142 16.6.1 Database Size 142 16.6.2 Database Searchability 144 viii Developments in Speech Synthesis Part V Markup 145 17 VoiceXML 147 17.1 Introduction 147 17.2 VoiceXML and XML 148 17.3 VoiceXML: Functionality 148 17.4 Principal VoiceXML Elements 149 17.5 Tapping the Autonomy of the Attached Synthesis System 151 18 Speech Synthesis Markup Language (SSML) 153 18.1 Introduction 153 18.2 Original W3C Design Criteria for SSML 153 Consistency 153 Interoperability 154 Generality 154 Internationalisation 154 Generation and Readability 155 Implementability 155 18.3 Extensibility 155 18.4 Processing the SSML Document 155 18.4.1 XML Parse 156 18.4.2 Structure Analysis 156 18.4.3 Text Normalisation 157 18.4.4 Text-To-Phoneme Conversion 157 18.4.5 Prosody Analysis 159 18.4.6 Waveform Production 160 18.5 Main SSML Elements and Their Attributes 160 18.5.1 Document Structure, Text Processing and Pronunciation 160 18.5.2 Prosody and Style 161 18.5.3 Other Elements 162 18.5.4 Comment 162 19 SABLE 165 20 The Need for Prosodic Markup 167 20.1 What is Prosody? 167 20.2 Incorporating Prosodic Markup 167 20.3 How Markup Works 168 20.4 Distinguishing Layout from Content 168 20.5 Uses of Markup 169 20.6 Basic Control of Prosody 170 20.7 Intrinsic and Extrinsic Structure and Salience 172 20.8 Automatic Markup to Enhance Orthography: Interoperability with the Synthesiser 174 20.9 Hierarchical Application of Markup 175 20.10 Markup and Perception 176 20.11 Markup: the Way Ahead? 177 20.12 Mark What and How? 179 20.12.1 Automatic Annotation of Databases for Limited Domain Systems 180 20.12.2 Database Markup with the Minimum of Phonology 180 20.13 Abstract Versus Physical Prosody 182 Contents ix Part VI Strengthening the High-Level Model 183 21 Speech 185 21.1 Introductory Note 185 21.2 Speech Production 186 21.3 Relevance to Acoustics 186 21.4 Summary 187 21.5 Information for Synthesis: Limitations 187 22 Basic Concepts 189 22.1 How does Speaking Occur? 189 22.2 Underlying Basic Disciplines: Contributions from Linguistics 191 22.2.1 Linguistic Information and Speech 191 22.2.2 Specialist Use of the Terms ‘Phonology’ and ‘Phonetics’ 192 22.2.3 Rendering the Plan 193 22.2.4 Types of Model Underlying Speech Synthesis 194 The Static Model 194 The Dynamic Model 194 23 Underlying Basic Disciplines: Expression Studies 197 23.1 Biology and Cognitive Psychology 197 23.2 Modelling Biological and Cognitive Events 198 23.3 Basic Assumptions in Our Proposed Approach 198 23.4 Biological Events 198 23.5 Cognitive Events 201 23.6 Indexing Expression in XML 203 23.7 Summary 204 24 Labelling Expressive/Emotive Content 207 24.1 Data Collection 208 24.2 Sources of Variability 209 24.3 Summary 210 25 The Proposed Model 213 25.1 Organisation of the Model 213 25.2 The Two Stages of the Model 214 25.3 Conditions and Restrictions on XML 214 25.4 Summary 215 26 Types of Model 217 26.1 Category Models 217 26.2 Process Models 218 Part VII Expanded Static and Dynamic Modelling 219 27 The Underlying Linguistics System 221 27.1 Dynamic Planes 221 27.2 Computational Dynamic Phonology for Synthesis 222 27.3 Computational Dynamic Phonetics for Synthesis 223 27.4 Adding How,Whatand Notions of Time 224