Table Of ContentTowards Adaptive Spoken Dialog Systems
Alexander Schmitt Wolfgang Minker
•
Towards Adaptive
Spoken Dialog
Systems
123
Alexander Schmitt Wolfgang Minker
Instituteof Communications Instituteof Communications
Engineering Engineering
Universityof Ulm Universityof Ulm
Ulm Ulm
Germany Germany
ISBN 978-1-4614-4592-0 ISBN 978-1-4614-4593-7 (eBook)
DOI 10.1007/978-1-4614-4593-7
SpringerNewYorkHeidelbergDordrechtLondon
LibraryofCongressControlNumber:2012944961
(cid:2)SpringerScience+BusinessMedia,NewYork2013
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation,broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionor
informationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar
methodology now known or hereafter developed. Exempted from this legal reservation are brief
excerpts in connection with reviews or scholarly analysis or material supplied specifically for the
purposeofbeingenteredandexecutedonacomputersystem,forexclusiveusebythepurchaserofthe
work. Duplication of this publication or parts thereof is permitted only under the provisions of
theCopyrightLawofthePublisher’slocation,initscurrentversion,andpermissionforusemustalways
beobtainedfromSpringer.PermissionsforusemaybeobtainedthroughRightsLinkattheCopyright
ClearanceCenter.ViolationsareliabletoprosecutionundertherespectiveCopyrightLaw.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexempt
fromtherelevantprotectivelawsandregulationsandthereforefreeforgeneraluse.
While the advice and information in this book are believed to be true and accurate at the date of
publication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityfor
anyerrorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,with
respecttothematerialcontainedherein.
Printedonacid-freepaper
SpringerispartofSpringerScience+BusinessMedia(www.springer.com)
Preface
This book investigates stochastic methods for the automatic detection of critical
dialog situations in Spoken Dialog Systems (SDS) and implements data-driven
modelingandpredictiontechniques.Theachievementofthisapproachistoallow
for a robust and user-friendly interaction with next generation SDS.
The advances in spoken language technology have led to an increasing
deployment of SDS in the field, such as speech-enabled personal assistants in our
smartphones. Limitations of the current technology and the great complexity of
natural language interaction between man and machine keep producing problems
in communication. We users may particularly experience this in our everyday
usage of telephone-based SDS in call centers. Low speech recognition perfor-
mance,falseexpectationstowardstheSDS,andsometimesbaddialogdesignlead
to frustration and dialog crashes.
Inthisbook,wepresent adata-drivenonline monitoringapproach thatenables
future SDS to automatically recognize negative dialog patterns. Thereby we
implement novel statistical and machine learning-based approaches. Using the
knowledgeaboutexistingproblemsintheinteractionfuturedialogsystemswillbe
able to change dialog flows dynamically and ultimately solve those problems.
Unlike rule-based approaches, the presented statistical procedures allow a more
flexible, more portable, and more accurate use.
After an introduction into spoken dialog technology the book describes the
foundations of machine learning and pattern recognition, serving as basis for the
presented approaches. Related work in the field of emotion recognition and data-
driven evaluation of SDS (e.g. the PARADISE approach) are presented and
discussed.
The main part of the book begins with a major symptom, which is closely
connected to poor communication and critical dialog situations, namely the
detection of negative emotions. Related work in the field is frequently based on
artificialdatasetsusingactedandenactedspeech,whichmaynotbetransferredto
real-lifeapplications.Weinvestigatehowfrequentlyusersshownegativeemotions
in real-life SDS and develop novel approaches to speech-based emotion recog-
nition. We follow a multilayer approach to model and detect emotions by using
v
vi Preface
classifiers based on acoustic, linguistic and contextual features. We prove that
acoustic models outperform linguistic models in recognizing emotions in real-life
SDS. Furthermore, we examine the relationship between the interaction flow and
the occurrence of emotions. We may show that the interaction flow of a human–
machinedialoghasaninfluenceontheuser’semotionalstate. Thisfact allows us
to support emotion recognition using parameters that describe the previous
interaction.Forourstudies,weexclusivelyemploynon-actedrecordingsfromreal
users.
Not all users react emotionally when problems arise in the interaction with an
SDS. Inthe second step,we therefore present novel statistical methods that allow
spottingproblemswithin adialog-basedVIoninteractionpatterns.Thepresented
InteractionQualityparadigmdemonstrateshowacontinuousqualityassessmentof
ongoing spoken human-machine interaction may be achieved. This expert-based
approach represents an objective quality view on an interaction. To which degree
thisparadigmmirrorssubjectiveusersatisfactionisassessedinalaboratorystudy
with 46 users. To our knowledge, this study is the first to assess user satisfaction
during SDS interactions. It can be shown that subjective satisfaction correlates
with objective quality assessments. Interaction parameters influencing user satis-
faction are statistically determined and discussed. Furthermore, we present
approaches that will enable future dialog systems to predict the dialog outcome
during an interaction. The latter models allow dialog systems in call center
applications to escalate endangered calls promptly to call center agents who may
help out. Problems specific to the estimation of the dialog outcome are assessed
and solved. An open-source workbench supporting the development and evalua-
tion of such statistical models and the presentation of a parameter set used to
quantify SDS interactions round off the book. The proposed approaches will
increase user-friendliness and robustness in future SDS. All models have been
evaluated on several large datasets of commercial and non-commercial SDS and
thus have been tested for practical use.
Acknowledgments
This project would not have been possible without the support of many people.
The authors express their deepest gratitude to David Sündermann, Jackson
Liscombe,andRobertoPieraccinifromSpeechCycleInc.(USA)forsupportingus
all the time with corpora still ‘‘hot from recording’’, their helpful advice, and for
their warm welcomes during our stays in New York.
We are notably grateful to Tim Polzehl (Deutsche Telekom Laboratories,
Berlin) and Florian Metze (Carnegie Mellon University, Pittsburgh) for our
prosperouscollaborationinthefieldofemotionrecognitionduringthepastyears.
We wouldfurther liketothankthecrew fromDeutsche TelekomLaboratories,in
particular Sebastian Möller, Klaus-Peter Engelbrecht, and Christine Kühnel for
their friendly and collegial cooperation and the exchange of valuable ideas at
numerous speech conferences.
We are indebted to our research colleagues, students, and the technical staff at
the Dialog Systems group at University of Ulm. In particular we owe a debt of
gratitude to Tobias Heinroth, Stefan Ultes, Benjamin Schatz, Shu Ding, Uli
Tschaffon, and Carolin Hank as well as Nada Sharaf and Sherief Mowafey from
the German University in Cairo (Egypt) for their fruitful discussions and the
tireless exchange of new ideas.
ThankstoAllisonMichaelfromSpringerforhisassistanceduringthepublishing
process.
Finally, we thank our families for their encouragement.
vii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Spoken Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Automatic Speech Recognition. . . . . . . . . . . . . . . . . . . 6
1.1.2 Semantic Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 Dialog Management . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.4 Language Generation and Text-to-Speech Synthesis . . . . 9
1.2 Towards Adaptive Spoken Dialog Systems. . . . . . . . . . . . . . . . 10
2 Background and Related Research. . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Machine Learning: Algorithms and Performance Metrics. . . . . . 18
2.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 Performance Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Emotion Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.1 Theories of Emotion and Categorization . . . . . . . . . . . . 33
2.2.2 Emotional Speech. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.3 Emotional Labeling. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.4 Paralinguistic and Linguistic Features
for Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.5 Related Work in Speech-Based Emotion Recognition . . . 48
2.3 Approaches for Estimating System Quality, User Satisfaction
and Task Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.1 Offline Estimation of Quality
on Dialog- and System-Level. . . . . . . . . . . . . . . . . . . . 50
2.3.2 Online Estimation of Task Success
and User Satisfaction. . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3 Interaction Modeling and Platform Development . . . . . . . . . . . . . 63
3.1 Raw Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Parameterization and Annotation. . . . . . . . . . . . . . . . . . . . . . . 66
ix
x Contents
3.2.1 Interaction Parameters. . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.2 Emotion Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3 A Workbench for Supporting the Development
of Statistical Models for Online Monitoring . . . . . . . . . . . . . . . 84
3.3.1 Requirements Toward a Software Tool . . . . . . . . . . . . . 85
3.3.2 The Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.3 Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.4 Evaluating Statistical Prediction Models
for Online Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . 89
3.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4 Novel Strategies for Emotion Recognition. . . . . . . . . . . . . . . . . . . 99
4.1 Speech-Based Emotion Recognition. . . . . . . . . . . . . . . . . . . . . 101
4.1.1 Paralinguistic Emotion Recognition. . . . . . . . . . . . . . . . 101
4.1.2 Linguistic Emotion Recognition . . . . . . . . . . . . . . . . . . 109
4.2 Dialog-Based Emotion Recognition. . . . . . . . . . . . . . . . . . . . . 110
4.2.1 Interaction and Context-Related Emotion Recognition. . . 111
4.2.2 Emotional History. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.2.3 Emotion Recognition in Deployment Scenarios . . . . . . . 116
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.3.1 Corpora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.3.2 Human Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.3.3 Speech-Based Emotion Recognition . . . . . . . . . . . . . . . 124
4.3.4 Dialog-Based Emotion Recognition. . . . . . . . . . . . . . . . 134
4.3.5 Fusion Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.3.6 Emotion Recognition in Deployment Scenarios . . . . . . . 143
4.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5 Novel Approaches to Pattern-Based Interaction
Quality Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.1 Interaction Quality Versus User Satisfaction. . . . . . . . . . . . . . . 154
5.2 Expert Annotation of Interaction Quality . . . . . . . . . . . . . . . . . 156
5.2.1 Annotation Example . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.2.2 Rating Statistics and Determination
of the Final IQ Score. . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.3 A User Satisfaction Study Under Laboratory Conditions . . . . . . 161
5.3.1 Lab Study Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.3.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.3.3 Comparison of User Satisfaction
and Interaction Quality . . . . . . . . . . . . . . . . . . . . . . . . 166
5.4 Input Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.5 Modeling Interaction Quality and User Satisfaction. . . . . . . . . . 169
5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.6.1 Performance Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . 170
Contents xi
5.6.2 Feature Set Composition . . . . . . . . . . . . . . . . . . . . . . . 171
5.6.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.6.4 Assessing the Model Performance. . . . . . . . . . . . . . . . . 174
5.6.5 Impact of Optimization Through Feature Selection. . . . . 175
5.6.6 Cross-Target Prediction and Portability . . . . . . . . . . . . . 177
5.6.7 Causalities and Correlations Between Interaction
Parameters and IQ/US. . . . . . . . . . . . . . . . . . . . . . . . . 177
5.7 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6 Statistically Modeling and Predicting Task Success. . . . . . . . . . . . 185
6.1 Linear Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.2 Window Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.3 SRI and Salience. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.4 Coping with Model Uncertainty . . . . . . . . . . . . . . . . . . . . . . . 192
6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.5.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.5.2 Linear Versus Window Modeling . . . . . . . . . . . . . . . . . 198
6.5.3 Class-Specific Performance . . . . . . . . . . . . . . . . . . . . . 198
6.5.4 SRI-Related Features. . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.5.5 Model Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
6.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7 Conclusion and Future Directions. . . . . . . . . . . . . . . . . . . . . . . . . 205
7.1 Overall Summary and Synthesis of the Results. . . . . . . . . . . . . 206
7.1.1 Corpora Creation, Basic Modeling,
and Tool Development . . . . . . . . . . . . . . . . . . . . . . . . 208
7.1.2 Emotion Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.1.3 Interaction Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
7.1.4 Task Success Prediction. . . . . . . . . . . . . . . . . . . . . . . . 214
7.2 Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Appendix A: Interaction Parameters for Exchange
Level Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Appendix B: Detailed Results for Emotion Recognition,
Interaction Quality, and Task Success. . . . . . . . . . . . . . 225
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249