c Targeted Help for Spoken Dialogue Systems: intelligent feedback improves naive users’ perfomance Beth Ann Hockey Oliver Lemon Research Institute for Advanced School of Informatics, Computer Science (RIACS), University of Edinburgh, NASA Ames Research Center, 2 Buccleugh Place Moffet Field, CA 94035 Edinburgh EH8 9LW, UK [email protected] [email protected] Ellen Campana Laura Hiatt Gregory Aist Department of Brain Center for the Study of Language RIACS and Cognitive Sciences and Information (CSLI) NASA Ames Research Center, University of Rochester Stanford University Moffet Field, CA 94035 Rochester, NY 14627 2 10 Panama St , [email protected] ecampana@bcs .r ochester . edu Stanford, CA 94305 [email protected] Jim Hieronymous Alexander Gruenstein John Dowding RIACS BeVocal, Inc. RIACS NASA Ames Research Center, 685 Clyde Avenue NASA Ames Research Center, Moffet Field, CA 94035 Mountain View, CA 94043 Moffet Field, CA 94035 [email protected] [email protected] [email protected] Abstract periments on a spoken dialogue system for command and control of a simulated We present experimental evidence that rob0 tic helicopter. providing naive users of a spoken dia- logue system with immediate help mes- l Introduction sages related to their out-of-coverage ut- terances improves their success in using Targeted Help makes use of user utterances ‘that ‘ - the system. A grammar-based recog- are out-of-coverage of the main dialogue sys- nizer and a Statistical Language Model tem recognizer to provide the user with immedi- (SLM) recognizer are run simultane- ate feedback, tailored to what the user said, for ously. If the grammar-based recognizer cases in which the system was not able to un- suceeds, the less accurate SLM recog- derstand their utterance. These messages can be nizer hypothesis is not used. When the much more informative than responding to the grammar-based recognizer fails and the user with some variant of “Sorry I didn’t under- SLM recognizer produces a recognition stand”, which is the behaviour of most current di- hypothesis, this result is used by the Tar- alogue systems. To implement Targeted Help we geted Help agent to give the user feed- use two recognizers: the Primary Recognizer is back on what was recognized, a diag- constructed with grammar-based language model nosis of what was problematic about the and the Secondary Recognizer used by the Tar- utterance, and a related in-coverage ex- geted Help module is constructed with a Statis- ample. The in-coverage example is in- tical Language Model (SLM). As part of a spo- tended to encourage alignment between ken dialogue system, grammar based recogniz- user inputs and the language model of ers tuned to a domain perfom very well, in fact the system. We report on controlled ex- better than comparable Statistical Language Mod- els (SLMs) for in-coverage utterances (Knight et low to interpret ongoing events, which may be re- al., 2001). However, in practice users will some- ported (via NL generation) to the operator. The times produce utterances that are out of coverage. robot can carry out various activities such as fly- Ths is particularly true of non-expert users, who ing to a location, fighting fires, following a ve- do not understand the limitations and capabilities hicle, and landing. Interaction in WITAS thus of the system, and consequently produce a much involves joint-activities between an autonomous lower percentage of in-coverage utteraces than ex- system and a human operator. These are activ- pert users. The Targeted Help strategy for achiev- ities which the autonomous system cannot com- ing good performance with a dialogue system is to plete alone, but which require some human inter- use a grammar-based language model and assist vention (e.g. search for a vehicle). These activi- users in becoming expert as quickly as possible. ties are specified by the user during dialogue, or This approach takes advantage of the strengths of can be initiated by the UAV. In any case, a major both types of language models by using the gram- component of the dialogue, and a way of maintain- mar based model for in-coverage utterances and ing its coherence, is tracking the state of current the SLM as part of the Targeted Help system for or planned activities of the robot. This system is out-of-coverage utterances. sufficiently complex to serve as a good testbed for In this paper we report on controlled experi- Targeted Help. ments, testing the effectiveness of an implementa- tion of Targeted Help in a mixed initiative dialogue 2.2 The Targeted Help Module system to control a simulated robotic helicopter. The Targeted Help Module is a separate compo- nent that can be added to an appropriately struc- 2 System Description tured dialogue system with minimal changes to ac- 2.1 The WITAS Dialogue System comodate the specifics of the domain. This mod- Targeted Help was deployed and tested as part ular design makes it quite portable, and a version of the WITAS dialogue system’, a command and of this agent is in fact being used in a second com-’ control and mixed-initiative dialogue system for mand and control dialogue system (Hockey et al., interacting with a simulated robotic helicopter or 2002; ?). It is argued in (?) that “low-level” UAV (Unmanned Aerial Vehicle) (?). The dia- processing components such as the Targeted Help logue system is implemented as a suite of agents module are an important focus for future dialogue communicating though the SRI Open Agent Ar- system research. Figure 1 shows the structure of chitecture (0-4A) (Martin et d., 1998). The the Targeted Help component and its relationship agents include: Nuance Communications Recog- to the rest of the dialogue system. nizer (Nuance, 2002); the Gemini parser and gen- The goal of the Targeted Help system is to han- erator (Dowding et al., 1993) (both using a gram- dle utterances that cannot be processed by the mar designed for the UAV application); Festival usual components of the dialogue system, and to text-to-speech synthesizer (Systems, 2001); a GUI align the user’s inputs with the coverage of the sys- which displays a map of the area of operation tem as much as possible. To perform this function and shows the UAV’s location; the Dialogue Man- the Targeted Help component must be able to de- ager (?); the Robot Control and Report compo- termine which utterances to handle, and then con- nent, which translates commands and queries bi- struct help messages related to those utterances, directionally between the dialogue interface and which are then passed to a speech synthesizer. The the UAV. The Dialogue Manager interleaves mul- module consists of three parts: tiple planning and execution dialogue threads (?). While the helicopter is airborne, an on-board 0 the Secondary Recognizer, active vision system will interpret the scene be- the Targeted Help Activator, 0 ‘See http://www.ida.liu.se/ext/witas and http://www-csli.stanford.edu/semlab/ witas the Targeted Help Agent. Prknary speech recognizer Speech Synthesis Figure 1: Architecture The Targeted Help Activator takes input from What the system heard: a report of the backup both the main grammar-based recognizer and the SLM recognition hypothesis. backup category-based SLM recognizer. It uses What the problem was: a description of the this input to determine when the Targeted Help problem with the user’s utterance (e.g. the component should produce a message. The Acti- system doesn’t know a word); and vator’s behavior is as follows for the four possible combinations of reco,gizer outcomes: What you might say instead: A similar in- coverage example. 1. Both recognizers get a recognition hypothe- sis: In constructing both the diagnostic of the prob- Targeted Help remains inactive; normal dia- lem with the utterance, and the in-coverage exam- logue system proccessing proceeds ple, we are faced with the question of whether the information from the secondary recognizer is suf- 2. Main recognizer gets a recognition hypothe- ficient to produce useful help messages. Since this sis and secondary recognizer rejects: domain is relatively novel, there is not very much Targeted Help remains inactive; normal dia- data for training the SLM and the performance re- logue system proccessing proceeds flects this. We have designed a rule based system 3. Main recognizer rejects, secondary recog- that looks for patterns in the recognition hypothe- nizer gets a recognition hypothesis: sis that seem to be detected adequately even with Targeted Help is activated incomplete or inaccurate recognition. Diagnostics are of three major types: 4. Both recognizers reject: Targeted Help is not activated, default system 0 endpointing errors, failure message is produced unknown vocabulary, 0 Once Targeted Help is activated, the Targeted subcategorization mistakes. Help Agent constructs a message based on the 0 recognition hypothesis from the secondary SLM We found from an analysis of transcripts that recognizer. These messages are composed of one these three types of errors accounted for the ma- or more of the following pieces: jority of failed utterances. Endpointing errors are cases of one or the other end of an utterance being ity that we suspect is important is matching the cut off. For example, when the user says “search utterance dalogue-move type (e.g. wh-question, for the red car” but the system hears “for the red yesho-question, command) otherwise the user is car”. We use information from the dialogue sys- likely to be misled into thinking that a particular tem’s parsing grammar (which has identical cover- type of dialogue-move is impossible in the system. age to its speech recognizer) to determine whether Looking for in-coverage words is fairly robust. the initial word recognized for an utterance is a Even when the user produces an out-of-coverage valid initial word. If not, the utterance is diag- utterance they are likely to produce some in- nosed as a case of the user pressing the push-to- coverage words. The Targeted Help agent looks talk button too late and the system reports that to for within-domain words in the recognition hy- the user. Out-of-vocabulary items that can be iden- pothesis from the secondary SLM recognizer. This tified by Targeted Help are those that are in the gives us a set of target words from which to SLM’s vocabulary but are out of coverage for the match the example to the dialogue-move type of grammar based recognizer and so cannot be pro- the user’s utterance: wh-question, yn-question, an- cessed by the dialogue system. For these items swer, or command. Targeted Help produces a message of the form Furthermore, for commands (which are a large “the system doesn’t understand the word X’. percentage of the utterances) we use the in- Saying “Zoom in on the red car” when the sys- coverage words to produce a targeted in-coverage tem only has intransitive “zoom in” is an exam- example that is interpretable by the system. These ple of a subcategorization error. In these cases the examples are intended to demonstrate how in- word is in-vocabulary hit has been used in a way vocabulary words from the backup recognizer hy- that is out-of-grammar. To diagnose subcatego- pothesis could be successfully used in commu- rization errors we consult the recognitiodparsing nicating with the system. For example, if the grammar for subcategorization information on in- user says something like “fly over to the hospi- vocabulary verbs in the secondary recognizer hy- tal”, where “over” is out-of-coverage, and the fall- pothesis, then check what else was recognized to back recognizer detected the words “fly” and “hos- detennine if the right arguments are there. For pital”, the Targeted Help agent could provide an these types of errors the system produces a mes- in-coverage example like “fly to the hospital”. For sage such as “the system doesn’t understand the the other less frequent utterance types we have one word X used with the red car”. These diagnostics in coverage example per type. The system cur- are one significant difference from the approach rently uses a look-up table but we hope to incor- used in (Gorrell et al., 2002). The simple classi- porate generation work which would support gen- fier approach used in that work to select example eration of these examples on the fly from a list of sentences would not support these types of diag- in-coverage words (?). nostics. 3 Design of Experiments In constructing examples that are similar to the user’s utterance one issue is in what sense they In order to assess the effectiveness of the targeted should be similar. One aspect we have looked help provided by our system, we compared the at is using in-coverage words from the user’s ut- performance of two groups of users, one that re- terance. It is likely to help naive users learn ceived targeted help, and one that did not. Twenty the coverage of the system if the examples give members of the Stanford University community them valid uses of in-coverage words they pro- were randomly assigned to one of the two groups. duced in their utterance. By using words from the There were both male and female subjects, the ma- user’s utterance the system provides both confir- jority of subjects were in their twenties and none mation that those words are in coverage and an in- of the subjects had prior experience with spoken coverage pattern to imitate. We believe that this dialogue systems. The structure of the interaction leads to greater linguistic alignment between the with the system was the same for both groups. user and the system. Another aspect of similar- They were given minimal written instruction on how to use the system before the interaction be- marked in the first task than in the fifth task gan. They were then asked to use the system to (LARGER EARLY EFFECT). complete five tasks, in whch they directed a heli- 4 Experimental Results copter to move within a city environment to com- plete various task oriented goals whch were dif- We found clear evidence that targeted help im- ferent for four of the five tasks. For each task the proves performance in this environment, as mea- goals were given immediately prior to the start of sured by both the frequency with which th- user p the interaction, in language the system could not simply explicitly gave up on a task, and the time process to prevent users from simply reading the to complete the remaining tasks. In this section we goal aloud to the system. A given task ended when present the statistical analyses of the experiment. one of the following criteria was met: For the following analyses two subjects, both in the No Help condition, were excluded from the 1. the task was accurately completed and the analyses because they gave up on every task, leav- user indicated to the system that he or she had ing 9 users in each of the two help conditions. Ex- finished, ceptions are noted. We begin by examining the percentage of trials 2. the user believed that the task was completed in which users explicitly gave up on a task before and indicated this to the system when in fact it was completed. We compared the percentage of the task was not accurately completed, or trials in which tie user ciicked the “give up” but- 3. the user gave up. ton in both tasks for users in both help conditions. As predicted, a i-within (Task), i-between (Help The first and last of the sequence of five tasks condition) subjects ANOVA revealed a main effect were the critical trials that were used to assess per- of the help condition (F1(1,16)=6.000, p<.05). formance. Both of the tasks had goals of the form Users who received targeted help were less likely “locate an x and then land at the y” The experi- to give up than those who did not receive help, par- ment was conducted in a single session. An exper- ticularly during the first task (11% vs. 27%). If imenter was present throughout, but when asked we include the two subjects in the No Help con- she refused to provide any feedback or hints about dition who gave up on every task the difference is how to interact with the system. even more striking. For the first task only 1I % of As stated above, the critical difference between the users who received help gave up, compared to the two groups of users was the feedback they re- 45% of the users who did not receive help. The ceived during interaction with the system. When pattern holds up even if we include the three in- the users in the No Help condition produced out- tervening filler trials along with the experimental of-coverage utterances the system responded only trials, as demonstrated by a paired t-test item anal- with a text display of the message “not recog- ysis (t(4) = 7.330, p<.05). Those who received nized”. In contrast, when users in the Help condi- help were less likely to explicitly give up even on tion produced ogt-of-coverage utterances they re- this wider variety of tasks. ceived in-depth feedback such as: “The system We next examine the time it took users to com- heardjy betweerr the hospital and the sclzool, ~ n - plete the individual tasks. Here it is necessary to fortunately it doesn’t understand JEy when used be clear about what is meant by “completion.” It with the words between the hospital and the is more ambiguous than it may seem. Each task school. You could try sayingJEy to the hospital.” had several sub-goals, and it was even difficult We hypothesized that: 1) providing Targeted to objectively evaluate whether a single sub goal Help would improve users’ ability to complete had been met. For instance, the goal of the first tasks (HIGHER TASK COMPLETION); and 2) time task was to find a red car near the warehouse and to complete tasks would be reduced for users re- then land the helicopter. Users tended to indicate ceiving Targeted Help (REDUCEDT IME). We that they had finished as soon as they saw the red also anticipated that both effects would be more car, failing to land the helicopter as the instruc- tions specified. Another common source of ambi- Lenient Criterion Analysis guity was when the user saw the car on the map but never brought it up in the dialogue, simply 3IQ landmg the helicopter and clicking “finished.” The problem with this is that there is no way of know- ing whether the user actually saw the car before cliclung finish, and there was no explicit record that they were aware of its presence. For all tri- t : als the experimenter evaluated the task comple- tion, recording what was done and what was left undone. According to the experimenter, in most cases of potential ambiguity the basic goal was Figure 2: Time to complete task under Lenient completed. In a few instances, however, the user Criterion for completion indicated belief that the task had been completed when it obviously had not. An example of this is the following: The goal specified was to find a red than the last one (365.5 seconds vs. 220.4 sec- car near the warehouse and then land. The user onds), and the difference between the help and no flew the helicopter to the police station, and then help conditions was more marked on the first task clicked “finished,” ending the task. Wc dealt with than on the iast one (150.2 seconds vs. 94 sec- the ambiguity problem by analyzing the time to onds). Figure 2 shows these results. complehr, dat2 separately acccrdicg to tvm dif- One criticism of this analysis is that it may in- ferent inclusion criteria. In both cases the pattern clude trials in which the task objectives were not was the same: Users who received help took less accurately completed before the subject clicked time to complete tasks than those who did not, the “finished”. We wished to avoid experimenter sub- first task took longer to complete than the last one, jectivity with respect to task completion, so we and the difference between the help and no help conducted another analysis using the strictest in- conditions was more marked on the first task than clusion criterion the experimental design allowed. on the last one. In this analysis we included only those trials in In the first analysis we included all trials in which all task objectives were completed and which the user clicked the “finished” button, re- could be verified using the transcripts. This meant gardless of their actual performance. Subjects who that for all of the trials we included, the goal entity failed to complete one of the two critical tasks was explicitly mentioned ir, the dialogue. Accord- (tasks 1 and 5) were excluded from the analysis. ing to this criterion only 44% of users in the Help We used a 1-within (Task), 1-between (Help con- condition and 18% of users in the No Help con- dition) subjects ANOVA. For task 1, 89% of the dition completed the first task. Similarly, 89% of trials in the Help condition and 55% of the trials users in the Help condition and 40% of users in the in the No Help were considered ”completed.” For No Help condition accurately completed the task. task 5, 100% of the trials in the Help condition and Although this analysis is conducted on sparse data, 80% of the trials in the No Eieip condition were it provides strong supporting evidence for the data considered “completed.” The analysis revealed a pattern observed in the more lenient analysis. marginally significant main effect of the help con- We examined the time it took to complete tasks dition (Fl(1,ll) = 3.809, p<.l), a main effect of according to the strict criterion, excluding all other task (F1,~1=62.545p, <.001) and a help condition trials. The ANOVA analysis was identical to the by task interaction (F1(1,11)=10.203, p < .OS). previous one. It, too, revealed a main effect of The effects were in the predicted direction. Users help condition (F1(1,3) = 15.438, p<.05), a main who received help took less time to complete tasks effect of task (F1,~=83.512,p < .Ol), and a help than those who did not (290.4 seconds vs. 440.6 condition by task interaction (F1( 1,3)=20.335, p seconds), the first task took longer to complete < .OS). Again the effects were in the predicted di- . These findings are remarkable because they Strict Criterion Analysis demonstrate that it is possible to construct ef- 51)s ---__ --____-- fective Targeted Help messages even from fairly 1511 1 low quality secondary recognition. Moreover, the , I 400 study suggests that such an approach can improve - . 1 the speed of training for naive users, and may re- I m{iP DXir llrlp sult in lasting improvements in the quality of their understanding. 6 Future Work This work suggests many interesting directions for Figure 3: Time to complete task under Strict Cn- further research. One area of investigation is the tenon for completion contribution of various factors in the effectiveness of the Targeted Help message for example: rection. Users who received help took less time to What benefit is due to the online nature of the complete tasks than those who did not (226.2 sec- 0 help? onds vs. 377.5 seconds), the first task took longer to coqlete than the last one (379.9 secoyds vs. What benefit is due to the information con- 223.75), and the difference between the help and 0 tent? no helm cocditims was mcre mzked en *e first task than on the last one (190.4 seconds vs. 112.3 What is the relative contribution of the vari- 0 seconds). These results are shown in Figure 3. ous parts of the Targeted Help message to the improvement in user performance. 5 Conclusions - Is the diagnostic alone more or less ef- We have shown that users benefit from having on- fective than the example alone? line Targeted Help. Naive users who received Targeted Help messages were less likely to give - How much does getting the back up rec- up and significantly faster to complete tasks than ognizer hypothesis help the user? users who did not. Overall, those who did not - What is the most effective combination receive help gave up on 39% of the trials, while of these components? those who received our Targeted Help only gave up on 6% of the trials. With respect to time, Another interesting direction is to look at effec- when we considered all trials in which the user tiveness across different types of applications. The indicated that the goal had been completed (re- fact that we found positive results in this domain gardless of performance), those users who did not and that (Gorrell et al., 2002) also found a variant receive our Targeted Help took 53% longer than of Targeted Help useful on a quite different do- those who did. Under stricter inclusion criteria, main suggests that the approach could be generally whch required the users to explicitly mention the useful for a variety of types of dialogue systems. goal and accurately complete the task, the differ- We are currently looking at porting our Targeted ence was even more pronounced. Those users who Help agent to additional domains. did not receive help took 67.0% longer to com- plete the tasks than those who received our Tar- Acknowledgements geted Help. In both help conditions, performance improved over the course of the experimental ses- This work was partially funded by the Wallenberg sion. However, the advantage conferred by help Foundation’s WITAS project, Linkoping Univer- merely diminished and did not disappear during sity, Sweden. the session. References J. Dowding, M. Gawon, D. Appelt, L. Cherny, R. Moore, and D. Moran. 1993. Gemini: A nat- ural language system for spoken language under- standing. In Proceedings of the Thirty-First Annual Meeting of the Association for Computational Lin- guistics. G. Gorrell, I. Lewin, and M. Rayner. 2002. Adding intelligent help to mixed-initiative spoken dialogue systems. In Proceedings of the Seventh Intema- tional Conference on Spoken Language Processing (ZCSLP),D enver, CO. B.A. Hockey, G. Aist, J. Dowding, and J. Hieronymus. 2002. Targeted help and dialogue about plans. In Proceedings of the 40th Annual Meeting of the Asso- ciation for Computational Linguistics (demo track), Philadelphia, PA. S. Knight, G. Gorrell, M. Rayner, D. Milward, R. Koel- ing, and I. Lewin. 200 1. Comparing grammar-based and robust approaches to speech understanding: a case study. In Proceedings of Eurospeech 2001, pages 1779-1782, Aalborg, Denmark. D. Martin, A. Cheyer, and D. Moran. 1998. Build- ing distributed software systems with the open agent architecture. In Proceedings of the Third Intema- tional Conference on the Practical Application of In- telligent Agents and Multi-Agent Technology, Black- pool, Lancashire, UK. Nuance, 2002. http:Nwww.nuance.com. As of 1 Feb 2002. The Festival Speech Synthesis Systems, 2001. http://www.cstr.ed.ac.uk/projects/festival. As of 28 February 2001.