ebook img

Voice Interaction Design. Crafting the New Conversational Speech Systems PDF

592 Pages·2004·11.089 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Voice Interaction Design. Crafting the New Conversational Speech Systems

Preface We need to be able to work at a level of abstraction concrete enough to provide leverage within the task-artifact cycle, yet abstract enough to cumulate and develop as a theory base. m John .M Carroll There are new machines abroad, talking machines. They are very impressive devices, especially to people who understand the magnitude of the problems involved. But the making of machines that talk is dominated by concerns for recognition rates, vocabulary size, and processing capacity, with comparative inattention to the interaction. There has not been, up to this point, sufficient focus on what these machines should actually say and listen. The simple premise of this book is that if the effort for the human-factors aspects matched the recognition and processing aspects of these devices, then we would all be happier with our talking machines. You I wrote this book to serve a wide range of speech-related interests, and you are the best judge, by far, of whether it serves yours. But I can help, I hope, by sketching out the sort of people I developed the book to engage, to assist, and even occasionally to entertain. If you are a speech-system interaction designer, or an interaction designer aspiring to work with speech systems, you are the primary audience, especially if you work with (or aspire to work with) systems that have a conversational style. I have laid out for you a soup- to-nuts overview of the linguistic, pragmatic, and conversational principles that undergird speech interaction, and coupled that with a detailed map of the process for developing a voice interaction, cramming examples in at every turn and steadily outlining the necessary instruments. If you are a multimodal designer, for whom speech is not the sole interaction channel, but one of several possibilities, I wrote this book for you as well. The speech aspects of XV xvi Preface multimodal design work are different from those of speech-only interaction design m more complicated in some ways, simpler in others--but they depend just as heavily on the fundamentals of speech, conversation, and interface development explained throughout this book. If you are part of the human-computer interaction community generally, a practitioner or researcher who is broadly interested in the various terrains interfacing between humans and machines m their respective constraints, and affordances- you are also in my audi- ence. Speech-system designers read good interaction-design books regularly, most of which are predominantly about graphic interfaces, not because we design such systems (though some do), but just for ideas, and information, and new perspectives. Now that there is a body or voice interaction books appearing, graphic-design people and human-computer folks generally will surely profit from reading them- for ideas, information, and perspec- tives ~ and I have kept such readers in mind while developing this book. Most pertinently, the conversational focus I adopt relentlessly in this book is especially rich for human-com- puter professionals, all of whom understand that the field is defined at almost every level as a type of conversation between human and computer agents. If you are a manager of speech projects, you will also benefit from this book, and reading it can help you identify the necessary personnel for such projects, provide them with the appropriate resources, and help you understand the specific voice interaction elements your project demands. Similarly, if you are a client of a speech company, Voice Interaction Design will help you to know the questions to ask, the features to request, the data you can gather to guide and facilitate such projects, and the sorts of interactions you can expect the system to support. Me I am an academic, with input/output in linguistics, literature, rhetoric, and technical com- munication. I am a teacher, with input/output in linguistics, rhetoric, human factors, usabil- ity, documentation, and information design. I am a corporate hack, with input/output in quality assurance, usability, and graphic and voice interaction development. If this sounds like unstable and opportunistic careering around a group of loosely related fields and professions, you're right. It is. But opportunistic careering has its virtues. On the academic front, it has given me routine exposure, from a broad range of perspectives, to the best that has been thought and said about the issues and problems of language- the most intimate and gregarious and human-defining singularity we share. On the teaching front, it has given me routine exposure to keen and challenging minds who don't settle for easy answers, unless, as sometimes proves to be the case, the answers really are easy. On the corporate front, it has brought me into contact with innovative, dedicated, practical minded folk, and their remarkable, obtuse machines. My work on all three fronts is integrated into this book. Preface xvii This Book Voice Interaction Design is not a Book of Pristine Theory, in the coherent-system-of- explanation-and-prediction sense. There is no Unified Theory of Voice Interaction here. It is also not a Book of Messy Practice, in the first-we-tried-this-and-then-we-tried-that sense. There is no Story of Speech Application X here. Voice Interaction Design is a book of theorized practice, a book that brings a diverse body of theory to a growing body of practice and elaborates the ways the first can inform the development of the second. The body of theory, while not elegant, is robust. It comes from very smart people, who have thought very hard about the features, patterns, and problems of communication for thousands of years. I have brought the results of all this thought into new relationships and done some relabeling, grafting and pruning. The merit of the theory in this book is not mine though it belongs to my sources. The body of practice, while not unitary, is also robust. The data is good. It comes from decades of work, with an especially furious burst of activ- ity in the nineties through the turn of the century--work by smart and hardworking people, who strive to design and build machines that speak and, when spoken to, respond helpfully and congenially. Its substantial merit, too, belongs to my sources. The body of theory in this book is a potpourri, assembled for utility from the results, speculations, and research programs of a variety of overlapping and interpenetrating fields: human-computer interaction, conversational analysis, philosophy of language, cognitive psychology, social psychology, computational linguistics, rhetoric, technical communica- tion, chatterbot theory, and interface design. There is, in short, nothing sacred about what has been assembled here. Frankly, it is profane, in the old-fashioned sense of the term. It is outside the temple, all the temples. I am proceeding without allegiance to, or reverence for, any particular body of doctrine. My methodological model, the model of theorized practice, is the crafts. The applied research that has fed this book is considerably more unified than the theory. The sources are diverse: conference proceedings, books, technical articles, academic papers, observation, expert advice, hallway chit-chat, and long hours stuck in automated telephony hell. But the motivations behind all of these sources orbit around one over- whelming theme: how to get these stubborn, brilliant, ubiquitous instruments -- comput- ers m to behave cooperatively through the medium of speech. The Research A huge amount of the theory and practice I draw on, especially involving the human-com- puter interaction, telephony, and computer speech communities, I collected via the Web, which occasions both gratitude (to my sources) and an apology (to my readers). I am tremendously grateful to the researchers who have made their work so available on the xviii Preface Internet, and more generally to the many sources of opinion and information that populate this vital, virtual world. This is an explosion of open-source information that in many ways recalls the development of public science in the 17 ht century out of secretive, information- hoarding practices like alchemy. While the quality can be highly variable, much of this available research is superb, and the movement to make it so widely available is almost entirely salutary. But N here's the apology~ I'm sorry that I have not always been able to document those sources as well as I would like. In particular, many of the documents that I have worked with either have unreliable pagination (those published in conference proceedings but seen by me only as downloaded PDF or PS files), or no pagination at all (those pub- lished only in HTML or XML); some sites disappeared after I used them, so their URLs became useless for citation purposes; and some have multiple addresses, so that picking a single authoritative URL to cite became a bit of a guessing game. In general, I have priori- tized the citations according to authors, titles, and original venues, trusting that anyone with a search engine will be able to track them down without too much trouble. There likely remains both linkrot and vagueries in the citations, however, and I know how aggravating it is to see a quotation tagged by something like "(Derf, 1996)," and dis- cover that the source is 36 pages long in an unsearchable file format, if locatable at all. But I've done the best that I can, and the alternative in many cases was either eliminating many sources altogether, or dedicating my remaining years to pursuing final, ultimate authoritative citations for them all, while the manuscript for this book lay moldering in my computer. I have also minimized the number of footnotes in the text. There still a handful in almost every chapter, offloading concerns from the (already brimming) main text, or addressing counter-arguments and counter-themes to my assorted proclamations. Part of being an academic, of course, is being obsessive. Earlier versions of this book were encrusted with notes full of qualifications, ramifications, and preoccupations. All but a narrow group of the readers found the footnotes mostly a bother. I have relented, hacking them back savagely. To readers as anal-retentive as myself, I apologize. For everyone else, I hope the book has increased fluidity. Dialogue Inclusions In general, I have played somewhat fast and loose with the snippets of dialogue I incor- porate in this book from other sources m capitalizing, punctuating, eliminating distractions always in the hopes of clarifying what is going on, or of minimizing transcription weird- nesses. For instance, a great many speech researchers use English orthography for report- ing their dialogue data (and even their scenario data) without using its standard conventions, or using them very irregularly ~ not capitalizing names of days or months, for instance, but capitalizing proper names and first-person singular pronouns. I'm not sure at all what the motivations for this practice are, though it may be related to the widespread Preface xix belief among some researchers that spoken language is "ungrammatical" (they're wrong; it's not). In any case, my apologies to any authors who take offense. I am confident that I haven't introduced any distortions, and rarely does any point hinge on one of my adjust- ments. I am not doing this editing out of a sense of grammatical correctness, to be a smart- ass, or for aesthetics, but simply to clarify parts of the dialogue (use/mention distinctions, for instance, and self-quotations), and to eliminate textual suggestions that spoken language is deficient. In general, it is the human/human dialogues that are most heavily edited in this book (removing much of the transcription machinery of conversation analysts in particular); with human/machine dialogues, my alterations are mostly a matter of capitalization, punctua- tion, agent-labeling, and line breaks. Some Acknowledgments One pleasant evening at the beginning of this century, I was having dinner at La Toscana, with Rick and Linda Serafini, and Rick was holding forth on investment opportunities with a company that designed software for talking to your appliances. A week later, Douglas Wright, former president of the University of Waterloo and all-round technology bon vivant called to tell me about a company that he was working with that was developing speech- recognition products. He invited me to drop by, on the off-chance that someone who worked in linguistics might have some advice for them about the finer points of language. I accepted. Ipso novo ~ thank you Rick and Doug m I had a whole new research interest. Many, many people since then have helped me to bring this book together. I am especially grateful to Scott Brave, Jennifer Lai, Clifford Nass, and Nicole Yankelovich for sharing their unpublished work with me. Lai and Yankelovich also read and reviewed versions of this book, as did Daryle GardneroBonneau, Martha Lindeman, Chris Schmandt, and, most generously, Ellen Isaacs, whose early-draft review deserves special praise--a meticulously close reading that was a model of collegial helpfulness. Other members of the dialogue and speech-system communities have also been kind mproviding tips, suggestions, advice, responses to inquiries, and general goodwill, starting chronologically and decisively with Victor Lee. Ian Mccallum, Sunny Mendes, Richard Rosinski, Greg Sanders, Laura Sasaki, Phil Shin, Steve Shepherd, Flora Shiu, and Rakesh Tailor were also generous with their time and expertise. Not all of these folks agree with what I have to say here, and some of them dis- agree quite violently with my overall perspective or with some specific claims. But the importance of their feedback to the quality and integrity of this book is immeasurable. My colleagues at the University of Waterloo have been supportive and creative allies m especially, at a critical junction, Harry Logan, Andrew McMurry, and Glenn Stillar m for which I thank them heartily. I have also had the benefit of several discussions with xx Preface Chrysanne DiMarco. Fakhri Karray, along with Otman Basir of the University of Guelph, have been particularly generous in allowing me to work with and explore the technology that they are developing. My contact with students at the University of Waterloo has been the most sustained source of joy I have had in my professional life. I thank all of them collectively, but a small subset needs to be singled out for helping keep me engaged and challenged about voice- interaction design: Heather Calder, Gabriel Chan, Dewlyn D'Mellow, David Flett, Opal Gamble, David Gillis, Kim Honeyford, John Jong-Suk Lee, Teresa Winky Mak, Sheila McConnell, Kim McMullen, Sarah Mohr, Maria Andrusiak Morland, Amy Oulette, Robert Shanks, Mike Truscello, Aliya Walji, Phil Wang, Michelle Wilier, and Karl Wierzbicki. Zarsheesh Divecha and Kateryna Zolotkova helped with the bibliography. Thank you all with special mention to Dewlyn, for her bank project and for many lively discussions; to Gabriel for sending me useful materials; to the stellar '04 class (Heather, David, Kim, and Sarah); and to Mike. Diane Cerra, Belinda Breyer, Mona Buehler, and Daniel Stone have been wonderfully supportive in the iterative creation of this book, and Matt Wagner gave me some very useful advice early on. The experience working with Morgan Kaufrnann has been all that I could have hoped for. Antepenultimately, I am deeply obliged to Galen, for inspiration, insatiable curiosity, and intense dedication; for writing, and illustrating nine books to my one, five of them in one week; and for protecting me from "the forces of evil" that sought to interfere with my writing time. The badges, too, were a big help. Penultimately, I am grateful also to Oriana, for dancing, singing, loving, sowing joy, slipping notes and pictures into the study, getting down off my stacks of books when I asked, and for, sometimes, not pounding on the study door. Ultimately, I thank Indira, always Indira, for everything but the mornings. RETPAHC Introduction The user interface to an interactive product such as software can be defined as the languages through which the user and the product communicate with each other. m Deborah Mayhew Interfaces User interfaces (hereafter, mostly just "interfaces" with various specifying adjectives, like "voice" and "graphic") are the media by which people interact with digital systems. Although it is an abstract and quite shallow image (in particular, an interface is part of the system, not outside it, and strictly speaking, so is the user), it is nevertheless useful to picture the interface in the terms of Figure 1.1: A user performs some physical actions according to an established protocol, effecting input, which trigger electronic operations. Almost always, these operations include indications for the user about the effect of her actions; that is, feedback. Voice interfaces are interactive media in which the input is primarily or exclusively speech, and so is the feedback. They are new phenomena, especially in the conversational style that dominates the approach of this book, and their design draws on wide bodies of language research, from linguistics, philosophy, sociology, and psychology. The results of this research, in turn, need to be interpreted, then applied, through the disciplines of com- puter science, telephony, and interaction design. Voice interfaces are so new, indeed, that their properties and possibilities are still being worked out, daily. So new, that there is still only a partial awareness among designers that those properties and possibilities need to be investigated in a way that is independent of specific systems and technologies. So new, that--even though there is widespread 4 Chapter 1 Introduction RESU USER INTERFACE DIGITAL SYSTEM Input Z Feedback FIGURE 1.1 A rough schematic of user interfaces recognition that building usable speech systems overwhelmingly implicates the field of human-computer interaction m the word interface is not often used by the very people who are developing them. "Spoken Language Systems" is more common, along with a variety of terms featuring "dialogue" m like "Automatic Telephone Dialogues" and "Spoken Dialogue Systems." These labels reflect much the same attitude that characterized the early years of graphic interface development, which was subsumed under general application development and often designed by the same software engineers who designed the system mechanics. Voice interfaces are among the growing range of interfaces populating the modern land- scape, from the rapid, virtually invisible interfaces of digital calculators to the awkward on-screen interfaces of digital televisions, most of them exploiting multiple modes (chiefly, sound, vision, and touch). But three specific interface categories are significant for the evo- lution and design of voice interfaces. Two are familiar from computer interaction: the nearly archaic command-line interface and the ubiquitous graphic interface. The third comes from telephony: the keypad interface, which provides interaction with messaging applications, automated reception systems, telephone banking, and the like. It is in these domains, where keypad interfaces have dominated for a decade or more, that voice interfaces are beginning steadily to appear. An inevitable and near-total eclipse of keypad interfaces is around the Interfaces 5 next bend (how far ahead that bend is depends on economic and technological develop- ments; but it is the next bend). Since all three of these interface types are implicated in voice-interaction design, and since I will be drawing analogies from them throughout the book, we'll look at them briefly in turn, and also glance at the related area of multimodal interfaces, where voice may come to play an increasingly important role. Command-line Interfaces 149. Implementation restriction: The "stringrange" on-condition cannot be enabled when the "substr" pseudo-variable is used. Note that this restriction does not apply to the "substr" built-in function. A PL/I Multics error diagnostic Command-line interfaces were usually just called "Man-Machine Interfaces" (MMIs) in their period of dominance, the 1970s. The name reflected both the gender imbalance that characterized computer development and the fact that they were the only game in town. As in Figure 1.2, they work on a linguistic paradigm (strings of alphanumeric "words" arranged in a determinate syntax). RESU RESU INTERFACE DIGITAL METSYS tupnI (,- N Z N kcabdeeF GIF U R E 1.2 A rough schematic of command-line user interfaces 6 Chapter 1 Introduction Users issue typed commands and receive textual feedback, resulting in interactions of the following sort: User $MESSAGESYSTEM RETRIEVE MTS Mailbox EYVQ: 1 new, 4 old messages That particular specimen is the command users once issued to MTS (the Michigan Termi- nal System) to get a list of email, followed by MTS's response. Command-line interfaces like MTS or the better-known Multics (Multiplexed Informa- tion and Computing Service) could be tremendously efficient, compressing a whole range of operations into one sleek line of text, so long as the user knew their narrow vocabular- ies and rigid syntaxes. But these interfaces incorporated very little sense of the user, and were brutally unforgiving. In particular, there was comparatively negligible attention to the clarity of the feedback. The system either did what you told it to, with minimal confirma- tion of its actions, or it generated an error message, often cryptic in the extreme, to account for why it wasn't doing what you thought you told it to do. Here is a from-the-trenches characterization of the user's relation to the system, during the heyday of command-line systems: The user is often placed in the position of an absolute master over an awesomely powerful slave, who speaks a strange and painfully awkward tongue, whose obedience is immediate and complete but woefully thoughtless, without regard to the potential destruction of its master's things, rigid to the point of being psychotic, lacking sense, memory, compassion, and m worst of all ~ obvious consistency. (Miller and Thomas, 1977: 172) They weren't completely lacking in goodwill, however, and some subsystems even showed a measure of personality. I was once fortunate enough to get, upon issuing the following command, the next-following response: Me $MESSAGESYSTEM RETREIVE NEW MTS Didn't your momma ever tell you: I before E, except after C? The last bastions of this era are MS DOS (Microsoft Disk Operating System), which has been overlain by various incarnations and generations of Windows, and Unix (named by way of an arcane, nose-thumbing pun on Multics), which undergirds (among other graphic interfaces) Linux and Mac OS X. While command-line interfaces still have active user popu- lations in some specialized communities, and while the graphic overlays sometimes have to be peeled back to the command-line level when things go wrong, or when higher effi- ciencies are needed, they are powerful yet lumbering dinosaurs; graphic interfaces have clearly inherited the earth.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.