ebook img

DTIC ADA457057: Characterizing and Processing Robot-Directed Speech PDF

10 Pages·0.1 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA457057: Characterizing and Processing Robot-Directed Speech

Characterizing and Processing Robot-Directed Speech PaulinaVarchavskaia,PaulFitzpatrick,CynthiaBreazeal AILab,MIT,Cambridge,USA paulina,paulfitz,[email protected], Abstract. Humanoid robots are more suit- Sections 4 and 5 look at these effects in more de- ed to being generalists rather than special- tail. Sections 6 and 7 look at methods that don’t rely ists. Hence when designing a speech in- on such cooperative forms of speech. We expect both terface, we need to retain that generality. mechanismstoplayaroleinpracticallanguagemodel- But speech recognition is most successful in ingforageneral-purposehumanoidrobot. strongly circumscribed domains. We exam- inewhether someuseful properties ofinfant- directedspeechcanbeevokedbyarobot,and 2 Background how the robot’s vocabulary can be adapted. (need200words) Recent developments in speech research on robot- Keywords: speech recognition, ASR, word- s have followed two basic approaches. The first ap- spotting proach builds on techniques developed for command- and-controlstyle interfaces. These systemsemploy the 1 Introduction standardstrategyfoundinASRresearchoflimitingthe recognizable vocabulary to a particular predetermined A natural-language interface is a desirable component domain or task, so as to ensure a manageable size of of a humanoid robot. In the ideal, it allows for natural thevocabulary.Forinstance,theROBITArobot[16]in- hands-free communication with the robot without ne- terprets command utterances and queries related to its cessitating any special skills on the human user’s part. functionandcreators,usingafixedvocabularyof1,000 Inpractice,wemusttradeoffflexibilityoftheinterface words.Withinafixeddomainfastperformancewithfew with its robustness. Contemporary speech understand- errorsbecomespossible,attheexpenseofanyabilityto ingsystemsrelyonstrongdomainconstraintstoachieve interpret out-of-domain utterances. But in many cases highrecognitionaccuracy[23].Thispapermakesanini- thisisperfectlyacceptable,sincethereisnosensiblere- tialexplorationofhowASRtechniquesmaybeapplied sponse available for such utterances even if they were to the domain of robot-directed speech with flexibility modeled. that matchesthe expectationsraised by the robot’shu- A second approach adopted by some roboticists manoidform. [19,17]istoallowadjustable(mainlygrowing)vocab- Acrucialfactorforthesuitabilityofcurrentspeech ularies. This introducesa great dealof complexity,but recognitiontechnologytoadomainistheexpectedper- hasthepotentialtoleadtomoreopen,general-purpose plexity of sentences drawn from that domain. Perplex- systems.Vocabularyextensionisachievedthroughala- ity isa measureofthe averagebranchingfactor within belacquisitionmechanismbasedonalearningalgorith- thespaceofpossiblewordsequences,andsogenerally m,whichmay besupervisedorunsupervised.Thisap- growswiththesizeofthevocabulary.Forexample,the proach was taken in particular in the development of basic vocabularyusedformostweather-relatedqueries CELL [19], Cross-channel Early Language Learning, may be quite small, whereas for dictation it may be wherearoboticplatformcalledTocotheToucanisde- muchlargerandwithamuchlessconstrainedgrammar. velopedanda modelofearlyhumanlanguageacquisi- In thefirstcase speechrecognitioncan beappliedsuc- tionisimplementedonit. CELLisembodiedinanac- cessfully for a large user population across noisy tele- tive vision camera placed on a four degree of freedom phonelines[22], whereasin the seconda goodquality motorizedarmandaugmentedwithexpressivefeatures headsetandextensiveusertrainingarerequiredinprac- tomakeitappearlikeaparrot.Thesystemacquireslex- tice. It is important to determine where robot-directed ical units from the following scenario: a human teach- speechliesinthisspectrum.Thiswilldependonthena- er places an object in front of the robot and describes tureofthetasktowhichtherobotisbeingapplied,and it. The visual system extracts color and shape proper- the characteroftherobotitself. Forthispaper,we will tiesoftheobject,andCELLlearnson-linealexiconof consider the case of Kismet [5], an “infant-like” robot color and shape terms grounded in the representation- whose form and behavior is designed to elicit nurtur- s of objects. The terms learned need not be pertaining ing responses from humans. Among other effects, the to color or shape exclusively - CELL has the potential youthful character of the robot is expected to confine to learn any words, the problem being that of decid- discoursetothehere-and-now. ingwhichlexicalitemstoassociatewithwhichseman- Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2001 2. REPORT TYPE 00-00-2001 to 00-00-2001 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Characterizing and Processing Robot-Directed Speech 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Massachusetts Institute of Technology,Computer Science and Artificial REPORT NUMBER Intelligence Laboratory,32 Vassar Street The Strata Center, Building 32,Cambridge,MA,02139 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 9 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 2 Varchavskaiaetal tic categories. In CELL, associations between linguis- Isolated words Whether isolated words in parental tic and contextual channels are chosen on the basis of speechhelpinfantslearnhasbeenamatterofsomede- maximummutualinformation.Alsoin[17],aPioneer-1 bate. It has been shown that infant-directed utterances mobilerobotwasprogrammedwithasystemtocluster are usually short with longer pauses between words its sensory experiencesusing an unsupervisedlearning (e.g., research cited in [21]), but also that they do not algorithm.In thisway the robotextendsits vocabulary necessarily contain a significant proportion of isolat- by associatingsets of sensoryfeatureswith thespoken ed words[1]. Another study [6] presents evidencethat labelsthataremostfrequentlyutteredintheirpresence. isolated words are a reliable feature of infant-directed speech,andthatinfants’earlywordacquisitionmaybe facilitated by their presence. In particular, the authors 3 Ourapproach find that the frequencyof exposure to a word in isola- tion is a better predictor of whether the word will be We share the goal of automatically acquiring new vo- learned,thanthetotalfrequencyofexposure.Thissug- cabulary,butourmethodsaredifferent.Werelyoncon- gests that isolated words may be easier for infants to temporaryASRsystems. process and learn. Equally importantly for us, howev- butwish todoso usingconventionalspeechrecog- er,istheevidenceforasubstantialpresenceofisolated nitionsystems. wordsinIDS:9%foundin[6]and20%reportedin[1]. In our approach, we try to stay within the ASR IfKismetachievesitspurposeofelicitingnurturingbe- paradigmasmuchaspossible haviorfromhumans,thenmaybewecanexpectasim- Therestofthispaperexaminestwobroadclassesof ilar proportionof Kismet-directed speech to consist of robot-directedspeechwhich are relevantto vocabulary single-wordutterances.This hypothesiswill undergoa extension. In the first case, we consider speech that is preliminaryevaluationhere. cooperative–speechthatintentionallyornothasprop- erties that allow us to cast the machine learning prob- lemassupervised.ThisisexaminedinSection4.Inthe Enunciatedspeechand“vocalshaping” Thetenden- secondcase, weconsiderneutralspeechforwhichthis cy of humans to slow down and overarticulate their is notthe case, andthe problemis essentially unsuper- utterances when they meet with misunderstanding has vised.ThisisexaminedinSection6. beenreportedasaproblemintheASRcommunity[12]. Suchenunciatedspeechdegradesconsiderablytheper- 4 Supervised extension formance of speech recognition systems which were trained on natural speech only. If we find that human As mentioned earlier, natural speech may be the most caretakers tend to address Kismet with overarticulated convenientmedium for a human to interact with a hu- speech,itspresencebecomesaproblemtobeaddressed manoid robot, including in the case of communicating bytherobot’sperceptualsystem. vocabulary extensions. The latter case is a specialized In infant-directed speech, we might expect overar- kind of interaction, where the human plays the role of ticulation to occur in an instructional context, when a a teacher, with the accompanying modifications in the caretaker deliberately introduces the infant to a new discoursewithwhichtherobotisaddressed.Wheninter- wordorcorrectsamispronounciation.Anotherpossible actingwitha young-appearingrobotsuchasKismetin strategy is that of “shaping” of the infant’s pronuncia- particular,wecanexpectthatthespeechinputmayhave tionbyselectingandrepeatingthemispronouncedpart specialized characteristics similar to those of infant- oftheworduntila satisfactoryresultis reached.There directed speech (IDS). This section examines some of is anecdotal evidence that parents may employ such a the propertiesof IDS so they may informour expecta- strategy. tionsofthenatureofKismet-directedspeechsignal. 4.2 Preliminaryexploration 4.1 Propertiesofinfant-directedspeech To facilitate some preliminaryexplorationof this area, Inthispaperweexaminedthefollowingtwoquestions experimentswereconductedinwhichsubjectswerein- regardingthenatureofinfant-directedspeech: structedtotrytoteacharobotwords.Whiletheresponse – Does it include a substantial proportion of single- of the robot was not the focus of these experiments, word utterances? Presenting words in isolation a very basic vocabulary extension was constructed to would solve the difficult word-segmentation prob- encourage users to persist in their efforts. The system lem. consistedofasimplecommand-and-controlstylegram- – How often, if at all, is it clearly enunciated and s- mar. Sentences that began with phrases such as “say”, lowed down in an unnatural way? Overarticulated “canyousay”,“try”etc.weretreatedtoberequestsfor speechmaybehelpfultoinfants,butmaybedetri- therobottorepeatthephoneticsequencethatfollowed mentaltoartificialspeechrecognizers. them.If,aftertherobotrepeatedasequence,apositive Robot-DirectedSpeech 3 phrasesuchas“yes”or“goodrobot”wereused,these- Thesecoarsefiguresmaskthefindingthattherewas quence would be entered in the vocabulary. If instead a very wide distribution of strategies among different the human’s next utterance was similar enough to the subjects.Inthefollowing,deviationsfromthemeanare first, it was assumed to be a correction and the robot mentionedtogiveanideaofthewiderangeofthedata. wouldrepeatit.Because oftherelativelylowaccuracy They are not meaningfulotherwise, since we have not of phoneme-levelrecognition, such correctionsare the observedany Gaussian distributions. The total number ruleratherthantheexception. ofutterancesvariedfromsessiontosessionintherange Maybe also: Kismet’s “production” system (so so- between 19 and 169, with a mean of 64 (standard de- phisticated). viation of 44, based on a sample of 13) utterances per session.Thepercentageofsingle-wordutteranceshada distribution among subjects with a mean at 34.8 and a 5 “Supervised” results - preliminary standard deviation of 21.1. These numbers come from analysisofthe input signal countsofsingle-wordutterancesincludinginstancesof greetings, such as “Hello!”, and attention-bidding us- Thepurposeofthispreliminarystudyistosuggestnew ingthenameoftherobot,i.e.“Kismet!”.Thestatistics ways of improving the speech interface on the robot changeifweexcludesuchinstancesfromthecounts,as basedonabetterknowledgeofthepropertiesofspeech canbeseeninTable1. directedatthisparticularrobot.Thissectionpresentsthe Specifically, if we exclude both greetings and the resultsofthestudyandadiscussionofthemethodused robot’snamefromcountsofsingle-wordutterances,we withdirectionsforsimilarfutureresearch. getadistributioncenteredaround20.3%withastandard We have analyzed video recordings of 13 children deviationof18.5%.Thisstillaccountsforasubstantial agedfrom5to10(?)yearsoldinteractingwiththerobot. proportionof allrecordedKismet-directedspeech.Ex- Eachsessionlastedapproximately20minutes.Intwoof amining the other distributions, we find that the mean the sessions,two childrenareplayingwiththe robotat proportionofenunciatedspeechis25.6%withadevia- thesametime.Intherestofthesessions,onlyonechild tionof20.4%.ThepercentageofvocalimitationofK- ispresentwiththerobot. ismet’s babble has a mean of 2.0% and a deviation of The recordings were originally made for Sherry 4.0%, which is again a very large variation. The same Turkle’sresearchonchildren’sperceptionoftechnology pattern holds for the proportion of vocal shaping ut- andidentity. terances: a mean of 0.6% with a standard deviation of 1.1%. Thus, children in this dataset used varied strate- 5.1 Preliminarydataanalysis giestocommunicatewiththerobot,andtheredoesnot seem to be enough evidence to suggest that the strate- We have looked to establish, in particular, whether gies of vocal shaping and imitation play an important any of the following strategies are present in Kismet- partinit. directedspeech: – single-wordutterances(wordsspokeninisolation) – enunciatedspeech 5.2 Discussion – vocalshaping Theresultspresentedaboveseemencouraging.Howev- – vocalmimicryofKismet’sbabble er,beforewedrawanymeaningfulconclusionsfromthe A total of 831 utteranceswere transcribedfromthe 13 analysis, we must realize that in this instance, the pro- sessions of children playing with the robot. Of these, cess of gathering the data and the method of analysis 303 utterances, or 36.5% consisted of a single word hadseveralshortcomings.The data itself, as was men- said in isolation.27.4%oftranscribedutterances(228) tionedearlier,camefromrecordingsofinteractionsset contained enunciated speech. An utterance was count- upforthepurposesofanunrelatedsociologicalstudyof ed as ”enunciated speech” whenever deliberate pauses children.(AMISUPPOSEDTOCITESHERRY?BUT between words or syllables within a word, and vow- HOW?). ellengtheningwereused.Thecountthereforeincludes The interaction sessions were not set up as con- the very frequentexamples where a subject would ask trolled experiments, and do not necessarily represen- the robot to repeat a word, e.g. “Kismet, can you say: tspontaneousKismet-directedspeech.Inparticular,on GREEN?”.Insuchexamples,GREENwouldbetheon- alloccasionsbutone,atsomepointduringtheinterac- ly enunciatedpartof the utterancebutthe whole ques- tion,childrenwereinstructedtomakeuseofthecurrent- tionwascountedascontainingenunciatedspeech.Inthe lyimplementedcommand-and-controlsystemtogetthe wholebodyofdatawehavediscoveredonly6plausible robot to repeat words after them. In some cases, once instances(0.7%)ofvocalshaping.Finally,therewere23 that happened, the subject was so concerned with get- casesofchildrenimitatingthebabblingsoundsthatK- tingtherobottorepeatawordthatanythingelsesimply ismetmade,whichaccountsfor2.8%ofthetranscribed disappeared from the interaction. On three occasions, utterances. the subjects were instructed to use the ”say” keyword 4 Varchavskaiaetal #single-word #single-word #kismet subject#utterances % % utterances greetings utterances 1 94 65 69.2 0 30 37.2 2 19 9 47.4 1 2 31.6 3 128 69 54.0 11 46 9.3 4 37 17 46.0 2 7 21.6 5 26 9 34.7 3 0 23.1 6 61 14 23.0 9 0 8.2 7 34 2 5.9 1 0 2.9 8 73 43 58.9 0 0 58.9 9 169 39 23.1 8 9 13.0 10 32 17 53.1 0 2 46.9 11 56 7 12.5 3 1 5.4 12 33 5 15.2 5 0 0.0 13 69 7 10.1 3 0 5.8 total 831 303 46 97 mean 34.8 20.3 deviation 21.1 18.5 Table1.PercentageofisolatedwordsinKismet-directedspeech assoonastheysatinfrontoftherobot.Whensubject- Thesubjectswillbetoldthattheyarecontrolsinanex- sareso clearlyfocusedonateachingscenario,we can perimentwithchildrenandthattheyshouldplayfreely expecttheproportionofisolatedwords,forinstance,to withtherobotwhilewerecordtheinteraction.Allcare beunnaturallyhigh. will be taken notto constrainthe subjects’ speech pat- Notealsothatasofnow,wehavenomeasureofac- ternsartificially byasking themto teachKismet word- curacy of the transcriptions,which were done by hand s.Thedatawillbesequencedandtranscribedindepen- byonetranscriber,fromaudiothatsometimeshadpoor dently by two people, so the transcripts may be com- quality. Given the focus of the analysis, only Kismet- pared and analyzed for agreement and error. An utter- directed speech was noted from each interaction, ex- ance will be defined as continuous speech between t- cluding any conversationsthat the child may have had wo pauses which last for longer than a threshold. We withotherhumanswhowerepresentduringthesession. willbelookingtoaddressindepththequestionofhow Decidingwhichutterancestotranscribewasclearlyan- word teaching scenarios are differentfrom other kinds other judgement call that we cannot validate here yet. ofinteractionwiththerobot,byexaminingtheprosody, Finally, since the speech was transcribed by hand, we vowel and pause lengthening (enunciation) and repeti- cannotclaimascientificdefinitionofanutterance(e.g., tions in the speech input. We will also be interested in bypauseduration)butmustrelyononeperson’sjudge- findingout whetherKismet-directedspeech has a high mentcallagain. proportionoftopicsrelatedtotherobot’simmediateen- vironment,forthepurposesofattachingmeaningtothe However, this preliminary analysis shows promise inthatwehavefoundmanyinstancesofisolatedword- wordsthattherobotlearns. s in Kismet-directed speech, suggesting that Kismet’s environmentmay indeed be scaffolded for word learn- 6 Unsupervised vocabulary extension ing.Wehavealsofoundthatasubstantialproportionof speechwasenunciated.Thiswouldpresentproblemsfor the speech recognizer,but at the same time opens new Thissection developsa techniqueto bootstrapfroman possibilities. For an improved word-learning interface, initial vocabulary (perhaps introduced by the methods itmaybepossibletodescriminatebetweennaturaland described in Section 4) by building a model of unrec- enunciated speech to detect instances of pronunciation ognized parts of utterances. The purpose of this mod- teaching(thisapproachwas takenin the ASR commu- el is both to improve recognition accuracy on the ini- nity,forexamplein [12]).Onthe otherhand,thestrat- tialvocabularyandtoautomaticallyidentifycandidates egy of vocal shaping was not clearly presentin the in- forvocabularyextension.Thisworkdrawsonresearch teractions, and there were few cases of mimicry. More inwordspottingandspeechrecognition.Inwordspot- (and better) research would determine how reliable or ting,utterancesare modeledas a relativelysmall num- notthesefeaturesofKismet-directedspeechmaybe. ber of keywords floating on a sea of unknown word- We plan in the future to conduct much more con- s. In speech recognition, an occasional unknown word trolledstudiestoexplorefurtherthenatureofthespeech maypunctuateutterancesthatareotherwiseassumedto inputtotherobot.Thesetupwillinvolvefilming20min- be completelyin-vocabulary.Despite this differencein utesofinteractionbetweenanadultsubjectandKismet. viewpoint, in some circumstances implementations of Robot-DirectedSpeech 5 thetwomaybecomeverysimilar.Whentranscribedut- terancesareavailableforadomain,wordspottingben- efitsfromthemoredetailedbackgroundmodelthiscan support [13]. The manner in which the background is Runrecognizer modeledin thesecases isreminiscentof speechrecog- nition.Forexample,alargevocabularywithgoodcov- erage may be extracted from the corpus, so that rela- Hypothesized N-Best tivelyfew wordsin an utteranceremainunmodeled.In transcript hypotheses this case, the situation is qualitatively similar to OOV (out-of-vocabulary)modelingin a conventionalspeech recognizer,exceptthatthevocabularyisstrictlydivided ExtractOOV Identifyrarely- Identify fragments usedadditions competition into“filler”and“keyword”. We will bootstrap from a relatively weak back- ground model for word-spotting, where OOV words Removefrom Updatelexicon, dominate,toamuchstrongermodelwheremanymore Addtolexicon lexicon baseforms word or phrase clusters have been “movedto the fore- ground” and explicitly modeled. With this increase in vocabulary comes an increase in the potency of lan- UpdateLanguageModel guage modeling, boosting performanceon the original vocabulary. Theremainderofthissectionshowshowaconven- tionalspeechrecognizercanbeconvincedtoclusterfre- quently occurring acoustic patterns, without requiring Fig.1.Theiterativeclusteringprocedure. theexistenceoftranscribeddata. Aspeechrecognizerwithaphone-basedOOVmod- elisabletorecoveranapproximatephoneticrepresenta- pothesesmadebytherecognizerareusedtoretrainthe tionforwordsorwordsequencesthatarenotinitsvo- language model, making sure to give the newly added cabulary. If commonly occurring phone sequences can vocabulary items some probability in the model. Then be located,thenaddingthemto thevocabularywillal- the recognizerrunsusing the new languagemodeland low the language modelto capture their co-occurrence theprocessiterates.Therecognizer’soutputcanbeused withwordsintheoriginalvocabulary,potentiallyboost- to evaluate the worth of the new “vocabulary” entries. ing recognition performance. This suggests building a Thefollowingsectionsdetailhowtoeliminatevocabu- “clustering engine” that scans the outputof the speech laryitemstherecognizerfindslittleusefor,andhowto recognizer, correlates OOV phonetic sequences across detectandresolvecompetitionbetweensimilaritems. allthe utterances,andupdatesthevocabularywithany frequent, robust phone sequences it finds. While this ExtractingOOVphonesequences Recognizeristhat is feasible,thekindofjudgmentstheclusteringengine developedbytheSLSgroupatMIT[8].Therecogniz- needs to make aboutacoustic similarity and alignmen- erusedtheOOVmodeldevelopedbyBazziin[3].This t are exactly those at which the speech recognizer is modelcan match an arbitrarysequence of phones,and mostadept.Thissectiondescribesawaytoconvincethe has a phone bigram to capture phonotacticconstraints. speechrecognizertoperformclusteringalmostforfree, TheOOVmodelisplacedinparallelwiththemodelsfor eliminatingtheneedforanexternalmoduletomakea- thewordsinthevocabulary.Acostparametercancon- cousticjudgments. trolhowmuchtheOOVmodelisusedattheexpenseof The clustering procedureis shown in Figure 1. An the in-vocabularymodels.This valuewas fixedat zero ngram-basedlanguagemodelisinitializedrandomly,or throughout the experiments described in this paper, s- trained up using whateverdata is available - for exam- inceitwasmoreconvenienttocontrolusageatthelevel ple,asmallcollectionoftranscribedutterances.Unrec- ofthelanguagemodel.Thebigramusedinthisproject ognizedwordsareexplicitlyrepresentedusingaphone- is exactly the one used in [3], with no training for the based OOV model, described in the next section. The particulardomain. recognizeristhenrunonalargesetofuntranscribedda- ta.Thephoneticandwordleveloutputsoftherecogniz- erarecomparedsothatoccurrencesofOOVwordsare Recovering phonemic representations It is usefulto assigned a phonetictranscription.A randomlycropped convert the extracted phone sequences to phonemes if subset of these are tentatively entered into the vocab- they are to be added as baseforms in the lexicon. Al- ulary, without any attempt yet to evaluate their signifi- thoughthesequencescouldbekeptintheiroriginalfor- cance(e.g.whethertheyoccurfrequently,whetherthey m by creating a dummy set of units for the baseforms are dangerously similar to a keyword, etc.). The hy- thatarepassedverbatimbythephonologicalrules,con- 6 Varchavskaiaetal verting to phonemes adds some small amount of gen- both the attractive and repulsive conditions to distin- eralizationoverallophonestothesequence’spronunci- guish competition from vocabulary items that are sim- ation,andreducestheamountofcompetingformsthat plylikelyorunlikelytooccurincloseproximity. have to be dealt with later (see Section 0). I make the Accumulating statistics about the above two prop- conversion in a nave way, classifying single or paired erties across all utterances gives a reliable measure of phoneticunitsintoasetofequivalenceclassesthatcor- whether two vocabulary items are essentially acousti- respondtophonemes.Forexample,tapsandcleanlye- cally equivalentto the recognizer.If they are, they can nunciatedstopsaremappedtothesamephoneme,with bemergedorprunedsothatthestatisticsmaintainedby explicitclosuresbeingdropped.Althoughtheprocedure the language model will be well trained. For clear-cut does not capture some contextual effects, it achieves cases,thecompetingitemsaremergedasalternativesin perfectlyadequateperformance(seeSection0). the baseformentry fora single vocabularyunit.A bet- Phonemesequencesaregivenanarbitrarynameand teralternativemighthavebeentouseclassn-gramsand added to the list of vocabulary and baseforms. To en- puttheitemsintothesameclass,butthisworksfine.For sure that the language model assigns some probability lessclear-cutcases,oneitemissimplydeleted. to these new vocabularyitems thenexttime the recog- Here is an example of this process in operation.In nizerruns,acollectionofrandomlygeneratedsentences this example, “phone” is a keyword present in the ini- is added to those output of the recognizer used in re- tialvocabulary.Thesearethe10-besthypothesesforthe training. givenutterance: “whatisthephonenumberforvictorzue” Dealingwithrarely-usedadditions Ifaphonemese- quenceintroducedintothevocabularyisactuallyacom- <oov> phone (nahmber) (mihterz) (yuw) monsoundsequenceinthe acousticdata, thenthe rec- <oov> phone (nahmber) (mihterz) (zyuw) ognizerwillpickitupanduseit.Otherwise,itjustwill <oov> phone (nahmber) (mihterz) (uw) not appear very often in hypotheses. After each itera- <oov> phone (nahmber) (mihterz) (zuw) tion a histogram of phoneme sequence occurrences in <oov> phone (ahmberf) (mihterz) (zyuw) theoutputoftherecognizerisgenerated,andthosebe- <oov> phone (ahmberf) (mihterz) (yuw) lowathresholdarecut. <oov> (axfaanah) (mberfaxr) (mihterz) (zyuw) <oov> (axfaanah) (mberfaxr) (mihterz) Dealingwithcompetingadditions Veryoften,twoor (yuw) moreverysimilarphonemesequenceswillbeaddedto <oov> phone (ahmberf) (mihterz) (zuw) the vocabulary.If the soundsthey representare in fac- <oov> phone (ahmberf) (mihterz) (uw) t commonly occurring, both are likely to prosper and beusedmoreorlessinterchangeablybytherecognizer. The“<oov>”symbolcorrespondsto anoutofvo- Thisis unfortunateforlanguagemodelingpurposes,s- cabularysequence.Thephonesequenceswithinparen- incetheirstatisticswillnotbepooledandsowillbeless thesesareusesofitemsaddedtothevocabularyinapri- robust.Happily,theoutputoftherecognizermakessuch oriterationofthealgorithm.Fromthissingleutterance, situationsveryeasytodetect.Inparticular,thiskindof weacquireevidencethat: confusioncanbe uncoveredthroughanalysisoftheN- bestutterancehypotheses. Theentryfor(ax f aa n ah)maybecompet- . IfweimagingasetofN-besthypothesesalignedand ing with the keyword “phone”. If this holds up s- stacked vertically, then competition is indicated if two tatistically across all the utterances, the entry will vocabularyitemsexhibitbothoftheseproperties: bedestroyed.Thekeywordvocabularyisgivenspe- cialstatus,sincetheyrepresentalinktotheoutside Horizontallyrepulsive-ifoneoftheitemsappears worldthatshouldnotbemodified. . in a single hypothesis,the other will notappear in (nahmber),(mberfaxr)and(ahmberf)may . itsvicinity. becompeting.Theyarecomparedagainsteachoth- Verticallyattractive- the itemsfrequentlyoccurin erbecauseallofthemarefollowedbythesamese- . thesamepartofacollectionofhypothesesforapar- quence(mihterz)andmanyofthemarepreceded ticularutterance. bythesameword“phone”. (yuw),(zyuw),and(uw)maybecompeting . Since the utterances in this domain are generally shortandsimple,itdidnotprovenecessarytorigorous- All of these will be patched up for the next itera- lyalignthehypotheses.Instead,itemswereconsidered tion.ThisuseoftheN-bestutterancehypothesesisrem- tobealignedbasedsimplyonthevocabularyitemspre- iniscentoftheirapplicationtocomputinga measureof cedingandsucceedingthem.Itisimportanttomeasure recognitionconfidencein[11]. Robot-DirectedSpeech 7 Testing for convergence For any iterative procedure, those keywords. Other high-frequency clusters corre- it is important to know when to stop. If we have tran- spondtocommonfirstnames(Karen,Michael). scribeddata,wecantrackthekeyworderrorrateonthat Everynowandthena”parasite”appearssuchas/d- dataandhaltwhentheincrementinperformanceissuf- h ax f ow n/ (from an instance of ”the phone” that the ficientlysmall. recognizer fails to spot) or /iy n eh l/ (from ”email”). Ifthereisnotranscribeddata,thenwecannotdirect- These havethe potentialto interferewith the detection lymeasuretheerrorrate.Wecanhoweverboundtherate ofthekeywordstheyresembleacoustically.Butassoon atwhichitischangingbycomparingkeywordlocations astheyhaveanysuccess,theyaredetectedandeliminat- intheoutputoftherecognizerbetweeniterations.Iffew edasdescribedinSection[sect].Itispossiblethatifa keywordsare shifting location, then the error rate can- parasitedoesn’tgetgreedy,andforexamplelimitsitself notbechangingaboveacertainbound.Wecantherefore tooneperson’spronunciationofakeyword,thatitwill placeaconvergencecriteriononthisboundratherthan not be detected, although I didn’tsee any examples of on the actualkeyworderrorrate. It is importantto just thishappening. measurechangesinkeywordlocations,andnotchanges Manysimplesentencescanbemodeledcompletely in vocabulary items added by clustering. Items that do afterclustering,withoutneedtofallbackonthegeneric not occur often tend to be destroyed and rediscovered OOVphonemodel.Forexample,theutterances: continuously,makingcomparisonsdifficult. What is Victor Zue’s room number Please connect me to Leigh Deacon 7 Experiments in unsupervised vocabulary extension arerecognizedas: The unsupervised procedure described in the previous (w ah t ih z) (ih t er z uw) room (n ah m b section is intended to both improve recognition accu- er) racy on the initial vocabulary, and to identify candi- (p l iy z) (k ix n eh k) (m iy t uw) (l iy d dates for vocabulary extension. This section describes iy) (k ix n) experimentsthatdemonstratetowhatdegreethesegoals All of which are entries in the vocabulary and so wereachieved.TofacilitatecomparisonwithotherASR contributetothelanguagemodel.Allthediscoveredvo- systems, results are quoted for a fairly typical domain cabulary items are assigned one or more baseforms as calledLCSInfo[9]developedbytheSLSgroupatMIT. describedinSection[sect].Forexample,thenasalin/n Thisdomainconsistsofqueriesaboutpersonnel–their ahm ber/issometimesrecognized,sometimesnot,so addresses,phonenumbersetc. bothpronunciationsareaddedtoasinglebaseform. 7.1 Experiment1:QualitativeResults 7.2 Experiment2:QuantitativeResults This section describes the candidate vocabulary dis- For experimentsinvolving small vocabularies, it is ap- covered by the clustering procedure. Numerical, propriatetomeasureperformanceintermsofKeyword performance-relatedresultsarereportedinthenextsec- ErrorRate(KER).Itakethistobe: tion. Resultsgivenherearefromaclusteringsessionwith an initial vocabulary of five keywords (email, phone, F +M (1) KER= (cid:3)100 room,office, address),runon a set of 1566utterances. T with: Transcriptionsfortheutteranceswereavailablebutnot F :Numberoffalseorpoorlylocalizeddetections used by the clustering procedure. Here are the top 10 M :Numberofmisseddetections clusters discovered on this very typical run, ranked by T :Truenumberofkeywordoccurrencesindata decreasingfrequencyofoccurrence: A detection is only counted as such if it occurs at 1 n ah m b er 6 p l iy z therighttime.Specifically,themidpointofthehypothe- 2 w eh r ih z 7 ae ng k y uw 3 w ah t ih z 8 n ow sizedtimeintervalmustliewithinthetruetimeinterval 4 t eh l m iy 9 hh aw ax b aw the keyword occupies. I take forced alignments of the 5 k ix n y uw 10 g r uw p These clusters are used consistently by the recog- testsetasgroundtruth.Thismeansthatfortestingitis nizer in places corresponding to: ”number, where is, better to omit utterances with artifacts and words out- what is, tell me, can you, please, thank you, no, sidethefullvocabulary,sothattheforcedalignmentis how about, group,” respectively in the transcription. likelytobesufficientlyprecise. The first, /n ah m b er/, is very frequent because of Theexperimentsherearedesignedtoidentifywhen ”phonenumber”,”roomnumber”,and”officenumber”. clusteringleadstoreducederrorratesonakeywordvo- Once it appearsas a cluster the languagemodelis im- cabulary.Sincetheformofclusteringaddressedinthis mediately able to improve recognitionperformanceon paperisfundamentallyaboutextendingthevocabulary, 8 Varchavskaiaetal 100 7.3 Experiment3:KismetDomain Baseline performance 90 Performance after clustering An exploratory experiment was carried out for data 80 drawn from robot-directed speech collected for the K- ismet robot. This data comes from an earlier series of 70 %) recordingsessions[7]ratherthantheonesdescribedin word error rate ( 456000 Ss“aoelkcietainyot”nwa4po.prdeEasarrsaluymchroenasgsult“thskeiastmroepetpt”er,no“mcnlious”sint,eg“rss–o.rBrsuyetm”,tha“nirstoiwbcaooltrl”ky, key isinaverypreliminarystage. 30 20 8 DiscussionandConclusions 10 00 10 20 30 40 50 60 70 80 90 100 Paperisacollectionofstuff,notaunifiedwhole.Work coverage (%) in progress. Hard hat area. Divers alarums and excur- sions.Exeuntallpursuedbythefuries. Fig.2.Keyworderrorrateofbaselinerecognizerandcluster- ingrecognizerastotalcoveragevaries. Thispaperdoesnotaddressthecrucialissueofbind- ing vocabularyto meaning.One line of research under way is to use transient, task-dependentvocabulariesto communicate the temporal structure of processes. An- other line of research looks more generally at how a wewouldexpectittobeuselessifthevocabularyisal- robot can establist a shared basis for communication ready large enough to give good coverage. We would with human through learning expressive verbal behav- expectittoofferthegreatestimprovementwhenthevo- iorsaswellasacquiringthehumans’existinglinguistic cabularyissmallest.Tomeasuretheeffectofcoverage, labels. the full vocabulary was made smaller and smaller by Problem of affective speech – too much darned incrementally removing the most infrequent words. A prosody.Goodthingaboutprosody:itmayhelpdistin- set of keywords were chosen and kept constant and in guish a word teaching scenario from normalconversa- thevocabularyacrossalltheexperimentssotheresults tion. The robot could operate in different modes then wouldnotbeconfoundedbypropertiesofthekeywords (mentionedinsection5.2). themselves(forexample,themostcommonword”the” Ultimateresearchinterests:Howcanarobotestab- would make a very bad keyword since it is often un- lishasharedbasisforcommunication?Informedbyin- stressedandlooselypronounced).Thesamesetofkey- fantlanguageresearch.Establishingamechanismfora wordswereusedasintheprevioussection. robot to vocalize its behavioral and internal state in a consistentmannerunderstandabletohumans. Clustering is again performed without making any THISCAMEFROM4.1: useoftranscripts.Totrulyeliminateanydependenceon thetranscripts,anacousticmodeltrainedonlyonadif- ferent dataset was used. This reduced performancebut Vocalimitationandreferentialmapping Parentstend madeiteasiertointerprettheresults. to interpret their children’s first utterances very gener- Figure2showsaplotoferrorratesonthetestdata ouslyandoftenattributemeaningandintentwherethere asthesizeofthevocabularyisvariedtoprovidediffer- maybenone[4].Ithasbeenshown,however,thatsuch entdegreesofcoverage.Themoststrikingresultisthat a strategy may indeed help infants coordinate mean- theclusteringmechanismreducesthesensitivityofper- ingandsoundandlearntoexpressthemselvesverbally. formanceto dropsin coverage.In thisscenario,theer- Pepperberg[18]formalizedtheconceptintoateaching ror rateachievedwith thefullvocabulary(whichgives techniquecalledreferentialmapping.Thestrategyisfor 84.5% coverage on the training data) is 33.3%. When theteachertotreatthepupil’sspontaneousutterancesas thecoverageislow,theclusteredsolutionerrorratere- meaningful,and act upon them. This, it is shown, will mainsunder50%-inrelativeterms,theerrorincreases encouragethe pupil to associate the utterance with the byatmostahalfofitsbestvalue.Straightapplicationof meaning that the teacher originally gave it, so the stu- alanguagemodelgiveserrorratesthatmorethandouble dentwillusethesamevocalizationagaininthefutureto ortrebletheerrorrate. makeasimilarrequestorstatement.Thetechniquewas successfullyusedinaidingthedevelopmentofchildren Asareferencepoint,thekeyworderrorrateusinga withspecialneeds. languagemodeltrainedwiththefullvocabularyonthe For the purposes of the research reported here, we fullsetoftranscriptionswithanacousticmodeltrained arenotconcernedwiththemeaningofwordsyet.How- onallavailabledatagivesan8.3%KER. ever,oneofthepurposesofvocabularyextensionsisto Robot-DirectedSpeech 9 buildasharedbasisformeaningfulcommunicationbe- [10] A.L. Gorin, D. Petrovksa-Delacrtaz, G. Riccardi, and tweenthehumanandtherobot,andreferentialmapping J.H. Wright. Learning spoken language without tran- maybeoneofthepromisinglinesofdevelopment.We scriptions.InProc.IEEEAutomaticSpeechRecognition arethereforeinterestedinfindingouthowoftenhumans andUnderstandingWorkshop,Colorado,1999. [11] T.J.HazenandI.Bazzi. Acomparisonandcombination spontaneouslytreatKismet’s utterancesasmeaningful. ofmethodsforoovworddetectionandwordconfidence WHYDOITHINKTHATOnewayofdoingthisisto scoring.InProc.InternationalConferenceonAcoustics, lookathowoftentheyareimitatedbythehumans.? SaltLakeCity,Utah,May2001. [12] J.Hirschberg,D.Litman,andM.Swerts. Prosodiccues Acknowledgements torecognitionerrors. InASRU. [13] P.Jeanrenaud,K.Ng,M.Siu,J.R.Rohlicek,andH.Gish. TheauthorswouldliketothankSherryTurkle,JenAu- Phonetic-basedwordspotter:Variousconfigurationsand dley,AnitaChan,TamaraKnutsen,BeckyHurwitz,and applicationtoevent spotting. InProc.EUROSPEECH, the MIT Initiativeon TechnologyandSelf, for making 1993. [14] P.W.Jusczyk.TheDiscoveryofSpokenLanguage.Cam- availablethevideorecordingsthatwereanalyzedinthis bridge:MITPress,1997. paper. [15] P.W.Jusczyk and R.N.Aslin. Infants’ detection of the Partsofthisworkrelyheavilyonspeechrecognition soundpatternsofwordsinfluentspeech. CognitivePsy- toolsandcorporadevelopedbytheSLSgroupatMIT. chology,29:1–23,1995. FundsforthisprojectwereprovidedbyDARPA as [16] Y.MatsusakaandT.Kobayashi. Humaninterfaceofhu- partofthe”NaturalTaskingofRobotsBasedonHuman manoid robot realizinggroup communication inreal s- InteractionCues”projectundercontractnumberDABT pace. InProc.SecondInternationalSymposiumonHu- 63-00-C-10102,andbytheNipponTelegraphandTele- manoidRobots,pages188–193,1999. phone Corporationas part of the NTT/MIT Collabora- [17] T. Oates, Z. Eyler-Walker, and P. Cohen. Toward nat- tionAgreement. ural language interfaces for robotic agents: Grounding linguisticmeaninginsensors. InProceedingsofthe4th InternationalConferenceonAutonomousAgents,pages References 227–228,2000. [18] I.Pepperberg. Referentialmapping:Atechniqueforat- [1] R.N. Aslin, J.Z.Woodward, N.P.LaMendola, and T.G. taching functional significance to the innovative utter- Bever. Models of word segmentation in fluent mater- ancesofanafricangreyparrot. AppliedPsycholinguis- nal speech to infants. In J.L. Morgan and K. Demuth, tics,11:23–44,1990. editors,SignaltoSyntax:BootstrappingFromSpeechto [19] D.K. Roy. Learning Words from Sights and Sounds: Grammar inEarlyAcquisition.LawrenceErlbaumAs- A Computational Model. PhD thesis, MIT, September sociates:Mahwah,NJ,1996. 1999. [2] E.G.Bardand A.H.Anderson. Theunintelligibilityof [20] C.Trevarthen. Communicationandcooperationinear- speechtochildren:effectsofreferentavailability. Jour- ly infancy: a description of primary intersubjectivity. nalofChildLanguage,21:623–648,1994. In M. Bullowa, editor, Before Speech: The beginning [3] I. Bazzi and J.R. Glass. Modeling out-of-vocabulary ofinterpersonal communication. Cambridge University words for robust speech recognition. In Proc. 6th In- Press,1979. ternationalConferenceonSpokenLanguageProcessing, [21] J.F.Werker,V.L.Lloyd,J.E.Pegg,andL.Polka. Putting Beijing,China,October2000. thebabyinthebootstraps:Towardamorecompleteun- [4] P.Bloom. HowChildrenLearntheMeaningofWords. derstandingoftheroleoftheinputininfantspeechpro- Cambridge:MITPress,2000. cessing.InSignaltoSyntax:BootstrappingFromSpeech [5] C.Breazeal. SociableMachines:ExpressiveSocialEx- toGrammarinEarlyAcquisition. changeBetweenHumansandRobots. PhDthesis,MIT [22] V. Zue, J. Glass, J. Plifroni, C. Pao, and T.J. Hazen. DepartmentofElectricalEngineeringandComputerSci- Jupiter: A telephone-based conversation interface for ence,2000. weatherinformation. IEEETransactionsonSpeechand [6] M.R. Brent and J.M. Siskind. The role of exposure to AudioProcessing,8:100–112,2000. isolatedwordsinearlyvocabularydevelopment. Cogni- [23] V. Zue and J.R. Glass. Conversational interfaces: Ad- tion,81:B33–B44,2001. vancesandchallenges.ProceedingsoftheIEEE,Special [7] C.BreazealandL.Aryananda. Recognitionofaffective IssueonSpokenLanguage Processing,Vol.88,August communicativeintentinrobot-directedspeech. InPro- 2000. ceedingsofHumanoids2000,Cambridge,MA,Septem- ber2000. [8] J. Glass, J. Chang, and M. McCandless. A proba- bilisticframeworkforfeature-basedspeechrecognition. InProc.InternationalConferenceonSpokenLanguage Processing,pages2277–2280,1996. [9] J. Glass and E. Weinstein. Speechbuilder: Facilitating spokendialoguesystemsdevelopment. In7thEuropean ConferenceonSpeechCommunicationandTechnology, Aalborg,Denmark,September2001.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.