Table Of ContentCharacterizing and Processing Robot-Directed Speech
PaulinaVarchavskaia,PaulFitzpatrick,CynthiaBreazeal
AILab,MIT,Cambridge,USA
paulina,paulfitz,cynthia@ai.mit.edu,
Abstract. Humanoid robots are more suit- Sections 4 and 5 look at these effects in more de-
ed to being generalists rather than special- tail. Sections 6 and 7 look at methods that don’t rely
ists. Hence when designing a speech in- on such cooperative forms of speech. We expect both
terface, we need to retain that generality. mechanismstoplayaroleinpracticallanguagemodel-
But speech recognition is most successful in
ingforageneral-purposehumanoidrobot.
strongly circumscribed domains. We exam-
inewhether someuseful properties ofinfant-
directedspeechcanbeevokedbyarobot,and 2 Background
how the robot’s vocabulary can be adapted.
(need200words)
Recent developments in speech research on robot-
Keywords: speech recognition, ASR, word-
s have followed two basic approaches. The first ap-
spotting
proach builds on techniques developed for command-
and-controlstyle interfaces. These systemsemploy the
1 Introduction standardstrategyfoundinASRresearchoflimitingthe
recognizable vocabulary to a particular predetermined
A natural-language interface is a desirable component domain or task, so as to ensure a manageable size of
of a humanoid robot. In the ideal, it allows for natural thevocabulary.Forinstance,theROBITArobot[16]in-
hands-free communication with the robot without ne- terprets command utterances and queries related to its
cessitating any special skills on the human user’s part. functionandcreators,usingafixedvocabularyof1,000
Inpractice,wemusttradeoffflexibilityoftheinterface words.Withinafixeddomainfastperformancewithfew
with its robustness. Contemporary speech understand- errorsbecomespossible,attheexpenseofanyabilityto
ingsystemsrelyonstrongdomainconstraintstoachieve interpret out-of-domain utterances. But in many cases
highrecognitionaccuracy[23].Thispapermakesanini- thisisperfectlyacceptable,sincethereisnosensiblere-
tialexplorationofhowASRtechniquesmaybeapplied sponse available for such utterances even if they were
to the domain of robot-directed speech with flexibility modeled.
that matchesthe expectationsraised by the robot’shu- A second approach adopted by some roboticists
manoidform. [19,17]istoallowadjustable(mainlygrowing)vocab-
Acrucialfactorforthesuitabilityofcurrentspeech ularies. This introducesa great dealof complexity,but
recognitiontechnologytoadomainistheexpectedper- hasthepotentialtoleadtomoreopen,general-purpose
plexity of sentences drawn from that domain. Perplex- systems.Vocabularyextensionisachievedthroughala-
ity isa measureofthe averagebranchingfactor within belacquisitionmechanismbasedonalearningalgorith-
thespaceofpossiblewordsequences,andsogenerally m,whichmay besupervisedorunsupervised.Thisap-
growswiththesizeofthevocabulary.Forexample,the proach was taken in particular in the development of
basic vocabularyusedformostweather-relatedqueries CELL [19], Cross-channel Early Language Learning,
may be quite small, whereas for dictation it may be wherearoboticplatformcalledTocotheToucanisde-
muchlargerandwithamuchlessconstrainedgrammar. velopedanda modelofearlyhumanlanguageacquisi-
In thefirstcase speechrecognitioncan beappliedsuc- tionisimplementedonit. CELLisembodiedinanac-
cessfully for a large user population across noisy tele- tive vision camera placed on a four degree of freedom
phonelines[22], whereasin the seconda goodquality motorizedarmandaugmentedwithexpressivefeatures
headsetandextensiveusertrainingarerequiredinprac- tomakeitappearlikeaparrot.Thesystemacquireslex-
tice. It is important to determine where robot-directed ical units from the following scenario: a human teach-
speechliesinthisspectrum.Thiswilldependonthena- er places an object in front of the robot and describes
tureofthetasktowhichtherobotisbeingapplied,and it. The visual system extracts color and shape proper-
the characteroftherobotitself. Forthispaper,we will tiesoftheobject,andCELLlearnson-linealexiconof
consider the case of Kismet [5], an “infant-like” robot color and shape terms grounded in the representation-
whose form and behavior is designed to elicit nurtur- s of objects. The terms learned need not be pertaining
ing responses from humans. Among other effects, the to color or shape exclusively - CELL has the potential
youthful character of the robot is expected to confine to learn any words, the problem being that of decid-
discoursetothehere-and-now. ingwhichlexicalitemstoassociatewithwhichseman-
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2001 2. REPORT TYPE 00-00-2001 to 00-00-2001
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Characterizing and Processing Robot-Directed Speech
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
Massachusetts Institute of Technology,Computer Science and Artificial REPORT NUMBER
Intelligence Laboratory,32 Vassar Street The Strata Center, Building
32,Cambridge,MA,02139
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
The original document contains color images.
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE 9
unclassified unclassified unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
2 Varchavskaiaetal
tic categories. In CELL, associations between linguis- Isolated words Whether isolated words in parental
tic and contextual channels are chosen on the basis of speechhelpinfantslearnhasbeenamatterofsomede-
maximummutualinformation.Alsoin[17],aPioneer-1 bate. It has been shown that infant-directed utterances
mobilerobotwasprogrammedwithasystemtocluster are usually short with longer pauses between words
its sensory experiencesusing an unsupervisedlearning (e.g., research cited in [21]), but also that they do not
algorithm.In thisway the robotextendsits vocabulary necessarily contain a significant proportion of isolat-
by associatingsets of sensoryfeatureswith thespoken ed words[1]. Another study [6] presents evidencethat
labelsthataremostfrequentlyutteredintheirpresence. isolated words are a reliable feature of infant-directed
speech,andthatinfants’earlywordacquisitionmaybe
facilitated by their presence. In particular, the authors
3 Ourapproach
find that the frequencyof exposure to a word in isola-
tion is a better predictor of whether the word will be
We share the goal of automatically acquiring new vo-
learned,thanthetotalfrequencyofexposure.Thissug-
cabulary,butourmethodsaredifferent.Werelyoncon-
gests that isolated words may be easier for infants to
temporaryASRsystems.
process and learn. Equally importantly for us, howev-
butwish todoso usingconventionalspeechrecog-
er,istheevidenceforasubstantialpresenceofisolated
nitionsystems.
wordsinIDS:9%foundin[6]and20%reportedin[1].
In our approach, we try to stay within the ASR
IfKismetachievesitspurposeofelicitingnurturingbe-
paradigmasmuchaspossible
haviorfromhumans,thenmaybewecanexpectasim-
Therestofthispaperexaminestwobroadclassesof
ilar proportionof Kismet-directed speech to consist of
robot-directedspeechwhich are relevantto vocabulary
single-wordutterances.This hypothesiswill undergoa
extension. In the first case, we consider speech that is
preliminaryevaluationhere.
cooperative–speechthatintentionallyornothasprop-
erties that allow us to cast the machine learning prob-
lemassupervised.ThisisexaminedinSection4.Inthe
Enunciatedspeechand“vocalshaping” Thetenden-
secondcase, weconsiderneutralspeechforwhichthis
cy of humans to slow down and overarticulate their
is notthe case, andthe problemis essentially unsuper-
utterances when they meet with misunderstanding has
vised.ThisisexaminedinSection6.
beenreportedasaproblemintheASRcommunity[12].
Suchenunciatedspeechdegradesconsiderablytheper-
4 Supervised extension formance of speech recognition systems which were
trained on natural speech only. If we find that human
As mentioned earlier, natural speech may be the most caretakers tend to address Kismet with overarticulated
convenientmedium for a human to interact with a hu- speech,itspresencebecomesaproblemtobeaddressed
manoid robot, including in the case of communicating bytherobot’sperceptualsystem.
vocabulary extensions. The latter case is a specialized In infant-directed speech, we might expect overar-
kind of interaction, where the human plays the role of ticulation to occur in an instructional context, when a
a teacher, with the accompanying modifications in the caretaker deliberately introduces the infant to a new
discoursewithwhichtherobotisaddressed.Wheninter- wordorcorrectsamispronounciation.Anotherpossible
actingwitha young-appearingrobotsuchasKismetin strategy is that of “shaping” of the infant’s pronuncia-
particular,wecanexpectthatthespeechinputmayhave tionbyselectingandrepeatingthemispronouncedpart
specialized characteristics similar to those of infant- oftheworduntila satisfactoryresultis reached.There
directed speech (IDS). This section examines some of is anecdotal evidence that parents may employ such a
the propertiesof IDS so they may informour expecta- strategy.
tionsofthenatureofKismet-directedspeechsignal.
4.2 Preliminaryexploration
4.1 Propertiesofinfant-directedspeech
To facilitate some preliminaryexplorationof this area,
Inthispaperweexaminedthefollowingtwoquestions
experimentswereconductedinwhichsubjectswerein-
regardingthenatureofinfant-directedspeech:
structedtotrytoteacharobotwords.Whiletheresponse
– Does it include a substantial proportion of single- of the robot was not the focus of these experiments,
word utterances? Presenting words in isolation a very basic vocabulary extension was constructed to
would solve the difficult word-segmentation prob- encourage users to persist in their efforts. The system
lem. consistedofasimplecommand-and-controlstylegram-
– How often, if at all, is it clearly enunciated and s- mar. Sentences that began with phrases such as “say”,
lowed down in an unnatural way? Overarticulated “canyousay”,“try”etc.weretreatedtoberequestsfor
speechmaybehelpfultoinfants,butmaybedetri- therobottorepeatthephoneticsequencethatfollowed
mentaltoartificialspeechrecognizers. them.If,aftertherobotrepeatedasequence,apositive
Robot-DirectedSpeech 3
phrasesuchas“yes”or“goodrobot”wereused,these- Thesecoarsefiguresmaskthefindingthattherewas
quence would be entered in the vocabulary. If instead a very wide distribution of strategies among different
the human’s next utterance was similar enough to the subjects.Inthefollowing,deviationsfromthemeanare
first, it was assumed to be a correction and the robot mentionedtogiveanideaofthewiderangeofthedata.
wouldrepeatit.Because oftherelativelylowaccuracy They are not meaningfulotherwise, since we have not
of phoneme-levelrecognition, such correctionsare the observedany Gaussian distributions. The total number
ruleratherthantheexception. ofutterancesvariedfromsessiontosessionintherange
Maybe also: Kismet’s “production” system (so so- between 19 and 169, with a mean of 64 (standard de-
phisticated). viation of 44, based on a sample of 13) utterances per
session.Thepercentageofsingle-wordutteranceshada
distribution among subjects with a mean at 34.8 and a
5 “Supervised” results - preliminary
standard deviation of 21.1. These numbers come from
analysisofthe input signal
countsofsingle-wordutterancesincludinginstancesof
greetings, such as “Hello!”, and attention-bidding us-
Thepurposeofthispreliminarystudyistosuggestnew
ingthenameoftherobot,i.e.“Kismet!”.Thestatistics
ways of improving the speech interface on the robot
changeifweexcludesuchinstancesfromthecounts,as
basedonabetterknowledgeofthepropertiesofspeech
canbeseeninTable1.
directedatthisparticularrobot.Thissectionpresentsthe
Specifically, if we exclude both greetings and the
resultsofthestudyandadiscussionofthemethodused
robot’snamefromcountsofsingle-wordutterances,we
withdirectionsforsimilarfutureresearch.
getadistributioncenteredaround20.3%withastandard
We have analyzed video recordings of 13 children
deviationof18.5%.Thisstillaccountsforasubstantial
agedfrom5to10(?)yearsoldinteractingwiththerobot.
proportionof allrecordedKismet-directedspeech.Ex-
Eachsessionlastedapproximately20minutes.Intwoof
amining the other distributions, we find that the mean
the sessions,two childrenareplayingwiththe robotat
proportionofenunciatedspeechis25.6%withadevia-
thesametime.Intherestofthesessions,onlyonechild
tionof20.4%.ThepercentageofvocalimitationofK-
ispresentwiththerobot.
ismet’s babble has a mean of 2.0% and a deviation of
The recordings were originally made for Sherry
4.0%, which is again a very large variation. The same
Turkle’sresearchonchildren’sperceptionoftechnology
pattern holds for the proportion of vocal shaping ut-
andidentity.
terances: a mean of 0.6% with a standard deviation of
1.1%. Thus, children in this dataset used varied strate-
5.1 Preliminarydataanalysis giestocommunicatewiththerobot,andtheredoesnot
seem to be enough evidence to suggest that the strate-
We have looked to establish, in particular, whether
gies of vocal shaping and imitation play an important
any of the following strategies are present in Kismet-
partinit.
directedspeech:
– single-wordutterances(wordsspokeninisolation)
– enunciatedspeech 5.2 Discussion
– vocalshaping
Theresultspresentedaboveseemencouraging.Howev-
– vocalmimicryofKismet’sbabble
er,beforewedrawanymeaningfulconclusionsfromthe
A total of 831 utteranceswere transcribedfromthe 13 analysis, we must realize that in this instance, the pro-
sessions of children playing with the robot. Of these, cess of gathering the data and the method of analysis
303 utterances, or 36.5% consisted of a single word hadseveralshortcomings.The data itself, as was men-
said in isolation.27.4%oftranscribedutterances(228) tionedearlier,camefromrecordingsofinteractionsset
contained enunciated speech. An utterance was count- upforthepurposesofanunrelatedsociologicalstudyof
ed as ”enunciated speech” whenever deliberate pauses children.(AMISUPPOSEDTOCITESHERRY?BUT
between words or syllables within a word, and vow- HOW?).
ellengtheningwereused.Thecountthereforeincludes The interaction sessions were not set up as con-
the very frequentexamples where a subject would ask trolled experiments, and do not necessarily represen-
the robot to repeat a word, e.g. “Kismet, can you say: tspontaneousKismet-directedspeech.Inparticular,on
GREEN?”.Insuchexamples,GREENwouldbetheon- alloccasionsbutone,atsomepointduringtheinterac-
ly enunciatedpartof the utterancebutthe whole ques- tion,childrenwereinstructedtomakeuseofthecurrent-
tionwascountedascontainingenunciatedspeech.Inthe lyimplementedcommand-and-controlsystemtogetthe
wholebodyofdatawehavediscoveredonly6plausible robot to repeat words after them. In some cases, once
instances(0.7%)ofvocalshaping.Finally,therewere23 that happened, the subject was so concerned with get-
casesofchildrenimitatingthebabblingsoundsthatK- tingtherobottorepeatawordthatanythingelsesimply
ismetmade,whichaccountsfor2.8%ofthetranscribed disappeared from the interaction. On three occasions,
utterances. the subjects were instructed to use the ”say” keyword
4 Varchavskaiaetal
#single-word #single-word #kismet
subject#utterances % %
utterances greetings utterances
1 94 65 69.2 0 30 37.2
2 19 9 47.4 1 2 31.6
3 128 69 54.0 11 46 9.3
4 37 17 46.0 2 7 21.6
5 26 9 34.7 3 0 23.1
6 61 14 23.0 9 0 8.2
7 34 2 5.9 1 0 2.9
8 73 43 58.9 0 0 58.9
9 169 39 23.1 8 9 13.0
10 32 17 53.1 0 2 46.9
11 56 7 12.5 3 1 5.4
12 33 5 15.2 5 0 0.0
13 69 7 10.1 3 0 5.8
total 831 303 46 97
mean 34.8 20.3
deviation 21.1 18.5
Table1.PercentageofisolatedwordsinKismet-directedspeech
assoonastheysatinfrontoftherobot.Whensubject- Thesubjectswillbetoldthattheyarecontrolsinanex-
sareso clearlyfocusedonateachingscenario,we can perimentwithchildrenandthattheyshouldplayfreely
expecttheproportionofisolatedwords,forinstance,to withtherobotwhilewerecordtheinteraction.Allcare
beunnaturallyhigh. will be taken notto constrainthe subjects’ speech pat-
Notealsothatasofnow,wehavenomeasureofac- ternsartificially byasking themto teachKismet word-
curacy of the transcriptions,which were done by hand s.Thedatawillbesequencedandtranscribedindepen-
byonetranscriber,fromaudiothatsometimeshadpoor dently by two people, so the transcripts may be com-
quality. Given the focus of the analysis, only Kismet- pared and analyzed for agreement and error. An utter-
directed speech was noted from each interaction, ex- ance will be defined as continuous speech between t-
cluding any conversationsthat the child may have had wo pauses which last for longer than a threshold. We
withotherhumanswhowerepresentduringthesession. willbelookingtoaddressindepththequestionofhow
Decidingwhichutterancestotranscribewasclearlyan- word teaching scenarios are differentfrom other kinds
other judgement call that we cannot validate here yet. ofinteractionwiththerobot,byexaminingtheprosody,
Finally, since the speech was transcribed by hand, we vowel and pause lengthening (enunciation) and repeti-
cannotclaimascientificdefinitionofanutterance(e.g., tions in the speech input. We will also be interested in
bypauseduration)butmustrelyononeperson’sjudge- findingout whetherKismet-directedspeech has a high
mentcallagain. proportionoftopicsrelatedtotherobot’simmediateen-
vironment,forthepurposesofattachingmeaningtothe
However, this preliminary analysis shows promise
inthatwehavefoundmanyinstancesofisolatedword- wordsthattherobotlearns.
s in Kismet-directed speech, suggesting that Kismet’s
environmentmay indeed be scaffolded for word learn-
6 Unsupervised vocabulary extension
ing.Wehavealsofoundthatasubstantialproportionof
speechwasenunciated.Thiswouldpresentproblemsfor
the speech recognizer,but at the same time opens new Thissection developsa techniqueto bootstrapfroman
possibilities. For an improved word-learning interface, initial vocabulary (perhaps introduced by the methods
itmaybepossibletodescriminatebetweennaturaland described in Section 4) by building a model of unrec-
enunciated speech to detect instances of pronunciation ognized parts of utterances. The purpose of this mod-
teaching(thisapproachwas takenin the ASR commu- el is both to improve recognition accuracy on the ini-
nity,forexamplein [12]).Onthe otherhand,thestrat- tialvocabularyandtoautomaticallyidentifycandidates
egy of vocal shaping was not clearly presentin the in- forvocabularyextension.Thisworkdrawsonresearch
teractions, and there were few cases of mimicry. More inwordspottingandspeechrecognition.Inwordspot-
(and better) research would determine how reliable or ting,utterancesare modeledas a relativelysmall num-
notthesefeaturesofKismet-directedspeechmaybe. ber of keywords floating on a sea of unknown word-
We plan in the future to conduct much more con- s. In speech recognition, an occasional unknown word
trolledstudiestoexplorefurtherthenatureofthespeech maypunctuateutterancesthatareotherwiseassumedto
inputtotherobot.Thesetupwillinvolvefilming20min- be completelyin-vocabulary.Despite this differencein
utesofinteractionbetweenanadultsubjectandKismet. viewpoint, in some circumstances implementations of
Robot-DirectedSpeech 5
thetwomaybecomeverysimilar.Whentranscribedut-
terancesareavailableforadomain,wordspottingben-
efitsfromthemoredetailedbackgroundmodelthiscan
support [13]. The manner in which the background is Runrecognizer
modeledin thesecases isreminiscentof speechrecog-
nition.Forexample,alargevocabularywithgoodcov-
erage may be extracted from the corpus, so that rela- Hypothesized N-Best
tivelyfew wordsin an utteranceremainunmodeled.In transcript hypotheses
this case, the situation is qualitatively similar to OOV
(out-of-vocabulary)modelingin a conventionalspeech
recognizer,exceptthatthevocabularyisstrictlydivided ExtractOOV Identifyrarely- Identify
fragments usedadditions competition
into“filler”and“keyword”.
We will bootstrap from a relatively weak back-
ground model for word-spotting, where OOV words
Removefrom Updatelexicon,
dominate,toamuchstrongermodelwheremanymore Addtolexicon
lexicon baseforms
word or phrase clusters have been “movedto the fore-
ground” and explicitly modeled. With this increase in
vocabulary comes an increase in the potency of lan-
UpdateLanguageModel
guage modeling, boosting performanceon the original
vocabulary.
Theremainderofthissectionshowshowaconven-
tionalspeechrecognizercanbeconvincedtoclusterfre-
quently occurring acoustic patterns, without requiring
Fig.1.Theiterativeclusteringprocedure.
theexistenceoftranscribeddata.
Aspeechrecognizerwithaphone-basedOOVmod-
elisabletorecoveranapproximatephoneticrepresenta- pothesesmadebytherecognizerareusedtoretrainthe
tionforwordsorwordsequencesthatarenotinitsvo- language model, making sure to give the newly added
cabulary. If commonly occurring phone sequences can vocabulary items some probability in the model. Then
be located,thenaddingthemto thevocabularywillal- the recognizerrunsusing the new languagemodeland
low the language modelto capture their co-occurrence theprocessiterates.Therecognizer’soutputcanbeused
withwordsintheoriginalvocabulary,potentiallyboost- to evaluate the worth of the new “vocabulary” entries.
ing recognition performance. This suggests building a Thefollowingsectionsdetailhowtoeliminatevocabu-
“clustering engine” that scans the outputof the speech laryitemstherecognizerfindslittleusefor,andhowto
recognizer, correlates OOV phonetic sequences across detectandresolvecompetitionbetweensimilaritems.
allthe utterances,andupdatesthevocabularywithany
frequent, robust phone sequences it finds. While this
ExtractingOOVphonesequences Recognizeristhat
is feasible,thekindofjudgmentstheclusteringengine
developedbytheSLSgroupatMIT[8].Therecogniz-
needs to make aboutacoustic similarity and alignmen-
erusedtheOOVmodeldevelopedbyBazziin[3].This
t are exactly those at which the speech recognizer is
modelcan match an arbitrarysequence of phones,and
mostadept.Thissectiondescribesawaytoconvincethe
has a phone bigram to capture phonotacticconstraints.
speechrecognizertoperformclusteringalmostforfree,
TheOOVmodelisplacedinparallelwiththemodelsfor
eliminatingtheneedforanexternalmoduletomakea-
thewordsinthevocabulary.Acostparametercancon-
cousticjudgments.
trolhowmuchtheOOVmodelisusedattheexpenseof
The clustering procedureis shown in Figure 1. An
the in-vocabularymodels.This valuewas fixedat zero
ngram-basedlanguagemodelisinitializedrandomly,or
throughout the experiments described in this paper, s-
trained up using whateverdata is available - for exam-
inceitwasmoreconvenienttocontrolusageatthelevel
ple,asmallcollectionoftranscribedutterances.Unrec-
ofthelanguagemodel.Thebigramusedinthisproject
ognizedwordsareexplicitlyrepresentedusingaphone-
is exactly the one used in [3], with no training for the
based OOV model, described in the next section. The
particulardomain.
recognizeristhenrunonalargesetofuntranscribedda-
ta.Thephoneticandwordleveloutputsoftherecogniz-
erarecomparedsothatoccurrencesofOOVwordsare Recovering phonemic representations It is usefulto
assigned a phonetictranscription.A randomlycropped convert the extracted phone sequences to phonemes if
subset of these are tentatively entered into the vocab- they are to be added as baseforms in the lexicon. Al-
ulary, without any attempt yet to evaluate their signifi- thoughthesequencescouldbekeptintheiroriginalfor-
cance(e.g.whethertheyoccurfrequently,whetherthey m by creating a dummy set of units for the baseforms
are dangerously similar to a keyword, etc.). The hy- thatarepassedverbatimbythephonologicalrules,con-
6 Varchavskaiaetal
verting to phonemes adds some small amount of gen- both the attractive and repulsive conditions to distin-
eralizationoverallophonestothesequence’spronunci- guish competition from vocabulary items that are sim-
ation,andreducestheamountofcompetingformsthat plylikelyorunlikelytooccurincloseproximity.
have to be dealt with later (see Section 0). I make the Accumulating statistics about the above two prop-
conversion in a nave way, classifying single or paired erties across all utterances gives a reliable measure of
phoneticunitsintoasetofequivalenceclassesthatcor- whether two vocabulary items are essentially acousti-
respondtophonemes.Forexample,tapsandcleanlye- cally equivalentto the recognizer.If they are, they can
nunciatedstopsaremappedtothesamephoneme,with bemergedorprunedsothatthestatisticsmaintainedby
explicitclosuresbeingdropped.Althoughtheprocedure the language model will be well trained. For clear-cut
does not capture some contextual effects, it achieves cases,thecompetingitemsaremergedasalternativesin
perfectlyadequateperformance(seeSection0). the baseformentry fora single vocabularyunit.A bet-
Phonemesequencesaregivenanarbitrarynameand teralternativemighthavebeentouseclassn-gramsand
added to the list of vocabulary and baseforms. To en- puttheitemsintothesameclass,butthisworksfine.For
sure that the language model assigns some probability lessclear-cutcases,oneitemissimplydeleted.
to these new vocabularyitems thenexttime the recog- Here is an example of this process in operation.In
nizerruns,acollectionofrandomlygeneratedsentences this example, “phone” is a keyword present in the ini-
is added to those output of the recognizer used in re- tialvocabulary.Thesearethe10-besthypothesesforthe
training. givenutterance:
“whatisthephonenumberforvictorzue”
Dealingwithrarely-usedadditions Ifaphonemese-
quenceintroducedintothevocabularyisactuallyacom- <oov> phone (nahmber) (mihterz) (yuw)
monsoundsequenceinthe acousticdata, thenthe rec- <oov> phone (nahmber) (mihterz) (zyuw)
ognizerwillpickitupanduseit.Otherwise,itjustwill <oov> phone (nahmber) (mihterz) (uw)
not appear very often in hypotheses. After each itera- <oov> phone (nahmber) (mihterz) (zuw)
tion a histogram of phoneme sequence occurrences in <oov> phone (ahmberf) (mihterz) (zyuw)
theoutputoftherecognizerisgenerated,andthosebe- <oov> phone (ahmberf) (mihterz) (yuw)
lowathresholdarecut. <oov> (axfaanah) (mberfaxr) (mihterz)
(zyuw)
<oov> (axfaanah) (mberfaxr) (mihterz)
Dealingwithcompetingadditions Veryoften,twoor (yuw)
moreverysimilarphonemesequenceswillbeaddedto <oov> phone (ahmberf) (mihterz) (zuw)
the vocabulary.If the soundsthey representare in fac- <oov> phone (ahmberf) (mihterz) (uw)
t commonly occurring, both are likely to prosper and
beusedmoreorlessinterchangeablybytherecognizer. The“<oov>”symbolcorrespondsto anoutofvo-
Thisis unfortunateforlanguagemodelingpurposes,s- cabularysequence.Thephonesequenceswithinparen-
incetheirstatisticswillnotbepooledandsowillbeless thesesareusesofitemsaddedtothevocabularyinapri-
robust.Happily,theoutputoftherecognizermakessuch oriterationofthealgorithm.Fromthissingleutterance,
situationsveryeasytodetect.Inparticular,thiskindof weacquireevidencethat:
confusioncanbe uncoveredthroughanalysisoftheN-
bestutterancehypotheses. Theentryfor(ax f aa n ah)maybecompet-
.
IfweimagingasetofN-besthypothesesalignedand ing with the keyword “phone”. If this holds up s-
stacked vertically, then competition is indicated if two tatistically across all the utterances, the entry will
vocabularyitemsexhibitbothoftheseproperties: bedestroyed.Thekeywordvocabularyisgivenspe-
cialstatus,sincetheyrepresentalinktotheoutside
Horizontallyrepulsive-ifoneoftheitemsappears worldthatshouldnotbemodified.
.
in a single hypothesis,the other will notappear in (nahmber),(mberfaxr)and(ahmberf)may
.
itsvicinity. becompeting.Theyarecomparedagainsteachoth-
Verticallyattractive- the itemsfrequentlyoccurin erbecauseallofthemarefollowedbythesamese-
.
thesamepartofacollectionofhypothesesforapar- quence(mihterz)andmanyofthemarepreceded
ticularutterance. bythesameword“phone”.
(yuw),(zyuw),and(uw)maybecompeting
.
Since the utterances in this domain are generally
shortandsimple,itdidnotprovenecessarytorigorous- All of these will be patched up for the next itera-
lyalignthehypotheses.Instead,itemswereconsidered tion.ThisuseoftheN-bestutterancehypothesesisrem-
tobealignedbasedsimplyonthevocabularyitemspre- iniscentoftheirapplicationtocomputinga measureof
cedingandsucceedingthem.Itisimportanttomeasure recognitionconfidencein[11].
Robot-DirectedSpeech 7
Testing for convergence For any iterative procedure, those keywords. Other high-frequency clusters corre-
it is important to know when to stop. If we have tran- spondtocommonfirstnames(Karen,Michael).
scribeddata,wecantrackthekeyworderrorrateonthat Everynowandthena”parasite”appearssuchas/d-
dataandhaltwhentheincrementinperformanceissuf- h ax f ow n/ (from an instance of ”the phone” that the
ficientlysmall. recognizer fails to spot) or /iy n eh l/ (from ”email”).
Ifthereisnotranscribeddata,thenwecannotdirect- These havethe potentialto interferewith the detection
lymeasuretheerrorrate.Wecanhoweverboundtherate ofthekeywordstheyresembleacoustically.Butassoon
atwhichitischangingbycomparingkeywordlocations astheyhaveanysuccess,theyaredetectedandeliminat-
intheoutputoftherecognizerbetweeniterations.Iffew edasdescribedinSection[sect].Itispossiblethatifa
keywordsare shifting location, then the error rate can- parasitedoesn’tgetgreedy,andforexamplelimitsitself
notbechangingaboveacertainbound.Wecantherefore tooneperson’spronunciationofakeyword,thatitwill
placeaconvergencecriteriononthisboundratherthan not be detected, although I didn’tsee any examples of
on the actualkeyworderrorrate. It is importantto just thishappening.
measurechangesinkeywordlocations,andnotchanges Manysimplesentencescanbemodeledcompletely
in vocabulary items added by clustering. Items that do afterclustering,withoutneedtofallbackonthegeneric
not occur often tend to be destroyed and rediscovered OOVphonemodel.Forexample,theutterances:
continuously,makingcomparisonsdifficult.
What is Victor Zue’s room number
Please connect me to Leigh Deacon
7 Experiments in unsupervised
vocabulary extension arerecognizedas:
The unsupervised procedure described in the previous (w ah t ih z) (ih t er z uw) room (n ah m b
section is intended to both improve recognition accu- er)
racy on the initial vocabulary, and to identify candi- (p l iy z) (k ix n eh k) (m iy t uw) (l iy d
dates for vocabulary extension. This section describes iy) (k ix n)
experimentsthatdemonstratetowhatdegreethesegoals
All of which are entries in the vocabulary and so
wereachieved.TofacilitatecomparisonwithotherASR
contributetothelanguagemodel.Allthediscoveredvo-
systems, results are quoted for a fairly typical domain
cabulary items are assigned one or more baseforms as
calledLCSInfo[9]developedbytheSLSgroupatMIT.
describedinSection[sect].Forexample,thenasalin/n
Thisdomainconsistsofqueriesaboutpersonnel–their
ahm ber/issometimesrecognized,sometimesnot,so
addresses,phonenumbersetc.
bothpronunciationsareaddedtoasinglebaseform.
7.1 Experiment1:QualitativeResults
7.2 Experiment2:QuantitativeResults
This section describes the candidate vocabulary dis-
For experimentsinvolving small vocabularies, it is ap-
covered by the clustering procedure. Numerical,
propriatetomeasureperformanceintermsofKeyword
performance-relatedresultsarereportedinthenextsec-
ErrorRate(KER).Itakethistobe:
tion.
Resultsgivenherearefromaclusteringsessionwith
an initial vocabulary of five keywords (email, phone, F +M (1)
KER= (cid:3)100
room,office, address),runon a set of 1566utterances. T
with:
Transcriptionsfortheutteranceswereavailablebutnot
F :Numberoffalseorpoorlylocalizeddetections
used by the clustering procedure. Here are the top 10
M :Numberofmisseddetections
clusters discovered on this very typical run, ranked by
T :Truenumberofkeywordoccurrencesindata
decreasingfrequencyofoccurrence: A detection is only counted as such if it occurs at
1 n ah m b er 6 p l iy z
therighttime.Specifically,themidpointofthehypothe-
2 w eh r ih z 7 ae ng k y uw
3 w ah t ih z 8 n ow sizedtimeintervalmustliewithinthetruetimeinterval
4 t eh l m iy 9 hh aw ax b aw
the keyword occupies. I take forced alignments of the
5 k ix n y uw 10 g r uw p
These clusters are used consistently by the recog- testsetasgroundtruth.Thismeansthatfortestingitis
nizer in places corresponding to: ”number, where is, better to omit utterances with artifacts and words out-
what is, tell me, can you, please, thank you, no, sidethefullvocabulary,sothattheforcedalignmentis
how about, group,” respectively in the transcription. likelytobesufficientlyprecise.
The first, /n ah m b er/, is very frequent because of Theexperimentsherearedesignedtoidentifywhen
”phonenumber”,”roomnumber”,and”officenumber”. clusteringleadstoreducederrorratesonakeywordvo-
Once it appearsas a cluster the languagemodelis im- cabulary.Sincetheformofclusteringaddressedinthis
mediately able to improve recognitionperformanceon paperisfundamentallyaboutextendingthevocabulary,
8 Varchavskaiaetal
100 7.3 Experiment3:KismetDomain
Baseline performance
90 Performance after clustering An exploratory experiment was carried out for data
80 drawn from robot-directed speech collected for the K-
ismet robot. This data comes from an earlier series of
70
%) recordingsessions[7]ratherthantheonesdescribedin
word error rate ( 456000 Ss“aoelkcietainyot”nwa4po.prdeEasarrsaluymchroenasgsult“thskeiastmroepetpt”er,no“mcnlious”sint,eg“rss–o.rBrsuyetm”,tha“nirstoiwbcaooltrl”ky,
key isinaverypreliminarystage.
30
20
8 DiscussionandConclusions
10
00 10 20 30 40 50 60 70 80 90 100 Paperisacollectionofstuff,notaunifiedwhole.Work
coverage (%)
in progress. Hard hat area. Divers alarums and excur-
sions.Exeuntallpursuedbythefuries.
Fig.2.Keyworderrorrateofbaselinerecognizerandcluster-
ingrecognizerastotalcoveragevaries. Thispaperdoesnotaddressthecrucialissueofbind-
ing vocabularyto meaning.One line of research under
way is to use transient, task-dependentvocabulariesto
communicate the temporal structure of processes. An-
other line of research looks more generally at how a
wewouldexpectittobeuselessifthevocabularyisal- robot can establist a shared basis for communication
ready large enough to give good coverage. We would with human through learning expressive verbal behav-
expectittoofferthegreatestimprovementwhenthevo-
iorsaswellasacquiringthehumans’existinglinguistic
cabularyissmallest.Tomeasuretheeffectofcoverage, labels.
the full vocabulary was made smaller and smaller by Problem of affective speech – too much darned
incrementally removing the most infrequent words. A prosody.Goodthingaboutprosody:itmayhelpdistin-
set of keywords were chosen and kept constant and in guish a word teaching scenario from normalconversa-
thevocabularyacrossalltheexperimentssotheresults tion. The robot could operate in different modes then
wouldnotbeconfoundedbypropertiesofthekeywords (mentionedinsection5.2).
themselves(forexample,themostcommonword”the” Ultimateresearchinterests:Howcanarobotestab-
would make a very bad keyword since it is often un-
lishasharedbasisforcommunication?Informedbyin-
stressedandlooselypronounced).Thesamesetofkey- fantlanguageresearch.Establishingamechanismfora
wordswereusedasintheprevioussection. robot to vocalize its behavioral and internal state in a
consistentmannerunderstandabletohumans.
Clustering is again performed without making any
THISCAMEFROM4.1:
useoftranscripts.Totrulyeliminateanydependenceon
thetranscripts,anacousticmodeltrainedonlyonadif-
ferent dataset was used. This reduced performancebut
Vocalimitationandreferentialmapping Parentstend
madeiteasiertointerprettheresults.
to interpret their children’s first utterances very gener-
Figure2showsaplotoferrorratesonthetestdata ouslyandoftenattributemeaningandintentwherethere
asthesizeofthevocabularyisvariedtoprovidediffer- maybenone[4].Ithasbeenshown,however,thatsuch
entdegreesofcoverage.Themoststrikingresultisthat a strategy may indeed help infants coordinate mean-
theclusteringmechanismreducesthesensitivityofper- ingandsoundandlearntoexpressthemselvesverbally.
formanceto dropsin coverage.In thisscenario,theer- Pepperberg[18]formalizedtheconceptintoateaching
ror rateachievedwith thefullvocabulary(whichgives techniquecalledreferentialmapping.Thestrategyisfor
84.5% coverage on the training data) is 33.3%. When theteachertotreatthepupil’sspontaneousutterancesas
thecoverageislow,theclusteredsolutionerrorratere- meaningful,and act upon them. This, it is shown, will
mainsunder50%-inrelativeterms,theerrorincreases encouragethe pupil to associate the utterance with the
byatmostahalfofitsbestvalue.Straightapplicationof meaning that the teacher originally gave it, so the stu-
alanguagemodelgiveserrorratesthatmorethandouble dentwillusethesamevocalizationagaininthefutureto
ortrebletheerrorrate. makeasimilarrequestorstatement.Thetechniquewas
successfullyusedinaidingthedevelopmentofchildren
Asareferencepoint,thekeyworderrorrateusinga withspecialneeds.
languagemodeltrainedwiththefullvocabularyonthe For the purposes of the research reported here, we
fullsetoftranscriptionswithanacousticmodeltrained arenotconcernedwiththemeaningofwordsyet.How-
onallavailabledatagivesan8.3%KER. ever,oneofthepurposesofvocabularyextensionsisto
Robot-DirectedSpeech 9
buildasharedbasisformeaningfulcommunicationbe- [10] A.L. Gorin, D. Petrovksa-Delacrtaz, G. Riccardi, and
tweenthehumanandtherobot,andreferentialmapping J.H. Wright. Learning spoken language without tran-
maybeoneofthepromisinglinesofdevelopment.We scriptions.InProc.IEEEAutomaticSpeechRecognition
arethereforeinterestedinfindingouthowoftenhumans andUnderstandingWorkshop,Colorado,1999.
[11] T.J.HazenandI.Bazzi. Acomparisonandcombination
spontaneouslytreatKismet’s utterancesasmeaningful.
ofmethodsforoovworddetectionandwordconfidence
WHYDOITHINKTHATOnewayofdoingthisisto
scoring.InProc.InternationalConferenceonAcoustics,
lookathowoftentheyareimitatedbythehumans.?
SaltLakeCity,Utah,May2001.
[12] J.Hirschberg,D.Litman,andM.Swerts. Prosodiccues
Acknowledgements torecognitionerrors. InASRU.
[13] P.Jeanrenaud,K.Ng,M.Siu,J.R.Rohlicek,andH.Gish.
TheauthorswouldliketothankSherryTurkle,JenAu- Phonetic-basedwordspotter:Variousconfigurationsand
dley,AnitaChan,TamaraKnutsen,BeckyHurwitz,and applicationtoevent spotting. InProc.EUROSPEECH,
the MIT Initiativeon TechnologyandSelf, for making 1993.
[14] P.W.Jusczyk.TheDiscoveryofSpokenLanguage.Cam-
availablethevideorecordingsthatwereanalyzedinthis
bridge:MITPress,1997.
paper.
[15] P.W.Jusczyk and R.N.Aslin. Infants’ detection of the
Partsofthisworkrelyheavilyonspeechrecognition
soundpatternsofwordsinfluentspeech. CognitivePsy-
toolsandcorporadevelopedbytheSLSgroupatMIT.
chology,29:1–23,1995.
FundsforthisprojectwereprovidedbyDARPA as
[16] Y.MatsusakaandT.Kobayashi. Humaninterfaceofhu-
partofthe”NaturalTaskingofRobotsBasedonHuman manoid robot realizinggroup communication inreal s-
InteractionCues”projectundercontractnumberDABT pace. InProc.SecondInternationalSymposiumonHu-
63-00-C-10102,andbytheNipponTelegraphandTele- manoidRobots,pages188–193,1999.
phone Corporationas part of the NTT/MIT Collabora- [17] T. Oates, Z. Eyler-Walker, and P. Cohen. Toward nat-
tionAgreement. ural language interfaces for robotic agents: Grounding
linguisticmeaninginsensors. InProceedingsofthe4th
InternationalConferenceonAutonomousAgents,pages
References
227–228,2000.
[18] I.Pepperberg. Referentialmapping:Atechniqueforat-
[1] R.N. Aslin, J.Z.Woodward, N.P.LaMendola, and T.G.
taching functional significance to the innovative utter-
Bever. Models of word segmentation in fluent mater-
ancesofanafricangreyparrot. AppliedPsycholinguis-
nal speech to infants. In J.L. Morgan and K. Demuth,
tics,11:23–44,1990.
editors,SignaltoSyntax:BootstrappingFromSpeechto
[19] D.K. Roy. Learning Words from Sights and Sounds:
Grammar inEarlyAcquisition.LawrenceErlbaumAs-
A Computational Model. PhD thesis, MIT, September
sociates:Mahwah,NJ,1996.
1999.
[2] E.G.Bardand A.H.Anderson. Theunintelligibilityof
[20] C.Trevarthen. Communicationandcooperationinear-
speechtochildren:effectsofreferentavailability. Jour-
ly infancy: a description of primary intersubjectivity.
nalofChildLanguage,21:623–648,1994.
In M. Bullowa, editor, Before Speech: The beginning
[3] I. Bazzi and J.R. Glass. Modeling out-of-vocabulary
ofinterpersonal communication. Cambridge University
words for robust speech recognition. In Proc. 6th In-
Press,1979.
ternationalConferenceonSpokenLanguageProcessing,
[21] J.F.Werker,V.L.Lloyd,J.E.Pegg,andL.Polka. Putting
Beijing,China,October2000.
thebabyinthebootstraps:Towardamorecompleteun-
[4] P.Bloom. HowChildrenLearntheMeaningofWords.
derstandingoftheroleoftheinputininfantspeechpro-
Cambridge:MITPress,2000.
cessing.InSignaltoSyntax:BootstrappingFromSpeech
[5] C.Breazeal. SociableMachines:ExpressiveSocialEx-
toGrammarinEarlyAcquisition.
changeBetweenHumansandRobots. PhDthesis,MIT
[22] V. Zue, J. Glass, J. Plifroni, C. Pao, and T.J. Hazen.
DepartmentofElectricalEngineeringandComputerSci-
Jupiter: A telephone-based conversation interface for
ence,2000.
weatherinformation. IEEETransactionsonSpeechand
[6] M.R. Brent and J.M. Siskind. The role of exposure to
AudioProcessing,8:100–112,2000.
isolatedwordsinearlyvocabularydevelopment. Cogni-
[23] V. Zue and J.R. Glass. Conversational interfaces: Ad-
tion,81:B33–B44,2001.
vancesandchallenges.ProceedingsoftheIEEE,Special
[7] C.BreazealandL.Aryananda. Recognitionofaffective
IssueonSpokenLanguage Processing,Vol.88,August
communicativeintentinrobot-directedspeech. InPro-
2000.
ceedingsofHumanoids2000,Cambridge,MA,Septem-
ber2000.
[8] J. Glass, J. Chang, and M. McCandless. A proba-
bilisticframeworkforfeature-basedspeechrecognition.
InProc.InternationalConferenceonSpokenLanguage
Processing,pages2277–2280,1996.
[9] J. Glass and E. Weinstein. Speechbuilder: Facilitating
spokendialoguesystemsdevelopment. In7thEuropean
ConferenceonSpeechCommunicationandTechnology,
Aalborg,Denmark,September2001.