ebook img

A Multichannel Convolutional Neural Network For Cross-language Dialog State Tracking PDF

0.18 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Multichannel Convolutional Neural Network For Cross-language Dialog State Tracking

PublishedasaconferencepaperatIEEESLT2016 A MULTICHANNEL CONVOLUTIONALNEURAL NETWORK FOR CROSS-LANGUAGEDIALOGSTATE TRACKING HongjieShi,Takashi Ushio,MitsuruEndo,KatsuyoshiYamagamiandNoriakiHorii InteractiveAIResearch Group,PanasonicCorporation,Osaka, Japan 7 1 ABSTRACT hand-craftedrule-based approaches[2, 3].This is a very un- 0 favorablesituationbecause buildinghand-annotatedtraining 2 The fifth Dialog State Tracking Challenge (DSTC5) intro- corpus is very expensive, time-consuming and requires hu- duces a new cross-language dialog state tracking scenario, n man experts. Not to mentionthe collection of a new corpus a wheretheparticipantsareaskedtobuildtheirtrackersbased foreachlanguageotherthanEnglishifweneedtobuildtrack- J ontheEnglishtrainingcorpus,whileevaluatingthemwiththe ersforanewlanguage. 3 unlabeledChinesecorpus. Althoughthecomputer-generated TheDSTC5proposedanewchallengebasedonusingthe 2 translationsforbothEnglishandChinesecorpusareprovided rapidlyadvancingmachinetranslation(MT)technology,one in the dataset, these translations contain errors and careless ] maybeabletoadaptthebuilttrackertoanewlanguagewith L useofthemcaneasilyhurttheperformanceofthebuilttrack- limitedtrainingordevelopmentcorpusinthatlanguage. We C ers.Toaddressthisproblem,weproposeamultichannelCon- find this idea very attractive because not only it can reduce . volutionalNeuralNetworks(CNN)architecture,inwhichwe s the costofnew languageadaptation,butalso it providesthe treat English and Chinese language as different input chan- c possibility of building a tracker with cross-languagecorpus. [ nelsofonesingleCNN model. IntheevaluationofDSTC5, For example it can be very useful for developingthe tourist wefoundthatsuchmultichannelarchitecturecaneffectively 1 informationsystemsbecauseonemayhavecorpuscollected v improvethe robustness against translation errors. Addition- fromdifferentlanguagespeakers(i.e. touristsfromdifferent 7 ally,ourmethodforDSTC5ispurelymachinelearningbased countries): for each language,the amountof corpusmay be 4 and requires no prior knowledge about the target language. 2 very limited, but together it can be large enough for a good We consider this a desirable property for building a tracker 6 training.Ontheotherhand,althoughthemachinetranslation inthecross-languagecontext,asnoteverydeveloperwillbe 0 technologyhasachievedgreatprogressrecently,the transla- . familiarwithbothlanguages. 1 tionqualityisstillnotsatisfactory[4]. Aconventionalmono- 0 Index Terms— Convolutional neural networks, multi- lingualtrackertrainedonthecomputer-generatedtranslations 7 channelarchitecture,dialogstatetracking,dialogsystems may lead to an imperfect model, and it can only accept the 1 translationsfromotherlanguagesasinputwhichwillalsode- : v gradetheperformance. 1. INTRODUCTION i To addressthese problems,we proposea modelthatcan X be trained with different languages at the same time, and r Dialog state tracking is one of key sub-tasksof dialogman- a agement, whose goal is to transfer human utterances into a use both original utterances and their translations as input source for the dialog state tracking. In such way, we can slot-value representation (dialog state) that is easy for com- avoidbuildingthetrackeronlybasedoncomputer-generated putertoprocessandtracktheinformationthatappearedinthe translations, and maximize the use of all possible input lan- dialog. Toprovideacommontestbedforthistask,theseries guages to increase the robustness to translation errors. This of Dialog State Tracking Challenges (DSTC) was initiated paper is organized as follows. Sect.2 briefly describes the [1]. Thischallengehasalreadybeenheldforfourtimes,dur- datasetandthedialogstatetrackingproblem;Sect.3presents ingwhichitprovidedaveryvaluablesharedrecourseforthe an overviewof our methodand explainsin detailour multi- researchin this field and helped to improvethe state-of-the- channel CNN model. Sect.4 presents the evaluation results art. Sincetheforthchallenge(DSTC42015),thetargetofdi- withanalysisanddiscussion. Sect.5concludesourworkand alogstatetrackinghasbeenshiftedfromhuman-machinedia- proposesfutureimprovements. logtohuman-humandialog,whichsignificantlyincreasesthe difficultyofthedialogstatetrackingtaskbecauseofthevari- etyandambiguityinthehuman-humandialog.Onelessonwe 2. DATASETANDPROBLEMDESCRIPTION learnedfromDSTC4 isthe difficultyof buildinga highper- formancetrackerforhuman-humandialogwithverylimited The fifth Dialog State Tracking Challenge (DSTC5) uses trainingcorpus,nomatterwhetherusingmachinelearningor thewholedataset(includingtrain/dev/testdatasets)fromthe Copyright2016IEEE.Publishedinthe2016IEEEWorkshoponSpokenLanguageTechnology(SLT2016). 3. METHOD Table1. Exampleofatranscriptionanddialogstatelabelfor asub-dialogsegmentinthetopicof‘Accommodation’. InDSTC4weproposedamethodwhichisbasedonthecon- volutional neural networks originally proposed by Kim [5]. Transcription Dialogstatelabel Guide:Let’strythisone,okay? Bythismethodwewereabletoachievethebestperformance Tourist:Okay. fortrackingtheINFO slot. TheCNNmodelwe usedinthis Guide:It’sInnCrowdBackpackersHostelinSingapore.If methodwasmodifiedfromtheoriginbyaddingastructureof youtakeadormbedperpersononlytwentydollars.Ifyou takearoom,it’stwosinglebedsatfiftyninedollars INFO:Pricerange multi-topicconvolutionallayer,sothatitcanbetterhandlethe Tourist:Um.Wow,that’sgood. informationpresentedindifferentdialogtopics. Thismodel Guide: Yah,thepricesarebasedonperpersonperbedor PLACE: dorm.Butthisoneisroom.Soitshouldbefiftynineforthe InnCrowd Back- is characterizedby its high performancefor limited training tworoom.Soyou’reactuallypayingabouttendollarsmore packersHostel data, because it can be trained across various topics. More perpersononly. Tourist: Ohokay. That’s-thepriceisreasonableactually. detailsaboutthismulti-topicmodelcanbefoundin[2]. It’sgood. In DSTC5 the training data is 75% more than DSTC4, thereforethesituationoflimitedtrainingdataisimproved.In orderto focusmoreon thenewcross-languageproblemand DSTC4 as the training dataset. This dataset contains 35 di- keepour methodsimple, instead ofusing the more complex alog sessions on tourist information for Singapore collected multi-topicmodelweproposedlasttime,wetrainedindivid- ualCNNmodelforeachslot-topiccombination. Thatis,for from English speakers. Besides the training dataset, a de- velopment set which includes 2 dialog sessions collected examplethe‘INFO’slotinthetopicof‘FOOD’andthesame fromChinesespeakersisprovidedfortestingandtuningthe ‘INFO’ slot in the topic of ‘SHOPPING’ will be trained by two independentmodels. This is the major difference from trackers’ cross-language performance before the final eval- uation. Both the training and developmentsets are labelled thelast time, wherewe trainedonesinglemodelforall top- ics. With this new scheme, we can set the hyperparameters with the dialog state tags and come with 5-best hypothesis of English or Chinese translations by a machine translation in each model for each slot/topic to be exactly the same, so systems. In the evaluationphase ofthe challenge, a Test set thatourmethodisscalable,universallyapplicableandeasyto tune.Fig1isasimplediagramillustratingourmethod. including 8 unlabeled Chinese dialogs is distributed to each participant,andall predictionresultssubmittedbyeach par- ticipantareevaluatedbycomparingwiththetruelabels. The (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6) (cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:10)(cid:11)(cid:12)(cid:6)(cid:13)(cid:14)(cid:8)(cid:15)(cid:16)(cid:17) test dataset also includes 5-best English translations which aregeneratedbythesamemachinetranslationsystemasthe (cid:20)(cid:10)(cid:23)(cid:7)(cid:24)(cid:8)(cid:24)(cid:25)(cid:6)(cid:26)(cid:27)(cid:6)(cid:7)(cid:14)(cid:15)(cid:10)(cid:13)(cid:6)(cid:19)(cid:20)(cid:21)(cid:21)(cid:1)(cid:22) training/developmentdataset. (cid:3)(cid:8)(cid:9)(cid:10)(cid:11)(cid:10)(cid:11)(cid:12)(cid:6)(cid:13)(cid:14)(cid:8)(cid:15)(cid:16)(cid:17) The dialog state in this challengeis defined by the same (cid:10)(cid:11)(cid:6)(cid:7)(cid:14)(cid:15)(cid:10)(cid:13)(cid:18)(cid:6)(cid:19)(cid:20)(cid:21)(cid:21)(cid:1)(cid:22) ontology used in DSTC4, which contains 5 topic branches (cid:1)(cid:2)(cid:3)(cid:4)(cid:5) with different slot sets (Table2). These topic-slot combina- tions indicate the most important information mentioned in (cid:28)(cid:16)(cid:23)(cid:7)(cid:10)(cid:13)(cid:29)(cid:9)(cid:11)(cid:11)(cid:24)(cid:23)(cid:6) thattopic,forexamplethe‘CUISINE’slotunderthetopicof (cid:4)(cid:30)(cid:30)(cid:6)(cid:28)(cid:14)(cid:25)(cid:24)(cid:23) (cid:31)(cid:14)(cid:8)(cid:6)(cid:17)(cid:23)(cid:14)(cid:7)(cid:18)(cid:6)(cid:19) (cid:30)(cid:20)(cid:21)(cid:22) ‘FOOD’referstothecuisinetypes,whilethe‘STATION’slot (cid:28)(cid:16)(cid:23)(cid:7)(cid:10)(cid:13)(cid:29)(cid:9)(cid:11)(cid:11)(cid:24)(cid:23) (cid:19) (cid:30)(cid:20)(cid:21)(cid:22)(cid:6)(cid:18)(cid:6) forthetopicof‘TRANSPORTATION’referstothetrainsta- (cid:4)(cid:30)(cid:30)(cid:6)(cid:28)(cid:14)(cid:25)(cid:24)(cid:23) %&(cid:9)(cid:23)(cid:16)(cid:24)'((cid:6)&(cid:9)(cid:23)(cid:16)(cid:24))((cid:6)*+ tions. Intotalthereare30suchtopic-slotcombinations,and !(cid:7)(cid:7)(cid:24)(cid:8)(cid:9)(cid:11)(cid:13)(cid:24)(cid:17)$(cid:10)(cid:7)(cid:29)(cid:6) (cid:31)(cid:14)(cid:8)(cid:6)(cid:17)(cid:23)(cid:14)(cid:7)(cid:18)(cid:6)(cid:19)(cid:4)! (cid:2) (cid:30)"(cid:22) (cid:19)(cid:4)! (cid:2) (cid:30)"(cid:22)(cid:6)(cid:18)(cid:6) (cid:7)(cid:8)(cid:9)(cid:11)(cid:17)(cid:23)(cid:9)(cid:7)(cid:10)(cid:14)(cid:11)(cid:17) %&(cid:9)(cid:23)(cid:16)(cid:24),((cid:6)&(cid:9)(cid:23)(cid:16)(cid:24)-((cid:6)*+ allpossiblevaluesforeachtopic-slotaregivenasalistinthe (cid:10)(cid:11)(cid:6)(cid:7)(cid:14)(cid:15)(cid:10)(cid:13)(cid:18)(cid:6)(cid:19)(cid:20)(cid:21)(cid:21)(cid:1)(cid:22) (cid:1) (cid:1) (cid:19)(cid:1) (cid:2)#(cid:22)(cid:6)(cid:18) ontology. The main task of DSTC5 is to predict the proper %&(cid:9)(cid:23)(cid:16)(cid:24)(cid:5)((cid:6)&(cid:9)(cid:23)(cid:16)(cid:24).((cid:6)*+ value(s)foreachslotgiventhecurrentutteranceanditstopic, (cid:28)(cid:16)(cid:23)(cid:7)(cid:10)(cid:13)(cid:29)(cid:9)(cid:11)(cid:11)(cid:24)(cid:23) withalldialoghistorypriortothatturn(e.g.Table1). (cid:4)(cid:30)(cid:30)(cid:6)(cid:28)(cid:14)(cid:25)(cid:24)(cid:23) (cid:31)(cid:14)(cid:8)(cid:6)(cid:17)(cid:23)(cid:14)(cid:7)(cid:18)(cid:6)(cid:19)(cid:1) (cid:2)#(cid:22) (cid:7)(cid:8)(cid:9)(cid:10)(cid:3)(cid:11)(cid:12)(cid:5)(cid:2)(cid:13)(cid:14)(cid:6)(cid:4)(cid:13)(cid:15)(cid:4)(cid:16)(cid:17) (cid:21)(cid:22)(cid:4)(cid:3)(cid:11)(cid:23)(cid:17)(cid:4)(cid:3)(cid:14)(cid:5)(cid:12)(cid:9)(cid:4)(cid:5)(cid:6) Table2. Listofslotsforeachtopic. (cid:18)(cid:17)(cid:4)(cid:6)(cid:17)(cid:14)(cid:11)(cid:16)(cid:19)(cid:8)(cid:17)(cid:20) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6) (cid:18)(cid:2)(cid:8)(cid:17)(cid:19)(cid:8)(cid:17)(cid:20) Topic SLOT Food INFO, CUISINE, TYPEOFPLACE, DRINK, PLACE, MEALTIME,DISH,NEIGHBOURHOOD Fig.1. Overviewofourmethod(forthe‘FOOD’topic). Attraction INFO, TYPEOFPLACE, ACTIVITY, PLACE, TIME, NEIGHBOURHOOD Shopping INFO, TYPEOFPLACE, PLACE, NEIGHBOURHOOD, TIME Transportation INFO,FROM,TO,STATION,LINE,TYPE,TICKET 3.1. Motivation Accommodation INFO,TYPEOFPLACE,PLACE,NEIGHBOURHOOD ThebiggestchallengeofDSTC5isthatthetrainingandtest corporaareoriginallycollectedindifferentlanguages. Since bothcomputer-generatedChineseandEnglishtranslationare whereSisthesigmoidfunction;⊕istheconcatenationoper- provided in the training and test dataset respectively, one atorandhˆchn = (hˆ1,...,ˆhm)isthepenultimatelayerofthe straightforward approach is to train a model with English n-thchannel. corpusand use it for the English translation in the test data. NoticethatintheoriginalpaperofKim,amulti-channel Alternatively,a modeltrainedon Chinese translationsin the architecturehasalsobeenproposed.Themaindifferencebe- training dataset can be used for Chinese utterances in the tweenourmodelandtheirmodelisthatweusedifferentsets test data. However, both methods will waste the originally offiltersforeachchannel,whileintheirmodelthesamefilter collected utterances either in the training or the test data. setareappliedtoallchannels. Thereasonforthismodifica- In order to fully utilize the corpus resource in both English tion is that the word embedding for different languages can and Chinese languages, we proposed the following multi- channel model which can be regarded as a combination of bothEnglishandChinesemodels. 3.2. Modelarchitecture (cid:1)(cid:2)(cid:3)(cid:4)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8) Ourmodelisinspiredbythemultichannelconvolutionalneu- ral networks commonly used in the image processing [6]. (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12) Instead of RGB channels used for color images, we apply (cid:9)(cid:3)(cid:6)(cid:10)(cid:5)(cid:8)(cid:7)(cid:11)(cid:7)(cid:12)(cid:13)(cid:8) (cid:9)(cid:3)(cid:6)(cid:10)(cid:5)(cid:14)(cid:7)(cid:11)(cid:7)(cid:12)(cid:13)(cid:14) eachinputchanneltoeachdifferentlanguagesource. Inthis (cid:9)(cid:3)(cid:6)(cid:10)(cid:5)(cid:15)(cid:7)(cid:11)(cid:7)(cid:12)(cid:13)(cid:16) (cid:1)(cid:2)(cid:3)(cid:4)(cid:4)(cid:5)(cid:6)(cid:7)(cid:14) (cid:9)(cid:3)(cid:6)(cid:10)(cid:5)(cid:17)(cid:7)(cid:11)(cid:7)(cid:12)(cid:13)(cid:18) model,theinputofeachchannelisatwodimensionalmatrix, (cid:9)(cid:3)(cid:6)(cid:10)(cid:5)(cid:19)(cid:7)(cid:11)(cid:7)(cid:12)(cid:13)(cid:15) eachrowofwhichistheembeddingvectorofthecorrespond- (cid:9)(cid:3)(cid:6)(cid:10)(cid:5)(cid:20)(cid:7)(cid:11)(cid:7)(cid:12)(cid:13)(cid:8) (cid:9)(cid:3)(cid:6)(cid:10)(cid:5)(cid:21)(cid:7)(cid:11)(cid:7)(cid:12)(cid:13)(cid:14) ingword: —w — 1   —w — 2 (cid:1)(cid:2)(cid:3)(cid:4)(cid:4)(cid:5)(cid:6)(cid:7)(cid:15) s= .. , (1)  .     —wn—  wherewi ∈ Rk istheembeddingvectorforthei-thwordin Input layer: Convolutional Max-pooling Fully-connected embedding matrix layer layer sigmoid layer the inputtext. This 2-dimensionalarray s is a matrix repre- sentationoftheinputtext. Weusedthreedifferentwordem- beddinginourmodel—twoforChineseandoneforEnglish. Fig.2. MultichannelCNNmodelarchitectureforthreeinput Thedetailsoftheseembeddingwillbeexplainedlaterinthe channels. Sect. 3.3. For each channel, a feature map h ∈ Rn−d+1 is obtainedbyconvolvingatrainablefilterm ∈ Rd×k withthe (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:7)(cid:8)(cid:7)(cid:8)(cid:9)(cid:4)(cid:10)(cid:2)(cid:3)(cid:11)(cid:12)(cid:13)(cid:4)(cid:14)(cid:15)(cid:8)(cid:9)(cid:16)(cid:7)(cid:13)(cid:17)(cid:18)(cid:19) embeddingmatrixs∈Rn×k usingthefollowingequation: (cid:12)(cid:7)(cid:5)(cid:9)(cid:10)(cid:16)(cid:10)(cid:11)(cid:17)(cid:14)(cid:18)(cid:15) (cid:10)(cid:19)(cid:20)(cid:10)(cid:15)(cid:15)(cid:5)(cid:9)(cid:21)(cid:11)(cid:19)(cid:14)(cid:15)(cid:10)(cid:3) h=f(m∗s+b). (2) (cid:22)(cid:9)(cid:21)(cid:3)(cid:5)(cid:16)(cid:7)(cid:11)(cid:4)(cid:14)(cid:11)(cid:12)(cid:7)(cid:5)(cid:9)(cid:10)(cid:16)(cid:10) (cid:1)(cid:23)(cid:11)(cid:16)(cid:24)(cid:16)(cid:4)(cid:10)(cid:19) (cid:12)(cid:7)(cid:5)(cid:9)(cid:10)(cid:16)(cid:10)(cid:11)(cid:6)(cid:7)(cid:8)(cid:18)(cid:8)(cid:6)(cid:4)(cid:10)(cid:18) (cid:1)(cid:2)(cid:3) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:9)(cid:10)(cid:3)(cid:11) Heref isanon-linearactivationfunction1;∗istheconvolu- (cid:10)(cid:19)(cid:20)(cid:10)(cid:15)(cid:15)(cid:5)(cid:9)(cid:21)(cid:11)(cid:19)(cid:14)(cid:15)(cid:10)(cid:3) (cid:1)(cid:2)(cid:4) (cid:12)(cid:13)(cid:13)(cid:11)(cid:1)(cid:14)(cid:15)(cid:10)(cid:3) tion operator and b = (b,...,b) ∈ Rn is a bias term. The (cid:1)(cid:2)(cid:5) maximum value of this feature map hˆ = max{h} is then (cid:22)(cid:9)(cid:21)(cid:3)(cid:5)(cid:16)(cid:7)(cid:11)(cid:2)(cid:4)(cid:4)(cid:10)(cid:18)(cid:8)(cid:9)(cid:6)(cid:10)(cid:16) (cid:22)(cid:9)(cid:21)(cid:3)(cid:5)(cid:16)(cid:7)(cid:11)(cid:17)(cid:14)(cid:18)(cid:15) (cid:10)(cid:19)(cid:20)(cid:10)(cid:15)(cid:15)(cid:5)(cid:9)(cid:21)(cid:11)(cid:19)(cid:14)(cid:15)(cid:10)(cid:3) selected by the max-pooling layer. This is the process how onefilter extractsonemostimportantfeaturefromthe input (cid:25)(cid:18)(cid:14)(cid:26)(cid:5)(cid:15)(cid:10)(cid:15)(cid:11)(cid:5)(cid:9)(cid:27)(cid:2)(cid:4) matrix. Inthismodel,multiplefiltersareusedineachchan- (cid:1)(cid:2)(cid:3)(cid:4)(cid:20)(cid:21)(cid:22)(cid:21)(cid:16)(cid:2)(cid:11)(cid:23)(cid:21)(cid:8)(cid:5)(cid:24)(cid:5)(cid:21)(cid:13)(cid:5)(cid:4)(cid:10)(cid:2)(cid:3)(cid:11)(cid:12)(cid:13)(cid:4)(cid:14)(cid:25)(cid:17)(cid:7)(cid:8)(cid:21)(cid:13)(cid:21)(cid:18)(cid:19) neltoextractmultiplefeatures. Thesefeaturesthenformthe poolinglayerwhicharepassedtoafullyconnectedlayerfor (cid:12)(cid:7)(cid:5)(cid:9)(cid:10)(cid:16)(cid:10)(cid:11)(cid:17)(cid:14)(cid:18)(cid:15) (cid:10)(cid:19)(cid:20)(cid:10)(cid:15)(cid:15)(cid:5)(cid:9)(cid:21)(cid:11)(cid:19)(cid:14)(cid:15)(cid:10)(cid:3) prediction. (cid:12)(cid:7)(cid:5)(cid:9)(cid:10)(cid:16)(cid:10)(cid:11)(cid:2)(cid:4)(cid:4)(cid:10)(cid:18)(cid:8)(cid:9)(cid:6)(cid:10)(cid:16) The idea of multichannel model is to connect those ex- (cid:12)(cid:7)(cid:5)(cid:9)(cid:10)(cid:16)(cid:10)(cid:11)(cid:6)(cid:7)(cid:8)(cid:18)(cid:8)(cid:6)(cid:4)(cid:10)(cid:18) (cid:1)(cid:2)(cid:3) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:9)(cid:10)(cid:3)(cid:11) tracted featuresfrom differentchannelsbeforethe final out- (cid:10)(cid:19)(cid:20)(cid:10)(cid:15)(cid:15)(cid:5)(cid:9)(cid:21)(cid:11)(cid:19)(cid:14)(cid:15)(cid:10)(cid:3) (cid:1)(cid:2)(cid:4) (cid:12)(cid:13)(cid:13)(cid:11)(cid:1)(cid:14)(cid:15)(cid:10)(cid:3) put, so that the model can use richer information obtained (cid:1)(cid:2)(cid:5) from different channels. The fully connect layer in multi- (cid:12)(cid:7)(cid:5)(cid:9)(cid:10)(cid:16)(cid:10)(cid:11)(cid:4)(cid:14)(cid:11)(cid:22)(cid:9)(cid:21)(cid:3)(cid:5)(cid:16)(cid:7) (cid:22)(cid:9)(cid:21)(cid:3)(cid:5)(cid:16)(cid:7)(cid:11)(cid:17)(cid:14)(cid:18)(cid:15) (cid:1)(cid:23)(cid:11)(cid:16)(cid:24)(cid:16)(cid:4)(cid:10)(cid:19) (cid:10)(cid:19)(cid:20)(cid:10)(cid:15)(cid:15)(cid:5)(cid:9)(cid:21)(cid:11)(cid:19)(cid:14)(cid:15)(cid:10)(cid:3) channelmodelfollowstheequation: (cid:25)(cid:18)(cid:14)(cid:26)(cid:5)(cid:15)(cid:10)(cid:15)(cid:11)(cid:5)(cid:9)(cid:27)(cid:2)(cid:4) y=S w·(hˆch1⊕hˆch2⊕hˆch3)+b , (3) (cid:0) (cid:1) 1Weusedrectifiedlinearunit(ReLU)forthisactivationfunction. Fig.3. Pre-processingoftheinpututterances. varygreatly, forexamplethe same (or nearlythe same) em- (0.0956/0.0635)inAccuracyand15%(0.4519/0.3945)inF- beddingvectorindifferentlanguagemodelsmaycorrespond measure with the sub-dialog evaluation. Our submitted five toirrelevantwordswith verydifferentmeanings. Usingdif- entriesaretheresultsof5differenthyperparameterssettings ferent sets of filters ensures that proper features can be ex- whicharedeterminedbyaroughgridsearch3,andthoseset- tracted in each channel no matter how the word embedding tingsaresummarizedinTable4. Comparedtheseresultswith variesamongdifferentlanguages. eachother,onecaneasilytellthatamongthesehyperparam- eters the dropoutrate is a key factor. The dropoutis known asatechniqueforreducingoverfittinginneuralnetworks[9], 3.3. Embeddingmodels andinourcasereducingthedropoutratealwaysimprovesthe The word2vec [7] is one of the most common methods for PrecisionwhiledegradingtheRecallscore. Oneexplanation producing word embeddings. In DSTC5, we applied this forthisisthatanover-fittedmodelonlyoutputsthesamela- method and trained three different models with different bels for the data which are very similar to the training data, training corpus. The details of these models are listed as andthereforedecreasesitsgeneralizationtounseendata. On below: the other hand, further decreasing the dropout rate does not improve the overall performance, whose results and param- 1. English wordmodel: 200-dimensionword2vecmodel eter settingsarealso shownin the table as‘AdditionalExpt. trained on English Wikipedia, with all text split by #5&6’. space and all letters lowercased. This model contains 253854Englishwords. 2. Chinesewordmodel: 200-dimensionword2vecmodel Table 3. Evaluationresults(subdialog-level)onDSTC5test trained on Chinese Wikipedia, with all text split by dataset.4 wordboundaryusing‘jieba’module2. Thismodelcon- Tracker Accuracy Precision Recall F-measure tains457806Chinesewordsand53743Englishwords Multichannel#3 0.0956 0.5643 0.3769 0.4519 appearedintheChineseWikipedia. Multichannel#4 0.0872 0.5427 0.3842 0.4499 Multichannel#0 0.0964 0.5217 0.3849 0.4430 3. Chinese character model: 200-dimension word2vec Multichannel#1 0.0712 0.4340 0.4196 0.4267 Multichannel#2 0.0681 0.4216 0.4303 0.4259 modeltrainedonChineseWikipediawithalltextsplit Team4-entry2 0.0635 0.3768 0.4140 0.3945 into single Chinese character. This model contains Team1-entry4 0.0612 0.3811 0.3548 0.3675 12145 Chinese characters and 53743 English words Team6-entry1 0.0383 0.4063 0.3124 0.3532 appearedintheChineseWikipedia. Team5-entry0 0.0520 0.3637 0.3044 0.3314 baseline2 0.0222 0.1979 0.1774 0.1871 The reason why we trained two modelsforChinese lan- AdditionalExpt#5 0.0949 0.5786 0.3689 0.4505 AdditionalExpt#6 0.0888 0.5677 0.3712 0.4489 guage is because identifying word boundaries in Chinese is not a trivial task. For Chinese, the smallest element with meaning(word)variesfromone single Chinese characterto severalconcatenatedChinesecharacters,andthetaskforChi- nesewordsplittingusuallyinvolvesparsingthesentenceand the state-of-the-artmethodstill cannotachieveperfectaccu- Table4. Mainhyperparametersettingsforeachentry. racy. For this reason the Chinese word model may contain Entry#inTable3 #3 #4 #0 #1 #2 #5 #6 incorrectvocabulariesandisnotcapableofhandlingunseen Dropoutrate 0.4 0.5 0.6 0.8 0.8 0.2 0 Number of filters m ∈ Chinesecharactercombinations. Ontheotherhand,theChi- R1×kforeachchannel 1000 600 1000 1400 1000 1000 1000 nesecharactermodeldoesnotrelyonwordsegmentationso Number of filters m ∈ that the model is error-free, and also it can easily deal with R2×kforeachchannel 1000 600 1000 1400 1000 1000 1000 Learningrate 0.001 unseen words. However, since the Chinese character model Weightdecaycoefficient 0.0005 ignoresthewordboundaries,theresultingembeddingvector forL2regularization maynotbeabletoreflecttheprecisemeaningofeachword. Trainingdata Trainingcorpus+1-bestChinesetranslation &Developmentcorpus+1-bestEnglishtranslation Trainingepochs 100 Testinput Testcorpus+1-bestEnglishtranslation 4. RESULTS The results of the proposed method along with the scores of other teams are shown in the Table 3. Our multichannel CNN model achieves the best score among all 9 teams: the result of entry-3 outperforms the second best team by 50% 3Aguideforsettingthesehyperparameterscanbefoundin[8] 4Thesearetheevaluationresultsusing‘Schedule2’describedinthechal- 2https://github.com/fxsjy/jieba lengehandbook[1]. Table5. Exampleofpredictedlabelsbydifferentmodels. Chinese character Multichannel model Transcription (translation) Chinese word model English word model Model combination model (= True label) INFO: Exhibit INFO: Exhibit Guide: (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)们(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:5)(cid:15) PLACE: Singapore (you can walk to the botanical garden where we INFO: Exhibit PLACE: Singapore Botanic Gardens, PLACE: Singapore go for a walk .) PLACE: Singapore Botanic Gardens, Bukit TimahNature Botanic Gardens (cid:16)(cid:17)赏(cid:18)(cid:19)(or the flowers .) PLACE: Singapore Botanic Gardens Bukit TimahNature Reserve, Botanic Gardens, Reserve, Orchard Road, ACTIVITY: Walking Tourist: (cid:20)(cid:19)(er.) National Orchid Garden HortPark, National Orchid Garden (cid:20)(cid:10)(cid:11)-(er, plants) National Orchid Garden ACTIVITY: Walking 4.1. Multichannelmodel&singlechannelmodel&model exampleisdescribedusingdifferentfeaturesetsthatprovide combination different,complementaryinformationabouttheinstance.The fullyconnectedlayerin the multichannelmodelfurtherpro- Toinvestigatebyhowmuchtheproposedmultichannelarchi- videsan optimizationto use this informationfor the predic- tecture contributesto these results, we comparedthe perfor- tion,andthereforetheresultingmodelcaninprinciplebetter mancebetweenthemultichannelandordinarysinglechannel dealwiththetranslationerrorsappearedindifferentchannels. CNN models. For this comparison, we trained three differ- Table5isoneoftheexamplesthatdemonstratethisidea. ent monolingual single channel CNN models using each of In this particular sub-dialog segment, none of the 3 single theembeddingmodelsmentionedinSection3.3. Thesemod- channelmodelsisabletooutputthecorrectlabels,whilethe els used the same parameter setting as ‘multichannel#3’ in multichannelmodelgivesthe correctprediction. As seen in the Table 4, and were trained only on the 1-best machine this example, the model combination behaves like a simple translation results. Fig.4 shows the comparing results: the voting,whichmeansitonlypicksupthelabelsthataresup- Chinese character model achieves the best overall accuracy portedbymajorityofthesinglechannelmodels. Themulti- amongsinglechannelmodels,whilethemultichannelmodel channelmodel,ontheotherhand,isabletoselectivelychoose outperformsallthreesinglechannelmodels. which languagesource to trust more for each particular slot In the earlier DSTC, a simple model combination tech- orvalue. Asaresult,thelabelof‘Walking’iscorrectlypre- niquehasbeenusedtofurtherimprovethepredictiveperfor- dicteddespiteitonlyappearingintheEnglishmodel’soutput, mance, where the final outputis computedby averagingthe whilethe‘Exhibit’labeliscorrectlyrejectedeventhoughitis scoresoutputbydifferentmodels[10]. We also appliedthis supportedbytwosinglechannelmodelsoutofthree. method to combine the output from all three single channel However the real situation is more complex. When we models,andtheresultisalsoshownintheFig.4. Thissimple look at the overall predictive accuracy for each slot (Fig.5), modelcombinationmethoddoesnotperformas goodas the we can find that the performance for each model varies on multichannel model, but considering its simplicity, we still slots. We consider this is due to the ambiguity caused by consideritasagoodalternativetoimprovetheperformance. machine translation which varies on different subjects. For example, as a time expression in English, 96% of the word “evening” and 43% of the word “night” are translated into (cid:0)(cid:6)(cid:11)(cid:7) (cid:0)(cid:6)(cid:11) (cid:0)(cid:6)(cid:0)(cid:10) (cid:0)(cid:6)(cid:0)(cid:9) (cid:0)(cid:6)(cid:0)(cid:8) (cid:0)(cid:6)(cid:0)(cid:7) !& (cid:0) !% (cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:16) (cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:16)(cid:27)(cid:25)(cid:20)(cid:26) (cid:28)(cid:15)(cid:22)(cid:23)(cid:14)(cid:17)(cid:13)(cid:27)(cid:25)(cid:20)(cid:26) (cid:29)(cid:25)(cid:26)(cid:16)(cid:23) (cid:29)(cid:31)(cid:23)(cid:21)(cid:14)(cid:18)(cid:13)(cid:19)(cid:15)(cid:15)(cid:16)(cid:23) !$ (cid:18)(cid:13)(cid:19)(cid:20)(cid:19)(cid:18)(cid:21)(cid:16)(cid:20)(cid:17)(cid:14)(cid:15)(cid:22)(cid:23)(cid:16)(cid:17)(cid:14)(cid:15)(cid:22)(cid:23)(cid:16)(cid:18)(cid:13)(cid:19)(cid:15)(cid:15)(cid:16)(cid:23) (cid:17)(cid:14)(cid:15)(cid:22)(cid:23)(cid:16)(cid:18)(cid:13)(cid:19)(cid:15)(cid:15)(cid:16)(cid:23) (cid:18)(cid:25)(cid:24)(cid:30)(cid:14)(cid:15)(cid:19)(cid:21)(cid:14)(cid:25)(cid:15) (cid:24)(cid:25)(cid:26)(cid:16)(cid:23) !# (cid:18)(cid:13)(cid:19)(cid:15)(cid:15)(cid:16)(cid:23)(cid:24)(cid:25)(cid:26)(cid:16)(cid:23) (cid:24)(cid:25)(cid:26)(cid:16)(cid:23) (cid:24)(cid:25)(cid:26)(cid:16)(cid:23) !" Fig.4. Overallpredictiveaccuracyofdifferentmodels. 4.2. Discussion ’()*+,+-(./.-0+/123+4 ’()*+,+52/3123+4 6*74),(52/3123+4 123+4’289)*.0)2* 1:40)-(.**+4123+4 Wethinktheaboveresultscanbepartiallyexplainedfromthe pointofviewofensemblelearning.Inamultichannelmodel, Fig.5. Comparisonof accuracyforeach slotin the topicof each channel provides a different view of the data, and an ‘Attraction’. thesameChineseword“wanshang”. AlthoughthisChinese Koichiro Yoshino, “The Fifth Dialog State Tracking word does have both meanings of “evening” and “night” in Challenge,” inProceedingsofthe2016IEEEWorkshop Chinese, there are more precise Chinese terms representing onSpokenLanguageTechnology(SLT),2016. eachword.ThisEnglishtoChinesetranslationambiguityim- [2] HongjieShi,TakashiUshio,MitsuruEndo,Katsuyoshi mediatelyincreasesthedifficultyofidentifyingthevaluesof Yamagami, and Noriaki Horii, “Convolutional neural EVENING and NIGHT in Chinese, which leads to the poor networks for multi-topic dialog state tracking,” Pro- performanceoftheChinesemodelintheslotofTIME. ceedings of the 7th International Workshop on Spoken Anotherproblemisthatthetranslationqualityoftenvaries DialogueSystems(IWSDS),2016. byreversingthetranslationdirection,duetothedifferencein inflections, word order and grammars [11]. Since the train- [3] Franck Dernoncourt, Ji Young Lee, Trung H Bui, and ingcorpusonlycontainsonetranslationdirection(Englishto Hung H Bui, “Robust dialog state tracking for large Chinese), the multichannelmodelis by no meansoptimized ontologies,” arXivpreprintarXiv:1605.02130,2016. forthereversetranslationdirection. Thismaycausethemul- [4] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Se- tichannelmodelto have bias on certain channels, and it can quence to sequence learning with neural networks,” explainwhyincertainslotsthemodelcombinationthattreats in Advances in neural information processing systems, eachchannelequallyworksbetter.Amoresophisticatedway 2014,pp.3104–3112. totrainourmultichannelmodelshouldbefirstlytrainingthe modelwithonetranslationdirectionandthenfine-tuningthe [5] YoonKim,“Convolutionalneuralnetworksforsentence modelwiththeother.UnfortunatelythisisdifficultinDSTC5 classification,” arXivpreprintarXiv:1408.5882,2014. becausethedevelopmentdatasetthatcanbeusedforthefine tuningistoolimited. [6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- ton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information 5. CONCLUSION processingsystems,2012,pp.1097–1105. We proposed a multichannel convolutional neural network, [7] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey inwhichwetreatmultiplelanguagesasdifferentinputchan- Dean, “Efficientestimation of word representationsin nels. This multichannelmodel is found to be robustagainst vectorspace,” arXivpreprintarXiv:1301.3781,2013. thetranslationerrorsandoutperformsanyofthesinglechan- [8] Ye Zhang and Byron Wallace, “A sensitivity analy- nelmodels. Furthermore,ourmethoddoesnotrequireprior sis of (and practitioners’ guide to) convolutional neu- knowledgeaboutnewlanguages,andthereforecanbeeasily ralnetworksforsentenceclassification,” arXivpreprint applied to available corpusresourcesof differentlanguages. arXiv:1510.03820,2015. This not only can reduce the cost for the adaption to a new language,butalsooffersthepossibilitytobuildmultilingual [9] GeoffreyEHinton,NitishSrivastava,AlexKrizhevsky, dialogstatetrackerswithlarge-scalecross-languagecorpora. IlyaSutskever,and Ruslan R Salakhutdinov, “Improv- Inthisworkweappliedthreedifferentembeddingmodels, ingneuralnetworksbypreventingco-adaptationoffea- whilethereisonemorewedidnottry—theEnglishcharac- turedetectors,” arXivpreprintarXiv:1207.0580,2012. termodel. Thereareseveralcharacter-awarelanguagemod- elsproposedrecently,whicharesuperiorindealingwithsub- [10] MatthewHenderson,BlaiseThomson,andSteveYoung, word information, rare words and misspelling [12, 13]. We “Word-based dialog state tracking with recurrent neu- believethatintegratingthemintothemultichannelmodelisa ralnetworks,” inProceedingsofthe15thAnnualMeet- promisingresearchdirection. ingoftheSpecialInterestGrouponDiscourseandDia- On the other hand, since our method is purely machine logue(SIGDIAL),2014,pp.292–299. learningbased,itcannothandleunseenlabelsinthetestdata. [11] DavidVilar,JiaXu,LuisFernandodHaro,andHermann Thisisaveryimportantissueespeciallyforalargeontology, Ney, “Error analysis of statistical machine translation becauseofthedifficultyinobtaininglargetrainingcorpusthat output,” inProceedingsofLREC,2006,pp.697–702. covers all concepts. To overcome this disadvantage, future workshouldincludecombiningmachinelearningwithother [12] Yoon Kim, Yacine Jernite, David Sontag, and Alexan- approaches,suchashand-craftrules,dataargumentationand derMRush,“Character-awareneurallanguagemodels,” soon. arXivpreprintarXiv:1508.06615,2015. [13] Xiang Zhang, Junbo Zhao, and Yann LeCun, 6. REFERENCES “Character-level convolutional networks for text classification,” in Advances in Neural Information [1] Seokhwan Kim, Luis Fernando D’Haro, Rafael E. ProcessingSystems,2015,pp.649–657. Banchs, Jason Williams, Matthew Henderson, and

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.