ebook img

Emotion Recognition From Speech With Recurrent Neural Networks PDF

0.98 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Emotion Recognition From Speech With Recurrent Neural Networks

Emotion Recognition From Speech With Recurrent Neural Networks VladimirChernykh GrigoriySterling MIPT,IITP,Skoltech IITP Moscow Moscow [email protected] [email protected] 7 1 0 PavelPrihodko 2 IITP n Moscow a [email protected] J 7 2 Abstract ] L In this paper the task of emotion recognition from speech is considered. Pro- C posed approach uses deep recurrent neural network trained on a sequence of . acousticfeaturescalculatedoversmallspeechintervals. Atthesametimespecial s c probabilistic-natureCTClossfunctionallowstoconsiderlongutterancescontain- [ ingbothemotionalandunemotionalparts. Theeffectivenessofsuchanapproach isshownintwoways. Firstoneisthecomparisonwithrecentadvancesinthisfield. 1 Whilesecondwayimpliesmeasuringhumanperformanceonthesametask,which v 1 alsowasdonebyauthors. 7 0 8 1 Introduction 0 1. In the recent years human-computer interaction has become more and more interesting for data 0 scientists. The goal of the most research is to make conversation between human and computer 7 morenativeandnatural. Obviouslythereare2necessarypointstoachievethisgoal: makehumans 1 understandcomputerbetterandconversely. Onecanobserveagreatsuccessinspeechrecognition : (speechtotext,STT).Nowadaysmachinescanunderstandalmostallhumanspeechandtherelated v services are used everywhere from Siri-like applications in smartphoes to voice control services. i X Totally,computercanunderstandwhathasbeensaid. Butthisisnotallthedatainavoice,italso r containsaninformationaboutwhoandhowhasspoken. Thesecondquestionmotivatesustolearn a howtorecognizeemotionsfromspeech. Moreover,texttospeech(TTS)serviceshavethesame challengingproblem: machinescanproduceonlyintonationlessspeech. 1.1 Emotionrecognitionproblemdescription Thispaperisdedicatedtoemotionalspeechprocessing. Themaingoalistolearnhowtoclassify speech by emotional state of a speaker. Unfortunately, there is no fairly strict definition what is emotion. Moreover,asitwillbeshownfurther,differentpeopleclassifyemotionalspeechdifferently. Theseconddifficultyisabouttimepropertiesofemotions. Oftenalmostthewholeutterancehasno emotion(speakerisinneutralstate),butemotionalityiscontainedonlyinafewwordsorphonemes inanutterance. Emotionrecognitionproblemcanbereformulatedinmathematicaltermsasaclassificationtask. In briefafunctionfromtheutterancesspacetothesetofemotionalstateshastobeconstructed. In thisspacedecisionruleseparatesutteranceswithoneemotionfromtheothers. Butwhatifpeople evaluatesutterancesdifferently? Ifweassumethattheutterancesemotionalstatesarepickedfroman unknownprobabilitydistributionthenthemodelshouldpredictandprocesstheseprobabilitiesinthe mostsensibleway. 1.2 Relationtopriorworks Atthemomenttheemotionrecognitionpipelinelooksthefollowing: • Chooseanemotionalspeechcorpus • Dividecontinuousaudiosignalintoemotionalutterances • Calculatefeaturesforeachutterance • Selectandtrainaclassificationmodel • Developasetofevaluationmetricsandvalidateconstructedmodel Thereareawholebunchofmethodsforeachoftheaforementionedsteps. However, thereisno panaceaforageneralproblemsettingofemotionrecognition. Oneneedtomodifyeachstepfrom thelistinaccordancewiththespecificpropertiesofthetask. Themosteffectiveapproachtakesin accountlotsofaspectslikeproblemstatement,goals,datasetstructure,evaluationmetrics,etc. Inthispaperweconsiderutterance-levelclassificationforspeaker-freeemotionrecognitionfrom naturalspeech. WedecidedtochooseIEMOCAPdatabaseasaspeechcorpusbecauseitsatisfiesall ofpropertiesofthetaskdescribedaboveandthereareseveralwell-developedapproachesonthis emotionalcorpustocompare. Before 2014 all research does not take into account time properties of the speech but uses only statisticsaggregatedovertheutterances.Thefirstattempttoconsiderafeaturesequencewasprovided byHanetal. in[1]wheretheyusedDNN-basedsystemwiththefollowingaggregationovertime usingstatisticalfunctions. Theyhaveshownsomeadvantagesofthisapproach,butaccuracywas insufficient. In the next year Lee and the same microsoft team [2] trained long short-term memory (LSTM) recurrentneuralnetworkwithonafeaturesequenceandachievedabout60%accuracyonIEMOCAP database. Theyusedaspeciallossfunctionthatenablestotakeintoaccountthefactthatemotionality in an utterance exists only in a few frames. Untill now it was the most effective workflow on IEMOCAP database. But the thing is that in their paper Lee et al. sieved the initial database in thewaydiffersfromours. Firstlytheyconsideredanothersetofemotionsandsecondlytheyused only improvisation sessions (see 2 for details). Testing our approach under the same conditions we got almost the same quality — 56% mean class accuracy and 59% overall accuracy (see 5.3 fordefinitions). Onemoredrawbackoftheirworkisthelackofcodeandtheunclearnessofthe importantdetailsofthealgorithmwhichmakesithardtoreproduce. 2 Datadescription Regardlessofmodeltype,trainingprocedurerequireslabelledemotionalcorpus. Therearequitea fewdatabasesandits’overviewcanbefoundin[3,4]. Wecarriedoutallexperimentswithanaudio datafromTheInteractiveEmotionalDyadicMotionCapture(IEMOCAP)database[5]. Itconsistsofabout12hoursofaudio-visualdatafrom10actors. Allrecordingshaveastructure ofdialoguebetweenamanandawomaneitherscriptedorimprovisedonthegiventopic. After collectingthisaudio-visualsignalsauthorsdivideddialoguesintosmallutterancesoflengthmainly from3to15seconds(seefig.1afordetails)andthengivethemtotheassessorstoevaluate.Utterances arebrokenintotwo"streams"(manandwomanvoices)andthustheysometimesintersects(seefig. 2b). Thenfrom3to4assessorswereaskedtoevaluateeachutterancebasedonbothaudioandvideo streams. Theevaluationformcontained10options(neutral,happiness,sadness,anger,surprise,fear, disgustfrustration,excited,other). Herewetakeforanalysisonly4ofthem—anger,excitement, neutralandsadness. Figure1bshowsthedistributionofconsideredemotionsamongtheutterances. Emotionwasassignedtotheutteranceonlyifatleasthalfofexpertswereconsistentintheirchoice. Andobviouslyitisnotalwayslikethat. Thereisa30%ofutterancesinwhichmorethanahalf ofexpertsgavedifferentlabelsandemotionwasnotassignedatall. Moreover,fromtheremained utterancessignificantlylessthanahalfhaveallexpertsconsistentlyevaluatethem(fig. 2a). Thisfact 2 (a)Utterancedurationdistribution (b)Emotionallabelsdistribution Figure1: Dataoverview (a)Expertconsistency (b)Percentageofintersections Figure2: Markupdetails illustratesthatemotionisasubjectivenotionandthereisnowaytoclassifyemotionspreciselyeven ifhumanscan’tdothat. Inotherwordsanymodelcannotlearntorecognizeemotions,butcanlearn howexpertslabelemotionalutterances. 2.1 Technicaldetails Usuallydigitalsoundacquiredasaconvertedanalogoussignalfromarecordingdevice. Audiowave isdiscretizedandconvertedintotimeserieswiththegivendiscretizationfrequencycalledsamplerate. InIEMOCAPcase,weneedtohave16000integersforonesecondofasoundwith16kHzsample rate. Mostpopularapproachtoreducingthisnumberistoseparateaninputsignalintointersecting intervals (frames) of 0.1-1 seconds length and calculate acoustic features over it. This sequence offeaturevectorsrepresentsaninputsoundinlowerdimensionalspacewithenoughprecisionbut anywaysomeinformationislost. Theotherapproachistoworkwithrawsignaldiscretizedwitha lowerframerate. Thisapproachismoreaccuratebutneedssignificantlymorecomputationalpower. 2.2 Featuresused Oneofthestickingpointsinemotionrecognitioniswhatfeaturesshouldbeused. Almostallpossible featurescanbedividedinto3mainclasses: • Acousticfeatures. Theyarecalculatedfromtheinputsignalanddescribewaveproperties ofaspeech. Sometimespitch-basedfeatureslikechromagrampowersarealsoincludedin acousticset,butinsomeworkstheyareconsideredseparatelyorinprosodicgroup. 3 • Prosodicfeatures. Generally,theytrytomeasurepeculiaritiesofspeechlikepausesbetween words, prosodies and loudness. Unfortunately, speech details such as prosodic features stronglydependonaspeaker,andtheiruseinspeaker-freeemotionrecognitionisdebatable. Insomeworksprosodicfeaturesimprovethetotalefficiency,butwedonotuseitbecause theaforementionedreason. • Linguisticfeatures.Thethirdkindoffrequentlyusedfeaturesbasedonsemanticinformation fromspeech. Wealsodonotusethembecauseusuallyexacttranscriptionsareunavailable. Sointhispaperweconsideronlyacousticfeaturesthatmeasurewavepropertiesofasignal. Along withFourierfrequenciesandenergy-basedfeaturesMel-frequencycepstralcoefficients(MFCC)are usuallyused. Fouriertransformationofthecurrenttimeintervaloftheinputsignalcanbeconsidered asanothersignalandMFCCfeaturesmeasureitsfrequenciesandpowers. Finally,34featuresareused. Theyinclude12MFCC,chromagram-basedandspectrumproperties likefluxandroll-off. Forallspeechintervalswecalculatefeaturesin0.2secondwindowandmoving itwith0.1secondstep. Forexample,4secondssignalresultsin78vectorsofdimension34. 3 Problemstatement Emotionrecognitiontaskcanbeformulatedasamulti-classclassificationprobleminmathematical terms. Thisreformulationallowstoapplywiderangeofwell-establishedmethods. LetD ={(X,z)}n bethetrainingsetwherez ∈Z =E∗ ={0...k−1}∗isthetruesequence i=1 i oflabelsandX ∈X =(cid:0)Rf(cid:1)∗—correspondingmultidimensionalfeaturesequence. It’sworthto i mentionthatthelengthsofthesesequences|z |=U and|X |=T maynotbethesameingeneral i i i i case,theonlyconstraintistheU ≤T condition. i i Also let’s divide the dataset into train and validation sets D = D ∪D with sizes N and N l v l v correspondingly. I andI showthecorrespondingsplitofindexes. l v FurtherweintroducesthesetofclassificationfunctionsF ={f :X (cid:55)→Z}inwhichwewanttofind thebestmodel. Incaseofneuralnetworkwiththefixedarchitectureitispossibletoassociatetheset offunctionsF withthenetworkweightsspaceW andthusfunctionf andvectorofweightsware interchangeable. Nextstepittointroducethelossfunction. Asinalmostallclassificationtaskshereweareinterested inaccuracyerrormeasure. Butduetoit’snon-convexityweareunabletooptimizeitdirectlyandthus weusetheconvexapproximationwhichisdenotedasL(z,y)whereyistheanswerofthemodel. Assumethatthereisan"perfect"classifierf∗. Ourtaskistofindf∗ thatapproximatef∗inthebest F possibleway. Itcanbedoneusingtheriskminimizationapproach: f∗ =argmin E [L(z,f(X))]. F f∈F (X,z) But real data distribution in X ×Z space is unknown and thus it is impossible to calculate the expectedvalueproperly. Thereforeitiscommontominimizeanempiricalrisk: fˆ=argmin 1 (cid:88)L(z ,f(X ))=argminQ(w,D ,L). N i i l f∈F l w∈W Dl Therebywehavefˆ—approximationoff∗ builtbasedontheavailabledata. Commonapproach F heretoreduceoverfittingistocheckthevalidationerrorfromtimetotimeduringtheoptimization processandtostopwhenitstartsgrowing. 4 Recognitionmethods Throughoutthispaperweuseafewrecurrentneuralnetworkmodelsforemotionsdetectioninthe humanspeech. Thegeneralstructurestaysmoreorlessthesame: inputlayer,fewLSTMlayers stackedovereachother,denseclassificationlayersandoutputsoftmaxlayer. Onecanaddresstothe experimentsection5foramoredetailedarchitecturesdescription. Belowthereisanoverviewoftwo mainapproachestothenetworktrainingthatweusedinthispaper. 4 4.1 One-labelapproach One-labelapproachimpliesthateveryutterancehasonlyoneemotionallabelnotwithstandingit’s length. Itistheobviousapproachregardingdatasetslabelledwiththeutteranceannotationtechnique [5]. Figure3: One-labelapproachpipeline Indetails, letallthelengthsofthelabelsequencesareequalto1, U = |z | = 1. Thenvectorz i i becomesscalarz The recurrent neural network can be thought as a mapping from the input feature space X to theprobabilitydistributionovertheemotionalclasslabelsparameterizedwiththemodelweights y=N (X)∈[0;1]k,whereyistheoutputofthesoftmaxlayer. w Thelossfunctionhereisthecategoricalcross-entropyloss. Belowthereisanobjectivefunctionto minimizebasedonthisloss: k−1 1 (cid:88)(cid:88) Q(w,D ,L)=− z logy l N i,c i,c l i∈Ilc=0 whereclasslabelsz areencodedwiththeone-hotencodingandy =N (X ). i i w i Thefinalclassificationismadebymeansofagrmaxruleovertheoutputsofthetrainedmodel: fˆ(X)=argmaxy c c∈E Inthefigure3onecanseethewholeworkflowofthedescribedapproach. Untilrecentlythisapproachwithdifferenttypesofamodels(i.e. HMMorGaussianMixtures)has beenconsideredasastate-of-the-art. Thedrawbackofsuchanapproachisobviousfromit’sname—itconvertssequenceofarbitrary lengthintoonelabel. Onemorepotentialbutnonethelesssignificantproblemwithone-labelapproach isontheflywork. Intheonlinesettingthereiscontinuousaudioflowwithoutassessor’ssplitinto utterances. Thustheabilitytopredictthesequenceofemotionsperoneutteranceiscrucial. Themostrecentadvanceinthefieldofnaturallanguageprocessingistheuseofsocalledend-to-end orsequence-to-sequencemodelswhichisdescribednextin4.2 4.2 CTCapproach ConnectionistTemporalClassificaion(CTC)approachistheoneamongmanyend-to-endprediction methods. ThemainadvantageofCTCisthatitchoosesthemostprobablelabelsequenceregarding thevariouswaysofaligningitwiththeinitialsequence. Theprobabilityoftheparticularlabellingis addedupfromtheprobabilitiesofeveryit’salignment. More formally, let D still be the training set but unlike one-label approach the answer can be a l sequencenow. Further we expand the label set E and introduce one more label — NULL. This label can be interpretedasalackofemotion. Technicallyitreflectsinonemoreunitinthelastsoftmaxlayerof neuralnet. Sothefinallabelsetisthefollowing: L=E∪{NULL}. Asintheone-labelapproachlet’sconsidertheworkofthemodelasamappingbutherethecodomain isalsothesequencespaceY=N (X)∈[0;1](k+1)×T. Hereyt istheoutputofthesoftmaxlayer w c andrepresentstheestimationfortheprobabilityofobservingclasscinthemomentt. 5 Figure4: CTCapproachpipeline ForeveryinputXlet’sdefinethepathπ—itisthearbitrarysequencefromL∗ withthelengthT. Thentheconditionalprobabilityofthepathis T (cid:89) p(π|X)= yt . πt t=1 TheproblemisthatpathscontainNULLclasslabelswhichareunacceptableatthefinaloutput. So firstneedistogetridoftheNULLs. ThereforemappingM :LT (cid:55)→E≤T isintroduced. Itconsistof thefewstages: 1. Deleteallconsequentrepeatedlabels, 2. DeleteallNULLs One can think of the following application: M(−aa−b−b−−ccc) = M(abb−−−bc−) = abbc. FromthisexampleitisobviousthatM isthesurjectivemapping. Bymeansofitwetransform pathstolabellings. Themethodforcomputingtheprobabilityofthelabellingissimple: (cid:88) p(l|X)= p(π|X). π∈M−1(l) Onecouldnoticethatthedirectcalculationofp(l|X)requiressummationoverallcorresponding pathswhichisveryexhausting—thereare(k+1)T possiblepaths. ThusGravesetal. in[6]derived anewefficientforward-backwarddynamicprogrammingalgorithmforthat. Theinitialideawas takenfromRabiner[7]HMMdecodingalgorithm. Finally, the objective function to minimize in this approach is based on the maximal likelihood principle. Wetrytomaximizetheprobabilitiesofallcorrectanswersatthesametime: (cid:88) Q(w,D )=− logp(z |X ) l i i i∈Il Neuralnetworkhereplaysaroleofevaluatorofprobabilitymeasurepandthemoreittrainsthe morepreciseprobabilityestimationsitgives. Toenabletheneuralnetworktrainingwiththestandard gradient-basedmethodsGravesetal. [6]alsosuggesteddifferentiationtechniquenaturallyembedded intodynamicprogrammingalgorithm. Thefinalclassificationruleistheonethatmaximizestheprobabilityoftheanswer: fˆ(X)=argmaxp(l|X) l∈E≤T Inthefigure4thepipelinefortheCTCmethodisdepictedinthewaysimilartoone-labelapproach. 5 Emotionrecognitionexperiments Inseriesofexperimentsweinvestigateddifferentmodelsandapproachestotheemotionrecognition task. Allthecodecanbefoundingithubrepository[8] As it was mentioned in section 2, data structure is the following — dialogues are broken into utterancesbyhumanassessors. Utteranceistheleaststructuralunitwithemotiontaginoriginal 6 labelling. Bututterancesconsiderablyvaryinlength. Thuswedecidedtosplitthemintooverlapped framesof0.2secondsdurationwithoverlappingof0.1second. Theproblemisthatframeshaveno labels. Whileitisobviousthatnotallframesofangryutterancealsocanbereferredasanangry frames. Furtherinthispaper,unlessotherwisespecified,thetestsetconsistsof20%pointsrandomlypicked fromdataset. Alsotwoneuralnetworkstructuresaremainlyused: • SimpleLSTM.ItconsistsoftwoconsecutiveLSTMlayerswithhyperbolictangentactiva- tionsfollowedbytwoclassificationdenselayers. • BidirectionalLSTM.NetworkcontainstwoBLSMlayerseachofwhichsplitsintotwoparts —halfofnodesareusualLSTMblocksandotherhalfisBLSTMunits. Moredetaileddiagramsareshowninthefigure10andcanbefoundinappendixA. 5.1 Framewiseclassification Thefirstmethodthatwasappliedisframewiseclassification. Thecoreofthismethodistotryto classifyeachframeseparately. Underspecifiedconditionswecomeupwiththefollowingworkflow: • Take two of the loudest frames from each utterance. Loudness in this context is the synonymousforspectralpower • Assigntheseframeswiththeemotionoftheutterance • Traintheframeclassificationmodelonthederiveddataset Herewemakeanassumptionthattheemotionoftheutterancecontainsnotinallframesofitbut only in the loudest ones. Experiment shows that 2 frames is the optimal number. Regarding the classificationmodelweusedRandomForestClassifierfromscikit-learnpackage[9]. Figure5: Framewiseclassification Inthefigure5onecanobservetheresultsofthismethodforrandomutterancesfromthetestset. In glanceitlookssomewhatreasonablewithshortutterances. Onthelongeronesitbecomessawtooth andunstable. The next step is to classify utterances based on its’ artificial labellings. Simple majority voting algorithmgivesabout44%accuracy. Takingintoaccountthatneutralclassisabout36%ofadataset itlooksnotverygood. Moreover,thethingisthaterrordistributioninthatcaselooksunnaturalin 7 comparisonwiththehumanonegotbyus. 70%oftheanswersareneutral,whichimplicitlyconfirm ourassumptionsthatmostoftheframesintheutterancesdon’thaveanyemotion. ApplyingsimpleLSTMnetworkdoesn’tgiveanysignificantimprovementandBLSTMnetwork drasticallyoverfitsandwecan’tescapefromthiseffectwithoutsignificantlossinquality. Summarizing,framespre-classificationshowsitselfnotbrilliantinthistask. Webelievethatthemain reasonislackofmarkupontheframelevel. Choosingtwoframefromutteranceistooinaccurate methodforlabellingsamples. Onparwithit,wecanstatethathypothesisaboutneutralcharacterof mostpartoftheframesholds. 5.2 Utterance-levelclassification Theideaofthefollowingexperimentistouseframefeaturesthemselvesasaninputtotheneural network model. There is a rationale behind it, because with the framewise approach we don’t get any additional information after the intermediate RFC stage. The key idea in utterance level classificationisthatRNNcanlearnsufficientfeaturesfrominputfeaturesstreamitselfandmadethe finalclassificationbasedonthem. Hereweexperimentwithtwoapproachesdescribedinsection4. 5-foldcross-validationtechniqueis usedtoevaluateandcomparethem. 5.2.1 One-labelapproach BLSTM network outperforms Simple LSTM network within this approach. In the figure 6 the performanceanalysisisshown. (a)Lossevolution (b)5-foldCVresults Figure6: BLSTMperformanceinone-labelapproach Asonecanseefromtheplotsinfigure6athat20-22epochsisenoughfortrainingandafterthat modelstartstooverfit. Accuracyevolutioncurvescanbefoundinfigure11aplacedinappendixA.In thebarchart6btheresultsofCVareshown. Numberofepochsfortrainingwaschosenaccordingly topreviouslyobtainednumber—20. 5.2.2 CTCapproach PerformanceindicatorsforCTCapproacharegiveninthefigures7and11binappendixA. ItisworthtomentionthatBLSTMwithCTClossrequiresapproximatelytwotimesmoreepochsto trainproperlycomparingtocrossentropyloss. Theinterestingandimportantfeaturetonoticeisthat networkwithCTClossdoesnotoverfit. 5.3 Comparison Asweconsideringmulticlassclassificationtaskwithalittleunbalancedclasses,itisessentialtolook notonlyonoverallaccuracybutonclassaccuraciestoo. Thusweintroducetwotypeofmetrics: 8 (a)Lossevolution (b)5-foldCVresults Figure7: BLSTMperformanceinCTCapproach • Overall(weighted)accuracy. Itisanaccuracyontheentiretestset. • Meanclass(unweighted)accuracy. Isisaverageclassificationaccuraciesforeachclass. Inthetable1theaforementionedmetricsareshownforeachmethodweusedthroughoutthepaper. Thelastlineofthetablecorrespondstothehumanassessorswhowasaskedtorelabelthisdataset. Allthedetailsabouthowandwhyitwasdoneareinthesections5.4and6 Table1: Methodsaccuraciescomparison Method Overallaccuracy Meanclassaccuracy Framewise 45% 41% One-label 51% 49% CTC 54% 54% Human 69% 70% 5.4 Errorstructure WealsodecidedtofurtherinvestigatetheCTCmodelbecauseGravesin[6]reportsabouthugegap inqualityoverclassicalmodels. Whileherethegainisabout3-5%. Itisstillsignificantbutnotthe breakthrough. Forthatreasontheerrorstructureofourmostsuccessfulmodelisstudied. Firstofall,it’susefultotakealookonpredictionsdistributionincomparisonwith"groundtruth" fromexperts. Thisdistributionisshowninthefigure8a. Bussoinhiswork[5]mentionsthataudio signalplaysthemainroleinsadnessrecognitionwhileangryandexcitementarebetterdetectedvia videosignalwhichaccompaniesaudioduringtheassessorswork. Thishypothesisseemstobetruein relationtoourmodel. Sadnessrecognitionpercentageismuchhigherthantheothers. Italsomaybe interestingtocompareitwiththehumanerrorstructureshowninthefigure9a Alsoasitwassaidinsection2datasethasnotthehomogeneousstructureinsenseofexpertanswers reliability. Itisinterestingtosee(fig. 8b)howthemodelpredictionsdependontheexpertconfidence degree. Forthatpurposewefirstdifferentiateutterancesbytheconfidencedegreeofexperts. On thex-axisonecanseethenumberofexpertswhoseanswerdiffersfromthefinalemotionassigned totheutterance. Ineachcellofatablethereisamodelerrorpercentagewithparticularemotion correspondingtoycoordinateandparticularconfidencelevelcorrespondingtoxcoordinate. Infactthismatrixgivescuriousinformation—ifwetakeinaccountonlythoseutterancesinwhich expertswereconsistentthenweget65%accuracy. Itsoundsmuchmorepromisingthat54%andthis transitioncanbetreatedlegalbecauseemotionissomethingverysubjective. Recognizingatruck orafloweronapicturecanbeevaluatedobjectivelyandintaskslikethatcomputershavealready outperformhumans[10]. Butemotionissomethingdifferentwhichcan’tbedefinedproperly. Thus 9 (a)Answersdistribution (b)Misclassificationrates Figure8: BLSTMCTCerrorstructure evenifhumancan’tbesureinrightunderstandingofemotionsthenourmodelcan’tguaranteeitas well. Onemoreinterestingthingtonoticehereisnotonlythequantityofinconsistentanswersbutalsothis answersthemselves. Asitwasdiscussedpreviouslysomeexpertsgiveanswersnotthesameasthe finalemotionoftheutterances. Theseanswerscanbearbitraryemotionandhereweselectonlythose fourconsideredbefore. Therebyinthefirstrowofthetable2thereisthepercentageofinconsistent answersfromutterancestreatedashavingemotionwiththenamefromtheheaderofthecolumn fallenintoconsideredfouremotions. Andinthesecondrowthereisthepercentageofmodelanswers thatcoincidewithinconsistentexpertinthiscase. Table2: Residualaccuracy Angry Excitement Neutral Sadness Consideredratio 17% 22% 36% 39% Modelaccuracy 51% 73% 71% 74% In other words table 2 shows how frequently the errors of our model coincide with the human divergenceinemotionassessment. Allthesefactsandideasleadustothefinalandstillongoingpartoftheresearchdescribedinsection 6. 6 Markupinvestigation Observinginconsistencyofexpertsandotherproblemsofthemarkupdescribedinthesections5.4 and2wecomewiththeideathatitwillbeinterestingandusefultoseehowhumansperforminthat task. Thisquestionisnotnewandpreviouslyarisedinthepapers. Altrovin[11]collectedthecorpusof Estonianspeechandaskedpeopleofdifferentnationalitiestoevaluateit. Almostallnationalities (Latvians,Italians,Finns,Swedes,Danes,Norwegians,Russians)wereclosetoEstoniansgeographi- callyandculturally. NeverthelessEstoniansperformmuchbetterthananyothernationalityshowing about69%meanclassaccuracy. Allotherpeopleperform10-15%worseandtheonlyemotionthat theyrecognisewellwassadness. Inourresearchwedevelopedasimpleinterface(fig. 9b)forrelabellingspeechcorpustoseehow wellhumanscansolvethistask. Inthefigure9ayoucanseetheresultsoftheexperimentstaken. Bothtypesofaccuraciesweconsideredbeforeareabout70%(tab. 1). Thesenumbersconfirmthe ideathattheemotionisthenotiontreateddifferentlybydifferentpeople. Andhavingthis70%asa milestonewecansaythatourmodelperformsverygood. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.