Deep Memory Networks for Attitude Identification Cheng Li Xiaoxiao Guo Qiaozhu Mei SchoolofInformation ComputerScienceandEng. SchoolofInformation UniversityofMichigan UniversityofMichigan UniversityofMichigan AnnArbor AnnArbor AnnArbor [email protected] [email protected] [email protected] ABSTRACT rate subtasks: target detection that identifies whether an entity is 7 mentioned in the text, either explicitly or implicitly, and polarity We consider the task of identifying attitudes towards a given set 1 classification that classifies the exact sentiment towards the iden- 0 ofentitiesfromtext. Conventionally,thistaskisdecomposedinto tified target, usually into three categories: positive, negative, and 2 twoseparatesubtasks:targetdetectionthatidentifieswhethereach neutral. entityismentionedinthetext, eitherexplicitlyorimplicitly, and n Solvingthetwosubtasksback-to-backisbynomeansunreason- polarityclassificationthatclassifiestheexactsentimenttowardsan a able,butitmaynotbeoptimal. Specifically,intrinsicinteractions identifiedentity(thetarget)intopositive,negative,orneutral. J betweenthetwosubtasksmaybeneglectedinsuchamodularized Instead,weshowthatattitudeidentificationcanbesolvedwith 6 pipeline. Indeed, signals identified in the first subtask – both the anend-to-endmachinelearningarchitecture,inwhichthetwosub- 1 wordsthatrefertothetargetandthepositionsofthesewords,could tasksareinterleavedbyadeepmemorynetwork. Inthisway,sig- provideusefulinformationforthepolarityofsentiments. Forex- nalsproducedintargetdetectionprovidecluesforpolarityclassi- ] ample, the identified target in the sentence “this Tiramisucake is L fication,andreversely,thepredictedpolarityprovidesfeedbackto ”indicatesahighprobabilitythatasentimentalwordwouldap- theidentificationoftargets.Moreover,thetreatmentsforthesetof C pearintheblankandishighlylikelytoberelatedtoflavororprice. targetsalsoinfluenceeachother–thelearnedrepresentationsmay . On the other hand, sentimental expressions identified in the sec- s sharethesamesemanticsforsometargetsbutvaryforothers. The c proposeddeepmemorynetwork,theAttNet,outperformsmethods ondsubtaskandtheirpositionscouldinturnprovidefeedbackto [ thefirsttaskandsignaltheexistenceofatarget. Forexample,the thatdonotconsidertheinteractionsbetweenthesubtasksorthose 1 amongthetargets,includingconventionalmachinelearningmeth- positivesentimentin“thenewKeynoteisuserfriendly”provides goodevidencethat“Keynote”isasoftware(thetarget)insteadof v odsandthestate-of-the-artdeeplearningmodels. a speech (not the target). In addition, models learned for certain 9 targetsandtheirsentimentsmaysharesomeimportantdimensions 8 1. INTRODUCTION witheachotherwhiledifferonotherdimensions.Forexample,two 1 In many scenarios, it is critical to identify people’s attitudes 1 targetsfoodandservicemaysharemanysentimentalexpressions, 4 towardsasetofentities.Examplesincludecompanieswhowantto butthesentence“wehavebeenwaitingforfoodforonehour”is 0 knowcustomers’opinionsabouttheirproducts,governmentswho clearlyabouttheserviceinsteadofthefood.Failuretoutilizethese . 1 areconcernedwithpublicreactionsaboutpolicychanges, andfi- interactions(bothbetweentasksandamongtargets)maycompro- 0 nancialanalystswhoidentifydailynewsthatcouldpotentiallyin- misetheperformanceofbothsubtasks. 7 fluence the prices of securities. In a more general case, attitudes Recentdevelopmentsofdeeplearninghasprovidedtheopportu- 1 towardsallentitiesinaknowledgebasemaybetrackedovertime nityofabetteralternativetomodularizedpipelines,inwhichma- v: forvariousin-depthanalyses. chinelearningandnaturallanguageprocessingtaskscanbesolved i Differentfromasentimentwhichmightnothaveatarget(e.g., in an end-to-end manner. With a carefully designed multi-layer X “Ifeelhappy”)oranopinionwhichmightnothaveapolarity(e.g., neuralnetwork,learningerrorsbackpropagatefromupperlayersto r “we should do more exercise”), an attitude can be roughly con- lowerlayers,whichenablesdeepinteractionsbetweenthelearning a sidered as a sentiment polarity towards a particular entity (e.g., of multi-grained representations of the data or multiple subtasks. “WSDM is a great conference”). Therefore, the task of attitude Indeed, deep learning has recently been applied to target-specific identificationhasbeenconventionallydecomposedintotwosepa- sentimentanalysis(mostlythesecondsubtaskofattitudeidentifi- cation)andachievedpromisingperformance,whereagiventarget 1 “The way you think and feel about someone or something,” as isassumedtohaveappearedexactlyonceinapieceoftextandthe defined by Merriam-Webster. http://www.merriam-webster.com/ dictionary/attitude taskistodeterminethepolarityofthistext[38, 45, 36]. Adeep networkstructurelearnsthedependencybetweenthewordsinthe contextandthetargetword. Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor Inanotherrelatedtopicknownasmulti-aspectsentimentanaly- classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- sis, where the goal is to learn the fine-grained sentiments on dif- tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan ferentaspectsofatarget,somemethodshaveattemptedtomodel ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- aspects and sentiments jointly. Aspects are often assumed to be publish,topostonserversortoredistributetolists,requirespriorspecificpermission and/[email protected]. mentionedexplicitlyintext,sothattherelatedentitiescanbeex- WSDM2017,February06-10,2017,Cambridge,UnitedKingdom tractedthroughsupervisedsequencelabelingmethods[21,19,46]; (cid:13)c 2017ACM.ISBN978-1-4503-4675-7/17/02...$15.00 aspectsmentionedimplicitlycanbeextractedasfuzzyrepresenta- DOI:http://dx.doi.org/10.1145/3018661.3018714 tionsthroughunsupervisedmethodssuchastopicmodels[22,41, timentpolaritytowardsthattarget. Mostmethodstrainaspecific 32]. While unsupervised methods suffer from low accuracy, it is classifierforeachtargetandreportperformanceseparatelypertar- usually difficult for supervised methods, like support vector ma- get.Manyresearchersfocusonthedomainofonlinedebates.They chines(SVMs)[17],tointerleaveaspectextractionandsentiment utilizedvariousfeaturesbasedonn-grams,partofspeech,syntac- classification. ticrules,anddialogicrelationsbetweenposts[40,10,7,31]. The Inthispaper,weshowthattheaccuracyofattitudeidentification workshopSemEval-2016presentedataskondetectingstancefrom canbesignificantlyimprovedthrougheffectivelymodelingthein- tweets [24], where an additional category is added for the given teractionsbetweensubtasksandamongtargets. Theproblemcan target,indicatingtheabsenceofsentimenttowardsthetarget. Mo- besolvedwithanend-to-endmachinelearningarchitecture,where hammadetal.[25]beatallteamsbybuildinganSVMclassifierfor thetwosubtasksareinterleavedbyadeepmemorynetwork. The eachtarget. proposedmodel,calledtheAttNet,alsoallowsdifferenttargetsto Asstanceclassificationdealswithonlyonegiventargetperin- interactwitheachother,bysharingacommonsemanticspaceand stance,itfailstoconsidertheinteractionbetweentargetdetection simultaneouslykeeptheirownspace,makingitpossibleforalltar- andsentimentclassification.Furthermore,theinterplayamongtar- gets to be learned in a unified model. The proposed deep mem- getsisignoredbytrainingaseparatemodelpertarget. orynetworkoutperformsmodelsthatdonotconsiderthesubtask Explicittargets,nottagged. Inthedomainofproductreviews, or target interactions, including conventional supervised learning a specific aspect of a product could be considered as a target of methodsandstate-of-the-artdeeplearningmodels. attitudes. Whenthetargetsappearinareviewbutarenotexplic- Therestofthepaperisorganizedasfollows. Section2summa- itlytagged,theyneedtobeextractedfirst. Mostworkfocuseson rizestherelatedliterature. InSection3,wedescribehowthedeep extracting explicitly mentioned aspects. Hu et al. [12] extracted neuralnetworkisdesignedtoincorporatetheinteractionbothbe- product aspects via association mining, and expanded seed opin- tweensubtasksandamongtargets. Wepresentthedesignandthe ion terms by using synonyms and antonyms in WordNet. When results of empirical experiments in Section 4 and Section 5, and supervisedlearningapproachesaretaken,bothtasksofaspectex- thenconcludethepaperinSection6. traction and polarity classification can be cast as a binary classi- fication problem [17], or as a sequence labeling task and solved 2. RELATEDWORK usingsequencelearningmodelssuchasconditionalrandomfields (CRFs)[21,19]orhiddenMarkovmodels(HMMs)[46]. Sentimentanalysishasbeenaveryactiveareaofresearch[27, Implicit targets. There are studies that attempt to address the 30]. Whilesentimentingeneraldoesnotneedtohaveaspecific situation when aspects could be implicitly mentioned. Unsuper- target,thenotionofattitudeisusuallyconcernedwithasentiment visedlearningapproachesliketopicmodelingtreataspectsastop- towardsatargetentity(someoneorsomething). Asonecategory ics,sothattopicsandsentimentpolaritycanbejointlymodeled[22, ofsentimentanalysis, thereismuchexistingworkrelatedtoatti- 41,32]. TheworkshopofSemEval-2015announcedataskofas- tudeidentification, whichgenerallytakesplaceinthreedomains: pect based sentiment analysis [28], which separates aspect iden- multi-aspect sentiment analysis in product reviews, stance classi- tificationandpolarityclassificationintotwosubtasks. Foraspect ficationinonlinedebates,andtarget-dependentsentimentclassifi- identification,topteamscastaspectcategoryextractionasamulti- cation in social media posts. Below we categorize existing work classclassificationproblemwithfeaturesbasedonn-grams,parse bytheproblemsettings, e.g., whetherthetargetisrequiredtobe trees,andwordclusters. explicitlymentioned. Althoughaspectidentificationandpolarityclassificationaremod- Explicitlytaggedtargets. Therehasbeenabodyofworkthat eledjointlyhere,itishardtotrainunsupervisedmethodsinanend- classifiesthesentimenttowardsaparticulartargetthatisexplicitly to-endwayanddirectlyoptimizethetaskperformance. mentionedandtaggedintext,mostlyappliedtosocialmediatext Deeplearning forsentimentanalysis. Inthe generaldomain suchasTweets.DuetotheshortlengthofTweets,manymodelsas- ofsentimentanalysis, therehasbeenanincreasingamountofat- sumethattargetsappearexactlyonceineverypost.Jiangetal.[14] tention on deep learning approaches. In particular, Bespalov et developed seven rule-based target-dependent features, which are al. [1] used Latent Semantic Analysis to initialize the word em- fed to an SVM classifier. Dong et al. [6] proposed an adaptive bedding,representingeachdocumentasthelinearcombinationof recursive neural network that propagates sentiment signals from n-gramvectors. Glorotetal.[9]appliedDenoisingAutoencoders sentiment-baring words to specific targets on a dependence tree. for domain adaptation in sentiment classification. A set of mod- Voetal.[38]splitaTweetintoaleftcontextandarightcontext els have been proposed to learn the compositionality of phrases accordingtoagiventarget,andusedpre-trainedwordembeddings basedontherepresentationofchildreninthesyntactictree[33,34, andneuralpoolingfunctionstoextractfeatures. Zhangetal.[45] 11].Thesemethodsrequireparsetreesasinputforeachdocument. extendedthisideabyusinggatedrecursiveneuralnetworks. The However, parsingdoesnotworkwellonusergeneratedcontents, papermostrelevanttooursisTangetal.[36],whichappliedMem- e.g., tweets[8]. Liuetal.[20]usedrecurrentneuralnetworksto ory Networks [35] to the task of multi-aspect sentiment analysis. extractexplicitaspectsinreviews. Aspectsaregivenasinputs,assumingthattheyhavealreadybeen Comparedtotheexistingapproaches,ourworkdevelopsanovel annotatedinthetext.TheirmemorynetworkbeatallLSTM-based deep learning architecture that emphasizes the interplay between networksbutdidnotoutperformSVMwithhand-craftedfeatures. targetdetectionandpolarityclassification,andtheinteractionamong Model structures for target-dependent sentiment classification multipletargets. Thesetargetscanbeexplicitlyorimplicitlymen- heavily rely on the assumption that the target appears in the text tionedinapieceoftextanddonotneedtobetaggedapriori. explicitly,andexactlyonce. Thesemodelscoulddegeneratewhen atargetisimplicitlymentionedormentionedmultipletimes.Addi- tionally,theydonotconsidertheinteractionsbetweenthesubtasks 3. ATTNETFORATTITUDEIDENTIFICA- (targetdetectionandsentimentclassification)oramongthetargets. TION Giventarget,oneperinstance.Intheproblemofstanceclassi- fication,thetarget,mentionedexplicitlyorimplicitly,isgivenbut Weproposeanend-to-endneuralnetworkmodeltointerleavethe nottaggedinapieceoftext. Thetaskisonlytoclassifythesen- targetdetectiontaskandthepolarityclassificationtask.Thetarget detectiontaskistodeterminewhetheraspecifictargetoccursina 3.1 Background: MemoryNetworks givencontexteitherexplicitlyorimplicitly. Thepolarityclassifi- As one of the recent developments of deep learning, memory cationtaskistodecidetheattitudeofthegivencontexttowardsthe networks[35]havebeensuccessfullyappliedtolanguagemodel- specifictargetifthetargetoccursinthecontext. Formally,atar- ing,questionanswering,andaspect-levelsentimentanalysis[36], getdetectionclassifierisafunctionmappingpairsoftargetsand whichgeneratessuperiorperformanceoveralternativedeeplearn- contextsintobinarylabels,(context,target)→{present,absent}. ingmethods,e.g.,LSTM. Apolarityclassifierisafunctionmappingpairsoftargetsandcon- Givenacontext(ordocument, e.g., “wehavebeenwaitingfor textsintothreeattitudelabels,(context,target)→{positive,negative, foodforonehour”)andatarget(e.g.,service),amemorynetwork neutral}.Forexample,givenacontext,ifeveryonehasguns,there layerconvertsthecontextintoavectorrepresentationbycomput- wouldbejustmess,andatarget,guncontrol,thecorrectlabelis ing a weighted sum of context word vector representations. The presentforthetargetdetectiontaskandpositiveforpolarityclassi- weightisascorethatmeasurestherelevancebetweenthecontext fication. wordandthetarget(e.g., ahigherscorebetweenthewordswait- Ourmodelbuildsontheinsightthatthetargetdetectiontaskand ingandservice). Thevectorrepresentationofthecontextisthen thepolarityclassificationtaskaredeeplycoupledinseveralways. passedtoaclassifierfortargetdetectionorpolarityclassification. An attractive property is that all parameters, including the target • Thepolarityclassificationdependsonthetargetdetectionbe- embeddings,contextwordembeddingsandscores,areend-to-end causethepolarityismeaningfulonlyifthetargetoccursin trainablewithoutadditionalsupervisionsignals. the context. Reversely, the polarity classification task pro- AttNetimprovestheoriginalmemorynetworkmodelsforatti- videsindirectsupervisionsignalsforthetargetdetectiontask. tudeidentificationby(1)interleavingthetargetdetectionandpo- For example, if the attitude label positive is provided for a larityclassificationsubtasksand(2)introducingtarget-specificpro- context-targetpair,thetargetmusthaveoccurredinthecon- jection matrices in representation learning, without violating the textfollowingthedefinition. Suchindirectsupervisionsig- end-to-endtrainablity. nalsareusefulespeciallywhenthetargetonlyoccursinthe contextimplicitly. 3.2 SingleLayerAttNet • Thesignalwordsinthetargetdetectionandthepolarityclas- WebeginbydescribingAttNetinthesinglelayercase, shown sificationtaskareusuallyposition-related: thesignalwords inFigure1. Hereafterforsimplicity,werefertothetaskoftarget todeterminethepolarityareusuallythesurroundingwords detectionasTD,andpolarityclassificationasPC. of the signal words to detect the target. Moreover, when a contexthasmultipletargets,thesignalwordsusuallycluster (1)TargetEmbedding. Eachquerytargetisrepresentedasaone- fordifferenttargets[12,30]. hotvector,q∈RNtarget,whereN isthenumberoftargets. All target targetsshareatargetembeddingmatrixB∈Rd×Ntarget,wheredis • Differenttargetsinteractinboththetargetdetectiontaskand theembeddingdimensionality.ThematrixBconvertsatargetinto thepolarityclassificationtask.Intuitively,somecontextwords itsembeddingvectoru = Bq,whichisusedastheinputforthe couldmeanthesameformanytargets, whilesomecontext TDtask. wordscouldmeandifferentlyfordifferenttargets. Specifically, our model introduces several techniques building (2) Input Representation and Attention for TD. We compute ontheinteractionbetweenthetargetdetectiontaskandthepolarity matchscoresbetweenthecontext(ordocument)andthetargetfor classificationtaskaccordingly. content-basedaddressing. Thecontextisfirstconvertedintoase- quenceofone-hotvectors,{xi ∈RNvoc},wherexiistheone-hot • Theoutputofthetargetdetectionisconcatenatedaspartof vectorforthei-thwordinthecontextandNvoc isthenumberof theinputofthepolarityclassificationtasktoallowpolarity wordsinthedictionary. Theentiresetof{xi}arethenembedded classificationtobeconditionedontargetdetection. Polarity intoasetofinputrepresentationvectors{mt}by: i classificationlabelsarealsousedtotrainthetargetdetection classifierbyback-propagatingtheerrorsofthepolarityclas- mt =VtAtx , i q i sificationtothetargetdetectionend-to-end. whereAt ∈Rd×Nvoc isthewordembeddingmatrixsharedacross • Theattentionofpolarityclassificationovercontextwordsare targets,superscripttstandsfortheTDtask,andVt ∈ Rd×d isa q preconditionedbytheattentionoftargetdetection. Thepo- target-specificprojectionmatrixfortargetq,whichallowscontext larityclassificationtaskbenefitsfromsuchpreconditiones- wordsx tosharesomesemanticdimensionsforsometargetswhile i peciallywhentherearemultipletargetsinthecontext. varyforothers. Intheembeddingspace,wecomputethematchscoresbetween • Target-specific projection matrices are introduced to allow thetargetinputrepresentationuandeachcontextwordrepresenta- some context words to have similar representations among tionmt bytakingtheinnerproductfollowedbyasoftmax,at = targetsandothercontextwordstohavedistinctrepresenta- SoftMaix(u(cid:124)mt),whereSoftMax(w ) = exp(w )/(cid:80) exp(wi ). tions. Thesematricesarealllearnedinanend-to-endfash- i i i j j Inthisway,atisasoftattention(orprobability)vectordefinedover ion. thecontextwords. Weproposeadeepmemorynetworkmodel, calledtheAttNet, which implements the above motivation and ideas. In the rest of (3)OutputRepresentationforTD. Adifferentembeddingma- this section, we give a brief introduction to the memory network trix,Ct ∈ Rd×Nvoc,isintroducedforflexibilityincomputingthe model, followed by a description of a single layer version of the outputrepresentationofcontextwordsby: model. Then we extend the expressiveness and capability of the modelbystackingmultiplelayers. ct =VtCtx i q i target identification polarity classification 4 7 4 𝑜𝑡 𝑧 5 3 Weighted Sum 3 tuptu 𝑐𝑖𝑡 Embedding𝐶𝑡 Embed5ding 𝐴𝑝 𝑚𝑖𝑝 tupni o 3 5 4 context {𝑥𝑖} SoftMax etta noitn 𝑎𝑖𝑡 2 𝑢𝑡 𝑢𝑝 𝑎𝑖𝑝 5 noitn 7 etta SoftMax 2 6 tupni 𝑚𝑖𝑡 Embedding 𝐴𝑡 Embedding 𝐶𝑝 𝑐𝑖𝑝 tuptuo 2 6 InnerProduct 2 1 Weighted Sum 𝑢 1 6 target 𝑞 1 𝑜𝑝 7 7 Figure1:AsinglelayerversionofAttNet.Keysubmodulesarenumberedandcorrespondinglydetailedinthetext. The response output vector ot is then a sum over the outputs ct, where0≤λ<1controlstheimportanceofthesecondterm,and i weightedbytheattentionvectorfromtheinput:ot =(cid:80) atct. f isamovingaveragefunctionwhichshiftsattentionsfromwords i i i ofhighvaluestotheirsurroundingneighbors.Theoutputvectoris (4) Interleaving TD and PC. In the single layer case, the sum op =(cid:80) bpcp. i i i oftheoutputvectorot andthetargetqueryembeddinguisthen passedtothePCtask,z=ot+u. (7)PredictionforTDandPC. Topredictwhetheratargetpresents, thesumoftheoutputvectoroftargetclassificationot andthetar- (5)InputRepresentationandAttentionforPC. Similartothe getqueryvectoruispassedthroughaweightmatrixWt ∈R2×d TDtask,weconverttheentiresetof{x }intoinputrepresentation (2isthenumberofclasses: present, absent)andasoftmaxoper- i vectors{mp}by: atortoproducethepredictedlabel,avectorofclassprobabilities: i yt =SoftMax(Wt(ot+u)). mpi =VqpApxi, Similarly,thesumoftheoutputvectorsop ofPCanditsinput whereAp ∈ Rd×Nvoc istheinputembeddingmatrixforPC.We vectorzisthenpassedthroughaweightmatrixWp ∈ R3×d and useseparateembeddingmatricesAtandApforTDandPC,asthe a softmax operator to produce the predicted attitude label vector, yp =SoftMax(Wp(op+z)). wordscouldhavedifferentsemanticsinthetwotasks. Forsimilar reasons, we use different projection matrices Vt and Vp for the q q twotasks. Giventhepolarityinputrepresentation{mp},wealsocompute 3.3 MultipleLayerAttNet i thesoftattentionoverthecontextwordsforpolarityidentification, Wenowextendourmodeltostackedmultiplelayers. Figure2 ap =SoftMax(z(cid:124)mp). i i showsathreelayerversionofourmodel.Thelayersarestackedin thefollowingway: (6)OutputRepresentationforPC. Thereisalsoonecorrespond- ingoutputvectorcpi inPCforeachxi: Functionality of Each Layer. For TD, the input to the (k+1)- cp =VpCpx , thlayeristhesumoftheoutputotkandtheinputukfromthek-th i q i layer,followedbyasigmoidnonlinearity:u =σ(Htu +ot), k+1 k k whereCp ∈ Rd×Nvoc isthepolarityoutputembeddingmatrix. It whereσ(x) = 1/(1+exp(x))isthesigmoidfunctionandHt is has been observed that sentiment-baring words are often close to alearnablelinearmappingmatrixsharedacrosslayers.ForthePC thetarget[12,30]. Basedonthisobservation,attentions,orposi- task,theinputtothefirstlayeristhetransformedsumfromthelast tionsofimportantwordsthatidentifythetargetinthefirstmodule, layeroftheTDmodule,z1 =σ(HtuKt +otKt),whereKtisthe couldprovidepriorknowledgetolearntheattentionofthesecond number of stacked layers in the TD task. Thus the prediction of module.Thereforewecomputethefinalattentionvectorasafunc- polaritywoulddependontheoutputoftheTDtaskandreversely tionoforiginalattentionsofbothtasks: theTDtaskwouldbenefitfromindirectsupervisionfromthePC taskbybackwardpropagationoferrors.SimilarlyforPC,theinput bp =(1−λ)ap+λf(at), (1) tothe(k+1)-thlayeristhesumoftheoutputop andtheinputz k k target identification polarity classification sets,andweshowthesuperiorperformanceofourmodel.Wealso experiment with variants of AttNet as credit assignments for the σ nonlineoaruittpyut 𝑜3𝑡 oinuptuptu 𝑧t 1𝑜1𝑝 σ𝑧1 keycomponentsinourmodel. 𝑢3 input 𝑢3 projection 𝑢𝑡 ninopnulitn 𝑧e2arity σ nonlineoaurtitpyut 𝑜2𝑡 context {𝑥𝑖} output 𝑜2𝑝 σ𝑧2 4.1 DataSets 𝑢2σ nonlineainriptuyt 𝑢2 ninopnulitn 𝑧e3arity 𝑧3 We examine AttNet on three domains that are related to atti- output 𝑜1𝑡 projection 𝑢𝑝 output 𝑜3𝑝 σ tude classification: online debates (Debates), multi-aspect senti- nonlinearity 𝑢1 input 𝑢1 ment analysis on product review (Reviews), and stance in tweets (Tweets). target 𝑞 Debates.ThisdatasetisfromtheInternetArgumentCorpusver- sion22. ThedatasetconsistsofpoliticaldebatesonthreeInternet Figure2:Athreelayerversionofourmodel. BoththeTDand forums3.Intheseforums,apersoncaninitiateadebatebyposting PCmoduleshavethreestackedlayers. atopicandtakingpositionssuchasfavor vs. against. Examples oftopicsareguncontrol,deathpenaltyandabortion. Otherusers participateinthesedebatesbypostingtheirargumentsforoneof fromthek-thlayer, followedbyasigmoidnonlinearity: zk+1 = thesides. σ(Hpzk+opk). Tweets.ThisdatasetcomesfromataskoftheworkshopSemEval- 2016ondetectingstancefromtweets[24]. Targetsaremostlyre- AttentionforPC. Inthesinglelayercase,theattentionforPCis latedtoideology,e.g.,atheismandfeministmovement4. basedonthatoftheTDmodule.Whenlayersarestacked,alllayers Review.Thisdatasetincludesreviewsofrestaurantsandlaptops ofthefirstmodulecollectivelyidentifyimportantattentionwords fromSemEval2014[29]and2015[28],wheresubtasksofidenti- todetectthetarget. Thereforewecomputetheaveragedattention fyingaspectsandclassifyingsentimentsareprovided. Wemerge vector across all layers in the TD module a¯t = 1 (cid:80)Kt at. two years’ data to enlarge the data set, and only include aspects Kt k=1 k Accordingly for k-th layer of the PC module, the final attention thatareannotatedinbothyears. vector is bp = (1 − λ)ap + λf(a¯t), and the output vector is Toguaranteeenoughtrainingandtestinstances,forallthedata op =(cid:80) bpk cp . k setswefilterouttargetsmentionedinlessthan100documents.The k i k,i k,i originaltrain-testsplitisusedifprovided,otherwisewerandomly TyingEmbeddingandProjectionMatrices. Theembeddingma- sample10%dataintotestset. Wefurtherrandomlysample10% tricesandprojectionmatricesareconstrainedtoeasetrainingand trainingdataforvalidation.Textpre-processingincludesstopword reduce the number of parameters [35]. The embedding matrices removalandtokenizationbytheCMUTwitterNLPtool[8]. The andtheprojectionmatricesaresharedfordifferentlayers.Specifi- detailsofthedatasetsareshowninTable1. cally,usingsubscription(k)denotetheparametersinthek-thlayer, foranylayerk,wehaveAt(1) ≡At(k),Ct(1) ≡Ct(k),Ap(1) ≡ Ap(k),Cp(1) ≡Cp(k),Vt(1) ≡Vt(k)andVp(1) ≡Vp(k). q q q q Predictions for TD and PC. The prediction stage is similar to Table1:Statisticsofeachdataset. Dataset Set #docs #pos #neg #neutral #absent the single-layer case, with the prediction based on the output of train 24352 13891 10711 0 0 thelastlayerKt(forTD)andKp(forPC).FortheTDtask,yt = Debates val 2706 1530 1203 0 0 Sσo(HftMpzax(W+toσp(H)t)u.Kt+otKt)),whileforPC,yp =SoftMax(Wp ttreasitn 32066144 1678420 11327513 00 6079 Kp Kp Tweets val 291 71 142 0 78 3.4 End-to-EndMulti-TaskTraining test 1249 304 715 0 230 train 5485 2184 1222 210 2336 Weusecrossentropylosstotrainourmodelend-to-endgivena Review val 610 260 121 17 277 setoftrainingdata{ct ,q ,gt ,gp},wherect isthei-thcontext test 1446 496 455 60 634 i j ij ij i (ordocument),q isthej-thtarget,gt andgp aretheground-truth j ij ij #posmeansthenumberofdocumentswithpositivesentimentforeachtar- labelsfortheTDandthePCtasksrespectively. Thetrainingisto get. Ifonedocumentcontainspositivesentimenttowardstwotargets, it minimizetheobjectivefunction: willbecountedtwice. #absentcountsthenumberofdocumentswithout L=−(cid:88)(cid:88)(cid:16)log(yt (gt ))+1 log(yp(gp))(cid:17), anyattitudetowardsanytarget. ij ij gitj ij ij i j whereyt isavectorofpredictedprobabilityforeachclassofTD, ij yitj(s)selectsthes-thelementofyitj,1gt equalsto1ifgitjequals ij to class present and 0 otherwise. Note that when a target is not mentionedinagivencontext,thepolaritytermplaysnoroleinthe 2https://nlds.soe.ucsc.edu/iac2. objectivebecausethevalueof1gt iszero. 34forums(http://www.4forums.com/political/), ij ConvinceMe(http://www.convinceme.net/)and 4. EXPERIMENTSETUP CreateDebate(http://www.createdebate.com/) 4Sincethere areless than10 tweets withneutral stance, we only Intheexperiments,wecompareAttNettoconventionalapproaches considerpositiveandnegativeattitudebydiscardingtheseneutral andalternativedeeplearningapproachesonthreerealworlddata tweets. 4.2 Metrics (rel,w ,w ), where rel represents the grammatical relation be- i j For our problem, each data set has multiple targets, and each tweenwordwiandwj,e.g.,issubjectof. target can be classified into one of the outcomes: absent (do not Generalizeddependency:thefirstwordofthedependencytriple exist), neutral, positive, and negative. If we treat each outcome is“backedoff”toitspart-of-speechtag[39]. Additionally,words ofeachtargetasonecategory,wecanadoptcommonmetricsfor thatappearinsentimentlexiconsarereplacedbypositiveornega- multi-classclassification.Sincemosttargetsdonotappearinmost tivepolarityequivalents[39]. instances,wehaveahighlyskewedclassdistribution,wheremea- Embedding: theelement-wiseaveragesofthewordvectorsfor sureslikeaccuracyarenotgoodchoices[3]. allthewordsinadocument. Weusethreetypesofwordembed- Apart from precision, recall and AUC, we also use the macro- dings. Two of them are from studies on target-dependent senti- averageF-measure[44]. Letρ andπ berecallandprecisionre- mentclassification[38,45],whicharetheskip-gramembeddings i i spectivelyforaparticularcategoryi,ρ = TPi ,π = TPi , ofMikilovetal.[23]andthesentiment-drivenembeddingsofTang whereTP ,FP ,FN arethenumberofitrueTpPoi+siFtiNvie,failsepToPis+itFivPie, etal.[37]. Thefirsttypeofembeddingistrainedon5millionun- i i i andfalsenegativeforcategoryi.Givenρ andπ ,F-scoreofcate- labeledtweetsthatcontainemoticons,whichguaranteesthatmore i i goryiiscomputedasF = 2πiρi. Themacro-averageF-scoreis sentimentrelatedtweetsareincluded. Thesecondtypeofembed- obtainedbytakingtheavieragπei+ovρeirallcategories. Thefinalpreci- dingisof50dimensions, whichispubliclyavailable5. Thethird sionandrecallarealsoaveragedoverindividualcategories. There typeofembeddingisalso50-dimensional,releasedbyCollobertet isanothermicro-averagedF-measure,whichisequivalenttoaccu- al.[4]andtrainedonEnglishWikipedia6. racy.Therefore,wedonotincludeit. Wordcluster:thenumberofoccurrencesofwordclustersforall wordsintext.WeperformK-meansclusteringonthewordvectors. ApartfromtwostandardSVMmodelconfigurations,SVM-sep- 4.3 Baselines indandSVM-sgl-ind,wealsocomparewithahybridmodelSVM- Wecomparebaselinemethodsfromtwolargecategories: con- cmb-ind, whose prediction is absent if SVM-sep-ind says so, and ventionalmethodsandalternativedeeplearningmethods. otherwiseitfollowsthedecisionsofSVM-sgl-ind.7 Eachbaselinemethodhasvariousconfigurations,basedonwhether: (1)ittrainsasinglemodelortwoseparatemodelsfortheTDand 4.3.2 DeepLearningBaselines PCsubtasks,and(2)ittrainsoneuniversalmodelforalltargets BiLSTM, MultiBiLSTM and Memnet. We also compare to orseparatedmodelsfordifferent targets. Todistinguishdifferent the bidirectional LSTM (BiLSTM) model, the state-of-the-art in configurations,weappend-sglwhenusingasinglemodelforthe target-dependentsentimentclassification[45].TheirvariantofBiL- twosubtasks, and-sepwhenusingseparatemodelsforeachsub- STMmodelassumesthatthegiventargetalwaysappearsexactly task. Taking SVM as an example, SVM-sgl directly classify tar- once, and can be tagged in text by starting and ending offsets. gets into four classes: absent, neutral, positive, and negative. In Whensuchassumptionfails,theirmodelisequivalenttostandard contrast,SVM-sepfirstclassifieseachtargetintotwoclasses: ab- BiLSTM.Weincludethestandardmulti-layeredbidirectionalLSTM sent,present,anduseasecondmodeltoclassifypolarity: neutral, (MultiBiLSTM)[13]asanextension. Recently, Tangetal.[36] positive,andnegative. Moreover,weappend-indwhenindividual appliedmemorynetworks(Memnet)tomulti-aspectsentimentanal- targetsaretrainedonseparatemodels, or-allwhenonemodelis ysis. Their results show memory network performs comparably trainedforalltargets. withfeaturebasedSVMandoutperformsallLSTM-relatedmeth- odsintheirtasks. 4.3.1 Conventionalbaselines CNNandParaVec.Weincluderelateddeeplearningtechniques beyond the sentiment analysis domain, such as the convolutional SVM+features. SVM using a set of hand-crafted features has neuralnetworks(CNN)[15]andParaVec[18]. ParaVecrequires achievedthestate-of-the-artperformanceinstanceclassificationof ahugeamountoftrainingdatatoreachdecentperformance. We SemEval2016task[25],onlinedebates[10],andaspect-basedsen- enhancetheperformanceoftheParaVecmodelbytrainingoverthe timentanalysis[36]. SVMhasalsodemonstratedsuperiorperfor- merged training set of all data sets, plus the 5 million unlabeled manceindocument-levelsentimentanalysiscomparedwithcondi- tweetsmentionedabove. tionalrandomfieldmethods[42].Thereforeweincludeallfeatures Parser-dependentdeeplearningmethodshavealsobeenapplied fromthesemethodsthataregeneralacrossdomains,andusealin- tosentimentanalysis[33,34,11].Thesemodelsarelimitedinour ear kernel SVM implemented by LIBSVM [2] for classification. attitude identification problem for two reasons. First, they often Welistthesetoffeatures: workwellwithphrase-levelsentimentlabels,butonlydocument- Documentinfo:basiccountingfeaturesofadocument,including levelsentimentlabelsareprovidedinourproblems. Second,their thenumberofcharacters,thenumberofwords,theaveragewords parsers do not extend to user generated content, such as Tweets perdocumentandtheaveragewordlength. andDebates[8].Ourpreliminaryresultsshowthesemethodswork N-grams:wordunigrams,bigrams,andtrigrams.Weinsertsym- poorlyonourproblemsandwedonoincludetheirresultsforsim- bolsthatrepresentthestartandendofadocumenttocapturecue plicity. words[40]. Foralldeeplearningmethods,wereporttheir-sep-alland-sgl- Sentiment: the numberofpositiveandnegative wordscounted allversion. UnlikeSVM,deepmethodsperformquitewellwhen from the NRC Emotion Lexicon [26], Hu and Liu Lexicon [12], using a single model for all targets, by casting the problem as a andtheMPQASubjectivityLexicon[43]. multi-taskmulti-classclassification. Thoughnotscalable, forthe Target: presenceofthetargetphraseinthetext. Furthermore,if strongest baselines (BiLSTM and MultiBiLSTM), we in addition thetargetispresent,wegenerateasetoftargetdependentfeatures accordingto[14]. Togetasenseofthesefeatures, forthetarget 5http://ir.hit.edu.cn/~dytang/ iPhoneintextIloveiPhone,afeaturelove_argwillbegenerated. 6http://ronan.collobert.com/senna/ POS:thenumberofoccurrencesofeachpart-of-speechtag(POS). 7SVM-sgl-allandSVM-sep-allhaveperformancedegenerationdue Syntacticdependency: asetoftriplesobtainedbyStanfordde- totheinterferenceofdifferenttargets. Wedonotincludetheirre- pendency parser [5]. More specifically, the triple is of the form sultsforsimplicity. train a separate model for each target. Since -sep-ind works bet- {10−2,10−4,...,10−10}. Thehyper-parametersofourmodelAt- ter than -sgl-ind, we only report the former one. The variants of tNet+fordifferentdatasetsarelistedinTable2. memorynetworksaredetailedbelow. 5. EXPERIMENTRESULTS 4.4 VariantsofAttNet To assign the credit of key components in our model, we con- 5.1 OverallPerformance structacompetingmodelAttNet-. Unlikeourproposedmodel,for The overall performance of all competing methods over data AttNet-thetarget-specificprojectionmatricesVqp andVqt arere- sets are shown in Table 38. Evaluating with F-score and AUC, placedbytheidentitymatrixandarefixedduringtraining.Thusthe we make the following observations. Our method AttNet outper- AttNet-modelinterleavethetargetdetectionandpolarityclassifi- formsallcompetingmethodssignificantly. Thisempiricallycon- cationsubtasks,butdonotconsidertheinteractionsamongtargets. firms that interleaving target detection and polarity classification WereferourproposedmodelasAttNet, whichallowstheprojec- subtaskscombinedwithtarget-specificrepresentationscouldbene- tionmatricestobelearnedduringtraining,andthuswordsemantics fitattitudeidentification. couldvaryfortargets. The variants of our model, AttNet-all and AttNet-ind, have al- ForAttNet-,wereporttwosettingsinourexperiments: AttNet- readygainedsignificantimprovementoverthestrongestbaselines ind and AttNet-all. The former makes all targets share the same onalldatasets. Moreimportantly, thetwomethodssignificantly embedding, while the latter separates the embedding space com- outperformtheMemnet-sep-allandMemnet-sep-allbaselines,which pletelyforeachtarget,i.e.,targetsaretrainedonseparatemodels. donotinterleavethesubtasks. Suchempiricalfindingscastlight onthatinterleavingthesubtasksindeedimprovestheattitudeiden- tificationperformance. Incontrast,separatingthetwosubtasksof Table2:Hyper-parametersforourmethodAttNet. attitudeidentificationcouldleadtoperformancedegeneration. Hyper-parameters Tweets Review Debates Our model AttNet also outperforms its variants, AttNet-all and L1coeff 1e-6 1e-4 1e-6 L2coeff 1e-4 1e-8 1e-8 AttNet-ind,onalldatasets. TheperformanceadvantageofAttNet initlearningrate 0.05 0.01 0.005 comes from the adoption of target-specific projection matrices in #layers(target) 4 4 3 representation learning, since these matrices are the only differ- #layers(sentiment) 4 8 6 encesbetweenAttNetandAttNet-. Eventhoughtheimprovement priorattentionλ 0.5 0.1 0.5 fromadoptingtarget-specificprojectionmatricesisnotasmarked as from the techniques of interleaving the subtasks, the improve- Theembeddingsizeissetto100foralldatasets.Theslidingwindowsize mentisstillsignificant. Thisresultconfirmsthatattitudeidentifi- ofthemovingaveragefunctioninEquation1issetto3.#layers(target)is cationcouldbenefitfromthelearnedrepresentationsthatsharethe thenumberofmemorylayersfortargetdetection,and#layers(sentiment) isthenumberforsentimentclassification.priorattentionλistheweight samesemanticsformanytargetsbutvaryforsometargets. forpriorattentioninEquation1. Byexaminingtheprecisionandrecallresults, wefindthatthe superiorperformanceofourmodelismainlyfromthesignificant improvement of recall, though both precision and recall are im- provedsignificantlyontheDebatesdataset. 4.5 TrainingDetails 5.2 PerformanceonSubtasks Allhyper-parametersaretunedtoobtainthebestperformanceof F-scoreonvalidationset.Thecandidateembeddingsizesetis{50, Wehaveestablishedthatourmodelsoutperformcompetingmeth- 100,200,300}forLSTM-relatedmethods, SVMandCNN.The odsonalldatasets.Inordertofurtherassignthecreditsoftheim- candidatenumberofclustersforK-meansis{50,100,150}. The provementofourmethods,weevaluateourmodelsonthetwosub- candidaterelaxingparameterCforSVMmodelis{27,26,...,2−3}. tasks:targetdetectionandpolarityclassification,withresultsgiven TheCNNmodelhasthreeconvolutionalfiltersizesandtheirfilter inTable4and5respectively. Sincedifferentconfigurationsofthe sizecandidatesare{{1,2,3},{2,3,4},{3,4,5},{2,4,6}},and samemethodworksimilarly,weonlypresenttheresultswheresep- the candidate number of filters is {50, 100, 200, 300}. For Par- aratemodelsaretrainedforeachtask. ItcanbeseenfromTable4 aVec,weexperimentwithbothskip-grammodelorbag-of-words thatthetargetdetectiontaskisrelativelyeasy,asallmethodscan model,andselectthehiddenlayersizefrom{26,27,...,210}. achievequitehighscores.Thisalsomeansthatitishardtoimprove Weexploredthreeweightinitializationmethodsofwordembed- anyfurtheronthistask.Intermsofprecisionandrecall,SVMper- dingsforLSTM-relatedandCNNbaselines:(1)samplingweights formsquitewellontheprecisionmetric,especiallyfortheReview fromazero-meanGaussianwith0.1standarddeviation;(2)initial- dataset.Whilemostdeeplearningmethodsfocusmoreonenhanc- izingfromthepre-trainedembeddingmatrix,and(3)usingafixed ingrecall. Whenconsideringbothprecisionandrecall,mostdeep pre-trainedembeddingmatrix. learningmethodsarestillbetter,astheF-scoreshows. Memorynetworkmodels,includingourmodel,areinitializedby The second task is only evaluated on documents with ground- sampling weights from a zero-mean Gaussian with unit standard truthsentimentstowardsparticulartargets,withF-scoresaveraged deviation. The candidate number of memory layers is {2, 3,..., overalltargetsandthreesentimentclasses: positive,negative,and 9}. Thepriorattentionparameterλofourmodelisselectedfrom neutral. We make several notes for this evaluation. (1) In order {0,0.1,0.5,0.9,1}. The capacity of memory, which has limited to achieve a high score in the second task, it is still important to impactontheperformance,isrestrictedto100wordswithoutfur- classifycorrectlythepresenceofatarget.(2)Ingeneralthescores thertuning. Anullsymbolwasusedtopadalldocumentstothis forallmethodsinthesecondtaskarelow, duetothattheclassi- fixedsize.Toreducethemodelcomplexity,theprojectionmatrices fiermightpredictatargetasabsent,eventhoughtheground-truth areinitializedinthewaythateachcolumnisaone-hotvector. 8TheperformanceofallmethodsontheReviewdatasetislower DeeplearningmodelsareoptimizedbyAdam[16]. Theinitial thantheothertwobecauseReviewdatasethandlesthreepolarities learningrateisselectedfrom{0.1,0.05,0.01,0.005,0.001},and while the others only need to handle two polarities as shown in L1-coefficientandL2-coefficientofregularizersareselectedfrom Table1. Table3:Performanceofcompetingmethods:AttNetachievestopperformance. Tweets Review Debates Method F-score AUC Precision Recall F-score AUC Precision Recall F-score AUC Precision Recall SVM-sep-ind 59.93 69.20 68.70 55.69 38.43 57.99 51.22 36.83 58.30 72.10 64.48 57.81 SVM-sgl-ind 57.44∗∗∗ 66.64∗∗∗ 69.87 52.45∗∗∗ 36.06∗∗ 56.84∗∗ 50.79 34.07∗∗ 59.75 72.25 66.39 57.67 SVM-cmb-ind 57.09∗∗∗ 66.46∗∗∗ 69.84 52.25∗∗∗ 35.71∗∗∗ 56.61∗∗∗ 50.73 33.78∗∗∗ 59.86 71.68 66.28 56.48 ParaVec-sep-all 53.17∗∗∗ 62.88∗∗∗ 56.75∗∗∗ 48.29∗∗∗ 34.02∗∗∗ 55.26∗∗∗ 38.04∗∗∗ 30.47∗∗∗ 56.32∗∗ 68.12∗∗∗ 59.09∗∗∗ 49.41∗∗∗ ParaVec-sgl-all 54.15∗∗∗ 63.41∗∗∗ 57.52∗∗∗ 48.76∗∗∗ 34.26∗∗∗ 55.31∗∗∗ 38.26∗∗∗ 30.89∗∗∗ 55.35∗∗∗ 67.48∗∗∗ 59.46∗∗∗ 49.82∗∗∗ CNN-sep-all 58.05∗ 70.10 62.43∗∗∗ 56.19 37.15∗ 57.55 43.73∗∗∗ 33.24∗∗ 57.38 70.70∗∗ 61.81∗∗ 52.94∗∗∗ CNN-sgl-all 58.69 70.71∗ 61.83∗∗∗ 56.64 35.45∗∗∗ 56.29∗ 44.65∗∗∗ 32.83∗∗ 56.23∗∗ 69.92∗∗ 60.75∗∗ 52.29∗∗∗ BiLSTM-sep-all 61.16 71.26∗∗ 63.45∗ 59.87∗∗∗ 40.78∗ 61.01∗∗∗ 42.54∗∗∗ 39.01∗∗ 59.83∗ 71.91 65.94∗ 57.65 BiLSTM-sgl-all 60.86 71.02∗∗ 62.58∗ 59.61∗∗∗ 39.68 60.84∗∗ 41.88∗∗∗ 38.81∗ 58.66 72.01 64.87 57.89 BiLSTM-sep-ind 59.49 71.92∗∗ 61.44∗∗∗ 57.86 40.42 62.25∗∗∗ 42.68∗∗∗ 39.78∗∗ 58.75 72.83 64.73 57.95 MultiBiLSTM-sep-all 60.53 71.51∗∗ 64.81∗ 57.76 40.47 60.71∗∗ 44.89∗∗∗ 37.67∗ 59.24∗ 72.24 64.75 58.43 MultiBiLSTM-sgl-all 60.59 71.32∗∗ 64.27∗ 57.97 39.38 59.68∗ 43.22∗∗∗ 37.92∗ 58.98 71.18 63.46 57.26 MultiBiLSTM-sep-ind 59.71 71.16∗∗ 63.72∗ 57.94 40.81 61.27∗∗ 44.76∗∗∗ 38.02∗ 58.36 72.93 64.15 57.14 Memnet-sep-all 59.44 71.68∗∗ 63.22∗ 59.80∗∗∗ 41.75∗∗ 61.82∗∗∗ 45.61∗∗∗ 39.25∗∗ 60.42∗∗ 73.84∗ 65.37 58.92∗ Memnet-sgl-all 60.69 71.80∗∗ 63.48∗ 59.97∗∗∗ 41.65∗∗ 61.53∗∗∗ 45.23∗∗∗ 39.13∗∗ 59.67∗ 73.31∗∗ 64.27 58.83 Proposedmethods AAtAtttNNtteNett-e-iatnldl 666433...640229∗(cid:5)∗(cid:5)∗(cid:79)∗(cid:5)∗(cid:5)∗(cid:5)(cid:5)∗ 777422...797436∗(cid:5)∗∗(cid:79)∗∗∗(cid:79)∗∗∗ 66886..73883.4(cid:5)(cid:5)0(cid:5)(cid:5)(cid:5)(cid:5) 665209...056798∗∗(cid:79)∗∗∗(cid:79)∗∗∗ 444534...991315∗(cid:5)∗(cid:5)∗(cid:79)∗(cid:5)∗(cid:5)∗∗∗∗ 666533...515883∗(cid:5)∗(cid:5)∗(cid:79)∗(cid:5)∗(cid:5)∗(cid:79)∗∗∗ 54075..0082.93(cid:5)∗(cid:5)4(cid:5)∗(cid:5)(cid:5) 444420...975758∗(cid:5)∗(cid:79)∗∗(cid:5)∗(cid:79)∗∗∗ 666745...620318∗(cid:5)∗(cid:5)∗(cid:79)∗(cid:5)∗(cid:5)∗(cid:79)∗(cid:5)∗(cid:5)∗ 777866...413835∗(cid:79)∗(cid:5)∗(cid:5)∗(cid:5)∗(cid:5)∗(cid:79)∗∗∗(cid:79) 776407..5.01589∗(cid:79)∗(cid:5)∗(cid:5)∗(cid:5)∗(cid:79)∗(cid:5)∗(cid:5)∗(cid:79) 66626..031.717∗(cid:5)0∗(cid:79)∗(cid:5)∗(cid:5)∗(cid:79)(cid:5)∗ *(**,***)indicatethatonemethodisstatisticallysignificantlybetterorworsethanSVM-sep-ind(whichisingeneralthebestconfigurationamongallSVM models)accordingtot-test[44]atthesignificancelevelof0.05(0.01,0.001).(cid:5)(cid:5)((cid:5)(cid:5)(cid:5))indicateAttNet-outperformsthebetteronebetweenMemnet-sep-alland Memnet-sgl-allatthesignificancelevelof0.01(0.001).(cid:79)(cid:79)((cid:79)(cid:79)(cid:79))indicateAttNetoutperformsthebetteronebetweenAttNet-allandAttNet-indatthe significancelevelof0.01(0.001). Table4:Performanceontargetdetectionfor-sepmodels. Table5:Performanceonpolarityclassificationfor-sepmodels. Tweets Review Debates 79.74 67.84 84.59 Tweets Review Debates SVM 89.12,75.00 84.13,63.52 91.47,81.00 44.45 21.37 42.52 ParaVec 75.63∗∗ 63.74∗∗∗ 76.16∗∗∗ SVM 64.50,34.28 53.25,14.66 57.37,38.29 82.67∗∗∗,71.76∗∗∗ 67.41∗∗∗,57.03∗∗∗ 80.32∗∗∗,71.80∗∗∗ 39.20∗∗∗ 17.69∗∗∗ 30.59∗∗∗ CNN 80.62 65.34∗∗ 80.21∗∗ ParaVec 56.91∗∗,26.46∗∗∗ 31.05∗∗∗,9.25∗∗∗ 56.15,20.88∗∗∗ 87.57∗∗,77.41∗ 73.23∗∗∗,59.16∗∗ 93.26∗,73.48∗∗ 42.95 19.42∗ 35.15∗∗∗ BiLSTM 81.50∗∗ 70.41∗∗ 85.05 CNN 58.20∗∗,35.71 39.91∗∗,11.34∗∗ 49.82∗∗,25.40∗∗∗ 86.48∗,81.68∗∗ 76.39∗∗∗,69.11∗∗∗ 92.03,82.05 46.18∗∗ 25.25∗∗ 42.69 MultiBiLSTM 81.53∗∗ 69.26∗∗ 85.33 BiLSTM 61.35∗,41.79∗∗∗ 41.06∗∗∗,19.40∗∗ 54.58,35.59 87.69,78.82∗∗ 75.38∗∗∗,68.40∗∗∗ 92.82,83.77∗ 46.26∗∗ 24.06∗∗ 41.87 Memnet 81.29∗∗ 70.71∗∗ 86.29∗ MultiBiLSTM 60.56∗∗,39.36∗∗ 47.15∗∗,17.52∗∗ 49.69∗,37.26 87.82,79.36∗∗ 75.52∗∗∗,68.59∗∗∗ 92.31,83.26 47.81∗∗ 25.47∗∗ 44.61 AttNet-all 82.58∗∗ 71.84∗∗∗ 89.02∗(cid:5)∗(cid:5)∗ Memnet 61.20∗∗,40.52∗∗ 46.34∗∗,19.81∗∗ 54.40∗,38.14 AttNet-ind 88.8728.,7794.∗(cid:5)0∗5∗∗ 75.857∗1∗.∗9,56∗9∗.9∗9∗∗∗ 92.2848,8.869.4∗(cid:5)7∗(cid:5)∗(cid:5)∗(cid:5)∗ AttNet-all 66.395(cid:5)0(cid:5).9(cid:5)1,4∗(cid:5)2∗(cid:5).∗00∗(cid:5)∗∗ 55.923(cid:5)2(cid:5).(cid:5)4,32∗(cid:5)4∗(cid:5).4∗(cid:5)3∗(cid:5)∗(cid:5)∗ 60.7520∗(cid:5).(cid:5)4(cid:5)6,4∗(cid:5)4∗(cid:5).∗(cid:5)02∗(cid:5)∗(cid:5) AttNet 888.429.,8799.∗(cid:79)8∗(cid:79)2∗∗∗ 82.6772(cid:5).(cid:5)5(cid:5)9,6∗6∗.∗19∗∗ 9829.3.35,58∗2.∗2∗2 AttNet-ind 64.014(cid:5)9(cid:5).,1461∗(cid:5).7∗4∗∗∗ 56.135∗(cid:5)2∗(cid:5).7,923∗(cid:5).∗(cid:5)1∗(cid:5)7∗(cid:5)∗(cid:5)∗ 62.5710∗(cid:5).8∗(cid:5)8(cid:5)∗(cid:5),4∗(cid:5)0∗(cid:5).52 88.76,82.24∗(cid:79)∗(cid:79)∗ 76.38∗∗∗,71.09∗(cid:79)∗∗ 92.05,87.12∗∗∗ AttNet 52.23∗(cid:79)∗(cid:79)∗ 35.34∗(cid:79)∗(cid:79)∗ 55.93∗(cid:79)∗(cid:79)∗(cid:79) 65.54,44.57∗(cid:79)∗(cid:79)∗ 59.99∗(cid:79)∗(cid:79)∗,27.32∗(cid:79)∗(cid:79)∗ 71.53∗(cid:79)∗(cid:79),50.25∗(cid:79)∗(cid:79)∗ ThefirstrowofeachmethodshowsF-score,followedbyprecisionandrecallonthe secondrow. ThefirstrowofeachmethodshowsF-score,followedbyprecisionandrecallonthe secondrow. classcanonlybedrawnfromthreesentimentclasses.(3)Itispos- sible for a method to outperform SVM on both tasks, while still SVM can finish training in less than one hour, but its required obtain close results when two tasks are evaluated together. This trainingtimeincreaselinearlyasthenumberoftargetsincreases. resultsfromourmethodofevaluationonthesecondtask,wherea For all deep learning methods, the number of epochs required documentisincludedonlywhenitexpressessentimenttowardsa for training is in general very close, which is around 20 epochs particulartarget. averagedoveralldatasets. BasedontheresultsfromTable5,wecanseethatthepercentage Comparing the training time per epoch, ParaVec and CNN are of improvement over SVM is much higher than that of the first muchfasterthanothermethods(lessthan5seconds/epoch). De- task. Intuitively,thesentimenttaskrequiresbettermodelingofthe spitethetrainingefficiency,theireffectivenessisaproblem.When non-linearinteractionbetweenthetargetandthecontext,whilefor all targets share a single model, LSTM has a speed of 200 sec- thetargetdetectiontask,presenceofcertainsignalwordsmightbe onds/epoch,whilestandardmemorynetworkshaveaspeedof150 enough. seconds/epoch. Memory networks in many tasks, e.g., language modeling, are much faster than LSTM, due to the expensive re- 5.3 TrainingTimeAnalysis cursiveoperationofLSTM.Howeverinourproblemsetting,each Inordertomeasurethetrainingspeedofeachmodel, wetrain targethastobeforwardedonebyoneforeverydocument,lower- alldeeplearningmethodsonaserverwithasingleTITANXGPU. ing the efficiency of memory networks. When individual targets For SVM, it is trained on the same server with a 2.40 GHz CPU aretrainedonseparateLSTMs, LSTMsrequirefarmoretraining and120GRAM.Allmethodsaretrainedsequentiallywithoutpar- time(1000seconds/epoch). allelization. AttNet consumes 200 seconds per epoch. Comparing to stan- dardmemorynetworks,AttNetproducessomeadditionaloverhead interleavingthetwotasks–successfullyidentifyingmentionedtar- byintroducingtheinteractionbetweensubtasks, andbyaddinga getscouldoffercluesaboutthefindingofsentimentwordsforthe projectionmatrix.Butthisoverheadissmall. secondtask. Theefficiencyofdeeplearningmethodscouldbeimprovedby Thesecondsentenceisfromareviewofarestaurant,whenam- parallelization. Since there are already many work on this topic, bienceisusedasthequerytarget.Wecanseethatthetargetdetec- which could increase the speed without sacrificing effectiveness, tionmoduleofAttNetscapturestheworddecor,whichsignalsthe wedonotgofurtherintothisdirection. presenceofthetargetambience.Thepolarityclassificationmodule thenfocusesonextractingsentimentwordsassociatedwiththetar- Summary: empiricalexperimentsdemonstratedthattheproposed get. HoweverforthebaselineMemnet,itcapturesbothdecorand deepmemorynetworks, AttNetanditsvariantsoutperformscon- foodinthefirsttask,mistakenlyconsideringallsentimentsareonly ventionalsupervisedlearningmethods. Thisispromisingbutper- describingfoodotherthantheambience. Consequently,itjudges hapsnotsurprisinggiventhesuccessofdeeplearningingeneral.It thatthereisnoattitudetowardsambience. Thisexampleshowsus isencouragingtonoticethatAttNetalsoimprovesthestate-of-the- thebenefitofusingtheprojectionmatricestoconsidertheinterac- art deep learning architectures. This improvement is statistically tion and distinction between targets. Otherwise the model might significant,andcanbeobservedforbothsubtasksandforattitude easilybeconfusedbytowhichentitythesentimentsareexpressed. identificationasawhole. Theimprovementofeffectivenessdoes Fromthethirdsentence,wecanseehowourmodelAttNetsde- notcompromiselearningefficiency. terminesthatthequerytargetdrinkdoesnotexist.Thefirstmodule highlightswordslikeports(winename),waitress,andthesecond 5.4 Visualizationofattention moduleextractsnegativesentimentsbutnotknow,whichisusually In order to better understand the behavior of our models, we usedtodescribepeople,ratherthandrink. Memnetalmosthasthe comparetheattentionweightsgivenbyourmodelAttNetsandthe sameattentiondistributionasAttNets,butstillfailstoproducethe competingmethodMemnet. correctprediction. Similartothesecondcase,projectionmatrices areimportantformodelstofigureoutthecommonphrasesusedto describedifferentsetofentities. 1. It is admittedly to have them for policy . if everyone have guns there would be just mess 6. CONCLUSION . (Truth: gun control+. Predict + given gun control.) Attitude identification, a key problem of morden natural lan- 2. Highly impressed from the decor to the food to guageprocessing, isconcernedwithdetectingoneormoretarget the great night ! (Truth: service+, ambience+, entities from text and then classifying the sentiment polarity to- food+. Predict + given ambience.) wards them. This problem is conventionally approached by sep- 3. When we inquired about ports - the waitress arately solving the two subtasks and usually separately treating listed off several but did not know taste eachtarget, whichfailstoleveragetheinterplaybetweenthetwo variations or cost . (Truth: service-. Predict subtasks and the interaction among the target entities. Our study absent given drink.) demonstrates that modeling these interactions in a carefully de- (a)AttentiongivenbyAttNets. signed, end-to-end deep memory network significantly improves theaccuracyofthetwosubtasks,targetdetectionandpolarityclas- 1. It is admittedly to have them for policy . if everyone have guns there would be just mess sification, and attutide indentification as a whole. Empirical ex- . (Predict - given gun control.) perimentsprovesthatthisnovelmodeloutperformsmodelsthatdo 2. Highly impressed from the decor to the food to not consider the interactions between the two subtasks or among the great night ! (Predict absent given thetargets,includingconventionalmethodsandthestate-of-the-art ambience.) deeplearningmodels. 3. When we inquired about ports - the waitress Thisworkopenstheexplorationofinteractionsamongsubtasks listed off several but did not know taste andamongcontexts(inourcase,targets)forsentimentanalysisus- variations or cost . (Predict - given drink.) inganend-to-enddeeplearningarchitecture.Suchanapproachcan beeasilyextendedtohandleotherrelatedproblemsinthisdomain, (b)AttentiongivenbyMemnet. such as opinion summarization, multi-aspect sentiment analysis, andemotionclassification.Designingspecificnetworkarchitecture Figure 3: Visualization of learned attention. Red patches tomodeldeeperdependenciesamongtargetsisanotherintriguing highlightingthetophalfofthetextindicatemodel’sattention futuredirection. weightinthetargetdetectiontask,whilegreenoneshighlight- ingthebottomhalfshowthepolarityclassificationtask.Darker Acknowledgment colors indicate higher attentions. Truth: service+ means that the ground-truth sentiment towards service is positive, while ThisworkispartiallysupportedbytheNationalScienceFounda- Predict+givenambiencegivesthepredictedpositivesentiment tionundergrantnumbersIIS-1054199andSES-1131500. giventhequerytargetambience. 7. REFERENCES Figure3(a)and(b)listsomeexamplesofwordattentionsgen- eratedbydifferentmodelsforthesamesetofsentencesinthetest [1] D.Bespalov,B.Bai,Y.Qi,andA.Shokoufandeh.Sentiment classificationbasedonsupervisedlatentn-gramanalysis.InProc.of set. Inthefirstsentence,boththemandgunsarefoundastargets CIKM,2011. by AttNets, while words like mess and policy are found as senti- [2] C.-C.ChangandC.-J.Lin.Libsvm:alibraryforsupportvector mentwords. ThoughMemnetcorrectlyidentifiestheexistenceof machines.TIST,2011. theattitudetowardsguncontrol,itfailstofindimportantwordsto [3] N.V.Chawla.Dataminingforimbalanceddatasets:Anoverview.In classifythepolarityofsentiment. Thissuggeststheimportanceof Dataminingandknowledgediscoveryhandbook.2005. [4] R.Collobert,J.Weston,L.Bottou,M.Karlen,K.Kavukcuoglu,and [25] S.M.Mohammad,P.Sobhani,andS.Kiritchenko.Stanceand P.Kuksa.Naturallanguageprocessing(almost)fromscratch.JMLR, sentimentintweets.arXivpreprintarXiv:1605.01655,2016. 2011. [26] S.M.MohammadandP.D.Turney.Emotionsevokedbycommon [5] M.-C.DeMarneffe,B.MacCartney,C.D.Manning,etal. wordsandphrases:Usingmechanicalturktocreateanemotion Generatingtypeddependencyparsesfromphrasestructureparses.In lexicon.InProc.ofNAACL,2010. Proc.ofLREC,2006. [27] B.PangandL.Lee.Opinionminingandsentimentanalysis. [6] L.Dong,F.Wei,C.Tan,D.Tang,M.Zhou,andK.Xu.Adaptive Foundationsandtrendsininformationretrieval,2008. recursiveneuralnetworkfortarget-dependenttwittersentiment [28] M.Pontiki,D.Galanis,H.Papageorgiou,S.Manandhar,and classification.InACL,2014. I.Androutsopoulos.Semeval-2015task12:Aspectbasedsentiment [7] A.Faulkner.Automatedclassificationofstanceinstudentessays:An analysis.InProc.ofSemEval,2015. approachusingstancetargetinformationandthewikipedia [29] M.Pontiki,D.Galanis,J.Pavlopoulos,H.Papageorgiou, link-basedmeasure.InFLAIRS,2014. I.Androutsopoulos,andS.Manandhar.Semeval-2014task4:Aspect [8] K.Gimpel,N.Schneider,B.O’Connor,D.Das,D.Mills, basedsentimentanalysis.InProc.ofSemEval,2014. J.Eisenstein,M.Heilman,D.Yogatama,J.Flanigan,andN.A. [30] A.PopescuandM.Pennacchiotti.Dancingwiththestars,nbagames, Smith.Part-of-speechtaggingfortwitter:Annotation,features,and politics:AnexplorationoftwitterusersâA˘Z´ responsetoevents.In experiments.InProc.ofACL,2011. Proc.ofICWSM,2011. [9] X.Glorot,A.Bordes,andY.Bengio.Domainadaptationfor [31] A.RajadesinganandH.Liu.Identifyinguserswithopposing large-scalesentimentclassification:Adeeplearningapproach.In opinionsintwitterdebates.InSocialComputing, Proc.ofICML,2011. Behavioral-CulturalModelingandPrediction.2014. [10] K.S.HasanandV.Ng.Stanceclassificationofideologicaldebates: [32] C.SauperandR.Barzilay.Automaticaggregationbyjointmodeling Data,models,features,andconstraints.InIJCNLP,2013. ofaspectsandvalues.JAIR,2013. [11] K.M.HermannandP.Blunsom.Theroleofsyntaxinvectorspace [33] R.Socher,B.Huval,C.D.Manning,andA.Y.Ng.Semantic modelsofcompositionalsemantics.InACL,2013. compositionalitythroughrecursivematrix-vectorspaces.InProc.of [12] M.HuandB.Liu.Miningandsummarizingcustomerreviews.In EMNLP-CoNLL,2012. Proc.ofSIGKDD,2004. [34] R.Socher,A.Perelygin,J.Y.Wu,J.Chuang,C.D.Manning,A.Y. [13] O.IrsoyandC.Cardie.Deeprecursiveneuralnetworksfor Ng,andC.Potts.Recursivedeepmodelsforsemantic compositionalityinlanguage.InNIPS,2014. compositionalityoverasentimenttreebank.InProc.ofEMNLP, [14] L.Jiang,M.Yu,M.Zhou,X.Liu,andT.Zhao.Target-dependent 2013. twittersentimentclassification.InProc.ofACL,2011. [35] S.Sukhbaatar,J.Weston,R.Fergus,etal.End-to-endmemory [15] Y.Kim.Convolutionalneuralnetworksforsentenceclassification. networks.InNIPS,2015. 2014. [36] D.Tang,B.Qin,andT.Liu.Aspectlevelsentimentclassification [16] D.KingmaandJ.Ba.Adam:Amethodforstochasticoptimization. withdeepmemorynetwork.InEMNLP,2016. ICLR,2015. [37] D.Tang,F.Wei,N.Yang,M.Zhou,T.Liu,andB.Qin.Learning [17] N.Kobayashi,R.Iida,K.Inui,andY.Matsumoto.Opinionmining sentiment-specificwordembeddingfortwittersentiment onthewebbyextractingsubject-aspect-evaluationrelations.InAAAI classification.InACL,2014. SpringSymposium:ComputationalApproachestoAnalyzing [38] D.-T.VoandY.Zhang.Target-dependenttwittersentiment Weblogs,2006. classificationwithrichautomaticfeatures.InIJCAI,2015. [18] Q.V.LeandT.Mikolov.Distributedrepresentationsofsentencesand [39] M.A.Walker,P.Anand,R.Abbott,andR.Grant.Stance documents.InICML,2014. classificationusingdialogicpropertiesofpersuasion.InProc.of [19] F.Li,C.Han,M.Huang,X.Zhu,Y.-J.Xia,S.Zhang,andH.Yu. NAACL,2012. Structure-awarereviewminingandsummarization.InProc.ofACL, [40] M.A.Walker,P.Anand,R.Abbott,J.E.F.Tree,C.Martell,and 2010. J.King.Thatisyourevidence?:Classifyingstanceinonlinepolitical [20] P.Liu,S.Joty,andH.Meng.Fine-grainedopinionminingwith debate.DecisionSupportSystems,2012. recurrentneuralnetworksandwordembeddings.InEMNLP,2015. [41] H.Wang,Y.Lu,andC.Zhai.Latentaspectratinganalysiswithout [21] D.Marcheggiani,O.Täckström,A.Esuli,andF.Sebastiani. aspectkeywordsupervision.InProc.ofSIGKDD,2011. Hierarchicalmulti-labelconditionalrandomfieldsforaspect-oriented [42] S.WangandC.D.Manning.Baselinesandbigrams:Simple,good opinionmining.InECIR,2014. sentimentandtopicclassification.InProc.ofACL,2012. [22] Q.Mei,X.Ling,M.Wondra,H.Su,andC.Zhai.Topicsentiment [43] T.Wilson,J.Wiebe,andP.Hoffmann.Recognizingcontextual mixture:modelingfacetsandopinionsinweblogs.InProc.ofWWW, polarityinphrase-levelsentimentanalysis.InProc.ofHLT/EMNLP, 2007. 2005. [23] T.Mikolov,I.Sutskever,K.Chen,G.S.Corrado,andJ.Dean. [44] Y.YangandX.Liu.Are-examinationoftextcategorizationmethods. Distributedrepresentationsofwordsandphrasesandtheir InProc.ofSIGIR,1999. compositionality.InNIPS,2013. [45] M.Zhang,Y.Zhang,andD.-T.Vo.Gatedneuralnetworksfor [24] S.M.Mohammad,S.Kiritchenko,P.Sobhani,X.Zhu,and targetedsentimentanalysis.InProc.ofAAAI,2016. C.Cherry.Semeval-2016task6:Detectingstanceintweets.InProc. [46] C.Zirn,M.Niepert,H.Stuckenschmidt,andM.Strube.Fine-grained ofSemEval,2016. sentimentanalysiswithstructuralfeatures.InIJCNLP,2011.