Table Of Content

Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA Attention and Memory-Augmented Networks for Dual-View Sequential Learning YongHe,ChengWang,NanLi,ZhenyuZeng {sanyuan.hy,youmiao.wc,kido.ln,zhenyu.zzy}@alibaba-inc.com AlibabaGroup Hangzhou,China ABSTRACT medicationrecommendation,DRGclassification,invoicefraudde- Inrecentyears,sequentiallearninghasbeenofgreatinterestdue tection totheadvanceofdeeplearningwithapplicationsintime-series ACMReferenceFormat: forecasting,naturallanguageprocessing,andspeechrecognition. YongHe,ChengWang,NanLi,ZhenyuZeng.2020.AttentionandMemory- Recurrent neural networks (RNNs) have achieved superior per- AugmentedNetworksforDual-ViewSequentialLearning.InProceedingsof formanceinsingle-viewandsynchronousmulti-viewsequential the26thACMSIGKDDConferenceonKnowledgeDiscoveryandDataMining learningcomparingtotraditionalmachinelearningmodels.How- USBStick(KDD’20),August23–27,2020,VirtualEvent,USA.ACM,New ever,themethodremainslessexploredinasynchronousmulti-view York,NY,USA,10pages.https://doi.org/10.1145/3394486.3403055 sequentiallearning,andtheunalignmentnatureofmultiplese- quencesposesagreatchallengetolearntheinter-viewinteractions. 1 INTRODUCTION WedevelopanAMANet(AttentionandMemory-AugmentedNet- Multi-viewdatathatdifferentviewsdescribedistinctperspectives works)architecturebyintegratingbothattentionandmemoryto isverycommoninmanyreal-worldapplications.Variouslearning solveasynchronousmulti-viewlearningproblemingeneral,and algorithmshavebeenproposedtoincorporatethecomplementary wefocusonexperimentsindual-viewsequencesinthispaper.Self- informationofthemulti-viewdatasetsinordertounderstandthe attentionandinter-attentionareemployedtocaptureintra-view complexsystemandmakeprecisedata-drivenprediction[26].Due interactionandinter-viewinteraction,respectively.Historyatten- totheadvanceoftherecurrentneuralnetwork(RNN)architecture tionmemoryisdesignedtostorethehistoricalinformationofa development,multiplesequentialeventscouldbeintegratedinto specificobject,whichservesaslocalknowledgestorage.Dynamic multi-viewlearning.Suchaformoflearningisknownasmulti-view externalmemoryisusedtostoreglobalknowledgeforeachview. sequentiallearning.Manymodelsaredesignedforsynchronousse- Weevaluateourmodelinthreetasks:medicationrecommendation quencesthatallviewssharethesametimesteporcouldbealigned, fromapatient’smedicalrecords,diagnosis-relatedgroup(DRG) whiletherearefewmodelsfocusedonasynchronoussequences classificationfromahospitalrecord,andinvoicefrauddetection thatsequencesareinvariousgranularityandwithdynamiclengths. throughacompany’staxationbehaviors.Theresultsdemonstrate Electronichealthrecord(EHR)learningisanotableexampleof thatourmodeloutperformsallbaselinesandotherstate-of-the-art multi-viewsequentiallearning.Apatient’smedicalrecordsconsist modelsinalltasks.Moreover,theablationstudyofourmodelin- ofmultipleviews,suchasdiagnosis,procedureandmedication, dicatesthattheinter-attentionmechanismplaysakeyroleinthe andareasynchronousinafine-grainedtimescale. modelanditcanboostthepredictivepowerbyeffectivelycapturing Thekeychallengeofmulti-viewsequentiallearningistocap- theinter-viewinteractionsfromasynchronousviews. turebothintra-viewinteractionwithinasingleviewandinter-view interactionsacrossmultipleviews.RNNs,suchaslongshort-term CCSCONCEPTS memory(LSTM)[7]andgatedrecurrentunit(GRU)[3],havebeen •Computingmethodologies→Neuralnetworks;Supervised widely used to learn the intra-view interactions in single view learningbyclassification;•Informationsystems→Cluster- sequentiallearningproblems.However,thesequentialnatureof ingandclassification;•Appliedcomputing→Healthcarein- RNNsmakesparallelizationchallenging.Thetransformer,which formationsystems. dispensesrecurrenceandusesattentionmechanismtomodelthe dependenciesinsequence,canachievethestate-of-the-artperfor- KEYWORDS manceinsequentialmodelingandallowforparallelization[18]. dual-viewsequentiallearning,inter-attention,intra-attention,dy- Thetechniquestolearntheinter-viewinteractionscanbeclassified namicexternalmemory,historyattentionmemory,classification, intoearly,late,andhybridfusionapproaches[2].Intheearlyfusion method,featuresofmultipleviewsareconcatenatedintoasingle Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor featurebeforelearning.Thelatefusionmethodfirstlearnssepa- classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation ratedmodelsfromindividualviewsandthencombinestogether onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM throughaveragingorvotingtomakethefinalprediction.Ahybrid mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish, approachattemptstotakeadvantageofbothabovemethods.How- topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora fee.Requestpermissionsfrompermissions@acm.org. ever,theprerequisiteofearlyorhybridapproachisthealignmentof KDD’20,August23–27,2020,VirtualEvent,USA thesequenceswhichcannotbefulfilledinanasynchronoussetting. ©2020AssociationforComputingMachinery. Toaddresstheabovechallengesinasynchronousmulti-view ACMISBN978-1-4503-7998-4/20/08...$15.00 https://doi.org/10.1145/3394486.3403055 learning,wedevelopAMANet(AttentionandMemory-Augmented 125 Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA Networks)basedonattentionandmemorymechanisms.Ourmodel withoutsacrificingtheperformancecomparingtoRNNs[18].For consistsofthreemajorcomponentsforeachview:neuralcontroller, instances,SAnD(SimpleAttendandDiagnose)modeladaptsthe historyattentionmemory,anddynamicexternalmemory.Theneu- self-attentionmechanismasintransformerformultiplediagnosis ralcontrollerperformsencodingoftheinputsequencetocapture tasks [17]. In our work, the self-attention layer from the trans- theintra-viewinteractionrelyingontheself-attentionmechanism. formerisusedforintra-viewlearning.Moreover,weproposean Theinter-attentionmechanismisintroducedtolearntheinter-view inter-attentionlayerbymodifyingtheself-attentionlayertocap- interaction.Theself-attentionorintra-attentionmechanismcould tureinter-viewinteractions.Theattentionmechanismisalsoused associatedifferentpositionswithinasinglesequence.Similarly, toretrievethehistoricalattentionvectorfromhistoryattention theinter-attentionrelatespositionsacrosstwosequences.Then memory. thevectorsfrombothself-attentionandinter-attentionarecon- Memory-augmentedneuralnetworks(MANN),suchasmemory catenatedtogetherastheencodingvectorofthesequence.The networks[20]anddifferentiableneuralcomputers(DNC)[6],use historyattentionmemorystoresallpreviousencodingvectorsof external memory as dynamic knowledgebase to store the long- thesameobject.Thedynamicexternalmemoryissharedbyall termdependencies.Theexternalmemorycanbereadandwritten objectsintrainingandstoresthecommonknowledgefromthe to with attentional processes and has been widely used. Graph data.Lastly,theencodingvectorfromthecontroller,thereadvec- augmentedmemorynetworks(GAMENet)[16]integratesMANN torfromdynamicexternalmemory,andthehistoricalattention withgraphconvolutionnetworks(GCN)[11]toembedmultiple vectorfromhistoryattentionmemoryareconcatenatedtogether knowledgegraphs.DMNC[12]isafirstattempttobuildMANNfor forthepredictiontask. asynchronousdual-viewsequentiallearning.GAMENetandDMNC Wehaveconductedexperimentsonthreetasks:medicationrec- havebeenemployedformedicationrecommendationtaskandare ommendationfromapatient’smedicalrecordsincludingprevious servedasbaselinesinthiswork.Inourwork,thedynamicexternal andcurrentdiagnosesandmedicalprocedures,DRGclassification memoryfromDNCisusedtostorethecommonknowledgeofeach fromacurrenthospitalrecordwithdiagnosisandprocedurese- view. quences,andinvoicefrauddetectionfromacompany’smonthly taxdeclarationandinvoicesequences.Inalltasks,ourmodelisable 3 METHOD tooutperformbaselinesandotherstate-of-the-artmodels.Therest 3.1 ProblemFormulation ofthepaperisorganizedasfollows.Wereviewrelatedworksand techniquesinsection2andintroducetheformulationandarchitec- Inthiswork,weusedual-viewsequencestointroduceourmodel. tureofourmodelinsection3.Weevaluateourmodelinthreetasks Thegoalofdual-viewsequentiallearningistopredictthetarget introducedaboveinsection4.Inaddition,wealsoperformance 𝑦 giventwoinputsequences𝑋1 and𝑋2.Let𝑆𝑖 = (𝑋𝑖1,𝑋𝑖2,𝑦𝑖) to ablationstudiestoinvestigatetheimportanceofeachmodulein bethesamplewithindex𝑖,where𝑋1 = {𝑥1,𝑥1,...,𝑥1 },𝑋2 = 𝑖 𝑖1 𝑖2 𝑖𝐿1 𝑖 oduirrecmtioodnesl..Insection5,weconcludeourworkandhighlightfuture {𝑥𝑖21,𝑥𝑖22,...,𝑥𝑖2𝐿2},and𝑦𝑖 couldbeascalaroravectorde𝑖pending 𝑖 on the prediction task. 𝐿1 and 𝐿2 are the length of𝑋1 and𝑋2, 𝑖 𝑖 𝑖 𝑖 respectively.Thelengthoftheinputsequenceisnotonlyview 2 RELATEDWORKS typedependentbutalsosampledependent.The𝑥1 and𝑥2 inthe 𝑖𝑘 𝑖𝑘 Multi-viewlearningisarapidlygrowingareaofmachinelearning sequencesaretheinitialrepresentationoftheinput. andrecenttrendshavebeenreviewedthoroughly[2,21,26].Sev- Inthemedicationrecommendationtask,wepredictthemedica- eraldeeplearningarchitectureshavebeendevelopedtoexpand tionsetgivenpreviousandcurrentdiagnosesandprocedures.The multi-viewlearningintosequentialdatasetssuchasmulti-view currentdiagnosesandproceduresofapatientcanbeformulated LSTM[15],memoryfusionnetwork(MFN)[24],anddualmemory astwosequencesfordual-viewsequentiallearning.InMIMIC-III neuralcomputer[12].Hereweemphasizetheworkswithattention dataset[9],thediagnoses,medicalprocedures,andmedications mechanismandmemoryaugmentationmethod. arerepresentedasstandardmedicalcodeswithcodespacesize Theattentionmechanismwasinitiallyintroducedinmachine 𝑑𝑥1,𝑑𝑥2 and𝑑𝑦,respectively.Therefore,weuseone-hotvectors translationtojointlytranslateandalignwords[1].Ithasbecome 𝑥𝑖1𝑘 ∈ {0,1}𝑑𝑥1,𝑥𝑖2𝑘 ∈ {0,1}𝑑𝑥2,and𝑦𝑖 ∈ {0,1}𝑑𝑦 torepresentdi- anintegralpartwithRNNsinsequentialmodelingtoabstractthe agnoses,procedures,andmedications,respectively.Theprevious mostrelevantinformationofthesequence.Attentionmechanism medicalrecordsareincorporatedwithhistoryattentionmemory offerstheinterpretabilityofdeepneuralnetworks,whichisone module. This is a multi-label classification task with𝑑 binary 𝑦 ofthemostintriguingfeaturesofattention.Inhealthcarestudies, labels. attentionmechanismisadoptedinmanyneuralnetworkarchitec- IntheDRGclassificationtask,wepredicttheDRGaccording tures,suchasDipole[14],REATIN[4],andRAIM[22],toachieve todiagnosissequenceandproceduresequencefromapatient’s betterpredictivepowerandmoreeffectiveresultinterpretation. medicalrecordofcurrentvisit.Thisisamulti-classclassification Forexample,RAIMintegratescontinuousmonitoringdataanddis- taskwith𝑑 categories. 𝑦 creteclinicaleventusingguidedmulti-channelattentionfortwo Intheinvoicefrauddetectiontask,weidentifythecompany predictiontasks.Ithasshownexcellentperformancesinprediction involvingininvoicefraudthroughitshistoricalinvoiceandtaxa- andmeaningfulinterpretationinquantitativeanalysis. tiondeclarationbehavior.Thedailyinvoiceandmonthlytaxation Recently,thetransformersolelybasedonself-attentionmecha- declarationvaluesareformulatedastwoasynchronoussequences nismhasdrawngreatattentionduetothecomputationefficiency withdifferentgranularity.Thisisabinaryclassificationproblem. 126 Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA 3.2 NetworkArchitect computedbyapplyattentionfunctionontheprojectedmatricesin OurmodelisnamedAttentionandMemory-AugmentedNetworks eachsubspace.Finally,allattentionheadsareconcatenatedtogether (AMANet)1, and Figure 1 shows the AMANet in the dual-view andprojectedtobetheintra-encodingvectorsas setting.Inthissection,wedescribethemodulesofourmodelin 𝐸𝑖𝑛𝑡𝑟𝑎 =MultiHead(𝐸) details. (8) =Concat(ℎ𝑒𝑎𝑑1,...,ℎ𝑒𝑎𝑑ℎ)𝑊𝑖𝑛𝑡𝑟𝑎 eacThokiteenmEimnbthededsienqgu.eGncivee,ntwtoheinopnuet-sheoqtuveenccteosr𝑋re1parnedse𝑋nt2aatiroenfoorf- whereℎ𝑒𝑎𝑑𝑚 =Attention(𝑄𝑚,𝐾𝑚,𝑉𝑚),and𝑊𝑖𝑛𝑡𝑟𝑎 ∈Rℎ𝑑𝑣×𝑑𝑖𝑛𝑡𝑟𝑎, mulatedas𝑀𝑖1 ∈{0,1}𝐿𝑖1×𝑑𝑥1 and𝑀𝑖2 ∈{0,1}𝐿𝑖2×𝑖𝑑𝑥2,resp𝑖ectively. atwndo𝑚inipsutthseeiqnudeenxcoefetmhebhedeaddins.gWs 𝐸e1seatn𝑑d𝑘𝐸=2𝑑,𝑣th=e𝑑in𝑖𝑛t𝑡r𝑟a𝑎-/eℎn.cGodivineng Wederivetheembeddingoftheinputmatrixas vectors are calculated as 𝐸1 = MultiHead(𝐸1) and 𝐸2 = 𝑖𝑛𝑡𝑟𝑎 𝑖𝑛𝑡𝑟𝑎 𝑇𝐸𝑖1=𝑀𝑖1𝑊𝐸1 (1) MultiHead(𝐸2). The intra-encoding vectors will be used for dynamicexternalmemory. 𝑇𝐸𝑖2=𝑀𝑖2𝑊𝐸2 (2) Theself-attentionmoduleisalsoknownasintra-attention.Its where𝑊1 ∈R𝑑𝑥1×𝑑 and𝑊2 ∈R𝑑𝑥2×𝑑 aretheembeddingmatrices purposeistoobtaintherelationshipbetweendifferentelementsin 𝐸 𝐸 and𝑑isthedimensionofembedding. thesamesequence. PositionalEncoding. Sincetheneuralcontrollerusesolelyself- DynamicExternalMemory. Theintra-encodingvectorfromthe attentioninsteadofRNNs,thereisnorecurrentandthepositional neuralcontrollerismappedtotheinterfacevector𝜉[6]followed information is lost. In order to incorporate the order of the se- byglobalaveragepoolingoperatoras quence,positionalencodingisintroducedtointegratewiththe inputembeddings[18].Weusethesamepositionalencodingasin 𝜉 =pool(𝐸𝑖𝑛𝑡𝑟𝑎𝑊𝜉) (9) thetransformer[18]as where𝑊 isatrainablematrixandpoolistheglobalaveragepooling 𝜉 𝑝𝑜𝑠 operator.Theinterfacevector𝜉issubdividedintoseveralsections 𝑃𝐸(𝑝𝑜𝑠,2𝑗)=sin( 2𝑗 ) (3) forreadingfromandwritingtomemorymatrix𝑀.Weusedthe 1000𝑑 sameapproachasinDNC[6]andmoredetailscanbefoundinthe 𝑝𝑜𝑠 𝑃𝐸(𝑝𝑜𝑠,2𝑗+1)=cos( ) (4) originalwork. 2𝑗 1000𝑑 Eachtokenintheinputsequencewillreadandwritememory where𝑝𝑜𝑠isthepositionindexinthesequence,and2𝑗 and2𝑗+1 onceintheoriginalpaper.Butthedifferenceisthatweusesample aretheevenandoddindexesoftheembeddingvector,respectively. granularitytoreadandwritememoryinourwork.Wethinkthat Thesequenceembeddingvectoriscalculatedbysummarizing thismethodismoreinlinewiththeupdatefrequencyofglobal tokenembeddingandpositionalencodingtogetheras knowledgeandfaster. 𝐸𝑖1=𝑇𝐸𝑖1+𝑃𝐸𝑖1 ∈R𝐿𝑖1×𝑑 (5) vecEtoacrsh𝑟v1ieawndh𝑟a2saitrseorwetnriemveemdofrroymmtawtroixmaesminorFyigmuarteri1c.eTsw𝑀o1raenadd 𝐸𝑖2=𝑇𝐸𝑖2+𝑃𝐸𝑖2 ∈R𝐿𝑖2×𝑑 (6) 𝑀2,respectively,andbothreadvectorsareusedfortheclassification layer. Multi-HeadSelf-Attention. Theneuralcontrollerwillperform Weusedynamicexternalmemorymoduletostoretheglobal encodingontheembeddingvectorwiththeself-attentionmech- knowledgeextractedfromthedataset,andallsamplescanread anism.Ingeneral,anattentionfunctionistomapaqueryanda fromandwritetoit.Wethinkthatdomainknowledgeexistsinthe setofkey-valuepairstoanoutput,whereallofthemarevectors. dataset.Therefore,weadaptthismodulefordomainknowledge Theoutputiscomputedastheweightedsumofvalues,wherethe learining. weightforeachvalueistheinnerproductofqueryandkeyswith asoftmaxnormalization.Inself-attention,queries𝑄,keys𝐾,and Inter-Attention. Intheself-attention,thequery,keyandvalue values𝑉 arethelinearprojectionsfromthesameinputembedding areprojectedfromthesameinputembeddingtocapturetheintra- matrix[18].Foreachview,theself-attentionfunctionisdefinedas viewinteraction.Intheinter-attention,thequeryisprojectedfrom 𝑄𝐾𝑇 oneinputembedding,whilekeyandvalueareprojectedfromthe Attention(𝑄,𝐾,𝑉)=softmax( )𝑉 (7) otherinputembedding.Forinstances,giventwoinputsequence (cid:112)𝑑 𝑘 embeddings𝐸1and𝐸2,theinter-encodingvectorsfor𝐸1from𝐸2 where𝑑𝑘 isthedimensionof𝑄and𝐾.Theinnerproductof𝑄and arecalculatedas 𝐾toitshseclaalregdebvyal√u1𝑑e𝑘sianstihnetrdaontspforormduecrt.toavoidsmallgradientsdue 𝐸𝑖1𝑛𝑡𝑒𝑟 =MultiHead(𝐸1,𝐸2) (10) Hereweusemulti-headattentiontoencodetheinputsequence. =Concat(ℎ𝑒𝑎𝑑1,...,ℎ𝑒𝑎𝑑ℎ)𝑊𝑖𝑛𝑡𝑒𝑟 Theinputembeddingvectorisprojectedlinearlyintoℎsubspaces where ℎ𝑒𝑎𝑑𝑚 = Attention(𝑄𝑚1,𝐾𝑚2,𝑉𝑚2), 𝑄𝑚1 = 𝐸1𝑊𝑚𝑄1, 𝐾𝑚2 = wquitehrieas𝑄se𝑚t o=f𝐸m𝑊a𝑚𝑄tri,c𝑊es𝑚𝑄{𝑄∈𝑚R,𝑑𝐾×𝑚𝑑𝑘,,𝑉k𝑚ey}sf𝐾or𝑚e=ac𝐸h𝑊s𝑚𝐾ub,𝑊sp𝑚a𝐾ce∈,Rw𝑑h×e𝑑r𝑘e, 𝐸𝐸21𝑊ca𝑚𝐾n2b,aencdal𝑉c𝑚u2la=te𝐸d2𝑊as𝑚𝑉𝐸22.The=inMteurl-teiHnceoaddi(n𝐸g2,v𝐸e1c)to.rsfor𝐸2from andvalues𝑉𝑚 =𝐸𝑊𝑚𝑉,𝑊𝑚𝑉 ∈R𝑑×𝑑𝑣.Theneachattentionheadis Themainpurposeoft𝑖h𝑛i𝑡s𝑒𝑟moduleistocapturetherelationship betweenelementsfromdifferentsequences,therebyenhancingthe 1Thepseudo-codeisinappendix,andthecodeisinhttps://github.com/non-name- 2020/AMANet. representationability. 127 Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA Figure1:AttentionandMemory-AugmentedNetworksforDualSequences.Inhistoryattentionmemorymodule,𝑡 denotes thetimeindexofanobject’ssampleattime𝑡.Ifanobjecthasonlyonesample,thereisnohistoryattentionmemory. EncodingVector. Thefinalencodingvectorsofaviewindual- Theweights𝛼𝑘 ={𝛼𝑘1,𝛼𝑘2,...,𝛼𝑘(𝑡−1)}isobtainedby view sequential data are calculated by applying global average preosopleinctgivoeplye,rtahtoenrtchoencinatteran-aetnecdotdhienmgatongdetihneterra-sencodingvectors, 𝛼𝑘𝑖 =softmax(𝜇𝑘𝑖)= (cid:205)𝑒𝑗𝜇𝑒𝑇𝑘𝜇𝑖𝑇𝑘𝑤𝑗𝑐𝑤𝑐 (14) 𝑒1=Concat(pool(𝐸𝑖1𝑛𝑡𝑟𝑎),pool(𝐸𝑖1𝑛𝑡𝑒𝑟)) (11) 𝑒2=Concat(pool(𝐸𝑖2𝑛𝑡𝑟𝑎),pool(𝐸𝑖2𝑛𝑡𝑒𝑟)) (12) wabhleerehi𝜇s𝑘to𝑖r=ytcaonnht(e𝑊xtℎ𝑒v𝑘e𝑖c+to𝑏rℎ[)2,𝑖3]∈.E{1a,c2h,.v..i,e𝑡w−h1}a,saintsdo𝑤w𝑐nishaistrtoairny- attentionmemoryasinFigure1.Twohistoricalattentionvectors wherethepoolistheglobalaveragepoolingoperator.Thefinal ℎ1andℎ2areretrievedfromtwohistoryattentionmemory,respec- encodingvectors𝑒1and𝑒2flowintotwocomponents:1)directly tively,andtheyareusedforclassificationlayer. usedforclassificationlayer;2)savedtohistoryattentionmemory. Historyattentionmemorymoduleisdesignedforthetaskofan objectwithhistorysamples.Itspurposeistostorethehistoryrepre- HistoryAttentionMemory. Thehistoryattentionmemoryisde- sentationinformationoftheobject,whichiscalledlocalknowledge. signed to capture an object’s historical contribution to current Asinthemedicationrecommendationtask,themedicationofcur- time step 𝑡. It stores the historical representations, i.e., the en- rentvisitisrelatedtonotonlycurrentdiagnosisandprocedure codingvectorsofanobject.Giventhehistoryattentionmemory sequences,butalsomedicalrecordsfrompreviousvisits. 𝐻𝑘 = {𝑒𝑘1,𝑒𝑘2,...,𝑒𝑘(𝑡−1)}fork-thobjectwithallprevioustime stepsinasingleview,theattendedhistoricalrepresentationℎ is 𝑘 ClassificationLayer. Theencodingvectors,readvectors,andhis- calculatedastheweightedsumofhistoricalembeddingvectorsas toricalattentionvectorsfromtwoviewsareconcatenatedtogether astheinputforfinalclassificationas 𝑡−1 (cid:213) ℎ𝑘 = 𝛼𝑘𝑖𝑒𝑘𝑖 (13) 𝑦ˆ=𝜎(Concat(𝑒1,𝑒2,𝑟1,𝑟2,ℎ1,ℎ2)) (15) 𝑖=1 128 Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA where𝜎isthesigmoidfunctionforbinaryclassificationandmulti- experiment[16].TheNDCdrugcodeinMIMIC-IIIismappedto labelclassification,andthesoftmaxfunctionformulti-classclassi- thirdlevelATCcodeaspredictionlabel.Thestatisticsofthedataset fication. aresummarizedinTable1. 3.3 LossFunction Table1:StatisticsofMIMIC-IIIDataset Cross-entropylossiswidelyusedforclassification.However,class Count imbalancecanleadtoinefficientintrainingandaffecttheperfor- patients 6350 manceofthemodel.Todealwithclassimbalance,weusefocal visits 15032 loss[13],whichisamodifiedversionofcrossentropy,asourloss diagnosiscategories 1958 function.Giventherealclass𝑦𝑖 ∈{0,1}andpredictedprobability procedurecategories 1430 𝑦ˆ𝑖 ∈ [0,1],thefocallossforbinaryclassificationisdefinedas medicationcategories 151 Mean Max Min 𝐿𝑖 =−𝛼𝑦𝑖(1−𝑦ˆ𝑖)𝛽log𝑦ˆ𝑖 (16) countofvisit 2.37 29 1 −(1−𝛼)(1−𝑦𝑖)𝑦ˆ𝑖𝛽log(1−𝑦ˆ𝑖) lengthofdiagnosissequence 13.63 39 1 lengthofproceduresequence 4.54 32 1 where𝛼istheweightingfactortobalancetheimportanceofpositive countofmedicationcombinations 19.86 55 1 andnegativesamples,and𝛽 isthefocusingparametertodown- weightthewell-classifiedsamples.Therefore,locallossemphasizes trainingonhardsamplesbyreducingthelosscontributionfrom Baselines. Wecompareourmodelwiththefollowingbaseline easysamples.Thefocallossformulti-labelclassificationis andstate-of-the-artalgorithms. 𝐿𝑖 =−(cid:213)𝑑𝑦 [𝛼𝑦𝑖𝑗(1−𝑦ˆ𝑖𝑗)𝛽log𝑦ˆ𝑖𝑗+ • Nfroeamretshte:Tphreisviaolugsorviitshitmasusceusrrtehnetmviesditi’csaptiroendiccotimonb.inations (17) 𝑗=1 • Logistic regression (LR): Logistic regression with L2 reg- (1−𝛼)(1−𝑦𝑖𝑗)𝑦ˆ𝑖𝛽𝑗log(1−𝑦ˆ𝑖𝑗)] ularizationisevaluatedasabaselinemodel.Theinputis themulti-hotvectorofdiagnosesandproceduresfromthe where𝑑𝑦 isthetotalnumberoflabels,and𝑦𝑖 ={𝑦𝑖1,𝑦𝑖2,...,𝑦𝑖𝑑𝑦}, currentvisit. 𝑦𝑖𝑗 ∈{0,1},𝑦ˆ𝑖 ={𝑦ˆ𝑖1,𝑦ˆ𝑖2,...,𝑦ˆ𝑖𝑑𝑦},𝑦ˆ𝑖𝑗 ∈ [0,1].Weset𝛼 =0.6and • LEAP:LEAP(LearntoPrescribe)[25]usesarecurrentde- 𝛽 =2inthisstudyasrecommendedby[13]. codertomodellabeldependencyandcontent-basedatten- Sinceeachcategory𝑐𝑖needsaspecificweight𝛼𝑖,itiscomplicate tiontomodelthelabel-to-instancemappingformedicine toapplythefocallossformulti-classclassification.Therefore,the recommendation. originalcross-entropylossisusedformulti-classclassification. • RETAIN:RETAIN(ReverseTimeAttention)[4]isaninter- pretablemodelwithreversetimeattentionmechanismby 4 EXPERIMENTS mimickingphysician’spractice.Therefore,recentvisitsare Inthissection,wefirstcomparetheAMANetmodeltobaselines likelytoreceivehighattention. andotherstate-of-the-artmodelsinthreetasks:medicationrec- • DMNC:DMNC[12]usesdualmemoryneuralcomputerfor ommendation,diagnosis-relatedgroup(DRG)classification,and medicationrecommendationtask. invoicefrauddetection.Then,wereporttheevaluationofeach • GAMENet:GAMENet[16]recommendsmedicationthrough moduleinourmodelintheablationstudy. graph augmented memory module whose memory bank basedondrugusageanddrug-druginteraction(DDI)and 4.1 MedicationRecommendation dynamicmemorybasedonpatient’shistory.Wehavedone TheincreasingvolumeofEHRsprovidesagreatopportunityto hyperparametersearchingforGAMENetsuchaslearning improvetheefficiencyandqualityofhealthcarebydevelopingpre- rateandembeddingsize.AndourresultforGAMENetis dictivemodels.Onecriticaltaskistopredictmedicationsbased betterthanthereportedperformancesin[16]formedication onthemedicalrecords[8].Inthispaper,weformulatemedication recommendationtask. recommendationasamulti-labelclassificationproblemanditrec- EvaluationMetrics. Inordertomeasurethemodelperformance, ommendsmedicinecombinationbasedondiagnosesandmedical weuseJaccardsimilarity,F1-score,andtheareaunderthePrecision- procedures.Thediagnosesandmedicalproceduresareorderedina Recallcurve(PRAUC)astheevaluationmetrics,whereallofthem hospitalvisit.Therefore,wetreatthediagnosesandproceduresin aremacro-averaged.Werandomlydividethedatasetintotraining, currentvisitastwosequentialviewsandapplyAMANettopredict validation,andtestwithratio4:1:1andreporttheperformance themedicationcombinations. fromthetestset. Dataset. WeusethemedicaldatafromtheMIMIC-IIIdatabase2, ImplementationDetails. Forourmethod,bothtokenembedding which comprises rich medical information relating to patients sizeandpositionembeddingsizeare100.Bothself-attentionsize withintheintensivecareunits(ICUs)[9].Wefollowtheproce- andinter-attentionsizeare64with4heads.Thedynamicexternal dure similar to GAMENet to process the medical codes in this memoryis256×64matrixandhas4readheads.The𝛼 and𝛽 of 2https://mimic.physionet.org/ focallossare0.6and2,respectively.Wetrainthemodelusingthe 129 Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA Adam[10]optimizeratlearningrate0.0001andstopat15epochs. andproceduresequence.ThetotalnumberofDRGcategoriesis270. TheothermodelsareevaluatedthesameasinGAMENet[16].All ThestatisticsofthedatasetaresummarizedinTable3. modelsaretrainedonmacOSMojavesystemwith2.2GHzIntel Corei7and16GBmemory. Table3:StatisticsofDRGDataset Results. Table 2 summarizes our results and we can see that Count theAMANetmodeloutperformsothermodelsinallmetrics.For records 21356 Jaccard similarity and F1 score, our model achieves about 0.04- diagnosiscategories 3052 0.05improvementcomparingtothenext-bestmodel(GAMENet procedurecategories 1177 withoutDDI).AMANetalsoscoresabout2%betterthanthenext- DRGcategories 270 bestmodelLRinthePRAUCmetric.WhenIA,DEMandHAM Mean Max Min componentsareremovedrespectively,therehavebeendifferent lengthofdiagnosissequence 3.85 11 1 levels of performance decline. The removing of IA leads to the lengthofproceduresequence 1.3 5 1 mostperformancedropamongthreecomponents.Wecanconclude fromthesecomparativeexperimentsthateachcomponentisuseful, especiallytheIAcomponent. Baselines. LEAP and GAMENet are not suitable for this task becauseLEAPoutputssequenceandGAMENetisdevelopedfor Table2:ExperimentsonMedicationRecommendation medicationrecommendationpurpose.Herewelistthebaseline modelsinthistaskasfollows: Method Jaccard F1 PRAUC Nearest 0.2548 0.3398 0.3187 • LR,RETAIN,andDMNCareusedasthebaselinealgorithms similartomedicationrecommendationtask. LR 0.4579 0.6177 0.7602 DMNC 0.4231 0.5849 0.6418 • Dual-LSTM:WeaddDual-LSTMmodel,whichencodestwo inputsequencesusingLSTMrespectivelyandthencombines LEAP 0.4365 0.5997 0.6367 twoencodingvectorstogetherforfinalclassificationlayer. RETAIN 0.4647 0.6269 0.7336 GAMENet 0.4551 0.6172 0.7275 EvaluationMetrics. Weuseaccuracy,F1-scoreandPRAUCas GAMENet(w/oDDI) 0.4796 0.6390 0.7484 theevaluationmetricinthistask.Thedatasetisrandomlydivided AMANet(w/oIA) 0.5067 0.6641 0.7629 intotraining,validation,andtestsetwith4:1:1ratiowhiletheratio AMANet(w/oDEMHAM) 0.5136 0.6706 0.7710 ofeachcategorysamplesiskeptthesameforeachset.Wereport AMANet(w/oDEM) 0.5201 0.6757 0.7716 theperformancefromthetestset. AMANet(w/oHAM) 0.5200 0.6761 0.7746 AMANet 0.5259 0.6809 0.7772 ImplementationDetails. Thehyperparametersofourmodelare the same as in medication recommendation task except for the learningrateandepoch.Wetrainthemodelatlearningrate0.0005 andstopat10epochs.AllmodelsaretrainedonmacOSMojave 4.2 DRGClassification systemwith2.2GHzIntelCorei7and16GBmemory. Diagnosis-relatedgroups(DRGs)hasbeenoneofthemostpopular prospectivepaymentsystemsinmanycountries.Thebasicideaof Results. Table4indicatesthatAMANetperformsthebestcom- paredtotheotherfourmodelsinallthreemetrics.Comparedto theDRGsystemisthatpatientsadmittedtoahospitalareclassified DMNCandDual-LSTM,whicharesolelybasedonlatefusion,our intoalimitednumberofDRGs.PatientsineachDRGsharesimilar modelachievesabout0.02betterperformanceforAccuracyandF1 clinical and demographical patterns and are likely to consume score,and0.04forPRAUC.Wecanobservesignificantperformance similaramountsofhospitalresources.Thehospitalsarereimbursed dropbyremovingIAmodule,whichindicatestheimportanceof asetfeeforthetreatmentinasingleDRGcategory,regardlessof IAinAMANetmodeltocapturetheinter-viewinteractionsand theactualcostforanindividualpatient.Therefore,DRGscanput improvetheabilityofrepresentation. pressureonhospitalstocontrolcost,toreducethelengthofstay, andtoincreasethenumberofadmittedinpatients.TheDRGfor eachpatientisassignedatthetimeofdischarge[19].Previousstudy Table4:ExperimentsonDRGClassification hasshownthataccuratelyclassifyinganpatient’sDRGattheearly stageofhospitalvisitcanallowhospitaleffectivelyallocatelimited Method Accuracy F1 PRAUC hospitalresources[5].Inthisstudy,weextractdiagnosissequence LR 0.7743 0.7458 0.6271 andproceduresequencefromEHRsasinputandapplyAMANet RETAIN 0.7631 0.7483 0.6048 topredictDRG.SinceDRGsaredeterminedsolelyoncurrentvisit, DMNC 0.8473 0.8407 0.7324 weuseAMANetwithouthistoryattentionmemorymoduleinthis Dual-LSTM 0.8525 0.8433 0.7305 task. AMANet(w/oIA) 0.7497 0.7416 0.6116 AMANet(w/oDEM) 0.8623 0.8612 0.7832 Dataset. The dataset of this task is from a Grade Three Class AMANet 0.8712 0.8690 0.7975 B hospital in China. It contains 21356 hospital records of 2019. Eachhospitalrecordconsistsoftwosequences:diagnosissequence 130 Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA 4.3 InvoiceFraudDetection memorycancapturetheintra-viewandinter-viewinteractions Thevalue-addedtax(VAT)isageneraltaxbasedonthevalueadded effectively. togoodsandservicesinthemajorityofcountries.Itischarged as a percentage of current sale price deducting the tax paid at Table6:ExperimentsonInvoiceFraudDetection theprecedingstage.IntheVATcollectionprocess,thecompany providestheinvoicefromitssellingforcurrenttaxcalculationand Method Accuracy F1 PRAUC theinvoiceofitsbuyingfromsuppliersforthededuction.Some LR 0.7724 0.7222 0.8839 companiesarespecificallyestablishedforinvoicefraudbyillegally RETAIN 0.8628 0.8152 0.8910 providingsellinginvoicestoothercompanieswithoutthebusiness DMNC 0.8669 0.8323 0.9005 inbetween,andthecompaniesbuyinginvoicescangetbetterVAT Dual-LSTM 0.8655 0.8352 0.9220 deduction.Toidentifythecompanyinvolvingininvoicefraudis AMANet(w/oIA) 0.8324 0.7805 0.9030 acommontaskforthetaxationagency.Hereweformulatethe AMANet(w/oDEM) 0.8924 0.8556 0.9467 invoicefrauddetectiontaskasabinaryclassificationproblemto AMANet 0.8938 0.8623 0.9503 identifythecompanyinvolvingininvoicefraudthroughhistorical invoiceandtaxdeclarationinformation.Weconstructthedaily invoiceandmonthlytaxdeclarationinformationastwosequential 4.4 Hyperparameters viewsandapplyAMANettoidentifythecompanywithabnormal Inthissubsection,westudytheeffectofthehyperparametersto behaviors.Sincethereisonlyonesampleforacompany,weuse ourmodel.Duetolimitspace,weusemedicationrecommendation AMANetwithouthistoryattentionmemorymoduleinthistask. taskasanexample.Wefocusonfourhyperparametersetsinclud- ingembeddingsize𝑑,learningrate𝐿𝑅,numberofhead𝑁𝐻 and Dataset. ThedatasetisfromatexationagencyinChina.Inthe attentionsize𝐴𝑆,and𝛼 and𝛽inthefocalloss.Wevarythevalues dataset,itcontains8700companiesintotaland1/3ofthemisin- ineachhyperparametersetwithothersetsfixed.Theresultsfordif- volvedintheinvoicefraud.Foreachcompany,twotimelyorder ferenthyperparametersettingsareshowninTable7.Trainingwith sequences,i.e.,dailyinvoicevaluesandmonthlytaxdeclaration focallossyieldsmorethan0.01improvementovercross-entropy values, are used as input. The statistics of the dataset are sum- loss(i.e.,𝛼 = 0.5and 𝛽 = 0)inJaccardsimilarityandF1score. marizedinTable5.Theinvoicevalueandtaxdeclarationvalue Thedifferencesamongotherhyperparametersetsaresmallerthan arecontinuousvaluesandwediscretizethevaluesofeachview 0.006inallthreemetrics,whichisnotsignificant.Therefore,our into2000categoriesusingquantile-baseddiscretizationfunction. proposedmodelisnotsensitivetomostofthehyperparameters Then,weusetheone-hotencodingtorepresenttheinvoiceandtax andadaptingfocallossintrainingcanboosttheresultsignificantly. declarationvalues. Table5:StatisticsofInvoiceFraudDataset 4.5 AblationStudy Weperformablationstudiesonourmodelbyremovingspecific Count moduleoneatatimetoexploretheirrelativeimportance.Thethree companies 8700 modulesevaluatedinthissectionareinter-attention(IA),dynamic invoicefraudcompanies 2900 externalmemory(DEM),andhistoryattentionmemory(HAM).As invoicecategories 2000 wecanseefromtable2,table4,andtable6,theperformancede- declarationcategories 2000 gradinginallablationexperimentsindicatesthatallthreemodules Mean Max Min inourmodelarenecessary. lengthofinvoicesequence 56.05 1256 1 lengthofdeclarationsequence 27.08 166 1 Inter-Attention. We design the inter-attention to capture the inter-viewinteractionacrossviewsbyrelatingoneinputembed- dingtoanotherinputembeddingandaligndifferentsequences. Theinter-viewencodingforsequenceAfromsequenceBcould Baselines. Thebaselinemodelsforthistaskarethesameasinthe beviewedassoftlyaligningsequenceAtosequenceB.Firstly,the DRGclassificationtask,exceptforthechangingfrommulti-class relevancematrixiscalculatedbetweenAandB,whereeachvalue classificationintobinaryclassification. inthematrixrepresentstherelevancebetweenonepositioninA EvaluationMetrics. ItisthesameastheDRGclassificationtask. andanotherpositioninB.Thenencodingvectorofeachposition inAiscomputedbyrelevanceweightedsummationofallpositions ImplementationDetails. ItisthesameastheDRGclassification inB.Theremovingofinter-attentioncanleadtothebiggestper- task. formancedroppinginalltasksandallthreemetricscomparingto Results. Table6indicatesthatAMANetperformsthebestcom- memorymodules,whichindicatesthecrucialroleofinter-attention paredtotheotherfourmodelsinallthreemetrics.Comparedto inourmodel. Dual-LSTMandDMNC,ourmodelachievesabout3%betterperfor- Inmedicationrecommendationtask,allthreemetricshavedropp- manceconsistently.WhenIAcomponentisremoved,theeffectis edaround0.02whenexcludingIAcomponent.TheremovalofIA notasgoodasthatofDMNCandDual-LSTM.Thisdemonstrates leadstodramaticperformancedropover0.12inDRGclassification. thatself-attentionandinter-attentionmechanismcouplingwith IninvoicefrauddetectiontaskusingAMANetmodelwithoutIA, 131 Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA Table7:ExperimentsonVaryingHyperparameters HyperParameters Jaccard F1 PRAUC Varying𝑑(𝐿𝑅=0.0001,𝐴𝑆 =64,𝑁𝐻 =4,𝛼 =0.6,𝛽 =2) 𝑑 =64 0.5249 0.6802 0.7780 𝑑 =100 0.5259 0.6809 0.7772 𝑑 =128 0.5242 0.6797 0.7759 Varying𝐿𝑅(𝑑 =100,𝐴𝑆 =64,𝑁𝐻 =4,𝛼 =0.6,𝛽 =2) 𝐿𝑅=0.0001 0.5259 0.6809 0.7772 𝐿𝑅=0.0002 0.5220 0.6774 0.7760 𝐿𝑅=0.0005 0.5255 0.6804 0.7761 (a)AMANetwithIAmodule 𝐿𝑅=0.001 0.5235 0.6788 0.7747 Varying𝐴𝑆,𝑁𝐻(𝐿𝑅=0.0001,𝑑 =100,𝛼 =0.6,𝛽 =2) 𝐴𝑆 =32,𝑁𝐻 =8 0.5205 0.6763 0.7755 𝐴𝑆 =32,𝑁𝐻 =4 0.5207 0.6764 0.7749 𝐴𝑆 =64,𝑁𝐻 =4 0.5259 0.6809 0.7772 𝐴𝑆 =64,𝑁𝐻 =2 0.5239 0.6793 0.7779 Varying𝛼,𝛽(𝐿𝑅=0.0001,𝑑 =100,𝐴𝑆 =64,𝑁𝐻 =4) 𝛼 =0.5,𝛽 =0 0.5056 0.6626 0.7753 𝛼 =0.6,𝛽 =0 0.5183 0.6741 0.7756 𝛼 =0.6,𝛽 =1.5 0.5248 0.6798 0.7791 𝛼 =0.6,𝛽 =2.0 0.5259 0.6809 0.7772 𝛼 =0.6,𝛽 =2.5 0.5198 0.6759 0.7722 (b)AMANetwithoutIAmodule 𝛼 =0.6,𝛽 =3.0 0.5208 0.6766 0.7721 𝛼 =0.7,𝛽 =2.5 0.5218 0.6777 0.7763 Figure2:Thecosinesimilaritybetweenadiagnosissequence 𝛼 =0.7,𝛽 =3.0 0.5241 0.6797 0.7763 (vertical axis) and a procedure sequence (horizontal axis). Eachblockiscoloredbythesimilarityvalue.Thepositive cosinesimilarityvalueiscoloredinred,whilethenegative theaccuracy,F1score,andPRAUCdrop0.061,0.082,and0.047, similarityvalueiscoloredinwhite. respectively.Theempiricalresultsfromthreetaskssuggestthatin- tegratinginformationfromtwoviewsisbeneficialforoutcomepre- dictionincertaintasksandourdesignedinter-attentionmechanism cancapturetheinter-viewinteractioneffectivelyinasynchronous setting. 5 CONCLUSIONS Moreover,weobservetheimprovedmedicalcoderepresentation Inthiswork,weproposeanovelarchitectureAMANetbasedon with inter-attention module. We demonstrate this finding with attentionandmemoryfordualasynchronoussequentiallearning. an example from DRG classification task as in Figure 2. In this Thenewmodellearnstheintra-viewinteractionandinter-view example,apatientwithheartdiseaseistreatedwithasequenceof interactionthroughself-attentionandinter-attentionmechanism, medicalprocedures.Allprocedurecodesandthediagnosiscoded1, respectively.Weusedynamicexternalmemorytorepresentthe d2,d4areheartdiseaserelated.Procedurecodep3isadiagnostic commonknowledgefromthedata.Wealsointroduceahistory operation, which is different from other therapeutic operations attentionmemorymoduletostorethehistoricalrepresentationof here.Inaddition,hepaticcyst(d7)isnottreatedanditshouldnot thesameobject.Weevaluateourmodelonmedicationrecommen- berelatedwithanyprocedurecodes.Figure2(a)iswellaligned dationtask,DRGclassificationtask,andinvoicefrauddetection with the medical knowledge above. However, the medical code withsuperiorperformancecomparingtootherbaselineandstate- representationfromAMANetwithoutIAmoduleisnotconsistent of-the-artmodels.Theablationstudyemphasizestheimportance withthemedicalknowledgeasinFigure2(b).Therefore,wecan ofmodelinginter-viewinteractionsandtheinter-attentionmodule conclude that the inter-attention module can contribute to the couldmodeltheinter-viewinteractioneffectively.TheAMANet representationlearningacrosscodesets. frameworkisformulatedusingdual-viewsequencesinthisstudy, butitcanbeappliedformulti-viewsequencesdirectly.Inthefu- Memory. Weincorporatetwomemorymodulesinourmodelto ture,wewillemploythisframeworktostudymulti-viewsequential storetheglobalandlocalinformationbyusingdynamicexternal learningproblems. memoryandhistoryattentionmemory,respectively.Inmedication recommendationtask,thedynamicexternalmemoryandhistory attentionmemorycontributesimilarlytotheoverallpredictionin ACKNOWLEDGMENTS allthreemetrics.Intheabovethreetasks,theeffectisreducedwhen Wethanktheanonymousreviewersfortheirhelpfulcomments. thememorycomponentisremoved.Bothmodulesarenecessary This work is supported in the Department of Data Intelligence basedontheempiricalexperimentalresult. ProductsofAlibabaCloudinAlibabaGroup. 132 Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA REFERENCES Vision(ICCV).2980–2988. [1] D.Bahdanau,K.Cho,andY.Bengio.2015.Neuralmachinetranslationbyjointly [14] F.Ma,R.Chitta,J.Zhou,Q.You,T.Sun,andJ.Gao.2017.Dipole:Diagnosispre- learningtoalignandtranslate.In3rdInternationalConferenceonLearningRepre- dictioninhealthcareviaattention-basedbidirectionalrecurrentneuralnetworks. sentations(ICLR). InProceedingsofthe23rdACMSIGKDDinternationalconferenceonknowledge [2] T.Baltrušaitis,C.Ahuja,andL.P.Morency.2019.Multimodalmachinelearning: discoveryanddatamining(KDD).1903–1911. Asurveyandtaxonomy.InIEEETransactionsonPatternAnalysisandMachine [15] S.S.Rajagopalan,L.P.Morency,T.Baltrusaitis,andR.Goecke.2016. Extend- Intelligence(TPAMI),Vol.41(2).423–443. inglongshort-termmemoryformulti-viewstructuredlearning.InEuropean [3] K.Cho,B.Merrienboer,C.Gulcehre,F.Bougares,H.Schwenk,andY.Bengio.2014. ConferenceonComputerVision(ECCV).338–353. LearningPhraseRepresentationsusingRNNEncoder-DecoderforStatistical [16] J.Shang,C.Xiao,T.Ma,H.Li,andJ.Sun.2019. Graphaugmentedmemory MachineTranslation.InProceedingsofthe2014ConferenceonEmpiricalMethods networksforrecommendingmedicationcombination.InThirty-ThirdAAAI inNaturalLanguageProcessing(EMNLP).1724–1734. ConferenceonArtificialIntelligence(AAAI).1126–1133. [4] E.Choi,M.T.Bahadori,J.Sun,J.Kulas,A.Schuetz,andW.Stewart.2016.Dipole: [17] H.Song,D.Rajan,J.J.Thiagarajan,andA.Spanias.2018.Attendanddiagnose: Diagnosispredictioninhealthcareviaattention-basedbidirectionalrecurrent Clinicaltimeseriesanalysisusingattentionmodels.InThirty-SecondAAAI neuralnetworks.InAdvancesinNeuralInformationProcessingSystems(NeurIPS). ConferenceonArtificialIntelligence.4091–4098. 3504–3512. [18] A.Vaswani,N.Shazeer,N.Shazeer,J.Uszkoreit,L.Jones,A.N.Gomez,andet [5] DanielGartner,RainerKolisch,DanielBNeill,andRemaPadman.2015. Ma- al.2017.Attentionisallyouneed.InAdvancesinNeuralInformationProcessing chinelearningapproachesforearlyDRGclassificationandresourceallocation. Systems(NeurIPS).5998–6008. INFORMSJournalonComputing27,4(2015),718–734. [19] ZhaoxinWang,RuiLiu,PingLi,andChenghuaJiang.2014. Exploringthe [6] A.Graves,G.Wayne,M.Reynolds,T.Harley,I.Danihelka,A.Grabska-Barwińska, transitiontoDRGsindevelopingcountries:acasestudyinShanghai,China. andetal.2016.Hybridcomputingusinganeuralnetworkwithdynamicexternal Pakistanjournalofmedicalsciences30,2(2014),250. memory.InNature,Vol.538(7626).471. [20] J.Weston,S.Chopra,andA.Bordes.2015.Memorynetworks.In3rdInternational [7] S.HochreiterandJ.Schmidhuber.1997. Longshort-termmemory.InNeural ConferenceonLearningRepresentations(ICLR). Computation,Vol.9(8).1735–1780. [21] C.Xu,D.Tao,andC.Xu.2013. Asurveyonmulti-viewlearning.InArXiv. [8] B.Jin,H.Yang,L.Sun,C.Liu,Y.Qu,andJ.Tong.2018.ATreatmentEngineby arXiv:1304.5634. PredictingNext-PeriodPrescriptions.InProceedingsofthe24thACMSIGKDD [22] Y.Xu,S.Biswal,S.R.Biswal,K.O.Maher,andJ.Sun.2018. RAIM:Recurrent InternationalConferenceonKnowledgeDiscovery&DataMining(KDD).1608– AttentiveandIntensiveModelofMultimodalPatientMonitoringData.InProceed- 1616. ingsofthe24thACMSIGKDDInternationalConferenceonKnowledgeDiscovery& [9] A.E.Johnson,T.J.Pollard,L.Shen,H.L.Li-wei,M.Feng,M.Ghassemi,andet DataMining(KDD).2565–2573. al.2016.MIMIC-III,afreelyaccessiblecriticalcaredatabase.InScientificData, [23] Z.Yang,D.Yang,C.Dyer,X.He,A.Smola,andE.Hovy.2016.Hierarchicalatten- Vol.(3).160035. tionnetworksfordocumentclassification.InProceedingsofthe2016Conference [10] D.P.KingmaandJ.Ba.2015.Adam:Amethodforstochasticoptimization.In oftheNorthAmericanChapteroftheAssociationforComputationalLinguistics: InternationalConferenceonLearningRepresentations(ICLR). HumanLanguageTechnologies(NAACL-HLT).1480–1489. [11] T.N.KipfandM.Welling.2017.Semi-supervisedclassificationwithgraphcon- [24] A.Zadeh,P.P.Liang,N.Mazumder,S.Poria,E.Cambria,andL.P.Morency.2018. volutionalnetworks.In5thInternationalConferenceonLearningRepresentations Memoryfusionnetworkformulti-viewsequentiallearning.InThirty-Second (ICLR). AAAIConferenceonArtificialIntelligence(AAAI).5634–5641. [12] H.Le,T.Tran,andS.Venkatesh.2018. Memoryfusionnetworkformulti- [25] Y.Zhang,R.Chen,J.Tang,W.F.Stewart,andJ.Sun.2017. LEAP:learning viewsequentiallearning.InProceedingsofthe24thACMSIGKDDInternational toprescribeeffectiveandsafetreatmentcombinationsformultimorbidity.In [13] CT.oYn.feLrienn,cPe.oGnoKynaol,wRle.dGgiersDhiisccko,vKer.yH&e,DaantdaPM.iDnionlglá(rK.2D0D17)..1F6o3c7a–l1l6o4s5s.fordense PDriosccoevederinygasnodfDthatea2M3ridniAngCM(KDSIDG)K.1D3D15–In1t3e2rn4.ationalConferenceonKnowledge [26] J.Zhao,X.Xie,X.Xu,andS.Sun.2017.Multi-viewlearningoverview:Recent objectdetection.InProceedingsoftheIEEEInternationalConferenceonComputer progressandnewchallenges.InInformationFusion,Vol.38.43–54. 133 Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA Algorithm1TrainingAMANet Algorithm2TrainingAMANetwithoutHAM Input: 𝐿𝑅,𝑑,𝐴𝑆,𝑁𝐻,𝛼,𝛽,𝑒𝑝𝑜𝑐ℎ,Sequences𝑋1,Sequences𝑋2 Input: 𝐿𝑅,𝑑,𝐴𝑆,𝑁𝐻,𝛼,𝛽,𝑒𝑝𝑜𝑐ℎ,Sequences𝑋1,Sequences𝑋2 andlabels𝑦 andlabels𝑦 Output: Jaccard,F1,PRAUC Output: Accuracy,F1,PRAUC 1: proceduretraining 1: proceduretraining 2: Initialize𝑊{1,2}(TwoTokenEmbeddingMatrices), 𝑊{1,2} 2: Initialize𝑊{1,2}(TwoTokenEmbeddingMatrices), 𝑊{1,2} 𝐸 𝜉 𝐸 𝜉 (TwoDEMInterfaceMatrices), 𝑊{1,2} (TwoDEMContent (TwoDEMInterfaceMatrices), 𝑊{1,2} (TwoDEMContent 𝐷𝐸𝑀 𝐷𝐸𝑀 Matrices), 𝑊{1,2}, 𝑏{1,2}(TwoHAMMatricesandTwoHAM Matrices) Vectors), 𝑤{1ℎ,2}(TwℎoHistoryContextVectors) 3: for𝑖 =1→𝑒𝑝𝑜𝑐ℎdo 𝑐 4: for𝑜𝑏𝑗𝑒𝑐𝑡𝑖 intrainingsetdo 3456:::: forf𝑖o=rf1𝑜o𝑏→r𝑗𝑋𝑒𝑣𝑐𝑖𝑒𝑖1𝑡𝑠𝑗𝑝𝑖𝑖,𝑡𝑜𝑋i𝑗n𝑐𝑖2ℎi𝑗tn,rd𝑦a𝑜o𝑖i𝑏𝑗n𝑗i←𝑒n𝑐g𝑡𝑣𝑖′s𝑖ev𝑠t𝑖i𝑡sd𝑗itosdo 657::: 𝑇𝑋𝑀𝐸𝑖1𝑖{𝑖,{1𝑋1,2,𝑖22}},←𝑦←𝑖 ←o𝑀ne𝑖𝑜{h𝑏1,o2𝑗𝑒t}𝑐e×𝑡n𝑖𝑊co𝐸d{1in,2g}𝑜𝑓 𝑋𝑖{1,2} 7: 𝑀{1,2} ←onehotencodingof𝑋{1,2} 8: 𝑃𝐸𝑖{1,2} ←positionalencodingof𝑋𝑖{1,2} 𝑖𝑗 𝑖𝑗 8: 𝑇𝐸{1,2} ←𝑀{1,2}×𝑊{1,2} 9: 𝐸𝑖{1,2} ←𝑇𝐸𝑖{1,2}+𝑃𝐸𝑖{1,2} ∈R𝐿𝑖{1,2}×𝑑 109:: 𝑃𝐸𝐸{1𝑖𝑖{,𝑗𝑗21,}2}←←𝑇p𝐸o{𝑖1s𝑗,i2ti}o+na𝑃l𝐸e𝐸{n1c,2o}d∈ingR𝐿o𝑖{f𝑗1,2𝑋}×𝑖{𝑗𝑑1,2} 1110:: 𝜉𝐸𝑖𝑖{{𝑛11,𝑡,22𝑟}𝑎}𝑖←←pMooul(lt𝐸iH𝑖{𝑛1e𝑡,2𝑟a𝑎}d𝑖(×𝐸𝑖{𝑊1,2𝜉{}1),2}) 11: 𝐸𝑖{𝑗1,2} ←M𝑖𝑗ultiHead(𝑖𝐸𝑗 {1,2}) 12: update 𝑊𝐷{1𝐸,𝑀2}𝑏𝑦𝜉𝑖{1,2} 𝑖𝑛𝑡𝑟𝑎𝑖𝑗 𝑖𝑗 13: 𝑟{1,2}readfrom𝑊{1,2} 1132:: 𝜉u𝑖{p𝑗1d,2a}te←𝑊𝐷p{o1𝐸,o𝑀2}l(𝑏𝐸𝑦𝑖{𝑛1𝜉𝑡,2𝑖𝑟{𝑗𝑎}1𝑖,2𝑗}×𝑊𝜉{1,2}) 1154:: 𝐸𝐸𝑖𝑖𝑖21𝑛𝑛𝑡𝑡𝑒𝑒𝑟𝑟𝑖𝑖 ←←MMuullttiiHHee𝐷aa𝐸dd𝑀((𝐸𝐸𝑖𝑖21,, 𝐸𝐸𝑖𝑖12)) 14: 𝑟𝑖{𝑗1,2}readfrom𝑊𝐷{1𝐸,𝑀2} 16: 𝑒𝑖{1,2} ←concat(pool(𝐸𝑖{𝑛1𝑡,2𝑟𝑎}𝑖), pool(𝐸𝑖{𝑛1𝑡,2𝑒𝑟}𝑖)) 15: 𝐸𝑖1𝑛𝑡𝑒𝑟𝑖𝑗 ←MultiHead(𝐸𝑖1𝑗, 𝐸𝑖2𝑗) 17: 𝑦ˆ𝑖 ←𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑦(concat(𝑒𝑖1, 𝑒𝑖2, 𝑟𝑖1, 𝑟𝑖2) 16: 𝐸2 ←MultiHead(𝐸2, 𝐸1) 18: computeloss 𝑖𝑛𝑡𝑒𝑟𝑖𝑗 𝑖𝑗 𝑖𝑗 19: updatethenetworkparametersbasisonlossusing 17: 𝑒𝑖{𝑗1,2} ←concat(pool(𝐸𝑖{𝑛1𝑡,2𝑟𝑎}𝑖𝑗), pool(𝐸𝑖{𝑛1𝑡,2𝑒𝑟}𝑖𝑗) BackPropagation 18: save𝑒{1,2} to HAM 20: endfor 𝑖𝑗 19: ℎ{1,2} ←(cid:205)𝑗−1𝛼{1,2}𝑒{1,2} 21: computeAccuracy,F1,PRAUConvalidationset 𝑖𝑗 𝑘=1 𝑖𝑘 𝑖𝑘 22: endfor 2201:: where𝜇𝛼𝑖𝑖{{𝑘𝑘11,,22}}==t(cid:205)a𝑒n𝑙𝜇𝑒h𝑖{𝜇𝑘1(𝑖,{𝑊2𝑙1},𝑇2ℎ}{𝑇𝑤1𝑐𝑤{,21𝑐,{2}1}𝑒,2𝑖}{𝑘1,2}+𝑏ℎ{1,2}) 22223456:::: epnrodbrcpeeetsrdutou_rcmrneedobedueversalte_l←umaoatdrigeolnmax𝑚𝑜𝑑𝑒𝑙{F1} 22: 𝑘 ∈1→ 𝑗−1 27: computeAccuracy,F1,PRAUContestset 23: 𝑦ˆ𝑖𝑗 ←𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑦(concat(𝑒𝑖1𝑗,𝑒𝑖2𝑗,𝑟𝑖1𝑗,𝑟𝑖2𝑗,ℎ𝑖1𝑗,ℎ𝑖2𝑗)) 28: returnAccuracy,F1,PRAUC 24: computeloss 29: endprocedure 25: update the network parameters basis on loss usingBackPropagation 26: endfor A PSEUDO-CODE 27: endfor Thepseudo-codeofourmodelisasAlgorithm1andAlgorithm 28: computeJaccard,F1,PRAUConvalidationset 2.Algorithm1isapplicabletomedicationrecommendationtask, 29: endfor Algorithm2isapplicabletoDRGclassificationandinvoicefraud 30: best_model←argmax𝑚𝑜𝑑𝑒𝑙{F1} detectiontask.Thecodeisopenedinhttps://github.com/non-name- 31: returnbest_model 2020/AMANet.Thedatasetofmedicationrecommendationtask 32: endprocedure isavailableinhttps://mimic.physionet.org.Andwewillopenthe 33: procedureevaluation datasetsofDRGclassificationandinvoicefrauddetectiontask. 34: computeJaccard,F1,PRAUContestset 35: returnJaccard,F1,PRAUC 36: endprocedure 134

Attention and Memory-Augmented Networks for Dual-View Sequential Learning PDF

10 Pages·2020·2.395 MB·English

by Yong He, Cheng Wang, Nan Li, Zhenyu Zeng

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download Attention and Memory-Augmented Networks for Dual-View Sequential Learning PDF Free - Full Version

by Yong He, Cheng Wang, Nan Li, Zhenyu Zeng| 2020| 10 pages| 2.395| English

Download Attention and Memory-Augmented Networks for Dual-View Sequential Learning by Yong He, Cheng Wang, Nan Li, Zhenyu Zeng in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Attention and Memory-Augmented Networks for Dual-View Sequential Learning

No description available for this book.

Detailed Information

Author:	Yong He, Cheng Wang, Nan Li, Zhenyu Zeng
Publication Year:	2020
ISBN:	9781450379984
Pages:	10
Language:	English
File Size:	2.395
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Attention and Memory-Augmented Networks for Dual-View Sequential Learning Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Attention and Memory-Augmented Networks for Dual-View Sequential Learning PDF?

Yes, on https://PDFdrive.to you can download Attention and Memory-Augmented Networks for Dual-View Sequential Learning by Yong He, Cheng Wang, Nan Li, Zhenyu Zeng completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Attention and Memory-Augmented Networks for Dual-View Sequential Learning on my mobile device?

After downloading Attention and Memory-Augmented Networks for Dual-View Sequential Learning PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Attention and Memory-Augmented Networks for Dual-View Sequential Learning?

Yes, this is the complete PDF version of Attention and Memory-Augmented Networks for Dual-View Sequential Learning by Yong He, Cheng Wang, Nan Li, Zhenyu Zeng. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Attention and Memory-Augmented Networks for Dual-View Sequential Learning PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.