Table Of ContentResearch Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA
Attention and Memory-Augmented Networks for Dual-View
Sequential Learning
YongHe,ChengWang,NanLi,ZhenyuZeng
{sanyuan.hy,youmiao.wc,kido.ln,zhenyu.zzy}@alibaba-inc.com
AlibabaGroup
Hangzhou,China
ABSTRACT medicationrecommendation,DRGclassification,invoicefraudde-
Inrecentyears,sequentiallearninghasbeenofgreatinterestdue tection
totheadvanceofdeeplearningwithapplicationsintime-series
ACMReferenceFormat:
forecasting,naturallanguageprocessing,andspeechrecognition. YongHe,ChengWang,NanLi,ZhenyuZeng.2020.AttentionandMemory-
Recurrent neural networks (RNNs) have achieved superior per- AugmentedNetworksforDual-ViewSequentialLearning.InProceedingsof
formanceinsingle-viewandsynchronousmulti-viewsequential the26thACMSIGKDDConferenceonKnowledgeDiscoveryandDataMining
learningcomparingtotraditionalmachinelearningmodels.How- USBStick(KDD’20),August23–27,2020,VirtualEvent,USA.ACM,New
ever,themethodremainslessexploredinasynchronousmulti-view York,NY,USA,10pages.https://doi.org/10.1145/3394486.3403055
sequentiallearning,andtheunalignmentnatureofmultiplese-
quencesposesagreatchallengetolearntheinter-viewinteractions. 1 INTRODUCTION
WedevelopanAMANet(AttentionandMemory-AugmentedNet- Multi-viewdatathatdifferentviewsdescribedistinctperspectives
works)architecturebyintegratingbothattentionandmemoryto isverycommoninmanyreal-worldapplications.Variouslearning
solveasynchronousmulti-viewlearningproblemingeneral,and algorithmshavebeenproposedtoincorporatethecomplementary
wefocusonexperimentsindual-viewsequencesinthispaper.Self- informationofthemulti-viewdatasetsinordertounderstandthe
attentionandinter-attentionareemployedtocaptureintra-view complexsystemandmakeprecisedata-drivenprediction[26].Due
interactionandinter-viewinteraction,respectively.Historyatten- totheadvanceoftherecurrentneuralnetwork(RNN)architecture
tionmemoryisdesignedtostorethehistoricalinformationofa development,multiplesequentialeventscouldbeintegratedinto
specificobject,whichservesaslocalknowledgestorage.Dynamic multi-viewlearning.Suchaformoflearningisknownasmulti-view
externalmemoryisusedtostoreglobalknowledgeforeachview. sequentiallearning.Manymodelsaredesignedforsynchronousse-
Weevaluateourmodelinthreetasks:medicationrecommendation quencesthatallviewssharethesametimesteporcouldbealigned,
fromapatient’smedicalrecords,diagnosis-relatedgroup(DRG) whiletherearefewmodelsfocusedonasynchronoussequences
classificationfromahospitalrecord,andinvoicefrauddetection thatsequencesareinvariousgranularityandwithdynamiclengths.
throughacompany’staxationbehaviors.Theresultsdemonstrate Electronichealthrecord(EHR)learningisanotableexampleof
thatourmodeloutperformsallbaselinesandotherstate-of-the-art multi-viewsequentiallearning.Apatient’smedicalrecordsconsist
modelsinalltasks.Moreover,theablationstudyofourmodelin- ofmultipleviews,suchasdiagnosis,procedureandmedication,
dicatesthattheinter-attentionmechanismplaysakeyroleinthe andareasynchronousinafine-grainedtimescale.
modelanditcanboostthepredictivepowerbyeffectivelycapturing Thekeychallengeofmulti-viewsequentiallearningistocap-
theinter-viewinteractionsfromasynchronousviews. turebothintra-viewinteractionwithinasingleviewandinter-view
interactionsacrossmultipleviews.RNNs,suchaslongshort-term
CCSCONCEPTS memory(LSTM)[7]andgatedrecurrentunit(GRU)[3],havebeen
•Computingmethodologies→Neuralnetworks;Supervised widely used to learn the intra-view interactions in single view
learningbyclassification;•Informationsystems→Cluster- sequentiallearningproblems.However,thesequentialnatureof
ingandclassification;•Appliedcomputing→Healthcarein- RNNsmakesparallelizationchallenging.Thetransformer,which
formationsystems. dispensesrecurrenceandusesattentionmechanismtomodelthe
dependenciesinsequence,canachievethestate-of-the-artperfor-
KEYWORDS manceinsequentialmodelingandallowforparallelization[18].
dual-viewsequentiallearning,inter-attention,intra-attention,dy- Thetechniquestolearntheinter-viewinteractionscanbeclassified
namicexternalmemory,historyattentionmemory,classification, intoearly,late,andhybridfusionapproaches[2].Intheearlyfusion
method,featuresofmultipleviewsareconcatenatedintoasingle
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor featurebeforelearning.Thelatefusionmethodfirstlearnssepa-
classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed
forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation ratedmodelsfromindividualviewsandthencombinestogether
onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM throughaveragingorvotingtomakethefinalprediction.Ahybrid
mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,
approachattemptstotakeadvantageofbothabovemethods.How-
topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora
fee.Requestpermissionsfrompermissions@acm.org. ever,theprerequisiteofearlyorhybridapproachisthealignmentof
KDD’20,August23–27,2020,VirtualEvent,USA thesequenceswhichcannotbefulfilledinanasynchronoussetting.
©2020AssociationforComputingMachinery. Toaddresstheabovechallengesinasynchronousmulti-view
ACMISBN978-1-4503-7998-4/20/08...$15.00
https://doi.org/10.1145/3394486.3403055 learning,wedevelopAMANet(AttentionandMemory-Augmented
125
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA
Networks)basedonattentionandmemorymechanisms.Ourmodel withoutsacrificingtheperformancecomparingtoRNNs[18].For
consistsofthreemajorcomponentsforeachview:neuralcontroller, instances,SAnD(SimpleAttendandDiagnose)modeladaptsthe
historyattentionmemory,anddynamicexternalmemory.Theneu- self-attentionmechanismasintransformerformultiplediagnosis
ralcontrollerperformsencodingoftheinputsequencetocapture tasks [17]. In our work, the self-attention layer from the trans-
theintra-viewinteractionrelyingontheself-attentionmechanism. formerisusedforintra-viewlearning.Moreover,weproposean
Theinter-attentionmechanismisintroducedtolearntheinter-view inter-attentionlayerbymodifyingtheself-attentionlayertocap-
interaction.Theself-attentionorintra-attentionmechanismcould tureinter-viewinteractions.Theattentionmechanismisalsoused
associatedifferentpositionswithinasinglesequence.Similarly, toretrievethehistoricalattentionvectorfromhistoryattention
theinter-attentionrelatespositionsacrosstwosequences.Then memory.
thevectorsfrombothself-attentionandinter-attentionarecon- Memory-augmentedneuralnetworks(MANN),suchasmemory
catenatedtogetherastheencodingvectorofthesequence.The networks[20]anddifferentiableneuralcomputers(DNC)[6],use
historyattentionmemorystoresallpreviousencodingvectorsof external memory as dynamic knowledgebase to store the long-
thesameobject.Thedynamicexternalmemoryissharedbyall termdependencies.Theexternalmemorycanbereadandwritten
objectsintrainingandstoresthecommonknowledgefromthe to with attentional processes and has been widely used. Graph
data.Lastly,theencodingvectorfromthecontroller,thereadvec- augmentedmemorynetworks(GAMENet)[16]integratesMANN
torfromdynamicexternalmemory,andthehistoricalattention withgraphconvolutionnetworks(GCN)[11]toembedmultiple
vectorfromhistoryattentionmemoryareconcatenatedtogether knowledgegraphs.DMNC[12]isafirstattempttobuildMANNfor
forthepredictiontask. asynchronousdual-viewsequentiallearning.GAMENetandDMNC
Wehaveconductedexperimentsonthreetasks:medicationrec- havebeenemployedformedicationrecommendationtaskandare
ommendationfromapatient’smedicalrecordsincludingprevious servedasbaselinesinthiswork.Inourwork,thedynamicexternal
andcurrentdiagnosesandmedicalprocedures,DRGclassification memoryfromDNCisusedtostorethecommonknowledgeofeach
fromacurrenthospitalrecordwithdiagnosisandprocedurese- view.
quences,andinvoicefrauddetectionfromacompany’smonthly
taxdeclarationandinvoicesequences.Inalltasks,ourmodelisable 3 METHOD
tooutperformbaselinesandotherstate-of-the-artmodels.Therest
3.1 ProblemFormulation
ofthepaperisorganizedasfollows.Wereviewrelatedworksand
techniquesinsection2andintroducetheformulationandarchitec- Inthiswork,weusedual-viewsequencestointroduceourmodel.
tureofourmodelinsection3.Weevaluateourmodelinthreetasks Thegoalofdual-viewsequentiallearningistopredictthetarget
introducedaboveinsection4.Inaddition,wealsoperformance 𝑦 giventwoinputsequences𝑋1 and𝑋2.Let𝑆𝑖 = (𝑋𝑖1,𝑋𝑖2,𝑦𝑖) to
ablationstudiestoinvestigatetheimportanceofeachmodulein bethesamplewithindex𝑖,where𝑋1 = {𝑥1,𝑥1,...,𝑥1 },𝑋2 =
𝑖 𝑖1 𝑖2 𝑖𝐿1 𝑖
oduirrecmtioodnesl..Insection5,weconcludeourworkandhighlightfuture {𝑥𝑖21,𝑥𝑖22,...,𝑥𝑖2𝐿2},and𝑦𝑖 couldbeascalaroravectorde𝑖pending
𝑖
on the prediction task. 𝐿1 and 𝐿2 are the length of𝑋1 and𝑋2,
𝑖 𝑖 𝑖 𝑖
respectively.Thelengthoftheinputsequenceisnotonlyview
2 RELATEDWORKS typedependentbutalsosampledependent.The𝑥1 and𝑥2 inthe
𝑖𝑘 𝑖𝑘
Multi-viewlearningisarapidlygrowingareaofmachinelearning sequencesaretheinitialrepresentationoftheinput.
andrecenttrendshavebeenreviewedthoroughly[2,21,26].Sev- Inthemedicationrecommendationtask,wepredictthemedica-
eraldeeplearningarchitectureshavebeendevelopedtoexpand tionsetgivenpreviousandcurrentdiagnosesandprocedures.The
multi-viewlearningintosequentialdatasetssuchasmulti-view currentdiagnosesandproceduresofapatientcanbeformulated
LSTM[15],memoryfusionnetwork(MFN)[24],anddualmemory astwosequencesfordual-viewsequentiallearning.InMIMIC-III
neuralcomputer[12].Hereweemphasizetheworkswithattention dataset[9],thediagnoses,medicalprocedures,andmedications
mechanismandmemoryaugmentationmethod. arerepresentedasstandardmedicalcodeswithcodespacesize
Theattentionmechanismwasinitiallyintroducedinmachine 𝑑𝑥1,𝑑𝑥2 and𝑑𝑦,respectively.Therefore,weuseone-hotvectors
translationtojointlytranslateandalignwords[1].Ithasbecome 𝑥𝑖1𝑘 ∈ {0,1}𝑑𝑥1,𝑥𝑖2𝑘 ∈ {0,1}𝑑𝑥2,and𝑦𝑖 ∈ {0,1}𝑑𝑦 torepresentdi-
anintegralpartwithRNNsinsequentialmodelingtoabstractthe agnoses,procedures,andmedications,respectively.Theprevious
mostrelevantinformationofthesequence.Attentionmechanism medicalrecordsareincorporatedwithhistoryattentionmemory
offerstheinterpretabilityofdeepneuralnetworks,whichisone module. This is a multi-label classification task with𝑑 binary
𝑦
ofthemostintriguingfeaturesofattention.Inhealthcarestudies, labels.
attentionmechanismisadoptedinmanyneuralnetworkarchitec- IntheDRGclassificationtask,wepredicttheDRGaccording
tures,suchasDipole[14],REATIN[4],andRAIM[22],toachieve todiagnosissequenceandproceduresequencefromapatient’s
betterpredictivepowerandmoreeffectiveresultinterpretation. medicalrecordofcurrentvisit.Thisisamulti-classclassification
Forexample,RAIMintegratescontinuousmonitoringdataanddis- taskwith𝑑 categories.
𝑦
creteclinicaleventusingguidedmulti-channelattentionfortwo Intheinvoicefrauddetectiontask,weidentifythecompany
predictiontasks.Ithasshownexcellentperformancesinprediction involvingininvoicefraudthroughitshistoricalinvoiceandtaxa-
andmeaningfulinterpretationinquantitativeanalysis. tiondeclarationbehavior.Thedailyinvoiceandmonthlytaxation
Recently,thetransformersolelybasedonself-attentionmecha- declarationvaluesareformulatedastwoasynchronoussequences
nismhasdrawngreatattentionduetothecomputationefficiency withdifferentgranularity.Thisisabinaryclassificationproblem.
126
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA
3.2 NetworkArchitect computedbyapplyattentionfunctionontheprojectedmatricesin
OurmodelisnamedAttentionandMemory-AugmentedNetworks eachsubspace.Finally,allattentionheadsareconcatenatedtogether
(AMANet)1, and Figure 1 shows the AMANet in the dual-view andprojectedtobetheintra-encodingvectorsas
setting.Inthissection,wedescribethemodulesofourmodelin 𝐸𝑖𝑛𝑡𝑟𝑎 =MultiHead(𝐸)
details. (8)
=Concat(ℎ𝑒𝑎𝑑1,...,ℎ𝑒𝑎𝑑ℎ)𝑊𝑖𝑛𝑡𝑟𝑎
eacThokiteenmEimnbthededsienqgu.eGncivee,ntwtoheinopnuet-sheoqtuveenccteosr𝑋re1parnedse𝑋nt2aatiroenfoorf- whereℎ𝑒𝑎𝑑𝑚 =Attention(𝑄𝑚,𝐾𝑚,𝑉𝑚),and𝑊𝑖𝑛𝑡𝑟𝑎 ∈Rℎ𝑑𝑣×𝑑𝑖𝑛𝑡𝑟𝑎,
mulatedas𝑀𝑖1 ∈{0,1}𝐿𝑖1×𝑑𝑥1 and𝑀𝑖2 ∈{0,1}𝐿𝑖2×𝑖𝑑𝑥2,resp𝑖ectively. atwndo𝑚inipsutthseeiqnudeenxcoefetmhebhedeaddins.gWs 𝐸e1seatn𝑑d𝑘𝐸=2𝑑,𝑣th=e𝑑in𝑖𝑛t𝑡r𝑟a𝑎-/eℎn.cGodivineng
Wederivetheembeddingoftheinputmatrixas vectors are calculated as 𝐸1 = MultiHead(𝐸1) and 𝐸2 =
𝑖𝑛𝑡𝑟𝑎 𝑖𝑛𝑡𝑟𝑎
𝑇𝐸𝑖1=𝑀𝑖1𝑊𝐸1 (1) MultiHead(𝐸2). The intra-encoding vectors will be used for dy-
namicexternalmemory.
𝑇𝐸𝑖2=𝑀𝑖2𝑊𝐸2 (2) Theself-attentionmoduleisalsoknownasintra-attention.Its
where𝑊1 ∈R𝑑𝑥1×𝑑 and𝑊2 ∈R𝑑𝑥2×𝑑 aretheembeddingmatrices purposeistoobtaintherelationshipbetweendifferentelementsin
𝐸 𝐸
and𝑑isthedimensionofembedding. thesamesequence.
PositionalEncoding. Sincetheneuralcontrollerusesolelyself- DynamicExternalMemory. Theintra-encodingvectorfromthe
attentioninsteadofRNNs,thereisnorecurrentandthepositional neuralcontrollerismappedtotheinterfacevector𝜉[6]followed
information is lost. In order to incorporate the order of the se- byglobalaveragepoolingoperatoras
quence,positionalencodingisintroducedtointegratewiththe
inputembeddings[18].Weusethesamepositionalencodingasin 𝜉 =pool(𝐸𝑖𝑛𝑡𝑟𝑎𝑊𝜉) (9)
thetransformer[18]as where𝑊 isatrainablematrixandpoolistheglobalaveragepooling
𝜉
𝑝𝑜𝑠 operator.Theinterfacevector𝜉issubdividedintoseveralsections
𝑃𝐸(𝑝𝑜𝑠,2𝑗)=sin( 2𝑗 ) (3) forreadingfromandwritingtomemorymatrix𝑀.Weusedthe
1000𝑑 sameapproachasinDNC[6]andmoredetailscanbefoundinthe
𝑝𝑜𝑠
𝑃𝐸(𝑝𝑜𝑠,2𝑗+1)=cos( ) (4) originalwork.
2𝑗
1000𝑑 Eachtokenintheinputsequencewillreadandwritememory
where𝑝𝑜𝑠isthepositionindexinthesequence,and2𝑗 and2𝑗+1 onceintheoriginalpaper.Butthedifferenceisthatweusesample
aretheevenandoddindexesoftheembeddingvector,respectively. granularitytoreadandwritememoryinourwork.Wethinkthat
Thesequenceembeddingvectoriscalculatedbysummarizing thismethodismoreinlinewiththeupdatefrequencyofglobal
tokenembeddingandpositionalencodingtogetheras knowledgeandfaster.
𝐸𝑖1=𝑇𝐸𝑖1+𝑃𝐸𝑖1 ∈R𝐿𝑖1×𝑑 (5) vecEtoacrsh𝑟v1ieawndh𝑟a2saitrseorwetnriemveemdofrroymmtawtroixmaesminorFyigmuarteri1c.eTsw𝑀o1raenadd
𝐸𝑖2=𝑇𝐸𝑖2+𝑃𝐸𝑖2 ∈R𝐿𝑖2×𝑑 (6) 𝑀2,respectively,andbothreadvectorsareusedfortheclassification
layer.
Multi-HeadSelf-Attention. Theneuralcontrollerwillperform
Weusedynamicexternalmemorymoduletostoretheglobal
encodingontheembeddingvectorwiththeself-attentionmech-
knowledgeextractedfromthedataset,andallsamplescanread
anism.Ingeneral,anattentionfunctionistomapaqueryanda
fromandwritetoit.Wethinkthatdomainknowledgeexistsinthe
setofkey-valuepairstoanoutput,whereallofthemarevectors.
dataset.Therefore,weadaptthismodulefordomainknowledge
Theoutputiscomputedastheweightedsumofvalues,wherethe
learining.
weightforeachvalueistheinnerproductofqueryandkeyswith
asoftmaxnormalization.Inself-attention,queries𝑄,keys𝐾,and Inter-Attention. Intheself-attention,thequery,keyandvalue
values𝑉 arethelinearprojectionsfromthesameinputembedding areprojectedfromthesameinputembeddingtocapturetheintra-
matrix[18].Foreachview,theself-attentionfunctionisdefinedas viewinteraction.Intheinter-attention,thequeryisprojectedfrom
𝑄𝐾𝑇 oneinputembedding,whilekeyandvalueareprojectedfromthe
Attention(𝑄,𝐾,𝑉)=softmax( )𝑉 (7) otherinputembedding.Forinstances,giventwoinputsequence
(cid:112)𝑑
𝑘 embeddings𝐸1and𝐸2,theinter-encodingvectorsfor𝐸1from𝐸2
where𝑑𝑘 isthedimensionof𝑄and𝐾.Theinnerproductof𝑄and arecalculatedas
𝐾toitshseclaalregdebvyal√u1𝑑e𝑘sianstihnetrdaontspforormduecrt.toavoidsmallgradientsdue 𝐸𝑖1𝑛𝑡𝑒𝑟 =MultiHead(𝐸1,𝐸2) (10)
Hereweusemulti-headattentiontoencodetheinputsequence. =Concat(ℎ𝑒𝑎𝑑1,...,ℎ𝑒𝑎𝑑ℎ)𝑊𝑖𝑛𝑡𝑒𝑟
Theinputembeddingvectorisprojectedlinearlyintoℎsubspaces where ℎ𝑒𝑎𝑑𝑚 = Attention(𝑄𝑚1,𝐾𝑚2,𝑉𝑚2), 𝑄𝑚1 = 𝐸1𝑊𝑚𝑄1, 𝐾𝑚2 =
wquitehrieas𝑄se𝑚t o=f𝐸m𝑊a𝑚𝑄tri,c𝑊es𝑚𝑄{𝑄∈𝑚R,𝑑𝐾×𝑚𝑑𝑘,,𝑉k𝑚ey}sf𝐾or𝑚e=ac𝐸h𝑊s𝑚𝐾ub,𝑊sp𝑚a𝐾ce∈,Rw𝑑h×e𝑑r𝑘e, 𝐸𝐸21𝑊ca𝑚𝐾n2b,aencdal𝑉c𝑚u2la=te𝐸d2𝑊as𝑚𝑉𝐸22.The=inMteurl-teiHnceoaddi(n𝐸g2,v𝐸e1c)to.rsfor𝐸2from
andvalues𝑉𝑚 =𝐸𝑊𝑚𝑉,𝑊𝑚𝑉 ∈R𝑑×𝑑𝑣.Theneachattentionheadis Themainpurposeoft𝑖h𝑛i𝑡s𝑒𝑟moduleistocapturetherelationship
betweenelementsfromdifferentsequences,therebyenhancingthe
1Thepseudo-codeisinappendix,andthecodeisinhttps://github.com/non-name-
2020/AMANet. representationability.
127
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA
Figure1:AttentionandMemory-AugmentedNetworksforDualSequences.Inhistoryattentionmemorymodule,𝑡 denotes
thetimeindexofanobject’ssampleattime𝑡.Ifanobjecthasonlyonesample,thereisnohistoryattentionmemory.
EncodingVector. Thefinalencodingvectorsofaviewindual- Theweights𝛼𝑘 ={𝛼𝑘1,𝛼𝑘2,...,𝛼𝑘(𝑡−1)}isobtainedby
view sequential data are calculated by applying global average
preosopleinctgivoeplye,rtahtoenrtchoencinatteran-aetnecdotdhienmgatongdetihneterra-sencodingvectors, 𝛼𝑘𝑖 =softmax(𝜇𝑘𝑖)= (cid:205)𝑒𝑗𝜇𝑒𝑇𝑘𝜇𝑖𝑇𝑘𝑤𝑗𝑐𝑤𝑐 (14)
𝑒1=Concat(pool(𝐸𝑖1𝑛𝑡𝑟𝑎),pool(𝐸𝑖1𝑛𝑡𝑒𝑟)) (11)
𝑒2=Concat(pool(𝐸𝑖2𝑛𝑡𝑟𝑎),pool(𝐸𝑖2𝑛𝑡𝑒𝑟)) (12) wabhleerehi𝜇s𝑘to𝑖r=ytcaonnht(e𝑊xtℎ𝑒v𝑘e𝑖c+to𝑏rℎ[)2,𝑖3]∈.E{1a,c2h,.v..i,e𝑡w−h1}a,saintsdo𝑤w𝑐nishaistrtoairny-
attentionmemoryasinFigure1.Twohistoricalattentionvectors
wherethepoolistheglobalaveragepoolingoperator.Thefinal
ℎ1andℎ2areretrievedfromtwohistoryattentionmemory,respec-
encodingvectors𝑒1and𝑒2flowintotwocomponents:1)directly
tively,andtheyareusedforclassificationlayer.
usedforclassificationlayer;2)savedtohistoryattentionmemory.
Historyattentionmemorymoduleisdesignedforthetaskofan
objectwithhistorysamples.Itspurposeistostorethehistoryrepre-
HistoryAttentionMemory. Thehistoryattentionmemoryisde-
sentationinformationoftheobject,whichiscalledlocalknowledge.
signed to capture an object’s historical contribution to current
Asinthemedicationrecommendationtask,themedicationofcur-
time step 𝑡. It stores the historical representations, i.e., the en-
rentvisitisrelatedtonotonlycurrentdiagnosisandprocedure
codingvectorsofanobject.Giventhehistoryattentionmemory
sequences,butalsomedicalrecordsfrompreviousvisits.
𝐻𝑘 = {𝑒𝑘1,𝑒𝑘2,...,𝑒𝑘(𝑡−1)}fork-thobjectwithallprevioustime
stepsinasingleview,theattendedhistoricalrepresentationℎ is
𝑘 ClassificationLayer. Theencodingvectors,readvectors,andhis-
calculatedastheweightedsumofhistoricalembeddingvectorsas
toricalattentionvectorsfromtwoviewsareconcatenatedtogether
astheinputforfinalclassificationas
𝑡−1
(cid:213)
ℎ𝑘 = 𝛼𝑘𝑖𝑒𝑘𝑖 (13) 𝑦ˆ=𝜎(Concat(𝑒1,𝑒2,𝑟1,𝑟2,ℎ1,ℎ2)) (15)
𝑖=1
128
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA
where𝜎isthesigmoidfunctionforbinaryclassificationandmulti- experiment[16].TheNDCdrugcodeinMIMIC-IIIismappedto
labelclassification,andthesoftmaxfunctionformulti-classclassi- thirdlevelATCcodeaspredictionlabel.Thestatisticsofthedataset
fication. aresummarizedinTable1.
3.3 LossFunction Table1:StatisticsofMIMIC-IIIDataset
Cross-entropylossiswidelyusedforclassification.However,class
Count
imbalancecanleadtoinefficientintrainingandaffecttheperfor-
patients 6350
manceofthemodel.Todealwithclassimbalance,weusefocal
visits 15032
loss[13],whichisamodifiedversionofcrossentropy,asourloss
diagnosiscategories 1958
function.Giventherealclass𝑦𝑖 ∈{0,1}andpredictedprobability
procedurecategories 1430
𝑦ˆ𝑖 ∈ [0,1],thefocallossforbinaryclassificationisdefinedas
medicationcategories 151
Mean Max Min
𝐿𝑖 =−𝛼𝑦𝑖(1−𝑦ˆ𝑖)𝛽log𝑦ˆ𝑖 (16) countofvisit 2.37 29 1
−(1−𝛼)(1−𝑦𝑖)𝑦ˆ𝑖𝛽log(1−𝑦ˆ𝑖) lengthofdiagnosissequence 13.63 39 1
lengthofproceduresequence 4.54 32 1
where𝛼istheweightingfactortobalancetheimportanceofpositive
countofmedicationcombinations 19.86 55 1
andnegativesamples,and𝛽 isthefocusingparametertodown-
weightthewell-classifiedsamples.Therefore,locallossemphasizes
trainingonhardsamplesbyreducingthelosscontributionfrom
Baselines. Wecompareourmodelwiththefollowingbaseline
easysamples.Thefocallossformulti-labelclassificationis
andstate-of-the-artalgorithms.
𝐿𝑖 =−(cid:213)𝑑𝑦 [𝛼𝑦𝑖𝑗(1−𝑦ˆ𝑖𝑗)𝛽log𝑦ˆ𝑖𝑗+ • Nfroeamretshte:Tphreisviaolugsorviitshitmasusceusrrtehnetmviesditi’csaptiroendiccotimonb.inations
(17)
𝑗=1 • Logistic regression (LR): Logistic regression with L2 reg-
(1−𝛼)(1−𝑦𝑖𝑗)𝑦ˆ𝑖𝛽𝑗log(1−𝑦ˆ𝑖𝑗)] ularizationisevaluatedasabaselinemodel.Theinputis
themulti-hotvectorofdiagnosesandproceduresfromthe
where𝑑𝑦 isthetotalnumberoflabels,and𝑦𝑖 ={𝑦𝑖1,𝑦𝑖2,...,𝑦𝑖𝑑𝑦}, currentvisit.
𝑦𝑖𝑗 ∈{0,1},𝑦ˆ𝑖 ={𝑦ˆ𝑖1,𝑦ˆ𝑖2,...,𝑦ˆ𝑖𝑑𝑦},𝑦ˆ𝑖𝑗 ∈ [0,1].Weset𝛼 =0.6and • LEAP:LEAP(LearntoPrescribe)[25]usesarecurrentde-
𝛽 =2inthisstudyasrecommendedby[13]. codertomodellabeldependencyandcontent-basedatten-
Sinceeachcategory𝑐𝑖needsaspecificweight𝛼𝑖,itiscomplicate tiontomodelthelabel-to-instancemappingformedicine
toapplythefocallossformulti-classclassification.Therefore,the recommendation.
originalcross-entropylossisusedformulti-classclassification. • RETAIN:RETAIN(ReverseTimeAttention)[4]isaninter-
pretablemodelwithreversetimeattentionmechanismby
4 EXPERIMENTS
mimickingphysician’spractice.Therefore,recentvisitsare
Inthissection,wefirstcomparetheAMANetmodeltobaselines likelytoreceivehighattention.
andotherstate-of-the-artmodelsinthreetasks:medicationrec- • DMNC:DMNC[12]usesdualmemoryneuralcomputerfor
ommendation,diagnosis-relatedgroup(DRG)classification,and medicationrecommendationtask.
invoicefrauddetection.Then,wereporttheevaluationofeach • GAMENet:GAMENet[16]recommendsmedicationthrough
moduleinourmodelintheablationstudy. graph augmented memory module whose memory bank
basedondrugusageanddrug-druginteraction(DDI)and
4.1 MedicationRecommendation dynamicmemorybasedonpatient’shistory.Wehavedone
TheincreasingvolumeofEHRsprovidesagreatopportunityto hyperparametersearchingforGAMENetsuchaslearning
improvetheefficiencyandqualityofhealthcarebydevelopingpre- rateandembeddingsize.AndourresultforGAMENetis
dictivemodels.Onecriticaltaskistopredictmedicationsbased betterthanthereportedperformancesin[16]formedication
onthemedicalrecords[8].Inthispaper,weformulatemedication recommendationtask.
recommendationasamulti-labelclassificationproblemanditrec-
EvaluationMetrics. Inordertomeasurethemodelperformance,
ommendsmedicinecombinationbasedondiagnosesandmedical
weuseJaccardsimilarity,F1-score,andtheareaunderthePrecision-
procedures.Thediagnosesandmedicalproceduresareorderedina
Recallcurve(PRAUC)astheevaluationmetrics,whereallofthem
hospitalvisit.Therefore,wetreatthediagnosesandproceduresin
aremacro-averaged.Werandomlydividethedatasetintotraining,
currentvisitastwosequentialviewsandapplyAMANettopredict
validation,andtestwithratio4:1:1andreporttheperformance
themedicationcombinations.
fromthetestset.
Dataset. WeusethemedicaldatafromtheMIMIC-IIIdatabase2, ImplementationDetails. Forourmethod,bothtokenembedding
which comprises rich medical information relating to patients
sizeandpositionembeddingsizeare100.Bothself-attentionsize
withintheintensivecareunits(ICUs)[9].Wefollowtheproce-
andinter-attentionsizeare64with4heads.Thedynamicexternal
dure similar to GAMENet to process the medical codes in this
memoryis256×64matrixandhas4readheads.The𝛼 and𝛽 of
2https://mimic.physionet.org/ focallossare0.6and2,respectively.Wetrainthemodelusingthe
129
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA
Adam[10]optimizeratlearningrate0.0001andstopat15epochs. andproceduresequence.ThetotalnumberofDRGcategoriesis270.
TheothermodelsareevaluatedthesameasinGAMENet[16].All ThestatisticsofthedatasetaresummarizedinTable3.
modelsaretrainedonmacOSMojavesystemwith2.2GHzIntel
Corei7and16GBmemory. Table3:StatisticsofDRGDataset
Results. Table 2 summarizes our results and we can see that
Count
theAMANetmodeloutperformsothermodelsinallmetrics.For
records 21356
Jaccard similarity and F1 score, our model achieves about 0.04-
diagnosiscategories 3052
0.05improvementcomparingtothenext-bestmodel(GAMENet
procedurecategories 1177
withoutDDI).AMANetalsoscoresabout2%betterthanthenext-
DRGcategories 270
bestmodelLRinthePRAUCmetric.WhenIA,DEMandHAM
Mean Max Min
componentsareremovedrespectively,therehavebeendifferent
lengthofdiagnosissequence 3.85 11 1
levels of performance decline. The removing of IA leads to the
lengthofproceduresequence 1.3 5 1
mostperformancedropamongthreecomponents.Wecanconclude
fromthesecomparativeexperimentsthateachcomponentisuseful,
especiallytheIAcomponent.
Baselines. LEAP and GAMENet are not suitable for this task
becauseLEAPoutputssequenceandGAMENetisdevelopedfor
Table2:ExperimentsonMedicationRecommendation
medicationrecommendationpurpose.Herewelistthebaseline
modelsinthistaskasfollows:
Method Jaccard F1 PRAUC
Nearest 0.2548 0.3398 0.3187 • LR,RETAIN,andDMNCareusedasthebaselinealgorithms
similartomedicationrecommendationtask.
LR 0.4579 0.6177 0.7602
DMNC 0.4231 0.5849 0.6418 • Dual-LSTM:WeaddDual-LSTMmodel,whichencodestwo
inputsequencesusingLSTMrespectivelyandthencombines
LEAP 0.4365 0.5997 0.6367
twoencodingvectorstogetherforfinalclassificationlayer.
RETAIN 0.4647 0.6269 0.7336
GAMENet 0.4551 0.6172 0.7275 EvaluationMetrics. Weuseaccuracy,F1-scoreandPRAUCas
GAMENet(w/oDDI) 0.4796 0.6390 0.7484 theevaluationmetricinthistask.Thedatasetisrandomlydivided
AMANet(w/oIA) 0.5067 0.6641 0.7629 intotraining,validation,andtestsetwith4:1:1ratiowhiletheratio
AMANet(w/oDEMHAM) 0.5136 0.6706 0.7710 ofeachcategorysamplesiskeptthesameforeachset.Wereport
AMANet(w/oDEM) 0.5201 0.6757 0.7716 theperformancefromthetestset.
AMANet(w/oHAM) 0.5200 0.6761 0.7746
AMANet 0.5259 0.6809 0.7772 ImplementationDetails. Thehyperparametersofourmodelare
the same as in medication recommendation task except for the
learningrateandepoch.Wetrainthemodelatlearningrate0.0005
andstopat10epochs.AllmodelsaretrainedonmacOSMojave
4.2 DRGClassification
systemwith2.2GHzIntelCorei7and16GBmemory.
Diagnosis-relatedgroups(DRGs)hasbeenoneofthemostpopular
prospectivepaymentsystemsinmanycountries.Thebasicideaof Results. Table4indicatesthatAMANetperformsthebestcom-
paredtotheotherfourmodelsinallthreemetrics.Comparedto
theDRGsystemisthatpatientsadmittedtoahospitalareclassified
DMNCandDual-LSTM,whicharesolelybasedonlatefusion,our
intoalimitednumberofDRGs.PatientsineachDRGsharesimilar
modelachievesabout0.02betterperformanceforAccuracyandF1
clinical and demographical patterns and are likely to consume
score,and0.04forPRAUC.Wecanobservesignificantperformance
similaramountsofhospitalresources.Thehospitalsarereimbursed
dropbyremovingIAmodule,whichindicatestheimportanceof
asetfeeforthetreatmentinasingleDRGcategory,regardlessof
IAinAMANetmodeltocapturetheinter-viewinteractionsand
theactualcostforanindividualpatient.Therefore,DRGscanput
improvetheabilityofrepresentation.
pressureonhospitalstocontrolcost,toreducethelengthofstay,
andtoincreasethenumberofadmittedinpatients.TheDRGfor
eachpatientisassignedatthetimeofdischarge[19].Previousstudy Table4:ExperimentsonDRGClassification
hasshownthataccuratelyclassifyinganpatient’sDRGattheearly
stageofhospitalvisitcanallowhospitaleffectivelyallocatelimited Method Accuracy F1 PRAUC
hospitalresources[5].Inthisstudy,weextractdiagnosissequence LR 0.7743 0.7458 0.6271
andproceduresequencefromEHRsasinputandapplyAMANet RETAIN 0.7631 0.7483 0.6048
topredictDRG.SinceDRGsaredeterminedsolelyoncurrentvisit, DMNC 0.8473 0.8407 0.7324
weuseAMANetwithouthistoryattentionmemorymoduleinthis Dual-LSTM 0.8525 0.8433 0.7305
task. AMANet(w/oIA) 0.7497 0.7416 0.6116
AMANet(w/oDEM) 0.8623 0.8612 0.7832
Dataset. The dataset of this task is from a Grade Three Class
AMANet 0.8712 0.8690 0.7975
B hospital in China. It contains 21356 hospital records of 2019.
Eachhospitalrecordconsistsoftwosequences:diagnosissequence
130
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA
4.3 InvoiceFraudDetection memorycancapturetheintra-viewandinter-viewinteractions
Thevalue-addedtax(VAT)isageneraltaxbasedonthevalueadded effectively.
togoodsandservicesinthemajorityofcountries.Itischarged
as a percentage of current sale price deducting the tax paid at Table6:ExperimentsonInvoiceFraudDetection
theprecedingstage.IntheVATcollectionprocess,thecompany
providestheinvoicefromitssellingforcurrenttaxcalculationand Method Accuracy F1 PRAUC
theinvoiceofitsbuyingfromsuppliersforthededuction.Some LR 0.7724 0.7222 0.8839
companiesarespecificallyestablishedforinvoicefraudbyillegally RETAIN 0.8628 0.8152 0.8910
providingsellinginvoicestoothercompanieswithoutthebusiness DMNC 0.8669 0.8323 0.9005
inbetween,andthecompaniesbuyinginvoicescangetbetterVAT Dual-LSTM 0.8655 0.8352 0.9220
deduction.Toidentifythecompanyinvolvingininvoicefraudis AMANet(w/oIA) 0.8324 0.7805 0.9030
acommontaskforthetaxationagency.Hereweformulatethe AMANet(w/oDEM) 0.8924 0.8556 0.9467
invoicefrauddetectiontaskasabinaryclassificationproblemto AMANet 0.8938 0.8623 0.9503
identifythecompanyinvolvingininvoicefraudthroughhistorical
invoiceandtaxdeclarationinformation.Weconstructthedaily
invoiceandmonthlytaxdeclarationinformationastwosequential 4.4 Hyperparameters
viewsandapplyAMANettoidentifythecompanywithabnormal Inthissubsection,westudytheeffectofthehyperparametersto
behaviors.Sincethereisonlyonesampleforacompany,weuse ourmodel.Duetolimitspace,weusemedicationrecommendation
AMANetwithouthistoryattentionmemorymoduleinthistask. taskasanexample.Wefocusonfourhyperparametersetsinclud-
ingembeddingsize𝑑,learningrate𝐿𝑅,numberofhead𝑁𝐻 and
Dataset. ThedatasetisfromatexationagencyinChina.Inthe
attentionsize𝐴𝑆,and𝛼 and𝛽inthefocalloss.Wevarythevalues
dataset,itcontains8700companiesintotaland1/3ofthemisin-
ineachhyperparametersetwithothersetsfixed.Theresultsfordif-
volvedintheinvoicefraud.Foreachcompany,twotimelyorder
ferenthyperparametersettingsareshowninTable7.Trainingwith
sequences,i.e.,dailyinvoicevaluesandmonthlytaxdeclaration
focallossyieldsmorethan0.01improvementovercross-entropy
values, are used as input. The statistics of the dataset are sum-
loss(i.e.,𝛼 = 0.5and 𝛽 = 0)inJaccardsimilarityandF1score.
marizedinTable5.Theinvoicevalueandtaxdeclarationvalue
Thedifferencesamongotherhyperparametersetsaresmallerthan
arecontinuousvaluesandwediscretizethevaluesofeachview
0.006inallthreemetrics,whichisnotsignificant.Therefore,our
into2000categoriesusingquantile-baseddiscretizationfunction.
proposedmodelisnotsensitivetomostofthehyperparameters
Then,weusetheone-hotencodingtorepresenttheinvoiceandtax
andadaptingfocallossintrainingcanboosttheresultsignificantly.
declarationvalues.
Table5:StatisticsofInvoiceFraudDataset
4.5 AblationStudy
Weperformablationstudiesonourmodelbyremovingspecific
Count
moduleoneatatimetoexploretheirrelativeimportance.Thethree
companies 8700
modulesevaluatedinthissectionareinter-attention(IA),dynamic
invoicefraudcompanies 2900
externalmemory(DEM),andhistoryattentionmemory(HAM).As
invoicecategories 2000
wecanseefromtable2,table4,andtable6,theperformancede-
declarationcategories 2000
gradinginallablationexperimentsindicatesthatallthreemodules
Mean Max Min
inourmodelarenecessary.
lengthofinvoicesequence 56.05 1256 1
lengthofdeclarationsequence 27.08 166 1 Inter-Attention. We design the inter-attention to capture the
inter-viewinteractionacrossviewsbyrelatingoneinputembed-
dingtoanotherinputembeddingandaligndifferentsequences.
Theinter-viewencodingforsequenceAfromsequenceBcould
Baselines. Thebaselinemodelsforthistaskarethesameasinthe
beviewedassoftlyaligningsequenceAtosequenceB.Firstly,the
DRGclassificationtask,exceptforthechangingfrommulti-class
relevancematrixiscalculatedbetweenAandB,whereeachvalue
classificationintobinaryclassification.
inthematrixrepresentstherelevancebetweenonepositioninA
EvaluationMetrics. ItisthesameastheDRGclassificationtask. andanotherpositioninB.Thenencodingvectorofeachposition
inAiscomputedbyrelevanceweightedsummationofallpositions
ImplementationDetails. ItisthesameastheDRGclassification
inB.Theremovingofinter-attentioncanleadtothebiggestper-
task.
formancedroppinginalltasksandallthreemetricscomparingto
Results. Table6indicatesthatAMANetperformsthebestcom- memorymodules,whichindicatesthecrucialroleofinter-attention
paredtotheotherfourmodelsinallthreemetrics.Comparedto inourmodel.
Dual-LSTMandDMNC,ourmodelachievesabout3%betterperfor- Inmedicationrecommendationtask,allthreemetricshavedropp-
manceconsistently.WhenIAcomponentisremoved,theeffectis edaround0.02whenexcludingIAcomponent.TheremovalofIA
notasgoodasthatofDMNCandDual-LSTM.Thisdemonstrates leadstodramaticperformancedropover0.12inDRGclassification.
thatself-attentionandinter-attentionmechanismcouplingwith IninvoicefrauddetectiontaskusingAMANetmodelwithoutIA,
131
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA
Table7:ExperimentsonVaryingHyperparameters
HyperParameters Jaccard F1 PRAUC
Varying𝑑(𝐿𝑅=0.0001,𝐴𝑆 =64,𝑁𝐻 =4,𝛼 =0.6,𝛽 =2)
𝑑 =64 0.5249 0.6802 0.7780
𝑑 =100 0.5259 0.6809 0.7772
𝑑 =128 0.5242 0.6797 0.7759
Varying𝐿𝑅(𝑑 =100,𝐴𝑆 =64,𝑁𝐻 =4,𝛼 =0.6,𝛽 =2)
𝐿𝑅=0.0001 0.5259 0.6809 0.7772
𝐿𝑅=0.0002 0.5220 0.6774 0.7760
𝐿𝑅=0.0005 0.5255 0.6804 0.7761
(a)AMANetwithIAmodule
𝐿𝑅=0.001 0.5235 0.6788 0.7747
Varying𝐴𝑆,𝑁𝐻(𝐿𝑅=0.0001,𝑑 =100,𝛼 =0.6,𝛽 =2)
𝐴𝑆 =32,𝑁𝐻 =8 0.5205 0.6763 0.7755
𝐴𝑆 =32,𝑁𝐻 =4 0.5207 0.6764 0.7749
𝐴𝑆 =64,𝑁𝐻 =4 0.5259 0.6809 0.7772
𝐴𝑆 =64,𝑁𝐻 =2 0.5239 0.6793 0.7779
Varying𝛼,𝛽(𝐿𝑅=0.0001,𝑑 =100,𝐴𝑆 =64,𝑁𝐻 =4)
𝛼 =0.5,𝛽 =0 0.5056 0.6626 0.7753
𝛼 =0.6,𝛽 =0 0.5183 0.6741 0.7756
𝛼 =0.6,𝛽 =1.5 0.5248 0.6798 0.7791
𝛼 =0.6,𝛽 =2.0 0.5259 0.6809 0.7772
𝛼 =0.6,𝛽 =2.5 0.5198 0.6759 0.7722 (b)AMANetwithoutIAmodule
𝛼 =0.6,𝛽 =3.0 0.5208 0.6766 0.7721
𝛼 =0.7,𝛽 =2.5 0.5218 0.6777 0.7763 Figure2:Thecosinesimilaritybetweenadiagnosissequence
𝛼 =0.7,𝛽 =3.0 0.5241 0.6797 0.7763 (vertical axis) and a procedure sequence (horizontal axis).
Eachblockiscoloredbythesimilarityvalue.Thepositive
cosinesimilarityvalueiscoloredinred,whilethenegative
theaccuracy,F1score,andPRAUCdrop0.061,0.082,and0.047,
similarityvalueiscoloredinwhite.
respectively.Theempiricalresultsfromthreetaskssuggestthatin-
tegratinginformationfromtwoviewsisbeneficialforoutcomepre-
dictionincertaintasksandourdesignedinter-attentionmechanism
cancapturetheinter-viewinteractioneffectivelyinasynchronous
setting. 5 CONCLUSIONS
Moreover,weobservetheimprovedmedicalcoderepresentation Inthiswork,weproposeanovelarchitectureAMANetbasedon
with inter-attention module. We demonstrate this finding with attentionandmemoryfordualasynchronoussequentiallearning.
an example from DRG classification task as in Figure 2. In this Thenewmodellearnstheintra-viewinteractionandinter-view
example,apatientwithheartdiseaseistreatedwithasequenceof interactionthroughself-attentionandinter-attentionmechanism,
medicalprocedures.Allprocedurecodesandthediagnosiscoded1, respectively.Weusedynamicexternalmemorytorepresentthe
d2,d4areheartdiseaserelated.Procedurecodep3isadiagnostic commonknowledgefromthedata.Wealsointroduceahistory
operation, which is different from other therapeutic operations attentionmemorymoduletostorethehistoricalrepresentationof
here.Inaddition,hepaticcyst(d7)isnottreatedanditshouldnot thesameobject.Weevaluateourmodelonmedicationrecommen-
berelatedwithanyprocedurecodes.Figure2(a)iswellaligned dationtask,DRGclassificationtask,andinvoicefrauddetection
with the medical knowledge above. However, the medical code withsuperiorperformancecomparingtootherbaselineandstate-
representationfromAMANetwithoutIAmoduleisnotconsistent of-the-artmodels.Theablationstudyemphasizestheimportance
withthemedicalknowledgeasinFigure2(b).Therefore,wecan ofmodelinginter-viewinteractionsandtheinter-attentionmodule
conclude that the inter-attention module can contribute to the couldmodeltheinter-viewinteractioneffectively.TheAMANet
representationlearningacrosscodesets. frameworkisformulatedusingdual-viewsequencesinthisstudy,
butitcanbeappliedformulti-viewsequencesdirectly.Inthefu-
Memory. Weincorporatetwomemorymodulesinourmodelto ture,wewillemploythisframeworktostudymulti-viewsequential
storetheglobalandlocalinformationbyusingdynamicexternal
learningproblems.
memoryandhistoryattentionmemory,respectively.Inmedication
recommendationtask,thedynamicexternalmemoryandhistory
attentionmemorycontributesimilarlytotheoverallpredictionin ACKNOWLEDGMENTS
allthreemetrics.Intheabovethreetasks,theeffectisreducedwhen Wethanktheanonymousreviewersfortheirhelpfulcomments.
thememorycomponentisremoved.Bothmodulesarenecessary This work is supported in the Department of Data Intelligence
basedontheempiricalexperimentalresult. ProductsofAlibabaCloudinAlibabaGroup.
132
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA
REFERENCES Vision(ICCV).2980–2988.
[1] D.Bahdanau,K.Cho,andY.Bengio.2015.Neuralmachinetranslationbyjointly [14] F.Ma,R.Chitta,J.Zhou,Q.You,T.Sun,andJ.Gao.2017.Dipole:Diagnosispre-
learningtoalignandtranslate.In3rdInternationalConferenceonLearningRepre- dictioninhealthcareviaattention-basedbidirectionalrecurrentneuralnetworks.
sentations(ICLR). InProceedingsofthe23rdACMSIGKDDinternationalconferenceonknowledge
[2] T.Baltrušaitis,C.Ahuja,andL.P.Morency.2019.Multimodalmachinelearning: discoveryanddatamining(KDD).1903–1911.
Asurveyandtaxonomy.InIEEETransactionsonPatternAnalysisandMachine [15] S.S.Rajagopalan,L.P.Morency,T.Baltrusaitis,andR.Goecke.2016. Extend-
Intelligence(TPAMI),Vol.41(2).423–443. inglongshort-termmemoryformulti-viewstructuredlearning.InEuropean
[3] K.Cho,B.Merrienboer,C.Gulcehre,F.Bougares,H.Schwenk,andY.Bengio.2014. ConferenceonComputerVision(ECCV).338–353.
LearningPhraseRepresentationsusingRNNEncoder-DecoderforStatistical [16] J.Shang,C.Xiao,T.Ma,H.Li,andJ.Sun.2019. Graphaugmentedmemory
MachineTranslation.InProceedingsofthe2014ConferenceonEmpiricalMethods networksforrecommendingmedicationcombination.InThirty-ThirdAAAI
inNaturalLanguageProcessing(EMNLP).1724–1734. ConferenceonArtificialIntelligence(AAAI).1126–1133.
[4] E.Choi,M.T.Bahadori,J.Sun,J.Kulas,A.Schuetz,andW.Stewart.2016.Dipole: [17] H.Song,D.Rajan,J.J.Thiagarajan,andA.Spanias.2018.Attendanddiagnose:
Diagnosispredictioninhealthcareviaattention-basedbidirectionalrecurrent Clinicaltimeseriesanalysisusingattentionmodels.InThirty-SecondAAAI
neuralnetworks.InAdvancesinNeuralInformationProcessingSystems(NeurIPS). ConferenceonArtificialIntelligence.4091–4098.
3504–3512. [18] A.Vaswani,N.Shazeer,N.Shazeer,J.Uszkoreit,L.Jones,A.N.Gomez,andet
[5] DanielGartner,RainerKolisch,DanielBNeill,andRemaPadman.2015. Ma- al.2017.Attentionisallyouneed.InAdvancesinNeuralInformationProcessing
chinelearningapproachesforearlyDRGclassificationandresourceallocation. Systems(NeurIPS).5998–6008.
INFORMSJournalonComputing27,4(2015),718–734. [19] ZhaoxinWang,RuiLiu,PingLi,andChenghuaJiang.2014. Exploringthe
[6] A.Graves,G.Wayne,M.Reynolds,T.Harley,I.Danihelka,A.Grabska-Barwińska, transitiontoDRGsindevelopingcountries:acasestudyinShanghai,China.
andetal.2016.Hybridcomputingusinganeuralnetworkwithdynamicexternal Pakistanjournalofmedicalsciences30,2(2014),250.
memory.InNature,Vol.538(7626).471. [20] J.Weston,S.Chopra,andA.Bordes.2015.Memorynetworks.In3rdInternational
[7] S.HochreiterandJ.Schmidhuber.1997. Longshort-termmemory.InNeural ConferenceonLearningRepresentations(ICLR).
Computation,Vol.9(8).1735–1780. [21] C.Xu,D.Tao,andC.Xu.2013. Asurveyonmulti-viewlearning.InArXiv.
[8] B.Jin,H.Yang,L.Sun,C.Liu,Y.Qu,andJ.Tong.2018.ATreatmentEngineby arXiv:1304.5634.
PredictingNext-PeriodPrescriptions.InProceedingsofthe24thACMSIGKDD [22] Y.Xu,S.Biswal,S.R.Biswal,K.O.Maher,andJ.Sun.2018. RAIM:Recurrent
InternationalConferenceonKnowledgeDiscovery&DataMining(KDD).1608– AttentiveandIntensiveModelofMultimodalPatientMonitoringData.InProceed-
1616. ingsofthe24thACMSIGKDDInternationalConferenceonKnowledgeDiscovery&
[9] A.E.Johnson,T.J.Pollard,L.Shen,H.L.Li-wei,M.Feng,M.Ghassemi,andet DataMining(KDD).2565–2573.
al.2016.MIMIC-III,afreelyaccessiblecriticalcaredatabase.InScientificData, [23] Z.Yang,D.Yang,C.Dyer,X.He,A.Smola,andE.Hovy.2016.Hierarchicalatten-
Vol.(3).160035. tionnetworksfordocumentclassification.InProceedingsofthe2016Conference
[10] D.P.KingmaandJ.Ba.2015.Adam:Amethodforstochasticoptimization.In oftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:
InternationalConferenceonLearningRepresentations(ICLR). HumanLanguageTechnologies(NAACL-HLT).1480–1489.
[11] T.N.KipfandM.Welling.2017.Semi-supervisedclassificationwithgraphcon- [24] A.Zadeh,P.P.Liang,N.Mazumder,S.Poria,E.Cambria,andL.P.Morency.2018.
volutionalnetworks.In5thInternationalConferenceonLearningRepresentations Memoryfusionnetworkformulti-viewsequentiallearning.InThirty-Second
(ICLR). AAAIConferenceonArtificialIntelligence(AAAI).5634–5641.
[12] H.Le,T.Tran,andS.Venkatesh.2018. Memoryfusionnetworkformulti- [25] Y.Zhang,R.Chen,J.Tang,W.F.Stewart,andJ.Sun.2017. LEAP:learning
viewsequentiallearning.InProceedingsofthe24thACMSIGKDDInternational toprescribeeffectiveandsafetreatmentcombinationsformultimorbidity.In
[13] CT.oYn.feLrienn,cPe.oGnoKynaol,wRle.dGgiersDhiisccko,vKer.yH&e,DaantdaPM.iDnionlglá(rK.2D0D17)..1F6o3c7a–l1l6o4s5s.fordense PDriosccoevederinygasnodfDthatea2M3ridniAngCM(KDSIDG)K.1D3D15–In1t3e2rn4.ationalConferenceonKnowledge
[26] J.Zhao,X.Xie,X.Xu,andS.Sun.2017.Multi-viewlearningoverview:Recent
objectdetection.InProceedingsoftheIEEEInternationalConferenceonComputer progressandnewchallenges.InInformationFusion,Vol.38.43–54.
133
Research Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA
Algorithm1TrainingAMANet Algorithm2TrainingAMANetwithoutHAM
Input: 𝐿𝑅,𝑑,𝐴𝑆,𝑁𝐻,𝛼,𝛽,𝑒𝑝𝑜𝑐ℎ,Sequences𝑋1,Sequences𝑋2 Input: 𝐿𝑅,𝑑,𝐴𝑆,𝑁𝐻,𝛼,𝛽,𝑒𝑝𝑜𝑐ℎ,Sequences𝑋1,Sequences𝑋2
andlabels𝑦 andlabels𝑦
Output: Jaccard,F1,PRAUC Output: Accuracy,F1,PRAUC
1: proceduretraining 1: proceduretraining
2: Initialize𝑊{1,2}(TwoTokenEmbeddingMatrices), 𝑊{1,2} 2: Initialize𝑊{1,2}(TwoTokenEmbeddingMatrices), 𝑊{1,2}
𝐸 𝜉 𝐸 𝜉
(TwoDEMInterfaceMatrices), 𝑊{1,2} (TwoDEMContent (TwoDEMInterfaceMatrices), 𝑊{1,2} (TwoDEMContent
𝐷𝐸𝑀 𝐷𝐸𝑀
Matrices), 𝑊{1,2}, 𝑏{1,2}(TwoHAMMatricesandTwoHAM Matrices)
Vectors), 𝑤{1ℎ,2}(TwℎoHistoryContextVectors) 3: for𝑖 =1→𝑒𝑝𝑜𝑐ℎdo
𝑐 4: for𝑜𝑏𝑗𝑒𝑐𝑡𝑖 intrainingsetdo
3456:::: forf𝑖o=rf1𝑜o𝑏→r𝑗𝑋𝑒𝑣𝑐𝑖𝑒𝑖1𝑡𝑠𝑗𝑝𝑖𝑖,𝑡𝑜𝑋i𝑗n𝑐𝑖2ℎi𝑗tn,rd𝑦a𝑜o𝑖i𝑏𝑗n𝑗i←𝑒n𝑐g𝑡𝑣𝑖′s𝑖ev𝑠t𝑖i𝑡sd𝑗itosdo 657::: 𝑇𝑋𝑀𝐸𝑖1𝑖{𝑖,{1𝑋1,2,𝑖22}},←𝑦←𝑖 ←o𝑀ne𝑖𝑜{h𝑏1,o2𝑗𝑒t}𝑐e×𝑡n𝑖𝑊co𝐸d{1in,2g}𝑜𝑓 𝑋𝑖{1,2}
7: 𝑀{1,2} ←onehotencodingof𝑋{1,2} 8: 𝑃𝐸𝑖{1,2} ←positionalencodingof𝑋𝑖{1,2}
𝑖𝑗 𝑖𝑗
8: 𝑇𝐸{1,2} ←𝑀{1,2}×𝑊{1,2} 9: 𝐸𝑖{1,2} ←𝑇𝐸𝑖{1,2}+𝑃𝐸𝑖{1,2} ∈R𝐿𝑖{1,2}×𝑑
109:: 𝑃𝐸𝐸{1𝑖𝑖{,𝑗𝑗21,}2}←←𝑇p𝐸o{𝑖1s𝑗,i2ti}o+na𝑃l𝐸e𝐸{n1c,2o}d∈ingR𝐿o𝑖{f𝑗1,2𝑋}×𝑖{𝑗𝑑1,2} 1110:: 𝜉𝐸𝑖𝑖{{𝑛11,𝑡,22𝑟}𝑎}𝑖←←pMooul(lt𝐸iH𝑖{𝑛1e𝑡,2𝑟a𝑎}d𝑖(×𝐸𝑖{𝑊1,2𝜉{}1),2})
11: 𝐸𝑖{𝑗1,2} ←M𝑖𝑗ultiHead(𝑖𝐸𝑗 {1,2}) 12: update 𝑊𝐷{1𝐸,𝑀2}𝑏𝑦𝜉𝑖{1,2}
𝑖𝑛𝑡𝑟𝑎𝑖𝑗 𝑖𝑗 13: 𝑟{1,2}readfrom𝑊{1,2}
1132:: 𝜉u𝑖{p𝑗1d,2a}te←𝑊𝐷p{o1𝐸,o𝑀2}l(𝑏𝐸𝑦𝑖{𝑛1𝜉𝑡,2𝑖𝑟{𝑗𝑎}1𝑖,2𝑗}×𝑊𝜉{1,2}) 1154:: 𝐸𝐸𝑖𝑖𝑖21𝑛𝑛𝑡𝑡𝑒𝑒𝑟𝑟𝑖𝑖 ←←MMuullttiiHHee𝐷aa𝐸dd𝑀((𝐸𝐸𝑖𝑖21,, 𝐸𝐸𝑖𝑖12))
14: 𝑟𝑖{𝑗1,2}readfrom𝑊𝐷{1𝐸,𝑀2} 16: 𝑒𝑖{1,2} ←concat(pool(𝐸𝑖{𝑛1𝑡,2𝑟𝑎}𝑖), pool(𝐸𝑖{𝑛1𝑡,2𝑒𝑟}𝑖))
15: 𝐸𝑖1𝑛𝑡𝑒𝑟𝑖𝑗 ←MultiHead(𝐸𝑖1𝑗, 𝐸𝑖2𝑗) 17: 𝑦ˆ𝑖 ←𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑦(concat(𝑒𝑖1, 𝑒𝑖2, 𝑟𝑖1, 𝑟𝑖2)
16: 𝐸2 ←MultiHead(𝐸2, 𝐸1) 18: computeloss
𝑖𝑛𝑡𝑒𝑟𝑖𝑗 𝑖𝑗 𝑖𝑗 19: updatethenetworkparametersbasisonlossusing
17: 𝑒𝑖{𝑗1,2} ←concat(pool(𝐸𝑖{𝑛1𝑡,2𝑟𝑎}𝑖𝑗), pool(𝐸𝑖{𝑛1𝑡,2𝑒𝑟}𝑖𝑗) BackPropagation
18: save𝑒{1,2} to HAM 20: endfor
𝑖𝑗
19: ℎ{1,2} ←(cid:205)𝑗−1𝛼{1,2}𝑒{1,2} 21: computeAccuracy,F1,PRAUConvalidationset
𝑖𝑗 𝑘=1 𝑖𝑘 𝑖𝑘 22: endfor
2201:: where𝜇𝛼𝑖𝑖{{𝑘𝑘11,,22}}==t(cid:205)a𝑒n𝑙𝜇𝑒h𝑖{𝜇𝑘1(𝑖,{𝑊2𝑙1},𝑇2ℎ}{𝑇𝑤1𝑐𝑤{,21𝑐,{2}1}𝑒,2𝑖}{𝑘1,2}+𝑏ℎ{1,2}) 22223456:::: epnrodbrcpeeetsrdutou_rcmrneedobedueversalte_l←umaoatdrigeolnmax𝑚𝑜𝑑𝑒𝑙{F1}
22: 𝑘 ∈1→ 𝑗−1 27: computeAccuracy,F1,PRAUContestset
23: 𝑦ˆ𝑖𝑗 ←𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑦(concat(𝑒𝑖1𝑗,𝑒𝑖2𝑗,𝑟𝑖1𝑗,𝑟𝑖2𝑗,ℎ𝑖1𝑗,ℎ𝑖2𝑗)) 28: returnAccuracy,F1,PRAUC
24: computeloss 29: endprocedure
25: update the network parameters basis on loss
usingBackPropagation
26: endfor A PSEUDO-CODE
27: endfor Thepseudo-codeofourmodelisasAlgorithm1andAlgorithm
28: computeJaccard,F1,PRAUConvalidationset 2.Algorithm1isapplicabletomedicationrecommendationtask,
29: endfor Algorithm2isapplicabletoDRGclassificationandinvoicefraud
30: best_model←argmax𝑚𝑜𝑑𝑒𝑙{F1} detectiontask.Thecodeisopenedinhttps://github.com/non-name-
31: returnbest_model 2020/AMANet.Thedatasetofmedicationrecommendationtask
32: endprocedure isavailableinhttps://mimic.physionet.org.Andwewillopenthe
33: procedureevaluation datasetsofDRGclassificationandinvoicefrauddetectiontask.
34: computeJaccard,F1,PRAUContestset
35: returnJaccard,F1,PRAUC
36: endprocedure
134