Learning what to look in chest X-rays with a recurrent visual attention model Petros-PavlosYpsilantis GiovanniMontana DepartmentofBiomedicalEngineering DepartmentofBiomedicalEngineering King’sCollegeLondon King’sCollegeLondon 7 London,SE17EH London,SE17EH 1 [email protected] [email protected] 0 2 n Abstract a J 3 2 X-raysarecommonlyperformedimagingteststhatusesmallamountsofradiation toproducepicturesoftheorgans,tissues,andbonesofthebody. X-raysofthe ] chestareusedtodetectabnormalitiesordiseasesoftheairways, bloodvessels, L bones, heart, and lungs. In this work we present a stochastic attention-based M modelthatiscapableoflearningwhatregionswithinachestX-rayscanshouldbe visuallyexploredinordertoconcludethatthescancontainsaspecificradiological . t abnormality. Theproposedmodelisarecurrentneuralnetwork(RNN)thatlearns a t tosequentiallysampletheentireX-rayandfocusonlyoninformativeareasthat s arelikelytocontaintherelevantinformation. Wereportonexperimentscarriedout [ withmorethan100,000X-rayscontainingenlargedheartsormedicaldevices. The 1 modelhasbeentrainedusingreinforcementlearningmethodstolearntask-specific v policies. 2 5 4 6 0 1 Introduction . 1 0 7 ChestX-rays(CXR)arethemostcommonlyuseddiagnosisexamsforchest-relateddiseases. They 1 useaverysmalldoseofionizingradiationtoproducepicturesoftheinsideofthechest. CXRscans : v helpradiologiststodiagnoseormonitortreatmentforconditionssuchaspneumonia,heartfailure, i emphysema,lungcancer,positioningofmedicaldevices,aswellasfluidandaircollectionaround X thelungs. Anexpertradiologististypicallyabletodetectradiologicalabnormalitiesbylookinginthe r ’rightplaces’"’andmakingquickcomparisonstonormalstandards. Forexample,forthedetectionof a anenlargedheart,orcardiomegaly,thesizeoftheheartisassessedinrelationtothetotalthoracic width. GiventhatchestX-raysareroutinelyusedtodetectseveralabnormalitiesordiseases,acareful interpretationofascanrequiresexpertiseandtimeresourcesthatarenotalwaysavailable,especially sincelargenumbersofCXRexamsneedtobereporteddaily. Thisleadstodiagnosticerrorsthat,for somepathologies,havebeenestimatedtobeintherangeof23%[10]. Ourultimateobjectiveistodevelopafully-automatedsystemthatlearnstoidentifyradiological abnormalitiesusingonlylargevolumesoflabelledhistoricalexams. Wearemotivatedbyrecentwork onattention-basedmodelswhichhavebeenusedfordigitclassification[6],sequentialprediction ofstreetviewhousenumbers[1]andafine-grainedcategorizationtask[7]. However,wearenot awareofapplicationsofsuchattentionmodelstothechallengingtaskofchestX-rayinterpretation. Herewereportontheinitialperformanceofarecurrentattentionmodel(RAM),similartothemodel originallypresentedin[6],andtrainedend-to-endonaverylargenumberofhistoricalX-raysexams. 30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain. 2 Dataset Forthisstudywecollectedandpreparedadatasetconsistingof745,480X-rayplainfilmsofthe chestalongwiththeircorrespondingradiologicalreports. Allthehistoricalexamswereextracted fromthehistoricalarchivesofGuy’sandStThomas’HospitalinLondon(UK),andcoveredmore thanadecade,from2005to2016. Eachscanwaslabelledaccordingtotheclinicalfindingsthatwere originallyreportedbytheconsultantradiologistandrecordedinanelectronicclinicalreport. The labellingtaskwasautomatedusinganaturallanguageprocessing(NLP)systemthatimplementsa combinationofmachinelearningandrule-basedalgorithmsforclinicalentityrecognition,negation detectionandentityclassification. Anearlyversionofthesystemusedabidirectionallong-shortterm memory(LSTM)modelformodellingtheradiologicallanguageanddetectingclinicalfindingsand theirnegations[2]. Forthepurposeofthisstudy,weonlyusedscanslabelledasnormal(i.e. thosewithnoreported abnormalities),andthosereportedashavinganenlargedheart(i.e. alargecardiacsilhouette)anda medicaldevice(e.g. apacemaker). Thenumberofscanswithinthesethreecategorieswas57,595, 31,528and22,804,respectively. Wewereinterestedinthedetectionofenlargedheartsandmedical devices,andaseparatemodelwastrainedforeachtaskandtestedon6,000and4,000randomly selectedexams,respectively. Alltheremainingimageswereusedforbothtrainingandvalidation. In allourexperimentswescaledthesizeoftheimagesdownto256×256pixels. 3 Recurrentattentionmodel(RAM) TheRAMmodelimplementedhereissimilartotheoneoriginallyproposedin[6]. Mimickingthe humanvisualattentionmechanism,thethismodellearnstofocusandprocessonlyacertainregion ofanimagethatisrelevanttotheclassificationtask. Inthissectionweprovideabriefoverviewof themodelanddescribehowourimplementationdiffersfromtheoriginalarchitecture. Wereferthe readerto[6]forfurtherdetailsonthetrainingalgorithm. GlimpseLayer: Ateachtimet,themodeldoesnothavefullaccesstotheinputimagebutinstead receives a partial observation, or “glimpse”, denoted by x . The glimpse consists of two image t patchesofdifferentsizecentredatthesamelocationl ,eachonecapturingadifferentcontextaround t l . Bothpatchesarematchedinsizeandpassedasinputtoanencoder,asillustratedinFigure1. t Encoder: Theencoderimplementedherediffersfromtheoneusedin[6]. Inourapplicationwehave acomplexvisualenvironmentfeaturinghighvariabilityinbothluminanceandobjectcomplexity. Thisisduetothelargevariabilityinpatient’s’anatomyaswellasimageacquisitionastheX-ray scanswereaquiredusingmorethan20differentX-raydevices.Thegoaloftheencoderistocompress theinformationoftheglimpsebyextractingarobustrepresentation. Toachievethis,eachimage oftheglimpseispassedthroughastackoftwoconvolutionalautoencoderswithmax-pooling[5]. Eachconvolutionalautoencoderinthestackispre-trainedseparatelyfromtheRAMmodel. During training,ateachtimettheglimpserepresentationisconcatenatedwiththelocationrepresentation andpassedasinputtoafullyconnected(FC)layer. TheoutputoftheFClayerisdenotedasb and t ispassedasinputtothecoreRAMmodel,asseeninFigure1. CoreRAM:Ineachtimestept,theoutputvectorb andtheprevioushiddenrepresentationh t t−1 arepassedasinputtotheLSTMlayer. Thelocatorreceivesthehiddenrepresentationh fromthe t LSTMunitandpassesontoaFClayer,resultinginavectoro ∈R2(seeFigure1). Thelocatorthen t decidesthepositionofthenextglimpsebysamplingl ∼N(o ,Σ),i.e.fromanormaldistribution t+1 t withmeano anddiagonalcovariancematrixΣ. Thelocationl representsthex-ycoordinatesof t t+1 theglimpseattimestept+1. Attheveryfirststep,weinitiatethealgorithmatthecenterofthe image,andalwaysuseafixedvarianceσ2. 4 Results Table1summariestheclassificationperformanceoftheRAMmodelalongsidewiththeperformance ofstate-of-the-artconvolutionalneuralnetworkstrainedandtestedonthesamedataset. RAM,using 5millionparameters,reaches90.6%and91.0%accuracyforthedetectionofmedicaldevicesand enlargedhearts,respectively. Forthesametasks,Inception-v3[9]achievesthehighestaccuracywith 91.3%and91.4%,butuses4timesmoreparameterscomparedtotheRAMmodel. 2 Figure1: RAM.AteachtimestepttheCoreRAMsamplesalocationl ofwheretoattendnext. t+1 Thelocationl isusedtoextracttheglimpse(redframesofdifferentsize). Theimagepatchesare t+1 down-scaledandpassedthroughtheencoder. Therepresentationoftheencoderandtheprevious hidden state h of the Core RAM are passed as inputs to the LSTM of the step t. The locator t−1 receivesasinputthehiddenstateh ofthecurrentLSTMandthenitsamplesthelocationcoordinates t fortheglimpseinthenextstept+1. Thisprocesscontinuousrecursivelyuntilstepnwherethe outputoftheLSTMh isusedtoclassifytheinputimage. n Table1: Accuracy(%)ontheclassificationbetweenimageswithnormalradiologicalappearanceand imageswithenlargedheartandmedicaldevices. Model HeartEnlarged MedicalDevices NumberofParameters VGG[8] 89.1 86.2 ∼68million ResNet-18[3] 89.5 88.4 ∼11million Inception-v3[9] 91.4 91.3 ∼21million AlexNet[4] 89.7 87.4 ∼70million RAM 91.0 90.6 ∼5million baselineResNet-18 90.0 87.2 ∼5million baselineVGG 88.9 86.3 ∼5million baselineAlexNet 90.0 86.3 ∼5million InFigure3weillustratetheperformanceonthevalidationset,andhighlightthelocationsattended bythemodelwhentryingtodetectmedicaldevices. Hereitcanbenotedhowinitiallythemodel exploresrandomlyselectedportionsoftheimage,anditsclassificationperformanceremainslow. Afteraspecificnumberofepochs,themodeldiscoversthatthemostinformativepartsofachest X-rayarethosecontainingthelungsandspine,andselectivelyprefersthoseregionsinsubsequent paths. ThisisareasonablepolicysincemostofthemedicaldevicestobefoundinchestX-rays,such aspacemakersandtubes,arelocatedinthoseareas. Figure 3 (A) shows the locations mostly attended by the RAM model when looking for medical devices. Fromthisfigureitisobviousthatthelearntpolicyexploresonlytherelevantareaswhere thesedevicescangenerallybefound. Twoexamplesofpathsfollowedbythealgorithmafterlearning thepolicyareillustratedin Figures3(B), (C).Inthese examples, startingfromthe centerofthe image,thealgorithmsmovesclosertoaregionthatislikelytocontainapacemaker,whichisthen correctlyidentified. Thecircleandtrianglepoints(inred)indicatethecoordinatesofthefirstandlast glimpseinthelearntpolicy,respectively. Analogously, Figure 4 (A) highlights frequently explored locations when trying to discriminate betweennormalandenlargedhearts. Hereitcanbeobservedhowthemodelslearnstofocusonthe cardiacarea. TwosamplesofthelearnedpolicyareillustratedinFigure4(B),(C).Thetrajectories followedheredemonstrateshowthepolicyhaslearnedthatexploringtheextremitiesoftheheartis requiredinordertoconcludewhethertheheartisenlargedornot. 3 Figure2: Top: Accuracy(%)onthevalidationsetduringthetrainingofthemodel. Bottom: Image locations that the model attends during the validation. The grids corresponds to image areas of size 25×25 pixels. We split the total number of epochs into chunks of 500 epochs. For each chunkthemodelisvalidated100timesandthelocationsfromeachvalidationaresummarizedinthe correspondingimage. Imageregionswithlowtransparencycorrespondtolocationsthatthemodel visitswithhighfrequency. Figure3: (A)ImagelocationsattendedbytheRAMmodelforthedetectionofmedicaldevices. (B) and(C)aretwodifferentsamplesofthelearntpolicyontestimages. Figure4: (A)ImagelocationsattendedbytheRAMmodelforthedetectionofenlargedhearts. (B) and(C)aretwodifferentsamplesofthelearntpolicyontestimages. 4 5 ConclusionandPerspectives Inthisworkwehaveinvestigatedwhetheravisualattentionmechanism,theRAMmodel,iscapable of learning how to interpret chest X-ray scans. Our experiments show that the model not only hasthepotentialtoachieveclassifcationperformancecomparabletostate-of-the-artconvolutional architecturesusingfarfewerparameters,butalsolearnstoidentifyspecificportionsoftheimagesthat arelikelytocontaintheanatomicalinformationrequiredtoreachcorrectconclusions. Therelevant areasareexploredaccordingtopoliciesthatseemappropriateforeachtask. Currentworkisbeing directedtowardsenablingthemodeltolearneachpolicyasquicklyandpreciselyaspossibleusing full-scaleimagesandforamuchlargernumberofclinicallyimportantradiologicalclasses. References [1] Ba,J.,Mnih,V.,Kavukcuoglu,K.: Multipleobjectrecognitionwithvisualattention. In: ICLR (2015) [2] Cornegruta,S.,Bakewell,R.,Withey,S.,Montana,G.: Modellingradiologicallanguagewith bidirectionallongshort-termmemorynetworks. In: InEMNLP(2016) [3] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385v1(2015) [4] Krizhevsky,A.,Sutskever,I.,Hinton,G.:Imagenetclassificationwithdeepconvolutionalneural networks. In: NIPS.pp.1106–1114(2012) [5] Masci,J.,Meier,U.,Ciresan,D.,Schmidhuber,J.: Stackedconvolutionalauto-encodersfor hierarchicalfeatureextraction. In: ICANN(2011) [6] Mnih,V.,etal.: Recurrentmodelsofvisualattention. In: NIPS(2014) [7] Sermanet,P.,Frome,A.,Real,E.: Attentionforfine-grainedcategorization. arXiv:1412.7054v3 (2015) [8] Simonyan,K.,Zisserman,A.: Verydeepconvolutionalnetworksforlargescaleimagerecogni- tion. In: ICLR(2015) [9] Szegedy,C.,Vanhoucke,V.,Ioffe,S.,Shlens,J.,Wojna,Z.:Rethinkingtheinceptionarchitecture forcomputervision. arXiv:1512.00567v3(2015) [10] Tudor,G.,Finlay,D.,Taub,N.: Anassessmentofinter-observeragreementandaccuracywhen reportingplainradiographs. ClinicalRadiology52(3),235–238(1997) 5