Lurking Malice in the Cloud: Understanding and Detecting Cloud Repository as a Malicious Service XiaojingLiao1,SumayahAlrwais2,KanYuan2,LuyiXing2, XiaoFengWang2,ShuangHao3,RaheemBeyah1 1GeorgiaInstituteofTechnology,2IndianaUniversityBloomington, 3UniversityofCalifornia,SantaBarbara {xliao,rbeyah}@gatech.edu,{salrwais,kanyuan,luyixing,xw7}@indiana.edu,[email protected] Abstract websitesthroughasimplecodesnippet.Inadditiontobenignusers, thepopularityoftheseserviceshasalsoattractedcybercriminals. Thepopularityofcloudhostingservicesalsobringsinnewsecurity Comparedwithdedicatedundergroundhostingservices,reposito- challenges: it has been reported that these services are increas- riesonlegitimatecommercialcloudsaremorereliableandharder ingly utilized by miscreants for their malicious online activities. toblacklist.Theyarealsomuchcheaper:forexample,itisreported Mitigatingthisemergingthreat,posedbysuch“badrepositories” that 15 GB on the dark net is sold at $15 per month [?], which (simplyBar),ischallengingduetothedifferenthostingstrategy isactuallyofferedforfreebyGoogletoeveryGoogleDriveruser. totraditionalhostingservice,thelackofdirectobservationsofthe Indeed,ithasbeenreported[?]thatmalwaredistributorsareincreas- repositoriesbythoseoutsidethecloud,thereluctanceofthecloud inglyusingthecommercialcloudstoprocessanddeploymalicious providertoscanitscustomers’repositorieswithouttheirconsent, content. andtheuniqueevasionstrategiesemployedbytheadversary.Inthis paper,wetookthefirststeptowardunderstandinganddetectingthis Understanding bad cloud repositories: challenges. Although emergingthreat.Usingasmallsetof“seeds”(i.e.,confirmedBars), therehavebeenindicationsofcloudhostingmisuse,understand- we identified a set of collective features from the websites they inghowsuchservicesareabusedischallenging. Fortheservice serve(e.g.,attemptstohideBars),whichuniquelycharacterizethe providers,whoareboundbytheirprivacycommitmentsandethical Bars.Thesefeatureswereutilizedtobuildascannerthatdetected concerns, they tend to avoid inspecting the content of their cus- over600BarsonleadingcloudplatformslikeAmazon, Google, tomers’repositoriesintheabsenceofproperconsent.Evenwhen and150Ksites, includingpopularoneslikegroupon.com, using theprovidersarewillingtodoso,determiningwhetherarepository them. Highlightsofourstudyincludethepivotalrolesplayedby involvesmaliciouscontentisbynomeanstrivial: nutsandbolts theserepositoriesonmaliciousinfrastructuresandotherimportant formaliciousactivitiescouldappearperfectlyinnocentbeforethey discoveriesincludehowtheadversaryexploitedlegitimatecloud areassembledintoanattackmachine;examplesincludeimagefiles repositoriesandwhytheadversaryusesBarsinthefirstplacethat forSpamandPhishingasshowninFigure1. Actually,evenfor hasneverbeenreported. Thesefindingsbringsuchmaliciousser- therepositoryconfirmedtoservemaliciouscontentlikemalware, vicestothespotlightandcontributetoabetterunderstandingand today’scloudproviderstendtoonlyremovethatspecificcontent, ultimatelyeliminatingthisnewthreat. insteadofterminatingthewholeaccount,toavoidcollateraldamage (e.g., compromised legitimate repositories). Exploring the issue becomesevenmoredifficultforthethirdparty,whodoesnothave 1. INTRODUCTION theabilitytodirectlyobservetherepositoriesandcanonlyaccess Cloudhostingservicetodayisservingoverabillionusersworld- themthroughthewebsitesorsourcesthatutilizethestorageser- wide, providing them stable, low-cost, reliable, high-speed and vices.Furtheraddingtothecomplexityoffindingsucharepository globallyavailableresourceaccess. Forexample,AmazonSimple isthediverserolesitmayplayinattackinfrastructures(e.g.,serving StorageService(S3)isreportedtostoreover2trillionobjectsfor malwareforoneattackandservingPhishingcontentforanother), webandimagehosting,systembackup,etc.Inadditiontostoring duetothemixedcontentasinglerepositorymayhost:e.g.,malware data,theseservicesaremovingtowardamoreactiveroleinsup- together with Phishing images. As a result, existing techniques portingtheircustomers’computingmissions,throughsharingthe (e.g.,thosefordetectingdedicatedmaliciousservices[?][?])cannot repositories(a.k.a. bucketforGoogleCloud[?]) hostingvarious bedirectlyappliedtocapturetherepository,simplybecausetheir dynamiccontentandprogrammingtools. Aprominentexample originaltargetsoftencontainmorehomogeneouscontent(e.g.,just is Google’s Hosted Libraries [?], a content distribution network malware)andcontributetodifferentcampaignsinthesameway.So (CDN)fordisseminatingthemostpopular,open-sourceJavaScript far,littlehasbeendonetounderstandthescopeandmagnitudeof resources,whichwebdeveloperscaneasilyincorporateintotheir maliciousorcompromisedrepositoriesonlegitimateclouds(called BadRepositoryorsimplyBar inourresearch)andthetechnical Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor details about their services to the adversary, not to mention any classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed efforttomitigatethethreattheypose. forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM Finding“Bars”online. Inthispaper,wepresentthefirstsystem- mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish, aticstudyontheabusesofcloudrepositoriesonthelegitimatecloud topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora platforms as a malicious service, which was found to be highly [email protected]. pervasive,actingasabackboneforlarge-scalemaliciouswebcam- CCS’16,October24-28,2016,Vienna,Austria paigns(Section4).Ourstudywasbootstrappedbyasetof“seeds”: (cid:13)c 2016ACM.ISBN978-1-4503-4139-4/16/10...$15.00 100confirmedmaliciousorcompromisedbuckets[?],eachofwhich DOI:http://dx.doi.org/10.1145/2976749.2978349 1541 Whenitcomestomaliciousbuckets,ourstudybringstolight newinsightsintothisnewwaveofrepositorybasedcyber-attacks, includingtheimportanceofBarstomaliciouswebactivitiesand Figure 1: Example of deceptive images in Amazon S3 bucket thechallengesindefendingagainstthisnewthreat. Morespecifi- cicloudfrontusedformalvertising. Theimagewasshownatthebot- cally,wefoundthatonaverage,oneBarserves152maliciousor tomofawebpageasanupdatenotificationtolurevisitorstodownload malware. compromisedsites.Inoneofthelargecampaignsdiscoveredinour research,theBarcloudfront_file.enjin.comhostsamaliciousscript isacloudresourcerepositorywithstoredobjects(oftenofdifferent thatwasinjectedintoatleast1,020websites(Section4.1).These types) organized under a unique identification key. These buck- Barssitrightatthecenteroftheattackinfrastructure,supportingand ets were collected from Spam messages or the malicious URLs coordinatingothermaliciousactors’operationsatdifferentstagesof cachedbyapopularmalwarescanner.Comparingthemwiththose acampaign.Interestingly,wefoundthattheycouldbestrategically knowntobelegitimate,wefoundthatdespitevariousroleseach placedondifferentcloudplatforms,makingthemhardtoblock(due bucketplaysindifferenttypesofattacks(duetothediversityin tothepopularityoftheirhostingcloudslikeGoogle)anddetect thecontentitserves),stillthewebsitesconnectingtothosebuckets (scatteredacrossdifferentproviders),andeasytoshareacrossmulti- exhibitprominentcommonfeatures(seeSection3.1),particularly, plecampaigns.Asanexample,thePotentiallyUnwantedPrograms thepresenceof“gatekeeper”sitesthatcovertheBars(avaluable (PUP)campaignwefoundfirstloadsaredirectionscriptfroma assetfortheadversary)andremarkablyhomogeneousredirection BaronAkamaihd(theworld’slargestCDNplatform)toleadthe behavior(i.e.,fetchingrepositoryresourcesindirectlythroughother victimtotheattackwebsite,thenfetchesPhishingpicturesfroman sites’references)andsometimessimilarcontentorganizations,due AmazonS3Bar,andfinallydeliversthemalwarestoredonCloud- tothesameattackpayloadthecompromisedsitesuploadfromtheir fronttothetargetsystems(Section4.4). Inthepresenceofsuch backend(i.e.,theBars),orthetemplatesthebucketprovidestothe meticulouslyplannedattacks,thecloudserviceprovidersapparently adversaryforquickdeploymentofherattacksites.Bycomparison, areinadequatelyprepared,possiblyduetotheprivacyconstraints alegitimatebucket(e.g.,reputablejQueryrepository)tendstobe intouchingtheircustomers’repositories.WefoundthatmanyBars directlyaccessedbythewebsiteswithhighlydiversecontent. remainactiveduringourstudy,andsurviveamuchlongerlifetime Basedonthisobservation,wedevelopedBarFinder,ascanner thanthatofthemaliciouscontenthostedonwebsites(Section4.3). thatautomaticallydetectsBarsthroughinspectingthetopological FurthercomplicatingthemissionofBaridentificationareothereva- relations between websites and the cloud bucket they use, in an siontechniquestheadversaryemploys,includingcodeobfuscation attempttocaptureBarsbasedontheexternalfeaturesoftheweb- and use of a redirection chain and cloaking techniques to avoid sitestheyserve. Morespecifically,forallthesitesconnectingto exposingmaliciouspayloadstoamalwarescanner. arepository,ourapproachcorrelatesthedomainsandURLs(par- Contributions.Thecontributionsofthepaperareasfollows: ticularthoserelatedtocloudrepositories)acrosstheirredirection •Newunderstanding.Weperformedthefirstsystematicstudyon chainsandcontentfeaturesacrosstheirDOMstructurestoidentify cloudrepositoriesasamaliciousservice,anemergingsecuritythreat. thepresenceofgatekeepersandevadingbehavior,andalsomeasure Forthefirsttime,ourstudyrevealsthescopeandmagnitudeofthe thediversityoftheircontentorganization.Asetofnewcollective threatanditssignificantimpact,particularlyontheinfrastructures featuresgeneratedinthisway,includingbucketusagesimilarity, ofillicitwebactivities.Thesefindingsbringtothespotlightthisim- connectionratio,landingsimilarityandothers(Section3.1),arefur- portantyetunderstudiedproblemandleadtoabetterunderstanding therutilizedbyaclassifiertofindoutsuspiciousbuckets.Running ofthetechniquestheadversaryemploysandtheirweaknesses.This thescanneroverallthedatacollectedbytheCommonCrawl[?], willcontributetobetterdefenseagainstandultimateeliminationof whichindexedfivebillionwebpages,forthoseassociatedwithall thethreat. majorcloudstorageproviders(includingAmazonS3,Cloudfront, GoogleDrive,etc.),wefoundaround1millionsitesutilizing6,885 •Newtechnique.Basedonourunderstandingofbadcloudrepos- repositorieshostedontheseclouds.Amongthem,BarFinderidenti- itories, we take a first step toward automatically detecting them. fied694maliciousorcompromisedrepositories,involvingmillions Thetechniquewedevelopedreliesonthetopologicalrelationship offiles,withaprecisionof95%andacoverageof90%againstour betweenacloudrepositoryandthewebsitesitserves,whicharedif- ground-truthset. ficulttochangeandeffectiveatcapturingmaliciousorcompromised buckets. Ourevaluationoveralargenumberofpopularwebsites Ourdiscoveries.LookingintotheBarsidentifiedbyourscanner, demonstratesthepotentialofthetechnique,whichcouldbeutilized wearesurprisedbythescopeandthemagnitudeofthethreat.These bybothcloudprovidersandthirdpartiestoidentifythethreatsposed bucketsarehostedbythemostreputablecloudserviceproviders. byBars. Forexample,13.7%ofAmazonS3repositoriesand5.5%ofGoogle repositoriesthatweinspectedturnedouttobeeithercompromised Roadmap.Therestofthepaperisorganizedasfollows:Section2 orcompletelymalicious1.Amongthosecompromisedarepopular providesthebackgroundinformationandadversarymodelforour cloudrepositoriessuchasGroupon’sofficialbucket. Altogether, research; Section3describesourfindingsfromtheground-truth 472suchlegitimaterepositorieswereconsideredtobecontaminated, datasetandthedesignandimplementationofBarFinder;Section4 duetoamisconfigurationflawneverreportedbefore,whichallows providesthedetailsofthediscoveriesmadeinourlarge-scalemea- arbitrarycontenttobeuploadedandexistingdatatobemodified surementstudy;Section5discussesthelimitationsofourworkand withoutproperauthorization.TheimpactoftheseBarsissignificant, possiblefutureresearch;Section6comparesourworkwithrelated infecting1,306legitimatewebsites,includingAlexatop300sites priorresearchandSection7concludesthepaper. likegroupon.com,Alexatop5,000siteslikespace.com,etc. We 2. BACKGROUND reportedourfindingstoAmazonandleadingorganizationsaffected bytheinfections.Grouponhasalreadyconfirmedthecompromise Cloudhosting.Cloudhostingisatypeofinfrastructureasaservice wediscoveredandawardedusforourhelp. (IaaS),whichisrentedbythecloudusertohostherwebassets(e.g., HTML,JavaScript,CSS,andimagefiles).Thesewebassetsareorga- 1Wehavemanuallyexaminedandconfirmedallthoseinstances. nizedintocloudrepositoriesreferredtoasbucketswhichareidenti- 1542 malware,click-hijackingandothers.Thesebucketsareconnected ①' Cloud'hos*ng'pla/orm' tofront-endwebsites,whichcouldbemalicious,compromisedor s3.amazonaws.com' ①' legitimateonescontaminatedonlybytheBar. Resource'owner' t.js- Crazy-Egg- … 3. FINDINGBARSONLINE Bucket' trk.cetrk.com) Inthissection,weelaborateonouranalysisofasetofknown ④' Bars (the seed set) and the features identified for differentiating ②' benign repositories and Bars. These features are utilized in our <html> Cloud'URL' …… researchtobuildasimplewebscanner,BarFinder,fordetecting ⑤' <script Website' src=“s3.amazonaws.com/ othermaliciousorcompromisedhigh-profile,previously-unknown trk.cetrk.com/t.js”> …… repositoriesandthemaliciouscampaignsinwhichtheyserve. ③' </html> 3.1 FeaturesofBadRepositories Our study is based on a small set of confirmed good and bad Figure2:Overviewofthecloudhostingprocess. repositoriesandtheirrelateddomains,whichweanalyzedtofind outhowBars(badrepositories)differfromlegitimaterepositories. fiedbyunique2,user-assignedkeys,thataremappedassub-domains. Intheabsenceofdirectaccesstothesebuckets,goodorbad,all Forexample,thesubdomainaws-publicdatasets.s3.amazonaws.com wecandoistoinfertheirlegitimacyfromwhousethemandhow identifiesAmazonS3asthecloudplatformandaws-publicdatasets they are used (by different domains), that is, the features of the astheuser’scloudbucketandrepository. Suchnameassignment domainsandtheirinteractivitiesontheredirectionpathsleading is labeled as s3.amazonaws.com_aws-publicdatasets throughout to the cloud repository. Of particular interest here are a set of thispaper.Also,eachbucketisprotectedbyanaccesscontrollist collectivepropertiesidentifiedfromtheresourcefetchingchains configuredbytheusertoauthorizerequestsforherresources. (a.k.a.,redirectionchains)forservingthecontentofBars,whichis Inrecentyears,wehaveseenanincreaseinpopularityofthese hardtochangebytheadversary,comparedwiththecontentfeatures services.Akeyfeatureofcloudhostingisbuilt-insitepublishing ofindividualBars.Below,weelaborateonthewaysuchdatawas [?],wherethewebassetsinthebucketcanbeserveddirectlyto collectedandthesalientfeaturesdiscoveredinourresearch,which usersviafilenamesinarelativepathinthebucket(i.e.,cloudURL). describehowtheadversaryattemptstohideBarsorusethemto For instance, JavaScript files hosted in the cloud bucket can be coverotherattackassets,aredirectionpatternneverobservedon directlyruninthebrowser.Also,thepay-as-you-gohostingiswell legitimaterepositories. receivedasaneconomicandflexiblecomputingsolution. Asan Datacollection. Tobuildtheseedset,wecollectedasetofcon- example,GoogleDrivetodayoffersafreewebhostingservicewith firmedmaliciousorcompromisedbuckets(calledBadset)andlegiti- 15GBofstorage,andanadditional100GBfor$1.99/month,and matebuckets(calledGoodset)aswellastheirrelateddomains,as GoDaddy’swebhostingstartsat$1/monthfor100GB. illustratedinTable1. Besides such front-end websites, mainstream cloud providers •Badset. Weutilizedtwofeedsasthegroundtruthforgathering today(AmazonS3,MicrosoftAzure,GoogleDrive,etc.)allallow badcloudbuckets: theSpamtrapfeedandtheCleanMXfeed[?]. theircustomerstostoredifferentkindsofwebcontentandotherre- TheformercomesfromaSpamhoneypotweconstructed[?]that sourcesintheircloudbuckets,servingasback-endrepositoriesthat receivesaround10KSpamemailsperday,fromwhichcloudURLs canbeeasilyaccessedbyfront-endapplications(likethewebsite) promotedbytheemailswereextractedwhichmayincludespam andsharedacrossdifferentparties.Figure2illustratesanexample, resourcessuchasHTML,images,andscripts.Thelatterincludes inwhichtheresourceownercreatesabucketonthecloudhosting thehistoricaldataofCleanMX,apopulardomainscanningengine, platformanduploadsascriptthere((cid:172));thisresource(i.e.,thescript) fromwhichcloud-relatedURLswerecollected.Forbothfeeds,we ismadepublicthroughacloudURL,whichcanbeembeddedinto furthervalidatethembyVirusTotal[?]andmanualinspections(e.g., anywebsite((cid:173));wheneverthesiteisvisited((cid:174)),requestswillbe lookingforPhishingcontent)toensurethattheywereindeedbad generatedforfetchingthescript((cid:175))anddeliveringittothevisitor’s (toavoidcontaminatingthedatasetwithlegitimatebucketsused browser((cid:176)).Thebucketintheexampleistypicalofaservicerepos- inmaliciousactivities).Usingthecollectedsetofmaliciouscloud itory,whoseresourcescanbefetchedandupdatedthroughacloud URLsfrombothfeeds,weextractedtheirrepositories,whichledto URL:forexample,thevisitorstatisticsofawebsitecanbecollected 100confirmedBars. throughalink(s3.amazonaws.com/trk.cetrk.com/t.js),whichdown- •Goodset. ThegoodbucketsweregatheredfromtheAlexatop loadsatrackingscriptfroms3.amazonaws.com_trk.cetrk.com,a 3Kwebsites,whichareconsideredtobemostlyclean.Tothisend, bucketownedbythetrackingwebsiteCrazyEgg.Thisisdifferent wevisitedeachwebsiteusingacrawler(asaFirefoxadd-on)to froma“self-serving”bucket,whoseresourcescanonlybeaccessed recordtheHTTPtraffictriggeredbythevisit,includingnetwork bythebucketowner’ssites.Notethatourstudyfocusesonabuses requests,responses,browserevents,etc.Fromthecollectedtraffic, ofthistypeofcloudrepositories,regardlessoftheadditionalfunc- weextractedtheHTTPcloudrequestURLscorrespondingto300 tionalitiestheymayhave(e.g.,CDNs,DDoSprotection,etc.),since cloud buckets hosted on 20 leading cloud hosting services like thesefunctionalitiesdonotaffectthewaytherepositoriesareused Amazon S3, Google Drive, etc. (see Appendix Table 7 for the byeitherlegitimateormaliciousparties. completelist).NotethateventhoughsomeofthemprovideCDN Adversarymodel.Inourresearch,weconsidertheadversarywho serviceorDDOSprotection,theyareallprovidedhostingserviceto triestousecloudbucketsonlegitimatecloudplatformsasservice actascloudrepository. repositoriesforillicitactivities.Forthispurpose,theattackercould •Bucket-servedsitesandtheirHTTPtraffic. WecollectedHTTP build her own malicious bucket or compromise legitimate ones, trafficusingthecrawlermentionedabovetovisitalistofwebsites andstorevariousattackvectorsthere,includingSpam,Phishing, usingbucketsforfeatureextraction.Ratherthanblindlycrawling 2Thetermsrepositoriesandbucketsareusedinterchangeablythroughoutthispaper. thewebtofindthosesites,weadoptedamoretargetedstrategyby 1543 Table1:Summaryresultsoftheseeddataset. measuredby1−i,whereiisthenumberofimmediatepredecessor s nodestoarepository(thedomainsconnectingtotherepository) #of #of #of andsisthetotalnumberofentriesoftherepository’sredirection #of average linked redirection graph. To find out the CR, we first remove the bucket b and all buckets linked websites paths theedgestowhichitisattached(iftheyexist)togetanothergraph website G(cid:48) = G−G , whereG = ({b},E )andE = {e }. Note Badset 100 12,468 133 468,480 b b b b b,j thateachgraphG(cid:48)isassociatedwithonebucket. Then,fromG(cid:48), Goodset 300 128,681 864 2,659,304 wefindoutthenumberofconnectedcomponentsnandcalculate crawlingthesitesfoundtocontainlinkstothecloudinthepast. CR=1− n (seeFigure3foranexample). |V| WebuiltthesitelistwiththehelpofCommonCrawl[?],apublic Bothcollectivefeatureswerefoundtobediscriminativeinour bigdataprojectthatcrawlsabout5billionwebpageseachmonth research.Figure4(a)and4(b)comparethecumulativedistributions through a large-scale Hadoop-based crawler and maintains lists (CDF)oftheratiosbetweenBadandGoodsets.Aswecanseefrom ofthecrawledwebsitesandtheirembeddedlinks. Searchingthe thefigures,Barstendtohavehigherratiosthanbenignones: the Common Crawl [?] dataset, collected in February 2015, for the averageBUSis0.87fortheBarsand0.79forthelegitimatereposito- websitesloadingcontentfromthe400cleanandmaliciousbuckets riesandtheCRis0.85forthebadrepositoriesand0.67forthegood identified above, we found 141,149 websites, were used by our one.Asmentionedearlier,thisiscausedbythefactthatasmallset crawler. ofgatekeepersnodesareoftenplacedthereforprotectingtheBars whiletheredirectionchainstowardsthegoodrepositoriesaremuch Topologicalfeatures. Wefirstinspectedthetopologyoftheredi- moredirectandindependent: differentorganizationstypicallydo rectioninfrastructureassociatedwithaspecificbucket. Suchan notgothroughanintermediarytoindirectlyaccessthepublicrepos- infrastructureisacollectionofredirectionpaths,witheachnode itorylikejQuery,andevenwithinthesameorganization,useofsuch beingaFullyQualifiedDomainName(FQDN).Oneachpath,the aresourceisoftendirect. Althoughtherecanbeexceptions,our bucket is either a node when it directly participates in a redirec- measurementstudyshowsthatingeneral,thestructuraldifferences tion(e.g.,itscloudURLdeliversaredirectionscripttothevisitor’s betweenmaliciousandlegitimaterepositoriesarestark. browser)orsimplyapassiverepositoryprovidingresourceslikepic- Also, we found that occasionally, a Bar itself may serve as a turestootherdomains.Figure3illustratesexamplesofredirection gatekeeper,runningscriptstohidemorevaluableattackassets,such pathsleadingtotworeal-worldrepositories,oneforalegitimate as the attack server or other malicious landing sites. When this bucket cloudfront.net_d24n15hnbwhuhn and the other for a Bar happens,almostalwaystheBarleadstoasmallsetofsuccessors s3.amazonaws.com_cicloudfront. onredirectionpaths(e.g.,attackservers,landsites). Thisisvery Akeyobservationfromourstudyisthattheredirectioninfras- differentfromtheredirectionperformedbythescriptfromabenign tructureleadingtoaBartendstoincludethefeaturesforprotecting repository,forexample,cloudfront.net_d24n15hnbwhuhn.Insuch theBarfrombeingdetectedbywebscanners,presumablydueto cases,thetargetsofredirectionsareoftenverydiverse.Basedonthis thefactthattherepositoryisoftenconsideredtobeavaluableas- observation,wefurthermeasurethelandingsimilarity,LS =1−l, setfortheadversary. Specifically,wefoundthattypically,there s wherelisthenumberoftheuniquelastnodesontheredirection are a few gatekeeper nodes sitting in front of a Bar, serving as paths associated with a repository. Again, as illustrated in Fig- an intermediary to proxy the attempts to get resources from the ure4(c),ourstudyshowsthatredirectionpathsinvolvingBarsshare Bar.Examplesofthegatekeepersincludefp125.mediaoptout.com fewer end nodes than legitimate ones, and therefore, the related anditsdownstreamnodesinFigure3(b).Onthetopologyofsuch redirectiongraphs(forBars)haveahigherlandingsimilarity(0.94 aninfrastructure,thesegatekeepersarethehubsreceivingalotof vs0.88). resource-accessconnectionsfromentrysites(thefirstnodeonaredi- Content and network features. In addition to their distinctive rectionpath,seeFigure3).Alsointerestingly,ourresearchshows topological features, we found that the nodes on the redirection thatsomegatekeeperscanaccesstheBarthroughmultiplepaths. pathsattachedtoaBaroftenexhibitremarkablehomogeneityin Forexample,inFigure3(b),krd.semantichelper.comcaneithergo theircontentandnetworkproperties.Particularly,forthewebsites straighttos3.amazonaws.com_cicloudfrontortakeadetourthrough directlyconnectingtotherepository,wefoundthattheytypically p306.atemada.com.Thisstructurecouldbecausedbythecloaking useasmallsetoftemplates(likeWordPress)tobuilduptheirweb ofthegatekeeperforhidingtheBar,orconstructedtomaintainac- pages,includesimilarDOMpositionsforscriptinjection,carrying cesstotherepositoryevenwhennodes(like1.semantichelper.com) similarIPaddressesorevenhavingthesamecontentmanagement aredown(detected,cleaned,etc.).Notethatsuchaprotectionstruc- system(CMS)vulnerabilities,etc.Thesepropertiesturnouttobe turedoesnotexistonthepathstoabenignrepository(Figure3(a)): very diverse among those utilizing a legitimate cloud repository. normally,theresourceshostedinarepository(e.g.,jQuery)isdi- Forexample,allwebsiteslinkingtoaGoogleDriveBarhavetheir rectlyfetchedbythewebsiteusingit,withoutgoingthroughany maliciouscloudURL(forinjectingascript)placedatthebottom redirection;eveninthepresenceofredirections,therewillnotbe oftheDOMofeachwebsite. Inanotherexample,wefoundthat anygatekeeper,nottomentionattemptstocloakorbuildabackup the front-end sites using a Cloudfront Bar actually all include a path. vulnerableJCEJoomlaextension. To identify this unique “protection” structure, we utilize two To better understand the diversity of such websites, we try to collectivefeatures:bucketusagesimilarity(BUS)thatcapturesthe comparethemaccordingtoasetofcontentandnetworkproperties. topologyinvolvinghubs(gatekeepers)andconnectionratio(CR) Inourresearch,weutilizedthepropertiesextractedbyWhatWeb[?], thatmeasurestheinteractivitiesacrossdifferentredirectionpaths apopularwebpagescanner. WhatWebisdesignedtoidentifythe (whichpointtotheexistenceofcloakingbehaviorortheattempts webtechnologiesdeployed,includingthoserelatedtowebcontent to maintain back-up paths to the Bar). Specifically, consider a andcommunication:e.g.,CMS,bloggingplatforms,statistic/ana- redirectiongraphG=(V,E)(asillustratedinFigure3),whereV lyticspackages,JavaScriptlibraries,socialmediaplugins,etc.For isthesetofnodes(theFQDNsinvolvedinaredirection)andEis example,fromthecontent asetofedgesfromonenodetothenextoneonindividualpaths: E = {e |node i precedes node j on a path}. The BUS is i,j 1544 api2.amplitude.com cdn.amplitude.com cloudfront.net_d24n15hnbwhuhn api2.amplitude.com cdn.amplitude.com api2.amplitude.com cdn.amplitude.com jandhyala.com akamaihd.net_apispringsmartne-a fp125.mediaoptout.com krd.semantichelper.com p306.atemda.com 1.semantichelper.com s3.amazonaws.com_cicloudfront Figure3: Exampleoftheredirectioninfrastructureleadingtothelegitimatebucketcloudfront.net_d24n15hnbwhuhn(a)andthe Bars3.amazonaws.com_cicloudfront(b),whichareinREDcolor. (a)Cumulativedistributionofbucketus-(b) Cumulative distribution of con-(c) Cumulative distribution of landing agesimilaritypercloudbucket. nectedratiopercloudbucket. similaritypercloudbucket. Figure4:Barsshowsmallertopologicaldiversity. <link rel="search" E(cid:48)isthesetofedges:E(cid:48) ={e |websiteiandjsharep∈P}, i,j type="application/opensearchdescription+xml" thatis,bothsiteshavingacommonproperty.Overthisgraph,the href="https://wordpress.com/opensearch.xml" sitesimilarityiscalculatedasSiS =1− n .Herenisthenumber title="WordPress.com" /> |V(cid:48)| ofconnectedcomponentsinthegraph. we obtain the property p as a key-value pair p = (k,v) = Inourresearch,wecomputedSiSacrossallthecategoriessumma- (wordpress,opensearch),whichindicatesthewebsiteusingword- rizedfromtheseeddataset,andcomparedthosewithBarsagainst presspluginopensearch. thosewiththelegitimatebuckets. Again,thesitesusingBarsare Fromourseeddataset,thescannerautomaticallyextracted372 foundtosharemanymorepropertiesandthereforeachieveamuch keysof1,596,379properties,andthenweclusteredthekeysinto15 highersimilarityvaluethanthoselinkingtoagoodbucket.Thisis classessuchasAnalyticsandtracking,CMSandplugin,Meta-data likelycausedbymass-productionofmalicioussitesusingthesame information,etc.,followingthecategoriesusedbyBuiltWith,aweb resources(templates,pictures,etc.)providedbyaBarorutilization technologysearchengine[?]. Someexamplesoftheseproperties ofthesameexploittoolstoredinaBarforcompromisingthesites arepresentedinTable2. Inadditiontothesepropertiesextracted withthesamevulnerabilities.Therefore,suchsimilarityisinherent by WhatWeb, we added the following properties to characterize totheattackstrategiesandcanbehardtochange. cloudURLs,includingthepositionoftheURL,theorderinwhich differentbucketsappearinthewebcontentandthenumberofcloud 3.2 BarFinder platformsusedinapage. Design. The design of BarFinder includes a web crawler, a fea- Basedontheseproperties,againweutilizedatopologicalmetric tureanalyzer,andadetector.Thecrawlerautomaticallyscansthe tomeasuretheoverallsimilarityacrosssites.Specifically,therela- webforcloudbuckets(embeddedinwebcontent)andthenclus- tionsamongallthesites(connectingtothesamebucket)inthesame terswebsitesaccordingtothebucketstheyuse.Fromeachcluster, category(Analyticsandtracking,CMSandplugin,etc.)aremod- theanalyzerconstructsaredirectiongraphandacontentgraphas eledasagraphG(cid:48) =(V(cid:48),E(cid:48),P),whereV(cid:48)isthesetoftheweb- describedearlier(Section3.1),onwhichitfurthercalculatesthe sites,whicharecharacterizedbyacollectionofpropertiesP,and valuesforasetofcollectivefeaturesincludingdisconnectionratio 1545 Table2:Examplesofcontentandnetworkfeatures. Table3: F-scoreoffeatures. Category Feature Example Feature Label Metric F-score CMSplatforminformation (wordpress,allinone Connectionratio D 1− n 0.084 |V| andtheirplugin SEOpack) Bucketusagesimilarity B 1− i 0.076 Meta-datainformation (metagenerator, Landingsimilarity L 1− sl 0.072 drupal7) s Content CloudURLinformation (position,bottom) CMSinformation S1 1− |Vn(cid:48)| 0.037 Advertising (adsense, Meta-datainformation S2 1− |Vn(cid:48)| 0.033 asynchronous) Analyticsandtracking S3 1− |Vn(cid:48)| 0.032 AnJaalvyatisccsriapntdlitbrraacrkying (UG(AoJQo-g2ul4ee1r-y0A,0n17a.69ly-.31ti1)c)s, CloudURWLidignefotrmation SS45 11−− ||VVnn(cid:48)(cid:48)|| 00..003214 (addthis,welcome Table4: Performancecomparisonunderfive-foldcrossvalida- Widget bar) tion. (opengraphprotocol, DocInfotechnologies Classifier Precision Recall null) Identity (IP,216.58.216.78) SVM 0.94 0.89 (Cookie, DecisionTree 0.9 0.83 Network Cookie harbor.session) LogisticRegression 0.91 0.87 Serverframeworkversion (Apache,2.4.12) NaiveBayes 0.9 0.79 RandomForest 0.85 0.82 (X-hacker,If CustomHTTPheader youâA˘Z´re..) computingthecollectivefeatures(Section3.1).Eachfeatureinthe vectorisnormalizedusingtheL1normbeforepassedtotheSVM (D),bucketusagesimilarity(B),landingsimilarity(L)andaseries classifier.Inoursystem,weincorporatedtheSVMprovidedbythe ofcontentproperty/networkpropertysimilarities(S ···S )forn 1 n scikit-learnopen-sourcemachinelearninglibrary[?]. web-technologycategories(e.g.,analyticsandtracking,CMSand plugin,meta-datainformation,etc.).Theoutputofthisfeatureanal- 3.3 Evaluation ysisisthenpassedtothedetector,whichmaintainsamodel(trained ontheseeddataset)todeterminewhetherabucketismalicious, HerewereportourevaluationofBarFinderonboththeground basedonitscollectivefeatures. truthandtheUnknownsets. Alltheexperimentswereconducted Specifically,thecrawlervisitseachwebsite,inspectingitscon- withinanAmazonEC2C4.8xlargeinstanceequippedwithIntel tent,triggeringevents,recordingtheredirectionpathsitobserves XeonE5-266636vCPUand60GiBofmemory. andparsingURLsencounteredusingthepatternsofknowncloud Evaluationontheseedset.WetestedtheeffectivenessofBarFinder platforms to recognize cloud buckets. For example, the reposi- overourground-truthdataset(i.e.,theseedset)throughthestandard tory on Amazon S3 is accessed through the URL formatted as five-foldcrossvalidation:thatis,4/5ofthedatawasusedfortrain- w+.s3{−w+}[?].amazonaws.com, and Amazon CloudFront ingtheSVMandtheremaining1/5forevaluatingtheaccuracyof producesresourceURLsintheformofw+.cloudfront.net. In Bardetection.Specifically,werandomlychose80Bars(outof100) our research, 20 cloud platforms were examined to identify the fromtheBadsetand240(outof300)legitimatebucketsfromthe bucketstheyhost. Atthefeature-analysisstage,foreachbucket, Goodset,togetherwiththerelatedwebsites(outof141,149).These BarFinderinspectsallitsredirectionpaths,convertseverynodeinto datawerefirstprocessedbyourprototypetoadjusttheweightsand anFQDNtocomputetheirtopologicalfeatures,andthenconnects otherparametersforitsmodel. Thenwetestedthemodelonthe differentnodesaccordingtotheircontentandnetworkpropertiesto remainingdataset(20Bars, 60legitimatebuckets). Theprocess findouttheirsitesimilarities,asdescribedinSection3.1. isthenrepeated5times. BarFinderachievedbothalowfalsedis- Next,eachcloudbucketiisuniquelycharacterizedbyavector: coveryrate(FDR:1-precision)andahighrecallindetection:only (cid:104)D ,B ,L ,S ···S (cid:105),witheachelementacollectivefeature. 5.6%ofreportedBarsturnedouttobelegitimate(i.e.,1.6%offalse i i i i,1 i,n Individualfeatureshavedifferentpowerindifferentiatinggoodand positiverate),andover89.3%oftheBarsweredetected. Wefur- badbuckets,whichwemeasuredusingtheF-Score[?](seeTable3). thershowtheAreaUnderCurve(AUC)oftheReceiverOperating Note thatthe feature with alarge scorecan betterclassify these Characteristics(ROC)graph,whichcomesverycloseto1(0.96), vectorsthantheonewithasmallvalue.Therefore,abinaryclassifier demonstratingthegoodbalancewestrikebetweentheFDrateand withamodelforweighingthefeaturesandotherparameterscan thecoverage. Thispreliminaryanalysisshowsthatthecollective beusedtoclassifythevectorsetanddeterminewhetherindividual featuresofthesitesconnectingtocloudrepositoriesarepromising bucketsarelegitimateornot. Suchamodelislearnedfromthe indetectingBars. seeddataset.Inourresearch,weutilizedaSupportVectorMachine EvaluationontheUnknownset. WenowuseBarFindertoscan (SVM)astheclassifier,whichshowedthebestperformanceamong anunknownset.ThisunknownsetcontainsHTTPtrafficcollected otherclassificationalgorithms(seeTable4).Itsclassificationmodel usingacrawlerasdescribedinSection 3.1tovisitalistofwebsites. isbuiltupontheF-Scoresforthecollectivefeatures(D,B,etc.)and Thislistofwebsitesisalsoextractedfromcommoncrawl [?] by athresholdsetaccordingtothefalsepositiveandnegativediscovery searchingforwebsitesthathaveloadedsomecontentinthepast expectedtoachieve.Foreachbucketclassified,theSVMcanalso fromthecloudplatformslistedinTable 7.Asaresult,theunknown reporttheconfidenceoftheclassification. datasetcontainedHTTPtrafficgeneratedfromdynamicallyvisiting Implementation.Thissimpledesignwasimplementedinourstudy 1Mwebsitesloadingcontentfrom20cloudplatformsand6,885 intoaprototypesystem.ThewebcrawlerwasbuiltasaFirefoxadd- cloudbuckets. on.Intotal,20suchcrawlersweredeployed.Wefurtherdevelopeda To validate our evaluation results, we employ a methodology toolinPythontorecovercloudURLsfromthewebcontentgathered that combines anti-virus (AV) scanning, blacklist checking, and byCommonCrawl.Thefeatureanalyzerincludesaround500lines manualanalysis. Specifically,fortheBarsflaggedbyoursystem, ofPythoncodeforprocessingthedatacollectedbythecrawlerand wefirstscantheircloudURLswithVirusTotalformalwareand 1546 Figure 5: Top 10 cloud platforms with most Bars, compared withtheirtotalnumberofcloudbucketsinourdataset. Figure6:ImpactofBars’front-endwebsitesaroundtheglobe. check them against the list of suspicious cloud URLs collected fromourSpamtraphoneypotforSpam,Phishing,blackhatSearch (i.e.,front-endwebsites),throughwhichtheyarefurtherattachedto EngineOptimization(SEO),etc.InthecaseofVirusTotal,aURL 6,513,519redirectionpathsinvolving166,772domains.Figure5 is considered to be suspicious if at least two scanners raise the illustratesthenumberofBarswefoundondifferentcloudplatforms. alarm. AllsuchsuspiciousURLs(fromeitherVirusTotalorthe Amongthem,AmazonS3isthemostpopularoneinourdataset, Spamtraplist)arecross-checkedagainsttheblacklistofCleanMX. hosting the most Bars (45%), which is followed by CloudFront Onlythosealsofoundtherearereportedtobeatruepositive.Once (Amazon’sCDN)25.1%andAkamaihd9.3%.Notethatofthese20 aURLisconfirmedmalicious,itscorrespondingbucketislabeled clouds,sevenofthemprovidefreestorageservices(e.g.,15GBfree asbad. Thoseunlabeledbutflagged(byBarFinder)bucketsare spaceonGoogleDrive,5GBforAmazonS3),andthereforeeasily furthervalidatedmanually. becometheidealplatformsforlow-budgetmiscreantstodistribute Intheexperiment,BarFinderreportedatotalof730Bars,about theirillicitcontent.Also,elevenofthemsupportHTTPS,onwhich 10.6%ofthe6,885buckets. Amongthem,theAVscanningand maliciousactivitiesaredifficulttocatchbyexistingsignature-based blacklistverificationconfirmedthat502bucketswereindeedbad. intrusiondetectionsystemslikesnortandShadow[?][?]. Interest- Theremaining228weremanuallyanalyzedthrough,e.g.,inspecting ingly, on some of the most prominent platforms, the miscreants theresourcesinthebucketsforphishingorscamcontent,running arefoundtotakeadvantageofthecloudproviders’reputationsto scriptsintheVMtocapturebinarycodedownload.Thisvalidation maketheirPhishingcampaignslookmorecredible: forexample, furtherconfirmed192Bars.TheFDRwasfoundtobeatmost5% wefoundthattheadversarycontinuouslyspoofedGmail’slogin (assumingthosenotconfirmedtobelegitimate),inlinewiththe pageonGoogleDrive,andthesoftwaredownloadpageforAmazon findingfromtheseedset. FireTVinanAmazonS3bucket. Figure6showsthedistributionofBars’frontendwebsitesacross 4. MEASUREMENTANDDISCOVERIES 81countries,asdeterminedbythegeolocationsofthesites. The numberofBars’frontendsitesineachcountryisrankedandde- Based on the discoveries made by BarFinder, we further con- scribedwithdifferentlevelsofdarknessinthefigure.Weobserve ductedameasurementstudytobetterunderstandthefundamental thatmostofthesefrontendsstayinUnitedStates(14%),followed issues about Bar-based malicious services, particularly how the byGermany(7%)andUnitedKingdom(5%). cloudrepositorieshelpfacilitatemaliciousactivities,howtheadver- saryexploitedlegitimatecloudbucketsandwhytheadversaryuses Roleinattackinfrastructures. Actually,mostnodesonamali- Barsinthefirstplace.Ourresearchshowsthatontheinfrastructure, ciousinfrastructurearethemaliciouswebsiteswithnewlyregistered Barsplayapivotalrole,comparedwiththecontentkeptonother domainsandthosethatarecompromised.Tobetterunderstandthe maliciousorcompromisedsites,possiblybecausetheyarehostedon criticalrolesofBars,wecomparedthosenodeswiththebadcloud popularcloudservices,andthereforehardtoblacklistandalsoeasy buckets. Specifically,wefirstidentifiedbothtypesofnodesfrom toshareacrossdifferentcampaigns.Also,inamaliciouscampaign, theredirectionpathsandthenanalyzedthenumberofuniquepaths theadversarymaytakeadvantageofmultipleBars,atdifferentat- eachmemberineithercategoryisassociatedwithandtheposition tackstages,toconstructacomplicatedinfrastructurethatsupports ofthememberonthepath.Figure7(a)presentsthecumulativedis- hermission(Section4.1).Moreimportantly,wediscoveredthatthe tributionofthepathsgoingthroughaBarandthatofacompromised adversaryeffectivelyexploitedmisconfiguredlegitimatebucketsto ormalicioussite.Asseeninthefigure,comparedwithothernodes infectalargenumberoftheirfront-endwebservices(Section4.2), ontheinfrastructure,Barsclearlysitonmuchmorepaths(47.4on andthecloudprovidershavenotdonemuchtocounteractthethreat, averagevs.8.6),indicatingtheirimportance. oftenleavingBarsthereforalongtime(Section4.3),possiblydueto Further,Figure7(b)showsthehistogramofpositiondistributions theprivacyconstraintsandlimitedmeanstodetectindividualcom- (again,Barsvs.badsites).TheobservationisthatmoreBars(41%, ponentsofamaliciousactivity. Suchobservations,togetherwith 11%)showupatthebeginningsandtheendsofthepathsthanbad thechallengeinblockingBars,offerinsightsintothemotivationfor websites(22%,5%),whichdemonstratesthattheyoftenactasfirst- movingtowardthisnewtrendofrepository-basedattacks. hopredirectorsorattack-payloadrepositories.Forexample,inour three-month-longmonitoringofthecampaignbasedontheSpyware 4.1 Bar-basedMaliciousWebInfrastructure distribution Bar akamaihd.net_rvar-a, we found that besides the Landscape. Asmentionedearlier,BarFinderreported730suspi- Bar,320newly-registeredwebsitesparticipatedintheattack;here ciousrepositoriesfrom6885cloudbucketsover20cloudplatforms. theBaractedverymuchlikeadispatcher: providingJavaScript Amongthem,weutilized694confirmedBars(throughAV/blacklist thatidentifiedthevictim’sgeolocationandthenusinganiframeto scanningormanualvalidation,seeSection3.3)forthemeasurement redirecthertoaselectedbadsite. study. TheseBarswerefoundtodirectlyserve156,608domains Content sharing. Our research reveals that Bars have been ex- 1547 Table5:Top10mostpopularBars. Rank Cloudbucket #offront-endsites Avgpathlen Popularity 1 s3.amazonaws.com_content.sitezoogle.com 4,429 2.9 2.8% 2 cloudfront.net_d3n8a8pro7vhmx 1,829 3.3 1.4% 3 s3.amazonaws.com_assets.ngin.com 1,643 3.2 1.2% 4 s3.amazonaws.com_publisher_configurations.shareaholic 1,434 2.7 0.9% 5 cloudfront.net_d2e48ltfsb5exy 1,340 4.0 0.9% 6 cloudfront.net_d1t3gia0in9tdj 1,297 3.2 0.9% 7 cloudfront.net_d2i2wahzwrm1n5 1,249 2.5 0.8% 8 cloudfront.net_d202m5krfqbpi5 1,062 2.8 0.8% 9 s3.amazonaws.com_files.enjin.com 1,020 7.1 0.7% 10 akamaihd.net_cdncache3-a 976 6.4 0.6% (a) Cumulative distribution of degrees(b) Percentage of Bars in each posi-(c) Cumulative distribution of number persites. tionofredirectionpath(Ignoringthoseofin-degreesperBar. traceswithlengthof2). Figure7:Barsplaycriticalrolesinattackinfrastructures. tensivelysharedamongmaliciousorcompromisedwebsites,also reusedthesamemaliciouscontentsfortheseBars.Specifically,we across different positions on malicious redirection chains. Fig- found that 28 content sharing Bars on Akamaihd have the same ure7(c)illustratesthecumulativedistributionofBars’in-degreesin formatintheirnames. Attackersutilizedawordbankbasedsub- theirindividualredirectiongraphs:thatis,thenumberofthesites domaingenerationalgorithms[?],whichconcatenatesfixedterms utilizingtheseBars. Onaverage,eachBarshowsupon252sites andaseriesofdomainnames(removedot),thentruncatesthestring and12%ofthemareusedbymorethan200websites.Table5lists ifitslengthisover13,e.g.,apismarterpoweru-a(truncatedfrom the10mostpopularBarswefound.Amongthem,eight,including smarterpowerunite.com).ThecommonpatternsofBarsindicatethe s3.amazonaws.com_content.sitezoogle.com,s3.amazonaws.com_ potentialofdevelopinganaccuratedetectionprocedure. publisher_configurations.shareaholic,etc.,hostservicesforwebsite Correlation.Wefurtherstudiedtherelationshipsbetweendifferent generation, blackhat SEO or Spam. Particularly, akamaihd.net_ Bars, fetched by the same websites. From our dataset, 11,442 cdncache3-aturnsouttobeadistributorofAdware,whosescripts (3.5%) websites are found to access at least two Bars. Among areloadedintothevictim’sbrowsertoredirectittoothersitesfor them, 8,283 were served as front-end websites, and 3,159 other downloadingdifferentAdware. Also,wefoundthatanotherBar sitesonredirectionchains. Also,60.9%ofthesesiteslinktothe s3.amazonaws.com_files.enjin.comhostsexploitsutilizedby1,020 repositoriesonthesamecloudplatformsand39.1%usethoseon badsites.FindingBarscanhelptoeffectivelydetectmoresiteswith differentplatforms. Insomecases, twobucketsareoftenused maliciouscontents. together.Forexample,wefoundthataclick-hijackingprogramwas Anotherinterestingobservationisthatmaliciouscontentisalso separatedintothecodepartandtheconfigurationpart:theformeris extensivelysharedacrossdifferentBars.Tounderstandsuchcontent keptonCloudFrontwhilethelatterisonAkamaihd;thetwobuckets reuse,wegroupedthemaliciousprogramsretrievedfromdifferent alwaysshowuptogetheronredirectionchains.Suchaseparation Barsbasedonthesimilarityoftheircodeintermsofeditdistance. seems to be done deliberately, in an attempt to evade detection. Specifically,weremovedthespaceswithintheprogramsandran AlsowesawthatBarscarryingthesameattackvectorsareoften thePythonlibraryscipy.cluster.hierarchy.linkage[?] tohierarchi- usedtogether,whichareapparentlydeliberatelyputtheretoserve callyclusterthem(nowintheformofstrings)accordingtotheir partiesofthesameinterests: asanotherexample,acompromised Jaroscores[?]. Inthisway,wewereabletodiscoverthreetypes websitewasobservedtoaccessfourdifferentBarsondifferentcloud ofcontentsharing:intra-bucketsharing,cross-bucketsharing,and platforms,redirectingitsvisitorstodifferentplacesfordownloading cross-platform sharing. Specifically, within the Amazon bucket Adwaretothevisitor’ssystem. OurfindingsshowthatBarsare akamaihd.net_asrv-a,wefoundthatmanyofitscloudURLsare widelydeployedinattacksandserveinacomplexinfrastructure. intheformofhttp://asrv-a.akamaihd.net/sd/[num]/[num].js. The JavaScriptcodeturnsouttobeallidentical,exceptthateachscript 4.2 BucketPollution redirectsthevisitortoadifferentwebsite. Thesimilarcodealso Pollutedrepositories. Tofindpollutedbuckets,wesearchedthe appearsinanotherAmazonbucketakamaihd.net_cdncache-a.As Alexatop20KwebsitesfortheBarsinourdatasetand276Barswere another example, we discovered the same malicious JavaScript found. WhenalegitimatesitelinkstoaBar,thereasonmightbe (JS.ExploitBlacole.zm)fromtheBarsonCloudFrontandQiniudnre- eitherthewebsiteortherepositoryishacked.Differentiatingthese spectively,evenunderthesamepath(i.e.,media/system/js/modal.js). twosituationswithcertaintyishard,andinsomecases,itmaynot Moreover,wefoundthatattackersusedsub-domaingenerational- bepossible.Allwecoulddoistogetanideaabouttheprevalenceof gorithmtoautomaticallygeneratesub-domainforBars,thenfurther suchbucketpollution,basedontheintuitionthatifawebsiteisless 1548 (a) Cumulative distribution of Alexa(b) Cumulative distribution of Alexa(c) Cumulativedistributionoftrafficin- globalranksperBars’front-endsites. bouncerateperBar’sfront-endsites. creaserateperBar’sfront-endsites. Figure8:Alexaglobalrank,bouncerateandtrafficincreaserateofBar’sfront-endwebsites. GET /?delimiter=/ HTTP/1.1 adversarycaneasilybuildhisownHTTPheader,fillinginhisown Host: (bucket-name).s3.amazonaws.com S3key,asillustratedinFigure9,togainaccesstothemisconfigured Accept-Encoding: identity repository.Inourresearch,weverifiedthatallsuchoperationscan content-length: 0 beperformedonanyrepositorieswiththeconfigurationflaw,which Authorization: AWS (access key):(secret key) suggeststhatsiteoperatorsneedtotakemorecautionwhensetting theconfigurationrules. Figure9:Constructedrequestheader. Tounderstandtheimpactofthisproblem,wedevelopedasimple webtestingtool,whichcheckedabucket’sconfigurationusingour own S3 key. By scanning all 6,885 repositories (including both vulnerable,thenitislesslikelytobecompromised.Tothisend,we Barsandlegitimatebuckets),wediscoveredthat472arevulnerable, ranWhatWeb,apowerfulwebvulnerabilityscanner,onthesesites whichwereassociatedwith1,306front-endwebsites. TheAlexa andfound134Bar’sfront-endwebsitescontainvariousflaws,such globalranksandthebounceratesoftheirfront-endwebsitesare asusingCMSinvulnerableversion(e.g.wordpress3.9),vulnerable illustratedinFigure8(a)andFigure8(b).63%ofthemhavebounce plugins(e.g.,JCEExtension2.0.10)andvulnerablesoftware(e.g., ratesfrom20%to60%;9sitesarerankedwithinAlexatop5000 Apache 2.2). The remaining 142 Bar’s front-end websites look (e.g.,groupon.com,space.com). prettysolidinwebprotectionandthereforeitislikelythattheBars Focusingonthe104badbucketswiththeflaws,wefurthermanu- theyincludewerepolluted. Thissetofpotentiallycompromised allysampled50andconfirmedthatthesebucketswereindeedlegiti- buckets takes 19% of all the Bars flagged by BarFinder. These mate,includinghigh-profileoneslikes3.amazonaws.com_groupon. buckets,togetherwiththeadditional30randomlysampledfromthe Further,lookingintothethesebuckets’fileuploadingtime(retrieved set,wentthroughamanualanalysis,whichshowsthatindeedthey fromthebucketsthroughtheflaw),wefoundthatinsomecases, werelegitimatebucketscontaminatedwithmaliciouscontent. theattackhasbeenthereforsixyears. ParticularlytheAmazon Misconfigurationandimpact.Itisevenmorechallengingtodeter- buckets3.amazonaws.com_groupon,Groupon’sofficialbucket,was minehowthesebucketswerecompromised,whichcouldbecaused apparentlycompromisedfivetimesbetween2012and2015(see byexploitingeitherthecloudplatformvulnerabilitiesorthebucket Section 4.4 for details), according to the changes to the bucket misconfigurations.Withoutanextensivetestonthecloudplatform weobservedfromthebuckethistoricaldatasetwecollectedfrom andtherepositories,whichrequiresatleastdirectaccesstothem,a archive.org. WealsoestimatedthevolumeoftraffictothoseBar- comprehensivestudyontheissueisimpossible.Nevertheless,we relatedsitesusingaPassiveDNSdataset [?],whichcontainsDNS wereabletoidentifyamisconfigurationproblemwidelyexistingin lookupsrecordedbytheSecurityInformationExchange.Figure8(c) popularbuckets.Thisflawhasneverbeenreportedbeforebutwas illustratesthetrafficofthewebsitesduringthetimeperiodwhen likelyknowntotheundergroundcommunityandhasalreadybeen theirbucketswerecompromised,whichwasincreasedsignificantly utilizedtoexploittheserepositories. Wereportedtheflawstothe comparedwithwhatthosesitesreceivedbeforetheircompromise, vendorsandtheyconfirmedourfinding. indicatingthattheylikelyreceivedalotofvisits. Thisprovides Specifically,onAmazonS3,onecanconfiguretheaccesspolicies evidencethattheimpactofsuchcompromisedbucketsisindeed forherbuckettodefineswhichAWSaccountsorgroupsaregranted significant. accessandthetypeofaccess(i.e.,list,upload/modify,deleteand 4.3 LifetimeandEvasion download): this can be done through specifying access control list on the AWS Management Console. Once this happens, the In the presence of the severe threat from Bars, we found that cloud verifies the content of the authorization field within cloudproviders’responses,however,arefarfromadequate.Thisis the client’s HTTP request header before the requested access is highlightedbytherelativelylonglifetimesofmaliciousrepositories allowed to go through. However, we found that by default, the weobserved. policyisnotinplace,andinthiscase,thecloudonlycheckswhether Lifetime. To understand the duration of Bars’ impacts, we con- theauthorizationkey(i.e., accesskeyandsecretkey)belongsto tinuouslycrawledthefront-endbadsiteseveryfivedaystocheck an S3 user, not the authorized party for this specific bucket: in whethertheywerestillusingthesamesetofBars,andalsomali- otherwords,anyone,aslongassheisalegitimateuseroftheS3, ciouscloudURLstofindoutwhethertherepositorieswerestillalive. hastherighttoupload/modify,deleteandlisttheresourcesinthe Figure10(a)illustratesthedistributionsofsuchbadrepositories’ bucket and download the content. Note that this does not mean lifespanswithinthosefront-endsitesandoncloudplatforms.As thatthebucketcanbedirectlytouchedthroughthebrowser,since canbeseeninthefigure,onaverage,thereferencesoftheseBarson itdoesnotputanythingintotheauthorizationfield.However,the thewebsiteswereremovedmuchfasterthantheircloudURLsand 1549 Table6: ComparisonofBars’lifetimeunderdifferentevasion techniques. #of Avg.life Evasiontechnique #ofBars front-end span sites Contentseparation 10 743 25-30days Contentchange 10 1045 >30days Redirectcloaking 10 1220 10-15days Obfuscation 10 1032 10-15days None 10 984 5-10days (a) DistributionsofBars’life(b) Percentage of Bars re- spansonfront-endsitesandmoved within 5 days in top oncloudplatforms. 5cloudplatformswithmost toBarswithinfront-endwebsiteswereobfuscatedinsomecases, Bars. apparently,forthepurposeofprotectingtherepositories. Further,ourstudyshowsthatthesetechniqueswerealsoutilized Figure10:LifetimeofBars. togethertomakeidentificationofBarsevenharder. Specifically, wemanuallychoose10Barswitheachevasiontechnique(40in total),combinedwith10Barswithoutevasiontechnique,andthen ultimatelytheiraccountsonthecloudplatforms. Apparently,the comparetheirlifespans.Itisclearthatevasiontechniquesdoallow cloudprovidersarelessaggressive,relativetothewebsiteowners, Barstohidelonger,asillustratedinTable6. inaddressingBar-relatedinfections. InFigure10(b),wefurther compareBars’lifespansondifferentplatforms:interestingly,with 4.4 CaseStudies morebadbucketsonitsservers,AmazonAWSactedmorepromptly thanotherclouds;Google,however,movedmuchslower:forexam- Inthissection,wediscusstwoprominentexamples. ple,onGoogleDrive,arepositoryhostingmalware-servingpages, PUP campaign. Our study reveals a malicious web campaign googledrive.com_0B8D1eUrPT_z3OVpBTVJ3LUg2UEk,stayed dubbed Potentially Unwanted Programs (PUP) distribution: the thereforover150days,longerthantheaveragedurationofother attackredirectsthevictimtoanattackpage,whichshowsherfake exploitservers(non-cloud)reportedbythepriorwork[?][?] (2.5 systemdiagnosisresultsorpatchrequirementsthroughtheimages hours).Theobservationindicatesthatcloudprovidershavenoticed fetchedfromaBar,inanattempttocheatthevictimintodownload- suchproblem,butalikelylackofeffectivemethodstoidentifyand ing“unwantedprograms”suchasSpyware,Adwareoravirus.This cleanBars. campaignwasfirstdiscoveredinourdataset.Altogether,atleast11 Evasion. Suchalonglifetimecouldberelatedtoaspectrumof Barsfrom3differentcloudplatformsand772websites(nothosted evadingtechniquestheadversarydeploystoprotecthiscloudassets, onthecloud)wereinvolvedin. whicharedescribedasfollows: Throughanalyzingtheredirectiontracesofthecampaign, we foundthattwoAkamaiBars,akamaid.net_cdncache3-aandakamaihd_ • Content separation. Apparently, the adversary tends to break asrv-a,frequentlyinjectscriptsintocompromisedwebsites,which hisattackassetsintopiecesandstorethematdifferentplaces.As serveasfirst-hopredirectorstomoveavisitordowntheredirection mentionedearlier,wefoundthatmalware’scodeandconfiguration chainbeforehittingmaliciouslandingpages(thatservemalicious fileswereplacedindifferentbuckets. Also,wediscoveredinthis content). Interestingly, allthefollow-upredirectorsarecompro- studythatthereare32Barsthathostnothingbutimagesusedin misedormaliciouswebsitesthatarenothostedonthecloud.The various attacks, Phishing and Fake AV campaigns in particular. scriptsintheBarswerefoundtochangeovertime,redirectingthe Sincetheimagesthemselvesdonotcontainanymaliciouscode, visitor to different next-hop sites (also redirectors). On average, theserepositoriestypicallystayonthecloudsforalongtime,>30 thelifespanofsuchsitesisonly120hours,buttheBarwasstill daysonaverage. alivewhenwesubmittedthispaper. Suchredirectionsendatat •Contentchange.Anotherinterestingobservationisthatthemali- least216maliciouslandingsites,whichallretrievedeceptiveim- ciouscontentwithinBarschangesovertime,inanattempttoavoid agesfromanAmazonS3buckets3.amazonaws.com_cicloudfront beinglinkedtoblacklistedmaliciouswebsites.Specifically,looking (aBarneverreportedbeforeandisstillalive). Anexampleisa intothehistoryofthecontent(fromarchive.org)retrievedfrom system update warning, as shown in Figure 1. From the reposi- the Bar through the same cloud URL, we found that part of the tory, wecollected134images, includingthoseforfreesoftware content(e.g.,thedestinationofaredirection)changesfrequently, installation,updatesonallmainstreamOSes,browsersandsome movingquicklyawayfromknownmalicioussites. popularapplications.Ifsheclicksanddownloadstheprogrampro- •Redirectcloaking.Likemaliciousorcompromisedwebsites,Bars motedonthesite,thecodewillbefetchedfrommultipleBars,such arealsofoundtoleveragecloakingtechniques(renderingdifferent ass3.amazonaws.com_wbt_mediawherethePUPputsaBitcoin contentbasedonthevisitor’cookie,IP,useragent,etc.) toavoid mineronthevictim’ssystem,andcloudfront.net_d12mrm7igk59vq, detection.However,differentfromwebsites,cloudhostingservices whoseprogrammodifiesChrome’ssecuritysetting. typically do not support server-side scripting. As a result, Bars GrouponBar. WediscoveredthatamisconfiguredAmazonS3 havetorunthecloakingcodeontheclient(browser)side,which bucket s3.amazonaws.com_groupon belongs to Groupon (Alexa makestheevasionlessstealthy.Tomakeupforthisweakness,the globalrank265),aglobale-commercemarketplaceserving48.1 adversarywasobservedtoplaceredirectionwebsitesinfrontof millioncustomersworldwide.Thebucketwasusedastheresource Bars,runningcloakingtechniquestheretohidetherepositories. repository for Groupon’s official website (i.e., groupon.com) as •Obfuscation.Wefoundthattheattackpayloadsintherepositories wellasitsmarketingsites(12websitesobservedinourdataset). were often obfuscated. Various kinds of obfuscation techniques Whentrackingitshistoricalcontentfromarchive.org,weweresur- were found from simply Base64 encoding to online obfuscation prisedtoseethattheGrouponS3buckethasbeencompromisedat tools(e.g.,api.myobfuscate.com).Actually,eventhelinkstorefer leasteighttimesinthepastfiveyears(e.g.,2015/08/06,2014/12/18, 1550
Description: