ebook img

Network Analysis of Recurring YouTube Spam Campaigns PDF

0.45 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Network Analysis of Recurring YouTube Spam Campaigns

Network Analysis of Recurring YouTube Spam Campaigns DerekO’Callaghan,MartinHarrigan,JoeCarthy,Pa´draigCunningham SchoolofComputerScience&Informatics,UniversityCollegeDublin {derek.ocallaghan,martin.harrigan,joe.carthy,padraig.cunningham}@ucd.ie Abstract 2 1 As the popularity of content sharing websites such as 0 YouTube and Flickr has increased, they have become tar- 2 getsforspam,phishingandthedistributionofmalware. On YouTube,thefacilityforuserstopostcommentscanbeused n byspamcampaignstodirectunsuspectinguserstoboguse- a commercewebsites. Inthispaper,wedemonstratehowsuch J campaignscanbetrackedovertimeusingnetworkmotifpro- 8 filing, i.e. by tracking counts of indicative network motifs. 1 Byconsideringallmotifsofuptofivenodes,weidentifydis- criminating motifs that reveal two distinctly different spam ] campaign strategies. One of these strategies uses a small I Figure 1: Strategies of two spam campaigns targeting S numberofspamuseraccountstocommentonalargenumber . ofvideos,whereasalargernumberofaccountsisusedwith YouTube in 2011 - small number of accounts each com- s theother. Wepresentanevaluationthatusesmotifprofiling menting on many videos (left), and larger number of ac- c totracktwoactivecampaignsmatchingthesestrategies,and countseachcommentingonfewvideos(right). Bluenodes [ identifysomeoftheassociateduseraccounts. are videos, red nodes are accounts marked as spam, beige 1 nodesarespamaccountsnotmarkedaccordingly. v 3 Introduction 8 7 The usage and popularity of content sharing websites con- Our investigation has found that bot-posted spam com- 3 tinuestoriseeachyear. Forexample, thenumberofFlickr mentsareoftenassociatedwithorchestratedcampaignsthat . uploadshasrisentoatotalofsixbillionimages,havingin- can remain active for long periods of time, where the pri- 1 0 creased annually by 20% over the past five years1. Simi- mary targets are popular videos. Such campaigns tend to 2 larly, YouTube now receives more than three billion views employ a variety of detection evasion techniques, such as 1 perday,withforty-eighthoursofvideobeinguploadedev- variantsofthesamefundamentalmessagecontent,perhaps : ery minute; increases of 50% and 100% respectively over withdifferentwebsitedomains,andanever-evolvingsetof v thepreviousyear2. Unfortunately,suchincreaseshavealso fakeuseraccounts. Aninitialmanualanalysisofdatagath- i X resulted in these sites becoming more lucrative targets for eredfromYouTuberevealedactivityfromanumberofcam- r spammershopingtoattractunsuspectinguserstomalicious paigns, two of which can be seen in Figure 1. The results a websites,whereavarietyofthreatssuchasscams(phishing, presented in this paper confirm the presence of these cam- e-commerce)andmalwarecanbefound. Thisisaparticular paigns,alongwiththeirrecurringnature. problem for YouTube given its facility to host discussions Asanalternativetotraditionalapproachesthatattemptto in the form of video comments (Sureka 2011). Opportuni- detectspamonanindividual(comment)level(e.g. domain ties exist for the abuse of this feature with the availability blacklists), this paper presents an evaluation of the detec- of bots3 that can be used to post spam comments in large tion of these recurring campaigns using network analysis, volumes. based on networks derived from the comments posted by userstovideos. Thisapproachusestheconceptofnetwork 1http://news.softpedia.com/news/ motif profiling (Milo et al. 2002; 2004; Wu, Harrigan, and Flickr-Boasts-6-Billion-Photo-Uploads-215380. shtml Cunningham 2011), where motif counts from the derived 2http://youtube-global.blogspot.com/2011/ networks are tracked over time. Given that different cam- 05/thanks-youtube-community-for-two-big. paign strategies can exist (see Figure 1), the objective is to html discover certain discriminating motifs that can be used to 3http://youtubebot.com/ identifyparticularstrategiesandtheassociatedusersasthey periodicallyrecur. shortcomingsofURLblacklistsforthepreventionofspam This paper begins with a description of related work in onTwitterwerehighightedbyGrieretal. (2010), whereit thedomain. ThecollectionofcontemporaryYouTubedata, wasfoundthatblacklistupdatedelaysofuptotwentydays comprised of comments posted to the most popular videos canoccur. Thisisaparticularproblemwiththeuseofshort- overaperiodoftime, isthendiscussed. Next, themethod- ened URLs, the nature of which was recently analysed by ologyusedbythedetectionapproachisdescribedindetail, Chhabraetal. (2011). fromderivationofthecomment-basednetworkstothesub- sequentnetworkmotifprofilegeneration. Theresultsofan Networkmotifanalysis experimentforaseventy-twohourperiodarethenpresented. Network motifs (Milo et al. 2002; Shen-Orr et al. 2002) These results demonstrate the use of certain discriminating arestructuralpatternsintheformofinterconnectedn-node motifstoidentifysomeoftheusersassociatedwithtwosep- subgraphs that are considered to be inherent in many vari- aratecampaignswehavediscoveredwithinthistimeperiod. eties of network, such as biological, technological and so- Furtheranalysisofthecampaignwebsitesisalsoprovided. ciological networks. They are often used for the compari- Finally,theoverallconclusionsarediscussed,andsomesug- sonofsaidnetworks,andcanalsoindicatecertainnetwork gestionsforfutureworkaremade. characteristics. Inparticular,theworkofMiloetal. (2004) proposedtheuseofsignificanceprofilesbasedonthemotif RelatedWork counts found within networks to enable the comparison of Structuralandspamanalysis local structure between networks of different sizes. In this case,thegenerationofanensembleofrandomnetworkswas The network structure of YouTube has been analysed in a requiredforeachsignificanceprofile. Analternativetothis number of separate studies. Paolillo et al. (2008) investi- approach (Wu, Harrigan, and Cunningham 2011) involved gatedthesocialstructurewiththegenerationofausernet- theuseofmotifprofilesthatdidnotentailrandomnetwork work based on the friendship relationship, focusing on the generation. Instead, profiles were created on an egocentric degree distribution. They found that YouTube is similar to basis for the purpose of characterising individual egos, en- other online social networks with respect to degree distri- compassingthemotifcountsfromtheentiretyofegocentric bution, and that a social core exists between authors (up- networkswithinaparticularnetwork. loaders)ofvideos. Analternativenetworkbasedonrelated Thedomainofspamdetectionhasalsoprofitedfromthe videoswasanalysedbyChengetal. (2008). Giventhatthe useofnetworkmotifsorsubgraphs. Withinanetworkbuilt resulting networks were not strongly connected, attention fromemailaddresses(BoykinandRoychowdhury2005), a wasreservedforthelargeststronglyconnectedcomponents. low clustering coefficient (based on the number of triangle Thesecomponentswerefoundtoexhibitsmall-world char- structures within a network) may indicate the presence of acteristics (Watts and Strogatz 1998), with large clustering spam addresses, with regular addresses generally forming coefficientsandshortcharacteristicpathlengths,indicating close-knit communities, i.e. a relatively higher number of thepresenceofdenseclustersofrelatedvideos. triangles. Becchettietal. (2008)madeuseofthenumberof Benevenuto et al. (2008a) created a directed network triangles and clustering coefficient as features in the detec- based on videos and their associated responses. Similarly, tion of web spam. These two features were found to rank they found that using the largest strongly connected com- highly within an overall feature set. Motifs of size three ponents was more desirable due to the large clustering co- (triads) have also been used to detect spam comments in efficients involved. This was a precursor to subsequent networks generated from blog interaction (Kamaliha et al. work concerned with the detection of spammers and con- 2008). It was found that certain motifs were likely to indi- tent promoters within YouTube (Benevenuto et al. 2008b; catethepresenceofspam,basedoncomparisonwithcorre- 2009). Features from the video responses networks (e.g. spondingrandomnetworkensembles. clustering coefficient, reciprocity) were used as part of a Separately, network motifs have also been used to char- largersettoclassifyusersaccordingly.OtherYouTubespam acterize network traffic (Allan, Turkett, and Fulp 2009). A investigations include the recent work of Sureka (2011), network was created for each application (e.g. HTTP, P2P based on the detection of spam within comments posted to applications),andnodeswithinthenetworkwereclassified videos. A number of features were derived to analyse the usingcorrespondingmotifprofiles. overall activity of users, rather than focusing on individual commentdetection. DataCollection Anextensivebodyofworkhasbeendedicatedtotheanal- ysisofspamwithinotheronlinesocialnetworkingsites.For Following the lead of earlier related YouTube research, a example,Mishneetal. (2005)suggestedanapproachforthe data set was collected in order to permit the investigation detectionoflinkspamwithinblogcommentsusingthecom- of contemporary spam comment activity. An extensive parisonoflanguagemodels. Gaoetal. (2010)investigated crawl of the YouTube network was performed by other re- theproliferationofspamwithinFacebook“wall”messages, searchers (Paolillo 2008; Benevenuto et al. 2009). In our withthedetectionofspamclustersusingnetworksbasedon case, we opted for a specific selection of the available data message similarity. This particular study demonstrated the given that spam comments in YouTube tend to be directed bursty (recurring) and distributed aspects of botnet-driven towards a subset of the entire video set, i.e. more popu- spam campaigns, as discussed by Xie et al. (2008). The lar videos generally have a higher probability of attracting attention from spammers, thus ensuring a larger audience. Videos 6,407 Thischaracteristichasalsobeenseenonotheronlinesocial Totalcomments 6,431,471 networkssuchasTwitter(Benevenutoetal. 2010). Commentsmarkedasspam 481,334 Anotherissuetobeconsideredistheaccessibilityofcer- Totalusers 2,860,264 tain YouTube data attributes. The recent activity of a user Spamcommentusers 177,542 profile contains a number of potential attributes for use in thederivationofrepresentativenetworks,suchascomments Table1: Datasetproperties posted to videos, and subscriptions added to other users. Similarly, thelistofsubscribersforaparticularuserwould alsobeuseful. However,accesstotheseattributescanoften is set to true if a comment has previously been marked as be restricted, meaning that reliance on such data may lead spam, either by the spam filter or manually with the “Flag toinaccuraciesduringsubsequentexperiments. Onthecon- for spam” button available with each comment posted on trary, comments(andtheuserswhopostedthem)foundon a video’s page. However, this property cannot be consid- apublicvideo’spagearealwaysaccessible. Giventheseis- ered reliable due to its occasional inaccuracy, where inno- sues, we decided to use only data to which access was not centcommentscanbemarkedasspam,whileobviousspam restricted,namelythecommentspostedtovideosalongwith comments are not marked as such. This will be demon- theassociateduseraccounts. strated later in the results discussion. Similar evidence of this property’s unreliability was also encountered in earlier Retrievalprocess work(Sureka2011). The data has been retrieved using the YouTube Data API4. Althoughthecommentspamhintisusedforapproximate ThisAPIprovidesaccesstovideoanduserprofileinforma- annotation of the data (Table 1), it is not relied upon as a tion.TherearesomelimitsassociatedwithusingtheAPI,of label for the purposes of this evaluation. Other research in whichfurtherdetailsareprovidedbelow. Apartfromvideo this area (Benevenuto et al. 2009) performed manual label and user information, access is also provided to standard annotation of YouTube data for use in subsequent classifi- feedssuchasMostViewedvideos,TopRatedvideosetc.The cationexperiments. Anaccurateannotationprocesswillbe factthatthesefeedsareperiodicallyupdated(usuallydaily) consideredinfuturework. facilitates our objective of analysing recurring spam cam- paigns,asitenablestheretrievalofpopularvideos(i.e.those Methodology attractingspamcomments)onacontinualbasis. Therefore, Commentprocessingandnetworkgeneration theretrievalprocessisexecutedperiodicallyasfollows: Our methodology requires the generation of a network to 1. Retrievethecurrentvideolistfromthemostviewedstan- represent the comment posting activity of users to a set of dardfeedfortheUSregion(theAPIlimitsthistoamax- videos. Initially, comments made during a specified time imumof100videos). interval are selected from the data set discussed in the pre- 2. Foreachvideointhelist: vious section. However, a number of pre-processing steps mustbeexecutedbeforeanappropriatenetworkcanbegen- (a) If this video has not appeared in an earlier feed list, eratedsimilartothoseinFigure1. retrieve its meta-data such as upload time, description Spammers try to obfuscate the text of comments from a etc. particularcampaigninordertobypasstheirdetectionbyany (b) Retrievethecommentsandassociatedmeta-dataforthe filters. Obfuscation techniques include the use of varying last twenty-four hours, or those posted since the last amountsofadditionalcharacters(e.g. whitespace,Unicode retrieval time (if more recent). The API limits the re- newlines,etc.) withinthecommenttext,ordifferenttextual turnedcommentstoamaximumof1,000. formations(e.g. additionalwords,misspellings)ofthesame 3. Inordertotrackthecommentactivityonparticularvideos fundamentalmessage. Someexamplesofthesecanbeseen appearing intermittently in the most viewed feed, com- inthenextsection. mentsarealsoretrievedforthosevideosnotinthecurrent Tocounteracttheseefforts,eachcommentisconvertedto feedlistthatappearedinanearlierlistfromtheprevious asetoftokens. Duringthisprocess,stopwordsareremoved, forty-eighthours. along with any non-Latin-based words as the focus of this evaluation is English-language spam comments. Punctua- Datasetproperties tion characters are also removed, and letters are converted to lowercase. A modified comment text is then generated Data retrieval began on October 31st, 2011, and details of from the concatenation of the generated tokens. As initial the videos and comments as of January 17th, 2012 can be foundinTable15. analysisfoundthatspamcommentscanoftenbelongerthan An interesting feature of the API is the spam hint prop- regularcomments,anytextsshorterthanaminimumlength (currently25characters)areremovedatthispoint.Although erty provided within the video comment meta-data. This thecampaignstrategiesunderdiscussionhereareconcerned 4http://code.google.com/apis/youtube/ withattractinguserstoremotesitesthroughtheinclusionof getting_started.html#data_api URLs in comment text, comments without URLs are cur- 5Thedatasetisavailableathttp://mlg.ucd.ie/yt rently retained. This ensures the option of analysing other typesofspamcampaigns, suchasthoseencouragingchan- egowillhaveacorrespondingvalueofzerointheassociated nel views, i.e. promoters (Benevenuto et al. 2009), along motifprofile. withthebehaviourofregularusers. Asmentionedpreviously, theworkofMiloetal. (2004) A network can then be generated from the remaining proposedthegenerationofasignificanceprofile,wherethe modifiedcommenttexts. Thisnetworkconsistsoftwocat- significanceofaparticularmotifwascalculatedbasedonits egories of node, users and videos. An undirected edge is countinanetworkalongwiththatgeneratedbyanensemble createdbetweenauserandavideoifatleastonecomment ofcorrespondingrandomnetworks. Theseprofilesthenper- has been posted by the user on the video, where the edge mittedthesubsequentcomparisonofdifferentnetworks. In weightrepresentsthenumberofcommentsinquestion. For thiswork, theegocentricnetworksarecomparedwitheach the moment, the weight is merely recorded but is not sub- other, and the generation of random ensembles is not per- sequently used when counting motifs within the network. formed. An alternative ratio profile rp (Wu, Harrigan, and To capture the relationship between the users involved in a Cunningham2011)iscreatedforeachego, wheretheratio particularspamcampaign,undirectedandunweightededges value for a particular motif is based on the counts from all are created between user nodes based on the similarity of oftheegocentricnetworks,i.e.: theirassociatedcomments. Eachmodified(tokenized)com- ment text is converted to a set of hashes using the Rabin- nmp −nmp Karp rolling hash method (Karp and Rabin 1987), with a rpi = nmp +i nmp +i (cid:15) (1) sliding window length of 3. A pairwise distance matrix, i i basedonJaccarddistance,canthenbegeneratedfromthese commenthashsets.Foreachpairwisecommentdistancebe- Here,nmpiisthecountoftheithmotifintheego’smotif low a threshold (currently 0.6), an edge is created between profile,nmpi istheaveragecountofthismotifforallmotif thecorrespondingusersifonedoesnotalreadyexist. profiles, and (cid:15) is a small integer that ensures that the ratio Afterwards, any users whose set of adjacent nodes con- is not misleadingly large when the motif occurs in only a sistssolelyofasinglevideonodeareremoved. Sincethese fewegocentricnetworks.Toadjustforscaling,anormalized usershavecommentedononlyonevideo,andareinalllike- ratioprofilenrpisthencreatedforeachratioprofilerpwith: lihoodnotrelatedtoanyotherusers,theyarenotconsidered to be part of any spam campaign. The resulting network (cid:18) (cid:19)1 tendstoconsistofoneormorelargeconnectedcomponents, nrp = rpi 2 (2) i (cid:80)rp2 with a number of considerably smaller connected compo- i nents based on videos with a relatively minor amount of comment activity. Finally, an approximate labelling of the Thegeneratedsetofnormalizedratioprofilesusuallycon- user nodes is performed, where users are labelled as spam taincorrelationsbetweenthemotifs. Principalcomponents usersif theypostedatleast onecommentwhose spamhint analysis(PCA)isusedtoadjustforthese,actingasadimen- property is set to true. All remaining users are labelled as sionalityreductiontechniqueintheprocess. Wecanvisual- regular users. Although this can lead to label inaccuracies, izethefirsttwoprincipalcomponentsasastartingpointfor theresultsinthenextsectiondemonstratethatsuchinaccu- ouranalysis. Thisisdiscussedinthenextsection. racieswillbeperceivable. ExperimentsandResults Networkmotifprofiles Once the network has been generated, a set of egocentric Forthepurposeofthisevaluation,theexperimentswerefo- networks can be extracted. In this context, given that the cusedupontrackingtwoparticularspamcampaignsthatwe focus is on user activity, an ego is a user node, where its discoveredwithinthedataset. Thecampaignstrategiescan egocentricnetworkistheinducedk-neighbourhoodnetwork be seen in Figure 1, i.e. a small number of accounts each consisting of those user and video nodes whose distance commenting on many videos (Campaign 1), and a larger from the ego is at most k (currently 2). Motifs from size numberofaccountseachcommentingonfewvideos(Cam- threetofivewithintheegocentricnetworksarethenenumer- paign2). Aperiodofseventy-twohourswaschosenwhere atedusingFANMOD(WernickeandRasche2006). Asetof these campaigns were active, starting on November 14th, motifcountsismaintainedforeachego,whereacountisin- 2011andendingonNovember17th,2011. crementedforeachmotifinstancefoundbyFANMODthat Inordertotrackthecampaignactivityovertime,thispe- containstheego. riod was split into twelve windows of six hours each. For Anetworkmotifcountprofileisthencreatedforeachego. each of these windows, a network of user and video nodes Asthenumberofpossiblemotifscanberelativelylarge(par- wasderivedusingtheprocessdescribedintheprevioussec- ticularly if directed and/or weighted edges are considered), tion. Anormalizedratioprofilewasgeneratedforeachego the length of this profile will vary for each network gener- (user),basedonthemotifcountsofthecorrespondingego- ated from a selection of comment data, rather than relying centric network. Principal components analysis was then upon a profile with a (large) fixed length. For a particular performed on these profiles to produce 2-dimensional spa- generatednetwork,theprofileswillcontainanentryforeach tializationsoftheusernodes,usingthefirsttwocomponents. of the unique motifs found in the entirety of its constituent Thesespatializationsactasthestartingpointfortheanalysis egocentricnetworks. Anymotifsnotfoundforaparticular ofactivitywithinasetoftimewindows. Campaign 1 Other spam users Campaign 1 Campaign 2 Campaign 2 Figure2:SpatializationofthefirsttwoprincipalcomponentsofthenormalizedratioprofilesforWindows10and11(rednodes areuserswithcommentsmarkedasspam,beigenodesareallotherusers). Bothspamcampaignsarehighlighted. Visualizationandinitialanalysis normal cluster in Window 10 (“Other spam users”) appear Havinginspectedalltwelvesix-hourwindows,twowindows tobeisolatedspamaccountshavingsimilarbehaviourtothe containingactivityfrombothcampaignshavebeenselected Campaign 1 strategy, but on a smaller scale. This also ap- for detailed analysis here. These are from November 17th, plies to the single separated user in Window 11. They are 2011; Window 10 (04:19:32 to 10:19:32) and Window 11 notconsideredfurtherduringthisevaluationastheyarenot (10:19:32 to 16:19:32). Their derived network details can partofalargercampaign. befoundinTable2. Further analysis of Campaign 1 revealed that two and three users were active in Windows 10 and 11 respectively Window Videonodes Usernodes(spam) Edges (fiveseparateusers),postingthefollowingcomments: 10 263 295(107) 907 ThreemostcoolthingsintheWorldformebefore 11 296 523(137) 1627 1)))))Jordan–thesuperstar 2)))))66cheap. com–thecheapestshoppingsite Table2: NetworkdetailsforWindows10and11 3)))))theiphone–bestconnector NOWTHERE’SONEMORE,IT’STHEVIDEO A spatialization of the first two principal components of ABOVE!!!!!!!!!! thenormalizedratioprofilesforthesewindowscanbefound in Figure 2. Users posting at least one comment marked ThreeBestthingsintheWorldformenow: ): ): ): ): ) asspam(usingthespamhint property)areinred, allother 1. Lily——Myboyfriend users are in beige. The points corresponding to the spam 2. 55cheap. com–thecheapestshoppingsite campaign users have been highlighted accordingly. From 3. thevideoabove—-themostironicalandinteresting thespatializations,itcanbeseenthatinbothwindows: videoIthink:]:]:]:]:] :]:] :]:] 1. Thevastmajorityofusersappearasoverlappingpointsin largerclusters(ontherightandleftrespectively). Althoughtherearecertaindifferencesbetweenthesecom- ments, they are clearly from the same campaign. This be- 2. Thereisacleardistinctionbetweenthetwodifferentcam- haviourisalsoseeninbothwindowswithCampaign2,fea- paign strategies, as these points are plotted separately turingalargernumberofusers,althoughtherearefeweroc- (bothfromregularusersandeachother). currencesofidenticalcomments. Nevertheless,asimilarity 3. The inaccuracy of the spam hint comment property is isnoticeable,forexample: demonstrated as the Campaign 2 clusters contain users not coloured in red, i.e. none of their comments were Don’tmissthisguys,theCEOofappleisreleasing markedasspam(furtherdetailsoftheseuserscanbeseen ipadsonThursday: osapple.co.nr inFigure4). Similarly,thereverseistruewiththelarger clustersofregularusers. dontmissout. November17-newappleceoisshipping Apartfromthehighlightedcampaignclusters,otherspam outoldipadandiphones nodes in the spatializations have been correctly marked as Notalie. GotothiswebpagetoseewhatImean: such. Forexample,thefiveusersthatareseparatedfromthe bit.ly\vatABm 1.0 1.0 t 0.8 t 0.8 n n u u o o C C d 0.6 d 0.6 e e z z ali 0.4 ali 0.4 m m r r o o N 0.2 N 0.2 0.0 0.0 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Windows Windows Figure 3: Tracking the recurring activity of Campaign 1 (left) and Campaign 2 (right) for all six-hour windows from 14th November2011to17thNovember2011,usingasinglediscriminatingmotifforeachcampaign. Both of these comments are made by the same user in onalargenumberofvideos,andsoitwouldbeexpectedthat different windows. However, while the first comment was motifscontainingonlyoneusernodewithalargenumberof accuratelymarkedasspam,thesecondwasnot.Anassump- video node neighbours have higher counts for the users in- tionherecouldbethattheURLinthefirstcommentisona volved,asisthecasehere. Themotifsconsideredindicative spam blacklist, while the shortened URL in the second en- of Campaign 2 are more subtle, in that the number of user ables such a list to be bypassed. Similar shortcomings are and video nodes is similar, with both user and video nodes discussedinearlierwork(Chhabraetal. 2011). presentinthesetofneighboursforaparticularuser. How- ever,allthreehighlightthefactthatusersappeartobemore Discriminatingmotifs likelytobeconnectedtootherusersratherthanvideos. This makessensegiventhatwiththiscampaign,alargernumber Aninspectionoftheindividualmotifcountsfoundthatcer- ofuserstendtocommentonasmallnumberofvideoseach, tain motifs have relatively higher counts for users involved and the potential for connectivity between users is higher in the spam campaigns, than those found for regular users. giventhesimilarityoftheircomments. Thesemotifswould Thesemotifsmaybeconsideredindicativeofdifferentcam- alsoappeartoindicatethatusersinthecampaigndon’tcom- paignstrategies,andasubsetcanbefoundinTable3. mentonthesamevideos,asnotwousersshareavideonode neighbour. Campaign1 Campaign2 Figure3containsplotsforthecountsoftwoofthesemo- tifsforeachofthesix-hourwindows. Thecountswerenor- malisedusingtheedgecountforthecorrespondingwindow networks followed by min-max normalization. The fluctu- ation in counts across the windows appears to track the re- curring periodic activity of these campaigns, as confirmed Motifs by separate analysis of the data set. This would appear to corroboratetheburstynatureofspamcampaigns(Xieetal. 2008;Gaoetal. 2010). Finally,Figure4plotstheusercountsindescendingorder for these two motifs in Window 11. With the Campaign 1 motif (left), the first four users are involved and have con- siderablyhighercountsthantheremainingusers. Thereare alsodifferencesincountsbetweenthecampaignusersthem- Table3:Asubsetofdiscriminatingmotifsfordifferentspam selves, indicatingthemostactiveusersinthiswindow. All campaign strategies (user nodes are beige, video nodes are users plotted for the Campaign 2 motif (right) are indeed blue). involved. ThreeoftheCampaign2userswerecoloureddif- ferentlytotheothers,highlightingthefactthatnoneoftheir These discriminating motifs would appear to correlate comments were accurately marked as spam. These same with the existing knowledge of the campaign strategies. three users can be seen in the right spatialization in Figure Campaign1consistsofasmallnumberofuserscommenting 2. qqmsable AslanDeadly zhihan33 DustySapphires buzai5 FrigidSermak rs 573729573 rs MatharMetalhead e e Us <other_user_3> Us KaldarRavage <other_user_2> BabyPeeky <other_user_1> GrimGargoyles <other_user_0> IndigoRazz 0 200000 400000 600000 800000 10000001200000 0 200 400 600 800 1000 Motif counts Motif counts Figure4:UsersassociatedwithCampaign1(left)andCampaign2(right),havingthehighestcountsforasinglediscriminating motif for each campaign from Window 11 (17th November 2011 10:19:32 to 16:19:32). Note how three of the users in Campaign2arecoloureddifferently,i.e. noneoftheircommentsweremarkedasspam. Usersnotinvolvedinthecampaigns havebeenanonymized. CampaignAnalysis hasbeeninoperationsince2010attheveryleast. Itappears Following the inspection of the discriminating motifs, the that the About page details contain further inconsistencies, websitesanddomainsassociatedwiththecommentsposted e.g., 55goods.com states that “In 2009, 78.8% of our an- bythecampaign1userswerethenanalyzed. Thefollowing nual revenue was from the international market...”, while domainswerefoundinthedatasetincommentsbeginning 55cheap.com, allegedly in operation for “18 years” since with“ThreeBestthings”and“Threemostcoolthings”(as 1993 contains the same statement with merely a change in seen in the example comments listed in the previous sec- year: “In2010, 78,8%ofourannualrevenuewasfromthe tion),andcanbecategorizedasfollows: internationalmarket...”. Atotalof24differentuseraccountswereusedtosendthe 1. National Football League (NFL) merchandise: 2006jer- associatedcommentsfoundinthedataset. Althoughsome seys.com,66cheap.com,shopofnfl.com. oftheseaccountshavebeensuspendedbyYouTube,others 2. Footwear: 21boots.com. remain active. The campaign appears to rotate the existing accountsforcommentposting,andnewaccountsarecreated 3. Widerrangeofmerchandise(e.g. clothing, accessories): on a continual basis. The four accounts for this campaign 36shopping.com,55cheap.com,55goods.com. listedinFigure4arecurrentlyactiveasofJanuary2012. Of Itisquiteclearthatallofthesesitesarerelatedgiventhe these four, the oldest account was created in August 2011, highsimilaritybetweenthem,e.g. variousindexpagetitles whilethemostrecentwascreatedinOctober2011. containingthetext“TheCheapestShoppingSite”,identical paymentoptionsandthesamecontactemailaddress. There ConclusionsandFutureWork arealsosomeinconsistenciesintheHTMLcontent,forex- YouTube spam campaigns typically involve a number of ample,some66cheap.compagesrefertoshopofnfl.comand spam bot user accounts controlled by a single spammer jerseysofnfl.com, and 55cheap.com pages refer to 36shop- targeting popular videos with similar comments over time. ping.com. Suspicious claims are also made, such as Wehaveshownthatdynamicnetworkanalysismethodsare “SHOPOFNFL.COMwastheonlineshopofNFL”.Atfirst effective for identifying the recurring nature of different glance, 21boots.com looks different to the others, but fur- spamcampaignstrategies,alongwiththeassociateduserac- ther investigation reveals similarities such as the payment counts. We have used a characterization of YouTube users options. Thedomainsappeartohavebeenregisteredbythe intermsofmotifsinthecommentnetworktohighlightthe sameperson6. As55cheap.comhasbeenpreviouslyidenti- users in question. While the YouTube comment scenario fied as a known scam website7, it is safe to assume that all could be characterized as a network in a number of ways, ofthesesitesshouldbetreatedassuch. weuseanetworkrepresentationcomprisinguserandvideo Further analysis of 55goods.com shows it to be an older nodes, user-video edges representing comments and user- site,asitsAboutpageallegesthatithasbeeninoperationfor useredgesrepresentingcommentsimilarity. “17 years” since 1993. This would suggest that this scam Thediscriminatingpowerofthesemotif-basedcharacter- 6http://whois.domaintools.com izationscanbeseeninthePCA-basedspatializationinFig- 7http://answers.yahoo.com/question/index? ure2. ItisalsoclearfromFigure3thathistogramsofcer- qid=20110426143143AArdrbK tain discriminating motifs show the level of activity in the two campaign strategies over time. Furthermore, counts of Chhabra, S.; Aggarwal, A.; Benevenuto, F.; and Ku- thesemotifsintheegocentricnetworksofusershighlightthe maraguru, P. 2011. Phi.sh/$ocial: The phishing landscape associatedaccounts(Figure4). throughshorturls.InProceedingsofthe8thAnnualCollab- oration, Electronic Messaging, Anti-Abuse and Spam Con- FutureWork ference,CEAS’11,92–101. NewYork,NY,USA:ACM. Forfutureexperiments, it willbenecessarytoannotate the Gao,H.;Hu,J.;Wilson,C.;Li,Z.;Chen,Y.;andZhao,B.Y. data set with spam/non-spam labels, or perhaps a more ex- 2010. Detectingandcharacterizingsocialspamcampaigns. tensive annotation that considers the associated campaign In Proceedings of the 10th Annual Conference on Internet strategies. Featureselectionofasubsetofmotifscouldthen Measurement,IMC’10,35–47.NewYork,NY,USA:ACM. beperformedalongwithsubsequentuserclassification. The Grier, C.; Thomas, K.; Paxson, V.; and Zhang, M. 2010. useofasubsetofmotifsisattractive,asitwouldremovethe @spam: The underground on 140 characters or less. In currentrequirementtocountallmotifinstancesfoundinthe Proceedingsofthe17thACMConferenceonComputerand useregocentricnetworks,whichcanbealengthyprocess. CommunicationsSecurity,CCS’10,27–37. NewYork,NY, USA:ACM. Acknowledgements Kamaliha, E.; Riahi, F.; Qazvinian, V.; and Adibi, J. This work is supported by 2Centre, the EU funded Cyber- 2008. Characterizing Network Motifs to Identify Spam crimeCentresofExcellenceNetworkandScienceFounda- Comments. InProceedingsofthe2008IEEEInternational tion Ireland under grant 08/SRC/I140: Clique: Graph and ConferenceonDataMiningWorkshops,919–928.Washing- NetworkAnalysisCluster. ton,DC,USA:IEEEComputerSociety. Karp,R.M.,andRabin,M.O. 1987. Efficientrandomized References pattern-matchingalgorithms. IBMJ.Res.Dev.31:249–260. Allan, Jr., E. G.; Turkett, Jr., W. H.; and Fulp, E. W. Milo, R.; Shen-Orr, S.; Itzkovitz, S.; Kashtan, N.; 2009. Using network motifs to identify application proto- Chklovskii, D.; and Alon, U. 2002. Network Motifs: cols.InProceedingsofthe28thIEEEConferenceonGlobal Simple Building Blocks of Complex Networks. Science Telecommunications,GLOBECOM’09,4266–4272. Piscat- 298(5594):824–827. away,NJ,USA:IEEEPress. Milo, R.; Itzkovitz, S.; Kashtan, N.; Levitt, R.; Shen-Orr, Becchetti, L.; Boldi, P.; Castillo, C.; and Gionis, A. 2008. S.; Ayzenshtat, I.; Sheffer, M.; and Alon, U. 2004. Su- Efficientsemi-streamingalgorithmsforlocaltrianglecount- perfamilies of Evolved and Designed Networks. Science ing in massive graphs. In Proceedings of the 14th ACM 303(5663):1538–1542. SIGKDD International Conference on Knowledge Discov- Mishne, G. 2005. Blocking Blog Spam with Language ery and Data Mining, KDD ’08, 16–24. New York, NY, Model Disagreement. In Proceedings of the 1st Interna- USA:ACM. tional Workshop on Adversarial Information Retrieval on Benevenuto, F.; Duarte, F.; Rodrigues, T.; Almeida, V. A.; theWeb(AIRWeb. Almeida, J. M.; and Ross, K. W. 2008a. Understanding Paolillo, J. 2008. Structure and Network in the YouTube VideoInteractionsinYouTube. InProceedingsofthe16th Core. In Proceedings of the 41st Annual Hawaii Interna- ACM International Conference on Multimedia, MM ’08, tionalConferenceonSystemSciences,156. 761–764. NewYork,NY,USA:ACM. Shen-Orr,S.S.; Milo,R.; Mangan,S.; andAlon,U. 2002. Benevenuto, F.; Rodrigues, T.; Almeida, V.; Almeida, J.; Network Motifs in the Transcriptional Regulation Network Zhang,C.;andRoss,K.2008b.Identifyingvideospammers ofEscherichiacoli. NatureGenetics31:1061–4036. in online social networks. In Proceedings of the 4th Inter- nationalWorkshoponAdversarialInformationRetrievalon Sureka, A. 2011. Mining User Comment Activity for De- theWeb,AIRWeb’08,45–52. NewYork,NY,USA:ACM. tectingForumSpammersinYouTube.CoRRabs/1103.5044. Watts,D.J.,andStrogatz,S.H. 1998. Collectivedynamics Benevenuto, F.; Rodrigues, T.; Almeida, V.; Almeida, J.; ofsmall-worldnetworks. Nature393(6684):440–442. andGonc¸alves,M. 2009. Detectingspammersandcontent promotersinonlinevideosocialnetworks.InProceedingsof Wernicke,S.,andRasche,F. 2006. FANMOD:AToolfor the32ndInternationalACMSIGIRConferenceonResearch FastNetworkMotifDetection. Bioinformatics22(9):1152– andDevelopmentinInformationRetrieval,SIGIR’09,620– 1153. 627. NewYork,NY,USA:ACM. Wu, G.; Harrigan, M.; andCunningham, P. 2011. Charac- Benevenuto,F.;Magno,G.;Rodrigues,T.;andAlmeida,V. terizingWikipediaPagesusingEditNetworkMotifProfiles. 2010. Detecting Spammers on Twitter. In Proceedings of InProceedingsofthe3rdInternationalWorkshoponSearch the 7th Annual Collaboration, Electronic Messaging, Anti- and Mining User-Generated Contents, SMUC ’11, 45–52. AbuseandSpamConference(CEAS). NewYork,NY,USA:ACM. Boykin,P.,andRoychowdhury,V. 2005. Leveragingsocial Xie, Y.; Yu, F.; Achan, K.; Panigrahy, R.; Hulten, G.; and networkstofightspam. Computer38(4):61–68. Osipkov,I. 2008. Spammingbotnets: Signaturesandchar- acteristics. In Proceedings of the ACM SIGCOMM 2008 Cheng, X.; Dale, C.; and Liu, J. 2008. Statistics and So- ConferenceonDataCommunication,SIGCOMM’08,171– cialNetworkofYouTubeVideos. InThe16thInternational 182. NewYork,NY,USA:ACM. WorkshoponQualityofService(IWQoS’08),229–238.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.