ebook img

It's Always April Fools' Day! On the Difficulty of Social Network Misinformation Classification via Propagation Features PDF

0.32 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview It's Always April Fools' Day! On the Difficulty of Social Network Misinformation Classification via Propagation Features

It’s Always April Fools’ Day! On the Difficulty of Social Network Misinformation Classification via Propagation Features MauroConti,DanieleLain, WalterQuattrociocchi RiccardoLazzeretti,GiulioLovisotto IMTSchoolforAdvancedStudies,Lucca UniversityofPadua Abstract Nielsen,R.K.2015). ItisnotasurprisethenthattheOxford 7 Dictionary in 2016 elected Post-truth as word of the year. 1 Given the huge impact that Online Social Networks (OSN) Thedefinitionreads: 0 hadinthewaypeoplegetinformedandformtheiropinion, 2 they became an attractive playground for malicious entities “Relatingtoordenotingcircumstancesinwhichobjec- thatwanttospreadmisinformation,andleveragetheireffect. tivefactsarelessinfluentialinshapingpublicopinion n Infact,misinformationeasilyspreadsonOSNandisahuge thanappealstoemotionandpersonalbelief”. a threatformodernsociety, possiblyinfluencingalsotheout- J Severalstudiespointedouttheeffectsofsocialinfluence comeofelections,orevenputtingpeople’slifeatrisk(e.g., 6 online(Centola2010;FowlerandChristakis2010;Quattro- spreading “anti-vaccines” misinformation). Therefore, it is 1 ciocchi, Caldarelli, and Scala 2014; Salganik, Dodds, and of paramount importance for our society to have some sort Watts 2006). Results reported in (Kramer, Guillory, and of“validation”oninformationspreadingthroughOSN.The I] need for a wide-scale validation would greatly benefit from Hancock 2014) indicate that emotions expressed by others S automatictools. onFacebookinfluenceourownemotions, providingexper- . Inthispaper,weshowthatitisdifficulttocarryoutanauto- imentalevidenceofmassive-scalecontagionviasocialnet- s c maticclassificationofmisinformationconsideringonlystruc- works. As a result of disintermediated access to informa- [ turalpropertiesofcontentpropagationcascades.Wefocuson tion and of algorithms used in content promotion, commu- structural properties, because they would be inherently dif- nication has become increasingly personalized, both in the 1 ficult to be manipulated, with the the aim of circumventing way messages are framed and how they are shared across v classificationsystems. Tosupportourclaim,wecarryoutan social networks. Selective exposure has been shown to fa- 1 extensiveevaluationonFacebookpostsbelongingtoconspir- vortheemergenceofecho-chambers—polarizedgroupsof 2 acy theories (as representative of misinformation), and sci- 2 entific news (representative of fact-checked content). Our like-minded people where users reinforce their world view 4 findings show that conspiracy content actually reverberates withinformationadheringtotheirsystemofbeliefs(DelVi- 0 inawaywhichishardtodistinguishfromtheonescientific cario et al. 2016). Confirmation bias, indeed, plays a piv- 1. content does: for the classification mechanisms we investi- otal role in informational cascades (Quattrociocchi, Scala, 0 gated,classificationF1-scoreneverexceeds0.65duringcon- and Sunstein 2016; Bessi et al. 2015b; Zollo et al. 2015a; 7 tent propagation stages, and is still less than 0.7 even after Sunstein 2002). Recent works (Bessi et al. 2015a; Zollo et 1 propagationiscomplete. al.2015a)showedthatattemptstodebunkfalseinformation : arelargelyineffective. Inparticular,discussiondegenerates v 1 Introduction when the two polarized communities interact with one an- i X other (Zollo et al. 2015b). OSN users therefore tend to se- AnincreasingnumberofpeoplegetinformedonOnlineSo- lect information that are consistent with their beliefs (even r cial Networks (OSN) (Newman, N. and Levy, D.A.L. and a if containing false claims) and propagate it to like-minded Nielsen,R.K.2015). However,asOSNalloweveryuserto friends, and to ignore information dissenting with their be- post content, which propagates among users through viral liefs. Thisconfirmsthatmisinformationisahugethreatfor processes,theseplatformsbecameattractivetargetsformis- modernsociety. Notonlyitcanputpeople’slifeatrisk, as information creators. Moreover, the hyperconnected world inthecaseof“anti-vaccines”misinformation,itisalsostart- and increasing complexity of reality create a scenario in ingtobeusedagainstpoliticalopponents,suchasintheUS, which viral processes on OSN are driven by confirmation duringthe2016electoralcampaign(Silvermanetal.2016) bias, eliciting the proliferation of unsubstantiated rumors and during Election Day (Rogers, K. and Bromwich, J.E. and hoaxes all the way up to conspiracy theories (Bessi et 2016). al. 2015b; 2015a). News stories undergo the same popu- Rising attention to the spreading of fake news and un- larity dynamics as other forms of online contents (such as substantiated rumors led researchers to investigate many selfiesandcatpictures)(Newman,N.andLevy,D.A.L.and of their aspects, from the characterization of conversation Copyright(cid:13)c 2017,AssociationfortheAdvancementofArtificial threads (Backstrom et al. 2013), to the detection of bursty Intelligence(www.aaai.org).Allrightsreserved. topics on microblogging platforms (Diao et al. 2012), to the disclosure of the mechanisms behind information dif- mation, and posts belonging to scientific news as repre- fusionfordifferentkindsofcontents(Romero,Meeder,and sentativeoffact-checkedcontent. Ourfindingshighlight Kleinberg 2011). Spreading of misinformation also moti- the complexity of creating automatic solutions to mis- vatedmajorcorporationslikeGoogleandFacebooktopro- information classification. Indeed, structural features of vide solutions to the problem (First Draft News Coalition contentpropagationdonotallowustoobtainnotableim- 2016). However,giventheamountofcontentpostedevery- provements from a random guess baseline: F -score for 1 dayonOSN(e.g.,Facebookreports1.18billiondailyactive the classification mechanisms we investigated is always users on September 2016 (Facebook 2016)), effective fact- lowerthan0.65duringcontentpropagationstages,andis checking would greatly benefit of automatic classification stilllessthan0.7evenafterpropagationiscomplete. tools, that would possibly not require human intervention. Moreover, classification of fake news and misinformation Outline. Section 2 overviews related work in news and should ideally use properties that misinformation creators misinformation propagation in OSN. Section 3 formally can not manipulate. Considering in the classification, for models and defines content propagation in Facebook. Sec- example, the trustworthinessofnews domains, orthe topic tion 4 presents our experiment design and the features we ofcontent,couldleadtoanarmsracewithcreatorsoffalse extractfromcontentpropagationcascades. Then,Section5 news.Instead,thetopologyofpropagationcascades,andthe evaluates our classification and, finally, Section 6 critically patternsofusers’interactionwithcontent,areoutsideofthe discusses our results, summarizes the paper and delineates domainofour“adversaries”,andaremuchmoredifficultto futurework. bemanipulated. Inthiswork,weinvestigatedetectionofviralprocessesby 2 RelatedWork comparingdiffusionofpostsfromscientificandconspiracy pagesontheItalianFacebooknetwork. Theformerdiffuse Several studies moved towards the spreading of rumors scientificknowledge, wheredetailsaboutthesources(such and behaviors on online social networks, challenging both as authors and funding programs) are easy to access. We their structural properties and their effects on social dy- thereforeselecttheirpostsasrepresentativeoffact-checked namics(Moreno,Nekovee,andPacheco2004;Doerr,Fouz, content. The latter aim at diffusing what is neglected by andFriedrich2012;Seo,Mohapatra,andAbdelzaher2012; “manipulated” mainstream media. Specifically, conspiracy Borge-Holthoefer et al. 2013; Cozzo et al. 2013; Borge- theoriestendtoreducethecomplexityofrealitybyexplain- Holthoefer, Rivero, and Moreno 2012). In (Ugander et ingsignificantsocialorpoliticalaspectsasplotsconceived al. 2012), authors find that the probability of contagion is bypowerfulindividualsororganizations. Sincethesekinds tightly controlled by the number of connected components ofargumentscansometimesinvolvetherejectionofscience, inanindividual’scontactsneighborhood,ratherthanbythe alternativeexplanationsareinvokedtoreplacethescientific actualsizeoftheneighborhood.In(CentolaandMacy2007) evidence. Forinstance, peoplewhorejectthelinkbetween researchers show that, although long ties are relevant for HIVandAIDSgenerallybelievethatAIDSwascreatedby spreading information about an innovation or social move- the US Government to control the African American pop- ment, they are not sufficient with respect to the social re- ulation. We therefore select their posts as representative of inforcement necessary to act on that information. A key misinformation. factor in identifying true contagion in social network is to distinguish between peer-to-peer influence and homophily: Contribution. Thecontributionsofthispaperarethefol- in the first case, a node influences or causes outcomes to lowing: itsneighbors,whereasinthesecondone,dyadicsimilarities between nodes create correlated outcome patterns among • Weshowthatautomaticfact-checkingwithclassification neighbors that could mimic viral contagions even without techniques employing only structural features of content directcausalinfluence(McPherson,Smith-Lovin,andCook propagationcascades(featuresthatarerobusttoattacker’s 2001). The study presented in (Aiello et al. 2012) reveals manipulation) does not suggest to bring usable results. that there is a substantial level of topical similarity among Given the grain of our data, we design classifiers that users which are close to each other in the social network, leverages topological properties of content propagation suggesting that users with similar interests are more likely cascades, and properties of the evolution over time of to be friends. In (Aral, Muchnik, and Sundararajan 2009) users’ interactions with content. A set of classifiers op- authors develop an estimation framework to distinguish in- erateswithevolutionpropertiestoclassifycontentduring fluenceandhomophilyeffectsindynamicnetworksandfind its early propagation stage, another set of classifiers op- thathomophilyexplainsmorethan50%oftheperceivedbe- erateswithmorefeaturestoclassifycontentthatalready havioralcontagion.In(Bakshyetal.2012)theanalysisfaces propagated. Indeed, being able to issue warnings about the role of social networks and exposure to friends’ activ- possiblefakenewsasearlyaspossible,andretroactively ities in information resharing on Facebook. Once having flag such news can be useful, in the fight against OSN isolated contagion from other confounding effects such as misinformation. homophily,authorsclaimthatthereisaconsiderablyhigher • We evaluate our classifiers on a well-known dataset of chancetosharecontentswhenusersareexposedtofriends’ FacebookpostsfromItalianpages. Weusepostsbelong- resharing.Allthesecontributionsstrivetounderstandthein- ing to conspiracy theories as representative of misinfor- nermechanismofrumorspreadingandtoeventuallypredict massivediffusionprocesses,i.e. cascades. Cascadesrecur- weuselacksinformationaboutthelikeandfollowrelation- renceandpredictionhasbeenshapedin(Chengetal.2014) ships,thatwethereforecannotconsider. and(Chengetal.2016). PotentialPropagationGraph. Beforeformallymodeling 3 Methods the spreading of content on Facebook, we describe some In this section, we first report and describe the employed fundamental concepts of the OSN. We recall that, in our dataset (Section 3.1). We then give some necessary back- model, only pages can post new content on the social net- ground knowledge, and present our reference model and work.Henceforth,werefertothepagethatoriginallyposted definition of content propagation mechanisms of Facebook somecontentastheseedpage. Instead,usersfindnewcon- (Section3.2). tentbylookingattheirtimeline,wheretheyseerecentcon- tent posted by the pages they like, and content that their 3.1 Dataset friendsrecentlyinteractedwith. Theycantheninteractwith such content through the means of resharing, liking, and We employ a well-known dataset of posts shared by Ital- commenting it. However, all of these interactions can hap- ian Facebook users (Bessi et al. 2015a). This dataset con- pen directly on the original content, or on some types of tainspostspublishedby73publicFacebookpages:34pages interactions of users’ friends. For example, a user observ- that publish scientific content (e.g., press releases of peer- ingacommentorreshareofapostbyoneofhisfriendscan reviewed articles), and 39 pages that publish conspiracy decidetointeractdirectlywiththeoriginalpost,orwithhis theories-relatedcontent(e.g. newworldorder, chemtrails). friend’sinteractionitself. Inthefirstcase,user’sinteraction Additionally, thedatasetcontainsinformationaboutthein- looks exactly the same and it is impossible to understand teractionofuserswiththeseposts,andusers’ego-networks whether the content was found thanks to his friend’s inter- (i.e., the list of users that are their friends, when such list action,ordirectlyontheseedpage. Infact,differentlyfrom is public). Additionally, for a set of posts, the dataset pro- related work (Friggeri et al. 2014), the dataset we employ vides information about the propagation cascade of such doesnotprovidethistypeofinformation. Wethereforetake content, generated by users’ reshares, and subsequent re- aconservativeapproach,sayingthatthecontentpotentially sharesfromtheirfriends. Thispropagationfromoneuserto propagated to the user both from the seed page and from otheruserscanhappenmultipletimes,formingacascadeof any of his friends that interacted with the content, without resharing. Usinginformationfromthedataset,weextracted any distinction. On the other hand, in the second case, it 112141non-emptypropagationcascades,89491forconspir- becomes clear that the user found the content thanks to his acy and 22650 for science, respectively. We underline that friend. Wethereforesaythatthecontentpropagated tothe the dataset is obtained by using the Facebook Graph API, userfromhisfriend. andcontainsonlypublicinformation. Hence,timestampsof We now formalize the above observations. Let P be the resharesandcommentsareavailable,buttimestampsoflike setofcontentspostedonFacebookbyseedpages. Foreach interactionsarenot. post p ∈ P, created at some time t by a seed page, at a generic subsequent point in time t+δ we define a poten- 3.2 BackgroundandDefinitions tial propagation graph Gp (cid:104)Vp ,Ep (cid:105), where Vp is t+δ t+δ t+δ t+δ We now present our reference formal representation of the set composed by the seed page, and by the users that Facebook’sfriendshipgraph,andthepotentialpropagation interacted with p during the time interval [t,t+δ]. Fi- graphgeneratedbycontentpostedonthesocialnetwork. nal potential propagation graph Gp(cid:104)Vp,Ep(cid:105) is the graph formed considering all interactions with p on the timespan of the analysis (as new interactions with old content are Facebook Friendship Graph. We model Facebook rela- always possible on Facebook). Two nodes v ,v ∈ Vp tionshipsasagraphG(cid:104)V,E(cid:105),thatwecallFacebookfriend- 1 2 t+δ are connected by an undirected potential propagation edge shipgraph, whereV isthesetofnodesthatrepresententi- ep(v ,v ) ∈ Ep if either (i) v or v already interacted ties, namely user accounts and page accounts. We assume 1 2 t+δ 1 2 withpand∃e|e(v ,v ) ∈ E (thatis,v andv arefriends two main differences among these two types of entity: (i) 1 2 1 2 on Facebook), or (ii) v or v is the seed page. Therefore, pages can post new content on the OSN, while users can 1 2 anedgee(v ,v )∈Ep indicatesthatthecontentppoten- onlyinteractwithsuchcontentbyliking,commenting,and 1 2 t+δ tiallypropagatedeitherfromv tov ,orfromv tov . We resharingit; and(ii)userscanestablishfriendshiprelation- 1 2 2 1 associatetoeachedgee∈Ep twodifferentproperties: ships with other users, while pages cannot. Indeed, two t+δ users v ,v ∈ V are connected by an edge e(v ,v ) ∈ E 1. Time: the time when the interaction between v 1 2 1 2 1 if they are friends on Facebook. Pages are not connected andv tookplace; 2 byanyedge,astheydonothaveproperfriendshiprelation- 2. Type: either “like”, “comment”, “reshare”, or ships. ThismodelisasimplificationofhowFacebookactu- ”friendship”, depending on the type of interaction ally works, because users can post new content, and pages betweenv andv . and users are linked by like relationships. Similarly, users 1 2 can follow other users, without having any friend relation- Hereafter in the paper, we will refer to these properties for ship with them. However, for this work, we do not focus an edge e ∈ Ep with e.property name (for example, t+δ on new content generated by users. Moreover, the dataset e.time). Figure 1 depicts an example of this propagation model. 1. EarlyStage,meaningthatclassificationofthetype We represent a simple Facebook friendship graph in Fig- ofposthappensasearlyaspossibleduringitsprop- ure 1a, where edges represent friendship relations. Nodes agationphase; v ,...,v are OSN users, while node s represents a public 1 4 2. Final Stage, meaning that classification happens page posting content on the platform. We suppose v and 2 whenthepostalreadystoppedpropagating(within v interacted with a given content posted by s, and v re- 4 3 the considered timespan of the employed dataset), shared the post from v . Node v did not interact with the 2 1 anditscascadeiscomplete. content, hence is not present on the potential propagation These two scenarios are of particular interest, both from graphofthepost,thatwerepresentinFigure1b.There,with a research and a practical points of view. Indeed, be- edges (s,v ),(s,v ),(v ,v ) we represent possible propa- 2 4 2 4 ing able to issue automated warnings as early as possible gationpathsofthecontent: eitherv orv couldhaveseen 2 4 aboutpotentialmisinformationcouldhelpinreducingtheir thecontentfromtheseedpage,orfrompreviousinteraction spread(Friggerietal.2014). Ontheotherhand, stillbeing of their friend. Node v only has an edge (v ,v ), as we 3 2 3 able to perform such a classification when the post already knowthattheinteractionhappenedthankstov ,andthere- 2 propagatedmightserveasawarninginordertopreventits fore the content propagated from this node. Additionally, furtherdiffusion. Moreover,retroactivelyflaggingoldposts we highlight that some possible propagation edges, repre- aspotentiallyfakecouldhelptraininguserstodiscriminate sentedwithadashedline,correspondtofriendshipedgesof betweenfact-checkedanddubiousinformation,amajordi- thefriendshipgraph. rectioninthefightagainstmisinformation(FirstDraftNews Coalition2016). Toinvestigatethesetwoscenarios,wesetuptwodifferent experiments, modeled as binary classification tasks, where the two classes are conspiracy and science. We describe s s theminthefollowing. v v 4 4 Early Stage Classification. In this scenario, we have ac- v cess only to users’ interactions with a post up to a certain 1 v2 v2 time t + δ, after it has been created at time t, (e.g., up to after two hours after creation). With this scenario, we simulatehowOSNssuchasFacebookcouldtrytocontinu- ouslydetectmisinformationcontentduringitspropagation. v v 3 3 Therefore,foreachpostp,givenitscreationtimet,weex- tract features from the partial propagation graph Gp , and t+δ fromtheevolutionofpropertiesofthepropagationgraphbe- (a)FacebookFriendshipGraph (b)PotentialPropagationGraph forethecurrenttimet+δ(discussedfurtherinSection4.2). Weuse30minutessteps,asthisprovedtobeagoodtrade- Figure1: SampleFacebookfriendshipgraph, andpotential off between the granularity of the analysis and the number propagation graph of some content. A dashed line repre- of intervals to consider. We stop at two days (2880 min- sentspotentialpropagationedgesthatalsocorrespondstoa utes, or 96 time steps) after the publication time of a post, friendship edge of the friendship graph. Node v1 does not because we observed that most of the interactions happen interactwiththecontent,andisthereforenotpresentinthe in this period (> 95%). Finally, we compare the perfor- potentialpropagationgraph. manceofdifferentwell-knownclassifiers, namelyRandom Forests (Ho1995),LinearDiscriminantAnalysis(Izenman 2013),andMulti-LayerPerceptron(Haykin1998),foreach 4 Experiments δ ∈[30,60,...,2880],inacross-validationscheme. Inthissection,wedescribethedetailsoftheexperimentand Final Stage Classification. In this scenario, we suppose of the analysis performed on the dataset. We first describe that a post already stopped its propagation, and no new in- the experimental design (Section 4.1). We then thoroughly teractionswithithavebeenobservedforalongtime.Wede- report and motivate the different features that we extracted scribethefinalpropagationgraphGp ofpostpwithasetof frompropagationcascades(Section4.2). high-levelandtopologicalfeatures,thatwedescribeinmore detailinSection4.2. Wethencomparetheperformanceof 4.1 ExperimentalDesign differentclassifiers,inacross-validationscheme. The aim of our experiments is to show that it is difficult to 4.2 FeatureExtraction discriminatebetweenconspiracytheories,andfact-checked scientific news, by using only the propagation graph of the Toextractinformationfromthedifferentpropagationgraphs post, as it would be difficult to manipulate by misinforma- GpandGp,weidentifythreepossiblecategoriesoffeatures: t tioncreators. Inparticular, weevaluatethisintwospecific (i)high-levelpropertiesofthecontentpropagation,(ii)topo- momentsofcontentpropagation: logical properties of the propagation graphs, (iii) evolution properties of the content propagation. We use feature sets Fronczak, and Hołyst 2004), also known as Wiener index, (i),(ii),and(iii)consideringtheevolutionaftertwodays,in givesusindicationsoftheviralityofthecontent,intermsof theFinalStageClassificationscenario. Weonlyusefeature distance of propagation from the seed page. Long cascad- set (iii), in the Early Stage Classification scenario, because ingnews,resharedmanytimesfrominteractingfriends,will theotherfeaturescharacterizeonlyGp,butnotGp.Wesum- exhibitalongeraveragepathlengththannewswhoseinter- t marize these different features in Table 1, along with their actions happened mostly from the seed node. Finally, the formaldefinitions,anddescribetheminthefollowing. diameter ofagraph,definedasthelongershortestpathbe- tweenanypairofnodesofthegraph,indicatesthespreading High Level Properties. These features represent high- distanceofposts. levelpropertiesofthecompletepropagationcascaderepre- sentedbyGp(cid:104)Vp,Ep(cid:105). EvolutionProperties. Thesefeaturesrepresentevolution Somefeaturesrepresentverygeneralpropertiesrelatedto propertiesovertimeofthepostppropagation,fromitscre- theviralityofthecontent. Thesegeneralpropertiesarethe ationtimettoasubsequentpointt+δ. Tocomputethese lifetimeofthecascade,measuredasthedistanceinminutes features, we construct the propagation graphs at different from the first to the last captured interaction with the con- time steps G ,...,G . We then calculate the value of t+30 t+δ tent; the size of the cascade in terms of number of nodes threeofourhigh-levelfeaturesforeachgraphateachtime (userswhointeractedwiththecontent);thenumberoftotal step: (i) Friendships Ratio, (ii) Size, and (iii) Interactions interactions; andthetimerequiredforthecascadetoreach Ratio. Foreachofthesehigh-levelfeature,weobtainatime its 90% total interactions (referred to as 90% interactions series time). v ,v ,...,v , 1 2 δ/30 Otherfeaturesderivedfromhighlevelpropertiesattempt on which we compute a set of well-known statistical mea- to capture different possible types of interaction with the sures that represent the evolution over time of the se- content. Friendships ratio is defined as the proportion of ries(Wiens,Horvitz,andGuttag2012). Namely,thesesta- edges whose type is “friendship” over the total number of tistical measures are the mean, standard deviation, linear edgesandrepresentsthenumberoftimes,inproportion,that weighted mean and quadratic weighted mean, average ab- the post potentially propagated among friends, rather than solutechange,andmaximumoftheseries. Thesemeasures directly from the seed node. Indeed, if no friends of users capturetheevolutionofthetimeseriesuptothecurrenttime interact with some content, its potential propagation graph and,especially,donotrequireustoknowthewholetimese- only contains edges with the seed page. Interactions ratio, ries,animportantpropertyforourEarlyStageclassification instead,representstheaverageexposuretointeractionsfrom experiment. We decided to derive time-series only for the friendsofuserswiththecontent. Sinceverticesareinteract- threehigh-levelfeatureslistedabove. Indeed,wearguethat ingusers,andedgesarepotentialinteractions,highervalues the other features either have no temporal properties (e.g., of this metric mean lower exposure (little interaction with lifetime,timetoreach90%interactions),evolveinsimilaror the content by one’sfriends). These features are motivated predictableways(e.g.,diameter),ordescribebehaviorsthat bytheobservationthatitispossiblethattheusers’fruition arealreadycapturedbytheselectedtimeseries. Moreover, ofdifferenttypesofcontentisdifferent,withsometypesof evolutionpropertiesoftimeserieswouldbeespeciallyused contentbeinginteracteddirectlyfromthesource,andother in Early Stage classification. Social networks need to per- types of content relying more on word-of-mouth propaga- form feature extraction on this scenario at every time step: tion(Romero,Meeder,andKleinberg2011). featuresderivedfromtopologicalpropertiesaretoocompu- Topological Properties. We also select as features some tationallyexpensive(e.g.,diametercalculationrunsinmore well-knownpropertiesofthetopologicalstructureofgraphs. than quadratic complexity w.r.t. the number of vertices) to These properties are commonly used to learn information becontinuouslycalculatedforeachnewpost. about graph structures, and have been applied in solving problems such as link analysis and prediction (Al Hasan 5 Results et al. 2006), and especially in cascade and virality predic- Inthissectionwepresenttheresultsofourexperiments. In tion(Hong,Dan,andDavison2011;Chengetal.2014).The particular, we first discuss evaluation metrics and baseline averagevertexdegreefeaturerepresentstheaveragenumber (Section5.1). WethenreporttheresultsontheEarlyStage of possible propagation edges for the content at any given Classification scenario (Section 5.2), and the results on the hop. Higher values of this metric indicate the presence of FinalStageClassificationscenario(Section5.3). interacting users greatly exposed to the content, or able to influencemanyoftheirsocialfriends. Theglobalclustering 5.1 EvaluationMetrics coefficient (Holland and Leinhardt 1971), a measure of the density of connections of graphs, is another indication of As usual in binary classification, the classification base- whether the possible propagation paths are generated from line is the performance of a random classifier on the data: interactionsbetweenfriends,ordirectlywiththeseedpage. withoutanyinformationregardingthepropagationgraph,it Assortativity coefficient, defined as the degree correlation guesseseitherscienceorconspiracywithequalprobability, betweenpairsoflinkednodes(Newman2002),canmeasure as a coin flip. The goal of our experiments is to show that how friends influence each others in interacting with con- structuralfeaturesdonothelpmoresophisticatedmodelsin tentonthesocialnetwork. Averagepathlength(Fronczak, improvingthebaselineperformance. FeatureName Description HighLevelProperties Size Numberofedgesofthegraph|Ep| FriendshipsRatio Proportionofedgeswhosetypeis“friendship”: |{e∈Ep|e.type=“friendship”}|/|Ep| InteractionsRatio Numberofverticesoverthenumberofedges: |Vp|/|Ep| Lifetime Timepassedbetweenpostcreationtimeandthelastinteractiontime: max e.time−t e∈Ep 90%InteractionsTime Timerequiredforthecontenttoreachits90%numberofinteractions TopologicalProperties AverageVertexDegree Averagepossiblepropagationpathsofthepost: 2·|Ep|/|Vp| ClusteringCoefficient Ratioofconnectedtripletsofnodes AssortativityCoefficient Degreecorrelationbetweenpairsoflinkednodes AveragePathLength Averagelengthoftheshortestpaths Diameter Lengthofthelongestshortestpathbetweenanypairofvertices EvolutionProperties Mean 1 (cid:80)nv n i i LinearWeightedMean 2 (cid:80)ni·v n(n+1) i i QuadraticWeightedMean 6 (cid:80)ni2·v n(n+1)(2n+1) i i StandardDeviation (cid:113)(cid:80)ni(vi−v¯)2,wherev¯isthemeanofv n AverageAbsoluteChange 1 (cid:80)n−1|v −v | n i i i+1 Maximum max v i i Table1: DescriptionandformaldefinitionoffeaturesweuseonourFinalStageandEarlyStageclassificationexperiments. Unfortunately, our dataset is highly imbalanced (com- withthebaseline. posedby89491newsforconspiracy,andonly22650news Toperform(ii),weundersampledthemostfrequentclass forscience). Withsuchimbalance,standardevaluationmet- (i.e.,conspiracy).Wethereforeextractedexactly22650con- rics(suchasprecision,accuracy,andrecall)canbemislead- spiracysamplesfromthefulldataset,andcreatedasubsam- ing, because they do not account for the uneven class fre- plewithperfectlybalancedclasses. Toaccountforpossible quencies. Even computing averages of these metrics using biases caused by the undersampling, we repeated the pro- weights based on the class frequencies does not fit our in- cess several times, and averaged the outcomes. Using per- tentionsofconsistentlycomparingourresultswiththefixed fectly balanced datasets, we can evaluate precision, recall, baseline. accuracy,andF -scorevaluesofourclassifiers. Indeed,the 1 To deal with this imbalance, we performed two distinct value of these metrics for a random classifier on balanced experiments: (i) we consider only metrics that take imbal- dataisexactly0.5,whichweuseasbaseline. ance into account and use the full dataset, and (ii) we con- Hereafter,ifnotdifferentlyspecified,themetricsthatre- sidermeaningfulmetricswithbalanceddataset,obtainedun- quire a positive class (such as precision, recall, and F - 1 dersamplingtheoriginaldataset. score) use conspiracy as the positive class, and science as To perform (i), as metrics we use Area Under Receiver thenegativeone. OperatingCurve(AUC)(HanleyandMcNeil1982),andCo- 5.2 EarlyStageClassification hen’s Kappa (Cohen 1968) (scaled into the interval [0, 1]). Indeed,thevalueofthesemetricsforarandomclassifieris To evaluate this scenario, we recall that we used a total exactly 0.5, which we use as baseline. In this way, we can of 18 features. We then followed the methodology ex- use the full dataset and still be able to compare our results plained in Section 4.1, and evaluated three separate clas- sifiers (namely, Linear Discriminant (LD), Random Forest 0.70 (RF), Multi-Layer Perceptron (MLP)) at every 30 minutes timestep.Asdiscussed,wefirstevaluatedAUCandCohen’s 0.65 Kappa of these classifiers on the full, unbalanced dataset. 0.60 We then evaluated precision, recall, and accuracy of these classifiersonundersampledbalanceddatasets. e0.55 Figure2reportstheAUCandtheCohen’sKappametrics, ur s on a 5-fold cross-validation scheme, at different time steps ea0.50 m after the original post is published. We can see that evolu- 1- tionpropertiesofthepropagationgraphdonothelpsignif- F0.45 icantly our classifiers. Indeed, the curves are close to the 0.40 random classification baseline (dotted horizontal line), and LD farfromperfectclassification(i.e.,1.0). Cohen’sKappalies 0.35 RF almost exactly on the random classification baseline, sug- MLP gestingthattheclassificationperformanceisalmostrandom. 0.30 10 20 30 40 Using AUC, it is possible to set good classification thresh- Time (Hours) olds, slightly improving the performance. MLP performs slightlybetterthanRFandLD,andwecanobservethatre- Figure3: EarlyStageClassificationscenario, classification sultsdonotchangesignificantlyafterafewhoursofanaly- F1-score,asafunctionofelapsedtime. sis. However the outcome remains unsatisfying (i.e., AUC under0.6foreachclassifierateachtimestep). Figure 4 reports average AUC and Cohen’s Kappa on a 0.70 5-foldcross-validationscheme,onthefulldataset. Thedot- ted horizontal line shows the baseline performance of ran- 0.65 dom classification. Indeed, we observe that our classifiers donotsignificativelyimprovethebaseline,withthemetrics 0.60 remaining below 0.75. LD performance is poorer than RF andMLP,probablyduetothesimplicityoftheclassification 0.55 tool. e alu0.50 V 1.0 Cohen's Kappa-LD 0.45 Area Under Curve AUC-LD Cohen's Kappa Cohen's Kappa-RF 0.40 0.8 AUC-RF 0.35 Cohen's Kappa-MLP AUC-MLP 0.6 0.30 0 10 20 30 40 50 e Time (Hours) alu V Figure2: EarlyStageClassificationscenario,AUCandCo- 0.4 hen’sKappa,asafunctionofelapsedtime. 0.2 Figure3reportstheF -scoreobtainedontheundersam- 1 pled balanced dataset, on a 5-fold cross-validation scheme, averaged over ten repetitions of the undersampling proce- 0.0 dureafterwards.Figure3furtherconfirmsthefindingshigh- LD RF MLP lighted in Figure 2: classifiers are not able to leverage the Figure4: FinalStageClassificationscenario,AUCandCo- evolution properties to discriminate between science and hen’sKappa. conspiracy. In fact the F -score curve, after a slight incre- 1 ment during the first 10 hours, stabilizes relatively close to thebaseline, andremainsbelow0.65throughoutthewhole In Figure 5 we report the ROC curve, on a 5-fold cross- 48hourstimespan. validation scheme, on the full dataset. From Figure 5 we canobservethatnoclassifiercanreachagoodtradeoffbe- 5.3 FinalStageClassification tween true positive rate and false positive rate. Indeed, the Toevaluatethisscenario,werecallthatweusedatotalof28 curves are relatively close to the baseline (diagonal dotted features with three distinct classifiers: Linear Discriminant line), meaning that, as the decision threshold changes, lots (LD),RandomForest(RF),Multi-LayerPerceptron(MLP). ofsamplesaremisclassified. Again, we first evaluated the unbalanced dataset, and then Table 2 reports several metrics computed on the under- evaluatedtheundersampledbalanceddatasets. sampledbalanceddataset. Resultsareaveragedona5-fold 1.0 sification. These results highlight the necessity of includ- ing content-related features, or polarization metrics, in fu- ture analysis (i.e., whether particular users and their echo 0.8 chambers are more polarized towards one type of content). Unfortunately, misinformation creators can easily control e at content-relatedfeatures,inordertoavoidalgorithmicdetec- e R0.6 tion. Moreover,userpolarizationcanbeclearlyunderstood v siti from past users’ behaviors, but it takes time to understand o P polarization of new users. Hence, automatic detection of e 0.4 u fakenewsremainsanopenchallenge. Tr 0.2 LD Future Work. The employed dataset has some limita- RF tions,boundtotheFacebookAPI:(i)itonlycontainsFace- MLP book public information; (ii) it does not contain the times- 0.0 0.0 0.2 0.4 0.6 0.8 1.0 tamp of likes, one of the most common interactions; and False Positive Rate (iii) it does not always provide information about whether Figure 5: Final Stage Classification scenario, ROC curves. interactionwithcontenthappenedbecauseofinteractionsof For visual clarity, we linearly interpolated the points in the user’sfriends. Wewouldliketoanalyzefiner-graineddata, curves, and plotted markers at each 0.02 step in the false thattakesthesefactorsintoaccount,becauseitcouldleadto positiverate. improvedresults. In our analysis, we identified a set of features and used them in well-known classifiers. Our experiments were ex- cross-validationscheme,andthenovertenrepetitionsofthe tensive, butnotcomplete. However, weexpectthattheuse undersampling procedure. These metrics show that even if of different models would not provide significant improve- theclassifiersareabletoimprovethebaselineslightly,they ment. Thisclaimneedstobefurthervalidated,possiblyalso cannotreachgoodperformance. usingmorerecentdatasets. In the future, we also plan to test different methods for Classifier Precision Recall Accuracy F score misinformation classification, based on user polarization 1 andcontent-relatedfeatures,toinvestigatewhetherthesein- LD 0.578 0.523 0.570 0.549 formationcouldhelppropagationproperties,andovercome thedifficultyofthisproblem. RF 0.654 0.742 0.675 0.695 MLP 0.659 0.688 0.665 0.672 References Table2: FinalStageClassificationscenario,performanceof Aiello,L.;Barrat,A.;Schifanella,R.;Cattuto,C.;Markines, theclassifiers. B.; and Menczer, F. 2012. Friendship prediction and ho- mophily in social media. ACM Transactions on the Web (TWEB)6(2):1–33. 6 Conclusions Al Hasan, M.; Chaoji, V.; Salem, S.; and Zaki, M. 2006. Earlydetectionofmisinformationplaysacrucialroleinso- Linkpredictionusingsupervisedlearning. InInternational Conference on Data Mining (SDM): Workshop on Link cial networks. In this paper, we analyzed the difficulty of discerning conspiracy posts from scientific posts on Face- Analysis,Counter-TerrorismandSecurity. book. Wefocusedonusingonlystructuralfeaturesofcon- Aral, S.; Muchnik, L.; andSundararajan, A. 2009. Distin- tentpropagation,becausetheycannotbeeasilymanipulated guishinginfluence-basedcontagionfromhomophily-driven bymisinformationcreators. Ourresultsshowthatmisinfor- diffusionindynamicnetworks. ProceedingsoftheNational mationclassificationduringitsearlypropagationstagewith AcademyofSciences(PNAS)106(51):21544–21549. thesefeaturesisunsuccessful,suggestingthatthespreading Backstrom, L.; Kleinberg, J.; Lee, L.; and Danescu- dynamics captured by our features are independent on the Niculescu-Mizil,C. 2013. Characterizingandcuratingcon- typeofcontent. Furthermore,evenconsideringthecascade versation threads: Expansion, focus, volume, re-entry. In attheendofcontentpropagationdoesnothelp: alsointhis International Conference on Web Search and Data Mining case,theimprovementprovidedbyaclassifieroverrandom (WSDM). ACM. coinflipsisnegligible. Bakshy, E.; Rosenn, I.; Marlow, C.; andAdamic, L. 2012. OurfindingssuggestthatinFacebookusersinteractwith Theroleofsocialnetworksininformationdiffusion. InIn- different types of content in similar ways, reinforcing the ternational Conference on World Wide Web (WWW), 519– hypothesis of echo chambers (Del Vicario et al. 2016). In- 528. ACM. side these chambers, strongly polarized by topic (Bessi et al. 2015b), content propagation exhibits very similar struc- Bessi,A.;Coletto,M.;Davidescu,G.;Scala,A.;Caldarelli, turalproperties,thatarethereforelessusefulincontentclas- G.; and Quattrociocchi, W. 2015a. Science vs conspiracy: Collective narratives in the age of misinformation. PLOS Fronczak, A.; Fronczak, P.; and Hołyst, J. 2004. Aver- ONE10(2):1–17. age path length in random networks. Physical Review E Bessi, A.; Petroni, F.; DelVicario, M.; Zollo, F.; Anagnos- 70(5):056110. topoulos, A.; Scala, A.; Caldarelli, G.; and Quattrociocchi, Hanley, J., andMcNeil, B. 1982. Themeaninganduseof W.2015b.Viralmisinformation:Theroleofhomophilyand theareaunderareceiveroperatingcharacteristic(roc)curve. polarization. In International Conference on World Wide Radiology143(1):29–36. Web(WWW)Companion. ACM. Haykin,S.1998.NeuralNetworks:AComprehensiveFoun- Borge-Holthoefer, J.; Meloni, S.; Gonc¸alves, B.; and dation. PrenticeHallPTR,2ndedition. Moreno, Y. 2013. Emergence of influential spreaders Ho,T.1995.Randomdecisionforests.InInternationalCon- in modified rumor models. Journal of Statistical Physics ference on Document Analysis and Recognition (ICDAR), 151(1-2):383–393. volume1,278–282. IEEE. Borge-Holthoefer, J.; Rivero, A.; and Moreno, Y. 2012. Holland, P., and Leinhardt, S. 1971. Transitivity in Struc- Locating privileged spreaders on an online social network. tural Models of Small Groups. Small Group Research PhysicalReviewE85(6):066123. 2(2):107–124. Centola,D.,andMacy,M. 2007. Complexcontagionsand Hong,L.; Dan,O.; andDavison,B. 2011. Predictingpop- the weakness of long ties. American Journal of Sociology ularmessagesintwitter. InInternationalConferenceCom- 113(3):702–734. paniononWorldWideWeb(WWW). ACM. Centola,D.2010.Thespreadofbehaviorinanonlinesocial Izenman,A. 2013. Lineardiscriminantanalysis. InModern networkexperiment. Science329(5996):1194–1197. MultivariateStatisticalTechniques.Springer. 237–280. Cheng,J.;Adamic,L.;Dow,P.;Kleinberg,J.;andLeskovec, Kramer,A.;Guillory,J.;andHancock,J. 2014. Experimen- J. 2014. Cancascadesbepredicted? InInternationalCon- tal evidence of massive-scale emotional contagion through ferenceonWorldWideWeb(WWW). ACM. social networks. Proceedings of the National Academy of Cheng,J.;Adamic,L.;Kleinberg,J.;andLeskovec,J.2016. Sciences(PNAS)111(24):8788–8790. Do cascades recur? In International Conference on World WideWeb(WWW),671–681. ACM. McPherson,M.;Smith-Lovin,L.;andCook,F. 2001. Birds ofafeather: Homophilyinsocialnetworks. AnnualReview Cohen,J. 1968. Weightedkappa: Nominalscaleagreement ofSociology415–444. provisionforscaleddisagreementorpartialcredit. Psycho- logicalBulletin70(4):213. Moreno,Y.;Nekovee,M.;andPacheco,A.2004.Dynamics of rumor spreading in complex networks. Physical Review Cozzo, E.; Banos, R.; Meloni, S.; and Moreno, Y. E69(6):066130. 2013.Contact-basedsocialcontagioninmultiplexnetworks. PhysicalReviewE88(5):050801. Newman, N. and Levy, D.A.L. and Nielsen, R.K. 2015. Reuters Institute digital news report. http://www. DelVicario,M.; Bessi,A.; Zollo,F.; Petroni,F.; Scala,A.; digitalnewsreport.org/survey/2015/. [On- Caldarelli, G.; Stanley, H.; and Quattrociocchi, W. 2016. line;accessed10-January-2017]. Thespreadingofmisinformationonline. Proceedingsofthe NationalAcademyofSciences(PNAS)113(3):554–559. Newman,M. 2002. Assortativemixinginnetworks. Physi- calReviewLetters89(20):208701. Diao, Q.; Jiang, J.; Zhu, F.; and Lim, E. 2012. Finding bursty topics from microblogs. In Annual Meeting of the Quattrociocchi, W.; Caldarelli, G.; and Scala, A. 2014. Association for Computational Linguistics: Long Papers - Opiniondynamicsoninteractingnetworks: Mediacompeti- Volume 1 (ACL), 536–544. Association for Computational tionandsocialinfluence. ScientificReports4. Linguistics. Quattrociocchi, W.; Scala, A.; and Sunstein, C. 2016. Doerr, B.; Fouz, M.; and Friedrich, T. 2012. Why rumors Echo chambers on facebook. Available at SSRN: https: spread so quickly in social networks. Communications of //ssrn.com/abstract=2795110. theACM55(6):70–75. Rogers, K. and Bromwich, J.E. 2016. The hoaxes, Facebook. 2016. Company info. http://newsroom. fake news and misinformation we saw on election day. fb.com/company-info/. [Online; accessed 10- http://www.nytimes.com/2016/11/09/us/ January-2017]. politics/debunk-fake-news-election-day. First Draft News Coalition. 2016. About First Draft. html. [Online;accessed10-January-2017]. https://firstdraftnews.com/about/. [Online; Romero, D.; Meeder, B.; and Kleinberg, J. 2011. Dif- accessed10-January-2017]. ferences in the mechanics of information diffusion across Fowler, J., and Christakis, N. 2010. Cooperative behav- topics: Idioms, political hashtags, and complex contagion ior cascadesin human social networks. Proceedings ofthe ontwitter. InInternationalConferenceonWorldWideWeb NationalAcademyofSciences(PNAS)107(12):5334–5338. (WWW),695–704. ACM. Friggeri, A.; Adamic, L.; Eckles, D.; and Cheng, J. 2014. Salganik,M.;Dodds,P.;andWatts,D. 2006. Experimental Rumor cascades. In AAAI Conference on Weblogs and So- study of inequality and unpredictability in an artificial cul- cialMedia(ICWSM). turalmarket. Science311(5762):854–856. Seo,E.;Mohapatra,P.;andAbdelzaher,T. 2012. Identify- oftheNationalAcademyofSciences(PNAS)109(16):5962– ingrumorsandtheirsourcesinsocialnetworks.InSPIEDe- 5966. fense, Security, and Sensing, 83891I–83891I. International Wiens,J.;Horvitz,E.;andGuttag,J.2012.Patientriskstrat- SocietyforOpticsandPhotonics. ificationforhospital-associatedc.diffasatime-seriesclas- Silverman, C.; Strapagiel, L.; Shaban, H.; Hall, sificationtask. AdvancesinNeuralInformationProcessing E.; and Singer-Vine, J. 2016. Hyperpartisan Systems(NIPS)467–475. Facebook pages are publishing false and mislead- Zollo, F.; Bessi, A.; Del Vicario, M.; Scala, A.; Cal- ing information at an alarming rate. https: darelli, G.; Shekhtman, L.; Havlin, S.; and Quattrocioc- //www.buzzfeed.com/craigsilverman/ chi, W. 2015a. Debunking in a world of tribes. arXiv. partisan-fb-pages-analysis. [Online; accessed preprint,http://arxiv.org/abs/1510.04267. 10-January-2017]. Zollo,F.;Novak,P.K.;DelVicario,M.;Bessi,A.;Mozeticˇ, Sunstein,C. 2002. Thelawofgrouppolarization. Journal I.;Scala,A.;Caldarelli,G.;andQuattrociocchi,W. 2015b. ofPoliticalPhilosophy10(2):175–195. Emotional dynamics in the age of misinformation. PLOS ONE10:1–22. Ugander, J.; Backstrom, L.; Marlow, C.; and Kleinberg, J. 2012. Structuraldiversityinsocialcontagion. Proceedings

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.