Data Harvesting 2.0: from the Visible to the Invisible Web ClaudeCastelluccia Ste´phaneGrumbach LukaszOlejnik INRIA INRIA INRIA [email protected] [email protected] [email protected] Abstract It might at first sight seem irrelevant to know which cor- porations are handling and concentrating data, and in what Personal data are fuelling a fast emerging industry which countriestheyarebased,butithastremendousconsequences transform them into added value. Harvesting these data is related in particular to the jurisdiction that applies, not to therefore of the outermost importance for the economy. In mentiontheeconomicimpact,directorindirect.Somecom- this paper, we study the flows of personal data at a global plexfiscalissueshavealreadybeenraisedinEuropeonthe level, and distinguish countries based on their capacity to economyofthedata,whichatthisstageismostlyinvisible harvestdata.Weestablishacartographyofinternationaldata toEuropeanstates1. channelsonthevisibleandinvisibleWeb.ThevisibleWebis Personaldatahavebecomeanessentialresourceforthenew composedofthesitesthatareavailabletothegeneralpublic economy of the information society, much like iron ore or and are typically indexed by search engines. The invisible crudeoilwerefortheindustrialeconomy.Developingtools Webreferstotags,Webbugs,pixelsandbeaconsthatappear toharvestpersonaldataisthereforeofstrategicimportance onWebsitestotrackandprofileusers. to catch up with the digital revolution. Harvesting data is It is well known that the US dominate the visible Web mostlydonebysystems,suchassearchengines,socialnet- with more than 70% of the top 100 sites in the world. We works,clouds,etc.wherepeopleprovidetheirpersonaldata show that this domination is even stronger on the invisible in exchange for a service, which most of the time is acces- Web.Thelargestproportionoftrackersinmostcountriesare sible essentially for free 2. More precisely, the private data indeed from the US. Apart from the US, two countries ex- givenbyusers,constitutethemeansbywhichuserspayfor hibitanoriginalstrategy.China,whichdominatesitsvisible services.Itisthereforenotonlyaresource,butcanbeseenin Web with a majority of local sites, but surprisingly these asenseasavirtualcurrency.Harvestingdatacanbedoneas sites still contain a majority of US trackers. Russia, which wellinmoresubtleways,onthe”invisibleWeb”,bytrack- alsodominatesitsvisibleWeb,andistheonlycountrywith ingpeople,whichiscarriedonmostlybythirdpartyentities. morelocaltrackersthanUSones. The main objective of this paper is to analyze the current 1. Introduction worldsituation,andperformageographicalanalysisofdata SunTzufamouslywroteinthe6thcenturyBC:”Ifyouknow harvestingonboththevisibleandtheinvisibleWeb.Thevis- yourenemiesandknowyourself,youwillnotbeimperilled ible Web is composed of the sites that are available to the in a hundred battles; if you do not know your enemies but generalpublicandaretypicallyindexedbysearchengines. do know yourself, you will win one and lose one; if you TheinvisibleWebreferstotags,Webbugs,pixelsandbea- do not know your enemies nor yourself, you will be imper- consthatappearonWebsitestotrackandprofileusers. illed in every single battle” [1]. This statement might be- comemorerelevantthaneverwiththeexplosionofinforma- Thispaperaimsatansweringthetwofollowingquestions: tion now available from the Internet and Web 2.0 systems. Some countries are widely collecting data on the visible or 1. What are the most influential Web systems? In which the invisible Web about their citizens as well as sometimes countries are they based? What are their regional influ- about citizens of other countries. Some other countries on ence? the other hand are relying mostly on foreign systems, thus lettingalargeamountoftheirdatabehandledoutsidetheir 1Somecountries,suchasFrance[19],actuallyintendtotaxthetracking borders.Thisdiscrepancybetweenthesetwoapproachesto and data-mining providers. The premise behind this assumption is that consumerspayforservicewiththeirprivatedataandthisgoodisexchanged the management of personal data could result in informa- withoutissuingatax. tionasymmetriesbetweentheplayers,whetherindustrialor 2Notethatmostpeoplepayabout1000dollars/year(ISP,3G,...)toaccess governmental,intheircapacitytoaccessstrategicdata[24]. theInternet.Itisarguablewhethertheseservicesarereallyfree. 2. WhicharethebiggesttrackersontheInternet?Inwhich cently certain countries, notably the UK, passed laws re- countries are they based? What are their regional influ- quiring Websites to gain their users consent for the cookie ence? usage [14]. Examples include the site of the BBC, recently extendedwithaninformationpop-up.Usersareexpectedto In other words, this paper aims at analyzing the cyber- consentthoughtheymostlymechanicallydoso,withoutany strategies of different countries in terms of data harvesting specialconsiderationorunderstanding. onboththevisibleandtheinvisibleWeb. Itisimportanttonotetheexistenceofothertrackingmech- Interestingly, the answers to the previous questions reveal anisms. Of particular importance, in addition to Evercook- very different patterns, not necessarily correlated with the ies[11],arenon-standardbrowserfingerprintingtechniques penetrationrate.Numerousstudieshavemeasuredthelevel such as browser configurations [4], history [18], host iden- ofpenetrationoftheinformationsocietysuchastheWorld tifiers [26], pixels [16], allowing tracking of the browser EconomicForum[3].Recently,theWebFoundation3,ledby acrossvarioussites.Thediscussionoftherisksandprotec- Tim Berners-Lee, launched its Web Index, which as previ- tions against potential tracking by social buttons (i.e. Face- ousstudiesrankstopofthelistNorthAmericaandNorthern book’s Like) is covered in the work of [22]. According to Europe, as well as some countries of Asia. The index mea- [18], most popular sites can be leveraged to basic history- sures three key attributes of the Web: ”Web readiness”, for based fingerprinting and using this intuition, we assumed thecommunicationsinfrastructure;”Webuse”,forthepop- thiscanalsobethecaseofthemostimportantsitesfortrack- ulationonlineandthecontentsavailable;and”theimpactof ing. theWeb”,foritseconomic,socialandpoliticalimpact.The New risks resulted in new reactions. Quite recently, a new Web Foundation gives a high weight to the political open- and bold initiative attempts at limiting the tracking on the ness. China ranks 29th out of 61 countries in this index, a Internet. This initiative, known as Do Not Track [15], pro- ratherlowrank,whichrevealsthoughanincreasedWebuse, motes the consensual and easy solution of opting out from butarelativelyslowevolutionofWebcontent. tracking on the Web. It is technically achieved by a simple Thereisonemeasurethatislittletakenintoaccountinthese additionofananotherHTTPheaderinthebrowser’srequest: rankings, namely the local development of the Web 2.0 in- DNT[25].AccordingtoMozilla,morethan11%ofFirefox dustry,whichoffersanindicatorofthepotentialstrategicac- usershaveactivatedtheDNT.TheDNTinitiativeleadtodif- tivityonBigDatainthecountry4.TheUSwouldberanked ficultnegotiationswiththeadvertisementindustryintheUS atthetoppositionforsuchacriteria.Theyhaveindeedde- [9]. veloped the strongest industry worldwide, with most of the Trackingcanalsobelimitedbyusingdedicatedbrowserex- first online social systems accessed in the world, such as tensions, which can block unwanted tracker scripts and/or Google, Facebook, YouTube, Yahoo!, Wikipedia, Windows ads.Weselectedtwoofthemostprominentones,Ghostery Live,Twitter,Amazon,tonamethemostpopular.Withthese and AdBlock Plus, and compared them to present actual corporations,USAharvestsprivatedataofpeopleallaround metrics of performance. For a global analysis, we refer to the World that can be analysed for an unpredictable set of [12],whichshowshowtrackinghaschangedwithtimeand purposes,withconsiderableeconomicimpact.Ofcoursethe acquisitions of various companies by others. Our work fo- UShaveadominantpositioninanumberofstrategicsectors cusesonaglobalapproach,incontrastwith[18],wherethe oftheinformationsociety,includingtheoperatingsystems, situation of individuals was addressed. Geographic differ- the browsers, or the clouds that support the systems of the encesintheTwitternetworkhavebeenstudiedaswell[13]. Web. Whilesomeanalogiestoourresearchcanbefound,trackers informationarefundamentallydifferentfrominformationon Technically speaking, tracking is made possible by the de- Twitter’susers. sign of the HTTP protocol [17] itself. Third-party scripts Paper organisation: The paper is organized as follows. In can indeed be used for tracking purposes. Every inclusion the next section, we present the top harvesting sites of the ofathird-partyscriptonafirst-party,i.e.onthevisitedsite, visible Web, and their geographical influence. Section 3 is requires the browser to execute a request to this third-party devotedbothtothetrackeesandthetrackersoftheinvisible server, download the script and execute it into the user’s Web. In Section 4, we analyse the correlations between the browser. differentharvestingtechniques. Browser cookies [2] are traditionally used to maintain a browser-state of a Web user. They allow to tie a given browser with the internal profile in a third-party database. This motivates to limit the use of cookies on the Web. Re- 2. ThevisibleWeb 3http://www.Webfoundation.org/ Our investigation focuses on the top Websites of 55 coun- 4Inthesequel,weconsiderthemostpopularWebsites,globallyorinone tries,whichhavebeenidentifiedusingthestatisticsprovided specificcountrybasedonAlexa’sranking.AlexaisasubsidiaryofAmazon. alexa.com by Alexa. For simplicity, in the sequel we use the country code top-level domain in CcTLD format5. Alexa maintains with the american systems. Both China and Russia have lists of over 500 sites per country, but we restricted our at- developedaverypowerfulindustrywhichharvestsmostof tentiontothe25to100mostpopularsitesintheselists. the data produced by their citizens. In China most of the first 50 sites are Chinese [10]. As shown in the infography 2.1 Topsitesbycountry produced by Ogilvy8, there is no area of the social media Data on the Web 2.0 are produced by users everywhere in where a Chinese company cannot be found. Moreover, in theWorld,buttheyareaccumulatedbycorporations,which some areas, several very large systems coexist, while only formostusersworldwidearenotintheirowncountry.Afirst one dominates in the US, not to mention the rest of the measureofthisphenomenoncanbeestimatedbymeasuring World.Itisthecaseformicrobloggingplatformforinstance, foreachcountrythepercentageofnationalWebcorporations where Sina Weibo, and Tencent Weibo coexist with both amongthetop25sitesusedinthatcountry6.Theresultsare around 300 million users, and both ahead of Twitter. The striking.Table1presentsforafewrepresentativecountries ratio of local sites in Russia is lower than in China. Most thepercentageofnationalWebcorporationsamongthetop ofthetopsitesoftheUShavepredominantpositions(e.g., 25ineachcountry. Facebook) in Russia, while they are blocked in China. The relativesizeofthetwocountriesthoughimpactonthesizeof CC Nat.ratio ForeignSites theirfirstsystems,butbothalsohavetheirrespectivesphere US 100% noforeignsite ofinfluenceabroad. CN 92% onlyforeignsite:Google SouthKoreahasanextremelyinterestingpatternofdiversity. RU 68% mostlyamericansites Amongthetop25sites,thereareonly24%ofsiteswhichare JP 36% mostlyamericansites national,whilethereare36%ofbothAmericanandChinese KR 24% halfAmerican sites,aremarkablesituation.Amongolianportal(zaluu)also halfChinese belongtothelist. FR 36% Onlyamericansites NG 24% mostlyamericansites Letusconsidernowthetop100sites.Whenlookingbeyond thetop25sites,theratioofnationalsitesincreases,inpar- Table1:PercentageofnationalWebsitesamongtop25 ticularwithmostofthelocalnewspapersandmagazines,as wellassomeservicessuchasbankinginstitutions. In the US, there are no foreign sites among the top 25 Websites.Forallothercountriesweconsidered,apartfrom China and Russia, the ratio of national sites amounts at best to around a third of the Websites. Both in Japan and France,only36%ofthetop25Websitesarenational,butthis number hides very different realities in the two countries. First,whileinFranceall64%offoreignsitesareAmerican, Figure1:Ratio forsitesinAlexalists,byorigin. inJapan,thereismorediversity.TwoChinesesites(search us engine Baidu, instant messaging QQ) and one Korean site Figure1showsforaselectedlistofcountries,theproportion (search portal Naver) belong to the top 25 sites in Japan. ofsitesfromtheUS(inblack),thecountryitself(inred),as Second, and more importantly, the French sites are mostly well as the two others locally most active foreign countries sites7, such as newspapers, which do not gather as much (inblue),inthetop100sites.Theresultsarenormalizedto personal data, while in Japan, national sites include very thenumberofsitesfromtheUS,thatistheycorrespondtoa dataintensiveones,suchasWebportals,e-commerce,blogs, ratio,wheretheper-countrycountisdividedbythenumber etc. Similar patterns would be found for other European oftheUSbasedsites. countries. Italy for instance has only 28% of national sites. Differentpatternscanbeobserved.ForEuropeancountries, InAfrica,Nigeria,hasonly24%ofnationalsites,mostlyin Hungary, Poland, Norway and France, the number of local thePress. sites is about twice the number of US sites in the top 100. Chinaistheonlycountrywhichhasdevelopedsystemswith Russiahasasimilarpattern.Chinaontheotherhandmain- a number of users, in the hundreds of million, comparable tainsmorethan80%oflocalsitesinthetop100list.Nigeria showsaverydifferentsituationwithaverysmallproportion 5AT, AU, BE, BR, CA, CH, CN, CY, CZ, DE, DK, DZ, EC, ES, oflocalsitesinthetop100. FI, FR, GB, GR, HK, HN, HU, IE, IL, IN, IT, JP, KR, KW, KZ, LY, MA, MY, NG, NL, NO, NZ, OM, PA, PE, PL, PT, PY, NotethatthepercentageofusersthatvisitspecificWebsites QA, RO, RU, SA, SE, SG, SI, SN, TH, TN, US, UY, VN. decreases very fast from the top sites in a list to the subse- 6Unlessotherwisespecified,thenumberspresentedinthefollowingtables areextractedfromAlexa’srankingasofmidseptember2012. 8China social media equivalents: a new info- 7National sites among the top 25 in France: leboncoin, Orange, Free, graphic http://www.asiadigitalmap.com/2011/02/ commentcamarche,lemonde,lequipe,lefigaro,pagesjaunes,sfr china-social-media-equivalents-a-new-infographic/ quent sites. The first ten sites generally concentrate an im- Facebook.com and Youtube.com, we show the number of portant share of the total traffic. Countries whose ratio of countriesinwhichthesiteisamongthetop2or4sites(as websites is low in the first segment and increases then for indicatedinthetable). the top 100, have in general a relatively small share of the It is important to note that the traffic decreases very fast in globaltraffic. thetop50list.Google.com,whichoccupiesthetopposition for instance drives more than 40% of global users (not to 2.2 Topsitesglobally mention the local sites such as google.de, google.com.hk, WenowconsidertheglobalimpactofWebcorporations. etc.),andisamongthefirst4sitesinmorethan30countries, Table2presentstheproportionsoftop50sitesintheworld while Sohu.com, which occupies the 50th position, drives thatareownedbycompaniesinagivencountry. about 2% of users and is among the top ten sites only in TheUShavemorethantwothirdsofthetop50sitesworld- China. wide. These sites have a real prominence worldwide as we Rank Website %users #countries haveseenontheprevioustable.Theonlytwocountriesthat 1 Google.com 43.75 30(1st-4th) have more than one site in this club, are China and Rus- 2 Facebook.com 42.65 34(1st-2nd) sia, which have both developed their own industry for fun- 3 YouTube.com 33.43 35(1st-4th) damental tools such as search engines, blogs, e-commerce, 4 Yahoo.com 20.11 29(1st-10th) etc. Three countries, in the European sphere, have one site 5 Baidu.com 12.19 3(1st-10th) amongthetop50. 8 Amazon.com 8.26 15(1st-10th) Here again, China occupies a remarkable position after the 18 Yandex.com 2.97 6(1st-10th) US,whichhavetheabsolutesupremacy.Chinahaseightof the first fifty sites worldwide according to Alexa’s ranking. 32 Tumblr.com 2.57 0(1st-10th) IfthesizeoftheChinesepopulationimpactsofcourseonthe 50 Sohu.com 1.84 1(1st-10th) numberofusersoftheChinesesystems,andthereforeulti- Table3:Percentageofglobalusersandtopcountries matelyontherankingofthesesystems,themostimportant reasonfortheirsuccessistheassociationofaclearpolitical ambition, a strong appetence for social networking, and a dynamicindustry.Thesizeofthepopulationisbynomeans 2.3 Searchengines an explanation by itself. India for instance has only a few Letusconsidermorecarefullyparticularsegments,suchas national sites among its top 25 sites, which are almost all thesearchengine,whichplaysanessentialroleintheway American. people access knowledge. Here again, distinct patterns can Unlike their American counterparts, Chinese systems have befound.Table4presentsthetopsearchenginesforafew currently most of their users in China. Most of them are countries. of course widely used in Hong Kong and Taiwan as well, CC 1stSE share 2ndSE share while some are also used in South Korea, and in Russia suchasTaobaopopularfore-commerce.Theirinternational US Google 65% Bing/Yahoo 15% ambitionwillmostprobablygrowinthecomingyears. CN Baidu 73% Qihoo/Sogou 8-9% JP Yahoo!Japan 51% Google 36% CC Ratio Topsiteswiththeir(rank) RU Yandex 60% Google 25% US 72% Google(1);Facebook(2);YouTube(3); UK Google 91% Bing 5% Yahoo(4);... FR Google 92% Bing 3% CN 16% Baidu(5);QQ(8);Taobao(13);Sina(17); CZ Google 53% Seznam 37% 163.com(28);Soso(29);Sinaweibo(31); Sohu(43) Table4:Thetoptwosearchenginesbycountry RU 6% Yandex(21);kontakte(30);Mail.ru(33) IL 2% Babylon(22) The US have developed major search engines. The three UK 2% BBC(46) which dominate the American market, Google, Yahoo and NL 2% AVG(47) Bing,areamongthemostpopularworldwide.Googlehasa dominant position with 65% market share, while Bing and Table2:TheTop50Websitesworldwidebycountry Yahoo have both 15% share in USA. Globally Google has 65%marketshare,Baidu,8.2%marketshare,Yahoo,4.9% Table 3 shows the percentage of global users who visited a market share, Yandex, 2.8% market share, and Microsoft, site in the last three months. It also displays the number of 2.5%marketshare9. countriesinwhichthesiteappearsinthe10topsiteslocally. 9http://searchengineland.com/google-worlds-most- Note that for the most popular sites, namely Google.com, popular-search-engine-148089 China10 and Russia11 are in a very similar situation, where otherAsiancountriessuchasJapanandKoreaforinstance the dominant search engine is the local one, Google be- shows their appetence for local systems. The strong focus ing the next most widely used engine. Baidu has a rela- inWesternmediaonthecensorshipimposedontheInternet tivelybiggershareinChinathanYandexinRussia.Morere- hasoftenledtounderestimatethestrategyofChinatowards cently,GooglelostsharesinChina,withthesuddenraiseof ITandtheinformationsociety,andoverestimatetheimpor- twootherlocalsearchengines,whichareapproaching10% tanceofthecontrolofthecontent. sharesoftheChinesemarket,Qihoo360,andSogou. In Europe12, there are no local search engines with strong 3. TheinvisibleWeb positions,andthemarketisdominatedbyGoogle,whichhas While Section 2 deals with the cartography of the visible aquasimonopolisticposition.OnlySeznamhasareasonable Web, this section analyses what is often referred to as the shareforczech,butwhichisnowdecreasingwithrespectto invisible Web. The invisible Web refers to tags, Web bugs, Google’sshare. pixels and beacons that appear on Websites to track and 2.4 Socialnetworks profileusers.Wefirstpresentthemethodologyusedtotrack the trackers. We then consider the invisible Web from the Otherdomainsoftheinformationsocietysuchassocialnet- trackeespointofview,weaimtoshowhowInternetusersare works would lead to very similar conclusion, with Face- tracked across the world. Finally, we analyze the trackers, book largely dominating in Europe, while alternative sys- andconsiderwhetherthetrackersaredistributeduniformly tems have been developed in Asia. The size of the Chinese ontheplanet,orwhethersomecountriesaredominatingthe social networks deserve some attention. The ranking of the trackingbusiness. GlobalWebIndexbasedonthepercentageofglobalInternet usersarestriking.Table5showsthat6outof10ofthemost 3.1 Trackingthetrackers widelyusedsocialsystems13areChinese. In order to establish a cartography of global third-party re- CC Corporation share source utilization on the Web, we used PlanetLab’s infras- US Facebook 41% tructure [20]. PlanetLab connects many servers in different countries.Inourexperiment,37proxyserversfromdistinct US Google+ 21% countrieshavebeenused14. CN Qzone 19% To retrieve information on trackers, we created dynamic CN SinaWeibo 18% tunnelstotherelevantPlanetLab’sservers.Subsequently,all CN TencentWeibo 16% sites from the respective lists were visited and the trackers US Twitter 16% detectedonthesesitesweresavedforfurtheranalysis.This CN Renren 11% processwasautomatedwiththeuseofaWebDrivertogether CN Kaixin 8% withaFirefoxbrowser,equippedwithmodifiedpluginsand US LinkedIn 7% Flash enabled. All our data has been obtained between the CN 51.com 6% endofoctoberandthebeginningofdecember2012. Forourtestswehavechosentwopopulartoolsenablingthe Table5:PercentageofglobalInternetusers detection and blocking of third-party resources: Ghostery andAdBlockPlus(ABP).Theybothworkinasimilarman- China has developed a large industry on the net, with es- nerrequiringthescanningofthevisitedWebsite,andsearch- sentially all the usual services initially proposed by mostly ing for an offending resource or connection. If a resource American companies, such as online search engines, social is found to be present on the respective list of blocked re- networks, news, business, instant messaging, etc. Chinese sources (filter lists), this may either be reported to the user companieshavetakenadvantageintheirdevelopmentofthe orblockedbytheplugin. difficulties to access their foreign counterparts from Main- Ghostery. Ghostery is a popular extension which detects land China, but they would most certainly have succeeded trackers and display their names (such as “DoubleClick”, withoutthecensorshipofforeignsites.Thediversityinsome “Omniture”) [6]. Ghostery analyzes the requests made by 10http://www.chinainternetwatch.com/1444/ a browser and compares them against a database of known china-search-engine-market-share-by-revenue-q1-2012/ trackers. It is important to note that Ghostery maintains a 11http://www.bloomberg.com/news/2012-04-02/ listofconfirmed trackers:atrackerisnotonlyaddedtothe yandex-internet-search-share-gains-google database, but also included on the project’s Webpage (e.g. -steady-liveinternet.html 12http://theeword.co.uk/seo-manchester/google_tops_july_ http://www.ghostery.com/apps/omniture),withtheir 2012_search_engine_market.html respective privacy policy. In our experiment, we saved the 13http://globalWebindex.net/thinking/ social-platform-report-series-september-2012 14Ifaserverfromaspecificcountrywasunavailable,weusedalocalIP -facebook-on-track-to-hit-2bn/ fromInria. namesofthetrackersfoundforeveryvisitedsiteforfurther analysis. AdBlock Plus (ABP). For comparison purposes, we also used the AdBlock Plus extension [21], in the same envi- ronment as described previously. AdBlock is able to block third-partyresources,mostlyunwantedadvertisements,and it maintains a dedicated trackers list: EasyPrivacy15, which includes many known trackers as well as Web bugs, which allows AdBlock Plus to block these resources. We use this laterlistinouranalysis. Although these two tools are different, they provide results that are consistent. A more detailled comparaison of these toolsandresultsareprovidedinAppendixA. Country of origin. Determining the country of origin of Webcorporationsisachallengingproblem.Onesolutionis toidentifythelocationoftheirheadquarters,butthisisnot Figure2:Averagenumberoftrackers(Ghostery) alwaysrelevant.ItisalsopossibletouseWhoisdatabasesto identifythelocationofthesiteholdingtheirdomainname. However,domainregistrarsaresometimelocatedelsewhere than reported in Whois databases. A third possibility could are accessed from a local IP, and the trackers are detected be,inthecaseofABPforinstance,tousethetrackers’do- usingGhosterydataforthe55selectedcountries.Theresults mainnameresolutionandperformgeolocationusingappro- clearlyshowthatusersaretrackeddifferentlyontheInternet priate databases. None of these methods are fully satisfac- accordingtotheircountry.Thedifferentcolorsonthefigure tory.Weproposeinsteadanalternativetechnique,whichcan refertothedifferentcontinents16.AsshowninAppendixB, beautomated. these results do not depend in fact on the visitor locations. ForABP,aWebresourceisassociatedtotrackers.Thetop- Inotherwords,inmostcases,agivensitetracksitsvisitors level domain name can be extracted, a domain name reso- independentlyofthevisitor’sIPaddress. lution can be performed, and subsequently a lookup using While results do not show big differences between conti- geolocationdatabase,toidentifyitslocation.Thisapproach nents, some countries seem to be much more tracked than can correctly identify the country of origin in most of the others.Forexample,USInternetusersaretracked5times cases. For Ghostery, we performed a geolocation on the morethanChineseusers17. URLs in the trackers description from the Website http: //www.ghostery.com/apps/, where all the trackers are described by Ghostery themselves. The geolocation has beendoneusingPythonGeoIPandgeoiplookupcommand- line utility tools, which query geolocation databases. The results were then cross-checked manually. For example http://www.ghostery.com/apps/digilant mentions Digilant company’s Website www.digilant.com, which resolves to United States. The ”Privacy Contact“ tab on Ghostery’s Website confirms that Diligant headquarters are locatedinBoston,USA. 3.2 Thetrackees Wefirstconsidertheaveragenumberoftrackerspersite foreachofthe55countriesconsidered.Theaverageiscom- puted by connecting to the top 100 most popular sites of a country, summing the number of trackers on each of them, Figure3:Averagenumberoftrackers(Ghostery,mobile) anddividingthefinalresultbythenumberofretrievedsites. Figure 2 displays the average number of trackers (Y axis) 16Europeisingreen,Asiainblack,NorthAmericaindarkblue,Africain for each considered country (X axis). The top 100 sites lightblue,SouthAmericainpurple,Australiainwhite 17WedefineasaUS(resp.Chinese)user,auserthatvisitstheAlexa100 15copiedon30/11/2012 top,i.e.the100mostpopular,sitesintheUS(resp.inChina) Figure3displaysthesametypeofresultsasFigure2,with thebrowser’s User-Agentstring setto amobile device.Al- thoughtheresultsaredifferent,withasmalleraveragenum- beroftrackers,thetrendsaresimilar.Somecountriesarestill muchmoretrackedthanothers.Thisdifferenceislikelydue to the fact that many Websites redirect the mobile browser toaspecialversionofthesite,tailoredtowardsdeviceswith asmallerdisplay.Thesesites,asitseems,includelessthird- partyresources,probablytospeeduploading.Furthermore manypopularsiteshaveadedicatedmobileapplicationany- way,sothepotentiallossesfromtrackingand/oradvertising canbebalancedwiththeuseofin-appadvertisements. Figure5:Distributionoftrackersforselectedcountries much larger in the US than in other countries, which auto- matically increases the average number of trackers per site intheUS. Finally,weconsideredthedistributionofdifferenttracker types. Ghostery divides its detected scripts into five cate- gories[7],whichwerecallbelow: 1. Ad:advertisementsprovidedbythead-networks; 2. Tracker: scripts which actually perform tracking (often usingverysophisticatedbehavioralanalysis); Figure4:Averagenumberoftrackers(ABP) 3. Analytics: utility scripts for Website creators allowing them to discover various statistical details about their Figure 4 exhibits results from experiments similar to those visitors; of Figure 2, but while using ABP instead of Ghostery. The 4. Widget: small Web applications such as clocks, weather absolutenumbersareslightlysmaller,butthetrendsarevery tables,andothers.OtherexamplesincludeFacebookSo- similar.AppendixApresentsmoreexperimentsthatanalyse cialPlugins,Google+1,etc.; theconsistencyoftheresultsobtainedbyABPandGhostery. 5. Privacy:typicallyascriptdisclosingprivacypoliciesand We then considered the distribution of trackers for se- practicesrelatedtoads,suchasEvidonNotice18. lectedcountries,whichisshownonFigure5.Thedistribu- Figure 6 displays the distribution of trackers for the five tion isobtained by counting, fora given country,all occur- categoriesof[7]forthe55countries,basedonGhostery.It rencesofaparticulartracker,e.g.,DoubleClick.Thesenum- isinterestingtonotethatthedistributionofeachtypeseems bers are then ordered from the largest to the smallest, and tobequitesimilarindifferentcountries.TheAdtrackersare plotted. The first value of a given curve shows the number themostcommon,followedbytheanalyticsones. ofoccurrencesofthemostpopulartracker,thesecondvalue showsthenumberofoccurrencesofthesecondmostpopular 3.3 Thetrackers tracker,andsoon.TheanalysisisbasedonGhostery. Wenextconsiderthegeographicoriginofthetrackers,and The results clearly show that in China there are only 10 thewaytrackersproliferatedependinguponthecountrythey different trackers on the top 100 Chinese sites, and these come from. We first start by analyzing the origin of the trackers are not very active. Moreover, the most popular, trackers on the top 100 sites of the global list, that is the CNZZ, only appears in 10 of the top 100 Chinese sites. In listofthe100mostpopularWebsitesworldwide. contrastothercountries,suchastheUS,havealargenumber ofdifferenttrackers(about90fortheUS),andsomeofthese Table6showsthedistributionofthedetectedtrackersusing trackers,forexampleGoogleAnalytics,areverypopularin Ghostery. We computed the number of trackers, T, on the alargenumberofsites. Does it mean that the US market is more fragmented? In- 18http://www.evidon.com/about/ deed, it seems that the number of tracking companies is 19ROW:RestOfWorld Nigeria,andHungary,andanalyzetheoriginofthetrackers ofthetop100sitesforeachofthesecountries20. CC Tracker(countryoforigin) Count GoogleAnalytics(US) 53 LiveInternet(RU) 51 RU Yandex.Metrics(RU) 31 TNS(GB) 30 Rambler(RU) 27 GoogleAnalytics(US) 20 MarkMonitor(US) 16 CN CNZZ(CN) 10 ScoreCardResearchBeacon(US) 6 DoubleClick(US) 5 GoogleAnalytics(US) 39 Figure6:Distributionoftrackersbytypes DoubleClick(US) 32 US ScoreCardResearchBeacon(US) 32 CC Share Omniture(US) 21 US 87% FacebookConnect(US) 14 CN 3% GoogleAnalytics(US) 62 FacebookSocialPlugins(US) 39 GB 3% NG DoubleClick(US) 32 RU 2% ROW19 4.2% FacebookConnect(US) 30 GoogleAdsense(US) 23 Table6:Distributionoftrackersontop100globalsites GoogleAnalytics(US) 70 Median(HU) 48 HU Gemius(PL) 47 Adverticum(HU) 37 top100sitesworldwide.Wethencountedforeachcountry FacebookSocialPlugins(US) 31 i,thenumberoftrackersoforigini,C ,andthencomputed, i foreachcountry,thepercentagep =C /T. i i Table8:Top5trackersforselectedcountries The results show a clear domination of the US, with more than80%coverageofthetop100sitesworldwide,withonly afewcountries,suchasChina,GB,andRussia,whichhave Table8displaysthetop5mostpopulartrackingcompanies asmallpercentageoftrackers. onthe100mostpopularWebsitesofthese6selectedcoun- tries. Once again, the results clearly indicate that the track- Tracker(countryoforigin) Count ing business is largely dominated by US companies. Only GoogleAnalytics(US) 25 fewcountriesseemtoresistthisdominationsuchasRussia, DoubleClick(US) 20 China,andHungary.Interestingly,Russiaistheonlycoun- ScoreCardResearchBeacon(US) 19 trywhosetrackersaremostlylocal. FacebookSocialPlugins(US) 11 We next consider the origin of the most prominent trackers Omniture(US) 10 in the following eight countries: USA, Hungary, Poland, Norway,France,Russia,China,andNigeria.Weperformed Table7:Top5trackersontop100globalsites thefollowinganalysis.Wefirstcollectedallthetrackerson thetop100sitesofacountry,classifyingthemaccordingto Table7displaysthetop5mostpopulartrackingcorporations their origin, and counting their occurrences. Since the raw on the top 100 most popular sites worldwide. The results numbers,evenaverages,maynotbethemostinformativein clearly confirm that the tracking business is largely domi- thisanalysis,wedecidedtoplottheinformationwithrespect natedbyUScompanies. to the detected number of US-based trackers to ease the presentation.Morespecifically,wecomputedforalltrackers Letuspushfurthertheanalysisoftheoriginofthetrackers, ofanoriginN detectedonthetop100sitesofacountryC, andconsideranin-depthAnalysisforSelectedCountries. For the sake of readability, we decided to display here the 20Theanalysisofall55countrieslistedatthebeginningofthispaperis results for only 5 countries, namely Russia, China, USA, presentedinAppendixC. 4. SitesvsTrackersAnalysis In this section, we compare the harvesting activity on the visibleandtheinvisibleWeb.Ourobjectiveistounderstand whether the predominance of the US is larger on the visi- bleorontheinvisibleWeb.Inordertoachievethisgoal,we compute and compare the proportion of US sites (resp. US Figure7:Ratio fortrackersbyorigin(Ghostery) trackers),withtheproportionoflocalsites(resp.localtrack- us ers)foreachofthetop100sitesofeachofthe55considered countries. Theseproportions,P andP ,arecomputedasfollows: US local #UStrackers(resp.sites)incountryC P (C)= US #Alltrackers(resp.sites)incountryC P (C) is the number of US trackers (resp. US sites) di- US videdbythetotalnumberoftrackers(resp.sites)inthetop Figure8:Ratio fortrackersbyorigin(ABP) 100sitesincountryC.Quantitatively,itshowshowtrackers us ofadominatingcountry(e.g.US)tracktheworld. #Localtrackers(resp.sites)incountryC thefollowingratio: Plocal(C)= #Alltrackers(resp.sites)incountryC #trackersoforiginN P (C) is the number of local trackers (resp. local sites) Ratio (N)= local us #UStrackers dividedbythetotalnumberoftrackers(resp.sites)inthetop 100sitesincountryC.Quantitatively,itshowsthestrength TheUShasbeenchosenasareferencebecauseoftheglobal oflocaltrackers. prevalenceofthetrackersofthisorigin.Indeed,UStrackers arepresentonalmosteverysites. Figures 7 and 8 show for each of these selected countries theoriginofthetop4mostprominenttrackingcountriesfor both detection methods, Ghostery and ABP. In parenthesis for each country, the ratio of US trackers, and the ratio of local trackers. These figures clearly show that the US havealmostalwaysthelargestratiooftrackers.Thesecond largest tracker country, is apart from some exceptions, the country itself. Austria constitutes for instance an exception tothisrule,withGermanyassecondtrackerwitharatioof 0.24. China and even more so Russia constitute the two excep- tions,withexceptionallystrongratiosoflocaltrackersover US trackers. In fact, China has a ratio of 0.78, and Russia aratioof2.3,thatindicatesthatRussiansitescontainmore RussiantrackersthanUSones. Figure9:RatioUSvslocalsitesandtrackers Wealsoobservedsomevariationinthenumberandthedis- tributionofcountriesinvolvedintrackinginagivencountry, The scatterplot for P and P of all the 55 countries US local beyondtheUSandthecountryitself.ForexampleinFrance, consideredforsitesandtrackersareshownonFigure9.Ev- there are third-party resource providers from 11 countries, ery black square (resp. red circle) point corresponds to a whereasinChinaonly6countriesare”represented“. specific country C, with coordinates P (C) (X axis) and US Inaddition,weobservedthatthesetrackersoftencomefrom P (C) (Y axis), for sites (resp. trackers). The trackers local neighboringcountries.Forexample,Danishsitesoftencon- have been obtained using Ghostery, as for previous mea- taintrackersfromSwedenorFinland.Thesameobservation sures. applies to Austria as we noticed already, as well as other TheresultsclearlyshowthattheUSareevenmorepresent countriesinCentralEurope,suchasHungary,Slovakiaand on the invisible than on the visible Web. Most of the theCzechRepublic.Therefore,withtheexceptionoftheUS, tracker points (in red) are located on the right lower corner mosttrackersareregional. oftheplot.ThisindicatesthatthepercentageofUStrackers is large in most considered countries, while the percentage CC P P US local oflocaltrackersisusuallysmaller,whilethedistributionof RU 0.39/0.36 0.49/0.52 theproportionofthesitesismorebalancedbetweentheUS CN 0.69/0.68 0.29/0.31 andlocalsites. HU 0.57/0.53 0.21/0.23 PL 0.65/0.63 0.17/0.17 NO 0.68/0.62 0.03/0.04 FR 0.71/0.67 0.19/0.22 NG 0.9/0.86 0/0 Table10:P andP fortrackers US local sites are excluded. The results contrast with the results of the previous table, and clearly show that the percentage of US trackers is larger that the percentage of local trackers, exceptforRussia. Finally it is interesting to compare Russia and China with respecttotheproportionoflocalsitesandtrackers. Figure10:SitesvsTrackers(Ghostery) This finding is further confirmed with the results shown on Figure 10, which presents the scatterplots for sites against trackers:therelationshipsbetweenUSsites(blacktriangles) Figure11:SitesandtrackersinRussia (resp.localsites(redstars))andthecorrespondingtrackers forGhostery.Theredstarsareonthelowerpartofthefigure, corresponding to a small percentage of local trackers for Figure 11 shows how Russia manages to have both more most considered countries even with a large proportion of russian sites and trackers at home, although somehow with local sites. Whereas, the black triangles are on the higher thesameorderofmagnitudesasUSsitesandtrackers,while partofthefigure,correspondingtoalargepercentageofUS Figure 12 shows that China has mostly Chinese sites, but trackers. mostlyUStrackers. Letusnowconsideragainanin-depthAnalysisofSelected Countries.WeconsidertheUSandlocalproportionofsites and trackers for seven selected countries, namely Russia, China,Hungary,Poland,Norway,FranceandNigeria. CC P P US local RU 0.23 0.54 Figure12:SitesandtrackersinChina CN 0.12 0.86 HU 0.26 0.56 PL 0.21 0.69 5. Conclusion NO 0.37 0.51 Westudiedtheglobaldistributionandproliferationofthird- FR 0.34 0.61 party resources on the most popular sites in various coun- NG 0.72 0.04 tries. Our research reveals very different strategies and/or capacities to harvest local as well as global data, both on Table9:P andP forsites US local thevisibleandtheinvisibleWeb. • InEurope,mostcountrieshavetwiceasmanylocalsites Table9displaystherespectivepercentagesofUSandlocal as US sites amongst their top 100 most popular sites, sites.Itshowsthatlocalsitesareoftendominant,exceptfor although the top sites are mostly US, and the trackers somedevelopingcountriessuchasNigeria. are mostly US as well, followed by local and regional Table10displaystherespectivepercentagesofUSandlocal trackers, thus leading to an important flow of data from trackers. The second value shows the results when the US EuropetotheUS.
Description: