ebook img

Data Harvesting 2.0 PDF

15 Pages·2017·1.25 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Harvesting 2.0

TheTwelfthWorkshopontheEconomicsofInformationSecurity, GeorgetownUniversity,Washington,D.C.,June11-12,2013 Data Harvesting 2.0: from the Visible to the Invisible Web ClaudeCastelluccia Ste´phaneGrumbach LukaszOlejnik INRIA INRIA INRIA [email protected] [email protected] [email protected] asymmetriesbetweentheplayers,whetherindustrialorgov- ernmental, in their capacity to access strategic data [23]. It Abstract might seemirrelevant toknowwhich corporationsare han- Personal data are fuelling a fast emerging industry which dlingandconcentratingdata,andinwhatcountriestheyare transform them into added value. Harvesting these data is based, but it has tremendous consequences related in par- therefore of the outermost importance for the economy. In ticular to the jurisdiction that applies, not to mention the this paper, we study the flows of personal data at a global economicimpact,directorindirect.Somecomplexfiscalis- level, and distinguish countries based on their capacity to sueshavealreadybeenraisedinEuropeontheeconomyof harvestdata.Weestablishacartographyofinternationaldata thedata,whichatthisstageismostlyinvisibletoEuropean channelsonthevisibleandinvisibleWeb.ThevisibleWebis states1. composedofthesitesthatareavailabletothegeneralpublic Personaldatahavebecomeanessentialresourceforthenew and are typically indexed by search engines. The invisible economy of the information society, much like iron ore or Webreferstotags,Webbugs,pixelsandbeaconsthatappear crudeoilwerefortheindustrialeconomy.Developingtools onWebsitestotrackandprofileusers. toharvestpersonaldataisthereforeofstrategicimportance It is well known that the US dominate the visible Web to catch up with the digital revolution. Harvesting data is with more than 70% of the top 100 sites in the world. We mostlydonebysystems,suchassearchengines,socialnet- show that this domination is even stronger on the invisible works,clouds,etc.wherepeopleprovidetheirpersonaldata Web.Thelargestproportionoftrackersinmostcountriesare in exchange for a service, which most of the time is acces- indeed from the US. Apart from the US, two countries ex- sible essentially for free 2. More precisely, the private data hibitanoriginalstrategy.China,whichdominatesitsvisible givenbyusers,constitutethemeansbywhichuserspayfor Web with a majority of local sites, but surprisingly these services.Itisthereforenotonlyaresource,butcanbeseenin sites still contain a majority of US trackers. Russia, which asenseasavirtualcurrency.Harvestingdatacanbedoneas alsodominatesitsvisibleWeb,andistheonlycountrywith wellinmoresubtleways,onthe”invisibleWeb”,bytrack- morelocaltrackersthanUSones. ingpeople,whichiscarriedonmostlybythirdpartyentities. 1. Introduction The main objective of this paper is to analyze the current worldsituation,andperformageographicalanalysisofdata SunTzufamouslywroteinthe6thcenturyBC:”Ifyouknow harvestingonboththevisibleandtheinvisibleWeb.Thevis- yourenemiesandknowyourself,youwillnotbeimperilled ible Web is composed of the sites that are available to the in a hundred battles; if you do not know your enemies but generalpublicandaretypicallyindexedbysearchengines. do know yourself, you will win one and lose one; if you do TheinvisibleWebreferstotags,Webbugs,pixelsandbea- not know yourenemies nor yourself, you willbe imperilled consthatappearonWebsitestotrackandprofileusers. in every single battle” [1]. This statement might become more relevant than ever with the explosion of information Thispaperaimsatansweringthetwofollowingquestions: nowavailablefromtheInternetandWeb2.0systems.Some countries are widely collecting data on the visible and/or the invisible Web about their citizens as well as sometimes 1Somecountries,suchasFrance[18],actuallyintendtotaxthetracking about citizens of other countries. Some other countries on and data-mining providers. The premise behind this assumption is that the other hand are relying mostly on foreign systems, thus consumerspayforservicewiththeirprivatedataandthisgoodisexchanged withoutissuingatax. lettingalargeamountoftheirdatabehandledoutsidetheir 2Note that most people pay about 1000 dollars/year (ISP, cellular data borders.Thisdiscrepancybetweenthesetwoapproachesto subscription,...)toaccesstheInternet.Itisarguablewhethertheseservices themanagementofpersonaldatacouldresultininformation arereallyfree. 1. What are the most influential Web systems? In which browser with the internal profile in a third-party database. countries are they based? What are their regional influ- This motivates to limit the use of cookies on the Web. Re- ence? cently certain countries, notably the UK, passed laws re- 2. WhicharethebiggesttrackersontheInternet?Inwhich quiring Websites to gain their users consent for the cookie countries are they based? What are their regional influ- usage [13]. Examples include the site of the BBC, recently ence? extendedwithaninformationpop-up.Usersareexpectedto consentthoughtheymostlymechanicallydoso,withoutany In other words, this paper aims at analyzing the cyber- specialconsiderationorunderstanding. strategies of different countries in terms of data harvesting Itisimportanttonotetheexistenceofothertrackingmech- onboththevisibleandtheinvisibleWeb. anisms. Of particular importance, in addition to Evercook- Interestingly, the answers to the previous questions reveal ies[10],arenon-standardbrowserfingerprintingtechniques very different patterns, not necessarily correlated with the such as browser configurations [3], history [17], host iden- penetrationrate.Numerousstudieshavemeasuredthelevel tifiers [25], pixels [15], allowing tracking of the browser ofpenetrationoftheinformationsocietysuchastheWorld acrossvarioussites.Thediscussionoftherisksandprotec- EconomicForum[?].Recently,theWebFoundation3,ledby tions against potential tracking by social buttons (i.e. Face- Tim Berners-Lee, launched its Web Index, which as previ- book’s Like) is covered in the work of [21]. According to ousstudiesrankstopofthelistNorthAmericaandNorthern [17], most popular sites can be leveraged to basic history- Europe, as well as some countries of Asia. The index mea- based fingerprinting and using this intuition, we assumed sures three key attributes of the Web: ”Web readiness”, for thiscanalsobethecaseofthemostimportantsitesfortrack- thecommunicationsinfrastructure;”Webuse”,forthepop- ing. ulationonlineandthecontentsavailable;and”theimpactof New risks resulted in new reactions. Quite recently, a new theWeb”,foritseconomic,socialandpoliticalimpact.The and bold initiative attempts at limiting the tracking on the Web Foundation gives a high weight to the political open- Internet. This initiative, known as Do Not Track [14], pro- ness. China ranks 29th out of 61 countries in this index, a motes the consensual and easy solution of opting out from ratherlowrank,whichrevealsthoughanincreasedWebuse, tracking on the Web. It is technically achieved by a simple butarelativelyslowevolutionofWebcontent. additionofananotherHTTPheaderinthebrowser’srequest: Thereisonemeasurethatislittletakenintoaccountinthese DNT[24].AccordingtoMozilla,morethan11%ofFirefox rankings, namely the local development of the Web 2.0 in- usershaveactivatedtheDNT.TheDNTinitiativeleadtodif- dustry,whichoffersanindicatorofthepotentialstrategicac- ficultnegotiationswiththeadvertisementindustryintheUS tivityonBigDatainthecountry4.TheUSwouldberanked [8]. atthetoppositionforsuchacriteria.Theyhaveindeedde- Trackingcanalsobelimitedbyusingdedicatedbrowserex- veloped the strongest industry worldwide, with most of the tensions, which can block unwanted tracker scripts and/or first online social systems accessed in the world, such as ads.Weselectedtwoofthemostprominentones,Ghostery Google, Facebook, YouTube, Yahoo!, Wikipedia, Windows and AdBlock Plus, and compared them to present actual Live,Twitter,Amazon,tonamethemostpopular.Withthese metrics of performance. For a global analysis, we refer to corporations,USAharvestsprivatedataofpeopleallaround [11],whichshowshowtrackinghaschangedwithtimeand the World that can be analysed for an unpredictable set of acquisitions of various companies by others. Our work fo- purposes,withconsiderableeconomicimpact.Ofcoursethe cusesonaglobalapproach,incontrastwith[17],wherethe UShaveadominantpositioninanumberofstrategicsectors situation of individuals was addressed. Geographic differ- oftheinformationsociety,includingtheoperatingsystems, encesintheTwitternetworkhavebeenstudiedaswell[12]. the browsers, or the clouds that support the systems of the Whilesomeanalogiestoourresearchcanbefound,trackers Web. informationarefundamentallydifferentfrominformationon Twitter’susers. Technically speaking, tracking is made possible by the de- Paper organisation: The paper is organized as follows. In sign of the HTTP protocol [16] itself. Third-party scripts the next section, we present the top harvesting sites of the can indeed be used for tracking purposes. Every inclusion visible Web, and their geographical influence. Section 3 is of a third-party script on a visited site requires the browser devotedbothtothetrackeesandthetrackersoftheinvisible toexecutearequesttothisthird-partyserver,downloadthe Web. In Section 4, we analyse the correlations between the scriptandexecuteitintotheuser’sbrowser. differentharvestingtechniques. Browser cookies [2] are traditionally used to maintain a browser-state of a Web user. They allow to tie a given 2. ThevisibleWeb 3http://www.Webfoundation.org/ Our investigation focuses on the top Websites of 55 coun- 4Inthesequel,weconsiderthemostpopularWebsites,globallyorinone tries,whichhavebeenidentifiedusingthestatisticsprovided specificcountrybasedonAlexa’sranking.AlexaisasubsidiaryofAmazon. alexa.com by Alexa. For simplicity, in the sequel we use the country code top-level domain in CcTLD format5. Alexa maintains withamericansystems.BothChinaandRussiahavedevel- lists of over 500 sites per country, but we restricted our at- oped a very powerful industry which harvests most of the tentiontothe25to100mostpopularsitesintheselists. data produced by their citizens. In China most of the first 50 sites are Chinese [9]. As shown in the infography pro- 2.1 Topsitesbycountry ducedbyOgilvy8,thereisnoareaofthesocialmediawhere Data on the Web 2.0 are produced by users everywhere in aChinesecompanycannotbefound.Moreover,insomear- theWorld,buttheyareaccumulatedbycorporations,which eas,severalverylargesystemscoexist,whileonlyonedom- formostusersworldwidearenotintheirowncountry.Afirst inates in the US, not to mention the rest of the World. It measureofthisphenomenoncanbeestimatedbymeasuring is the case for microblogging platform for instance, where foreachcountrythepercentageofnationalWebcorporations Sina Weibo, and Tencent Weibo coexist with both around amongthetop25sitesusedinthatcountry6.Theresultsare 300 million users, and both ahead of Twitter. The ratio of striking.Table1presentsforafewrepresentativecountries localsitesinRussiaislowerthaninChina.Mostofthetop thepercentageofnationalWebcorporationsamongthetop sitesoftheUShavepredominantpositions(e.g.,Facebook) 25ineachcountry. inRussia,whiletheyareblockedinChina.Therelativesize ofthetwocountriesthoughimpactonthesizeoftheirfirst CC Nat.ratio ForeignSites systems,butbothalsohavetheirrespectivesphereofinflu- US 100% noforeignsite enceabroad. CN 92% onlyforeignsite:Google SouthKoreahasanextremelyinterestingpatternofdiversity. RU 68% mostlyamericansites Amongthetop25sites,thereareonly24%ofsiteswhichare JP 36% mostlyamericansites national,whilethereare36%ofbothAmericanandChinese KR 24% halfAmerican sites,aremarkablesituation.Amongolianportal(zaluu)also halfChinese belongtothelist. FR 36% Onlyamericansites Letusconsidernowthetop100sites.Whenlookingbeyond NG 24% mostlyamericansites thetop25sites,theratioofnationalsitesincreases,inpar- Table1:PercentageofnationalWebsitesamongtop25 ticularwithmostofthelocalnewspapersandmagazines,as wellassomeservicessuchasbankinginstitutions. In the US, there are no foreign sites among the top 25 Websites.Forallothercountriesweconsidered,apartfrom China and Russia, the ratio of national sites amounts at best to around a third of the Websites. Both in Japan and France,only36%ofthetop25Websitesarenational,butthis number hides very different realities in the two countries. First,whileinFranceall64%offoreignsitesareAmerican, Figure1:RatiousforsitesinAlexalists,byorigin. inJapan,thereismorediversity.TwoChinesesites(search engine Baidu, instant messaging QQ) and one Korean site Figure1showsforaselectedlistofcountries,theproportion (search portal Naver) belong to the top 25 sites in Japan. ofsitesfromtheUS(inblack),thecountryitself(inred),as Second, and more importantly, the French sites are mostly well as the two others locally most active foreign countries sites7, such as newspapers, which do not gather as much (inblue),inthetop100sites.Theresultsarenormalizedto personal data, while in Japan, national sites include very thenumberofsitesfromtheUS,thatistheycorrespondtoa dataintensiveones,suchasWebportals,e-commerce,blogs, ratio,wheretheper-countrycountisdividedbythenumber etc. Similar patterns would be found for other European oftheUSbasedsites. countries. Italy for instance has only 28% of national sites. Differentpatternscanbeobserved.ForEuropeancountries, InAfrica,Nigeria,hasonly24%ofnationalsites,mostlyin Hungary, Poland, Norway and France, the number of local thePress. sites is about twice the number of US sites in the top 100. Chinaistheonlycountrywhichhasdevelopedsystemswith Russiahasasimilarpattern.Chinaontheotherhandmain- a number of users, in the hundreds of million, comparable tainsmorethan80%oflocalsitesinthetop100list.Nigeria showsaverydifferentsituationwithaverysmallproportion 5AT, AU, BE, BR, CA, CH, CN, CY, CZ, DE, DK, DZ, EC, ES, oflocalsitesinthetop100. FI, FR, GB, GR, HK, HN, HU, IE, IL, IN, IT, JP, KR, KW, KZ, LY, MA, MY, NG, NL, NO, NZ, OM, PA, PE, PL, PT, PY, NotethatthepercentageofusersthatvisitspecificWebsites QA, RO, RU, SA, SE, SG, SI, SN, TH, TN, US, UY, VN. decreases very fast from the top sites in a list to the subse- 6Unlessotherwisespecified,thenumberspresentedinthefollowingtables areextractedfromAlexa’srankingasofmidseptember2012. 8China social media equivalents: a new info- 7National sites among the top 25 in France: leboncoin, Orange, Free, graphic http://www.asiadigitalmap.com/2011/02/ commentcamarche,lemonde,lequipe,lefigaro,pagesjaunes,sfr china-social-media-equivalents-a-new-infographic/ quent sites. The first ten sites generally concentrate an im- Facebook.com and Youtube.com, we show the number of portant share of the total traffic. Countries whose ratio of countriesinwhichthesiteisamongthetop2or4sites(as websites is low in the first segment and increases then for indicatedinthetable). the top 100, have in general a relatively small share of the It is important to note that the traffic decreases very fast in globaltraffic. thetop50list.Google.com,whichoccupiesthetopposition for instance drives more than 40% of global users (not to 2.2 Topsitesglobally mention the local sites such as google.de, google.com.hk, WenowconsidertheglobalimpactofWebcorporations. etc.),andisamongthefirst4sitesinmorethan30countries, Table2presentstheproportionsoftop50sitesintheworld while Sohu.com, which occupies the 50th position, drives thatareownedbycompaniesinagivencountry. about 2% of users and is among the top ten sites only in TheUShavemorethantwothirdsofthetop50sitesworld- China. wide. These sites have a real prominence worldwide as we Rank Website %users #countries haveseenontheprevioustable.Theonlytwocountriesthat 1 Google.com 43.75 30(1st-4th) have more than one site in this club, are China and Rus- 2 Facebook.com 42.65 34(1st-2nd) sia, which have both developed their own industry for fun- 3 YouTube.com 33.43 35(1st-4th) damental tools such as search engines, blogs, e-commerce, 4 Yahoo.com 20.11 29(1st-10th) etc. Three countries, in the European sphere, have one site 5 Baidu.com 12.19 3(1st-10th) amongthetop50. 8 Amazon.com 8.26 15(1st-10th) Here again, China occupies a remarkable position after the 18 Yandex.com 2.97 6(1st-10th) US,whichhavetheabsolutesupremacy.Chinahaseightof the first fifty sites worldwide according to Alexa’s ranking. 32 Tumblr.com 2.57 0(1st-10th) IfthesizeoftheChinesepopulationimpactsofcourseonthe 50 Sohu.com 1.84 1(1st-10th) numberofusersoftheChinesesystems,andthereforeulti- Table3:Percentageofglobalusersandtopcountries matelyontherankingofthesesystems,themostimportant reasonfortheirsuccessistheassociationofaclearpolitical ambition, a strong appetence for social networking, and a dynamicindustry.Thesizeofthepopulationisbynomeans 2.3 Searchengines an explanation by itself. India for instance has only a few Letusconsidermorecarefullyparticularsegments,suchas national sites among its top 25 sites, which are almost all thesearchengine,whichplaysanessentialroleintheway American. people access knowledge. Here again, distinct patterns can Unlike their American counterparts, Chinese systems have befound.Table4presentsthetopsearchenginesforafew currently most of their users in China. Most of them are of countries. course also widely used in Hong Kong and Taiwan. Some, CC 1stSE share 2ndSE share suchasTaobaoapopularonlineshoppingsite,arealsoused in South Korea and in Russia. Their international ambition US Google 65% Bing/Yahoo 15% willmostprobablygrowinthecomingyears. CN Baidu 73% Qihoo/Sogou 8-9% JP Yahoo!Japan 51% Google 36% CC Ratio Topsiteswiththeir(rank) RU Yandex 60% Google 25% US 72% Google(1);Facebook(2);YouTube(3); UK Google 91% Bing 5% Yahoo(4);... FR Google 92% Bing 3% CN 16% Baidu(5);QQ(8);Taobao(13);Sina(17); CZ Google 53% Seznam 37% 163.com(28);Soso(29);Sinaweibo(31); Sohu(43) Table4:Thetoptwosearchenginesbycountry RU 6% Yandex(21);kontakte(30);Mail.ru(33) IL 2% Babylon(22) The US have developed major search engines. The three UK 2% BBC(46) which dominate the American market, Google, Yahoo and NL 2% AVG(47) Bing,areamongthemostpopularworldwide.Googlehasa dominant position with 65% market share, while Bing and Table2:TheTop50Websitesworldwidebycountry Yahoo have both 15% share in USA. Globally Google has 65%marketshare,Baidu,8.2%marketshare,Yahoo,4.9% Table 3 shows the percentage of global users who visited a market share, Yandex, 2.8% market share, and Microsoft, site in the last three months. It also displays the number of 2.5%marketshare9. countriesinwhichthesiteappearsinthe10topsiteslocally. 9http://searchengineland.com/google-worlds-most- Note that for the most popular sites, namely Google.com, popular-search-engine-148089 China10 and Russia11 are in a very similar situation, where otherAsiancountriessuchasJapanandKoreaforinstance the dominant search engine is the local one, Google be- shows their appetence for local systems. The strong focus ing the next most widely used engine. Baidu has a rela- inWesternmediaonthecensorshipimposedontheInternet tivelybiggershareinChinathanYandexinRussia.Morere- hasoftenledtounderestimatethestrategyofChinatowards cently,GooglelostsharesinChina,withthesuddenraiseof ITandtheinformationsociety,andoverestimatetheimpor- twootherlocalsearchengines,whichareapproaching10% tanceofthecontrolofthecontent. sharesoftheChinesemarket,Qihoo360,andSogou. In Europe12, there are no local search engines with strong 3. TheinvisibleWeb positions,andthemarketisdominatedbyGoogle,whichhas While Section 2 deals with the cartography of the visible aquasimonopolisticposition.OnlySeznamhasareasonable Web, this section analyses what is often referred to as the shareforczech,butwhichisnowdecreasingwithrespectto invisible Web. The invisible Web refers to tags, Web bugs, Google’sshare. pixels and beacons that appear on Websites to track and 2.4 Socialnetworks profileusers.Wefirstpresentthemethodologyusedtotrack the trackers. We then consider the invisible Web from the Otherdomainsoftheinformationsocietysuchassocialnet- trackeespointofview,weaimtoshowhowInternetusersare works would lead to very similar conclusion, with Face- tracked across the world. Finally, we analyze the trackers, book largely dominating in Europe, while alternative sys- andconsiderwhetherthetrackersaredistributeduniformly tems have been developed in Asia. The size of the Chinese ontheplanet,orwhethersomecountriesaredominatingthe social networks deserve some attention. The ranking of the trackingbusiness. GlobalWebIndexbasedonthepercentageofglobalInternet usersarestriking.Table5showsthat6outof10ofthemost 3.1 Trackingthetrackers widelyusedsocialsystems13areChinese. In order to establish a cartography of global third-party re- CC Corporation share source utilization on the Web, we used PlanetLab’s infras- tructure [19]. PlanetLab connects many servers in different US Facebook 41% countries.Inourexperiment,37proxyserversfromdistinct US Google+ 21% countrieshavebeenused14. CN Qzone 19% To retrieve information on trackers, we created dynamic CN SinaWeibo 18% tunnelstotherelevantPlanetLab’sservers.Subsequently,all CN TencentWeibo 16% sites from the respective lists were visited and the trackers US Twitter 16% detectedonthesesitesweresavedforfurtheranalysis.This CN Renren 11% processwasautomatedwiththeuseofaWebDrivertogether CN Kaixin 8% withaFirefoxbrowser,equippedwithmodifiedpluginsand US LinkedIn 7% Flash enabled. All our data has been obtained between the CN 51.com 6% endofoctoberandthebeginningofdecember2012. Forourtestswehavechosentwopopulartoolsenablingthe Table5:PercentageofglobalInternetusers detection and blocking of third-party resources: Ghostery andAdBlockPlus(ABP).Theybothworkinasimilarman- China has developed a large industry on the net, with es- nerrequiringthescanningofthevisitedWebsite,andsearch- sentially all the usual services initially proposed by mostly ing for an offending resource or connection. If a resource American companies, such as online search engines, social is found to be present on the respective list of blocked re- networks, news, business, instant messaging, etc. Chinese sources (filter lists), this may either be reported to the user companieshavetakenadvantageintheirdevelopmentofthe orblockedbytheplugin. difficulties to access their foreign counterparts from Main- Ghostery. Ghostery is a popular extension which detects land China, but they would most certainly have succeeded trackers and display their names (such as “DoubleClick”, withoutthecensorshipofforeignsites.Thediversityinsome “Omniture”) [5]. Ghostery analyzes the requests made by 10http://www.chinainternetwatch.com/1444/ a browser and compares them against a database of known china-search-engine-market-share-by-revenue-q1-2012/ trackers. It is important to note that Ghostery maintains a 11http://www.bloomberg.com/news/2012-04-02/ listofconfirmed trackers:atrackerisnotonlyaddedtothe yandex-internet-search-share-gains-google database, but also included on the project’s Webpage (e.g. -steady-liveinternet.html 12http://theeword.co.uk/seo-manchester/google_tops_july_ http://www.ghostery.com/apps/omniture),withtheir 2012_search_engine_market.html respective privacy policy. In our experiment, we saved the 13http://globalWebindex.net/thinking/ social-platform-report-series-september-2012 14Ifaserverfromaspecificcountrywasunavailable,weusedalocalIP -facebook-on-track-to-hit-2bn/ fromInria. namesofthetrackersfoundforeveryvisitedsiteforfurther analysis. AdBlock Plus (ABP). For comparison purposes, we also usedAdBlockPlusextension[20],inthesameenvironment asdescribedpreviously.AdBlockisabletoblockthird-party resources,mostlyunwantedadvertisements,andmaintainsa dedicatedtrackerslist:EasyPrivacy15,whichincludesmany knowntrackersaswellasWebbugs,whichallowsAdBlock Plus to block these resources. We use this later list in our analysis. Although these two tools are different, they provide results that are consistent. A more detailled comparaison of these toolsandresultsareprovidedinAppendixA. Country of origin. Determining the country of origin of Webcorporationsisachallengingproblem.Onesolutionis toidentifythelocationoftheirheadquarters,butthisisnot Figure2:Averagenumberoftrackers(Ghostery) always relevant. It is also possible to use Whois databases to identify the location of the site holding their domain Figure 2 displays the average number of trackers (Y axis) name. However, domain registrars are sometimes located for each considered country (X axis). The top 100 sites elsewherethanreportedinWhoisdatabases. are accessed from a local IP, and the trackers are detected Weinsteadproposeatechniquethatcanbeautomated. ABP associates to each tracker a Web resource, for ex- usingGhosterydataforthe55selectedcountries.Theresults clearlyshowthatusersaretrackeddifferentlyontheInternet ample http://edge.quantserve.com/quant.js. We accordingtotheircountry.Thedifferentcolorsonthefigure extracted the top-level domain name of each tracker, i.e. refertothedifferentcontinents16.AsshowninAppendixB, quantserve.com.Wethenresolvedthedomainnameinto these results do not depend in fact on the visitor locations. anIPaddressanduseageolocationdatabase,toidentifyits Inotherwords,inmostcases,agivensitetracksitsvisitors location.Thisapproachcancorrectlyidentifythecountryof independentlyofthevisitor’sIPaddress. origininmostofthecases. While results do not show big differences between conti- Ghostery website contains a description of each tracker nents, some countries seem to be much more tracked than (see http://www.ghostery.com/apps/). This descrip- others.Forexample,USInternetusersaretracked5times tion contains the url of the tracker’s company website and morethanChineseusers17. potentially the postal address of the company. We used Figure3displaysthesametypeofresultsasFigure2,with the company website’s url to geolocalize it. The results thebrowser’s User-Agentstring setto amobile device.Al- were then cross-checked manually. For example http:// thoughtheresultsaredifferent,withasmalleraveragenum- www.ghostery.com/apps/digilant mentions Digilant beroftrackers,thetrendsaresimilar.Somecountriesarestill company’s Website www.digilant.com, which resolves muchmoretrackedthanothers.Thisdifferenceislikelydue to United States. The ”Privacy Contact“ tab on Ghostery’s to the fact that many Websites redirect the mobile browser Website confirms that Diligant headquarters are indeed lo- toaspecialversionofthesite,tailoredtowardsdeviceswith catedinBoston,USA.” asmallerdisplay.Thesesites,asitseems,includelessthird- In both cases, the geolocation has been done using Python partyresources,probablytospeeduploading.Furthermore GeoIP and geoiplookup command-line utility tools, which manypopularsiteshaveadedicatedmobileapplicationany- querygeolocationdatabases.ThesetoolstakeanIPaddress way,sothepotentiallossesfromtrackingand/oradvertising oranurlasinput,andoutputthelocationofthecorrespond- canbebalancedwiththeuseofin-appadvertisements. ingwebsite. Figure 4 exhibits results from experiments similar to those 3.2 Thetrackees of Figure 2, but while using ABP instead of Ghostery. The absolutenumbersareslightlysmaller,butthetrendsarevery Wefirstconsidertheaveragenumberoftrackerspersite similar.AppendixApresentsmoreexperimentsthatanalyse foreachofthe55countriesconsidered.Theaverageiscom- theconsistencyoftheresultsobtainedbyABPandGhostery. puted by connecting to the top 100 most popular sites of a country, summing the number of trackers on each of them, 16Europeisingreen,Asiainblack,NorthAmericaindarkblue,Africain anddividingthefinalresultbythenumberofretrievedsites. lightblue,SouthAmericainpurple,Australiainwhite 17WedefineasaUS(resp.Chinese)user,auserthatvisitstheAlexa100 15copiedon30/11/2012 top,i.e.the100mostpopular,sitesintheUS(resp.inChina) Figure3:Averagenumberoftrackers(Ghostery,mobile) Figure5:Distributionoftrackersforselectedcountries much larger in the US than in other countries, which nat- urally increases the average number of trackers per site in theUS. Finally,weconsideredthedistributionofdifferenttracker types. Ghostery divides its detected scripts into five cate- gories[6],whichwerecallbelow: 1. Ad:advertisementsprovidedbythead-networks; 2. Tracker: scripts which actually perform tracking (often usingverysophisticatedbehavioralanalysis); 3. Analytics: utility scripts for Website creators allowing them to discover various statistical details about their visitors; 4. Widget: small Web applications such as clocks, weather tables,andothers.OtherexamplesincludeFacebookSo- Figure4:Averagenumberoftrackers(ABP) cialPlugins,Google+1,etc.; 5. Privacy:typicallyascriptdisclosingprivacypoliciesand We then considered the distribution of trackers for se- practicesrelatedtoads,suchasEvidonNotice18. lectedcountries,whichisshownonFigure5.Thedistribu- Figure 6 displays the distribution of trackers for the five tion isobtained by counting, fora given country,all occur- categoriesof[6]forthe55countries,basedonGhostery.It rencesofaparticulartracker,e.g.,DoubleClick.Thesenum- isinterestingtonotethatthedistributionofeachtypeseems bers are then ordered from the largest to the smallest, and tobequitesimilarindifferentcountries.TheAdtrackersare plotted. The first value of a given curve shows the number themostcommon,followedbytheanalyticsones. ofoccurrencesofthemostpopulartracker,thesecondvalue showsthenumberofoccurrencesofthesecondmostpopular 3.3 Thetrackers tracker,andsoon.TheanalysisisbasedonGhostery. Wenextconsiderthegeographicoriginofthetrackers,and The results clearly show that in China there are only 10 thewaytrackersproliferatedependinguponthecountrythey different trackers on the top 100 Chinese sites, and these come from. We first start by analyzing the origin of the trackers are not very active. Moreover, the most popular, trackers on the top 100 sites of the global list, that is the CNZZ, only appears in 10 of the top 100 Chinese sites. In listofthe100mostpopularWebsitesworldwide. contrastothercountries,suchastheUS,havealargenumber ofdifferenttrackers(about90fortheUS),andsomeofthese Table6showsthedistributionofthedetectedtrackersusing trackers,forexampleGoogleAnalytics,areverypopularin Ghostery. We computed the number of trackers, T, on the alargenumberofsites. Does it mean that the US market is more fragmented? In- 18http://www.evidon.com/about/ deed, it seems that the number of tracking companies is 19ROW:RestOfWorld Nigeria,andHungary,andanalyzetheoriginofthetrackers ofthetop100sitesforeachofthesecountries20. CC Tracker(countryoforigin) Count GoogleAnalytics(US) 53 LiveInternet(RU) 51 RU Yandex.Metrics(RU) 31 TNS(GB) 30 Rambler(RU) 27 GoogleAnalytics(US) 20 MarkMonitor(US) 16 CN CNZZ(CN) 10 ScoreCardResearchBeacon(US) 6 DoubleClick(US) 5 GoogleAnalytics(US) 39 Figure6:Distributionoftrackersbytypes DoubleClick(US) 32 US ScoreCardResearchBeacon(US) 32 CC Share Omniture(US) 21 US 87% FacebookConnect(US) 14 CN 3% GoogleAnalytics(US) 62 GB 3% FacebookSocialPlugins(US) 39 NG DoubleClick(US) 32 RU 2% ROW19 4.2% FacebookConnect(US) 30 GoogleAdsense(US) 23 Table6:Distributionoftrackersontop100globalsites GoogleAnalytics(US) 70 Median(HU) 48 HU Gemius(PL) 47 Adverticum(HU) 37 top100sitesworldwide.Wethencountedforeachcountry FacebookSocialPlugins(US) 31 i,thenumberoftrackersoforigini,Ci,andthencomputed, foreachcountry,thepercentagepi =Ci/T. Table8:Top5trackersforselectedcountries The results show a clear domination of the US, with more than80%coverageofthetop100sitesworldwide,withonly afewcountries,suchasChina,GB,andRussia,whichhave Table8displaysthetop5mostpopulartrackingcompanies asmallpercentageoftrackers. onthe100mostpopularWebsitesofthese6selectedcoun- tries. Once again, the results clearly indicate that the track- Tracker(countryoforigin) Count ing business is largely dominated by US companies. Only GoogleAnalytics(US) 25 fewcountriesseemtoresistthisdominationsuchasRussia, DoubleClick(US) 20 China,andHungary.Interestingly,Russiaistheonlycoun- ScoreCardResearchBeacon(US) 19 trywhosetrackersaremostlylocal. FacebookSocialPlugins(US) 11 We next consider the origin of the most prominent trackers Omniture(US) 10 in the following eight countries: USA, Hungary, Poland, Norway,France,Russia,China,andNigeria.Weperformed Table7:Top5trackersontop100globalsites thefollowinganalysis.Wefirstcollectedallthetrackerson thetop100sitesofacountry,classifyingthemaccordingto Table7displaysthetop5mostpopulartrackingcorporations their origin, and counting their occurrences. Since the raw on the top 100 most popular sites worldwide. The results numbers,evenaverages,maynotbethemostinformativein clearly confirm that the tracking business is largely domi- thisanalysis,wedecidedtoplottheinformationwithrespect natedbyUScompanies. to the detected number of US-based trackers to ease the presentation.Morespecifically,wecomputedforalltrackers Letuspushfurthertheanalysisoftheoriginofthetrackers, ofanoriginN detectedonthetop100sitesofacountryC, andconsideranin-depthAnalysisforSelectedCountries. For the sake of readability, we decided to display here the 20Theanalysisofall55countrieslistedatthebeginningofthispaperis results for only 5 countries, namely Russia, China, USA, presentedinAppendixC. 4. SitesvsTrackersAnalysis In this section, we compare the harvesting activity on the visibleandtheinvisibleWeb.Ourobjectiveistounderstand whether the predominance of the US is larger on the visi- bleorontheinvisibleWeb.Inordertoachievethisgoal,we compute and compare the proportion of US sites (resp. US Figure7:Ratiousfortrackersbyorigin(Ghostery) trackers),withtheproportionoflocalsites(resp.localtrack- ers)foreachofthetop100sitesofeachofthe55considered countries. Theseproportions,PUSandPlocal,arecomputedasfollows: #UStrackers(resp.sites)incountryC PUS(C)= #Alltrackers(resp.sites)incountryC PUS(C) is the number of US trackers (resp. US sites) di- videdbythetotalnumberoftrackers(resp.sites)inthetop Figure8:Ratiousfortrackersbyorigin(ABP) 100sitesincountryC.Quantitatively,itshowshowtrackers ofadominatingcountry(e.g.US)tracktheworld. #Localtrackers(resp.sites)incountryC thefollowingratio: Plocal(C)= #Alltrackers(resp.sites)incountryC Ratious(N)= #trackersoforiginN Plocal(C) is the number of local trackers (resp. local sites) #UStrackers dividedbythetotalnumberoftrackers(resp.sites)inthetop 100sitesincountryC.Quantitatively,itshowsthestrength TheUShasbeenchosenasareferencebecauseoftheglobal oflocaltrackers. prevalenceofthetrackersofthisorigin.Indeed,UStrackers arepresentonalmosteverysite. Figures 7 and 8 show for each of these selected countries theoriginofthetop4mostprominenttrackingcountriesfor both detection methods, Ghostery and ABP. In parenthesis for each country, the ratio of US trackers, and the ratio of local trackers. These figures clearly show that the US havealmostalwaysthelargestratiooftrackers.Thesecond largest tracker country, is apart from some exceptions, the country itself. Austria constitutes for instance an exception tothisrule,withGermanyassecondtrackerwitharatioof 0.24. China and even more so Russia constitute the two excep- tions, with remarkably strong ratios of local trackers over US trackers. In fact, China has a ratio of 0.78, and Russia aratioof2.3,thatindicatesthatRussiansitescontainmore RussiantrackersthanUSones. Figure9:RatioUSvslocalsitesandtrackers Wealsoobservedsomevariationinthenumberandthedis- tributionofcountriesinvolvedintrackinginagivencountry, The scatterplot for PUS and Plocal of all the 55 countries beyondtheUSandthecountryitself.ForexampleinFrance, consideredforsitesandtrackersareshownonFigure9.Ev- there are third-party resource providers from 11 countries, ery black square (resp. red circle) point corresponds to a whereasinChinaonly6countriesare”represented“. specific country C, with coordinates PUS(C) (X axis) and Inaddition,weobservedthatthesetrackersoftencomefrom Plocal(C) (Y axis), for sites (resp. trackers). The trackers neighboringcountries.Forexample,Danishsitesoftencon- have been obtained using Ghostery, as for previous mea- taintrackersfromSwedenorFinland.Thesameobservation sures. applies to Austria as we noticed already, as well as other TheresultsclearlyshowthattheUSareevenmorepresent countriesinCentralEurope,suchasHungary,Slovakiaand on the invisible than on the visible Web. Most of the theCzechRepublic.Therefore,withtheexceptionoftheUS, tracker points (in red) are located on the right lower corner mosttrackersareregional. oftheplot.ThisindicatesthatthepercentageofUStrackers is large in most considered countries, while the percentage CC PUS Plocal oflocaltrackersisusuallysmaller,whilethedistributionof RU 0.39/0.36 0.49/0.52 theproportionofthesitesismorebalancedbetweentheUS CN 0.69/0.68 0.29/0.31 andlocalsites. HU 0.57/0.53 0.21/0.23 PL 0.65/0.63 0.17/0.17 NO 0.68/0.62 0.03/0.04 FR 0.71/0.67 0.19/0.22 NG 0.9/0.86 0/0 Table10:PUS andPlocal fortrackers sites are excluded. The results contrast with the results of the previous table, and clearly show that the percentage of US trackers is larger that the percentage of local trackers, exceptforRussia. Finally it is interesting to compare Russia and China with respecttotheproportionoflocalsitesandtrackers. Figure10:SitesvsTrackers(Ghostery) This finding is further confirmed with the results shown on Figure 10, which presents the scatterplots for sites against trackers:therelationshipsbetweenUSsites(blacktriangles) Figure11:SitesandtrackersinRussia (resp.localsites(redstars))andthecorrespondingtrackers forGhostery.Theredstarsareonthelowerpartofthefigure, corresponding to a small percentage of local trackers for Figure 11 shows how Russia manages to have both more most considered countries even with a large proportion of russian sites and trackers at home, although somehow with local sites. Whereas, the black triangles are on the higher thesameorderofmagnitudesasUSsitesandtrackers,while partofthefigure,correspondingtoalargepercentageofUS Figure 12 shows that China has mostly Chinese sites, but trackers. mostlyUStrackers. Letusnowconsideragainanin-depthAnalysisofSelected Countries.WeconsidertheUSandlocalproportionofsites and trackers for seven selected countries, namely Russia, China,Hungary,Poland,Norway,FranceandNigeria. CC PUS Plocal RU 0.23 0.54 Figure12:SitesandtrackersinChina CN 0.12 0.86 HU 0.26 0.56 PL 0.21 0.69 5. Conclusion NO 0.37 0.51 Westudiedtheglobaldistributionandproliferationofthird- FR 0.34 0.61 party resources on the most popular sites in various coun- NG 0.72 0.04 tries. Our research reveals very different strategies and/or capacities to harvest local as well as global data, both on Table9:PUS andPlocal forsites thevisibleandtheinvisibleWeb. • InEurope,mostcountrieshavetwiceasmanylocalsites Table9displaystherespectivepercentagesofUSandlocal as US sites amongst their top 100 most popular sites, sites.Itshowsthatlocalsitesareoftendominant,exceptfor although the top sites are mostly US, and the trackers somedevelopingcountriessuchasNigeria. are mostly US as well, followed by local and regional Table10displaystherespectivepercentagesofUSandlocal trackers, thus leading to an important flow of data from trackers. The second value shows the results when the US EuropetotheUS.

Description:
to harvest personal data is therefore of strategic importance strategies of different countries in terms of data harvesting .. Google Adsense (US). 23.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.