A Study of End-to-End Web Access Failures Venkata N. Padmanabhan† Sriram Ramabhadran§ Sharad Agarwal† Jitendra Padhye† †Microsoft Research §UC San Diego ABSTRACT fail in differentways (e.g., at theDNS,TCP, orHTTP lev- els) andforavarietyofreasons(e.g., problemswiththeac- We present a studyof end-to-endweb access failures in the cesslink,local DNS,WANconnectivity,website,etc.). Our Internet. Partofourcharacterization offailuresisbasedon objective here is to develop techniques for and to present directlyobservableend-to-endinformation. Wealsopresent a characterization of such end-to-end failures. The client novelanalysesthatrevealaspectsofend-to-endfailuresthat vantage point reveals a more complete picture of end-to- would behardtodiscern otherwise. First,wecombineend- end failures than monitoring of any individual component to-endfailureobservationsacrossalargenumberofclientsto would,albeitonlyforthe(limited)setofclientsthatweare classify failures as server-related or client-related. Second, in a position to monitor. So we believe that our work com- we correlate failures attributed to a client or server with plements prior work focused on the server-side view (e.g., BGP churn for the corresponding IP address prefix(es), to [8, 9, 21]), which provides an incomplete picture of a much shed light on the end-to-endimpact of BGP instability. broader set of clients. It also complements work focused on Ourstudyisbasedonfailureobservationsduringamonth- a detailed analysis of individual components of end-to-end long experiment involving 134 client hosts (across Planet- communication(e.g.,DNS[18,22,23],BGP[19,12,14,26]) Lab, commercial dialup and broadband ISPs, and a corpo- or traceroute-based fault analysis of theIP path [32, 16]. ratenetwork)repeatedlyaccessing80websites. Wefindthat We start with analysis that is based on information di- themedianfailurerateofwebaccessesisabout1.5%,which rectly available from the individual web accesses observed is non-negligible. About 34-42% of the web access failures at each client. In particular, we classify web access fail- are due to DNS problems, primarily due to the inability of ures as being due to DNS (inability to resolve the website the client to connect to its local DNS server. The majority name),TCP(inabilitytodoaTCPtransferfromtheserver of the remaining failures are due to TCP connection estab- to client), or HTTP (inability of the server to return the lishment failures. Also, by correlating failure observations requested content). DNS and TCP failures can further be across clientsand servers,wefindthat server-sideproblems classified into unresponsive local DNS server, TCP connec- are thedominant cause of TCP connection failures. tion establishment failure, etc. We then turn to gaining a deeper understanding of the CategoriesandSubject Descriptors natureofend-to-endfailuresbytappingintoinformationbe- yond that obtained from observing individual web accesses C.2 [Computer-Communication Networks]: Misc. at each client. Wepresent two novel analyses. First,wecombinefailureobservationsmadeacrossclients GeneralTerms and web servers to identify the extent of correlation in the failurepatterns. Suchcorrelation,orthelackthereof,isused Measurement, Reliability to infer the likelihood of an end-to-endfailure being dueto a client-side problem (i.e., affecting a significant fraction of Keywords aclient’scommunicationwithvariousservers),aserver-side problem (i.e., affecting a significant fraction of a server’s Web access, web failure, TCP, DNS,HTTP, BGP communication with various clients), or otherwise. Such a determinationwouldbehardtomakebasedjustonindivid- 1. INTRODUCTION ual web accesses. Wepresent a study of end-to-end failures of web accesses Second, we determine the extent to which client-side or in theInternet. Aweb access consists of a client download- server-sideproblemscoincidewithnetworkroutinginstabil- ingoneormoreobjectsfromawebserver. Theaccesscould ity at the inter-domain level. We identify the latter based on BGP churn in the corresponding IP prefixes. Our goal is to understand the relationship between network routing problems and end-to-endfailures. Wepresentmeasurementsfrom amonth-longexperiment Permission tomake digital orhardcopies ofall orpartofthis workfor conducted in Jan 2005, in which a set of 134 client hosts personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare repeatedlyaccessedadiversesetof80websites. Theclients notmadeordistributedforprofitorcommercialadvantageandthatcopies weredistributedgeographically(althoughthemajoritywere bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to intheU.S.)andacrossPlanetLab,theMSNdialupnetwork, republish,topostonserversortoredistributetolists,requirespriorspecific multipleresidentialbroadbandnetworks,andtheworldwide permissionand/orafee. corporate network of a major corporation. (We are making Copyright2006ACM1-59593-456-1/06/0012...$5.00. our measurement data available online [2].) theobjectofinterestusingHTTP.Werefertothedownload Wefindthattheoverallfailurerateofwebaccessesislow of each individual web object as a separate transaction. but non-negligible. The median failure rate across clients A transaction fails when any of these steps fails. These is 1.47% and that across servers is 1.63%, representing less steps proceed in order and the client can tell which, if any, thantwo9sofavailability. Sowebelievethatitisimportant has failed. Thus there are three basic categories of failures tounderstandthenatureofthesefailures. Herearesomeof directlyobservableattheclient,eachofwhichcanbefurther the keyfindings from our analysis of these failures: categorized into sub-classes: • 34-42% of web access failures are due to DNS prob- 1. DNS:Thewebsitenamecannotberesolved. Thiscan lems,about74-83%ofwhicharecausedbytheclient’s bedue toseveral observable reasons: inability to connect toits local DNSserver. a) Local DNS Server (LDNS) timeout: LDNSis unreachable,becauseitisdownorbecauseofnetwork • The remaining failures (57-64%) are almost all due connectivity problems between it and theclient. to TCP connection failures. The majority of these b) Non-LDNS timeout: LDNS is responsive, but are TCP connection establishment failures (i.e., failed the lookup still times out, because of an unreachable SYN handshake). authoritative server elsewhere in theDNS hierarchy. • Our correlation analysis reveals (somewhat surpris- c) Error response: Anerrorisreturnedbecausethe ingly)thatserver-sideproblemsarethedominantcause name could not be resolved (e.g., NXDOMAIN). of TCP connection failures. This is primarily because 2. TCP: Name resolution is successful, but either the client connectivity problems manifest themselves as TCPsession couldnotbeestablished orunexpectedly DNS resolution failures, precluding even a TCP con- terminated. Wecan observethe following : nection establishment attempt. a) No Connection: The client cannot connect to • Although the incidence of failure across websites is the server, i.e., the TCP SYN handshake fails, either highlyskewed,70%ofthewebsitesinourstudyexperi- becauseof anetworkconnectivityproblem orbecause encedatleastoneserver-sidefailureepisode,whichwe theserver is down. defineas a failure rate of ≥5% overa 1-hour period. b) No response: Theclient establishes aconnection andsendsitsrequest,butdoesnotreceivearesponse. • Severe BGP instability for a client or server’s IP pre- A server overload or failure of the server application fix often implies significant failures for the client or cancausethis. Whiletheremaybenetworkconnectiv- server’send-to-endcommunication. However,suchse- ity issues, thesuccess of theSYN handshakemakesit vere BGP instability is rare and does not account for less likely that therewas a total connectivity failure. thevast majority of end-to-endfailures. c) Partial response: The client receives only part of the server’s response before the connection termi- Besidestheseandotherspecificfindings,webelievethata natesprematurely,eitherbecauseofaserverfailureor keymethodologicalcontributionofourworkisincorrelating because of a server/network problem that makes the failureobservationsacrossendhoststoanalyzethenatureof connection so slow that the client times out and ter- end-to-end failures. We believe that this approach is novel minates theconnection. andprovidesausefulcomplement topriortraceroute-based approaches(e.g., [32, 16]),especially insettingswherefire- 3. HTTP:TheTCPtransferissuccessful,buttheserver walls or otherfilters impede traceroute functionality. does not supply the desired content and instead re- Therestofthispaperisorganized asfollows. Wepresent turns an HTTP error (e.g., file not found). Since our analysis framework in Section 2, and our experimental HTTP failures are rare in our study (under 1-2% of setup and methodology in Section 3. This sets the stage all failures), we do not categorize them further here. forthepresentationofourresultsandanalysesinSection4. Wediscusstheimplications ofourfindingsinSection5and 2.2 CorrelatingAcrossClients&Servers related work in Section 6. Weconclude in Section 7. Localobservationsatanindividualclientofitscommuni- cation with a particular server may not always indicate the 2. FAILUREANALYSISFRAMEWORK natureoftheproblemthatiscausingthefailure. Forexam- In this section, we present a framework for client-based ple, in the case of a “no connection” failure, it is not clear characterization and analysis of end-to-end web access fail- whether the cause is a connectivity problem at the client ures. Part of our characterization is based on information end,aserverfailure, oraproblemintheinteriorofthenet- that is directly available from the individual web accesses work. Client-based traceroute is impeded by firewalls (as observed at each client. We also consider inferences that withthecorporateclientsinourstudy),doesnotrevealfail- can be drawn by combining end-to-end observations across ures in the server→client direction, and is often incomplete clients and using indications of network routing instability. even when end-to-endweb communication is successful. Note that “failure” does not imply a total inability to Instead,wedisambiguatethelikelycauseoffailurebycor- communicate,butratherjustnoticeablyabnormalbehavior relatingfailureobservationsacrossclientsandservers. First, (e.g., a failure rate of 15% is abnormally higher than the we identify failure episodes, which are periods with an ab- normalfailurerate1%). Weusetheterms“failure”,“fault” normallyhighfailurerateforaclientoraserver(compared and “problem” interchangeably. to the system-wide “normal” behavior, as discussed in Sec- tion 4.4). Second, by combining failure observations across 2.1 FailureofIndividualTransactions clients and servers, we categorize failure episodes as: We begin by discussing the categorization of failures of individualweb accesses, or transactions. Aweb transaction 1. Server-side: Ifaserverisexperiencinganabnormally consistsofaclientresolvingawebservernametoanIPad- highaggregatefailurerateinitscommunicationacross dress,establishingaTCPconnectiontoit,anddownloading many clients, we term the corresponding period as a server-side failure episode for this particular server. the entire transaction. This data is available to the client Note that the underlying cause could be a network without requiring additional network communication. In problem that affects accesses to theserverfrom many addition,wehavetheclientinvokeaDNSlookupusingiter- clients ratherthana problem at theserveritself (e.g., ativequeriestoresolvethewebsite’sname(startingwiththe BGP instability in the corresponding prefix). LDNSserver,andthenworkingdownfromtherootservers). Notethattheend-hostvantagepointiscrucialtobuilda For websites with multiple replicas, a server-side fail- fullpictureofend-to-endfailures. Monitoringtrafficfrom a ureepisodecouldaffectallreplicas(totalreplicafail- different vantage point — say within the network or at the ure episode) or affect only a subset of the replicas server—runstheriskofmissingcertainend-to-endfailures (partial replica failure episode). Note that “total” (e.g.,DNSlookuporTCPSYNfailuresduetoalocalfault). and “partial” only refer to the spatial extent of the failure episode across thereplicas, not tototal orpar- 3.2 Clients tialfailureofaccessestothewebsite. So,forinstance, an abnormally high failure rateof 20% thataffects all Weused the 4 sets of clients listed in Table 1: replicas of a website would still be termed as a total PlanetLab(PL):Wepicked95PlanetLabnodesacross64 replica failure episode. sites. Having multiple nodes at many of the sites enabled ustoidentifyfailures thatwerelikelytobeclient-site-wide. 2. Client-side: If a client sees an abnormally high ag- All nodes ran Linuxkernel version 2.6.8. gregate failure rate in its communication across many servers, we term thecorresponding period as a client- Dialup (DU):Wehad5clients, alllocated inSeattle,dial side failure episode for this client. into26PoPsspreadacross9U.S.citiesintheMSNnetwork. The PoPs we pickedin each city were operated bydifferent 3. Client-server-specific: Ifaspecificclient-serverpair providers, from whom MSN buys service. The clients di- is experiencing an abnormally high failure rate, but aled intothevariousPoPsinrandomorderandthendown- neithertheclientnortheserverisexperiencinganab- loaded the URLs from the designated set also in random normally high failure rate in aggregate, then we term order. Thus although we only had 5 dialup clients, we ef- thecorrespondingperiodasaclient-server-specificfail- fectively had 26 “virtual” clients, each of which connected ure episode. to the Internet via a different path and hence provided a different perspective on the wide-area network. All nodes 4. Proxy-related: Iftheclient’saccessesthroughapar- ran Microsoft Windows XP. ticular proxy exhibit an abnormally high failure rate, we label the corresponding period as a proxy-related CorpNet (CN): We had 5 nodes, labeled SEA1, SEA2, failureepisode. Noteifalloftheaccessesofco-located SF,UK,and CHN,across 4 locations on Microsoft’s corpo- clients go through the same proxy, then it would be rate network. All external web requests from each of these hardtotellapartaproxy-specificproblemfromasep- 5 nodes were per-force routed via separate HTTP proxy arate client-side problem. caches. The proxy was located at the local site in all cases except for CHN, where it was located in Japan. In addi- 5. Other: Besides thefailure episodes noteabove,there tion, we had another node in Seattle (SEAEXT) that was may be intermittent or transient failures that are not located outsidethecorporate firewall/proxy butshared the significant enough in intensity to be registered as ab- sameWANconnectivityasSEA1andSEA2. TheCNnodes normalforaclient,server,proxy,orclient-serverpair. ran various flavors of Microsoft Windows (2000, XP, 2003). Note that while the above categorization may be sugges- Broadband (BB):Wehad7residentialbroadbandclients tive of the location of the problem, it does not indicate the (5 DSL and 2 cable modem) spread across 4 ISP networks rootcausewithcertainty. Also,thecategorizationisnotmu- (Roadrunner, SBC/Yahoo, Speakeasy, and Verizon) in 4 tually exclusive. For example, a server-side failure episode U.S.cities. Theaccesslinkspeedforthesehostswas768/128 could overlap in time with a client-side failure episode. So (down/up) Kbpsor higher. communicationbetweenthecorrespondingclient-serverpair Our choice of a diverse set of clients (in terms of geo- would be affected byboth. graphiclocationandthenatureandspeedofconnectivity)is WedeferthediscussionofouranalysisofBGPinstability motivatedbythedesiretoobtainabroaderunderstandingof and its impact on end-to-endfailures to Section 4.6. InternetbehaviorthancanbeobtainedfromthePlanetLab nodes alone, which are predominantly located at academic 3. SETUP ANDMETHODOLOGY sites [10]. Although we had a total of 95+5+6+7=113 client machines, the DU clients dialing in to 26 PoPs effec- 3.1 Overview tively gave us a total of 134 clients. Weranourexperimentduringaone-monthperiod: Jan1– Whiledialup mightbeonthewane, itremainsanimpor- Feb1,2005. Duringthisperiod,eachclienthostrepeatedly tant network access technology with a significant presence accessed a set of URLs, accessing each URL about 4 times (e.g., [6] indicates that 30% of U.S. home users were on di- per hour. We randomize the sequence of accesses to avoid alup as of June 2006). Also, for the purposes of our study, systematic bias. To limit network load, we download only thedialupclientsprovidevisibilityintofailuresobservedon the top-level “index” file for each web page. The download paths through commercial ISPs, and many of these failures attempt isterminated(and declared ashavingfailed) ifthe are likely to beindependent of thelow speed of dialup. underlying TCP connection idles (i.e., makes no progress) Finally,sincewewereconstrained tolocate allthedialup for 60 seconds; note that the download could take longer clientsinSeattle,thereistheconcernthattheextralatency provided it does not idle for 60 seconds. incurredindialingintoremotePoPsmightskewtheperfor- For each download, we record several pieces of informa- mance numbers. However, given our focus on failure rates tion: theDNSlookuptime(orfailureindication),thedown- rather than absolute performance numbers, this was not a load time (or failure indication), and apacket-leveltrace of significant concern. Category PlanetLab (PL) Dialup(DU) CorpNet (CN) Broadband(BB) #Clients 95 5(26PoPs) 5(+1) 7(5DSL,2Cable) US-EDU(50),US-ORG(19), Boston(ILQ),Chicago(ILQ),Houston(ILQ), SanFrancsico(1), Pittsburgh(1),San Details US-COM(4),US-NET(5), NewYork(IQU),Pittsburgh(ILQ),SanDiego(ILQ), Seattle(2+1), Diego(2),Seattle(3), Europe(13),Asia(4) SanFrancisco(ILQ),Seattle(ILQ),Wash. DC(IL) UK(1),China(1) SanFrancisco(1) Table 1: Clients used in ourexperiment. The DUclient providersareICG(I), Level3(L), Qwest(Q), and UUNet(U). US-EDU (8): berkeley.edu, washington.edu, cmu.edu, umn.edu, TherearealsospecialstepsfortheDUandCNclients. To caltech.edu,nmt.edu,ufl.edu,mit.edu minimize the overhead of dialing out in the case of the DU US-POPULAR (22): amazon.com, microsoft.com, ebay.com, clients,wedialintoaPoPatrandomanddownloadallthe mapquest.com, cnn.com, cnnsi.com, webmd.com, espn.go.com, URLs (in random order) at a stretch, before switching to a sportsline.com, expedia.com, orbitz.com, imdb.com, google.com, differentPoP.FortheCNclients,weconfigurewgettoissue yahoo.com, games.yahoo.com, weather.yahoo.com, msn.com, pass- requestswiththe“no-cache”cache-request-directive[15]set, port.net,aol.com,nytimes.com,lycos.com,cnet.com toensurethattheresponseisreceivedfromtheoriginserver. US-MISC (15): latimes.com, nfl.com, pbs.org, cisco.com, ju- We do so to avoid having the proxy cache mask failures niper.net, ibm.com, fastclick.com, advertising.com, slashdot.org, beyondtheproxy. However,sincetheproxyratherthanthe un.org,craigslist.org,state.gov,nih.gov,nasa.gov,mp.com CNclient doesnameresolution,andthereisnowayforthe INTL-EDU (10): iitb.ac.in, iitm.ac.in, technion.ac.il, client to force the DNS cache at the proxy to be flushed, cs.technion.ac.il, ucl.ac.uk, cs.ucl.ac.uk, cam.ac.uk, inria.fr, hku.hk, some DNSfailures may bemasked from the client. nus.edu.sg We did not gather packet-level traces on the BB clients. INTL-POPULAR (15): amazon.co.uk, amazon.co.jp, bbc.co.uk, Sincethesewereusers’homecomputers,therewereprivacy muenchen.de, terra.com, alibaba.com, wanadoo.fr, sohu.com, andstoragerequirementconcerns. Also,whilewedidgather sina.com.hk, cosmos.com.mx, msn.com.tw, msn.co.in, google.co.uk, packet-level traces for the CN machines, these were not in- google.co.jp,sina.com.cn terestingsincetheyonlyrevealedthedynamicsofTCPcon- INTL-MISC (10): lufthansa.com, english.pravda.ru, rediff.com, nections to thelocal proxy. samachar.com,chinabroadcast.cn, nttdocomo.co.jp, sony.co.jp, brazzil.com,royal.gov.uk,direct.gov.uk 3.5 Post-Processing Fromtherawdatarecordedforeachdownload,weobtain Table 2: The list of 80 web sites that were targets of our anindication ofthesuccess/failure ofboththeDNSlookup andthedownload,theDNSlookuptime,thedownloadtime, download experiment. For the sake of brevity, we have left andthefailurecode,ifany,reportedbywget. Westorethis outthe“www” prefix formostofthese hostnames. Weonly informationinaperformancerecord,togetherwiththeclient downloadedthe top-level “index”file ateach site. name, URL,server IP address, and time. We also post-process the tcpdump/windump packet traces 3.3 WebSites to determine (a) the cause of connection failure (i.e., no connection,noresponse,orpartialresponse,asdiscussedin Wepickedasetof80websitesasthetargetforourdown- Section2.1),and(b)packetlosscount(inferredfrompacket load experiments. As indicated in Table 2, we tried to en- retransmissions). suresignificantdiversityamongthewebsitesintermsofthe geographic location, popularity,etc. (Popularity was deter- 3.6 BGPData minedbased ontheAlexalist [1].) Someofthesesiteswere replicated or served via CDNs. To determine in which period a client or server expe- The number of websites chosen was constrained by the rienced failures that coincided with inter-domain network frequency with which each client could perform downloads, routinginstability,weexaminepubliclyavailableBGProut- without generating excessive network traffic or triggering ingdata,fromtheRouteviewsproject[4]. WeuseBGPup- alarms. Inourexperiment,eachclientaccessedeachwebsite datesstoredintheMRTformatfromthemonthofJanuary approximately 4 times an hour, which translates to 80 ∗ 2005 from the Routeviews2, EQIX, WIDE, LINX and ISC 4=320 downloads per hour from each client (although the servers. In total, these 5 servers have 73 peering sessions number of TCP connections attempted was higher because with a variety of ASes, including several large ISPs such as of HTTP redirects and also retries by our wget client). AT&T, Sprint, and UUNet. The 203 client and replica IP Notethatfortheremainderofthispaper,weusetheterm addresses that we consider 1 are covered by 137 BGP pre- “server”torefertothewebsitelistedinaURL,andtheterm fixes 2, 132 of which are announced from at least 71 peer- “replica” to refer to a specific server IPaddress. ing sessions (or neighbors). The remaining 5 prefixes have verypoorconnectivityandcanbereachedfromlessthan13 3.4 DownloadProcedure neighbors. We processed this MRT update data to obtain thenumberofBGProutewithdrawals andnumberof BGP We used off-the-shelf tools to do our measurements. In routeannouncementsheardforeachclientorserverprefixin each measurement iteration, the URLs were sorted in ran- each1-hourepisode. Wealsocalculatedhowmanyofthe73 dom order. The procedurefor each download was: 1. Flush thelocal DNS cache. 1 WeexcludeIPaddressesthatsaw toofewconnections,e.g.,server 2. Use wget to download theURL (“index”file only). IPsthatdidnotqualifytobeareplica,asdiscussedinSection4.5. 2 Of the 203 client and replica addresses, 153 can be reached from 3. Use iterative dig to traverse theDNS hierarchy. only 1prefix, while theremaining50are coveredby 2 prefixes. We considerbothprefixesinthelattercase,tocoverthescenariowhere 4. Usetcpdump or windump torecord apacket-leveltrace of the more specific prefix has been withdrawn, or worse, filtered by the entiretransaction. someASesduetoaprefixlengthfilter. Category Trans. Failed Conn. Failed itisashighas10-20%forsomeclientsandservers;e.g.,the Trans. Conn. 95th%-tile of client failure rate is 10%. PL 16,605,281 458,692(2.8%) 21,163,180 539,787(2.6%) Figure 1 plots the mean transaction failure rate for each BB 2,307,855 30,023(1.3%) 2,849,889 19,408(0.7%) DU 381,556 2,622(0.7%) 471,931 2,343(0.5%) category of clients (shown as underlined numbers). It is in- CN 1,236,544 10,473(0.8%) N/A N/A terestingtonotethatthemeanfailurerateislowest(0.69%) fortheDUclientsandsignificantlyhigher(2.76%)forthePL Table 3: Overall transaction and connection counts bro- clients, despite the latter being connected to much higher- speed academic and research networks. We confirmed with ken down by client type, with failure rates in parentheses. the MSN operators that they do not employ any caching Connection counts are unavailablefor CN because these are proxies,transparentorotherwise, thatmightshield theDU masked by theproxy. clients from wide-area network failures. The difference in failure rates may be because the DU clients connect via a commercialdialupservice(whichpresumablystrivestopro- peering sessions advertised at least 1 announcement for the vide a good quality of service) whereas the PL clients are relevant prefix,and how many participated in withdrawals. part of the experimental PlanetLab network (which suffers Acaveatwiththismeasurementdataisthatitcanexhibit from problems such as the permanent failures discussed in false routeupdatesduetothecollection infrastructure. For Section 4.4.2). example,ifoneoftheRouteviewsserversisrebootedorses- sion is reset, each prefix will have additional updates that donotreflectachangeduetoanactualBGProutingevent. 3 2.76% We follow prior work in “cleaning” our BGP data [31, 5], UHTnkTnPown basically by estimating and disregarding the volume of up- 2.5 Connection DNS datessuspectedtobeduetoresetsaffectingjusttheRoute- %) 2 views servers. For each 1 hour period, if more than 60,000 e ( 57% at uniqueprefixes(i.e.,atleasthalftheroutingtable)received e R1.5 1.29% athnenoauvnercaegmeennutsm,bweeroafssuunmiqeuaenreesigethboocrcsutrhreadt.eaWchepcarelcfiuxlartee- Failur 1 63% 0.69% 0.85% ceived an announcement from and subtract that from the 0.5 42% 64% 100% count of announcements and count of neighbors participat- 36% 34% 0 ing in announcements from all prefixes during that period. PL BB DU CN We perform thesame calculation for withdrawals. Client Type 4. EXPERIMENTAL RESULTS Figure 1: The transaction failure rate, broken down by failure type (in italics) and category of clients. The overall We start in Section 4.1 by presenting the statistics of failure rateforeach client category appearsunderlined. transaction-level failures, broken down by client category (PL,DU,CN,BB)andbyfailuretype(DNS,TCP,HTTP). In Section 4.2 and Section 4.3, we present a more detailed 4.1.2 BreakdownofTransactionFailures breakdownofDNSfailuresandTCPconnectionfailures,re- spectively, as observed at individual clients. The nature of Figure 1 also plots the breakdown of transaction failures DNS failures (e.g., whether the local name server is reach- by type, for each category of clients. (We are unable to able and responsive) is directly observable at clients. How- provideabreakdownforCNclients,sincetheseconnectvia ever,thenatureofTCPconnectionfailures isoftennot;for proxies that mask the true nature of failures.) The failure instance,whentheTCPSYNfromaclientgoesunanswered, types are the ones presented in Section 2.1: DNS, TCP, itisnotevidentwheretheproblemis. SoinSections4.4and and HTTP. We find that for all categories of clients, TCP 4.5, we turn to correlating TCP connection failure observa- connection-level failures dominate, accounting for 57-64% tions across clients and servers with a view to classifying of all transaction failures. DNS failures account for most failures as client-side or server-side. We consider the im- of the rest (34-42%). The significant chunkof DNS failures pact of BGP instability in Section 4.6, and finally turn to underscores the importance of the end-host view. Obser- investigating proxy-specificfailures in Section 4.7. vations made from a different vantage point (e.g., a client Table3summarizestheoveralltransactionandconnection site’sDMZortheserver)would,ingeneral,notrevealthese failurestatistics. NotethatthenumberofTCPconnections failures. is typicallylarger thanthenumberof transactions, because HTTP-level failures account for under 2% of the trans- ofHTTPredirectsandretriesbyourwgetclient. Also,there action failures in all cases. We only accessed the top-level were times when individual client machines were down and “index”fileateachwebsite,whichispresumablymoreavail- so were not making web accesses. able than the average object on the website. ThelowincidenceofHTTP-levelfailuresinourstudycon- 4.1 Transaction FailureAnalysis trastswiththefindingin[16]thatHTTP-levelfailures con- Inthissection,wepresentoverallfailurestatisticsforweb stitutedabout25%ofallfailures. Thereareacoupleofrea- transactions overthemonth-longdata set. Atransaction is sonsforthisdiscrepancy. First,theoverallfailureratein[16] an invocation of wget to download a URL. waslowerbecauseallclientstherewereonawell-connected university network and because DNS failures were appar- 4.1.1 OverallTransactionFailureRate ently not considered. So HTTP-level failures constituted a We compute the overall failure rate for each client over largerfractionofthefailuresin[16]. Second,themajorityof all its transactions with all servers, and likewise for each theHTTP-levelfailuresin[16]werepartialresponses,which server. The median failure rate over the one-month period weclassifyas“partialresponse”TCPconnectionfailures,as acrossclientsis1.47%andacrossserversis1.63%. However, noted in Section 2.1. Category Failure LDNS Non-LDNS Error 100 count timeout timeout All LDNS Timeout DBPBULTable119408179:19962B8reak8d3.o3w%n77o76f..70D%%N9.7S%failure22s724...030%%% Percentage of Failures 468000 Non-LDNS TimEerorourt Cumulative 20 4.1.3 PacketLossandTransactionFailures 0 Several previous studies have considered packet loss rate 0 10 20 30 40 50 60 70 80 of TCP connections [24, 33]. However, we only find weak Server Count correlation between packet loss rate and the failure rate of Figure 2: The cumulative contribution of website domain end-to-end transactions in our data set (the coefficient of names to the DNS failurecountin variouscategories. correlation is 0.19). We believe this is because: (a) trans- actions can fail for reasons that have little to do with the end-to-end server-client path (e.g., DNS failures, as shown mains. Theerrorsaffect alargefraction oftheclients, indi- in Figure 1), (b) a transaction can succeed despite (possi- cating that these are not client-side problems. bly severe) packet loss, and (c) estimating packet loss rate In summary, DNS failures account for a significant frac- using TCP traffic is prone to bias, since failed connections tion (34-42%) of transaction failures. Local DNS timeouts that transfer no data (which are, in fact, quite significant, accountforadominantfractionoftheDNSfailures,pointing as discussed in Section 4.3) are hard to account for. to client-side problems (including LDNS failures and last- Thus we believe that it is important to study the fail- mile connectivity problems) as thedominant cause. uresofend-to-endtransactions ratherthanonlypacketloss rate. Inthefollowingsections,weanalyzethetwodominant 4.3 TCPConnection FailureAnalysis causesoftransactionfailures—connectionfailuresandDNS In this section, we present a categorization of TCP con- failures — separately. The reason for analyzing these sepa- nectionfailures,whichcompriseasignificantchunkoftrans- rately is that DNS resolution and TCP/HTTP connections action failures (Figure 1). TCP connection failures are cat- typically involve distinct Internet components and possibly egorized as “no connection” (TCP SYN exchange failed), distinct network paths. “noresponse”(serverdidnotreturnanybytesofresponse), or“partialresponse”(theserverreturnedapartialresponse, 4.2 DNS FailureAnalysis buttheconnectionwasterminatedprematurely). Thebreak- We present a breakdown of DNS failures based on the down is shown in Figure 3. We see that “no connection” iterativedigrequestthatfollows eachwgetaccess. Wecon- failures dominate in the case of PL (79%) and DU (63%), siderallcaseswhereDNSresolutionfailedforwget. Inover andaresignificantinthecaseofBB(41%). Theprevalence 94% of these cases, the interative dig also fails; the small of“noconnection”failuresreinforcesthepointmadeinSec- discrepancy is dueto transient failures. tion 4.1.3 oftheunsuitabilityof TCPpacketloss rateasan We see from Table 4 that LDNS timeouts are the domi- indicatoroftransactionfailures. Itishardtoincorporatein- nant cause of DNS failures for PL, either due to the LDNS formation fromfailed SYNexchangesintoanoverallpacket beingdownorbecauseofconnectivityproblems. Non-LDNS loss rate metric. timeouts and DNS errors are less common. Unfortunately, duetodatacollectionissueswiththeDUandtheBBclients (which account for a much smaller number of DNS failures 3 than PL), we are not in a position to fully break down the No / Partial Response twime ehoauvteffaoirluBreBs faolrsothsehsoewcslietnhtast. LHDowNeSvetrim,tehoeutpsaratriealddoamta- Rate (%) 2.25 PNNaoor CRtiaoel nsRpneoesncptsiooennse iccnlliiaTeennnhttt,es-s.aiddcoecmofuainnilatuinrnteg,cwfaothreegt7oh3re.y9r%aofloaLfsDtt-hmNeSilDetNicmoSnenofauecitltsuivrreietspyrfopesrreontbhtlseemsae Average Failure 01..155 79% 41% 63% attheclientoranunreachableorofflineLDNSserver. Being 0 so, wewouldexpectLDNStimeoutstoaffecttheresolution PL BB DU Client Type of all website names roughly equally. In Figure 2, we plot thecumulativecontributionofwebsitedomainnamestothe overallDNSfailurecountaswellastheindividualcategories Figure 3: Breakdown ofTCP connection failures. The CN offailures (thelatteronlyforPL).Thesteady slopesofthe clients are excluded since they connect via a proxy, which curves for all DNS failures and the dominant category of masks the nature of its wide-area TCP connection failures. LDNS timeouts indicate that indeed these failures do not The category marked “no/partial response” corresponds to discriminate across website names. cases where we lacked the tcpdump traces needed to disam- However, the distribution is more skewed across server biguatebetween the two cases. domain names in the case of the less common non-LDNS timeoutfailuresandDNSerrors(thetwocurvesatthebot- TCP connection failures arise either because the server tom right). For instance, 57% of the DNS errors occur is down (or overloaded) or there is a network connectivity for www.brazzil.com and 30% for www.espn.com. These problem between the client and server. To shed more light are SERVFAIL and NXDOMAINerrors, pointing to buggy on the nature of such failures, we now turn to correlating or incorrectly configured authoritative servers for these do- failure observations across clients and servers. 4.4 Correlation AnalysisofTCPFailures 4.4.3 IdentifyingFailureEpisodes As discussed in Section 2.2, we can obtain greater in- We need to define the episode duration, i.e. the period sight into the nature of failures by correlating failure ob- over which failure rates for the various entities (the clients servations across clients and servers. Specifically, we can and servers, in particular) are computed. There are two determine whether failures are associated with a client-side conflictingconsiderationshere. Wewouldliketheperiodto failureepisode,server-sidefailureepisode,etc. (Section2.2). beshort enoughtohelpidentifyfailures thatlast say justa Ourgoalinthissectionistoapplysuchcorrelation analysis fewminutes. Forexample,a10-minuteserveroutagemight to TCP connection failures. stand out on a 1-hour timescale butmight beburied in the noise on a 1-day timescale. On the other hand, the period should be long enough for us to have a sufficient number 4.4.1 ClassifyingFailures of samples to be able to compute a meaningful failure rate We use a simple blame attribution procedure to classify (given thenumberof clients and servers, and thefrequency failures. The idea is to associate a set of “entities” with of accesses in our experiment). To balance both these con- eachwebaccess: theclient,theserver,theclient-serverpair, siderations, wepick 1houras theepisodeduration. Weare the proxy (if any), etc. By aggregating over accesses made thusassuredafewhundredaccessesperclientandperserver across all clients and servers, we compute the failure rate in each episode while also being able to identify relatively associated with each entity (i.e., each client, server, etc.). short-lived failures. The choice of 1 hour as the episode We computethese failure rates separately for each episode. duration also places minimal requirements on the degree of Given a failed web access, we check to see if any of the synchronizationneededacrosstheobservationsmadeatdif- associated entities has an abnormally high failure rate (de- ferent clients. fined precisely in Section 4.4.3) associated with it for the On the flip side, the 1-hour episode duration means re- corresponding episode. If so, we ascribe the failure to the duced resolution compared to intensive probing (for exam- corresponding entity/entities. ple, [16] probed each path every 15 seconds or more fre- In the remainder of this section, we focus on client-side quently). So,forinstance,separatefailureevents(eachlast- andserver-sidefailureepisodesandpresentadetailedanaly- ing say 5 minutes) within the 1-hour period would not be sis for these. We briefly discuss proxy-related failures in distinguished. This is the price we pay for keeping the ac- Section 4.7. But first we consider client-server pairs that cess rate of clients and the burden imposed on servers low experienced near-permanent failure. (well below the level at which our measurements might be noticed and/or elicit complaints). 4.4.2 Client-ServerPairswith“Permanent”Failures Next,to decide whether an episode qualifies as a “failure episode” for an entity, we need to determine whether the Thedistributionoffailuresratesacrossclient-serverpairs failure rate for that entity is “abnormally high”. Rather is highly skewed: the median is 0.55% but certain pairs ex- thansetanarbitrarythresholdonthefailurerate,wemake perience a transaction failure rate close to 100% over the thisdeterminationbycomparingwiththesystem-widenor- entire month. 38 out of the 134*80 = 10720 client-server malbehavior. Abnormalperiodsforclientsareidentifiedby pairs (i.e., about 0.4% of the pairs) experienced a failure comparing with all clientsandabnormal periods forservers rateofover90%throughthemonth;in34ofthese38cases, are identified by comparing with all servers. The underly- thefailureratewasinfactover99.6%. Themajorityofthese ing assumption is that the system as a whole is mostly in 38 cases of near-permanent failures involved PL clients and thenormalstate(lowfailurerateornofailuresatall),with the websites www.msn.com.tw (10 cases), www.sina.com.cn abnormal behavior (high failure rate) being the exception. (9 cases), and www.sohu.com (8 cases). Sinceweconsiderfailure ratesover1-hourperiods,thenor- Thesenear-permanentfailuresappeartobeduetoarange mal state could well correspond to a non-zero failure rate. ofcauses. InthecaseofthePLclientsatnorthwestern.edu accessing the www.mp3.com server, the client starts down- loading data but soon encounters TCP checksum errors. 1 However, this problem does not affect other clients when 0.9 they access this server or the clients at northwestern.edu 0.8 wuiwnnhhAitiehtcnnheoiu.trahimrateecby,cbeeparosoctoshcsefetscwseltsieloebo.nbstotihosrtteeg(hrse).wbsgeewa.nr,wsvcet.eodhsruoisinns.netaCea.rhtcinohnmpeaa...rcc-Tnopmraea,rncemdeprawfonwluew.tn.cetshdo,foahneiuylsu.uncr.oeoestmd,u, Fraction of Episodes 00000.....34567 revealmuchsinceitisincompleteandreachesjust asfaras 0.2 a traceroutefrom aclient that is abletocommunicate with 0.1 Grouped by Server Grouped by Client this server. It is possible that certain websites are being 0 0 10 20 30 40 50 blocked at particular client sites or that accesses from the Percentage Failure Rate affected client sites are being blocked at certain websites. Wedeferamoredetailedinvestigationofthesenear-permanent Figure 4: CDF of the overall failure rate over 1-hour failures(whichwouldlikelyinvolvetalkingtotheconcerned episodes acrossclients andservers. IT staff) to future work. To avoid skewing the client-side and server-side failure analysis presented next, we exclude Toidentifysystem-widenormalbehavior,weconsiderthe these 38 client-server pairs that could (almost) never com- distributionoffailurerates,separatelyforclientsandservers, municate. These account for 50.7% of all TCP connection overthe31*24=744episodesinourmonth-longstudy(Fig- failuresbutonly13%oftransactionfailures,thehighercon- ure 4). For each 1-hour episode and each client, we com- tribution to connection failures being due to wget retries. pute the failure rate for all of a client’s connections across With these failures removed, the connection failure rate of all servers, yielding the “client” CDF shown in the figure. PL clients falls to 1.2%. Likewise for servers. We then identify the distinct knee in Classification Server-side Client-side Both Other numberofclientstoexperiencefailurestothatserver,whereas f=5% 48.0% 9.9% 4.4% 37.7% aclientmachinebeingturnedoff(whetherbecauseofprob- f=10% 41.5% 6.7% 0.7% 51.1% lems in our client set or in general because of user actions) Table 5: Classification offailures fortwo settings off. would not contribute to access failures because the client wouldnotbemakinganyaccessesduringthecorresponding period. Thisisjustaswellsinceuserswouldonlycareabout failuresthathappenwhentheytrytoaccessserversandnot each CDF that separates the low failure rates (the “nor- ones that might happen while theirmachine is turned off. mal” range) to the left from the wide range of significantly 4.4.5 DistributionofServer-sideFailures higherfailurerates(the“abnormal”range)totheright. The episode failure rate, f, at the knee is then used to identify Weconsiderthetemporalandspatialdistributionofserver- the failure episodes, i.e., episodes with an abnormally high sidefailureepisodes. Wepresentdataforthef =5%thresh- failure rate. old; the results for f =10% are qualitatively similar. In the analysis that follows, we experiment with two set- Thetotalnumberof1-hourepisodesthatwereclassifiedas tings of thethreshold f — 5% and 10% — the latter being server-side failure episodes was 2732. Ifwe coalesce consec- moreconservative. Althoughthesethresholdsmightappear utive failure episodes for a server, the number of coalesced to be too low, note that a failure rate of 5% or 10% over a server-sidefailureepisodesis473. Thisyieldsanaverageco- 1-hour episode is actually quite significant for a client or a alesced failure episode duration of 5.78 hours. (Recall from server,somethingthatusersarelikelytonoticeorcomplain Section 4.4.3 thatafailure episodemeansanabnormal fail- about. Moreover, while the failure rate might have been urerate(atleastf =5%inthepresentcase,notnecessarily veryhighoverashortperiod,theratewouldbelowerwhen total failure.) However, the distribution of this duration is averaged over a full hour. highly skewed. The median is 1-hour (the minimum tem- poral unit used in our analysis) but certain servers suffered 4.4.4 Client-sidevs. Server-sideFailures from very long failure episodes (e.g., 448 hours at a stretch Weusetheprocedureoutlined inSection 4.4.1 toclassify in the case of www.sina.com.cn and 230 hours at a stretch failures as client-side or server-side. We use the threshold, in thecase of www.iitb.ac.in). f, to determine whether a failure between a client, C, and Thedistributionofserver-sidefailureepisodesacrossweb- aserver,S,coincideswithafailureepisodeforC and/orS. sitesisalsoskewed. Table6(column2)showsthelargenum- If it is a failure episode only for C, we classify thefailure berof1-hourfailure episodes sufferedbyasmall numberof as “client-side”. Likewise, if it is a failure episode only for servers. Forexample,www.sina.com.cnandwww.iitb.ac.in S, we classify the failure as “server-side”. If it is a failure sufferedserver-sidefailureepisodesalmostthroughthemonth- episode for both C and S, we classify the failure as be- long experiment. Despite the skewed distribution, a large ing “both” client-side and server-side. If it is not a failure numberofwebsitessufferedserver-sidefailureepisodes. Dur- episode for eitherC orS, weclassify thefailure as “other”, ing the course of our month-long experiment, 56 out of the which corresponds to intermittent failures or client-server- 80 websites were affected by at least one server-side failure specific failures (other than the permanent ones from Sec- episode and 39 were affected by multiple failure episodes. tion 4.4.2, which we haveexcluded here). So a large fraction of the servers do experience periods of Table 5 shows the breakdown of failures. Using the two significantfailure,eventhoughtheoverallfailurerateislow. settings of the threshold f — 5% and 10% — we were able While many of the failure-prone servers were non-U.S.- to classify 62.3% and 48.9%, respectively, of the failures. It based (a point we discuss further in Section 4.4.6), some is as expected that more failures would fall in the “other” U.S.-based servers also experienced a significant number of category when the more conservative threshold of f =10% server-side failure episodes. is used to flag failure episodes. We use the f =5% setting 4.4.6 IndirectValidation in later sections, unless stated otherwise. We see that a significant fraction of failures was classi- It is difficult to directly validate our inferences of server- fied as “other”, because the failures were intermittent and side and client-side failures, since we have little visibility notsignificantenoughtoregisterassignificantforeitherthe into the network. Instead, we provide indirect evidence to client ortheserverona1-hourtimescale. Oftheremaining support our inferences. Wedo thisin two ways. (i.e., classfiable) failures, the “server-side” category dom- First, we consider how widespread the impact of server- inates the “client-side” one. In other words, at the level side failures episodes is, i.e., what fraction of clients is af- of TCP connections, failures are much more likely due to fected in such episodes. Wewould expect a server-side fail- server-side problems (including network problems close to ure to impact a significant fraction of the clients, and like- theserver)thanclient-sideproblems. Thefractionoffailures wise expect a client-side failure to affect transactions to a that areclassified as“both”issmall, reflectingtheunlikeli- significant fraction of the servers. The results we present nessoftherebeingbothserver-sideandclient-sideproblems below (see #1) confirm this. during thesame 1-hour period. Second, we consider co-located clients (e.g., those on the Thedominanceof server-sidefailures at thelevelof TCP sameuniversitycampus)anddeterminethedegreetowhich connections might seem surprising, given the presumption their client-side failure episodes are correlated. We would thatlargewebsitesarebetterengineeredandmanagedthan expectasignificantdegreeofcorrelation,sincemanyfailures clientnetworks. However,thereareacoupleofreasonswhy (thoughnotall)mightaffectconnectivityatthelevelofthe this is so. First, connectivity problems or disconnection at subnet or even the entire campus. The results we present the client end are likely to cause DNS resolution failure below (see #2) confirm this. and hence preclude even the initiation of a TCP connec- #1: Spread of Server-side Failures tion. So these would contribute to the DNS failure count Weconsiderhowwidespreadtheimpactofserver-sidefail- (Section 4.2), not the TCP connection failure count. ure episodes is. Ideally, we would like to answer this ques- Second,aservermachinegoingofflinewouldcausealarge tionbylookingathowwidespreadtheimpactiswithineach Server #server-sidefailureepisodes Spread Co-located Random Non-U.S.-based pairs pairs sina.com.cn 764 78.4% #clientpairs 35 35 iitb.ac.in 759 85.1% #Pairswithsimilarity>75% 2 0 sohu.com 243 72.4% #Pairswithsimilarityin50-75% 6 0 brazzil.com 97 85.1% #Pairswithsimilarityin25-50% 10 1 cs.technion.ac.il 95 94.0% #Pairswithsimilarity<25%&>0% 10 7 technion.ac.il 90 92.5% #Pairswithsimilarity=0% 7 27 chinabroadcast.cn 89 73.9% ucl.ac.uk 55 95.5% Table 7: The measure of the similarity in the client-side U.S.-based failureepisodesexperienced bypairsofco-locatedclientsvs. craigslist.org 166 70.9% nih.gov 35 60.4% randompairs ofclients. mit.edu 23 91.8% Table 6: The list of most failure-prone servers and the Clientpair #client-side Similarity “spread” quantifyinghowwidespread the impact ofthe cor- failureepisodes intheunion respondingserver-side failures is. planet{1,2}.pittsburgh.intel-research.net 387 98.2% csplanetlab{1,3}.kaist.ac.kr 5 60.0% failure episode. However, this is difficult to do because of csplanetlab{3,4}.kaist.ac.kr 7 57.1% sampling limitations. csplanetlab{4,1}.kaist.ac.kr 6 50.0% planetlab{1,2}.comet.columbia.edu 196 3.6% There are two sampling problems. A server-side problem planetlab{2,3}.comet.columbia.edu 278 52.2% couldcause100%failureforallclientaccessesduringashort planetlab{3,1}.comet.columbia.edu 155 5.2% interval, say 10 minutes long. However, there would be no record of failure for clients that happened not to access the Table 8: Examples of co-located clients and the similarity server in question during this short period. On the other in the client-side failureepisodes thatthey experience. hand, a server-side problem could last the entire hour but affectsayonly20%ofthetransactionsatrandom. Whilethe the (small numberof) clients located relatively close to the underlyingproblemmightbeonethatdoesnotdiscriminate non-U.S.-based server were also affected at the same time between clients accessing the server, there is a chance that thattheU.S.-basedclientswere(e.g.,clientsinKoreaexpe- someclientsgetluckyinthesensethatnoneoftheiraccesses rienced problems accessing sina.com.cn and clients in the totheserverfailduringthehour. So,ingeneral,wearenot U.K. experienced problems accessing ucl.ac.uk). inapositiontodefinitivelyestablishwhichclientscouldhave #2: Correlation Between Co-Located Clients been affected by theserver-side problem. Weconsidertheextenttowhichclient-sidefailureepisodes In view of this difficulty,we only look at how widespread arecorrelated across co-located clients. Foreach pairof co- the impact of server-side failure episodes is over the entire located clients, we first determined the subset of episodes month-long period.3 That is, for each server, S, we con- that were (separately) marked as a client-side episode for siderallthefailuresascribedtoit(i.e.,toserver-sidefailure each client in the pair. We compute the similarity measure episodes at S). We are interested in how large the set of for the pair of clients as the ratio of the size of the inter- clientsaffectedbytheseserver-sidefailureepisodesatSover section set (i.e., the client-side failure episodes in common) themonth-longexperimentis. Wequantifythis“spread”by to the size of the union (i.e., episodes that are marked as a computing the fraction of all clients needed to account for client-side failure episode for either or both clients). the failures ascribed to S. We identify 35 pairs of co-located clients in our data set, Table6liststhespreadforthemostfailure-proneservers. most of them being PL clients co-located on the same aca- Wefindthatthespreadisgenerallyover70%andoftenover demicnetwork. Butwealsohadtwopairsofco-located BB 80%. (Note that this is much higher than the threshold of nodes, one pair each on the Roadrunner cable network in f =5%usedtoidentifyindividualfailureepisodes,withthe San Diego and the Verizon DSL network in Seattle. caveat in footnote 3.) This indicates that the failures that Table 7 shows the similarity measures across the client we flag as server-side typically do impact a significant frac- 35 pairs of co-located clients, and also 35 randomly-paired tionoftheclients,aswewouldhaveexpected. Thisholdsfor clients, for comparison. We see that a little over half of theU.S.-basedserversaswellasthenon-U.S.-basedservers. the co-located client pairs had a similarity of at least 25%, ThisservestoindirectlyvalidatetheinferencesmadeinSec- including about a quarter that had a similarity of at least tion 4.4.4. 50%. In contrast, only one of the randomly paired clients WemakeoneotherobservationregardingTable6. Many, had a similarity greater than 25%. Also, relatively fewer thoughnotall, ofthemostfailure-prone serversarelocated co-located pairs (20%) exhibited zero similarity, about 80% outside the U.S. Given that our client set is dominated by oftherandompairsexhibitedzerosimilarity. Thisindicates U.S.-based clients, it is hard to distinguish a network con- that co-located clients exhibit significantly greater sharing nectivityproblembetweentheU.S.andtherestoftheworld of client-side failure episodes than randomly-paired clients. fromanactualserver-sidefailureatanon-U.S.-basedserver The overwhelming majority of co-located client pairs with that affects a large fraction of the clients. In general, we low or zero similarity experienced a very small number of do not have enough (or any) clients located close to many client-side failure episodes through the month-long experi- of the non-U.S.-based servers to be able to tell if such “lo- ment(just1-2episodes,insomecases). Anymismatch(i.e., cal” clients were also affected by the apparent server-side lackofsharing)intheserarefailureepisodeswouldresultin failure. However, in some cases we were able to verify that a low similarity measure. In general, a low degree of similarity in the client-side 3 Ofcourse,thisisnotaperfectmeasureeithersincemultipledistinct failureepisodesforco-locatedclientscouldalsoarisebecause server-side problems during different episodes through the month thefailurewastrulyclient-specific. Oneoftheexampleswe couldhaveaffecteddifferentsubsetsofclients. Theoverall“spread” consider next illustrates this point. acrossclientsmightbelarge,butthespreadduringindividualfailure episodescouldstillbesmall. Table 8 lists a few examples of co-located clients that we studied. Eachsetofclientsexhibitsverydifferentbehavior. 30 seconds to 15 minutes for a route withdrawal and sub- The two nodes at Intel see a very large numberof client- sequent route announcement to converge [19]. Thus for a side failure episodes (387) between them. Moreover, there 1-hour period of TCP connection attempts to/from an af- isaveryhigh degreeof similarity (98.2%) across thefailure fected prefix, we expect failure rates in the range of 1% to episodesexperiencedbythesetwoco-locatedclients. Onthe 25% for temporary outages. otherhand,the3nodesatKAISTexperienceonlyahandful of client-side failure episodes, about 50-60% of which are shaTrheedcbaysethoefmth.e Columbia clients is remarkably different. Cnt 600 Two of the nodes — #2 and #3 — experience 247 and TCP 200 192 failureepisodes, respectively,withasimilarity of52.2% 0 betweenthem. However,thebehaviorofthethirdColumbia 1104500000 1105000000 1105500000 1106000000 1106500000 1107000000 node(#1)isverydifferent. Itonlyexperiences12client-side Time failureepisode,resultinginalowsimilarity(3.6%and5.2%) 30 wofiItcnho-srlueosmcpametceatdrtyco,liwethnetefisonstdhhaetrrheadttw2ao5l%Citotolluermmmbooirareentoohfdatenhse.hiarlfcltihenetp-saiidres TCP Fails 1020 failure episodes. Most of the rest were pairs that saw very 0 1104500000 1105000000 1105500000 1106000000 1106500000 1107000000 few client-side failure episodes, making any similarity com- putation noisy. Time 4.5We rRepeepalticthaetecdorrWelaetbiosnitaensalysis at the granularity of Max Streak Len 51015 serverreplicastosub-classifytheserver-sidefailureepisodes 0 as either total or partial replica failures. As noted in Sec- 1104500000 1105000000 1105500000 1106000000 1106500000 1107000000 tion 2.2, total failures affect all replicas of a website, while Time partial failures affect only a subset of thereplicas. eartitWnegmepaidtlleedndtibisfytyinatcnhtyeIcsPleietnaodtfdwrrheespilsleeicsdaostwofnowlrohaaidcshinergcvoecnronnSetcetnbityonfcrsoonmwseiSdre-. W Updts 50150 0 To make our analysis meaningful, only IP addresses that 1104500000 1105000000 1105500000 1106000000 1106500000 1107000000 account for at least 10% ofthetotal numberof connections to S are considered to be replicas. As a result, out of the Time 80 websites used in our experiments, 6 had zero replicas, 60 42 had exactly one replica, and 32 had multiple replicas. Ngbrs 40 The 6 websites with zero replicas are basically those served W 20 by CDNs like Akamai, where the numberof distinct IP ad- 0 dressesisverylarge,sothatnoneoftheIPaddressesqualify 1104500000 1105000000 1105500000 1106000000 1106500000 1107000000 to be counted as a replica perour definition above. Time We found that 62% of the failure episodes marked as Figure 5: TCP failures and BGP activity for server-side belonged to the 32 servers that had multiple nodea.howard.edu, in 1-hour bins. The blank period around replicas. Of these episodes, an overwhelming majority of 1106500000 corresponds to the client itself being down. 85% were total replica failures, which means that all repli- casofthewebsitewereexperiencingmorethanthethreshold Toillustratethedata,wepresentaparticularclientnodea. failure rate during that episode. This is somewhat surpris- howard.edu in Figure 5. The top three graphs show con- ing, but moredetailed analysis shows that almost all of the nection data, while thebottom two show BGP data for the total replica failures are due to websites whose replicas ap- client’sprefix. ThetopgraphshowsthenumberofTCPcon- peartobeonthesamesubnet(same/24prefix),andhence nectionattemptsmadeineachhouracrosstheentirecollec- are proneto correlated failures. tion period. We typically attempt about 800 TCP connec- tionsperhour,butdelaysduetolowthroughputorwaiting 4.6 BGPAnalysis fortimeoutscanreducethenumberofattemptsinaperiod. Wenowconsidertherelationshipbetweenend-to-endcon- Thesecondgraph showshowmanyoftheseattemptsfailed nection failures and inter-domain routing instability. We to receive a TCP SYNACK (i.e., “no connection” failure). look for BGP instability in the prefix(es) corresponding to OnehypothesisweentertainisthatduringaBGPfailure each client and server, and consider how these relate to event, remote access attempts will fail consecutively until client-side or server-side failure episodes. Clearly, BGP in- theclient prefixis reachable from all ASes. Thusifwe con- stability that originates close to a client or server has the siderthelongestconsecutivestreakofaccessfailuresineach potential to negatively impact the client or server’s wide- 1-hour episode, we should expect it to correspond to BGP area communication more than one that originates further events. However, the caveat with this hypothesis is that it away. Wedonotconsiderclient-server-specificfailures,since assumes BGP convergence takes just as long for all ASes eachclientaccessedeachserveronlyasmallnumberoftimes trying to respond to the client. In reality, some ASes will in each 1-hourepisode. convergeontheclientprefixfasterthanothersdependingon Ifweassumethatthemostcommoncauseofsuchfailures the AS topology [20], and thus some intermediate accesses is temporary, such as a router reboot or session reset, we may succeed while others fail. The third graph in Figure 5 would expect the outage duration to be at least as long as showsthelengthofthelongestconsecutivestreakoffailures BGP convergence times. This will likely be in the range of in each period.
Description: