ebook img

Beyond Pilots: Keeping Rural Wireless Networks Alive PDF

14 Pages·2008·0.43 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Beyond Pilots: Keeping Rural Wireless Networks Alive

Beyond Pilots: Keeping Rural Wireless Networks Alive Sonesh Surana∗ RabinPatra∗ Sergiu Nedevschi∗ ManuelRamos† LakshminarayananSubramanian‡ Yahel Ben-David§ EricBrewer∗ ¶ Abstract Researchers (ourselvesincluded) tend to focus on the sexypartsofadeployment,suchashigherperformance Very few computer systems that have been deployedin orahighlyvisiblepilot.However,realimpactrequiresa rural developing regions manage to stay operationally sustainedpresence,andthusoperationalchallengesmust sustainable overthe long term; most systems do not go be viewed as a first-class research topic. Analogous to beyondthepilotphase.Thereasonsforthisfailurevary: researchonhighavailability,wemustunderstandtheac- components fail often due to poor power quality, fault tualcausesofoperationalproblemsandtakeabroadsys- diagnosis is hard to achieve in the absence of local ex- temicviewtoaddresstheseproblemswell. pertiseandreliableconnectivityforremoteexperts,and In this paper, we describe our experiences over the faultpredictionisnon-existent.Anysolutionaddressing last three years in deploying and maintaining two ru- theseissuesmustbeextremelylow-costforruralviabil- ralwirelesssystemsbasedonpoint-to-pointWiFilinks. ity. OurpriorworkonWiFi-basedLongDistanceNetworks We take a broad systemic view of the problem, doc- (WiLDNet) [26] developed a low-cost high-bandwidth ument the operational challenges in detail, and present long-distance solution, and it has since been deployed low-cost and sustainable solutions for several aspects successfully in several developing regions. We present of the system including monitoring, power, backchan- real-world validation of the links, but the primary con- nels, recovery mechanisms, and software. Our work in tributionhereis the explorationof the operationalchal- the last three years has led to the deploymentand scal- lengesoftworuralnetworks:atelemedicinenetworkat ing of two rural wireless networks: (1) the Aravind the Aravind Eye Hospital [3] in southern India and the telemedicine network in southern India supports video- AirJaldi[1]communitynetworkinnorthernIndia. conferencing for 3000 rural patients per month, and is We have had to overcome major challenges in both targeting500,000patientexaminationsperyear,and(2) networks: (1) components fail easily due to low qual- the AirJaldi network in nothern India provides Internet ity power, (2) fault diagnosis is hard because of non- accessandVoIPservicesto10,000ruralusers. expertlocalstaffandlimitedconnectivityforremoteex- perts, and (3) remoteness of node locations makes fre- 1 Introduction quent maintenance difficult; thus fault anticipation be- comes critical. All of these problems can be fixed by The penetration of computer systems in the rural de- having higher operating budgets that can afford highly velopingworld has been abysmally low. Severalefforts trained staff, stable power sources, and robust high-end aroundtheworld thathavetriedtodeploylow-costcom- equipment.Buttherealchallengeistofindsolutionsthat puters,kiosksandothertypesofsystemshavestruggled are sustainable and low-cost at all levels of the system. toremainviable,andalmostnoneareabletoremainop- To this end, our main contributions are (1) document- erational over the long haul. The reasons for these fail- ingandcategorizingtheunderlyingcausesoffailurefor uresvary,butatthecoreisanunder-appreciationofthe thebenefitofresearchersundertakingruraldeployments manyobstaclesthatlimitthetransitionfromasuccessful in the future, and (2) developing low-cost solutions for pilottoatrulysustainablesystem.Inadditiontofinancial thesefailures. obstacles,theseincludeproblemswithpowerandequip- Inovercomingthesechallenges wehavelearnedthree ment, environmental issues (e.g. heat, dust, lightning), importantlessonsthatweargueapplytoITdevelopment and an ongoing need for trained local staff, as trained projects more broadly. First, designers must build sys- staffmoveontobetterjobs. temsthatreducetheneedforhighlytrainedstaff.Second, ∗UniversityofCalifornia,Berkeley simple redesign of standard components can go a long †UniversityofthePhilippines way in enabling maintenanceat lower costs. And third, ‡NewYorkUniversity therealcostofpowerisnotthegridcost,butisthecostof §AirJaldi,Dharamsala,India overcoming poor power quality problems. By applying ¶IntelResearch,Berkeley these lessons to severalaspectsof oursystem including monitoring,power,backchannels,recoverymechanisms, 10 and deployed software, we have made real progress in WiLDNet Aravind WiFi Aravind WiLDNet keepingtheseruralnetworksalive. ps) 8 Emulator Testbed WiFi Testbed WiLDNet Mb Venezuela WiFi Venezuela WiLDNet The Aravind network now uses WiLDNet to inter- ut ( 6 connect rural vision centers with their main hospitals p h 4 for patient-doctor video-conferencing. Currently 9 vi- ug o sion centerscater to 3000patients per month. Thusfar, hr 2 T 30,000ruralpatientshavebeenexaminedand3000have 0 had significant vision improvement. As all vision cen- 10 20 50 100 200 300 400 Distance (km) tersarenowrunningwithnooperationalassistancefrom ourteam,thehospitalconsidersthisnetworksustainable Figure 1: Comparison of TCP throughput for WiLDNet andistargetingatotalof50centersinthenext2years. (squares)andstandardWiFiMAC(triangles)fromlinksinAr- Similarly,AirJaldiisalsofinanciallysustainableandcur- avind,Venezuela,Ghana(the65kmlink),andourlocaltestbed rentlyprovidesInternetaccessandVoIPservicestoover intheBayArea.MosturbanlinksinAravindhadupto5–10% 10,000usersinruralmountainousterrain. loss, and so WiLDNet did not show substantial improvement In thenextsection we validatethesufficiencyofreal- overstandardWiFi.However,WiLDNet’sadvantageincreases worldWiLDperformance,andoutlinethechallengesto with distance. Each measurement is for a TCP flow of 60s, 802.11bPHY,11Mbps. operationalsustainability.Section3providessomeback- ground for the Aravind and AirJaldi networks. In Sec- tion4,we documentmanyofourexperienceswithsys- andPlatilloninVenezuela.Tothebestofourknowledge, temfailures,andtheninSection5presentthedesignof this is currently the longest distance at which a stable alllevelsofoursystemthataddresstheseissues.Related high-throughputWiFilinkhasbeenachievedwithoutac- workisdiscussedinSection6,andinSection7wesum- tive amplification or custom antenna design. Each site marizethreeimportantlessonsforruraldeployments. used a 2.4GHz 30-dBireflector gridantennawith 5.3◦ beam-widthanda400mWUbiquitiSR2radiocardwith theAtherosAR5213chipset. 2 Motivation Figure 1 presents results from running WiLDNet on reallinksfromourvariousdeploymentsinAravind(In- Inthissection,weconfirmhigh-throughputperformance dia),Venezuela,Ghana,andourlocaltestbedintheBay of WiLDNet links in real-world deployments, and then Area.WematchtheperformanceofWiLDNetoverem- outline the operationalchallengesthat remain obstacles ulated links and greatly exceed the performance of the tosustainedimpact. standardWiFiMACprotocolatlongdistances. 2.1 Real-WorldLink Performance Thus we find that we are no longer limited by per- formanceoverlongdistancesinruralnetworks.Instead, Existing work [16, 26, 29, 33, 34] on rural networking based on our experiencesin deployingand maintaining hasfocusedonmakingWiFi-basedlong-distancepoint- networks in the two rural regions of India for the last to-point links feasible. The primary goal has been high threeyears,wearguethatoperationalchallengesarenow performance,typicallyexpressedashighthroughputand theprimaryobstacletosuccessfuldeployments. lowpacketloss.Inpriorwork,wehavestudiedchannel- inducedandprotocol-inducedlossesinlong-distanceset- 2.2 Challenges inRural areas tings[33], andhaveaddressedthese problemsbycreat- Addressingthesechallengesrequireslookingatalllevels ingWiLDNet:aTDMA-basedMACwithadaptiveloss- of the system, starting from the power supply and base recoverymechanisms[26].Wehaveshowna2–5foldin- hardware,upthroughthesoftwareanduserinterface,all creaseinTCP/UDPthroughput(alongwithsignificantly the way to training and remote management. Although reducedlossrates)incomparisontothebestthroughput remotemanagement,reliablepowerandtrainingofstaff achievablebythestandard802.11MAC.Wehadshown ishardingeneral,theseproblemsareexacerbatedinrural these improvementson real medium-distance links and areasforseveralspecificreasons[35]. emulatedlong-distancelinks. First, local staff tend to start with limited knowledge In this paper we confirm the emulated results with about wireless networking and IT systems. This limits data fromseveralreallong-distancelinksin developing their diagnostic capabilities and results in inadvertent regions. Working with Ermanno Pietrosemoli of Fun- misuse and misconfiguration of equipment. Thus man- dacio´n Escuela Latinoamericano de Redes (EsLaRed), agement tools need to help with diagnosis and must be we were able to achievea total of 6 Mbpsbidirectional educational in nature. The effectiveness of training is TCPthroughput(3Mbpseachwaysimultaneously)over limited by the high turnover of IT staff, so education asingle-hop382kmWiLDNetlinkbetweenPicoAguila mustbeanongoingprocess. Second,thechancesofhardwarefailuresarehigherbe- causeofpoorpowerqualityandharshenvironments(e.g. exposuretolightning,heat,humidity,ordust).Although we donothaveconclusivedata aboutthe failurerate of equipmentforpowerreasonsinruralareas,wehavelost far more routers and adapters for power reasons in ru- ralIndiathanwehavelostinourBayAreatestbed.This callsforasolutionthatprovidesstableandhighquality powertoequipmentinthefield. Third, many locationswith wireless nodes, especially relays,are quite remote,and thereforeitis importantto avoidunnecessaryvisitstoremotelocations.Weneedto enable preventive maintenance during scheduled visits. Forexample,evidenceofagradualdegradationinsignal strength at a remote router could indicate that a cable needstobereplacedorantennasneedtoberealignedin thecourseofanormalvisit. Figure 2: Aravind Telemedicine Network. Theni hospital is Fourth, the wireless deploymentmay often notbe ac- connectedto5visioncenters.Theothernodesareallrelays. cessibleremotelyorthroughtheInternet.Thefailureofa singlelinkmightmakepartsofthenetworkunreachable, evenifthenodesthemselvesarefunctional.Thismakes furtherexpansionto50clinicsaround5hospitalsisbe- itveryhardforremoteexpertsorevenlocaladministra- ingplannedtoprovide500,000annualeyeexaminations. torstoresolveorevendiagnosetheproblem. Hardware: The wireless nodes are 266 MHz x86 sin- gleboardcomputers.Theseroutershaveupto3Atheros 3 Background 802.11 a/b/g radio cards (200–400 mW). The longer links use 24dBi directional antennas. The routers con- Over the last three years we have deployed two rural sumeabout4.5Wwhenidleandonly9.5Wwhentrans- wireless networks in India. One is at the Aravind Eye mittingatfullbandwidthfrom2radios;7Wistheaver- HospitalinsouthIndiawherewelinkdoctorsatthecen- agepowerconsumptionforanode.Theyrunastripped- trallylocatedThenihospitaltovillageclinics,knownas down version of Linux 2.4.26 stored on a 512 MB CF vision centers, via point-to-point WiLD links. Patients card, and include our software for WiLDNet, monitor- video-conferenceoverthelinkswiththedoctorsforcon- ing,logging,andremotemanagement. sultations.TheotherisinDharamsalainnorthIndiaand Theroutersareplacedinsmallandlightweightwater- is called the AirJaldi network. This network is primar- proofenclosures,andaremountedexternally,closetothe ily a mesh with a few long distance directional links antennas, to minimize signal losses. They are powered that provides VoIP and Internet access to local organi- via power-over-ethernet (PoE); a single ethernet cable zations.Bothnetworkshavefacedlargelysimilaropera- fromthegroundtotherouterissufficient.We useunin- tionalchallenges,butwithsomeimportantdifferences. terruptiblepowersupplies(UPS)toprovidecleanpower, althoughwediscusssolarpowerinSection5.2. 3.1 The AravindNetwork Applications: The primary application is video- The Aravind network at Theni consists of five vision conferencing. We currently use software from Marrat- centers connected to the main hospital in Theni (Fig- ech [22]. Although most sessions are between doctors ure 2). The network has total of 11 wireless routers (6 and patients, we also use the video conferencing for endpoints,5relaynodes)anduses9point-to-pointlinks. remote training of staff at vision centers. Typical The linksrangefromjust1 km (Theni- Vijerani)to 15 throughput on the links ranges between 5–7 Mbps km (Vijerani- Andipatti).Six of the wireless nodesare with channel loss less than 2%. But 256 Kbps in each installed on towers, heights of which range from 24– direction is sufficient for very good quality video 42 m; the others use short poles on rooftops or exist- conferencing.Ournetworkisthusoverprovisioned,and ingtallstructures,suchasthechimneyofapowerplant wealsousethenetworktotransmit4-5MB-sizedretinal on the premises of a textile factory. Recently, Aravind images. The hospital has a VSAT link to the Internet, has expanded this model to their hospitals in Madurai butmostapplicationsrequireonlyintranetaccesswithin and Tirunelveli where they have added two vision cen- thenetwork(exceptforremotemanagement). ters. The network is currently financially viable and a ThisbandwidthissufficientforapplicationssuchasIn- ternet access and VoIP that cater primarily to the needs oftheTibetancommunity-in-exilesurroundingDharam- sala, namely schools, hospitals, monasteries and other non-profitorganizations.AirJaldionlyprovidesconnec- tivitytofixedinstallationsanddoesnotofferwirelessac- cesstoroamingusersormobiledevices.Acost-sharing modelisusedamongallnetworksubscriberstorecover the operational costs. The network is currently finan- ciallysustainableandisgrowingrapidly. 4 Operational Experiences Figure 3: AirJaldi Network. There are 8 long distance links withdirectionalantennaswith10endpoints. We have experienced several operational challenges in both networks that have lead to significant downtimes, 3.2 The AirJaldiNetwork increased maintenance costs, and lower performance TheAirJaldinetworkprovidesInternetaccessandVoIP (e.g., increased packet loss). Initially we were involved telephonyservicestoabout10,000userswithinaradius inallaspectsofnetworkplanning,configuration,deploy- of 70 km in rural mountainousterrain characterized by ment,andmaintenanceofthenetworks.Ourspecificend extremeweather.Thenetworkhas8longdistancedirec- goalhasbeentoultimatelytransferresponsibilitytoour tionallinks rangingfrom10 km to 41 km with 10 end- ruralpartners,primarilytoensurelocalbuy-inandlong- points(Figure3).Inaddition,thenetworkalsohasover termoperationalsustainability.Thisprocesshasnotbeen ahundredlow-costmodifiedconsumeraccesspointsthat easy.Ourinitialapproachwastomonitorthesenetworks use a wide variety of outdoor antennas. Three of the over the Internet and to provide some support for local nodesaresolar-poweredrelaystationsatremoteelevated management, sometimes administering the network di- placeswithclimbabletowers.Allotherantennasarein- rectly(bypassingthelocalstaffwheneverrequired).But stalled on low-cost masts less than 5 m in height; the enablingremotemanagementhasbeenmorechallenging masts are typically water pipes on the rooftops of sub- than expected because of severe connectivity problems scribers. (Section5.3). This aspect, combined with the desire to enable local Hardware: Most of the routers are modified consumer operational sustainability, has led us to design the sys- devices, either Linksys WRT54GL or units from Buf- tem with more emphasis on support for local manage- falo Technologies, and cost less than US$50. They are ment, a particularly challenging problem given limited housed inside locally designed and built weatherproof local experience. One way in which we have ensured enclosures,andaremountedexternallytominimizesig- thateducationremainsanongoingprocessisbycreating nal losses. The antennas, power supplies and batteries athree-tiermanagementhierarchy,inwhichlocalITven- are allmanufacturedlocallyin India.The routerboards dors(calledintegrators)withsomeexpertiseinnetwork- arebuiltarounda200MHzMIPSprocessorwith16MB ing were hired to form a mid-level of support between of RAM, 4 MB of on-board flash memory, and a low local staff and ourselves. With this tiered approach, the powerBroadcom802.11b/gradio.WerunOpenWRTon ruralstaff has graduallylearned to handle many issues; these routers, and use open source software for mesh the IT vendorsstill handle some, most notably installa- routing, encryption, authentication, QoS, remote man- tion,whileourrolehasreducedfromoperationalrespon- agementandlogging.Forlongdistancelinksandremote sibility to just shipping equipment. In the last year we relaystationsweuseslightlyhigher-enddevicessuchas have notinstalled any linksourselveseven thoughboth the PCEngines WRAP boards, MikroTik routerboards, networks have grown. We review this transition in our andUbiquitiLS2s,allwithAtheros-basedradios. conclusion. Applications:TheInternetuplinkofAirJaldiconsistsof Although we were prepared to expect problems such 5 ADSL lines ranging from 144 Kbps to 2 Mbps for a aspoorconnectivity,poweroutages,andmisunderstand- totalofabout7Mbpsdownlinkand1Mbpsuplinkband- ingsaroundproperusageequipmentusage,theactualex- width.Thelongestlink fromTCV to Ashapuri(41km) tentoftheseproblemshasbeenverysurprising,requiring achievesathroughputofabout4–5Mbpsat2–5%packet asignificantcustomdesignofthesystematalllevelsto loss,whilethelinkfromTCVtoGopalpur(21km)only addresstheseissueseffectively.Asaresult,thereduced getsabout500–700Kbpsat10–15%lossduetotheab- downtimesandlowermaintenancecostshaveresultedin senceofclearlineofsight. bothnetworksbeingsustainableenoughtopayfortheir Veerampattinam Failures from Bad Power Quality: We have experi- 500 to 1000 Villianur enced a wide range of failures from bad power. First, 400 to 500 300 to 400 spikesandsurgeshavedamagedourpowersuppliesand -300 to -400 router boards. In the AirJaldi network, we have lost at -400 to -500 least 50 power supplies, about 30 ethernet ports and 5 -500 to -1000 boards to power surges, while in the Aravind network, 0 100 200 300 400 wehavelost4boards,atleast5powersuppliesandsome Figure4:Histogramofpowerspikesfromtworuralvillages. ethernetportsaswell. Thebins(yaxis)arethesizerangeofthespikeinvolts,while Second, voltage sags have caused brown outs. Low thexaxisisthecount.Negativebinsimplyreversedpolarity. voltagesleave routersin a wedgedstate, unableto boot completely. The on-board hardware watchdog, whose jobistoreboottherouter,isalsooftenrendereduseless ownequipmentandtowers.Beforemovingontothede- becauseofthelowvoltages,thusleavingtherouterina signofoursystem,wefirstdocumentthreemajorfactors hungstateindefinitely.Third,fluctuatingvoltagescause foroperationaloutages;eachfactorisaresultofacom- frequentreboots,whichcorruptandoccasionallydamage binationofthechallengespresentedinSection2.2. theCFcardsthroughwritesduringthereboots. 4.1 Components AreMore Inclined toFail Asatypicalexample,therouteratSBSinAravindre- bootedatleast1700timesinaperiodof12months(Fig- Operating conditions at Aravind and AirJaldi have ure 5), roughly 5 times per day, going up to 10 times greatly contributed to a substantial decrease in the ro- forsomedays.Incontrast,anotherrouteratAravindde- bustness of system components that would otherwise ployed on top of chimneyof a power plantfrom where workquitereliably.Onemajorculprithasbeenthelack itderivesreasonablystablepowerhasshownuptimesfor ofstableandqualitypower.Althoughissuessuchasfre- severalmonthsatastretch.Inpractice,wehaveobserved quent power outages in rural areas are well known, we that routers with more frequent reboots are more likely weresurprisedbythedegreeofpowerqualityproblems to get their flash memory corrupted over time. We had in rural villages even when power is available. Before atleast3suchcasesatnodesco-locatedwiththevision addressing the power issues (Section 5.2), not a single centers(Figure5),whichexperiencedmorerebootssince daywentbywithoutfailuresrelatedtolowpowerqual- staffattheselocationsshutdownandbootuptherouters ity in either network. Any effort that is focused on ru- everyday.Finally,frequentlyfluctuatingvoltagealsopre- ral deployments must necessarily fix the power issues. ventsoptimalchargingofthebatterybackupandhalves Thereforewe describe the quality of rural power in de- itsoveralllifetime. tail,particularlybecauseithasnotbeenpreviouslydoc- Lackofqualitypowerincreasesnotonlydowntimebut umented. also maintenance costs. Traveling to remote relay loca- LowPowerQuality:Figure4showsdataonspikesfrom tionsjusttorebootthenodeorreplacetheflashmemory a power logger placed in two differentrural villages in is expensive and sometimes has taken us several days, southern India for 6 weeks. We group the spikes based especiallyinDharamsalawheretheterrainisrough. on their magnitude in volts; negative voltage means OtherPower-relatedProblems:InDharamsala,oneof the polarity was reversed. We see many spikes above thestormiestlocationsinIndia,lightningstrikeshaveof- 500V,oftenwithreversedpolarity,andsomeevenreach- ten damaged our radios. We have learned the hard way ing 1000V!Clearly such spikes can damage equipment thatwheneverwedeployedamixofomnianddirectional (burned power supplies), and has affected us greatly. antennas,theradiosconnectedtotheomniantennaswere We havealsoseenextendedsagsbelow70Vandswells muchmorelikelytogetdamagedduringlightningstorms above350V(normalvoltagein Indiais 220-240V).Al- comparedtotheradiosconnectedtodirectionalantennas. though the off-the-shelf power supplies we use func- It turned out that omni-directional antennas attract tionwellatawiderangeofinputvoltages(80V–240V), lightning more as they are usually mounted on top of theyarenotimmunetosuchwidelyrangingfluctuations. mastsandhaveasharpertip,whiledirectionalantennas Also,locationsfarawayfromtransformersaresubjectto aretypicallymountedbelowthemaximumheightofthe morefrequentandextremepowerfluctuations.Ourfirst mast. To mitigate this problem, we install omni anten- approachwastouseUPSandbatterybackups.However, nas about 50 cm below the top of the mast. However, affordable UPS systems are only of the “standby” type thiscreatesdeadzonesbehindthemastwherethesignal where they let grid power flow throughuntouched;this fromtheantennaisblocked.Toreducethesedeadzones, passes the spikes and surges through to the equipment we sometimes use an arm to extend the omni antenna except during grid outages when the battery starts dis- away from the mast. After loweringthe omniantennas, chargingandisexpectedtoprovidestablepower. wehavenotlostanyradiosduringstorms. thingseemedfineexceptthatwecouldnotcommunicate Periyakulam withtheremoteend.Wehadnoothernetworkaccessto Bodi the remote host so local staff kept physically checking Chinamanoor Andipatti the remote end, but did not (ourselves included) think Ambasam of checking the roof at Theni. The resulting downtime SBS lasted for two months until we flew there and saw the Laxmipuram Vijerani problem! Chimney Theni2 Independent power supply Packet Loss due to Interference: In the AirJaldi net- Theni1 Dependent power supply work, a decrease in VoIPperformancewas reportedfor aparticularlinkatveryregularintervals.Howeverwith- 0 500 1000 1500 2000 outanyadditionalinformationtodiagnosetheproblem, Number of reboots noactioncouldbetaken andthisbehaviorpersistedfor threemonths.Finally,aftersomedetailedmonitoringby Figure5:NumberofrebootsestimatedpernodeintheAravind us (and not the rural staff), we saw a regular pattern of network for about one year of operation. Nodes with power packetloss between8am to 9ameverydayexceptSun- supplies dependent on the vision center are turned on or off days.ButscanningthechannelsshowednoexternalWiFi everyday.Nodeswithindependentpowersuppliesaretypically interference.We were finally able to attribute the prob- relaynodesorhospitalnodes. lemtoapoorlyinstalledwaterpumpthatwasactinglike apowerfulsparkgenerator,interferingwithwirelesssig- nalsinthevicinity.Withoutpacketlossinformation,both 4.2 FaultDiagnosisisDifficult the rural staff and we would have had a lot of trouble Accuratediagnosisoftheproblemcangreatlyreducere- solvingthisproblem. sponse time andthusdowntime.The mostcommonde- SignalStrengthDecrease:IntheTheni-Ambasamlink scriptionofafaultbyourruralpartnersisthatthe“linkis intheAravindnetwork(Figure2),wenoticedadropin down.”There are a wide varietyof reasonsfor network signal strength of about 10 dB that persisted for about outages and it is not always easy to diagnose the root a month.Withoutfurtherinformationitwas hardto tell cause. The lack of appropriate tools for inexperienced whethertheantennasweremisaligned,orthepigtailcon- staff, combinedwith unreliableconnectivitywhich hin- nectorsweredamaged,ortheradiocardswerenolonger dersdetailedmonitoring,preventsaccuratediagnosis. workingwell.Intheend,severaldifferentattemptswere Forexample,aremotehostmightberunningproperly, made by local staff over multiple trips; the radio cards, yet is unreachable when an intermediate wireless link theconnectorsandeventheantennaswerereplaced,and goesdown.Thenon-functionallinkmakesitimpossible thesignalstrengthbumpedbackupwithoutitbeingfully toquerytheremotehostfordiagnosis.Infact,therehave clearwhatfinallyhelped! been many instances where rural staff have traveled to theremotesitewithgreatdifficultyonlytorealizethatit Network Partition:We experiencednetwork partitions was a regular power shutdown from the grid (in which many times, but for several different reasons. For ex- case nothing could be done anyway), or that it was a ample, at Aravind, staff misconfigured the routing and software problem which could have been fixed if there added static routes while dynamic routing was already wereanalternatebackchanneltotherouter.Accuratedi- enabled.Thiscreatedaroutinglooppartitioningthenet- agnosisofsuchproblemscansaveconsiderabletimeand work. In another instance of operator error, the default effort,andpreventunnecessarytravel.Furthermore,our gateway of one of the routers was wrongly configured. ownabilitytohelpthelocalstaffbylogginginremotely Therewerealsoafewinstanceswhenoperatorschanged to diagnose the problem is limited by connectivity. For the IP addresses of the endpoints of a link incorrectly, instance,weusetheVSATlinkatTheni(intheAravind such that the link was non-functional even though it network)toaidthelocalstaffinmonitoringandmanag- showed up as being associated. And as mentioned ear- ingthenetwork,buttheVSATbackchannelhasworked lier,theconstructionoftheelevatorshaftleftthenetwork foronly65%ofthetimeinthelastoneyear. partitionedfortwomonths. Sometimes local misunderstandingsof equipment us- “Fixing” by users: A recurring problem is that well- age make it even harder to diagnose problems. For ex- meaningruralstaffoftenattempttofixproblemslocally ample,asshowninFigure6,anelevatorshaftwascon- when the actualrootcause is notlocal. For example,at structed right in front of the directional antenna at Ar- AirJaldi we have seen that when an upstream ISP goes avindThenihospital,completelyobstructingthe line of down,ruralstafftendtochangelocalsettingsinthehope sight to the remote end. Whenever we remotely logged of fixing the problem. These attempts typically create in to the Theni end of the link from Berkeley, every- new problems, such as misconfiguration, and in a few Problemdescription System Aspects ComponentFailures Unreliablepowersupply P BadpowercausingburntboardsandPoEs P CFcardcorruption:diskfullerrors M,P,S Omniantennasdamagedbylightning P FaultDiagnosis Packetlossfrominterference M Decreaseinsignalstrength M Networkpartitions M,B Selffixingbyusers S Routingmisconfigurationbyusers M,B,S Failedremoteupgrade B,R Remoterebootafterroutercrash B,R,S Figure 6: TheTheni toVijeranilinkintheAravindnetwork Spyware,viruseseatingbandwidth M,S was completely obstructed by a newly constructed elevator AnticipatingFaultsishard shaft. This problem was not resolved until we visited Theni Findingbatteryuptime/status M,B,P after2months. PredictCFdiskreplacement M Table1:ListofsometypesoffaultsthatweseeninbothAr- caseshaveevenresultedindamagetoequipment.Inall avindandAirJaldi.Foreachfault,weindicatewhichaspectsof these cases, the network remained non-functional (but thesystem,aswehavedesignedit,helpmitigatethefault.The now for a different reason) even after the ISP resumed differentaspectsareMonitoring(M),Power(P),Backchannel normalconnectivity.Thusweneedmechanismstoindi- (B),IndependentRecoveryMechanisms(R)andSoftware(S). catewhenalinkishavingproblemsattheremoteend,so Theinformationonfaultshasbeencollatedfromlogsandinci- astopreventlocalattemptsatrepair. dentreportsmaintainedbythelocaladministratorsandremote Thegeneralthemeisthatnomatterwhatthefault,ifthe expertsrespectively. link appearsto be down with no additionalinformation orconnectivityintothewirelessnode,itishardforeven Predicting Battery Lifetime: Battery life is limited by experiencedadministratorstoresolvetheproblem. the number of deep cycle operations that are permit- ted. This lifetime degrades sharply because of fluctuat- 4.3 Anticipating Faults isHard ingvoltagesseeninourdeploymentsthatdonotcharge Some of the node locations in our networks, especially the batteryoptimally,AtAravind,batteriesratedwith a relays, are quite remote.Site maintenancevisits are ex- lifetimeoftwoyearslastforroughlythreetosixmonths. pensive, time consuming, and require careful planning Informationaboutremainingbatterylifecanalsoenable around the availability of staff, tools, and other spare preventionofcatastrophicfailures. equipment.Therefore,visitsaregenerallyscheduledwell Predicting Disk Failure: We have observed that with inadvance,typicallyonceeverysixmonths.Inthissce- frequent reboots over time, the disk partition used to nario, it is especially important to be able to anticipate store system logs accumulates bad ext2 blocks. Unless failuresso thatthey can be addressedduringthe sched- we run fsck periodically to recover the bad blocks, the uled visits, or if a catastrophic failure is expected, then partition becomes completely unusable very soon. We a convincingcase canbemadeforanunscheduledvisit have also seen thatmanyflash disks show hardwareer- for timely action. But without an appropriate monitor- rors,anditis importantto keeptrackofdisk errorsand ing and reporting system that includes backchannels, it replacethembeforetheycauserouterstocompletelyfail. isdifficulttoprepareforimpendingfaults. Battery Uptime: At both Aravind and AirJaldi we use 5 System Architecture Design batterybackups.Lossofgridpoweratthenodescauses their batteries to start discharging It is generally not In this section, we present five aspects of our system: knownwhenthebatterieswillfinallyrunout.Ifthisin- monitoring,power,backchannels,independentrecovery formationissomehowprovidedtothestaff,theycanpre- mechanisms, and software. Each has been designed to ventdowntimeofthelinkbytakingcorrectivemeasures specifically address our goals of increasing component suchasreplacementofthethebatteryintime.Suchfeed- robustness,enablingfaultdiagnosis,andsupportingfault backwouldalsosuggestiftheproblemwereregional(as prediction. For each aspect, wherever appropriate, we otherrouterswouldalsosufferlossofgridpower)orsite- also discuss tradeoffs affecting our design choices. Ta- specificsuchasacircuitbreakertrip. ble 1 indicates which aspects of our system design are important for reducing the impact of some of the com- Scope Type MeasuredParameter monfaultspresentedintheprevioussection. Passive CPU,diskandmemoryutilization,in- Node terrupts, voltage, temperature, reboot 5.1 Monitoring logs (number & cause), kernel mes- Allaspectsofsystemmanagementrequiresomelevelof sages,solarcontrollerperiodicdata monitoring.DuringtheinitialdeploymentatAravind,we Active disksanitycheck facedtwomainchallengesindesigningamonitoringsys- Link Passive traffic:, traffic volume(#bytes, pack- ets) tem. First, the Aravind network at Theni only allowed wireless: signal strength, noise level, ustoinitiateconnectionsfromwithinthenetwork.Sec- #controlpackets,#retransmissions,# ond,localstaffwasnotfamiliarwithLinuxorwithcon- droppedpackets figuration of standard monitoring software such as Na- interference:#ofstationsoverheard& gios[10]. packet count from each, # corrupted This led us to build a push-based monitoring mecha- packets nismthatwecall“PhoneHome”inwhicheachwireless Active liveness, packet loss, maximum link router pushes status updates upstream to our US-based bandwidth server.Wechosethismethodoverthegeneralpull-based Passive routechanges,pairwisetrafficvolume System architectureinwhichadaemonrunningonalocalserver &type pollsallthe routers.Thepull-basedapproachwouldre- Active pairwise end-to-end delay & max throughput quireconstantmaintenanceviare-configurationofa lo- calservereverytimeanewrouterwouldbeaddedtothe network. In contrast, the push-based approach enabled Table2:ParameterscollectedbyPhoneHome. us to configurethe routersonlyonce,at installation, by andremoteservers. specifyingtheHTTPproxytobeused. PhoneHomeprovedtobehelpfulinunderstandingfail- The Aravind network features two remote connectiv- ures, diagnosing and predicting many faults. First, it ityoptions,bothofwhichareslowandunreliable(Sec- helpedmaintainnetworkreachabilityinformation,alert- tion 5.3): (1) a direct CDMA network connection on a ingthelocalstaffwhenthenetworkwasdownandaction laptopatthecentralhospitalnode,and(2)aVSATcon- neededtobetakentorecover.Earlier,onlyaphonecall nectiontoanotherhospital,whichhasaDSLconnection fromaruralcliniccouldalertthelocaladministrator,and to the Internet. PhoneHome is installed on each of the dependingontheawarenessofthestaffattheruralclinic, wirelessrouters.Alltheroutersperiodicallypostvarious thiscallwouldnotalwayshappen. parameters to our US server website. Server-side dae- Second, kernel logs transferred using PhoneHome monsanalyzethisdataandplotvisualtrends. helpedusdiagnoseseveralinterestingproblems.Forex- Wecollectnodeandlink-levelinformationandend-to- ample, in certain instances routers configured with two end measurements. The comprehensivelist of the mea- network interfaces reported only one interface as being sured parametersis presentedin Table 2. Most of these active.Pairingthisinformationwithpowerdata,wereal- parameterscan be measuredpassively,withoutinterfer- izedthatalowvoltagesupplycanpreventtworadioin- ing with normal network operation. However, several terfacesfromfunctioningsimultaneously.Inanotherin- of these measurements, such as maximum link or path stance, kernel logs and system messages allowed us to throughput, require active testing. Some of these tests examineflashdiskerrormessagesandpredictwhendisk can be performed periodically (e.g. pinging every net- partitionsneededrepartitioningorreplacement. workhost),andsomeofthemaredoneondemand(e.g. Third,byexaminingthepostedroutingtableandinter- findingthethroughputachievableona particularlinkat face parameters,we were able to diagnoseroutingmis- agiventime). configurationsorbadlyassignedIPaddresses. We also use the PhoneHome mechanism for remote Fourth,continuousmonitoringofwirelesslinkparam- management. Every time PhoneHome connects to our etershelpedusnarrowthescopeoftheproblemsinmany US server, it opens a reverse SSH tunnel back into the cases. Figure 7 shows the signal strength variation in wirelessnode,enablinginteractiveSSHaccesstotheAr- someofournetworklinks.Whilemajorityoftheselinks avindmachines.AstheVSATconnectiononlyallowsac- showfairlystablesignalstrength,someofthemshowim- cessoveranHTTPproxy,wearerequiredtorunSSHon portantvariationovertime.Forexample,asudden10dB topofHTTP,andconfigurePhoneHomewiththeproxy. signaldroponthelinkbetweenAmbasamtoTheniindi- IncaseofadirectconnectiontotheInternet,nosuchcon- catedsomekindofadrasticeventsuchasapossiblean- figuration is required. Another option (employed in the tenna misalignmentthat needed an immediate visit. On remotemanagementofAirJaldi)istousetheOpenVPN theotherhand,asteadydeclineinsignalstrengthonthe software to openVPN tunnelsbetween networkrouters Bodilinkindicatedagradualdegradationofaconnector incidentswe haveexperiencedhavebeenrelatedto low Chimney 40 power quality. Thus, designing to increase component Laxmipuram B)30 Ambasam reliability in the face of bad power is the most impor- d Bodi tant task. We have developed two separate approaches NR (20 to address the effects of low power quality. The first is S a Low Voltage Disconnect (LVD) solution, which pre- 10 vents both routers from getting wedged at low voltages 0 andalsoover-dischargeofbatteries.Thesecondisalow- 0 60 120 180 240 300 360 420 480 540 Time (days) cost power controller that supplies stable power to the equipmentbycombininginputfromsolarpanels,batter- Figure7:Signalstrength(shownindB)variationforalllinks. ies,andeventhegrid. Each point isaverage of measurement over 2 days. The Am- LowVoltageDisconnect(LVD):Over-dischargeofbat- basamlinkshowsatemporarydropinSNRof10dBforabout teries can reduce their lifetime significantly. Owing to 40days.WhiletheBodilinkisgraduallydegradingasitsSNR the poor quality of grid power, all AirJaldi routers are hasdroppedby4dBoverthelastyear,theChimneylink’sSNR onbatterybackup.LVDcircuits,builtintobatterycharg- hasremainedconstant. ers, prevent over-discharge of batteries by disconnect- ortheRFcabletotheantenna,andrequiredaneventual ingthe load(router)whenthe batteryvoltagedropsbe- visit. lowathreshold.Asabeneficialside-effect,theyprevent therouterfrombeingpoweredbyalow-voltagesource, Tradeoffs:We contrastthiswith monitoringat AirJaldi which may cause it to hang. Off-the-shelf LVDs oscil- where we use various off-the-shelf tools such as Na- lated frequently, bringing the load up and down, and gios[10]andSmokePing[13]tocollectnode,link,and eventually damaging the board and flash memory. Ev- network level parameters. Informationis stored at a lo- ery week, there were roughly fifty reboot incidents per cal data server in Dharamsala and then copied to a US routerduetohangscausedbylowvoltage.However,we server for detailed analysis. Various graphing toolkits designedanewLVDcircuit[24]withnooscillationand suchasMRTG[25]areusedtovisualizetrendsandde- better delay; since then the hangs per week per router tectanomalies. havereducedtonearzerointheDharamsalanetwork. The differencein approachcomparedtoAravindisin part due to the higher experience of the AirJaldi staff, Power Controller: We have developed a andinpartduetothebetterconnectivitywehavetoAir- microcontroller-basedsolarpowerchargecontroller[31] Jaldi. The advantageof having local serverspolling for that provides a stable input of 18 V to the routers and informationisthattheycanbeconfiguredbylocalstaff intelligently manages the charging and discharging to look for relevant problems, but such an approach is of the battery pack. It has several features such as maximumpowerpointtracking,lowvoltagedisconnect, beneficial only if local staff are experienced enough to takeadvantageofthesefeatures. tricklechargingandveryimportantly,supportforremote After three years of operation, the local Aravind staff managementvia ethernet. The setup is trivial as it sup- (someofwhomwelostduetoturnoveraftertheygained pliespowertotherouterusingPoE.Thiscombinationis more experience through our training) are more famil- novelforitspriceofaround$70. iar with system configuration,and show less apprehen- We use TVS diodes to absorb spikes and surges and sion in taking the initiative and maintaining the system a robustvoltage regulatorto get clean 18V power from ontheirown.Therefore,we arenowbeginningto usea wide ranginginputconditions.Figure 8 shows the flow ofcurrentthroughtheboardovera60-hourperiod.First, pull-basedmodel. Ingeneral,webelievethatduringtheinitialphaseofa we note that power is always available to the router. networkdeployment,minimalconfigurationpush-based Whenenoughsunlightisavailable,thesolarpanelpow- mechanisms are more appropriate for data collection. erstherouterandchargesthebattery.Duringperiodsof However,afterbuildingenoughlocalexpertise,themon- no sun, the battery takes over powering the router. The itoringsystemshouldbemigratedtowardsamoreflexi- frequentswingsobservedontheleftpartofthegraphare blepull-basedapproach. typical for a cloudy day. The graphs also demonstrate how the batteryis continuallychargedwhen sunlightis 5.2 Power available.Wehavemeasureda15%moreefficientpower Powerqualityandavailabilityhasbeenourbiggestcon- drawfromthe panels,and also expectthatwe can dou- cern at both Aravind and AirJaldi. Low-quality power blebatterylife.Usingthecontroller,wehavenotlostany damages the networking equipment (boards and power routersfrombadpower,butithasbeenonly8monthsof adapters)andsometimesalsobatteries.Over90%ofthe testing. 4000 NetworkBackchannel: Atthe AravindThenihospital, wealreadyhadsomeformofbackchannelintotheTheni networkthroughVSAT.WeusePhoneHometoopenan SSHtunnelovertheVSATlinkthroughanHTTPproxy s p at the Aravind Madurai hospital. We configure Phone- m a 0 Home to post monitoring data to our US-based server Milli every 3 hours and also to open a reverse SSH tunnel throughwhichwecanlogbackinforadministrationpur- Load Current poses. Out of the 2300 posts expected from the router Battery Current at Theniover143 days (2 posts every 3 days), we only Solar Current -4000 received 1510 of them, or about 65%. So this particu- 0 6 12 18 24 30 36 42 48 54 larbackchannelwasnotveryreliableinpractice,some- Time (hours) timesnotworkingforlongstretchesoftime.Asaresult, Figure8:Currentflowover60hours.Theloadstaysevenat weusedthesolitaryhospitallaptoptoconnectdirectlyto 7W, whilethe solar panel and battery shift their relativegen- the Internetusing a 1xRTT CDMA cardto improvethe eration over time. The battery current is negative when it is availabilityofabackchannelintothenetwork.However, charging. this laptop was used for several other purposes (shared hardware is a common feature in rural areas) and was Thecontrollerreportssolarpanel,loadandbatterysta- mostly unavailable.Furthermore,in many instances the tusinformationthatcanbeusedforremotediagnosisand networkbackchannelwas notenoughasthe localwire- somepredictionofbatteryuptimeandlifetime.Asecond lessnetworkwoulditselfbepartitioned. version of the controller, currently under development, Node Backchannel: At AirJaldi, we built a node willaddthefeaturetotakegrid-suppliedpowerasinput. backchannelmechanismusingGPRS.InIndiaatthemo- This has two major advantages: the same setup can be ment, GPRS connectivity costs roughly $10 per month usedto stabilize gridpowerlocally,andgridpowercan forunlimiteddurationandbandwidth.WeusedaNetgear alsobeusedtochargethebatteriesinadditiontothesolar WGT634U router, interfaced through its USB 2.0 port power. with a mobile phone. The router runs PPP over GPRS Tradeoffs:Therealcostofpowerinruralareasisnotjust and sets up an OpenVPN tunnel to a remote server. To the raw grid electricity costs, but the cost of overcom- enable remote diagnosis using this link, the backchan- ing power availability and quality issues through UPS, nelrouterisconnectedtothemainwirelessrouterusing battery-backups, and and chargers. The recurring costs ethernet and optional serial consoles. The backchannel can be quite high, and therefore solar power, although router can also power-cycle the wireless router using a stillexpensive,becomesmorecompetitvethanexpected solid-staterelayconnectedtooneofitsGPIOpins. as it can produce clean power directly. Currently we Thisapproachhastwoadvantages.First,thecellphone choosetousesolarforveryremotelocations.Atlessre- networkis completelyindependentof the wireless link. mote and critical sites, we tend to use “dumb” analog Second, even thoughthe mobile phone is chargedfrom chargerstoreducecostsevenfurther. the same powersource, it hasits own battery which al- 5.3 Backchannels lowsaccessviaGPRS evenif themain powersourceis down. However, for the Netgear router, we needed ad- AwidevarietyofproblemsatAravindandAirJaldihave ditional battery backup which adds to the maintenance caused link downtimes, leaving remote nodes discon- complexity.Oneapproachtosimplifythissetupforcon- nected.Thefailureofasinglelinkmakespartofthenet- soleaccesswouldbetouseaLinuxGPRSphonebutwe work unreachablealthoughthe nodesthemselvesmight havenottriedityet. be functional. In many cases, if we had alternate ac- Tradeoffs:Our experiencewith the GPRS backchannel cesstothenodes,thefixeswouldhavebeensimplesuch intermsofprovidingrealutilityforsystemmanagement ascorrectingaroutermisconfiguration,orrebootingthe hasbeenmixed.Manycommonproblemscanbesolved router remotely. It is important to have out-of-bandac- by alternative means in simpler ways. In cases of in- cessorabackchanneltothenodesthatisseparatefrom correct configuration of routers, we can imagine using the primary wireless path to it. Backchannel access is theGPRS backchannelto fixproblems.ButatAravind, alsousefulincaseswherethebatteryisdischargingbut when misconfigurationsresulted in routing outages, we therouterisalreadydownforotherreasons.Information used cascaded hop-by-hop logins to move through the aboutthebatterystatusfromthechargecontrollerviathe network,althoughthisdependedonatleasttheendpoint backchannelwouldstillbehelpful.Wehavetriedseveral IP addresses to be set correctly. However, we can also approachestobackchannelsinbothnetworks.

Description:
ing of two rural wireless networks: (1) the Aravind telemedicine tribution here is the exploration of the operational chal- lenges of two . ily a mesh with a few long distance directional links that provides the PCEngines WRAP boards, MikroTik routerboards, the idea of cellphone backchannels.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.