Network Algorithmics This page intentionally left blank Network Algorithmics An Interdisciplinary Approach to Designing Fast Networked Devices Second Edition George Varghese UCLA Department of Computer Science Los Angeles, CA, United States Jun Xu School of Computer Science Georgia Institute of Technology Atlanta, GA, United States MorganKaufmannisanimprintofElsevier 50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates Copyright©2022ElsevierInc.Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicormechanical, includingphotocopying,recording,oranyinformationstorageandretrievalsystem,withoutpermissioninwritingfromthe publisher.Detailsonhowtoseekpermission,furtherinformationaboutthePublisher’spermissionspoliciesandour arrangementswithorganizationssuchastheCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefound atourwebsite:www.elsevier.com/permissions. ThisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythePublisher(otherthanasmay benotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperiencebroadenour understanding,changesinresearchmethods,professionalpractices,ormedicaltreatmentmaybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingandusingany information,methods,compounds,orexperimentsdescribedherein.Inusingsuchinformationormethodstheyshouldbe mindfuloftheirownsafetyandthesafetyofothers,includingpartiesforwhomtheyhaveaprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeanyliabilityforany injuryand/ordamagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise,orfromanyuseor operationofanymethods,products,instructions,orideascontainedinthematerialherein. ISBN:978-0-12-809927-8 ForinformationonallMorganKaufmannandElsevierpublications visitourwebsiteathttps://www.elsevier.com/books-and-journals Publisher:MaraConner EditorialProjectManager:LindsayC.Lawrence ProductionProjectManager:ManchuMohan CoverDesigner:MatthewLimbert TypesetbyVTeX ForAjuand TimandAndrew, whomadeallthispossible... This page intentionally left blank Contents Prefacetothesecondedition .................................................. xvii Preface .................................................................. xix 15principlesusedtoovercomenetworkbottlenecks ................................ xxv PART 1 THE RULES OF THE GAME CHAPTER1 Introducingnetworkalgorithmics ................................. 3 1.1 Theproblem:networkbottlenecks ................................. 3 1.1.1 Endnodebottlenecks ..................................... 4 1.1.2 Routerbottlenecks ...................................... 5 1.2 Thetechniques:networkalgorithmics .............................. 7 1.2.1 Warm-upexample:scentinganevilpacket..................... 8 1.2.2 Strawmansolution....................................... 8 1.2.3 Thinkingalgorithmically .................................. 9 1.2.4 Refiningthealgorithm:exploitinghardware ................... 10 1.2.5 Cleaningup............................................ 11 1.2.6 Characteristicsofnetworkalgorithmics ....................... 13 1.3 Exercise .................................................... 15 CHAPTER2 Networkimplementationmodels .................................. 17 2.1 Protocols .................................................... 18 2.1.1 Transportandroutingprotocols ............................. 18 2.1.2 Abstractprotocolmodel .................................. 18 2.1.3 Performanceenvironmentandmeasures ...................... 20 2.2 Hardware.................................................... 22 2.2.1 Combinatoriallogic...................................... 22 2.2.2 Timingandpower ....................................... 23 2.2.3 Raisingtheabstractionlevelofhardwaredesign ................ 24 2.2.4 Memories ............................................. 26 2.2.5 Memorysubsystemdesigntechniques ........................ 30 2.2.6 Component-leveldesign .................................. 31 2.2.7 Finalhardwarelessons ................................... 32 2.3 Networkdevicearchitectures ..................................... 33 2.3.1 Endnodearchitecture..................................... 33 2.3.2 Routerarchitecture ...................................... 35 2.4 Operatingsystems ............................................. 39 2.4.1 Uninterruptedcomputationviaprocesses ...................... 40 2.4.2 Infinitememoryviavirtualmemory ......................... 42 2.4.3 SimpleI/Oviasystemcalls ................................ 44 2.5 Summary .................................................... 45 2.6 Exercises .................................................... 45 vii viii Contents CHAPTER3 Fifteenimplementationprinciples ................................. 51 3.1 Motivatingtheuseofprinciples—updatingternarycontent-addressable memories.................................................... 51 3.2 Algorithmsversusalgorithmics ................................... 55 3.3 Fifteenimplementationprinciples—categorizationanddescription ......... 57 3.3.1 Systemsprinciples ...................................... 58 3.3.2 Principlesformodularitywithefficiency ...................... 62 3.3.3 Principlesforspeedinguproutines .......................... 64 3.4 Designversusimplementationprinciples ............................ 66 3.5 Caveats ..................................................... 67 3.5.1 Eightcautionaryquestions................................. 69 3.6 Summary .................................................... 71 3.7 Exercises .................................................... 71 CHAPTER4 Principlesinaction ............................................ 75 4.1 Buffervalidationofapplicationdevicechannels ....................... 76 4.2 Schedulerforasynchronoustransfermodeflowcontrol ................. 78 4.3 RoutecomputationusingDijkstra’salgorithm ........................ 79 4.4 Ethernetmonitorusingbridgehardware ............................. 82 4.5 Demultiplexinginthex-kernel .................................... 84 4.6 Trieswithnodecompression ..................................... 85 4.7 Packetfilteringinrouters ........................................ 87 4.8 AvoidingfragmentationofLSPs .................................. 90 4.9 Policingtrafficpatterns ......................................... 92 4.10 Identifyingaresourcehog ....................................... 94 4.11 GettingridoftheTCPopenconnectionlist .......................... 96 4.12 Acknowledgmentwithholding .................................... 99 4.13 Incrementallyreadingalargedatabase .............................. 101 4.14 Binarysearchoflongidentifiers................................... 103 4.15 Videoconferencingviaasynchronoustransfermode ................... 105 PART 2 PLAYING WITH ENDNODES CHAPTER5 Copyingdata ................................................. 111 5.1 Whydatacopies .............................................. 113 5.2 Reducingcopyingvialocalrestructuring ............................ 115 5.2.1 Exploitingadaptormemory ................................ 115 5.2.2 Usingcopy-on-write ..................................... 117 5.2.3 Fbufs:optimizingpageremapping ........................... 119 5.2.4 Transparentlyemulatingcopysemantics ...................... 123 5.2.5 Arezerocopiesusedtoday? ............................... 125 5.3 AvoidingcopyingusingremoteDMA .............................. 126 5.3.1 Avoidingcopyinginacluster .............................. 127 5.3.2 Modern-dayincarnationsofRDMA ......................... 128 Contents ix 5.4 Broadeningtofilesystems ....................................... 131 5.4.1 Sharedmemory ......................................... 131 5.4.2 IO-lite:aunifiedviewofbuffering........................... 132 5.4.3 AvoidingfilesystemcopiesviaI/Osplicing.................... 134 5.5 Broadeningbeyondcopies ....................................... 134 5.6 Broadeningbeyonddatamanipulations ............................. 137 5.6.1 Usingcacheseffectively .................................. 137 5.6.2 DirectmemoryaccessversusprogrammedI/O ................. 142 5.7 Conclusions .................................................. 142 5.8 Exercises .................................................... 143 CHAPTER6 Transferringcontrol ............................................ 145 6.1 Whycontroloverhead? ......................................... 147 6.2 Avoidingschedulingoverheadinnetworkingcode ..................... 149 6.2.1 Makinguser-levelprotocolimplementationsreal ................ 151 6.3 Avoidingcontext-switchingoverheadinapplications ................... 153 6.3.1 Processperclient ....................................... 154 6.3.2 Threadperclient ........................................ 155 6.3.3 Event-drivenscheduler ................................... 155 6.3.4 Event-drivenserverwithhelperprocesses ..................... 157 6.3.5 Task-basedstructuring.................................... 158 6.4 ScalableI/ONotification ........................................ 159 6.4.1 Aservermystery ........................................ 159 6.4.2 Problemswithimplementationsofselect()..................... 160 6.4.3 Analysisofselect() ...................................... 162 6.4.4 Speedingupselect()withoutchangingtheAPI ................. 163 6.4.5 Speedingupselect()bychangingtheAPI ..................... 164 6.5 AvoidingsystemcallsorKernelBypass ............................. 166 6.5.1 Thevirtualinterfacearchitectureproposal ..................... 169 6.5.2 DataPlaneDevelopmentKit(DPDK) ........................ 169 6.5.3 SingleRootI/OVirtualization(SR-IOV) ...................... 170 6.6 RadicalRestructuringofOperatingSystems.......................... 171 6.7 Reducinginterrupts ............................................ 172 6.7.1 Avoidingreceiverlivelock ................................. 173 6.8 Conclusions .................................................. 174 6.9 Exercises .................................................... 175 CHAPTER7 Maintainingtimers ............................................ 179 7.1 Whytimers? ................................................. 180 7.2 Modelandperformancemeasures ................................. 181 7.3 Simplesttimerschemes ......................................... 182 7.4 Timingwheels ................................................ 183 7.5 Hashedwheels................................................ 185 7.6 Hierarchicalwheels ............................................ 187 7.7 BSDimplementation ........................................... 189