Table Of Content

Novel Abstractions for Data Center Network Management By Aaron Gember-Jacobson A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) at the UNIVERSITY OF WISCONSIN–MADISON 2016 Date of final oral examination: 04/29/2016 The dissertation is approved by the following members of the Final Oral Committee: Aditya Akella, Associate Professor, Computer Sciences Paul Barford, Professor, Computer Sciences Parameswaran Ramanathan, Professor, Electrical and Computer Engineering Jennifer Rexford, Professor, Computer Science, Princeton University Michael Swift, Associate Professor, Computer Sciences © Copyright by Aaron Gember-Jacobson 2016 All Rights Reserved i To Emily and Henry. ii acknowledgments First and foremost I would like to thank my advisor, Aditya Akella, for his dedication and support. Many of my achievements as a graduate student would not have been possible without his mentoring. I would also to thank my peers and collaborators who have contributed their time and insights to the work contained in this thesis: Robert Grandl, Shan-Hsiang Shen, Junaid Khalid, Xiujun Li, Ratul Mahajan, Chaithan Prakash, Raajay Viswanathan, and Wenfei Wu. Similar thanks goes to those I have worked with on other projects throughout my PhD, including: Archie Abhashkumar, Ashok Anand, Theo Benson, Ramon Caceres, Shoban Chandrabose, Mark Coatsworth, Sourav Das, Chris Dragga, Alexis Fisher, Xiaoyang Gao, Keqiang He, Shachar Itzhaky, Aleksandr Karbyshev, Anand Krishnamurthy, Roney Michael, Jeff Pang, Tom Ristenpart, Mooly Sagiv, Vyas Sekar, Saul St. John, Asaf Valadarsky, Alex Varshavsky, and Liang Wang. I appreciate the valuable feedback provided by my prelim and thesis committee members: Paul Barford, Somesh Jha, Parmesh Ramanathan, Jen Rexford, and Mike Swift. Finally, I am extremely grateful for my wife Emily, who has provided continual encour- agement and support, and my son Henry, who brings joy to my life. Emily and Henry, along with my parents (Lynn and Jim), grandparents (Eloise, Bob, Carol, and Ken), and other family, have all been an important part of my journey. iii contents Contents iii List of Tables v List of Figures vi Abstract viii 1 Introduction 1 1.1 Identifying Problematic Network Management Practices 3 1.2 Abstractions for Guaranteeing Network Functionality and Performance 5 1.3 Contributions 12 2 Identifying Problematic Practices Using MPA 14 2.1 Inferring Management Practices 15 2.2 Characterization of Management Practices 20 2.3 Identifying Statistical Dependencies 26 2.4 Identifying Causal Relationships 30 2.5 Predicting Network Health 39 2.6 Limitations 44 2.7 Related Work 45 2.8 Summary 47 3 Control Plane Checking Using ARC 49 3.1 Important Attributes of ARC 50 3.2 Challenges in Generating a Network’s ARC 53 3.3 Extended Topology Graphs 54 3.4 Computing ETG Edge Weights 59 iv 3.5 Using ARC to Check Invariants 63 3.6 Implementation & Evaluation 68 3.7 Related Work 73 3.8 Summary 75 4 Maintaining Middlebox Functionality and Performance Using OpenNF 77 4.1 Goals and Requirements 80 4.2 OpenNF Architecture 83 4.3 Middlebox API 85 4.4 Controller API: Move Operation 90 4.5 Controller API: Copy and Share Operations 100 4.6 Control Applications 103 4.7 Implementation 105 4.8 Evaluation 108 4.9 Related Work 118 4.10 Summary 122 5 Conclusion and Future Work 124 5.1 Contributions and Impact 124 5.2 Future Work 126 5.3 Closing Remarks 128 A Proving ARC is Comprehensive and Precise 129 A.1 Comprehensiveness 129 A.2 Precision 131 B Proving OpenNF’s Move Operation is Loss-free and Order-preserving 137 Bibliography 140 v list of tables 2.1 Management practice metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Size of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Top 10 management practices related to network health according to average monthly MI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Top 10 pairs of statistically dependent management practices according to CMI 29 2.5 Matching based on propensity scores . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Statistical significance of outcomes . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.7 Causal analysis results for the first and second bin for the top 10 statistically dependent management practices . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.8 Causal analysis results for upper bins for the top 10 statistically dependent management practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.9 Accuracy of future health predictions . . . . . . . . . . . . . . . . . . . . . . . . 44 3.1 Invariants of common interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Control plane constructs modeled in ARC . . . . . . . . . . . . . . . . . . . . . 63 3.3 Control plane constructs in the OSP’s networks . . . . . . . . . . . . . . . . . . 69 4.1 Possible causes of middlebox failures . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Effects of different guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.3 Additional code to implement OpenNF’s middlebox API . . . . . . . . . . . . . 116 vi list of figures 1.1 Results of network operator survey . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Steps in management practice analytics . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 State-of-the-art methods for verifying data center networks . . . . . . . . . . . . 7 1.4 Example network with a single OSPF instance and its ARC . . . . . . . . . . . 9 1.5 A scenario requiring scale-out and special handling of middlebox state to avoid performance and functional failures . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Impact of grouping threshold on the number of change events . . . . . . . . . . 20 2.2 Characterization of design practices . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Characterization of configuration changes . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Characterization of configuration change events . . . . . . . . . . . . . . . . . . 25 2.5 Ticket counts based on management practices . . . . . . . . . . . . . . . . . . . 27 2.6 Relationship between number of models and number of device types . . . . . . . 30 2.7 Visual equivalence of confounding practice distributions . . . . . . . . . . . . . . 35 2.8 Accuracy of 5-class models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.9 Health class distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.10 Decision trees (only a portion is shown) . . . . . . . . . . . . . . . . . . . . . . 43 3.1 Example network with a single OSPF instance and its ARC . . . . . . . . . . . 51 3.2 Example feature-rich network and its ETG . . . . . . . . . . . . . . . . . . . . . 56 3.3 Control plane where the costs assigned to redistribute routes are incongruent with processes’ ADs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4 Part of the interface-based ARC for the example control plane in Figure 3.1a . . 67 3.5 Scale of the OSP’s networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6 Time required to generate ARC for the OSP’s networks . . . . . . . . . . . . . . 70 3.7 Size of the ETGs for the OSP’s networks . . . . . . . . . . . . . . . . . . . . . . 71 vii 3.8 Time required to check key invariants . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1 A scenario requiring scale-out and special handling of middlebox state to avoid performance and functional failures . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 OpenNF architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3 Middlebox state taxonomy, with state from the Squid caching proxy as an example 85 4.4 Assumed topologies for move operation . . . . . . . . . . . . . . . . . . . . . . . 91 4.5 Order-preserving problem in Split/Merge . . . . . . . . . . . . . . . . . . . . . . 94 4.6 Pseudo-code executed by the controller for a loss-free and order-preserving move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.7 Application for maintaining high availability . . . . . . . . . . . . . . . . . . . . 104 4.8 Application for maintaining predictable performance . . . . . . . . . . . . . . . 105 4.9 Setup for injecting packets from events into dstInst’s input stream . . . . . . . . 107 4.10 Efficiency of move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.11 Improvements in move time with P2P transfers . . . . . . . . . . . . . . . . . . . 110 4.12 Impact of packet rate and number of per-flows states on pipelined move with and without a loss-free guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.13 Improvements in latency overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.14 Efficiency of state export and import . . . . . . . . . . . . . . . . . . . . . . . . 115 4.15 Performance of concurrent loss-free move operations . . . . . . . . . . . . . . . . 117 viii abstract Data center failures have become increasingly problematic due to the plethora of critical web and storage services hosted in today’s data centers. Frequently, the problem lies in the data center network, which is prone to both functional and performance failures caused by hardware or software faults, misconfiguration, overload, or other issues with links and devices. Preventing such failures is challenging, because data center network operators lack a formal understanding of how their design and operational decisions impact the frequency of network problems. Furthermore, current frameworks for verifying and maintaining the functionality and performance of data center networks are incomplete and/or inefficient. Consequently, this thesis explores how to analyze an organization’s network management practices and efficiently guarantee that a data center network functions correctly and offers reasonable performance amidst changes in infrastructure, configuration, and workload. We first present the design of a management plane analytics (MPA) framework which uncovers the relationships between network management practices and the frequency of network problems. By applying MPA to over 850 data center networks operated by a large online service provider, we identify several practices that strongly impact the frequency of problems in these networks, including: the number of control plane configuration changes and the number of device types (i.e., the presence of middleboxes). Armed with this information, we explore how to design abstractions that aid in ensuring the correct and performant operation of a data center’s control plane and middleboxes. We introduce an abstract representation for control planes that efficiently models a data center network’s forwarding behavior under all possible link/device failure scenarios. This allows us to verify important functional invariants—e.g., traffic between subnets S and S always 1 2 traverses a middlebox—three to five orders of magnitude faster than current verification tools. Additionally, we introduce a middlebox state management framework that allows network operators to realize a “one-big-middlebox” abstraction and avoid middlebox-induced functional and performance failures in the presence of hardware/software faults or overload.

Description:

Parameswaran Ramanathan, Professor, Electrical and Computer Paul Barford, Somesh Jha, Parmesh Ramanathan, Jen Rexford, and Mike Swift. design practices: e.g., consolidating routing domains to reduce complexity [40]

Novel Abstractions for Data Center Network Management By Aaron Gember-Jacobson A ... PDF

160 Pages·2016·3.74 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Novel Abstractions for Data Center Network Management By Aaron Gember-Jacobson A ...

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.