Ensuring dependability and improving performance of transactional systems deployed on multi-core architectures Mohamed Mohamedin Preliminary Examination Proposal submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Binoy Ravindran, Chair Leyla Nazhandali Mohamed Rizk Paul Plassmann Robert P. Broadwater Roberto Palmieri November 3, 2014 Blacksburg, Virginia Keywords: Transaction Memory, Hardware Transaction Memory (HTM), Best-efforts HTM, Transactions Partitioning, Transactions Scheduling, Transient Faults, Soft-errors, Fault-tolerance Copyright 2014, Mohamed Mohamedin Ensuring dependability and improving performance of transactional systems deployed on multi-core architectures Mohamed Mohamedin (ABSTRACT) The industrial shift from single core processors to multi-core ones introduced many chal- lenges. Among them, a program cannot get a free performance boost by just upgrading to a new hardware because new chips include more processing units but at the same (or compara- ble)clockspeedasthepreviousone. Inordertoeffectivelyexploitthenewavailablehardware and thus gain performance, a program should maximize parallelism. Unfortunately, parallel programming poses several challenges, especially when synchronization is involved because parallel threads need to access the same shared data. Locks are the standard synchronization mechanism but gaining performance using locks is difficult for a non-expert programmers. A new, easier, synchronization abstraction is required and Transactional Memory (TM) is the potential candidate. TM is a new programming paradigm that simplifies the implementation of synchronization. TheprogrammerjustdefinesatomicpartsofthecodeandtheunderlyingTMsystemhandles the required synchronization. In the past decade, TM researchers worked extensively to improve TM-based systems. Most of the work has been dedicated to Software TM (STM) as it does not requires special transactional hardware supports. Very recently (in the past two years),thosehardwaresupporthavebecomecommerciallyavailableincommodityprocessors. Hardware TM (HTM) provides the best performance of any TM-based systems, but current HTM systems are best-effort, thus transactions are not guaranteed to commit. In fact, HTM transactions are limited in size and time as well as prone to livelock at high contention levels. Another challenge facing multi-core architectures is dependability. In the hardware design, factors such as increasing number of components on the same chip, reducing transistor’s size, and reducing the operating voltage cooperate to increase the rate of soft-errors. Soft- errors are transient faults that can randomly flip registers and memory (likely cache) bits. Multi-core architectures meet all aforementioned factors, thus they are prone to soft-errors. In this dissertation, we tackle the performance and dependability challenges of multi-core architectures by providing two solutions for increasing performance using HTM (i.e., Part- htm and Octonauts), and two solutions for solving soft-errors (i.e., Shield and SoftX). Part-htm is an innovative Hybrid TM system that alleviates HTM resource limitation. A large transaction is partitioned into smaller partitions that fit in HTM. Thus, large transac- tionscanachievehighperformancebyexploitingHTMspeed. Transactionpartitioningposes many correctness challenges which are handled by Part-htm. Results in benchmarks where transactions are mostly aborted for resource failures confirmed Part-htm effectiveness. In those benchmarks Part-htm is the best, gaining up to 74%. Octonauts tackles the live-lock problem of HTM at high contention level. HTM lacks of advanced contention management (CM) policies. Octonauts is an HTM-aware scheduler thatorchestratesconflictingtransactions. Itusesaprioriknowledgeoftransactions’working- set to prevent the activation of conflicting transactions, simultaneously. Octonauts also accommodates both HTM and STM with minimal overhead by exploiting adaptivity. Based on the transaction’s size, time, and irrevocable calls (e.g., system call) Octonauts selects the best path among HTM, STM, or global locking. Results show performance improvement iii when Octonauts is deployed in comparison with pure HTM with falling back to global locking. Shield is a fault-tolerant system for transactional application using state-machine replica- tion (SMR). It is optimized for centralized multi-core architectures specially message-passing based ones. It includes an efficient network layer for ordering transactional requests with op- timistic delivery support. The other main component of Shield is the concurrency control. It supports concurrent execution of transactional requests while forcing the network layer order. Shield is also able to speculatively start transactions as soon as the request is opti- mistically delivered. Shield’s results show limited overhead in comparison with non-fault- tolerant transactional systems and improvement over other non-optimized SMR approaches. SoftX is a low-invasive protocol for supporting execution of transactional applications rely- ing on speculative processing and dedicated committer threads. Upon starting a transaction, SoftX forks a number of threads running the same transaction, independently. The commit phase is handled by dedicated threads for optimizing synchronization’s overhead. Results reveal better performance than classical replication-based fault-tolerant systems and limited overhead with respect to non fault-tolerant protocols. In comparison with Shield, SoftX doesnotrequirescomputationalresourcepartitioningnorrequestsorderingbutithassmaller fault coverage. Our first major post-preliminary research goal is to develop a contention manager for HTM that works based on online information without a priori knowledge of the transaction’s ac- cesses. Dropping this knowledge increases the CM accuracy and allows more concurrency. The challenge posed by this goal is to allow the sharing of online information without in- troducing additional aborts due to the nature of HTM conflict resolution. Our second goal is to add advanced nesting capabilities (i.e., closed and open nesting) to HTM. HTM sup- ports basic flat nesting only and it is limited to 7 levels only. Advanced nesting is known to improve the performance. For example, in closed nesting, aborting a child transaction does not abort the parent transaction. Our third goal is to build a comprehensive framework for Hybrid TM algorithms. In fact, analyzing the current literature of Hybrid TM algorithms, we can identify several common building blocks used to build those algorithm. Our proposed framework makes the implementation of hybrid TM algorithms easy. It aims for helping the research community in developing more hybrid TM algorithms, deploying extensive compar- ison studies and allowing the composability of different hybrid TM algorithms according to the transaction’s characteristics. This work is supported in part by US National Science Foundation under grants CNS 0915895, CNS 1116190, CNS 1130180, and CNS 1217385. iv Contents 1 Introduction 1 1.1 Summary of Current Research Contributions . . . . . . . . . . . . . . . . . . 3 1.1.1 Improving Multi-core Performance Exploiting HTM . . . . . . . . . . 3 1.1.2 Ensuing multi-core Dependability . . . . . . . . . . . . . . . . . . . . 5 1.2 Summary of Proposed Post-Prelim Work . . . . . . . . . . . . . . . . . . . . 7 1.3 Proposal Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Related Work 9 2.1 Performance Improvement Using HTM . . . . . . . . . . . . . . . . . . . . . 9 2.2 Transactional Memory Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Dependability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Background 15 3.1 Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 TM Design Classification . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Intel’s HTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Soft-errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Fault tolerance techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5.1 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5.2 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.3 Assertions/Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 v 3.5.4 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Part-HTM 23 4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.1 Intel’s HTM Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.1 Protocol Meta-data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.2 Begin Operation and Transaction’s Partitioning . . . . . . . . . . . . 31 4.3.3 Transactional Read Operation . . . . . . . . . . . . . . . . . . . . . . 32 4.3.4 Transactional Write Operation . . . . . . . . . . . . . . . . . . . . . . 32 4.3.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.6 Commit Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.7 Aborting a Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Compatibility with other HTM processors . . . . . . . . . . . . . . . . . . . 36 4.5 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5 Octonauts 43 5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3 Algorithm Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Reducing Conflicts via Scheduling . . . . . . . . . . . . . . . . . . . . 46 5.3.2 HTM-aware Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.3 Transactions Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.4 Adaptive Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.4.1 Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.4.2 TPC-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 vi 6 Shield 54 6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2 Is Byzantine Fault Tolerance the Solution? . . . . . . . . . . . . . . . . . . . 57 6.3 System Model and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.4 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.5 Shield Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.6 Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.6.1 Tolerating CPU-tfaults . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.7 Replica Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.8.1 Tilera TILE-Gx family . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.8.2 x86 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7 SoftX 77 7.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.2 Assumptions and Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.3 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.4.1 Shared-bus architecture . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.4.2 Message-passing architecture . . . . . . . . . . . . . . . . . . . . . . . 90 8 Conclusions 91 8.1 Proposed Post-Prelim Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.1.1 HTM Advanced Online Contention Manager . . . . . . . . . . . . . . 92 8.1.2 HTM Advanced Nesting Support . . . . . . . . . . . . . . . . . . . . 93 8.1.3 Hybrid TM Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 94 vii List of Figures 3.1 Relation between number of components on the same chip and soft-errors rate. (Figure from [11]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Relation between transistor size and soft-errors rate. (Figure from [16]) . . . 20 4.1 Part-htm’s Basic Idea. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Comparison between HTM, STM, and Part-htm. . . . . . . . . . . . . . . 29 4.3 Part-htm’s pseudo-code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4 Acquiring write-locks (a). Detecting intermediate reads or potential over- writes (b). Releasing write-locks (c). . . . . . . . . . . . . . . . . . . . . . . 34 4.5 Throughput using N-Reads M-Writes benchmark and disjoint accesses. . . . 39 4.6 Throughput using Linked-List. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.7 Speed-upoverserial(single-thread)executionusingapplicationsoftheSTAMP Benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1 Scheduling transactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2 Readers-Writers ticketing technique. . . . . . . . . . . . . . . . . . . . . . . . 47 5.3 HTM-STM communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.4 Throughput using Bank benchmark. . . . . . . . . . . . . . . . . . . . . . . 50 5.5 Execution time using Bank benchmark (lower is better). . . . . . . . . . . . 51 5.6 Throughput using TPC-C benchmark. . . . . . . . . . . . . . . . . . . . . . 52 5.7 Execution time using TPC-C benchmark. . . . . . . . . . . . . . . . . . . . . 52 6.1 System, replica and software architecture. . . . . . . . . . . . . . . . . . . . 56 6.2 Network Layer Protocol (Application side). . . . . . . . . . . . . . . . . . . . 63 viii 6.3 Network Layer Protocol (Replica side). . . . . . . . . . . . . . . . . . . . . . 64 6.4 Transaction’s read procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.5 Transaction’s write procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.6 Performance of network layer on Tilera. . . . . . . . . . . . . . . . . . . . . . 69 6.7 Transactional throughput of Shield on Tilera. . . . . . . . . . . . . . . . . 72 6.8 Vacation execution time on Tilera. . . . . . . . . . . . . . . . . . . . . . . . 73 6.9 Network layer throughput on x86. . . . . . . . . . . . . . . . . . . . . . . . . 74 6.10 Transactional throughput of Shield on x86. . . . . . . . . . . . . . . . . . . 75 6.11 Vacation execution time on x86. . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.1 Processing transaction with SoftX. . . . . . . . . . . . . . . . . . . . . . . 83 7.2 Transactional throughput of SoftX on x86. . . . . . . . . . . . . . . . . . . 85 7.3 Transactional throughput of SoftX on Tilera. . . . . . . . . . . . . . . . . . 87 8.1 The proposed object’s circular buffer meta data and abort notifiers. . . . . . 93 ix List of Tables 4.1 Statistics’comparisonbetweenHTM-GL(A)andPart-htm(B)usingLabyrinth application and 4 threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 x
Description: