ebook img

C-Level Programming of Parallel Coprocessor Accelerators PDF

293 Pages·2010·9.83 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview C-Level Programming of Parallel Coprocessor Accelerators

“C-Level” Programming of Parallel Coprocessor Accelerators Benjamin Ylvisaker A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2010 Program Authorized to Offer Degree: Computer Science and Engineering University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Benjamin Ylvisaker and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Co-Chairs of the Supervisory Committee: William H.c. Ebeling Scott Hauck Reading Committee: William H.c. Ebeling Scott Hauck Daniel Grossman Date: University of Washington Abstract “C-Level” Programming of Parallel Coprocessor Accelerators Benjamin Ylvisaker Co-Chairs of the Supervisory Committee: Professor William H.c. Ebeling Computer Science and Engineering Professor Scott Hauck Electrical Engineering We believe that FPGA-like parallel coprocessor accelerators can be programmed efficiently at the “C level” of abstraction. In order to support this claim we define an abstract archi- tectural model of accelerators that conveys the kind of high-level behavior and performance characteristics that the von Neumann model conveys to programmers of conventional pro- cessors. Using the model as a guide we define a programming language and compilation strategy that: 1. do not impose programming style restrictions that are not inherent in the model, 2. do not introduce serious inefficiencies, and 3. are performance portable across implementations of the model. In this dissertation I describe C-level programming of accelerators broadly, and make three particular contributions to the programmability of accelerators. • Enhanced loop flattening is a new method for translating loop nests with arbitrary static control flow into a form that can be efficiently pipelined with conventional algorithms designed for simple loops. This method advances the goal of supporting a wide set of programming styles with reasonable efficiency. • Parallel accelerators have statically managed resources–like local memories–that vary widely in capacity from one implementation to the next. In order to get close to peak performance, applications must be tuned to the specific resources available in a given implementation, and empirical auto-tuning is an attractive way to do that. I propose and evaluate a new probabilistic auto-tuning method that elegantly handles situation where many possible configurations of the application fail to work at all because they exceed some architectural resource limit. • For many applications, achieving good performance on parallel accelerators requires deep loop pipelining, which requires dramatically reordering the individual operations in the application. Local dependencies between operations can be respected by com- pilersrelativelyeasily, butnon-localdependenciesforceimplementationstochoosebe- tween conservatively not reordering operations (which might kill performance), prov- ing that reordering preserves the meaning of the program (which is impossible in the general case), or making unsound transformations (which programmers generally dislike). I propose a mostly sequential operational semantics for C-level streaming languages targeted at parallel accelerators that offers enough flexibility to the imple- mentation to achieve good performance, deviates from conventional program-order semantics in fairly modest and understandable ways, and provides tools with which the programmer can control the reordering performed by the implementation. These innovations are evaluated in the context of Macah, a new C-like language devel- oped in the Mosaic group at the University of Washington. For validation we use a number of compute-intensive benchmarks developed by members of the Mosaic group and other contributors. TABLE OF CONTENTS Page List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Chapter 1: The Parallel Coprocessor Accelerator Ecosystem . . . . . . . . . . . . 1 1.1 Parallel coprocessor accelerators . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 What accelerators are good for . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 How engineers program accelerators today . . . . . . . . . . . . . . . . . . . . 5 1.4 How researchers think engineers should program accelerators . . . . . . . . . 5 1.5 Contributions of this dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2: An Abstract Model for Parallel Coprocessor Accelerators . . . . . . . . 14 2.1 A proposed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Implementations of the HMP model . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Algorithm analysis and design . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 3: Macah and the Mosaic Toolchain . . . . . . . . . . . . . . . . . . . . . 34 3.1 Macah and the HMP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Example application: motion estimation . . . . . . . . . . . . . . . . . . . . . 38 3.3 Motion estimation in Macah . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Implementing Macah: Mosaic toolchain overview . . . . . . . . . . . . . . . . 59 3.5 Compiling Macah I: front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6 Compiling Macah II: back-end . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 4: Enhanced Loop Flattening . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Enhanced loop flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3 Enhanced loop flattening implementation . . . . . . . . . . . . . . . . . . . . 91 i 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Chapter 5: A Short Survey of Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.2 Improving mostly conventional compilers. . . . . . . . . . . . . . . . . . . . . 124 5.3 The auto-tuner approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.4 General purpose auto-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.5 Tuning for coprocessor accelerators . . . . . . . . . . . . . . . . . . . . . . . . 145 Chapter 6: Auto-Tuning for Accelerators . . . . . . . . . . . . . . . . . . . . . . . 148 6.1 Overview of the tuning knobs method . . . . . . . . . . . . . . . . . . . . . . 151 6.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.3 Context for accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.4 The prominent alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.5 How it works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.6 Probabilistic regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 165 6.7 Derived features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.8 Complete basic tuning knob algorithm . . . . . . . . . . . . . . . . . . . . . . 172 6.9 Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.10 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Chapter 7: Relaxed Operational Semantics for Dynamic Streaming Languages . . 192 7.1 Summary of Results for Non-Language Semanticists . . . . . . . . . . . . . . 195 7.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7.3 Unbounded stream buffer semantics . . . . . . . . . . . . . . . . . . . . . . . 202 7.4 Blocking and polling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 7.5 Bounded stream buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 7.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Chapter 8: Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . 225 8.1 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 8.2 How far have we come? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 ii

Description:
C-level is in quotes because accelerators are suf- . workshops, tutorials and conferences listed for 2009 and the first half of 2010. languages (HDLs) like Verilog and VHDL3; for GPUs it means systems like Oxford/Celoxica.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.