ebook img

Elastic Control of Parallelism in Concurrent Systems Ari Sundholm PDF

63 Pages·2014·1.28 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Elastic Control of Parallelism in Concurrent Systems Ari Sundholm

Aalto University School of Science Degree Programme in Computer Science and Engineering Elastic Control of Parallelism in Concurrent Systems Master’s Thesis May 13, 2014 Ari Sundholm Aalto University ABSTRACT OF School of Science MASTER’S THESIS Author: Ari Sundholm Title of thesis: Elastic Control of Parallelism in Concurrent Systems Date: May 13, 2014 Pages: 63 Degree programme: Degree Programme in Computer Science and Engineering Code of professorship:T-106 Supervisor: Professor Heikki Saikkonen Advisor(s): Vesa Hirvisalo, D.Sc. (Tech) Using heterogeneous systems with accelerators, especially graphics processing units (GPUs),forhigh-performancecomputingisanemergingfield. Dynamicvariationofthe resourcedemandsofaparallelcomputationtaskcanleadtosignificantunderutilization of the devices, as the remaining resources in an accelerator may not allow for additional tasks to run. Computing on an accelerator, a graphics processing unit in particular, is rather rigid. Computations are run one at a time by a single user at a time to completion and there is very limited support for running multiple tasks concurrently. Additionally, a computational task may require more resources than a device can offer. These issues necessitate elastic computing, where the resource demands of a task can be altered to make the task fit the remaining resources. Elastic computing is particularly useful in video encoding and decoding, as the various primitives with differing resource requirementsandcomputationaldimensionscanbeelasticallyoffloadedtoaccelerators. This thesis presents CUDA Wrapper, an API-level virtualization layer prototype for GPUs, implemented on top of CUDA, for dynamically and elastically running computational kernels on multiple GPUs while exploiting task-level parallelism both on an inter- and intra-device level. The prototype assumes that elastic kernels are used. This allows for changing the computational dimensions of kernel launches to make them fit the available resources. The GPUs are accessed through a resource pool which abstracts the GPUs away, allowing the application to be completely agnostic to the number of GPUs available. The approach is shown to work rather well. CUDA Wrapper accurately estimates the resource usage of the GPUs and achieves reasonable occupancy levels with an HEVC- inspired workload. Keywords: accelerators, graphics processing units, virtualization, HEVC Language: English 2 Aalto-yliopisto DIPLOMITYO¨N Perustieteiden korkeakoulu TIIVISTELMA¨ Tekij¨a: Ari Sundholm Ty¨on nimi: Elastinen rinnakkaisuuden hallinta rinnakkaisissa j¨arjestelmiss¨a P¨aiv¨am¨a¨ar¨a: 13. toukokuuta 2014 Sivum¨a¨ar¨a: 63 Koulutusohjelma:Tietotekniikan koulutusohjelma Professuurikoodi:T-106 Valvoja: Professori Heikki Saikkonen Ty¨on ohjaaja(t): TkT Vesa Hirvisalo Kiihdyttimi¨a, erityisesti grafiikkakiihdyttimi¨a sis¨alt¨avien heterogeenisten j¨arjestelmien k¨aytt¨o suurteholaskentaan on kehittyv¨a ala. Rinnakkaisen laskentateht¨av¨an resurssi- vaatimusten dynaaminen vaihtelu voi johtaa merkitt¨av¨a¨an laitteiden alik¨aytt¨o¨on koska kiihdyttimen j¨aljell¨a olevat resurssit eiv¨at v¨altt¨am¨att¨a salli lis¨alaskentateht¨avien aja- mista. Laskenta kiihdyttimell¨a, erityisesti grafiikkakiihdyttimell¨a, on varsin j¨aykk¨a¨a. Lasken- nat ajaa yksi k¨aytt¨aj¨a yksi kerrallaan alusta loppuun ja tuki useiden teht¨avien ajolle rinnakkain on hyvin rajoitettu. Lis¨aksi laskentateht¨av¨a voi vaatia enemm¨an resursse- ja kuin laite voi tarjota. N¨am¨a seikat tekev¨at v¨altt¨am¨att¨om¨aksi elastisen laskennan, jossa laskentateht¨av¨an resurssivaatimuksia voidaan muuttaa sen sovittamiseksi j¨aljell¨a oleviin resursseihin. Elastinen laskenta on erityisen hy¨odyllist¨a videokoodauksessa ja -dekoodauksessa, sill¨a monenlaiset primitiivit erilaisilla resurssivaatimuksilla voidaan siirt¨a¨a kiihdytinten laskettaviksi. T¨am¨adiplomity¨oesitt¨a¨aCUDAWrapperin,API-tasonvirtualisointikerrosprototyypin, joka on toteutettu CUDAn p¨a¨alle laskentakernelien ajamiseksi dynaamisesti ja elasti- sesti useilla grafiikkakiihdyttimill¨a k¨aytt¨aen hyv¨aksi teht¨av¨atason rinnakkaisuutta se- k¨a laitteiden v¨alill¨a ett¨a yksitt¨aisen laitteen sis¨all¨a. Prototyyppi olettaa, ett¨a kerne- lit ovat elastisia. T¨am¨a mahdollistaa kernelk¨aynnistysten laskennallisten dimensioiden muuttamisen niiden sovittamiseksi saatavilla oleviin resursseihin. Grafiikkakiihdytti- miin p¨a¨ast¨a¨an k¨asiksi resurssipoolin kautta, joka abstrahoi kiihdyttimet mahdollistaen sovelluksen t¨aydellisen agnostisuuden kiihdytinten m¨a¨ar¨an suhteen. T¨am¨an l¨ahestymistavan n¨aytet¨a¨an toimivan varsin hyvin. CUDA Wrapper arvioi t¨as- m¨allisesti grafiikkakiihdytinten resurssienk¨ayt¨on ja saavuttaa kohtuullisia k¨aytt¨oaste- tasoja kuormalla, joka on HEVCin inspiroima. Avainsanat: kiihdyttimet, grafiikkasuorittimet, virtualisointi, HEVC Kieli: Englanti 3 Acknowledgements I would like to use this space give my thanks to a number of individuals and non- individuals. First, I thank Professor Heikki Saikkonen, who has been kind enough to act as the supervisor for this thesis. Second, I thank the advisor of this thesis, D.Sc. (Tech) Vesa Hirvisalo, who has seen me through the whole process of writing the thesis with seemingly endless patience for my inquisitiveness. Without him, I could not have succeeded as I did. Third, I thank Eero Talvitie for all of his helpful suggestions and sparing his precious time to read the endless revisions to this thesis. Fourth, I thank TEKES for funding the ParallaX project under which this thesis was written. Fifth, I thank the late Helsinki University of Technology and Aalto University for giving me an excellent education which allowed me to gain the knowledge and skills required to be able to write this thesis. Sixth, I thank our industrial partners. Seventh, I thank everyone else who has helped me with suggestions or supported me otherwise during the process of writing this thesis. 4 Contents 1 Introduction 7 2 Background 8 2.1 Parallel computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Programming models . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Heterogeneous computing . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Software stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Programming languages . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3 Runtime systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.4 Operating systems . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 GPU virtualization 21 3.1 Virtualization in general . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 GViM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 vCUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 rCUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 gVirtuS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6 VirtualCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 Tools 27 4.1 NVIDIA GPUs and CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Memory in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Concurrency mechanisms in CUDA . . . . . . . . . . . . . . . . . . . . . 30 4.4 CUPTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 HEVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 4.5.1 The basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.2 Stream structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.5.3 Between frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 CUDA Wrapper 38 5.1 Launch queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Resource pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4 Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.5 Application programming interface . . . . . . . . . . . . . . . . . . . . . 42 5.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6 Experimental setup 45 6.1 Measurement setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Target system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.3 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7 Results 51 7.1 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.2 Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.3 Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.4 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 8 Conclusions 57 References 60 6 1 Introduction This thesis studies how parallel computations are performed on multiple devices called accelerators alongside general-purpose CPUs. These kinds of systems are called heterogeneous and computing with such systems is called heterogeneous computing. Exploiting heterogeneous systems in high performance computing is an emerging field. The resource demands of parallel computations can vary dynamically from task to task. This can lead to significant underutilization of the devices, as the resource demands of a taskmayexceedtheremainingresources, leavingpartofthedeviceidlebecausethedevice cannot accommodate the task. These problems compound when multiple accelerators are used, as is the case in this thesis, as selections must be made regarding which device to run a task on. The background on parallel and heterogeneous computing as well as the software stack necessary for them is covered in section 2. Computing on accelerators, graphics processing units (GPU) in particular, is rather rigid. One computational task is run at a time to completion, with no pre-emption. Typically only one user at a time is supported, that is, the usage model is single-user. There is generally very limited support to running multiple tasks concurrently on a single device, potentially partitioning the device’s resources and allowing asymmetric use. It may even be that a given computational task requires more resources than the device contains in total. In short, there is a need for elastic computing on the devices. In elastic computing, the resource demands of a task can be altered to fit the remaining resources, possibly at the costof execution time. In the case of GPUs, elastic kernels were proposed by Pai et al. (2013) as a way of implementing elasticity for computational tasks. They implemented source-to-source transformations of computational tasks, producing elastic kernels, which they showed to be superior to other approaches. In particular, elastic computing is useful in video encoding and decoding. There are various primitives, part of the encoding or decoding process, which have differing resource requirements and varying computational dimensions. The primitives can be offloaded to one or more accelerators, which then execute them elastically. In this thesis, an API-level prototype for allowing elastic computing on multiple GPUs is presented. Itsolvesmanyoftheproblemsmentionedabovebyallowingthecomputational dimensions of computational tasks to be modified to make them fit into the available resources, even if there are already tasks executing on the device. For simplicity, all tasks are assumed to be expressed as elastic kernels, as the transformation has been shown to be viable. The prototype is implemented on top of NVIDIA’s Compute Unified Architecture (CUDA) which allows tasks to run concurrently as long as there are sufficient resources 7 available to do so. This allows for partitioning the resources on devices and running tasks of wildly different resource requirements and computational dimensions. The prototype has been implemented to allow for validation of the concept through experiments and measurements, rather than building a complete system. It relies on virtualization, which is covered in section 3. Two major tools are used in the thesis. First, CUDA is used to construct the prototype. Second, the High Efficiency Video Coding (HEVC) standard is used as a basis for constructingaworkloadtotesttheprototype. CUDAandHEVCarecoveredinsection4. The contribution of this thesis is threefold. First, an API-level prototype, called CUDA Wrapper, has been implemented. The design and implementation of the prototype is detailed in section 5. Second, an HEVC-inspired workload has been implemented to test the prototype. Third, experiments have been performed and measurements made to determine the behavior of the prototype. The CUDA Wrapper prototype is a virtualization layer using the CUDA Runtime and Driver APIs, abstracting away the local GPUs. The GPUs are kept hidden from the application developer. The virtualization layer manages kernel launches which are automatically and elastically reconfigured to fit the available resources on the GPUs. It automatically chooses which GPU to launch a kernel on. FourdifferentexperimentsarerunonCUDAWrapper. Theexperimentalsetupisdetailed in section 6, the results are presented in section 7 and the thesis is concluded in section 8. The results show that CUDA Wrapper scales linearly with the total size of the workload, is able to reach occupancy levels exceeding 0.75 and accurately estimates the resource usage of the GPUs. It is also shown that CUDA’s energy and power management has an issue, causing it to consume power when it should not. 2 Background Parallel computing is an increasingly important branch of computing in the face of physical limits that effectively disallow traditional ways of increasing the performance of hardware. Parallel computing on homogeneous systems is nowadays rather well- established. Heterogeneous computing, however, brings with it a number of special considerations. Thesoftwarestack, includingprogramminglanguages, compilers, runtime systems and operating systems, is in a central role in actually implementing software that performs parallel computations. This section introduces parallel computing, heterogeneous computing and the software stack involved in both. 8 2.1 Parallel computing 2.1.1 Overview Parallel computing is the use of hardware or software level parallelism to perform tasks on sets of data. Two broad categories of parallel computing may be identified (Hennessy and Patterson, 2012, pp. 9–10). The first one is data parallelism, where multiple items within a set of data are processed simultaneously. The second one is task parallelism, where independent tasks are performed in parallel on a set or sets of data. One of the ways, listed by Hennessy and Patterson, in which computer hardware can build on these two categories is instruction-level parallelism, which uses data parallelism to perform operations on sets of data in parallel using a variety of hardware techniques. Oneofthemostimportanttechniquesispipelining(HennessyandPatterson,2012, pp.C- 2–C-81), which allows the execution of multiple instructions to be temporally overlapped, as there is inherent parallelism between the different phases of executing an instruction. Without pipelining, the instruction throughput of a processor would be significantly less than one instruction per clock cycle. By executing the different phases of instruction execution of different instructions in parallel, in optimal conditions, with no pipeline stalls, a single-pipeline processor can have a throughput of one instruction per clock cycle. If a processor is superscalar (yet another important technique implementing instruction- level parallelism), it can reach multiple instructions per clock cycle. As pipelining is not visible to the programmer, it usually does not complicate software development. Vector architectures and graphics processing units (GPUs, see section 2.2.3) are another exampleofhardwareexploitingdataparallelism. Invectorprocessors, thesameoperation can be performed on multiple data items in parallel. A GPU can be considered a set of vector processors, as the same operation is performed on a data set in parallel with a huge number of processing units. The scalability of vector architectures is limited, however, as itisinfeasibletoincreasethevectorsizetoomuchbecausevectorizationbecomesthemore difficult the larger the vector size gets. This applies to both manual vectorization done by the programmer and automatic vectorization done by the compiler. As an example, Intel’s Xeon Phi has a vector width of 512 bits (Intel, 2013), which is at the limits of feasibility. The scalability of single processor core is limited, as it has turned out to be difficult to increase the clock speed beyond 4 GHz without prohibitive power and cooling requirements. A way to increase performance without increasing the clock speed is to exploit parallelism both at the task and data level by having multiple processing units working on the data at the same time. Modern processors have multiple cores and fall in oneoftwocategories: multicore processors andmanycore processors. Multicoreprocessors 9 have a number of cores in the range of 2–16, while manycore processors typically have several dozens, hundreds or even thousands of cores. Multicore and manycore processors are typically used together in a heterogeneous system (See section 2.2), that is, a system where there are different kinds of processors. The cores in a multicore processor typically have a consistent view of memory through shared cache memories which have some level of coherence through snooping, for example. In contrast, in manycore processors the consistency and coherence are typically very limited or nonexistent and require explicit synchronization. This thesis focuses on manycore processors. When speaking of parallelism, the related concept of concurrency often comes up. These two concepts are different. While parallelism concentrates on performance in a single computational task, concurrency concentrates on the correctness among multiple computational tasks. 2.1.2 Programming models There are a number of programming models that enable parallel programming. The most primitive one, which many of the other models rely on, is based on support for threads provided by the platform. A thread is essentially a lightweight process executing a given task. A thread can be owned by a process and different threads owned by the same process often share the same address space. This means multiple threads may read and write common memory locations, which may cause problems if done without control. To allow such control, the platform also usually provides primitives to set up memory barriers, which allow another layer of support, namely, locks. Locks allow mutual exclusion of threads and thus the definition of critical sections, that is, sections of code that must only be executed by one thread at a time. Typically, a critical section is thread safe, as the shared resource protected by the associated lock is accessed in a safe manner. Furthermore, theexistenceoflocksallowsthedefinitionofsemaphores, monitors and condition variables. A semaphore is essentially an integer that is incremented and decremented in a thread-safe manner using a lock. Semaphores can be thought of as counters of available shared resources, for instance, the number of free slots within a queue of bounded size. A monitor is a thread-safe object with data and functions and it often has condition variables associated with it. A condition variable is an object which stores threads waiting, or“sleeping”, on the condition it represents and it allows one or all of those threads to be“woken up”from their wait. For instance, a queue of bounded size may have condition variables for the conditions“not empty”and“not full”. Some hardware also has support for vector operations, as explained in the previous section. The operations can be used directly by the programmer or the compiler (See section 2.3.2) can use it automatically to vectorize program code to exploit data 10

Description:
This allows for changing the computational dimensions of kernel launches to make them fit the 5What is called a function here actually covers not only functions, but also procedures, methods and subroutines. component exposing UNIX sockets on virtual machine instances using a QEMU device.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.