Enabling Development of OpenCL Applications on FPGA platforms Kavya S Shagrithaya Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering Peter M. Athanas, Chair Patrick R. Schaumont Paul E. Plassmann August 6, 2012 Blacksburg, Virginia Keywords: OpenCL, AutoESL, FPGA, Convey, HPC Copyright 2012, Kavya S Shagrithaya Enabling Development of OpenCL Applications on FPGA platforms Kavya S Shagrithaya (ABSTRACT) FPGAs can potentially deliver tremendous acceleration in high-performance server and em- bedded computing applications. Whether used to augment a processor or as a stand-alone device, these reconfigurable architectures are being deployed in a large number of implemen- tations owing to the massive amounts of parallelism offered. At the same time, a significant challenge encountered in their wide-spread acceptance is the laborious efforts required in programming these devices. The increased development time, level of experience needed by the developers, lower turns per day and difficulty involved in faster iterations over designs affect the time-to-market for many solutions. High-level synthesis aims towards increasing the productivity of FPGAs and bringing them within the reach software developers and domain experts. OpenCL is a specification introduced for parallel programming purposes across platforms. Applications written in OpenCL consist of two parts - a host program for initialization and management, and kernels that define the compute intensive tasks. In this thesis, a compilation flow to generate customized application-specific hardware descrip- tions from OpenCL computation kernels is presented. The flow uses Xilinx AutoESL tool to obtain the design specification for compute cores. An architecture provided integrates the cores with memory and host interfaces. The host program in the application is compiled and executed to demonstrate a proof-of-concept implementation towards achieving an end-to-end flow that provides abstraction of hardware at the front-end. Acknowledgements Firstly, I am sincerely thankful to my advisor, Dr. Peter Athanas for providing me an opportunity to be a part of the Configurable Computing Lab. He has been a source of inspiration throughout and has been very encouraging at all times. I would like to express my deep gratitude for his continual support and invaluable guidance without which I could not have progressed in my research. I have learnt a lot from him, in academics and beyond. I would like to thank Dr. Patrick Schaumont and Dr. Paul Plassman for serving as members of my committee. The courses that I took under Dr. Schaumont were a great learning experience and I immensely enjoyed his style of teaching. My parents and my brother have been there for me, through thick and thin. I am very grateful to them for encouraging me towards pursuing graduate studies and supporting me throughout. Special thanks to my father, my role model, for all his words of wisdom. The environment in the lab has always been conducive to learning things the enjoyable way. I would like to thank my lab mates; Tony, for his advice on all work-related matters; Karl, for helping me ramp up in the intial days; Krzysztof, for his help and guidance on this project; Kaiwen, for his inputs towards this work; and the rest of the CCM family, David, Ritesh, Ramakrishna, Keyrun, Xin Xin, Wen Wei, Andrew, Richard, Shaver, Jason, Ryan, Teresa, Abhay and Jacob for making this a truly enriching experience. My heartfelt thanks to Kaushik and Abhinav for their support in curricular and non- curricular aspects; Iccha, Vishwas, Vikas, Shivkumar, Navish, Atashi and Mahesh for the fun times off work; and all my other friends at Blacksburg who have made these two years very memorable for me. I am thankful to God for everything. iii Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Overview of OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 OpenCL models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 OpenCL Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 OpenCL C Programming Language . . . . . . . . . . . . . . . . . . . 12 2.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Approach and Implementation 16 3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.1 An Introduction to AutoESL . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Implementation Specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 iv 3.2.1 OpenCL to AutoESL Source-to-Source Translation . . . . . . . . . . 18 3.2.2 AutoESL Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.3 Integration and Mapping on Convey . . . . . . . . . . . . . . . . . . 23 3.2.4 Host Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Results 29 4.1 Case Study : Vector Addition . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Performance and Resource Utilization . . . . . . . . . . . . . . . . . . 31 4.2 Case Study : Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Comparison of Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Challenges in Using OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5 Conclusion 39 5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Bibliography 41 Appendix A: Dot File Description for Vector Addition 44 Appendix B: OpenCL API Implementations for Kernel Execution 47 v List of Figures 2.1 Few high-level synthesis tools and their source inputs . . . . . . . . . . . . . 6 2.2 The platform model for OpenCL architecture [12] . . . . . . . . . . . . . . . 9 2.3 An example of an NDRange index space where N = 2 . . . . . . . . . . . . . 10 2.4 OpenCL memory regions and their relation to the execution model . . . . . 11 3.1 Overview of the design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Dot format visualization for the vector addition kernel . . . . . . . . . . . . 21 3.3 System architecture on Convey . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Execution flow for the vector addition application . . . . . . . . . . . . . . . 30 4.2 Performance results for different vector sizes . . . . . . . . . . . . . . . . . . 31 4.3 Resource Utilization for vector addition application for a single AE . . . . . 33 4.4 Matrix multiplication operation . . . . . . . . . . . . . . . . . . . . . . . . . 33 vi List of Tables 4.1 Performance results for vector addition application for vectors of size 1024 . 31 4.2 Area Estimates for a single core from AutoESL report . . . . . . . . . . . . . 32 4.3 Comparison with SOpenCL and OpenRCL implementations . . . . . . . . . 37 4.4 Comparison with Altera’s OpenCL for FPGAs program . . . . . . . . . . . . 37 vii Chapter 1 Introduction An increasing number of application areas today are embracing high performance com- puting (HPC) solutions for their processing needs. With frequency scaling having run its course, conventional processors have given way to various accelerator technologies to meet the computational demands. These range from Graphic Processing Units (GPU), Field Pro- grammable Gate Arrays (FPGA), heterogeneous multicore processors like Cell to hybrid architectures like Convey HC-1. Different classes of applications employ different targets best suited for the problem at hand. Each of the accelerator technologies possess advantages overtheotherforvarianttasksleadingtopossibilitiesofaheterogeneousmixofarchitectures [1–3]. Reconfigurable systems like FPGAs provide promising opportunities for acceleration in many fields due to their inherent flexibility and massive parallel computation capabilities. Theavailabilityofamultitudeofhardwaredevicesrequirecomprehensiveapproachestosolu- tion which include the knowledge of underlying architecture along with methods of designing the algorithm. This translates to an increase in implementation effort, high learning curves and architecture aware programming at the developer’s end. In 2004, DARPA launched a project named High Productivity Computing Systems (HPCS) that aimed at providing eco- 1 2 nomically viable high productivity systems[4]. They proposed using “time to solution” as a metricthatincludesthetimetodevelopasolutionaswellthetimetakentoexecuteit. CPUs and GPUs are programmed in high level languages like C/C++ and CUDA. This level of abstraction enables faster deployment of solutions. In the case of FPGAs, implementations require tedious hardware design and debug which greatly impact the development time. 1.1 Motivation Enormous acceleration can be delivered by FPGAs owing to the flexibility and parallelism provided by the fine grained architecture. However, being able to fully harness this potential presents challenges in terms of programmability. Implementation of applications on FPGAs involvecumbersomeRTLprogrammingandmanualoptimizations. Besidesdomainexpertise and softwaredesign skills, developersarerequiredto understandintricatedetailsofhardware design including timing closure, state machine control and cycle accurate implementations. As a result the effort in drastically reducing the execution time translates to a significant increase in the time to develop the solution. Jeff Chase et al implemented a tensor-based real time optical flow algorithm on FPGA and GPU platforms. A fixed-point version was mapped on a Virtex II Pro XC2VP30 FPGA in the Xilinx XUP V2P board and a single- precision floating-point implementation was mapped on an Nvidia Geforce 880 GTX GPU with appropriate design optimizations for each platform. Comparisons were drawn on both implementationsonvariousparametersincludingperformance, power, andproductivity. The work presented comparable results for performance on the FPGA and the GPU, while the power consumption was significantly lower on the FPGA. The FPGA implementation, how- ever, took more than 12 times longer to develop, as compared to the GPU implementation. There has been significant research and industry related work in providing high level pro- 3 gramming solutions for FPGA to reduce these design efforts. They however do not succeed in abstracting all implementation details as the knowledge of hardware concepts is required, to an extent, to appropriately design the software solution for maximum performance. OpenCL (Open Computing Language) is a cross platform standard for building parallel applications. The OpenCL programming model, albeit skewed towards GPU architectures, provides a standard framework that enhances portability across devices essentially rendering it platform independent. It enables the developer to focus on design of the application while abstractingthefinedetailsofimplementation. BeingabletodevelopapplicationsinOpenCL for FPGA platforms would result in faster implementation times and time-to-market. 1.2 Contributions Supporting OpenCL applications on a new platform requires a compiler to translate the kernels to the device specific executable, a library for the OpenCL API and a driver that manages the communications between the host and the devices. This thesis presents a proof of concept system that enables OpenCL application development on FPGAs. An efficient method for compiling OpenCL kernel tasks to hardware is demonstrated, without having to build a compiler from scratch. Existing tools are leveraged, as required, in the process. A system architecture is presented for the target device with an automated generation of interfaces for the kernel. Also, a library that supports a subset of the OpenCL host API is provided.
Description: