Multicore Architecture Prototyping on Reconfigurable Devices Oriol Arcas Abella Advisors: Dr. Adrián Cristal Kestelman Dr. Osman S. Ünsal Dr. Nehir Sönmez Departament d'Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona - March 2016 Multicore Architecture Prototyping on Reconfigurable Devices Oriol Arcas Abella Department of Computer Architecture Universitat Politècnica de Catalunya A dissertation submitted in fulfillment of the requirements for the degree of / Doctor of Philosophy Doctor per la UPC February 2016 A la meva família, per fer-ho possible; al Nehir, amic i company, l’abella obrera; i a la Carlota, per recolzar-me fins al final. Acknowledgements This dissertation1 condenses the research and efforts I made in the past years. However,whatIvaluemostisnottheknowledgeIaccumulated,butthepeople I met and the experiences I lived during this journey. In the first place, Nehir Sönmez, friend and neighbour, the other worker bee, who introduced me into the FPGA world. I also want to thank my advisors, Dr. Osman S. Ünsal and Dr. Adrián Cristal, for always looking out for my best interest and for their endless support; and Eduard Aiguadé, my former advisor, for his wise advices. This thesiswouldnotbepossiblewithoutthedirectcollaborationfromGökhanSay- ilar, Philipp Kirchhofer and Abhinav Agarwal. In addition, during these years I had the luck to meet and learn from many excellent researchers: Prof. Mateo Valero, Dr. Satnam Sing, Prof. Roberto Hexsel, Prof. Arvind and Prof. Mikel Luján. My colleagues at the BSC helped me constantly and made our daily routine enjoyable, expecting the passionate lunch-time debates. At the risk of forget- tingsomebody,IwanttothankVasilisKarakostas,AdriàArmejach,SašaTomi´c, Srd¯an Stipi´c, Ferad Zyulkyarov, Vesna Smiljkovi´c, Gülay Yalçın, Oscar Palomar, Javier Arias, Nikola Markovi´c, Daniel Nemirovsky, Otto Pflucker, Azam Seyedi, Vladimir Gajinov, Vladimir Suboti´c, Behzad Salami, Omer Subasi, Tassadaq Hussain,CristianPerfumo,EgeAkpinar,PabloPrieto,RubénTitosandPaulCar- penter. OneoftheperiodsthatIwillrememberfromthisthesiswasmyinternshipatthe CSG of the MIT, kindly hosted by Prof. Arvind. He and his group made me feel like home, especially Myron King, Abhinav Agarwal, Asif Khan, Muralidaran Vijayaraghavan and Sang Woo Jun. I also want to thank Prof. Mikel Luján and his team from University of Manchester, Mohsen Ghasempour, Geoffrey Ndu, John Mawer and Javier Navaridas, for being excellent hosts. Finally, I want to thank my parents and my brother for always supporting my career, and Carlota, for always enduring my career. 1The cover image “Rubik’s Cube” by Wikipedia user Booyabazooka is licensed under the Creative Com- monsAttribution-ShareAlike3.0Unportedlicense: https://commons.wikimedia.org/wiki/File:Rubik’s_cube.svg Abstract In the last decades several performance walls were hit. The memory wall and the power wall are limiting the performance scaling of digital microprocessors. Homogeneousmulticoresrelyonthread-levelparallelism,whichischallenging to exploit. New heterogeneous architectures promise higher performance per watt rates, but software simulators have limited capacity to research them. In this thesis we investigate the advantages of Field-Programmable Gate Array devices (FPGA) for multicore research. We developed three prototypes, imple- menting up to 24 cores in a single FPGA, showing their superior performance and precision compared to software simulators. Moreover, our prototypes per- form full-system emulation and are totally modifiable. We use our prototypes to implement novel architectural extensions such as Transactional Memory (TM). This use case allowed us to research different needs that computer architects may have, and how to implement them on FP- GAs. We developed several techniques to offer profiling, debugging and ver- ification techniques in each stage of the design process. These solutions may bridgethegapbetweenFPGA-basedhardwaredesignandcomputerarchitects. In particular, we place a special stress on non-obtrusive techniques, so that the precision of the emulation is not affected. Based on the current trends and the sustained growth in the high-level syn- thesis community, we expect FPGAs to become an integral part of computer architecture design in the next years. Contents Contents ix List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Field-Programmable Gate Arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 FPGA-based Computer Architecture Prototyping . . . . . . . . . . . . . . . . . . 4 1.3 Simulation and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Thesis Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5.1 Implementing an Open-Source Multicore: Beefarm and TMbox . . . . 10 1.5.2 High-level Methodologies, Debugging and Verification: Bluebox . . . . 12 1.6 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.7 Thesis Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Background 17 2.1 The Rise of Multicore Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 The Limitations of Software Architectural Simulators . . . . . . . . . . . . . . . 20 2.3 FPGA-based Simulators and Emulators . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 The Limitations of FPGA-based Prototyping . . . . . . . . . . . . . . . . . . . . . 22 3 An Open-Source FPGA-based Multicore 25 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 ix CONTENTS 3.2 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 The Beefarm System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.1 The Plasma Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 The BEE3 Platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.3 The Honeycomb core: Extending Plasma . . . . . . . . . . . . . . . . . . 31 3.3.4 The Beefarm System Architecture . . . . . . . . . . . . . . . . . . . . . . 35 3.3.5 FPGA resource usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.6 The Beefarm Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.7 Investigating TM on the Beefarm . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Comparison with SW Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 Beefarm Multicore Performance with STM Benchmarks . . . . . . . . . 43 3.5 Efficient Implementations for Floating-point Support . . . . . . . . . . . . . . . 44 3.6 The Experience and Trade-offs in Hardware Emulation . . . . . . . . . . . . . . 47 3.7 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4 Hybrid Transactional Memory on FPGA 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 The TMbox Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 Interconnection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Hybrid TM Support for TMbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1 Instruction Set Architecture Extensions . . . . . . . . . . . . . . . . . . . 56 4.3.2 Bus Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.3 Cache Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.1 Architectural Benefits and Drawbacks . . . . . . . . . . . . . . . . . . . . 58 4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5 Profiling and Visualization on FPGA 63 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 Design Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.1 TMbox Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 x
Description: