ebook img

Using Hard Macros to Accelerate FPGA Compilation for Xilinx FPGAs PDF

134 Pages·2016·4.32 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Using Hard Macros to Accelerate FPGA Compilation for Xilinx FPGAs

BBrriigghhaamm YYoouunngg UUnniivveerrssiittyy BBYYUU SScchhoollaarrssAArrcchhiivvee Theses and Dissertations 2012-01-22 UUssiinngg HHaarrdd MMaaccrrooss ttoo AAcccceelleerraattee FFPPGGAA CCoommppiillaattiioonn ffoorr XXiilliinnxx FFPPGGAAss Christopher Michael Lavin Brigham Young University - Provo Follow this and additional works at: https://scholarsarchive.byu.edu/etd Part of the Electrical and Computer Engineering Commons BBYYUU SScchhoollaarrssAArrcchhiivvee CCiittaattiioonn Lavin, Christopher Michael, "Using Hard Macros to Accelerate FPGA Compilation for Xilinx FPGAs" (2012). Theses and Dissertations. 2933. https://scholarsarchive.byu.edu/etd/2933 This Dissertation is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected], [email protected]. Using Hard Macros to Accelerate FPGA Compilation for Xilinx FPGAs Christopher Michael Lavin A dissertation submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Brent E. Nelson, Chair Brad L. Hutchings David A. Penry Michael D. Rice Michael J. Wirthlin Department of Electrical and Computer Engineering Brigham Young University April 2012 Copyright ' 2012 Christopher Michael Lavin All Rights Reserved ABSTRACT Using Hard Macros to Accelerate FPGA Compilation for Xilinx FPGAs Christopher Michael Lavin Department of Electrical and Computer Engineering, BYU Doctor of Philosophy Field programmable gate arrays (FPGAs) offer an attractive compute platform be- cause of their highly parallel and customizable nature in addition to the potential of being reconfigurable to any almost any desired circuit. However, compilation time (the time it takes to convert user design input into a functional implementation on the FPGA) has been a growing problem and is stifling designer productivity. This dissertation presents a new approach to FPGA compilation that more closely follows the software compilation model than that of the application specific integrated cir- cuit (ASIC). Instead of re-compiling every module in the design for each invocation of the compilation flow, the use of pre-compiled modules that can be “linked” in the final stage of compilation are used. These pre-compiled modules are called hard macros and contain the necessary physical information to ultimately implement a module or building block of a design. Byassemblinghardmacrostogether, acompleteandfullyfunctionalimplementation can be created within seconds. This dissertation describes the process of creating a rapid compilation flow based on hard macros for Xilinx FPGAs. First, RapidSmith, an open source framework that enabled the creation of custom CAD tools for this work is presented. Second, HMFlow, the hard macro-based rapid compilation flow is described and presented as tuned to compile Xilinx FPGA designs as fast as possible. Finally, several modifications to HMFlow are made such that it produces circuits with clock rates that run at more than 75% of Xilinx-produced implementations while compiling more than 30× faster than the Xilinx tools. Keywords: FPGA, rapid prototyping, design flow, hard macros, Xilinx, XDL, RapidSmith, HMFlow, open source, placer, router ACKNOWLEDGMENTS I would first like to thank my wife Ashley and daughter Katelyn for their patience and support in allowing me to finish this work. Several long days, nights and Saturdays were necessary for this dissertation to meet completion and their long suffering and love provided me a great motivation to finish. I am also grateful for my parents and their love and support by providing me the great start in life to allow me to reach this achievement. I would like to thank Dr. Brent Nelson for taking me under his wing five and a half years ago. I am grateful for the patience he had to allow me to find out on my own what I should research for this dissertation. The extra time Dr. Nelson made for me to discuss my ideas or challenges and his example helped shape my talents and skills to help me become the engineer I am today. I would also like to thank Dr. Brad Hutchings for his significant contributions to this work and my development as an engineer. Dr. Hutchings went out of his way to to provide time and support as an unofficial co-advisor to this work. His insights were invaluable and added significantly to this dissertation and helped me grow as a graduate student. Thanks to Dr. Michael Rice for the opportunity to work with him and the Telemetry Lab on the Space-Time Coding project. That opportunity turned out to be a rich experience that laid the groundwork for several of the other accomplishments I have made as a graduate student. Thanks also to Dr. Michael Wirthlin for all the support and time he took to help me in various endeavors. Thanks to Dr. David Penry and the entire committee for their valuable insight on this dissertation. There were also several students whose example and work helped me significantly as a graduate student that I would like to thank: Joseph Palmer, Nathan Rollins, Jon-Paul Anderson, Brian Pratt, Marc Padilla, Jaren Lamprecht, Philip Lundrigan, Subhra Ghosh, Brad White, Jonathon Taylor, Josh Monson and all the other students in the Configurable Computing Lab. Special thanks also to Neil Steiner and Matt French at USC-ISI East, Dr. Peter Athanas and the Virginia Tech Configurable Computing Lab as well as the entire Gremlin project for their insight and ideas that helped me more fully understand FPGAs and their architecture. ThisresearchwassupportedbytheI/UCRCProgramoftheNationalScienceFounda- tionunderGrantNo. 0801876throughtheNSFCenterforHigh-PerformanceReconfigurable Computing (CHREC). TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Preview of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter2 Background and Related Work . . . . . . . . . . . . . . . . . . . . 5 2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 FPGA Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Chip Layout and Routing Interconnect . . . . . . . . . . . . . . . . . 8 2.2 Conventional FPGA Compilation Flow . . . . . . . . . . . . . . . . . . . . . 10 2.3 Related Work in Accelerating FPGA Compilation . . . . . . . . . . . . . . . 12 2.3.1 Using Pre-compiled Cores to Accelerate FPGA Compilation . . . . . 12 2.3.2 Accelerating Placement Techniques . . . . . . . . . . . . . . . . . . . 14 2.3.3 Routability-driven Routing . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.4 Summary and Overview of this Work . . . . . . . . . . . . . . . . . . 16 Chapter3 RapidSmith: An Open Source Platform for Creating FPGA CAD Tools for Xilinx FPGAs . . . . . . . . . . . . . . . . . . . . . 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Torc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 XDL: The Xilinx Design Language . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 Detailed FPGA Descriptions in XDLRC Reports . . . . . . . . . . . 22 3.3.2 Designs in XDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 RapidSmith: A Framework to Leverage XDL and Provide a Platform to Cre- ate FPGA CAD Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.1 Xilinx FPGA Database Files in RapidSmith . . . . . . . . . . . . . . 27 3.4.2 Augmented XDLRC Information in RapidSmith . . . . . . . . . . . . 29 3.4.3 XDL Design Representation in RapidSmith . . . . . . . . . . . . . . 30 3.4.4 Impact of RapidSmith . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter4 HMFlow 2010: Accelerating FPGA Compilation with Hard Macros for Rapid Prototyping . . . . . . . . . . . . . . . . . . . . . 33 4.1 Preliminary Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.1 Selection of a Compiled Circuit Representation . . . . . . . . . . . . 34 4.1.2 Experiments Validating Hard Macro Potential . . . . . . . . . . . . . 35 4.1.3 Hard Macros and Quality of Results . . . . . . . . . . . . . . . . . . 42 4.1.4 Hard Macros and Placement Time . . . . . . . . . . . . . . . . . . . 42 v 4.1.5 Conclusions on Preliminary Hard Macro Experiments . . . . . . . . . 43 4.2 HMFlow 2010: A Rapid Prototyping Compilation Flow . . . . . . . . . . . . 44 4.2.1 Xilinx System Generator . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.2 Simulink Design Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.3 Hard Macro Cache and Mapping . . . . . . . . . . . . . . . . . . . . 46 4.2.4 Hard Macro Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.5 XDL Design Stitcher . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.6 Hard Macro Placer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.7 Detailed Design Router . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.1 Benchmark Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.2 RapidSmith Router Performance . . . . . . . . . . . . . . . . . . . . 59 4.3.3 Hard Macro Placer Algorithms . . . . . . . . . . . . . . . . . . . . . 60 4.3.4 HMFlow Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter5 HMFlow 2011: Accelerating FPGA Compilation and Maintain- ing High Performance Implementations . . . . . . . . . . . . . . . 67 5.1 Preliminary Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1.2 Conclusions on Preliminary Work . . . . . . . . . . . . . . . . . . . . 75 5.2 Comparison of HMFlow Using Large Hard Macros vs. Small Hard Macros . 75 5.2.1 Upgrading HMFlow to Support Large Hard Macros . . . . . . . . . . 76 5.2.2 Upgrading HMFlow to Support Virtex 5 FPGAs . . . . . . . . . . . . 78 5.2.3 Large Hard Macro Benchmark Designs for HMFlow . . . . . . . . . . 79 5.2.4 Comparisons of Large and Small Hard Macro-based Designs . . . . . 80 5.2.5 Comparisons of Large Hard Macros with HMFlow vs. Xilinx . . . . . 83 5.3 Modification to HMFlow for High Quality Implementations . . . . . . . . . . 84 5.3.1 Hard Macro Simulated Annealing Placer . . . . . . . . . . . . . . . . 85 5.3.2 Register Re-placement . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3.3 Router Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.1 Result Measurement Fairness . . . . . . . . . . . . . . . . . . . . . . 97 5.4.2 Results of Three HMFlow 2011 Improvements . . . . . . . . . . . . . 98 5.4.3 Results of Optimizing HMFlow 2011 Improvements . . . . . . . . . . 99 5.5 Techniques for Reducing Variance . . . . . . . . . . . . . . . . . . . . . . . . 103 5.5.1 T1: Move Acceptance a Function of Hard Macro Port Count . . . . . 104 5.5.2 T2: Small Hard Macro Re-placement . . . . . . . . . . . . . . . . . . 106 5.5.3 T3: Cost Function Includes Longest Wire . . . . . . . . . . . . . . . 108 5.5.4 Configuration Comparison of Techniques . . . . . . . . . . . . . . . . 112 5.6 Runtime Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Chapter6 The Big Picture: The Compilation Time vs. Circuit Quality Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 vi 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2 Implications for FPGA Designers . . . . . . . . . . . . . . . . . . . . . . . . 120 6.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 vii LIST OF TABLES 2.1 Virtex 4 Routing Interconnect Types . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Virtex 5 Routing Interconnect Types . . . . . . . . . . . . . . . . . . . . . . 10 3.1 RapidSmith Device Files Performance . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Baseline Runtimes for each Test Design . . . . . . . . . . . . . . . . . . . . . 41 4.2 Performance of each Test Design Using Hard Macros . . . . . . . . . . . . . 41 4.3 Comparison of Baseline vs. Hard Macro Designs . . . . . . . . . . . . . . . . 43 4.4 Benchmark Design Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5 Fine-grained Hard Macro Compile Times . . . . . . . . . . . . . . . . . . . . 59 4.6 Router Performance Comparison: Xilinx vs. RapidSmith . . . . . . . . . . . 60 4.7 Hard Macro Placer Algorithm Comparison . . . . . . . . . . . . . . . . . . . 60 4.8 Runtime Performance of HMFlow and Comparison to Xilinx Flow . . . . . . 62 5.1 Width/Height Area Group Aspect Ratio Configurations . . . . . . . . . . . . 74 5.2 Slice Counts for Large Hard Macro Benchmark Virtex 5 Designs . . . . . . . 80 5.3 Coarse-grained Hard Macro Compile Times . . . . . . . . . . . . . . . . . . 83 5.4 Runtime Comparison of HMFlow with Large Hard Macros vs. Xilinx . . . . 83 5.5 Clock Rate Comparison of HMFlow with Large Hard Macros vs. Xilinx . . . 84 5.6 All HMFlow 2011 Improvement Configurations Tested . . . . . . . . . . . . . 99 5.7 HMFlow 2011 Benchmark Clock Rates of Single (Default) Run (in MHz) . . 99 5.8 HMFlow 2011 Benchmark Clock Rates of Average of 100 Runs (in MHz) . . 100 5.9 HMFlow 2011 Benchmark Clock Rates of Best of 100 Runs (in MHz) . . . . 100 5.10 Variance of HMFlow 2011 (C7) and Xilinx in 100 Compilation Runs . . . . . 102 5.11 Average Variance and Frequency using Variance-reducing Techniques . . . . 112 5.12 Compilation Runtime for Several HMFlow 2011 Configurations vs. Xilinx . . 114 5.13 Clock Rate Summary (in MHz) for HMFlow 2011 Configurations vs. Xilinx . 114 ix LIST OF FIGURES 2.1 General Logic Abstractions in a Xilinx Virtex 5 FPGA . . . . . . . . . . . . 6 2.2 Common Xilinx FPGA (Virtex 5) Architecture Layout . . . . . . . . . . . . 8 2.3 Conventional FPGA Compilation Flow (Xilinx) . . . . . . . . . . . . . . . . 11 3.1 RapidSmith and XDL Interacting at Different Points within the Xilinx Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 RapidSmith Abstractions for (a) Devices and (b) Designs . . . . . . . . . . . 27 3.3 Screenshots of Graphical Tools Provided with RapidSmith to Browse (a) De- vices or (b) Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Hard Macro Creation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Compilation Flow for Experiment #1 (Conventional Xilinx Flow) . . . . . . 38 4.3 Compilation Flow for Experiment #2 (Modified Xilinx Flow) . . . . . . . . . 40 4.4 Block Diagram of Multiplier Tree Design . . . . . . . . . . . . . . . . . . . . 40 4.5 Block Diagram of HMFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.6 Screenshot of an Example System Generator Design . . . . . . . . . . . . . . 45 4.7 Front-end Flow for an HMFlow Hard Macro . . . . . . . . . . . . . . . . . . 47 4.8 (a) A FIR filter Design Compiled with the Xilinx Tools (b) The Same Filter Design Compiled with an Area Constraint to Create a Hard Macro . . . . . 48 4.9 A Histogram of Hard Macro Sizes of All Hard Macros in the Benchmarks . . 57 4.10 Graph Showing Percentage of Routed Connections in Each Benchmark Before the Design is Sent to the Router . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.11 (a) Average Runtime Distribution of HMFlow (b) HMFlow Runtime as a Percentage of Total Time to Run HMFlow and Create an NCD File . . . . . 64 4.12 A Comparison Plot of the Benchmark Circuits Maximum Clock Rates When Implemented with HMFlow and the Xilinx Tools . . . . . . . . . . . . . . . . 65 5.1 General Pattern of Hard Macro Placement on FPGA Fabric . . . . . . . . . 70 5.2 Delay of a Path Within a 21×21 Bit LUT-multiplier Hard Macro Placed in a Grid of 400 Locations on a Virtex 4 SX35 FPGA . . . . . . . . . . . . . . . 71 5.3 A More Severely Impacted Path Caused by a Hard Macro Straddling the Center Clock Tree Spine of the FPGA . . . . . . . . . . . . . . . . . . . . . 72 5.4 A PicoBlaze Hard Macro Placed at 3700 Locations on a Virtex 5 FPGA . . . 73 5.5 (a) Illustrates a FIR Filter Implemented in System Generator (b) Shows the FIR Filter Converted to a Subsystem to be Turned into a Hard Macro . . . . 77 5.6 (a) Comparison of Runtime for Large and Small Hard Macro Versions of 3 Benchmarks on HMFlow 2010a (b) Comparison of Clock Rate for Large and Small Hard Macro Versions of 3 Benchmarks on HMFlow 2010a . . . . . . . 81 5.7 TheNumber ofExisting RoutedConnectionsin theLarge HardMacro Bench- mark as a Percentage ot Total Connections . . . . . . . . . . . . . . . . . . . 82 5.8 Block Diagrams of HMFlow 2010a and HMFlow 2011. . . . . . . . . . . . . . 85 xi

Description:
I would first like to thank my wife Ashley and daughter Katelyn for their in a register transfer level (RTL) based language such as VHDL or Verilog. HDL. Netlist. FPGA Primitive Netlist .NGD. MAP. PAR .NCD .NCD. BITGEN .BIT.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.