Techniques for Timing Closure on High-Speed Field Programmable Gate Arrays by Deshanand P. Singh A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto Copyright (cid:176)c 2002 by Deshanand P. Singh Abstract Techniques for Timing Closure on High-Speed Field Programmable Gate Arrays Deshanand P. Singh Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2002 The Field Programmable Gate Array (FPGA) has become a popular implementation medium for digital circuits due to its ability to be configured to realize a variety of different circuits. Although the configurable nature of FPGAs is very attractive, circuits implemented in FPGAs are almost an order of magnitude slower than their ASIC counterparts. Thus it has become increasingly difficult for users to realize realistic timing constraints for FPGA implementations. This is usually referred to as the “timing closure” problem. This thesis investigates methods for achieving timing closure for FPGA based designs. Two maintechniquesarestudiedinthisdissertation. Thefirststudiestheeffectsofcreatingarbitrary mappings between the logical and physical hierarchy of a design. This method has shown an average of 12% speed improvement when used to map critical sections of logic to fast physical regions on the target device. The second technique tightly integrates netlist optimizations with the placement and routing steps of the FPGA CAD flow. The circuit is restructured with a suite of timing-driven optimizations to better cope with the routing delays inherent in FPGAs. The suite of optimizations includes sequential retiming, Shannon’s decomposition theorem, and clock skew optimization. Each technique is applicable, depending on circuit characteristics, or theymayallbeusedinconcert. Theserestructuringtechniqueshaveshownanaveragespeedup of up to 25%. ii Acknowledgements Thanks first to Professor Stephen Brown, who from the start had great confidence that I would do something useful. I hope that this dissertation somehow justifies his blind faith. He presented me with the opportunity and the challenge of solving “real-world” problems that are often abstracted beyond recognition in academic works. For this I am both grateful and tired. I would like to thank Altera for providing me great insight into FPGA architectures and CAD. In addition, they gave me an opportunity to to learn from a wide variety of the brightest peoplearound. Specifically,IwouldliketothankTerryBorer,StevenCaranci,ZvonkoVranesic, and Paul McHardy for their insights relating to my work. Finally, thanks to my parents who patiently supported me through my many years of study. I think I’m finished for now. In this case, the journey surely was more important than the destination, as I’ve learned as much from the failures as from the successes. iii Contents 1 Introduction 1 1.1 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Contributions of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background and Related Work 7 2.1 FPGA Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Logic Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Clustered Logic Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 Programmable Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 CAD for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Synthesis and Technology Mapping . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 Clustering and Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.4 Floorplanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.5 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Circuit Restructuring Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Shannon’s Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 Sequential Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.3 Clock Skew Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 iv 2.4 Early Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 New Architectural Features . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.2 Netlist Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.3 New Architectural Constraints . . . . . . . . . . . . . . . . . . . . . . . . 31 3 Placement Grouping Constraints 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 APEX Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Placement with Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Timing Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5 Move Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.1 Floating Region Move Types . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.2 Number of Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6 Automatic Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4 Circuit Restructuring and Placement 62 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 Physical Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5 Incremental Placement 69 5.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Architectural Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 The ICP Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.1 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.2 Cluster Legality Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.3 Timing Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.4 Wirelength Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 v 5.3.5 Move Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4 Directed Hill-climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4.1 Violation Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.5 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.6 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.7 Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6 Layout-Level Retiming 92 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2 Critical Cycles and Cycle Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.3 Retiming Aware Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.4 Minimally Placement Disruptive Retiming . . . . . . . . . . . . . . . . . . . . . . 98 6.4.1 Costing Logic Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4.2 Costing Pipelined Routes and Fanin Registers . . . . . . . . . . . . . . . . 102 6.4.3 Solving the Minimum Cost Problem . . . . . . . . . . . . . . . . . . . . . 103 6.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.6 Architectural Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7 Post-Placement Decomposition 111 7.1 Shannon’s Decomposition Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3 Applying Shannon’s Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.4 Selecting Logic to Decompose and Duplicate. . . . . . . . . . . . . . . . . . . . . 116 7.5 Partial Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.6 Controlling Duplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.7 Integration with Incremental Placement . . . . . . . . . . . . . . . . . . . . . . . 121 7.8 Iterative Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.9 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.10 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 vi 7.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8 Constrained Clock Shifting 128 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.2 Synchronous Operation with Clock Skew . . . . . . . . . . . . . . . . . . . . . . . 129 8.3 Application to FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.4 Timing Uncertainties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 8.5 FPGA Clock Shift Decision Problem . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.5.1 Decision Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.5.2 Solving the Decision Problem . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.6 FPGA Clock Shift Optimization Problem . . . . . . . . . . . . . . . . . . . . . . 136 8.6.1 Binary Search for Best Clock Period . . . . . . . . . . . . . . . . . . . . . 136 8.6.2 Finding The Optimal Skew Set . . . . . . . . . . . . . . . . . . . . . . . . 137 8.7 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 9 Conclusions and Future Work 141 9.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 A Data Visualization 146 Bibliography 151 vii List of Figures 1.1 Simplified FPGA architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Routing switches implement a connection. . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Placement constraints reduce critical path delay. . . . . . . . . . . . . . . . . . . 3 1.4 Circuit restructuring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Logic Element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Clustered Logic Block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Segmented Interconnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Implementing a connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Routing switch implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6 Longline Interconnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.7 Nearest neighbor connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.8 FPGA CAD Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.9 Critical Path analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.10 FPGA Placement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.11 Design Floorplan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.12 Bin assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.13 Application of Shannon’s Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.14 Sequential Retiming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.15 Retiming circuit representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.16 Clock Shifting Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.17 Clock Shifting Waveform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 viii 2.18 Registered Routing Switches in a Segmented Architecture. . . . . . . . . . . . . . 27 2.19 Extra Input Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.20 The need for a LUT-fanin register. . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.21 Alternative to LUT-fanin register. . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.22 Post Place and Route Netlist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.23 Netlist Representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1 Logical to physical mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Simplified view of the APEX architecture. . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Annealing-based placement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Trapped in a local minimum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Residual overlap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Region-based annealing flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7 Estimated Minimum Slack vs True Minimum Slack. . . . . . . . . . . . . . . . . 42 3.8 Gains for the new Timing Cost function. . . . . . . . . . . . . . . . . . . . . . . . 43 3.9 Restricted region mobility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.10 Worst-case effects of non-restricted mobility. . . . . . . . . . . . . . . . . . . . . . 45 3.11 Region swaps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.12 Move selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.13 Placement subproblems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.14 Rule-based region sizing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.15 Regions with memory circuitry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.16 Fixed parent constrains size of the child. . . . . . . . . . . . . . . . . . . . . . . . 52 3.17 Fixed child constrains size of the parent. . . . . . . . . . . . . . . . . . . . . . . . 53 3.18 Simplified automatic-sizing algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 54 3.19 Effects of algorithmic decisions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.20 Effect of the annealing exponent on F . . . . . . . . . . . . . . . . . . . . . . . 60 max 3.21 Effect of the annealing exponent on Runtime. . . . . . . . . . . . . . . . . . . . . 60 4.1 Three step approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 ix 4.2 Layout driven retiming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 Prediction and Mis-Prediction rates. . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Depth vs Criticality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1 Simple clustered logic block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Iterative Improvement Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Oscillations in Fmax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4 Effect of the Damping Cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Slack based range window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.6 Local Congestion Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.7 Fanin, Fanout and Sibling relationships. . . . . . . . . . . . . . . . . . . . . . . . 80 5.8 Critical vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.9 Trapped in a Local Minima. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.10 Basin Filling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.11 Thrashing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.12 Violation Shifting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.13 Top Level ICP Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.14 Updating the Overuse Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.15 ICP Runtime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.1 Critical Cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2 Near Critical Paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3 Cycle Rate Netlist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4 Cycle Rate Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.5 Finding the Cycle Slack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.6 Intersecting Paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.7 Simplified FPGA Logic Block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.8 Costing Logic Duplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.9 Costing Pipeline Routes and Fanin Registers. . . . . . . . . . . . . . . . . . . . . 101 6.10 Effect of the relative cost ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . 103 x
Description: