ebook img

Parallelizing Simulated Annealing Placement for GPGPU PDF

111 Pages·2010·0.76 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Parallelizing Simulated Annealing Placement for GPGPU

Parallelizing Simulated Annealing Placement for GPGPU by Alexander Choong A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto Copyright (cid:13)c 2010 by Alexander Choong Abstract Parallelizing Simulated Annealing Placement for GPGPU Alexander Choong Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2010 Field Programmable Gate Array (FPGA) devices are increasing in capacity at an exponen- tial rate, and thus there is an increasingly strong demand to accelerate simulated annealing placement. Graphics Processing Units (GPUs) offer a unique opportunity to accelerate this simulated annealing placement on a manycore architecture using only commodity hardware. GPUs are optimized for applications which can tolerate single-thread latency and so GPUs can provide high throughput across many threads. However simulated annealing is not em- barrassingly parallel and so single thread latency should be minimized to improve run time. Thus it is questionable whether GPUs can achieve any speedup over a sequential implementa- tion. In this thesis, anovel subset-basedsimulated annealing placement framework is proposed, which specifically targets the GPU architecture. A highly optimized framework is implemented which, on average, achieves an order of magnitude speedup with less than 1% degradation for wirelength and no loss in quality for timing on realistic architectures. ii Acknowledgements Professor Jianwen Zhu has been insightful and patient advisor over the course of this thesis. The experience with him have certainly been en- lightening and unforgettable. I would like to show my appreciation for the time, kindness and assisi- tance I received from Andrew, Edward, Eugene, Hannah, Kelvin, Linda, Rami and Shikuan. Espeically Andrew, Hannah and Rami for showing me the ropes. This research was generously funded by NSERC. Thanks and awknowledge must be given to Professor Jonathan Rose and Professor Jason Anderson for their insightful advice, their valuable time and their kind words. Also, I would like to thank them as well as Professor Teng Joon Lim for being on my committee. To my dear friends: Chuck, David, Dharmendra, Diego, Kaveh, Nick, Wendy, Xun and Zefu. I am indebted to the support, and advice you have given me, as well as their swift and heartfelt aid whenever I needed help. My years in graduate school were made so much more pleasant becauseof them. A special thanks to Diego, Wendy and Zefu for helping me to revise this thesis. Most of all, I must and very eagerly acknowledge the love, patience, and supportofmyfamily. Withoutthem,Iwouldhavebeenabletocomplete this thesis. At the moment, words fail to describe the vast and immense appreciation I have for everything they have given me. iii For shallow draughts intoxicate the brain And drinking largely sobers us again. Fired at first sight with what the Muse imparts, In fearless youth we tempt the heights of arts, While from the bounded level of our mind, Short views we take, nor see the lengths behind; But more advanced, behold with strange surprise New distant scenes of endless science rise! - Alexander Pope’s An Essay on Criticism (1709), Contents List of Tables viii List of Figures x List of Algorithms xi 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 FPGA Placement Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Simulated Annealing Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 GPU Parallel Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.2 Hiding Memory Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.3 Branch Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Subset-based Simulated Annealing Placement Framework 21 3.1 Challenges for Simulated Annealing Placement using GPGPU . . . . . . . . . . . 21 v 3.1.1 Memory Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Branch Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.3 Consistency, Convergence and Scalability . . . . . . . . . . . . . . . . . . 23 3.2 Resolving Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Subset-based Simulated Annealing Framework. . . . . . . . . . . . . . . . . . . . 24 3.3.1 Move Biasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Subset Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Parallel Moves on GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6 Improving Run Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6.1 Subset Generation on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6.2 Subset Generation Optimizations . . . . . . . . . . . . . . . . . . . . . . . 31 3.6.3 Parallel Annealing Optimizations . . . . . . . . . . . . . . . . . . . . . . . 37 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 Wirelength-Driven and Timing-Driven Metrics 38 4.1 HPWL Metric and Pre-Bounding Box . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1.1 Pre-Bounding Box Optimization . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Challenges with Timing-Driven Placement using GPGPU . . . . . . . . . . . . . 43 4.2.1 Challenge with VPR’s Metric . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.2 Challenge with Net-Weighting Metric . . . . . . . . . . . . . . . . . . . . 46 4.2.3 Resolving Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.4 Investigating Sum Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.5 Investigating and Resolving Cases with High Fanout . . . . . . . . . . . . 49 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5 Evaluation and Analysis 54 5.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1.2 Sequential Simulated Annealing Placer . . . . . . . . . . . . . . . . . . . . 55 5.1.3 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 vi 5.2 Parameters for GPGPU Framework . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.1 Summary of Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . 68 5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Wirelength-Driven Placement . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.2 Timing-Driven Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 Analysis of Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.1 Determinism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.2 Error Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6 Conclusion and Future Work 95 Bibliography 96 vii List of Tables 2.1 Mapping between threads, CUDA blocks and grids to hardware resources . . . . 16 4.1 Parameters and Shared Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Shared Memory Usage for Each Cluster Size . . . . . . . . . . . . . . . . . . . . . 46 4.3 Quality of Results for Sum Operator . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Quality of Results for Max Operator . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1 Stitched ITC99 Benchmarks Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 Impact of Pre-Bounding Box Optimization . . . . . . . . . . . . . . . . . . . . . 69 5.3 Parameters used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Wirelength-Driven Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.5 Wirelength-Driven Results for Sequential Version . . . . . . . . . . . . . . . . . . 72 5.6 Average Time Per Move for CPU and Netlist Size . . . . . . . . . . . . . . . . . 75 5.7 Average Time Per Kernel for GPU and Netlist Size . . . . . . . . . . . . . . . . . 76 5.8 Timing-Driven Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.9 Post-Routing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.10 Timing-Driven Results for Sequential Version . . . . . . . . . . . . . . . . . . . . 80 5.11 Post-Routing Results for Sequential Version . . . . . . . . . . . . . . . . . . . . . 81 5.12 Wirelength-Driven Results With No Concurrent GPU and CPU Execution . . . 84 5.13 Comparing Specification of the GTX280 to GTX480 . . . . . . . . . . . . . . . . 87 5.14 Parameters used for GTX480 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.15 Wirelength-Driven Results for GTX480 . . . . . . . . . . . . . . . . . . . . . . . 89 viii 5.16 Placement-Estimated Results with 1.5x More Moves . . . . . . . . . . . . . . . . 91 5.17 Post-Routing Results with 1.5x More Moves . . . . . . . . . . . . . . . . . . . . . 92 5.18 Placement-Estimated Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.19 Post-Routing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 ix List of Figures 1.1 FPGA Size vs. CPU and GPU Performance . . . . . . . . . . . . . . . . . . . . . 2 2.1 HPWL for a Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Non-Interleaved and Interleaved Memory Requests . . . . . . . . . . . . . . . . . 18 2.3 Example of Branch Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Distribution of Threads for Each Stage of Parallel Annealing . . . . . . . . . . . 32 3.2 Non-streaming and Streamed Memory Access Patterns . . . . . . . . . . . . . . . 34 3.3 Overview of Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1 Pre-bounding box for a net of 4 blocks with two blocks in the subset. . . . . . . . 42 4.2 Problematic High Fanout Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 Impact of Number of Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Impact of Subset Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Impact of Number of Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4 Impact of High Temperature Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.5 Impact of Low Temperature Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6 Impact of Number of Subset Groups Stored . . . . . . . . . . . . . . . . . . . . . 66 5.7 Impact of Queue Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.8 Trend in Speedup and Number of Blocks for Wirelength-Driven GPGPU Placer . 74 5.9 Trend in Speedup and Number of Blocks for Timing-Driven GPGPU Placer . . . 82 x

Description:
tion. In this thesis, a novel subset-based simulated annealing placement framework is proposed, which specifically targets the GPU architecture. A highly optimized framework is implemented which, on average, achieves an order of magnitude speedup with less than 1% degradation for wirelength and
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.