Towards adaptive balanced computing (ABC) using reconfigurable functional caches (RFCs) by Hue-Sung Kim A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Computer Engineering Major Professors: Arun K. Somani and Akhilesh Tyagi Iowa State University Ames, Iowa 2001 Copyright (cid:13)c Hue-Sung Kim, 2001. All rights reserved. ii Graduate College Iowa State University This is to certify that the Doctoral dissertation of Hue-Sung Kim has met the dissertation requirements of Iowa State University Co-major Professor Co-major Professor For the Major Program For the Graduate College iii DEDICATION I would like to dedicate this thesis to my wife, Eunjung, and to my daughter, Haerin. Without their incredible support I would not have been able to complete this work. They encouraged me to follow the right direction at a moment when I was discouraged by failure. This work was completed not only by my own enthusiasm, but also by their endless attention and love. In addition, I would like to thank my friends and parents for their emotional support during the writing of this work. iv TABLE OF CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 General-purpose computing vs. reconfigurable computing . . . . . . . . . . . . 1 1.2 Adaptive balanced computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Motivation and approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 CHAPTER 2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Integration of processor and reconfigurable logic. . . . . . . . . . . . . . . . . . 10 2.2.1 Garp Architecture: Reconfigurable logic in a processor . . . . . . . . . . 10 2.2.2 DPGA-coupled microprocessor . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 ConCISe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Memory systems with computations . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Active pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 FlexRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.3 Reconfigurable caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.4 Cache tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 SIMD extension in microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Intel Pentium MMX/SSE ISA Extension for Multimedia . . . . . . . . . 16 2.4.2 AltiVec in PowerPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Cache memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.1 Cache memory architecture and design . . . . . . . . . . . . . . . . . . . 19 v 2.5.2 Cache memory characteristics in media applications . . . . . . . . . . . 22 2.6 Superscalar microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 Trends of future microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . 23 CHAPTER 3. PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . 26 CHAPTER 4. RECONFIGURABLE FUNCTIONAL CACHE (RFC) . . . 29 4.1 Multi-bit output LUTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Constant coefficient multipliers using multi-bit output LUTs . . . . . . . . . . 32 4.3 Organization and operation of a reconfigurable cache module . . . . . . . . . . 33 4.4 Access time for cache operations . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Configuration and scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5.1 Configuration of a computing unit . . . . . . . . . . . . . . . . . . . . . 40 4.5.2 Initial/partial reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5.3 Scheduling and controlling data flow . . . . . . . . . . . . . . . . . . . . 42 4.5.4 Number and size of LUTs in RFC . . . . . . . . . . . . . . . . . . . . . 43 CHAPTER 5. ABC MICROPROCESSOR . . . . . . . . . . . . . . . . . . . . 45 5.1 Overview of microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Microarchitecture with RFCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.1 Partitioned cache design . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.2 Cache organization with reconfigurable functional caches. . . . . . . . . 48 5.2.3 Instructions to utilize RFC . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.4 Mechanism for the computation in RFC . . . . . . . . . . . . . . . . . . 52 5.2.5 Compiling requirements for the specialized computations . . . . . . . . 59 CHAPTER 6. EXAMPLES OF RFC . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Functions to be mapped to RFC . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1.1 Convolution (FIR filter) . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1.2 DCT/IDCT (MPEG encoding/decoding) . . . . . . . . . . . . . . . . . 63 6.1.3 Reconfigurable cache merged with multi-context configurations . . . . . 69 6.2 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 vi 6.3 RFC with different cache organizations . . . . . . . . . . . . . . . . . . . . . . . 76 6.4 Execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.4.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.4.2 DCT/IDCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4.3 Multi-context reconfigurable functional cache . . . . . . . . . . . . . . . 84 CHAPTER 7. SIMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.1 Computing units configured using RFC . . . . . . . . . . . . . . . . . . . . . . 86 7.2 Experimental methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.2 Simulator and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.3 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.4 Microprocessor with RFC vs. without RFC . . . . . . . . . . . . . . . . . . . . 104 CHAPTER 8. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 vii LIST OF TABLES Table 4.1 Comparison of access time for an 8KB cache with 128bit-wide block . 38 Table 4.2 Parameters to determine the number and size of LUTs . . . . . . . . . 44 Table 6.1 Area comparison of FIR Filters and RFC overhead for FIR. . . . . . . 72 Table 6.2 Area comparison of DCT/IDCT chips and RC overhead for DCT/IDCT 72 Table 6.3 Area comparison of Multiplier-Accumulator’s and RC overhead for one MAC stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Table 6.4 Area overhead of the combined reconfigurable cache . . . . . . . . . . . 75 Table 6.5 Parameters for the RFCs . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Table 6.6 ComparisonofexecutiontimeofConvolutionbetweenSPARCandRFC (µsec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Table 6.7 Comparison of execution time of the DCT/IDCT between SPARC and RFC (µsec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Table 7.1 Benchmarks used in this thesis . . . . . . . . . . . . . . . . . . . . . . 87 Table 7.2 The base processor parameters with RFC and without RFC in Sim- plescalar simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Table 7.3 Number of cycles for the configuration in benchmarks . . . . . . . . . . 98 Table 7.4 Configuration cycles with different cache organizations (32K - 4way) . 100 Table 7.5 Number of data accesses and misses to level-1 data cache . . . . . . . . 101 viii LIST OF FIGURES Figure 2.1 FAGA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Figure 2.2 An n input Look-Up Table (LUT) with one-bit output . . . . . . . . . 9 Figure 2.3 4-LUT with 4-context DRAM . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 2.4 Conventional Cache memory structure . . . . . . . . . . . . . . . . . . 21 Figure 2.5 A typical superscalar microprocessor . . . . . . . . . . . . . . . . . . . 23 Figure 4.1 Multi-output LUTs : (a) A 2-bit adder ; (b) A 2x2 or a 4x2 constant coefficient multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 4.2 8bit adder using (a) two 9-LUTs ; (b) two 8-LUTs; (c) four 4-LUTs . . 31 Figure 4.3 A 16×16 multiplier : (a) top level of 16×16 multiplier ; (b) hierarchical structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 4.4 Cache architecture in the reconfigurable module . . . . . . . . . . . . . 35 Figure 4.5 Parallel decode cache architecture (Base array cache) for faster cache access time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 5.1 Overview of a processor with multiple reconfigurable cache modules . . 45 Figure 5.2 Partitioned cache for multiple modules . . . . . . . . . . . . . . . . . . 47 Figure 5.3 CacheorganizationsandaddressmappingwithRFCs(a)4(b)16cache modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Figure 5.4 rfc instructions for loading and storing “word” type of data . . . . . . 51 Figure 5.5 State transition for the RFC status . . . . . . . . . . . . . . . . . . . . 53 Figure 5.6 (a) Overview of I/O buffers organization; (b) I/O buffers to dedicated to RFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 ix Figure 5.7 A basic frame code using RFC as specialized computing units . . . . . 60 Figure 6.1 (a) One stage of Convolution; (b) Array of LUTs for one stage of Con- volution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Figure 6.2 (a) DCT/IDCT processing element; (b) Array of LUTs for DCT/IDCT processing element with the input registers . . . . . . . . . . . . . . . . 66 Figure 6.3 Data flow of the computation process for 8×8 2-D DCT transform . . 68 Figure 6.4 AnarrayofLUTsforthecombinedRFCwithConvolutionandDCT/IDCT 69 Figure 6.5 A possible layout of RFC for one stage of the Convolution . . . . . . . 70 Figure 6.6 A filter (16mult-32acc) with (a) 128 sets and 64 bytes/line (8KB); (b) 256 sets and 32 bytes/line (8KB); (c) 256 sets and 64 bytes/line (16KB) 77 Figure 6.7 Ratio of execution time of RFC and GPP for Convolution: (a) with- out memory flush; (b) with memory flush before converting into the computing unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 6.8 Ratio of execution time of RFC and GPP for DCT/IDCT with and without ‘flush time’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Figure 7.1 Normalized execution cycles in the base processor w/o and w/ RFC . . 92 Figure 7.2 Speed-up of benchmarks using RFCs (overall and core computation) . 93 Figure 7.3 Normalized execution cycles with various cache organizations . . . . . 95 Figure 7.4 Normalized execution cycles with various cache organizations . . . . . 96 Figure 7.5 Normalized execution cycles for mpeg2enc with Full Dynamic Associa- tivity (FDA) and Partial Dynamic Associativity (PDA). . . . . . . . . 97 Figure 7.6 Miss rate with Full Dynamic Associativity (FDA) and Partial Dynamic Associativity (PDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Figure 7.7 Normalized execution cycles w/o RFC and w/ RFC built in 4-way as- sociative cache (FDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 7.8 Miss rate for level-1 data cache memory . . . . . . . . . . . . . . . . . 101 x Figure 7.9 Memory bandwidth required per CPU cycle with a 32K - 4way set associative cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 7.10 Miss rate with 3-way and 4-way set associative cache for SPEC-95 (a)32KB ; (b)128KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Description: