ebook img

Towards adaptive balanced computing (ABC) PDF

129 Pages·2001·0.61 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Towards adaptive balanced computing (ABC)

Towards adaptive balanced computing (ABC) using reconfigurable functional caches (RFCs) by Hue-Sung Kim A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Computer Engineering Major Professors: Arun K. Somani and Akhilesh Tyagi Iowa State University Ames, Iowa 2001 Copyright (cid:13)c Hue-Sung Kim, 2001. All rights reserved. ii Graduate College Iowa State University This is to certify that the Doctoral dissertation of Hue-Sung Kim has met the dissertation requirements of Iowa State University Co-major Professor Co-major Professor For the Major Program For the Graduate College iii DEDICATION I would like to dedicate this thesis to my wife, Eunjung, and to my daughter, Haerin. Without their incredible support I would not have been able to complete this work. They encouraged me to follow the right direction at a moment when I was discouraged by failure. This work was completed not only by my own enthusiasm, but also by their endless attention and love. In addition, I would like to thank my friends and parents for their emotional support during the writing of this work. iv TABLE OF CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 General-purpose computing vs. reconfigurable computing . . . . . . . . . . . . 1 1.2 Adaptive balanced computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Motivation and approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 CHAPTER 2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Integration of processor and reconfigurable logic. . . . . . . . . . . . . . . . . . 10 2.2.1 Garp Architecture: Reconfigurable logic in a processor . . . . . . . . . . 10 2.2.2 DPGA-coupled microprocessor . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 ConCISe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Memory systems with computations . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Active pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 FlexRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.3 Reconfigurable caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.4 Cache tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 SIMD extension in microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Intel Pentium MMX/SSE ISA Extension for Multimedia . . . . . . . . . 16 2.4.2 AltiVec in PowerPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Cache memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.1 Cache memory architecture and design . . . . . . . . . . . . . . . . . . . 19 v 2.5.2 Cache memory characteristics in media applications . . . . . . . . . . . 22 2.6 Superscalar microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 Trends of future microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . 23 CHAPTER 3. PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . 26 CHAPTER 4. RECONFIGURABLE FUNCTIONAL CACHE (RFC) . . . 29 4.1 Multi-bit output LUTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Constant coefficient multipliers using multi-bit output LUTs . . . . . . . . . . 32 4.3 Organization and operation of a reconfigurable cache module . . . . . . . . . . 33 4.4 Access time for cache operations . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Configuration and scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5.1 Configuration of a computing unit . . . . . . . . . . . . . . . . . . . . . 40 4.5.2 Initial/partial reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5.3 Scheduling and controlling data flow . . . . . . . . . . . . . . . . . . . . 42 4.5.4 Number and size of LUTs in RFC . . . . . . . . . . . . . . . . . . . . . 43 CHAPTER 5. ABC MICROPROCESSOR . . . . . . . . . . . . . . . . . . . . 45 5.1 Overview of microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Microarchitecture with RFCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.1 Partitioned cache design . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.2 Cache organization with reconfigurable functional caches. . . . . . . . . 48 5.2.3 Instructions to utilize RFC . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.4 Mechanism for the computation in RFC . . . . . . . . . . . . . . . . . . 52 5.2.5 Compiling requirements for the specialized computations . . . . . . . . 59 CHAPTER 6. EXAMPLES OF RFC . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Functions to be mapped to RFC . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1.1 Convolution (FIR filter) . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1.2 DCT/IDCT (MPEG encoding/decoding) . . . . . . . . . . . . . . . . . 63 6.1.3 Reconfigurable cache merged with multi-context configurations . . . . . 69 6.2 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 vi 6.3 RFC with different cache organizations . . . . . . . . . . . . . . . . . . . . . . . 76 6.4 Execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.4.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.4.2 DCT/IDCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4.3 Multi-context reconfigurable functional cache . . . . . . . . . . . . . . . 84 CHAPTER 7. SIMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.1 Computing units configured using RFC . . . . . . . . . . . . . . . . . . . . . . 86 7.2 Experimental methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.2 Simulator and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.3 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.4 Microprocessor with RFC vs. without RFC . . . . . . . . . . . . . . . . . . . . 104 CHAPTER 8. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 vii LIST OF TABLES Table 4.1 Comparison of access time for an 8KB cache with 128bit-wide block . 38 Table 4.2 Parameters to determine the number and size of LUTs . . . . . . . . . 44 Table 6.1 Area comparison of FIR Filters and RFC overhead for FIR. . . . . . . 72 Table 6.2 Area comparison of DCT/IDCT chips and RC overhead for DCT/IDCT 72 Table 6.3 Area comparison of Multiplier-Accumulator’s and RC overhead for one MAC stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Table 6.4 Area overhead of the combined reconfigurable cache . . . . . . . . . . . 75 Table 6.5 Parameters for the RFCs . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Table 6.6 ComparisonofexecutiontimeofConvolutionbetweenSPARCandRFC (µsec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Table 6.7 Comparison of execution time of the DCT/IDCT between SPARC and RFC (µsec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Table 7.1 Benchmarks used in this thesis . . . . . . . . . . . . . . . . . . . . . . 87 Table 7.2 The base processor parameters with RFC and without RFC in Sim- plescalar simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Table 7.3 Number of cycles for the configuration in benchmarks . . . . . . . . . . 98 Table 7.4 Configuration cycles with different cache organizations (32K - 4way) . 100 Table 7.5 Number of data accesses and misses to level-1 data cache . . . . . . . . 101 viii LIST OF FIGURES Figure 2.1 FAGA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Figure 2.2 An n input Look-Up Table (LUT) with one-bit output . . . . . . . . . 9 Figure 2.3 4-LUT with 4-context DRAM . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 2.4 Conventional Cache memory structure . . . . . . . . . . . . . . . . . . 21 Figure 2.5 A typical superscalar microprocessor . . . . . . . . . . . . . . . . . . . 23 Figure 4.1 Multi-output LUTs : (a) A 2-bit adder ; (b) A 2x2 or a 4x2 constant coefficient multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 4.2 8bit adder using (a) two 9-LUTs ; (b) two 8-LUTs; (c) four 4-LUTs . . 31 Figure 4.3 A 16×16 multiplier : (a) top level of 16×16 multiplier ; (b) hierarchical structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 4.4 Cache architecture in the reconfigurable module . . . . . . . . . . . . . 35 Figure 4.5 Parallel decode cache architecture (Base array cache) for faster cache access time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 5.1 Overview of a processor with multiple reconfigurable cache modules . . 45 Figure 5.2 Partitioned cache for multiple modules . . . . . . . . . . . . . . . . . . 47 Figure 5.3 CacheorganizationsandaddressmappingwithRFCs(a)4(b)16cache modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Figure 5.4 rfc instructions for loading and storing “word” type of data . . . . . . 51 Figure 5.5 State transition for the RFC status . . . . . . . . . . . . . . . . . . . . 53 Figure 5.6 (a) Overview of I/O buffers organization; (b) I/O buffers to dedicated to RFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 ix Figure 5.7 A basic frame code using RFC as specialized computing units . . . . . 60 Figure 6.1 (a) One stage of Convolution; (b) Array of LUTs for one stage of Con- volution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Figure 6.2 (a) DCT/IDCT processing element; (b) Array of LUTs for DCT/IDCT processing element with the input registers . . . . . . . . . . . . . . . . 66 Figure 6.3 Data flow of the computation process for 8×8 2-D DCT transform . . 68 Figure 6.4 AnarrayofLUTsforthecombinedRFCwithConvolutionandDCT/IDCT 69 Figure 6.5 A possible layout of RFC for one stage of the Convolution . . . . . . . 70 Figure 6.6 A filter (16mult-32acc) with (a) 128 sets and 64 bytes/line (8KB); (b) 256 sets and 32 bytes/line (8KB); (c) 256 sets and 64 bytes/line (16KB) 77 Figure 6.7 Ratio of execution time of RFC and GPP for Convolution: (a) with- out memory flush; (b) with memory flush before converting into the computing unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 6.8 Ratio of execution time of RFC and GPP for DCT/IDCT with and without ‘flush time’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Figure 7.1 Normalized execution cycles in the base processor w/o and w/ RFC . . 92 Figure 7.2 Speed-up of benchmarks using RFCs (overall and core computation) . 93 Figure 7.3 Normalized execution cycles with various cache organizations . . . . . 95 Figure 7.4 Normalized execution cycles with various cache organizations . . . . . 96 Figure 7.5 Normalized execution cycles for mpeg2enc with Full Dynamic Associa- tivity (FDA) and Partial Dynamic Associativity (PDA). . . . . . . . . 97 Figure 7.6 Miss rate with Full Dynamic Associativity (FDA) and Partial Dynamic Associativity (PDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Figure 7.7 Normalized execution cycles w/o RFC and w/ RFC built in 4-way as- sociative cache (FDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 7.8 Miss rate for level-1 data cache memory . . . . . . . . . . . . . . . . . 101 x Figure 7.9 Memory bandwidth required per CPU cycle with a 32K - 4way set associative cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 7.10 Miss rate with 3-way and 4-way set associative cache for SPEC-95 (a)32KB ; (b)128KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Description:
4.2 Constant coefficient multipliers using multi-bit output LUTs . reconfiguration is about 50-60% of the memory cell cache array-only area .. computational memory system models attach logic to a conventional To initiate tasks in FlexRAM, the host processor should send a signal to the small RISC.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.