Table Of ContentTowards adaptive balanced computing (ABC) using reconfigurable functional
caches (RFCs)
by
Hue-Sung Kim
A dissertation submitted to the graduate faculty
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
Major: Computer Engineering
Major Professors: Arun K. Somani and Akhilesh Tyagi
Iowa State University
Ames, Iowa
2001
Copyright (cid:13)c Hue-Sung Kim, 2001. All rights reserved.
ii
Graduate College
Iowa State University
This is to certify that the Doctoral dissertation of
Hue-Sung Kim
has met the dissertation requirements of Iowa State University
Co-major Professor
Co-major Professor
For the Major Program
For the Graduate College
iii
DEDICATION
I would like to dedicate this thesis to my wife, Eunjung, and to my daughter, Haerin.
Without their incredible support I would not have been able to complete this work. They
encouraged me to follow the right direction at a moment when I was discouraged by failure.
This work was completed not only by my own enthusiasm, but also by their endless attention
and love. In addition, I would like to thank my friends and parents for their emotional support
during the writing of this work.
iv
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 General-purpose computing vs. reconfigurable computing . . . . . . . . . . . . 1
1.2 Adaptive balanced computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation and approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Integration of processor and reconfigurable logic. . . . . . . . . . . . . . . . . . 10
2.2.1 Garp Architecture: Reconfigurable logic in a processor . . . . . . . . . . 10
2.2.2 DPGA-coupled microprocessor . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 ConCISe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Memory systems with computations . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Active pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 FlexRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Reconfigurable caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Cache tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 SIMD extension in microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Intel Pentium MMX/SSE ISA Extension for Multimedia . . . . . . . . . 16
2.4.2 AltiVec in PowerPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Cache memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Cache memory architecture and design . . . . . . . . . . . . . . . . . . . 19
v
2.5.2 Cache memory characteristics in media applications . . . . . . . . . . . 22
2.6 Superscalar microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Trends of future microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . 23
CHAPTER 3. PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . 26
CHAPTER 4. RECONFIGURABLE FUNCTIONAL CACHE (RFC) . . . 29
4.1 Multi-bit output LUTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Constant coefficient multipliers using multi-bit output LUTs . . . . . . . . . . 32
4.3 Organization and operation of a reconfigurable cache module . . . . . . . . . . 33
4.4 Access time for cache operations . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Configuration and scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.1 Configuration of a computing unit . . . . . . . . . . . . . . . . . . . . . 40
4.5.2 Initial/partial reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.3 Scheduling and controlling data flow . . . . . . . . . . . . . . . . . . . . 42
4.5.4 Number and size of LUTs in RFC . . . . . . . . . . . . . . . . . . . . . 43
CHAPTER 5. ABC MICROPROCESSOR . . . . . . . . . . . . . . . . . . . . 45
5.1 Overview of microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Microarchitecture with RFCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 Partitioned cache design . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.2 Cache organization with reconfigurable functional caches. . . . . . . . . 48
5.2.3 Instructions to utilize RFC . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.4 Mechanism for the computation in RFC . . . . . . . . . . . . . . . . . . 52
5.2.5 Compiling requirements for the specialized computations . . . . . . . . 59
CHAPTER 6. EXAMPLES OF RFC . . . . . . . . . . . . . . . . . . . . . . . 61
6.1 Functions to be mapped to RFC . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1.1 Convolution (FIR filter) . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1.2 DCT/IDCT (MPEG encoding/decoding) . . . . . . . . . . . . . . . . . 63
6.1.3 Reconfigurable cache merged with multi-context configurations . . . . . 69
6.2 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
vi
6.3 RFC with different cache organizations . . . . . . . . . . . . . . . . . . . . . . . 76
6.4 Execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4.2 DCT/IDCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4.3 Multi-context reconfigurable functional cache . . . . . . . . . . . . . . . 84
CHAPTER 7. SIMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.1 Computing units configured using RFC . . . . . . . . . . . . . . . . . . . . . . 86
7.2 Experimental methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.2 Simulator and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.4 Microprocessor with RFC vs. without RFC . . . . . . . . . . . . . . . . . . . . 104
CHAPTER 8. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
vii
LIST OF TABLES
Table 4.1 Comparison of access time for an 8KB cache with 128bit-wide block . 38
Table 4.2 Parameters to determine the number and size of LUTs . . . . . . . . . 44
Table 6.1 Area comparison of FIR Filters and RFC overhead for FIR. . . . . . . 72
Table 6.2 Area comparison of DCT/IDCT chips and RC overhead for DCT/IDCT 72
Table 6.3 Area comparison of Multiplier-Accumulator’s and RC overhead for one
MAC stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Table 6.4 Area overhead of the combined reconfigurable cache . . . . . . . . . . . 75
Table 6.5 Parameters for the RFCs . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Table 6.6 ComparisonofexecutiontimeofConvolutionbetweenSPARCandRFC
(µsec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Table 6.7 Comparison of execution time of the DCT/IDCT between SPARC and
RFC (µsec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Table 7.1 Benchmarks used in this thesis . . . . . . . . . . . . . . . . . . . . . . 87
Table 7.2 The base processor parameters with RFC and without RFC in Sim-
plescalar simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Table 7.3 Number of cycles for the configuration in benchmarks . . . . . . . . . . 98
Table 7.4 Configuration cycles with different cache organizations (32K - 4way) . 100
Table 7.5 Number of data accesses and misses to level-1 data cache . . . . . . . . 101
viii
LIST OF FIGURES
Figure 2.1 FAGA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Figure 2.2 An n input Look-Up Table (LUT) with one-bit output . . . . . . . . . 9
Figure 2.3 4-LUT with 4-context DRAM . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 2.4 Conventional Cache memory structure . . . . . . . . . . . . . . . . . . 21
Figure 2.5 A typical superscalar microprocessor . . . . . . . . . . . . . . . . . . . 23
Figure 4.1 Multi-output LUTs : (a) A 2-bit adder ; (b) A 2x2 or a 4x2 constant
coefficient multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 4.2 8bit adder using (a) two 9-LUTs ; (b) two 8-LUTs; (c) four 4-LUTs . . 31
Figure 4.3 A 16×16 multiplier : (a) top level of 16×16 multiplier ; (b) hierarchical
structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 4.4 Cache architecture in the reconfigurable module . . . . . . . . . . . . . 35
Figure 4.5 Parallel decode cache architecture (Base array cache) for faster cache
access time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 5.1 Overview of a processor with multiple reconfigurable cache modules . . 45
Figure 5.2 Partitioned cache for multiple modules . . . . . . . . . . . . . . . . . . 47
Figure 5.3 CacheorganizationsandaddressmappingwithRFCs(a)4(b)16cache
modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Figure 5.4 rfc instructions for loading and storing “word” type of data . . . . . . 51
Figure 5.5 State transition for the RFC status . . . . . . . . . . . . . . . . . . . . 53
Figure 5.6 (a) Overview of I/O buffers organization; (b) I/O buffers to dedicated
to RFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
ix
Figure 5.7 A basic frame code using RFC as specialized computing units . . . . . 60
Figure 6.1 (a) One stage of Convolution; (b) Array of LUTs for one stage of Con-
volution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 6.2 (a) DCT/IDCT processing element; (b) Array of LUTs for DCT/IDCT
processing element with the input registers . . . . . . . . . . . . . . . . 66
Figure 6.3 Data flow of the computation process for 8×8 2-D DCT transform . . 68
Figure 6.4 AnarrayofLUTsforthecombinedRFCwithConvolutionandDCT/IDCT 69
Figure 6.5 A possible layout of RFC for one stage of the Convolution . . . . . . . 70
Figure 6.6 A filter (16mult-32acc) with (a) 128 sets and 64 bytes/line (8KB); (b)
256 sets and 32 bytes/line (8KB); (c) 256 sets and 64 bytes/line (16KB) 77
Figure 6.7 Ratio of execution time of RFC and GPP for Convolution: (a) with-
out memory flush; (b) with memory flush before converting into the
computing unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Figure 6.8 Ratio of execution time of RFC and GPP for DCT/IDCT with and
without ‘flush time’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Figure 7.1 Normalized execution cycles in the base processor w/o and w/ RFC . . 92
Figure 7.2 Speed-up of benchmarks using RFCs (overall and core computation) . 93
Figure 7.3 Normalized execution cycles with various cache organizations . . . . . 95
Figure 7.4 Normalized execution cycles with various cache organizations . . . . . 96
Figure 7.5 Normalized execution cycles for mpeg2enc with Full Dynamic Associa-
tivity (FDA) and Partial Dynamic Associativity (PDA). . . . . . . . . 97
Figure 7.6 Miss rate with Full Dynamic Associativity (FDA) and Partial Dynamic
Associativity (PDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Figure 7.7 Normalized execution cycles w/o RFC and w/ RFC built in 4-way as-
sociative cache (FDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Figure 7.8 Miss rate for level-1 data cache memory . . . . . . . . . . . . . . . . . 101
x
Figure 7.9 Memory bandwidth required per CPU cycle with a 32K - 4way set
associative cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Figure 7.10 Miss rate with 3-way and 4-way set associative cache for SPEC-95
(a)32KB ; (b)128KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Description:4.2 Constant coefficient multipliers using multi-bit output LUTs . reconfiguration is about 50-60% of the memory cell cache array-only area .. computational memory system models attach logic to a conventional To initiate tasks in FlexRAM, the host processor should send a signal to the small RISC.