ABSTRACT Title of dissertation: HEAP DATA ALLOCATION TO SCRATCH-PAD MEMORY IN EMBEDDED SYSTEMS Angel Dominguez Doctor of Philosophy, 2007 Dissertation directed by: Professor Rajeev K. Barua Department of Electrical and Computer Engineering This thesis presents the first-ever compile-time methodfor allocating a portion of a program’sdynamic datato scratch-pad memory. A scratch-pad isa fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its signifi- cantlylower overheads inaccesstime, energyconsumption, areaandoverallruntime. Dynamic data refers to all objects allocated at run-time in a program, as opposed to static data objects which are allocated at compile-time. Existing compiler methods for allocating data to scratch-pad are able to place only code, global and stack data (static data) in scratch-pad memory; heap and recursive-function objects(dynamic data) are allocated entirely in DRAM, resulting in poor performance for these dy- namic data types. Runtime methods based on software caching can place data in scratch-pad, but because of their high overheads from software address translation, they have not been successful, especially for dynamic data. Inthisthesis wepresent adynamic yet compiler-directed allocationmethodfor dynamicdatathatforthefirsttime, (i)isabletoplaceaportionofthedynamicdata inscratch-pad; (ii) hasno software-caching tags; (iii) requires norun-timeper-access extra address translation; and (iv) is able to move heap databack and forth between scratch-pad and DRAM to better track the program’s locality characteristics. With our method, code, global, stack and heap variables can share the same scratch-pad. When compared to placing all dynamic data variables in DRAM and only static data in scratch-pad, our results show that our method reduces the average runtime of our benchmarks by 22.3%, and the average power consumption by 26.7%, for the same size of scratch-pad fixed at 5% of total data size. Significant savings in runtime and energy were also observed when compared against cached memory organizations, showing our method’s success with SPM placement of dynamic data under constrained memory sizes. HEAP DATA ALLOCATION TO SCRATCH-PAD MEMORY IN EMBEDDED SYSTEMS by Angel Dominguez Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2007 Advisory Committee: Professor Rajeev K. Barua, Chair/Advisor Professor Manoj Franklin Professor Shuvra S. Bhattacharrya Professor Peter Petrov Professor Chau-Wen Tseng (cid:13)c Copyright by Angel Dominguez 2007 Table of Contents List of Figures iv 1 Introduction 1 1.1 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Embedded Systems and Software Development 13 2.1 Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Intel StrongARM Microprocessor . . . . . . . . . . . . . . . . . . . . 22 2.3 Embedded Software Development . . . . . . . . . . . . . . . . . . . . 26 2.4 C Language Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Heap Data Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6 Recursive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3 Previous Work on SPM allocation 47 3.1 Overview of Related Research . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Static SPM Allocation Methods . . . . . . . . . . . . . . . . . . . . . 48 3.3 Dynamic SPM Allocation Techniques . . . . . . . . . . . . . . . . . . 53 3.4 Existing Methods For Dynamic Program Data . . . . . . . . . . . . . 58 3.5 Heap-to-Stack Conversion Techniques . . . . . . . . . . . . . . . . . . 60 3.6 Memory Hierarchy Research . . . . . . . . . . . . . . . . . . . . . . . 62 3.7 Dynamic Memory Manager Research . . . . . . . . . . . . . . . . . . 68 3.8 Other Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5 Dynamic allocation of static program data 73 5.1 Overview for static program allocation . . . . . . . . . . . . . . . . . 74 5.2 The Dynamic Program Region Graph . . . . . . . . . . . . . . . . . . 78 5.3 Allocation Method for Code, Stack and Global Objects . . . . . . . . 83 5.4 Algorithm Modifications . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5 Layout and Code Generation . . . . . . . . . . . . . . . . . . . . . . . 100 5.6 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5 Dynamic program data 106 5.1 Understanding dynamic data in software . . . . . . . . . . . . . . . . 107 5.2 Obstacles to optimizing software with dynamic data . . . . . . . . . . 117 5.3 Creating the DPRG with dynamic data . . . . . . . . . . . . . . . . . 122 6 Compiler allocation of dynamic data 131 6.1 Overview of our SPM allocation method for dynamic data . . . . . . 132 6.2 Preparing the DPRG for allocation . . . . . . . . . . . . . . . . . . . 136 6.3 Calculating Heap Bin Allocation Sizes . . . . . . . . . . . . . . . . . 138 6.4 Overview of the iterative portion . . . . . . . . . . . . . . . . . . . . 141 6.5 Transfer Minimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.6 Heap Safety Transformations . . . . . . . . . . . . . . . . . . . . . . . 144 ii 6.7 Memory Layout Technique for Address Assignment . . . . . . . . . . 149 6.8 Feedback Driven Transformations . . . . . . . . . . . . . . . . . . . . 153 6.9 Termination of Iterative steps . . . . . . . . . . . . . . . . . . . . . . 155 6.10 Code generation for optimized binaries . . . . . . . . . . . . . . . . . 156 7 Robust dynamic data handling 159 7.1 General Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.2 Recursive function stack handling . . . . . . . . . . . . . . . . . . . . 164 7.3 Compile-time Unknown-Size Heap Objects . . . . . . . . . . . . . . . 171 7.4 Profile Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 8 Methodology 194 8.1 Target Hardware Platform . . . . . . . . . . . . . . . . . . . . . . . . 195 8.2 Software Platform Requirements . . . . . . . . . . . . . . . . . . . . . 199 8.3 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . . 203 8.4 Simulation Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 8.5 Benchmark Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 8.6 Benchmark Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 8.7 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 9 Results 223 9.1 Dynamic Heap Allocation Results . . . . . . . . . . . . . . . . . . . . 224 9.1.1 Runtime and energy gain . . . . . . . . . . . . . . . . . . . . . 224 9.1.2 Transfer Method Comparison . . . . . . . . . . . . . . . . . . 228 9.1.3 Reduction in Heap DRAM Accesses . . . . . . . . . . . . . . . 232 9.1.4 Effect of varying SPM size . . . . . . . . . . . . . . . . . . . . 235 9.2 Unknown-size Heap Allocation . . . . . . . . . . . . . . . . . . . . . . 236 9.3 Recursive Function Allocation . . . . . . . . . . . . . . . . . . . . . . 239 9.4 Comparison with caches . . . . . . . . . . . . . . . . . . . . . . . . . 247 9.5 Profile Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 9.5.1 Non-Profile Input Variation . . . . . . . . . . . . . . . . . . . 258 9.6 Code Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 10 Conclusion 268 10.1 Primary Heap Allocation Results . . . . . . . . . . . . . . . . . . . . 270 10.2 Cache Comparison Results . . . . . . . . . . . . . . . . . . . . . . . . 274 10.3 Profile Sensitivity Results . . . . . . . . . . . . . . . . . . . . . . . . 283 Bibliography 287 iii List of Figures 1.1 Example of heap allocation using our method . . . . . . . . . . . . . 8 2.1 Diagram of typical desktop computer. . . . . . . . . . . . . . . . . . . 15 2.2 Diagram of typical embedded computer. . . . . . . . . . . . . . . . . 16 2.3 Memory types common to embedded platforms . . . . . . . . . . . . 18 2.4 Comparison between popular embedded memory types. . . . . . . . . 22 2.5 Diagram of the Intel StrongARM embedded cpu. . . . . . . . . . . . 23 2.6 Compilation of an application from source files. . . . . . . . . . . . . 34 2.7 Compiler view of program memory . . . . . . . . . . . . . . . . . . . 39 2.8 Sample memory layout for an embedded application . . . . . . . . . . 41 2.9 Heap manager example . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.10 Stack growth of a recursive function. . . . . . . . . . . . . . . . . . . 46 5.1 DPRG created for a sample program. . . . . . . . . . . . . . . . . . . 80 5.2 Algorithm for dynamic allocation of static program data. . . . . . . . 85 5.3 DPRG enhanced with code regions. . . . . . . . . . . . . . . . . . . . 90 5.1 Memory map for a typical ARM application. . . . . . . . . . . . . . . 109 5.2 Example of a recursive data structure. . . . . . . . . . . . . . . . . . 116 5.3 Example program fragment . . . . . . . . . . . . . . . . . . . . . . . 124 5.4 DPRG showing a heap allocation site. . . . . . . . . . . . . . . . . . . 125 5.5 DPRG for a sample function with heap data. . . . . . . . . . . . . . . 127 6.1 Algorithm for dynamic allocation of heap data. . . . . . . . . . . . . 134 6.2 Calculating heap bin sizes for allocation. . . . . . . . . . . . . . . . . 140 6.3 Allocation scenario for an example program . . . . . . . . . . . . . . 150 iv 7.1 DPRG of a recursive function. . . . . . . . . . . . . . . . . . . . . . . 167 7.2 Binary tree showing access frequency. . . . . . . . . . . . . . . . . . . 169 7.3 Sample Program containing unknown-size heap allocation . . . . . . . 177 7.4 Sample Function containing unknown-size heap allocation . . . . . . . 189 8.1 GCC compiler flow for an application. . . . . . . . . . . . . . . . . . 206 8.2 Main stages of our allocation algorithm. . . . . . . . . . . . . . . . . 207 8.3 Benchmark Suite Information - Part 1 . . . . . . . . . . . . . . . . . 221 8.4 Benchmark Suite Information - Part 2 . . . . . . . . . . . . . . . . . 221 8.5 Benchmark Suite Information - Part 3 . . . . . . . . . . . . . . . . . 222 9.1 Runtime gain from our method for the default scenario. . . . . . . . . 225 9.2 Energy savings from our method for the default scenario . . . . . . . 227 9.3 Runtime results from using different transfer methods. . . . . . . . . 229 9.4 Power consumption using different transfer methods . . . . . . . . . . 231 9.5 Percentage of heap accesses going to DRAM after allocation. . . . . . 232 9.6 Effects of varying DRAM latency on runtime gain(Part 1). . . . . . . 233 9.7 Effects of varying DRAM latency on runtime gain(Part 2). . . . . . . 234 9.8 Effect of varying SPM size on runtime gain using our approach. . . . 235 9.9 Normalized runtime for unknown-size benchmark set. . . . . . . . . . 236 9.10 Normalized energy consumption for unknown-size benchmark set. . . 237 9.11 Normalized runtime for unknown-size benchmark set when varying SPM size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 9.12 Normalized energy usage for unknown-size benchmark set when vary- ing SPM size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 9.13 Normalized runtime for recursive benchmark set. . . . . . . . . . . . . 242 9.14 Reduction in energy consumption for recursive benchmark set. . . . . 243 v 9.15 Normalized runtime for recursive benchmark set at 25% SPM. . . . . 245 9.16 Normalized energy usage for recursive benchmark set at 25% SPM. . 246 9.17 Average normalized runtimes for benchmarks using combinations of SPM and cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 9.18 Averaged normalized energy usage for benchmarks using combina- tions of SPM and cache. . . . . . . . . . . . . . . . . . . . . . . . . . 251 9.19 Normalized runtimes for benchmarks using the second benchmark input set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 9.20 Normalizedenergyusageforbenchmarks usingthesecondbenchmark input set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 9.21 Average runtime gain for benchmarks using both benchmark input sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 9.22 Average energy savings for benchmarks using both benchmark input sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 9.23 Normalized runtime for benchmarks showing profile input sensitivity. 260 9.24 Normalized energy usage for benchmarks showing profile input sen- sitivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 9.25 Improvement in runtime from profile averaging passes. . . . . . . . . 262 9.26 Improvement in energy usage from profile averaging passes. . . . . . . 263 9.27 Runtime gain when code as well as global, stack and heap data are allocated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 9.28 Energy savings when code as well as global, stack and heap data are allocated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 10.1 NormalizedruntimeforrecursiveapplicationswhenSPMsizevaries(Part 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 10.2 NormalizedruntimeforrecursiveapplicationswhenSPMsizevaries(Part 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 10.3 Normalized energy usage for recursive applications when SPM size varies(Part 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 vi 10.4 Normalized energy usage for recursive applications when SPM size varies(Part 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 10.5 Details on cache experiments with original JEC apps . . . . . . . . . 274 10.6 Details on cache experiments with known-size heap benchmarks . . . 275 10.7 Details on cache experiments with unknown-size heap benchmarks . . 276 10.8 Details on cache experiments with recursive benchmarks . . . . . . . 277 10.9 Details on cache experiments with original JEC benchmarks . . . . . 278 10.10Details on cache experiments with known-size heap benchmarks . . . 279 10.11Details on cache experiments with unknown-size heap benchmarks . . 280 10.12Details on cache experiments with recursive benchmarks . . . . . . . 281 10.13Details on runtime gains from profile sensitivity experiments. . . . . . 282 10.14Details on energy savings from profile sensitivity experiments. . . . . 283 10.15Details on runtime gains from profile sensitivity experiments when inputs are applied in reverse . . . . . . . . . . . . . . . . . . . . . . . 284 10.16Details on energy savings from profile sensitivity experiments when inputs are applied in reverse . . . . . . . . . . . . . . . . . . . . . . . 285 vii
Description: