TEMPORAL MEMORY STREAMING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF CARNEGIE MELLON UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY By Thomas F. Wenisch August 2007 © Copyright 2007 by Thomas F. Wenisch All Rights Reserved ii Abstract While device scaling has led to continued processor performance improvement, scaling trends in DRAM technology have favored improving density over access latency. As a result, pro- cessors in modern servers spend much of execution time stalled on long-latency memory accesses. The conventional approach to latency tolerance—enlarging the on-chip cache hierarchy as transis- tor budgets scale—is providing diminishing returns because today's multi-megabyte caches already capture available locality. Commercial server applications present a particular challenge for memory system design because current prefetching/streaming approaches are often ineffective on the irregular data structures and dependent miss chains characteristic of these applications. To further improve server performance, architects must design mechanisms that issue memory requests earlier and with greater parallelism in the face of complex access patterns. Despite their complexity, commercial applications nonetheless execute repetitive code sequences, which give rise to recurring data structure traversals. As a result, memory addresses are temporally-correlated—addresses accessed near one another in time often recur together. By recording temporally-correlated cache miss addresses and using the recorded information to pre- dict future misses, irregular yet repetitive miss patterns can be predicted. To exploit temporal address correlation, we propose Temporal Memory Streaming, a memory system design paradigm where hardware mechanisms observe repetitive miss sequences at runtime and use recorded sequences to stream data from memory in advance of individual requests. iii Acknowledgements Foremost, I wish to recognize Babak Falsafi, my thesis advisor, who has taught me to think, write, speak, and do research. I owe him many beers for the efforts he has made to prepare me for an academic career. I thank Mark Hill, James Hoe, and Todd Mowry for serving on my thesis committee and for their advice in seeking a career in academia. To Anastasia Ailamaki and Andreas Moshovos, my thanks for their collaboration and guidance in my research. There are many students at Carnegie Mellon University without whom I could not have completed this work. Se-Hyun Yang and Chris Gniady taught me how to be a successful graduate student. Without Roland Wunderlich and his work on measurement methodology, none of the results in this thesis would be possible. Nikos Hardavellas and Jangwoo Kim invested years of their lives in the preparation and tuning of our workloads. Mike Ferdman, Stephen Somogyi, Jared Smolens, and Brian Gold spent countless hours locating obscure bugs in our tools. I thank all those who endured my “Template Tomfoolery” with good humor. I wish also to recognize my friends and teachers at the University of Rhode Island. Augus- tus Uht set me on the path that led me to computer architecture, graduate school, and Carnegie Mellon. John Grandin made my year abroad in Germany possible. My many friends in the Gaming Club always challenged me to keep my mind sharp and head clear (and eyes peeled for incoming frisbees). I am forever indebted to my parents, Theresia and Fritz Wenisch, for their unconditional love and support and for teaching me the importance of education; and to my siblings, Magdalena iv Andres and Michael Wenisch, for the inspiring example they set. I wish to thank Gerald and Janice Neylon for their encouragement and hospitality during the many visits to Rhode Island that Shan- non and I (and our cats) have made. I wish to acknowledge the good people of the ISOMAC corporation for the wonderful espresso that their machines produce; Matt Koeske for his high standards and discriminating taste that allowed only the best beans to reach our shelves; the students of A-level who ensured my oft- lost mug would always find its way home; and the faculty of A-level whose financial support guar- anteed my coffee cup would remain forever bottomless. Finally, I dedicate this thesis to my wife, Shannon Wenisch, for the patience, support, and love she has given me all these years. v vi vii . Table of Contents Abstract............................................................................................................................iii Acknowledgements..........................................................................................................iv List of Figures..................................................................................................................xi List of Tables..................................................................................................................xv Chapter 1 Introduction...............................................................................................1 1.1 Temporal address correlation..................................................................................3 1.2 Temporal Memory Streaming.................................................................................4 1.3 Scope of study........................................................................................................6 1.4 Contributions..........................................................................................................8 Chapter 2 Temporal Address Correlation...............................................................11 2.1 Motivating examples............................................................................................12 2.1.1 Example one: B+-tree range scans.....................................................................13 2.1.2 Example two: Solaris thread scheduler..............................................................14 2.2 Quantifying temporal address correlation............................................................15 2.2.1 A formal definition for repetitive streams..........................................................16 2.2.2 The SEQUITUR hierarchical compression algorithm........................................17 2.2.3 Methodology.......................................................................................................19 2.2.4 TMS opportunity................................................................................................22 2.2.5 Stream lookup.....................................................................................................31 2.3 Stream Characterization.......................................................................................38 2.3.1 Stream length......................................................................................................39 2.3.2 Stream reuse........................................................................................................43 viii 2.3.3 Stride patterns and miss repetition......................................................................46 2.4 Sources of repetitive streams...............................................................................48 2.4.1 Methodology.......................................................................................................49 2.4.2 Results.................................................................................................................50 2.5 Summary..............................................................................................................59 2.5.1 Projected coverage..............................................................................................59 Chapter 3 TMS-DSM................................................................................................63 3.1 Streaming in DSM...............................................................................................64 3.2 Hardware design..................................................................................................65 3.2.1 Logging misses....................................................................................................66 3.2.2 Finding & forwarding streams............................................................................68 3.2.3 The stream engine...............................................................................................70 3.3 Evaluation methodology......................................................................................74 3.4 Results..................................................................................................................77 3.4.1 Streaming effectiveness.......................................................................................77 3.4.2 End-of-stream detection......................................................................................79 3.4.3 Sensitivity to stream queues................................................................................81 3.4.4 Sensitivity to SVB size........................................................................................82 3.4.5 CMOB capacity requirements.............................................................................83 3.4.6 Bandwidth overhead............................................................................................85 3.4.7 Streaming timeliness...........................................................................................86 3.4.8 Performance impact.............................................................................................90 3.4.9 Synergy with stride prediction............................................................................91 3.5 Summary..............................................................................................................94 Chapter 4 TMS-CMP................................................................................................97 4.1 Challenges & opportunities in the CMP..............................................................98 4.1.1 Stream lookup for main memory streaming......................................................100 4.1.2 Intra-chip streaming..........................................................................................101 4.2 Main memory streaming....................................................................................102 ix . 4.2.1 Design...............................................................................................................103 4.2.2 Methodology......................................................................................................110 4.2.3 Results...............................................................................................................112 4.3 Intra-chip streaming............................................................................................125 4.3.1 Design...............................................................................................................125 4.3.2 Results..............................................................................................................128 4.4 Summary.............................................................................................................134 Chapter 5 Related Work.........................................................................................137 Chapter 6 Conclusions............................................................................................143 6.1 Future work........................................................................................................145 Bibliography.................................................................................................................147 x
Description: