Algorithmic and Software System Support to Accelerate Data Processing in CPU-GPU Hybrid Computing Environments DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Kaibo Wang Graduate Program in Computer Science and Engineering The Ohio State University 2015 Dissertation Committee: Xiaodong Zhang, Advisor P. Sadayappan Christopher Stewart Copyright by Kaibo Wang 2015 Abstract Massively data-parallel processors, Graphics Processing Units (GPUs) in particular, have recently entered the main stream of general-purpose computing as powerful hardware accelerators to a large scope of applications including databases, medical informatics, and big data analytics. However, despite their performance benefit and cost effectiveness, the utilization of GPUs in production systems still remains limited. A major reason behind this situation is the slow development of supportive GPU software ecosystem. More specially, (1) CPU-optimized algorithms for some critical computation problems have irregular memory access patterns with intensive control flows, which cannot be easily ported to GPUs to take full advantage of its fine-grained, massively data- parallel architecture; (2) commodity computing environments are inherently concurrent and require coordinated resource sharing to maximize throughput, while existing systems are still mainly designed for dedicated usage of GPU resources. In this Ph.D. dissertation, we develop efficient software solutions to support the adoption of massively data-parallel processors in general-purpose commodity computing systems. Our research mainly focuses on the following areas. First, to make a strong case for GPUs as indispensable accelerators, we apply GPUs to significantly improve the performance of spatial data cross-comparison in digital pathology analysis. Instead of trying to port existing CPU-based algorithms to GPUs, we design a new algorithm and fully optimize it to utilize GPU’s hardware architecture for high performance. Second, we ii propose operating system support for automatic device memory management to improve the usability and performance of GPUs in shared general-purpose computing environments. Several effective optimization techniques are employed to ensure the efficient usage of GPU device memory space and to achieve high throughput. Finally, we develop resource management facilities in GPU database systems to support concurrent analytical query processing. By allowing multiple queries to execute simultaneously, the resource utilization of GPUs can be greatly improved. It also enables GPU databases to be utilized in important application areas where multiple user queries need to make continuous progresses simultaneously. iii Dedication To my family and dearest friends. iv Acknowledgments This journey could not have been so memorable without the companion of many wonderful people. My advisor opened the door and led me step by step to the essential goal of doing impactful research. Xiaoning Ding and Rubao Li were great mentors whose deep knowledge and consistent support smoothened the roughest roads. Yuan Yuan, Kai Zhang, and Yin Huai walked shoulder to shoulder with me and lent the strongest arms whenever help was needed. My girlfriend was always supportive and never complained about my academic life that was often too busy to give her enough attention. Thank you all! v Vita 2006................................................................B.S. Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, China 2009................................................................M.S. Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, China 2010 to present ...............................................Ph.D. Computer Science and Engineering, The Ohio State University Publications 1. Kai Zhang, Kaibo Wang, Yuan Yuan, Rubao Lee, Lei Guo, Xiaodong Zhang. Mega- KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores. Proc. of VLDB Endow., 8(11):1226-1237, 2015. 2. Kaibo Wang, Kai Zhang, Yuan Yuan, Siyuan Ma, Rubao Lee, Xiaoning Ding, Xiaodong Zhang. Concurrent Analytical Query Processing with GPUs. Proc. VLDB Endow., 7(11):1011-1022, 2014. 3. Kaibo Wang, Xiaoning Ding, Rubao Lee, Shinpei Kato, Xiaodong Zhang. GDM: Device Memory Management for GPGPU Computing. In Proceedings of the 2014 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2014), pages 533-545, 2014. vi 4. Kaibo Wang, Yin Huai, Rubao Lee, Fusheng Wang, Xiaodong Zhang, Joel H. Saltz. Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems. In Proc. VLDB Endow., 5(11):1543-1554, 2012. 5. Xiaoning Ding, Kaibo Wang, Phillip B. Gibbons, Xiaodong Zhang. BWS: Balanced Work Stealing for Time-Sharing Multicores. In Proceedings of the 7th European Conference on Computer Systems (EuroSys 2012), pages 365-378, 2012. 6. Xiaoning Ding, Kaibo Wang, Xiaodong Zhang. SRM-Buffer: An OS Buffer Management Technique to Prevent Last Level Caches from Thrashing in Multicores. In Proceedings of the 6th European Conference on Computer Systems (EuroSys 2011), pages 243-256, 2011. 7. Xiaoning Ding, Kaibo Wang, Xiaodong Zhang. ULCC: A User-Level Facility for Optimizing Shared Cache Performance on Multicores. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP 2011), pages 103-112, 2011. Fields of Study Major Field: Computer Science and Engineering vii Table of Contents Abstract ............................................................................................................................... ii Dedication .......................................................................................................................... iv Acknowledgments............................................................................................................... v Vita ..................................................................................................................................... vi Publications ........................................................................................................................ vi Fields of Study .................................................................................................................. vii Table of Contents ............................................................................................................. viii List of Tables ................................................................................................................... xiii List of Figures .................................................................................................................. xiv Chapter 1 Introduction ........................................................................................................ 1 Chapter 2 Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems ............................................................................................................................... 6 2.1 Introduction ............................................................................................................... 6 2.2 Problem Identification ............................................................................................... 9 2.2.1 Background: Spatial Cross-Comparison ............................................................ 9 2.2.2 Existing Solutions with SDBMSs..................................................................... 11 viii 2.2.3 Performance Profiling of SDBMS Solution ..................................................... 13 2.3 The PixelBox Algorithm ......................................................................................... 15 3.1 Pixelization of Polygon Pairs .............................................................................. 15 2.3.2 Reduction of Computing Intensity ................................................................... 18 2.3.3 Optimized Algorithm Implementation ............................................................. 22 2.3.4 Related Discussions .......................................................................................... 26 2.4 System Framework .................................................................................................. 27 2.4.1 The Pipelined Structure .................................................................................... 28 2.4.2 Dynamic Task Migration .................................................................................. 30 2.5 Experiments ............................................................................................................. 31 2.5.1 Experiment Methodology ................................................................................. 31 2.5.2 Performance of the PixelBox Algorithm .......................................................... 33 2.5.3 Effectiveness of Optimization Techniques ....................................................... 36 2.5.4 Parameter Sensitivity of PixelBox .................................................................... 37 2.5.5 Performance of the Pipelined Framework ........................................................ 39 2.5.6 Effectiveness of Dynamic Task Migration ....................................................... 40 2.5.7 Performance Evaluation with All Data Sets ..................................................... 42 2.6 Related Work........................................................................................................... 43 2.7 Conclusions ............................................................................................................. 45 ix
Description: