Query Optimization and Execution for Multi-Dimensional OLAP Ahmad Taleb A Thesis in The Department of Computer Science and Software Engineering Presented in Partial Fullfilment of the Requirements for the Degree of Doctor of Philosophy (Computer Science) at Concordia University Montreal, Quebec, Canada April, 2011 (cid:13)c Ahmad Taleb, 2011 CONCORDIA UNIVERSITY School of Graduate Studies This is to certify that the thesis prepared By: Ahmad Taleb Entitled: Query Optimization and Execution for Multi-Dimensional OLAP and submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science) complies with the regulations of the University and meets the accepted standards with respect to originality and quality. signed by the examining committee: A. Agarwal Chair D. Lemire External Examiner L. Wang External to Program T. Fancott Examiner V. Haarslev Examiner T. Eavis Thesis Supervisor Approved By Chair of Department or Graduate Program Director Spring 2011 Dean of Faculty ii ABSTRACT Query Optimization and Execution for Multi-Dimensional OLAP Ahmad Taleb Online Analytical Processing (OLAP) is a database paradigm that supports the rich analysis of multi-dimensional data. While current OLAP tools are primarily constructed as extensions to conventional relational databases, the unique modeling and processing requirements of OLAP systems often make for a relatively awkward fit with RDBM systems in general, and their embedded string-based query languages in particular. In this thesis, we discuss the design, implementation, and evaluation of a robust multi-dimensional OLAP server. In fact, we focus on several distinct but related themes. To begin, we investigate the integration of an open source embedded storage engine with our own OLAP-specific indexing and access methods. We then present a comprehensive OLAP query algebra that ultimately allows developers to create expressive OLAP queries in native client languages such as Java. By utilizing a formal algebraic model, we are able to support an intuitive Object Oriented query API, as well as a powerful query optimization and execution engine. The thesis describes both the optimization methodology and the related algorithms for the efficient execution of the associated query plans. The end result of our research is a comprehensive OLAP DBMS prototype that clearly demonstrates new opportunities forimprovingtheaccessibility, functionality,andperformanceofcurrentOLAPdatabase management systems. iii ACKNOWLEDGEMENTS It is really an honour and a privilege to express my gratitude and indebtedness to my supervisor, Todd Eavis for his priceless support, encouragement and inspiration. His rich experience, wealth of knowledge, and critical and creative thinking has given me direction and insight in pursuing this research. Special thanks are extended to my uncle Dr. Nasser Taleb for his kindness, support and encouragement. He does not realize how much I have learned from him. I am really glad that I have an uncle like him in my life. Also, I would like to take this opportunity to personally thank all of my colleagues and friends who have been so supportive and giving of their time especially Yassine Chakir, Yasser Mahmood, Ahmad Al-Tamimi, Nour Eddine Caidi, Mohamad Solihat, Lami Abu Sharikh and Waqas. Finally, I treasure the invaluable support and encouragement of my dear mother and father. I also owe my loving thanks to my wife and my children. They have endured a lot from my research and traveling. If not for their support, understanding and encouragement it would not be possible to complete this research. iv Table of Contents List of Tables xii List of Figures xiii 1 Introduction 1 1.1 Research overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Basic Background Material 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Defining OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 OLAP: A Functional Definition . . . . . . . . . . . . . . . . . 8 2.3 The Data Warehouse and Data Cube . . . . . . . . . . . . . . . . . . 11 2.3.1 Data Warehouse Architecture . . . . . . . . . . . . . . . . . . 11 2.3.2 The Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Sidera ROLAP Architecture . . . . . . . . . . . . . . . . . . . 15 2.4.2 Sidera Backend Architecture . . . . . . . . . . . . . . . . . . . 16 2.4.2.1 Cube Indexing . . . . . . . . . . . . . . . . . . . . . 16 2.4.2.2 Hierarchical Attribute Representation . . . . . . . . 18 2.4.2.3 Caching . . . . . . . . . . . . . . . . . . . . . . . . . 19 v 2.4.2.4 Backend Query Engine Model . . . . . . . . . . . . . 22 2.5 Berkeley DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.1 Architecture of Berkeley DB . . . . . . . . . . . . . . . . . . . 25 2.5.2 Berkeley DB databases and environments . . . . . . . . . . . . 26 2.6 XML DOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3 Efficient OLAP Storage Engine 30 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Encoding Dimension Tables and the Fact Table . . . . . . . . . . . . 41 3.4.1 Encoding Dimension Tables . . . . . . . . . . . . . . . . . . . 43 3.4.2 Encoding the Fact Table . . . . . . . . . . . . . . . . . . . . 47 3.4.3 Dimension Table Storage . . . . . . . . . . . . . . . . . . . . 50 3.4.3.1 Hierarchical Attributes . . . . . . . . . . . . . . . . . 51 3.4.3.2 Non-Hierarchical Attributes . . . . . . . . . . . . . . 54 3.5 Cube Indexing Integration . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.1 Berkeley R-tree Model . . . . . . . . . . . . . . . . . . . . . . 60 3.6 Integration of Berkeley DB with other Sidera Components . . . . . . 67 3.7 Backend Processing Logic . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7.1 The Query Resolution Algorithms . . . . . . . . . . . . . . . . 70 3.7.2 Query Transformations . . . . . . . . . . . . . . . . . . . . . . 72 3.7.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.7.4 Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.8.1 The Test Environment . . . . . . . . . . . . . . . . . . . . . . 78 vi 3.8.2 Non-hierarchical Attributes: FastBit Bitmap versus Berkeley DB B-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.8.3 Single Node Experimental Evaluation . . . . . . . . . . . . . . 81 3.8.3.1 Index Construction . . . . . . . . . . . . . . . . . . . 83 3.8.3.2 OLAP Query Resolution . . . . . . . . . . . . . . . . 85 3.8.4 Parallel Experimental Evaluation . . . . . . . . . . . . . . . . 90 3.9 Review of Research Objectives . . . . . . . . . . . . . . . . . . . . . . 92 3.10 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4 OLAP Query Language 97 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4 Preliminary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.1 Sidera OLAP Model . . . . . . . . . . . . . . . . . . . . . . . 105 4.4.2 Object Oriented OLAP Query . . . . . . . . . . . . . . . . . . 110 4.5 OLAP Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.5.1 OLAP Algebra Operations . . . . . . . . . . . . . . . . . . . . 115 4.5.1.1 SELECTION Operator . . . . . . . . . . . . . . . . . 115 4.5.1.2 PROJECTION Operator . . . . . . . . . . . . . . . 117 4.5.1.3 Set operations on Data Cubes . . . . . . . . . . . . . 119 4.5.1.4 CHANGE LEVEL Operator . . . . . . . . . . . . . . 121 4.5.1.5 CHANGE BASE Operator . . . . . . . . . . . . . . 122 4.5.1.6 PIVOT Operation . . . . . . . . . . . . . . . . . . . 124 4.5.1.7 DRILL ACROSS Operator . . . . . . . . . . . . . . 124 4.5.2 Algebra Simplifications . . . . . . . . . . . . . . . . . . . . . . 127 vii 4.6 OLAP Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.7 OLAP Algebraic laws for Improving OLAP Expression Trees . . . . 131 4.7.1 Laws involving SELECTION . . . . . . . . . . . . . . . . . . . 136 4.7.1.1 Better Joins . . . . . . . . . . . . . . . . . . . . . . . 137 4.7.1.2 Combining conditions . . . . . . . . . . . . . . . . . 140 4.7.1.3 Pushing laws . . . . . . . . . . . . . . . . . . . . . . 141 4.7.1.4 Pulling laws . . . . . . . . . . . . . . . . . . . . . . . 143 4.7.2 Laws involving PROJECTION . . . . . . . . . . . . . . . . . 145 4.7.2.1 Decomposition Law . . . . . . . . . . . . . . . . . . 148 4.7.3 Laws for CHANGE LEVEL and CHANGE BASE . . . . . . . 149 4.7.3.1 Removal Law . . . . . . . . . . . . . . . . . . . . . . 149 4.7.3.2 Pushing and Pulling Laws . . . . . . . . . . . . . . . 150 4.7.3.3 Merging Law . . . . . . . . . . . . . . . . . . . . . . 151 4.7.4 Commutative, Associative and Trivial Laws . . . . . . . . . . 153 4.8 OLAP Metadata Storage . . . . . . . . . . . . . . . . . . . . . . . . . 154 4.8.1 OLAP Metadata Grammar . . . . . . . . . . . . . . . . . . . 155 4.8.1.1 Simple OLAP Schema . . . . . . . . . . . . . . . . . 160 4.9 Review of Research Objectives . . . . . . . . . . . . . . . . . . . . . . 161 4.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5 Multi-Dimensional OLAP Query Processor 165 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.4 Parsing OLAP XML Queries . . . . . . . . . . . . . . . . . . . . . . . 175 5.4.1 DOM Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.4.2 Sidera Parser: DOM graph to Parse Tree . . . . . . . . . . . . 177 viii 5.5 The Pre-processor: Semantic checking . . . . . . . . . . . . . . . . . . 179 5.6 From Parse Trees to Logical Query Plans . . . . . . . . . . . . . . . . 182 5.6.1 Conversion to Initial OLAP Logical Query Plan . . . . . . . . 183 5.7 Improving the OLAP Logical Query Plan . . . . . . . . . . . . . . . . 186 5.8 OLAP Physical Query Plan Generation . . . . . . . . . . . . . . . . . 193 5.8.1 Estimating Sizes of intermediate Cubes . . . . . . . . . . . . . 195 5.8.1.1 Estimating the Size of a PROJECTION . . . . . . . 196 5.8.1.2 Estimating the Size of a SELECTION . . . . . . . . 198 5.8.1.3 EstimatingtheSizeofUNION,INTERSECTION,DIFFERENCE199 5.8.1.4 Estimating the Size of a CHANGE LEVEL . . . . . 200 5.8.1.5 Reducing the Cost of the Logical Query Plan . . . . 201 5.8.2 Choosing an order for binary operators . . . . . . . . . . . . . 203 5.8.3 Implementations for OLAP algebraic operators . . . . . . . . 207 5.8.3.1 Choosing a SELECTION Method . . . . . . . . . . . 207 5.8.3.2 Choosing a PROJECTION Method . . . . . . . . . . 210 5.8.3.3 Choosing Binary Operations Methods . . . . . . . . 211 5.8.3.4 ChoosingaMethodforCHANGE LEVELandCHANGE BASE212 5.8.4 Pipelining OLAP operations . . . . . . . . . . . . . . . . . . . 212 5.8.4.1 Pipelining SELECTION . . . . . . . . . . . . . . . . 213 5.8.4.2 Pipelining PROJECTION . . . . . . . . . . . . . . . 213 5.8.4.3 Pipelining Binary Operations . . . . . . . . . . . . . 215 5.8.4.4 Pipelining CHANGE LEVEL . . . . . . . . . . . . . 217 5.8.5 Physical Query Plan Notation . . . . . . . . . . . . . . . . . . 217 5.8.5.1 Access method . . . . . . . . . . . . . . . . . . . . . 217 5.8.5.2 Physical Operators for SELECTION . . . . . . . . . 218 5.8.5.3 Physical Operators for PROJECTION . . . . . . . . 219 5.8.5.4 Physical Operators for Binary Operations . . . . . . 220 ix 5.8.5.5 Physical Operator for CHANGE LEVEL . . . . . . . 221 5.8.5.6 Additional Physical Operators (Get realValues) . . . 221 5.8.5.7 Final Physical Query Plan . . . . . . . . . . . . . . . 222 5.9 OLAP Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . 224 5.9.1 In-memory Hash Table Representation . . . . . . . . . . . . . 226 5.9.2 Index Based SELECTION Algorithm . . . . . . . . . . . . . . 229 5.9.2.1 Cost of the SELECTION operation . . . . . . . . . . 236 5.9.3 Algorithms for the PROJECTION operator . . . . . . . . . . 238 5.9.4 Hash Table Based Algorithms for Binary Operations . . . . . 243 5.9.4.1 UNION . . . . . . . . . . . . . . . . . . . . . . . . . 244 5.9.4.2 INTERSECTION . . . . . . . . . . . . . . . . . . . . 245 5.9.4.3 DIFFERENCE . . . . . . . . . . . . . . . . . . . . . 245 5.9.4.4 DRILL ACROSS . . . . . . . . . . . . . . . . . . . . 245 5.9.4.5 Cost of Binary operations . . . . . . . . . . . . . . . 246 5.9.5 Algorithms for CHANGE LEVEL Operation . . . . . . . . . . 246 5.10 The Sidera Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 5.11 Result Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 5.12 Review of Research Objectives . . . . . . . . . . . . . . . . . . . . . . 259 5.13 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 6 Experimental Results 264 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 6.2 The Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 265 6.3 Current Query Processor Implementation . . . . . . . . . . . . . . . . 269 6.4 Single Node Experimental Evaluation . . . . . . . . . . . . . . . . . . 271 6.4.1 Query Engine with and without LAW 5 . . . . . . . . . . . . 271 x
Description: