Disk-Based Algorithms for Big Data Disk-Based Algorithms for Big Data Christopher G. Healey North Carolina State University Raleigh, North Carolina CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2017 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20160916 International Standard Book Number-13: 978-1-138-19618-6 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information stor- age or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copy- right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro- vides licenses and registration for a variety of users. For organizations that have been granted a photo- copy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com To Michelle To my sister, the artist To my parents And especially, to DBelle and K2 Contents ListofTables xv ListofFigures xvii Preface xix Chapter 1(cid:4) Physical Disk Storage 1 1.1 PHYSICALHARDDISK 2 1.2 CLUSTERS 2 1.2.1 BlockAllocation 3 1.3 ACCESSCOST 4 1.4 LOGICALTOPHYSICAL 5 1.5 BUFFERMANAGEMENT 6 Chapter 2(cid:4) File Management 9 2.1 LOGICALCOMPONENTS 9 2.1.1 PositioningComponents 10 2.2 IDENTIFYINGRECORDS 12 2.2.1 SecondaryKeys 12 2.3 SEQUENTIALACCESS 13 2.3.1 Improvements 13 2.4 DIRECTACCESS 14 2.4.1 BinarySearch 15 2.5 FILEMANAGEMENT 16 2.5.1 RecordDeletion 16 2.5.2 Fixed-LengthDeletion 17 2.5.3 Variable-LengthDeletion 19 2.6 FILEINDEXING 20 vii viii (cid:4) Contents 2.6.1 SimpleIndices 20 2.6.2 IndexManagement 21 2.6.3 LargeIndexFiles 22 2.6.4 SecondaryKeyIndex 22 2.6.5 SecondaryKeyIndexImprovements 24 Chapter 3(cid:4) Sorting 27 3.1 HEAPSORT 27 3.2 MERGESORT 32 3.3 TIMSORT 34 Chapter 4(cid:4) Searching 37 4.1 LINEARSEARCH 37 4.2 BINARYSEARCH 38 4.3 BINARYSEARCHTREE 38 4.4 k-dTREE 40 4.4.1 k-dTreeIndex 41 4.4.2 Search 43 4.4.3 Performance 44 4.5 HASHING 44 4.5.1 Collisions 44 4.5.2 HashFunctions 45 4.5.3 HashValueDistributions 46 4.5.4 EstimatingCollisions 47 4.5.5 ManagingCollisions 48 4.5.6 ProgressiveOverflow 48 4.5.7 MultirecordBuckets 50 Chapter 5(cid:4) Disk-Based Sorting 53 5.1 DISK-BASEDMERGESORT 54 5.1.1 BasicMergesort 54 5.1.2 Timing 55 5.1.3 Scalability 56 5.2 INCREASEDMEMORY 57 Contents (cid:4) ix 5.3 MOREHARDDRIVES 57 5.4 MULTISTEPMERGE 58 5.5 INCREASEDRUNLENGTHS 59 5.5.1 ReplacementSelection 59 5.5.2 AverageRunSize 61 5.5.3 Cost 61 5.5.4 DualHardDrives 61 Chapter 6(cid:4) Disk-Based Searching 63 6.1 IMPROVEDBINARYSEARCH 63 6.1.1 Self-CorrectingBSTs 64 6.1.2 PagedBSTs 64 6.2 B-TREE 66 6.2.1 Search 68 6.2.2 Insertion 68 6.2.3 Deletion 70 6.3 B∗ TREE 71 6.4 B+TREE 73 6.4.1 PrefixKeys 74 6.5 EXTENDIBLEHASHING 75 6.5.1 Trie 76 6.5.2 RadixTree 76 6.6 HASHTRIES 76 6.6.1 TrieInsertion 78 6.6.2 BucketInsertion 79 6.6.3 FullTrie 79 6.6.4 TrieSize 79 6.6.5 TrieDeletion 80 6.6.6 TriePerformance 81 Chapter 7(cid:4) Storage Technology 83 7.1 OPTICALDRIVES 84 7.1.1 CompactDisc 84 7.1.2 DigitalVersatileDisc 85 7.1.3 Blu-rayDisc 85
Description: