9781137379276_01_prexii.indd i 5/12/2014 8:52:45 PM Predictive Analytics, Data Mining and Big Data 9781137379276_01_prexii.indd i 5/12/2014 8:52:45 PM This page intentionally left blank Predictive Analytics, Data Mining and Big Data Myths, Misconceptions and Methods Steven Finlay 9781137379276_01_prexii.indd iii 5/12/2014 8:52:45 PM © Steven Finlay 2014 Softcover reprint of the hardcover 1st edition 2014 978-1-137-37927-6 All rights reserved. No reproduction, copy or transmission of this publication may be made without written permission. No portion of this publication may be reproduced, copied or transmitted save with written permission or in accordance with the provisions of the Copyright, Designs and Patents Act 1988, or under the terms of any licence permitting limited copying issued by the Copyright Licensing Agency, Saffron House, 6–10 Kirby Street, London EC1N 8TS. Any person who does any unauthorized act in relation to this publication may be liable to criminal prosecution and civil claims for damages. The author has asserted his right to be identified as the author of this work in accordance with the Copyright, Designs and Patents Act 1988. First published 2014 by PALGRAVE MACMILLAN Palgrave Macmillan in the UK is an imprint of Macmillan Publishers Limited, registered in England, company number 785998, of Houndmills, Basingstoke, Hampshire RG21 6XS. Palgrave Macmillan in the US is a division of St Martin’s Press LLC, 175 Fifth Avenue, New York, NY 10010. Palgrave Macmillan is the global academic imprint of the above companies and has companies and representatives throughout the world. Palgrave® and Macmillan® are registered trademarks in the United States, the United Kingdom, Europe and other countries. ISBN 978-1-349-47868-2 ISBN 978-1-137-37928-3 (eBook) DOI 10.1057/9781137379283 This book is printed on paper suitable for recycling and made from fully managed and sustained forest sources. Logging, pulping and manufacturing processes are expected to conform to the environmental regulations of the country of origin. A catalogue record for this book is available from the British Library. A catalog record for this book is available from the Library of Congress. Typeset by MPS Limited, Chennai, India. To Ruby and Samantha 9781137379276_01_prexii.indd v 5/12/2014 8:52:46 PM This page intentionally left blank Contents Figures and Tables x Acknowledgments xii 1 Introduction 1 1.1 What are data mining and predictive analytics? 2 1.2 How good are models at predicting behavior? 6 1.3 What are the benefi ts of predictive models? 7 1.4 Applications of predictive analytics 9 1.5 Reaping the benefi ts, avoiding the pitfalls 11 1.6 What is Big Data? 13 1.7 How much value does Big Data add? 16 1.8 The rest of the book 19 2 Using Predictive Models 21 2.1 What are your objectives? 22 2.2 Decision making 23 2.3 The next challenge 31 2.4 Discussion 34 2.5 Override rules (business rules) 36 3 Analytics, Organization and Culture 39 3.1 Embedded analytics 40 3.2 Learning from failure 42 3.3 A lack of motivation 43 3.4 A slight misunderstanding 45 3.5 Predictive, but not precise 50 3.6 Great expectations 52 3.7 Understanding cultural resistance to predictive analytics 54 3.8 The impact of predictive analytics 60 vii 9781137379276_01_prexii.indd vii 5/12/2014 8:52:46 PM viii Contents 3.9 Combining model-based predictions and human judgment 62 4 The Value of Data 65 4.1 What type of data is predictive of behavior? 66 4.2 Added value is what’s important 70 4.3 Where does the data to build predictive models come from? 73 4.4 The right data at the right time 76 4.5 How much data do I need to build a predictive model? 79 5 Ethics and Legislation 85 5.1 A brief introduction to ethics 86 5.2 Ethics in practice 89 5.3 The relevance of ethics in a Big Data world 90 5.4 Privacy and data ownership 92 5.5 Data security 96 5.6 Anonymity 97 5.7 Decision making 99 6 Types of Predictive Models 104 6.1 Linear models 106 6.2 Decision trees (classifi cation and regression trees) 112 6.3 (Artifi cial) neural networks 114 6.4 Support vector machines (SVMs) 118 6.5 Clustering 120 6.6 Expert systems (knowledge-based systems) 122 6.7 What type of model is best? 124 6.8 Ensemble (fusion or combination) systems 128 6.9 How much benefi t can I expect to get from using an ensemble? 130 6.10 The prospects for better types of predictive models in the future 131 7 The Predictive Analytics Process 134 7.1 Project initiation 135 7.2 Project requirements 138 7.3 Is predictive analytics the right tool for the job? 142 7.4 Model building and business evaluation 143 7.5 Implementation 145 9781137379276_01_prexii.indd viii 5/12/2014 8:52:46 PM x Contents i 7.6 Monitoring and redevelopment 149 7.7 How long should a predictive analytics project take? 154 8 How to Build a Predictive Model 157 8.1 Exploring the data landscape 158 8.2 Sampling and shaping the development sample 159 8.3 Data preparation (data cleaning) 162 8.4 Creating derived data 163 8.5 Understanding the data 164 8.6 Preliminary variable selection (data reduction) 165 8.7 Pre-processing (data transformation) 166 8.8 Model construction (modeling) 170 8.9 Validation 171 8.10 Selling models into the business 172 8.11 The rise of the regulator 176 9 Text Mining and Social Network Analysis 179 9.1 Text mining 179 9.2 Using text analytics to create predictor variables 181 9.3 Within document predictors 181 9.4 Sentiment analysis 184 9.5 Across document predictors 185 9.6 Social network analysis 186 9.7 Mapping a social network 191 10 Hardware, Software and All that Jazz 194 10.1 Relational databases 197 10.2 Hadoop 200 10.3 The limitations of Hadoop 202 10.4 Do I need a Big Data solution to do predictive analytics? 203 10.5 Soft ware for predictive analytics 206 Appendix A. Glossary of Terms 209 Appendix B. Further Sources of Information 218 Appendix C. Lift Charts and Gain Charts 223 Notes 227 Index 246 9781137379276_01_prexii.indd ix 5/12/2014 8:52:46 PM
Description: