Charles Fox Data Science for Transport A Self-Study Guide with Computer Exercises Springer Textbooks in Earth Sciences, Geography and Environment The Springer Textbooks series publishes a broad portfolio of textbooks on Earth Sciences, Geography and Environmental Science. Springer textbooks provide comprehensive introductions as well as in-depth knowledge for advanced studies. A clear, reader-friendly layout and features such as end-of-chaptersummaries,workexamples,exercises,andglossarieshelpthe reader to access the subject. Springer textbooks are essential for students, researchers and applied scientists. More information about this series at http://www.springer.com/series/15201 Charles Fox Data Science for Transport A Self-Study Guide with Computer Exercises 123 CharlesFox Institute for Transport Studies University of Leeds Leeds UK ISSN 2510-1307 ISSN 2510-1315 (electronic) SpringerTextbooks inEarth Sciences, GeographyandEnvironment ISBN978-3-319-72952-7 ISBN978-3-319-72953-4 (eBook) https://doi.org/10.1007/978-3-319-72953-4 LibraryofCongressControlNumber:2017962979 ©SpringerInternationalPublishingAG2018 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeor part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation,broadcasting,reproductiononmicrofilmsorinanyotherphysicalway, andtransmissionorinformationstorageandretrieval,electronicadaptation,computersoftware, orbysimilarordissimilarmethodologynowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthis publication does not imply, even in the absence of a specific statement, that such names are exemptfromtherelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationin thisbookarebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernor the authors or the editors give a warranty, express or implied, with respect to the material containedhereinorforanyerrorsoromissionsthatmayhavebeenmade.Thepublisherremains neutralwithregardtojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. Coverimage:SimonandSimonphotography [email protected] www.simonandsimonphoto.co.uk Printedonacid-freepaper ThisSpringerimprintispublishedbySpringerNature TheregisteredcompanyisSpringerInternationalPublishingAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Foreword Therehasneverbeenagreaterneedtounderstandwhatisgoingonournetworks whether it be highways, public transport, train, or other sustainable modes. Roadusersdemandagreaterlevelofinformationabouttheroadstheyareusing andtheservicesthattheyareusing,andthereisanexpectationthattheywillbe able to access this information real time and in a mobile environment. This createssignificantchallengesforlocalhighwayauthoritiesinparticularasthey havetraditionallynotbeenearlytechnologyadopters,forgoodreasons,asthey needtobesurethattaxpayers’moneyisusedappropriatelyandisnotwasted. Derbyshire County Council have identified the importance, opportunities, and advantages that transport data in its many forms can provide as well as helpingtoprovidethebasisforeffectivedecisionmakingandpreparingfora “smarter” future. A key area of activity is Highways Asset Management, where timely information is being used to provide evidence to support the effective management of the highway asset, the highest value asset that the CountyCounciloperates.Anexampleistheuseofnearreal-timeinformation to assess traffic conditions, while thinking how we can share this with the road user to proactively manage the network, which in turn may provide the opportunitytogeneratemorecapacityinourhighwaynetworks.Akeyaspect inbeingabletoachievethisisafundamentalunderstandingofthedataatour disposal, its advantages and limitations, and how we can most effectively process, manipulate, and communicate the information. Information is key to all of our activities and can provide exciting opportunities to make considerable savings and efficiencies if appropriate techniques are adopted, the appropriate investment is made in technology, and very importantly, the right skills are developed to make sense of all the streamsofinformationatourdisposal.Thisbookprovidesthefoundationsto gettogripswithamyriadofdatainitsmanyforms,andtheopportunitiesfor greater insights and collaborations to be developed when bringing diverse data sets together. These are fundamental skills in the burgeoning field of datascience,ifwearetodeliveronthe“BigData,InternetofThings,Smart Cities” agenda that is a current focus in the transport arena. Derbyshire County Council, UK Neill Bennett Senior Project Officer Transportation Data and Analysis v Preface Thisbook isintended for professionals working inTransportationwho wish to add Data Science skills to their work, and for current or potential future students of graduate or advanced undergraduate Transport Studies building skills to work in the profession. It is based closely on a module of the same name which forms a core part of the MSc in Mathematical Transport Modeling course at the Institute for Transport Studies, University of Leeds, which trains many of the world’s leading transport professionals. The live module was designed with the help of many leading transport consultancies inresponsetotheirreal-worldskillsshortagesinthearea.Itistaughttosmall groups of students in a highly interactive, discursive way and is focused aroundasingleteamprojectwhichappliesallthetoolspresentedtodelivera real system component for Derbyshire Council each year. Each chapter of this book presents theory followed computer exercises running on self-contained companion software, based on aspects of these projects. The companion software provides a preinstalled transport data science stackincludingdatabase,spatial(GIS)data,machinelearning,Bayesian,and bigdatatools.Noprogrammingknowledgeisassumed,andabasicoverview of the Python language is contained within this book. This book is neither a complete guide to programming nor a technical manual for the tools described. Instead, it presents enough understanding and starting points to give the reader confidence to look up details in online software reference manualsortoparticipateinInternet communitiessuchasstackoverflow.com to find details for themselves. This book is also intended to provide an overview of the field to transport managers who do not want to program themselvesbutmayneedtomanageorpartnerwithprogrammers.Toenable this,computerexamplesareusuallysplitintoaseparatesectionattheendof each chapter, which may be omitted by non-programmers. This book follows roughly the structure of a typical Transport Data Sci- ence consulting engagement. Data Science consulting is a rapidly evolving fieldbuthasrecentlybeguntostabilizeintoareasonablystandardhigh-level process.Thisbeginswiththedatascientistfindingtherightquestionstoask, including consideration of ethical issues around these questions. Often, a client does not know exactly what they need or want, or what is technically possible. At this stage, the data scientist must both gain an understanding of the business case requirements and convey to the client what is possible. This includes a professional responsibility to represent the field accurately vii viii Preface andfreefromthehypewhichcurrentlyplaguesmanypopularpresentations. Next,suitabledatamustbeobtained.Thismaybeexistingdataownedbythe client or available from outside, or in some cases will be collected in newly commissionedstudies.Often,thedatawerecollectedforsomeotherpurpose andneedworktomakeitusableforanewpurpose.Sometimes,itwillbeina format such as a series of Web pages which are intended for human rather than automated reading. Again, ethical issues around privacy and data ownershipareimportanthere,aswellastechnicalaspectsoflinkingtogether datafromdifferentsources,formats,andsystems.Typically,thiswillinvolve “cleaning” of data, both to fix problems with individual sources such as missing or incorrect data and to ensure compatibility between data from different sources. Some of the topics we will cover are: (cid:129) Therelevanceandlimitationsofdata-centricanalysisappliedtotransport problems, compared with other types of modeling and forecasting, (cid:129) Reusing and reprocessing transport data for Data Science, (cid:129) Ontological issues in classical database design, (cid:129) Statistics and machine learning analytics, (cid:129) Spatial data and GISs for transport, (cid:129) Visualization and transport map making, (cid:129) “Big data” computation, (cid:129) Non-classical “NoSQL” database ontologies, (cid:129) Professional and ethical issues of big data in transport. For those following the programming details, some of the computational tools covered are: (cid:129) Postgres: How to design and create a relational database to house “raw” data, (cid:129) SQL: How to interrogate such data and provide appropriate data for transport modeling, (cid:129) Python: Basic programming skills and interfaces to Data Science tools, (cid:129) PostGISandGeoPandas:spatialdataextensionsforSQLandPythonfor transport data, (cid:129) scikit.learn, GPy, and PyMC3: machine learning libraries, (cid:129) Hadoop and Spark: Big data systems. The term “big data” is often used by the media as a synonym for Data Science. Both terms lack widely agreed formal definitions, but we consider them to be distinct here as follows. “Data Science” as distinct from just “science” emphasizes the reuse of existing data for new purposes. Specifically, regular “science” is based on causalinference,inwhichsomeofthedataarecausedtohavecertainvalues by the scientist. This enables causal inferences to be made as a result: Causationisputintothesystemandcanthereforebereadoutofthesystem. In contrast, Data Science typically makes use of passively observed data, which does not enable the same form of causal reasoning to take place. Preface ix “Big data” in this book means data which require the use of parallel computation to process. The actual numerical size (e.g., measured in bytes) required to pass this threshold varies depending on the power and price of current computing hardware, the nature of the data, and the type of pro- cessing required. For example, searching for the shortest road in a transport data set might be possible on a single computer, but computing an optimal routearoundasetofcitiesusingthesamedataisahardercomputationaltask which may require parallel processing and thus be considered to be a “big data” task. Before the “big data” movement, classical database design strongly emphasized strict database design via ontologies, or formal (“neat”) descriptions created in consultation with the client, specifying what is in the worldthatistoberepresentedbythedata.Some“bigdata”proponentshave argued that this is no longer the case and that the new era of data is about “scruffy” representations which can be reinterpreted by different users for different purposes. While this book covers both sides of this debate, it con- sidersthatthereisstillmuchmeritinclassicalontologicalideas.Itispossible that they may get “deconstructed” by the big data movement, but also that they will be “reconstructed” or rebuilt on new foundations created by that movement. In particular, this book covers the classical SQL database lan- guageandsomerecentattemptswhichrebuilditontopof“bigdata”tools.It considers that classical ontology is still relevant even in “messy” environ- ments, as individual analysts must still use the same concepts in their indi- vidual interpretations of the data as a classical database administrator would use across the whole database. Somecriticallyimportanttopicsforreal-worldDataScienceareratherless glamorous than the machine learning and visualization work usually asso- ciated with it in public. Data cleaning, database design, and software testing necessarily form a large part of related work. This book does not shy away from them but discusses them in detail consummate with their importance. Wherepossible,ittriestolivenupthesepotentiallydrytopicsbylinkedthem to relevant ideas from Computer Science (Chomsky languages and data cleaning), Philosophy (ontology), and History (of the modern Calendar for date formatting). Some of these connections are a little tenuous but are intended to add interest and aid memory in these areas. Readers following the mathematical details are assumed to be familiar with first-year undergraduate applied maths such as vectors, matrices, eigenvectors, numerical parameter optimization, calculus, differential equa- tions, Gaussian distributions, Bayes’ rule, covariance matrices. All these are covered in maths methods books such as the excellent, (cid:129) Riley, Hobson and Bence. Mathematical Methods for Physics and Engineering (3rd edition): A Comprehensive Guide. Cambridge Univer- sity Press 2006. IfyouarestudyingTransportDataSciencetobuildyourcareer,trysearching at www.cwjobs.co.uk and other job sites for “data scientist” job descriptions and salaries to learn what types of roles are currently in demand. A fun x Preface exercise isto apply your new Transport Data Science skills to automatically processsuchjobsdata;forexample,youcouldtryscrapingsalary,skills,and locations from the Web pages and drawing maps showing where certain skillsareinhighestdemand.Then,linkthemtosimilardatafromestateagent and transport data to visualize relationships between the salaries, house prices, and commute times to find your optimal job and location. Please check this book’s web site for updates or if you are having diffi- culty with any of the software exercises or installations. Readers who enjoy self-studying from this book are encouraged to join the live Leeds course afterward, where they will meet many like-minded colleagues, work on the real project together, and build a strong personal network for the rest of their Transport careers. ManythanksgotoNeillBennettandDeanFindlayatDerbyshireCounty Councilfortheirhelpwithcontentandapplicationideas,aswellasfortheir forward-looking policy of open data which allows many of the examples in thisbook.FromITSLeeds:RichardConnorsforhelpwithcontent,structure, and style; Oscar Giles and Fanta Camara for checking and testing; my stu- dents Lawrence Duncan, Aseem Awad, Vasiliki Agathangelou, and Ryuhei KondoandteachingassistantPanagiotisSpyridakosfortheirfeedback;Greg Marsden, Richard Batley, and Natasha Merat for proving a stimulating environment for teaching and writing. Robin Lovelace and Ian Philips at Leeds and John Quinn at the United Nations Global Pulse team for GIS ideas. Subhajit Basu at Leeds’ School of Law for ethical and legal ideas. Darren Hoyland at Canonical for help with OpenStack; Mark Taylor at Amazon for help with cloud services; Steve Roberts at Oxford for teaching memachinelearning;StephenNP SmithatAlgometricsforteachingmebig databeforeitwasa“thing”;AndrewVeglioatVantageInvestmentAdvisory forteachingme“agricultural”SQLstyle;ThomasHainatSheffieldforSGE; Peter Billington at Telematics Technology for the M25 project; Richard Holton for showing me the road from Athens to Thebes, Jim Stone at Sheffieldfor help withlucidpedagogy;Adam Binch atIbex Automationfor deep learning tests; To Jenny, if you are reading this I hope it is after I’ve read yours. Leeds, UK Charles Fox