Table Of ContentComputational and Statistical Methods
for Analysing Big Data with Applications
Computational
and Statistical Methods
for Analysing Big Data
with Applications
Shen Liu
The School of Mathematical Sciences and the ARC Centre
of Excellence for Mathematical & Statistical Frontiers,
Queensland University of Technology, Australia
James McGree
The School of Mathematical Sciences and the ARC Centre
of Excellence for Mathematical & Statistical Frontiers,
Queensland University of Technology, Australia
Zongyuan Ge
Cyphy Robotics Lab, Queensland University
of Technology, Australia
Yang Xie
The Graduate School of Biomedical Engineering,
the University of New South Wales, Australia
AMSTERDAM BOSTON HEIDELBERG LONDON
(cid:1) (cid:1) (cid:1)
NEWYORK OXFORD PARIS SANDIEGO
(cid:1) (cid:1) (cid:1)
SANFRANCISCO SINGAPORE SYDNEY TOKYO
(cid:1) (cid:1) (cid:1)
AcademicPressisanimprintofElsevier
AcademicPressisanimprintofElsevier
125,LondonWall,EC2Y5AS.
525BStreet,Suite1800,SanDiego,CA92101-4495,USA
225WymanStreet,Waltham,MA02451,USA
TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UK
Copyright©2016ElsevierLtd.Allrightsreserved.
Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronic
ormechanical,includingphotocopying,recording,oranyinformationstorageandretrievalsystem,
withoutpermissioninwritingfromthepublisher.Detailsonhowtoseekpermission,furtherinformation
aboutthePublisher’spermissionspoliciesandourarrangementswithorganizationssuchasthe
CopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefoundatourwebsite:
www.elsevier.com/permissions.
Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythe
Publisher(otherthanasmaybenotedherein).
Notices
Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperience
broadenourunderstanding,changesinresearchmethods,professionalpractices,ormedicaltreatment
maybecomenecessary.
Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingand
usinganyinformation,methods,compounds,orexperimentsdescribedherein.Inusingsuchinformation
ormethodstheyshouldbemindfuloftheirownsafetyandthesafetyofothers,includingpartiesfor
whomtheyhaveaprofessionalresponsibility.
Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeany
liabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability,negligence
orotherwise,orfromanyuseoroperationofanymethods,products,instructions,orideascontainedin
thematerialherein.
ISBN:978-0-12-803732-4
BritishLibraryCataloguing-in-PublicationData
AcataloguerecordforthisbookisavailablefromtheBritishLibrary.
LibraryofCongressCataloging-in-PublicationData
AcatalogrecordforthisbookisavailablefromtheLibraryofCongress.
ForInformationonallAcademicPresspublications
visitourwebsiteathttp://store.elsevier.com/
List of Figures
Figure5.1 A3Dplotofthedrill-holedata. 106
Figure6.1 95%credibleintervalsforeffectsizesfortheMortgagedefault 122
exampleateachiterationofthesequentialdesignprocess.
Figure6.2 Optimaldesignversusselecteddesignpointsfor(a)credit 123
score,(b)houseage,(c)yearsemployedand(d)creditcard
debtfortheMortgagedefaultexample.
Figure6.3 Posteriormodelprobabilitiesateachiterationofthesequential 125
designprocessforthetwomodelsinthecandidatesetforthe
airlineexample.
Figure6.4 Ds-optimaldesignsversusthedesignsextractedfromthe 125
airlinedatasetfor(a)departurehourand(b)distancefrom
theorigintothedestination.
Figure7.1 Standardprocessforlarge-scaledataanalysis,proposedby 134
Peng,Leek,andCaffo(2015).
Figure7.2 DataCompletenessmeasuredusingthemethoddescribed 142
inSection7.2.1.3.Thenumbersindicatepercentageof
completeness.
Figure7.3 Customerdemographics,admissionandprocedureclaimdata 144
foryear2010,alongwith2011DIHwereusedtotrainthe
model.Later,atthepredictionstage,customer
demographics,hospitaladmissionandprocedureclaimdata
foryear2011wereusedtopredictthenumberofDIHin
2012.
Figure7.4 Averagedaysinhospitalperpersonbyageforeachofthe 147
3yearsofHCFdata.
Figure7.5 Scatter-plotsforbaggedregressiontreeresultsforcustomers 149
bornbeforeyear1948(thoseaged63yearsorolderwhen
themodelwastrainedin2011).
Figure7.6 Distributionofthetop200featuresamongthefourfeature 151
subsets:(a)inthewholepopulation;(b)insubjectsbornin
orafter1948;(c)insubjectsbornbefore1948;(d)inthe
1+daysgroup;(e)showsthepercentageofthefoursubsets
withrespecttothefullfeaturesetof915features.
viii ListofFigures
Figure8.1 SandgateRoadanditssurroundings. 177
Figure8.2 IndividualtraveltimesoverLinkAandBon12November 178
2013.
Figure8.3 Clustersofroadusersandtraveltimeestimates,LinkA. 180
Figure8.4 Clustersofroadusersandtraveltimeestimates,LinkB. 181
Figure8.5 Traveltimeestimatesoftheclusteringmethodsandspotspeed 182
data,LinkA.
List of Tables
Table6.1 Resultsfromanalysingthefulldatasetforyear2000forthe 120
Mortgagedefaultexample
Table6.2 Thelevelsofeachcovariateavailableforselectioninthe 121
sequentialdesignprocessforthemortgageexamplefrom
Drovandietal.(2015)
Table6.3 Resultsfromanalysingtheextracteddataset fromtheinitial 121
learningphaseandthesequentialdesignprocessforthe
Mortgagedefaultexample
Table6.4 Thelevelsofeachcovariateavailableforselectioninthe 124
sequentialdesignprocessfortheairlineexamplefrom
Drovandietal.(2015)
Table7.1 Asummaryofbigdataanalysisinhealthcare 132
Table7.2 Summaryofthecompositionofthefeatureset.thefourfeature 138
subsetclasses are:1)demographicfeatures;2)medical
featuresextractedfromclinicalinformation;3)priorcost/dih
features,and;4)othermiscellaneousfeatures
Table7.3 Performancemeasures 146
Table7.4 Performancemetricsoftheproposedmethod,evaluatedon 148
differentpopulations
Table7.5 Performancemetricsofpredictionsusingfeaturecategory 150
subsetsonly
Table7.6 AnexampleofinterestingICD-10primarydiagnosisfeatures 151
Table8.1 Proportionsofroadusersthathavethesamegroupingoverthe 182
twolinks
Acknowledgment
The authors would like to thank the School of Mathematical Sciences and the ARC
Centre of Excellence for Mathematical & Statistical Frontiers at Queensland
University of Technology, the Australian Centre for Robotic Vision, and the
Graduate School of Biomedical Engineering at the University of New South Wales
fortheirsupportinthedevelopmentofthisbook.
The authors are grateful to Ms Yike Gao for designing the front cover of this
book.
1
Introduction
The history of humans storing and analysing data dates back to about 20,000 years
ago when tally sticks were used to record and document numbers. Palaeolithic
tribespeopleusedtonotchsticksorbonestokeeptrackofsuppliesortradingactivi-
ties, while the notches could be compared to carry out calculations, enabling them
to make predictions, such as how long their food supplies would last (Marr, 2015).
As one can imagine, data storage or analysis in ancient times was very limited.
However, after a long journey of evolution, people are now able to collect and
process huge amounts of data, such as business transactions, sensor signals, search
engine queries, multimedia materials and social network activities. As a 2011
McKinsey report (Manyika et al., 2011) stated, the amount of information in our
society has been exploding, and consequently analysing large datasets, which refers
tothe so-called bigdata,willbecome akey basisofcompetition, underpinningnew
wavesofproductivitygrowth,innovationandconsumersurplus.
1.1 What is big data?
Big data is not a new phenomenon, but one that is part of a long evolution of data
collection and analysis. Among numerous definitions of big data that have been
introduced over the last decade, the one provided by Mayer-Scho¨nberger and
Cukier(2013)appearstobemostcomprehensive:
(cid:1) Big data is “the ability of society to harness information in novel ways to produce useful
insightsorgoodsandservicesofsignificantvalue”and“thingsonecandoatalargescale
thatcannotbedoneatasmallerone,toextractnewinsightsorcreatenewformsofvalue.”
In the community of analytics, it is widely accepted that big data can be concep-
tualizedbythefollowingthreedimensions(Laney,2001):
(cid:1) Volume
(cid:1) Velocity
(cid:1) Variety
1.1.1 Volume
Volume refers to the vast amounts of data being generated and recorded. Despite
the fact that big data and large datasets are different concepts, to most people big
data implies an enormous volume of numbers, images, videos or text. Nowadays,
ComputationalandStatisticalMethodsforAnalysingBigDatawithApplications.
©2016ElsevierLtd.Allrightsreserved.
2 ComputationalandStatisticalMethodsforAnalysingBigDatawithApplications
the amount of information being produced and processed is increasing tremen-
dously,whichcanbedepictedbythefollowingfacts:
(cid:1) 3.4millionemailsaresenteverysecond;
(cid:1) 570newwebsitesarecreatedeveryminute;
(cid:1) Morethan3.5billionsearchqueriesareprocessedbyGoogleeveryday;
(cid:1) OnFacebook,30billionpiecesofcontentaresharedeveryday;
(cid:1) Every two days we create as much information as we did from the beginning of time
until2003;
(cid:1) In 2007, the number of Bits of data stored in the digital universe is thought to have
exceededthenumberofstarsinthephysicaluniverse;
(cid:1) Thetotalamountofdatabeingcapturedandstoredbyindustrydoublesevery1.2years;
(cid:1) Over90%ofallthedataintheworldwerecreatedinthepast2years.
As claimed by Laney (2001), increases in data volume are usually handled by
utilizing additional online storage. However, the relative value of each data point
decreases proportionately as the amount of data increases. As a result, attempts
have been made to profile data sources so that redundancies can be identified and
eliminated. Moreover, statistical sampling can be performed to reduce the size of
thedatasettobeanalysed.
1.1.2 Velocity
Velocity refers to the pace of data streaming, that is, the speed at which data are
generated, recorded and communicated. Laney (2001) stated that the bloom of
e-commerce has increased point-of-interaction speed and consequently the pace
data used to support interactions. According to the International Data Corporation
(https://www.idc.com/), the global annual rate of data production is expected to
reach 5.6 zettabytes in the year 2015, which doubles the figure of 2012. It is
expected that by 2020 the amount of digital information in existence will have
grown to 40 zettabytes. To cope with the high velocity, people need to access, pro-
cess, comprehend and act on data much faster and more effectively than ever
before. The major issue related to the velocity is that data are being generated con-
tinuously. Traditionally, the time gap between data collection and data analysis
used tobe large, whereasinthe era ofbigdata thiswouldbeproblematicas alarge
portion of data might have been wasted during such a long period. In the presence
of high velocity, data collection and analysis need to be carried out as an integrated
process. Initially, research interests were directed towards the large-volume charac-
teristic, whereascompaniesarenow investinginbigdatatechnologiesthatallow us
toanalysedatawhiletheyarebeinggenerated.
1.1.3 Variety
Variety refers to the heterogeneity of data sources and formats. Since there are
numerouswaystocollectinformation,itisnowcommontoencountervarioustypes
of data coming from different sources. Before the year 2000, the most common
Introduction 3
format of data was spreadsheets, where data are structured and neatly fit into
tables or relational databases. However, in the 2010s most data are unstructured,
extracted from photos, video/audio documents, text documents, sensors, transaction
records,etc.Theheterogeneityofdatasourcesandformatsmakesdatasetstoocom-
plex tostore andanalyse usingtraditional methods,while significanteffortshave to
bemadetotacklethechallengesinvolvedinlargevariety.
As stated by the Australian Government (http://www.finance.gov.au/sites/
default/files/APS-Better-Practice-Guide-for-Big-Data.pdf), traditional data analysis
takes adataset fromadatawarehouse,whichisclean and complete with gaps filled
and outliers removed. Analysis is carried out after the data are collected and stored
in a storage medium such as an enterprise data warehouse. In contrast, big data
analysis uses a wider variety of available data relevant to the analytics problem.
The data are usually messy, consisting of different types of structured and
unstructured content. There are complex coupling relationships in big data from
syntactic, semantic, social, cultural, economic, organizational and other aspects.
Rather than interrogating data, those analysing explore it to discover insights and
understandingssuchasrelevantdataandrelationshipstoexplorefurther.
1.1.4 Another two V’s
It is worth noting that in addition to Laney’s three Vs, Veracity and Value have
been frequently mentioned in the literature of bigdata (Marr,2015). Veracity refers
tothetrustworthinessofthedata,thatis,theextenttowhichdataarefreeofbiased-
ness, noise and abnormality. Efforts should be made to keep data tidy and clean,
whereas methods should be developed to prevent recording dirty data. On the other
hand, value refers to the amount of useful knowledge that can be extracted from
data. Big data can deliver value in a broad range of fields, such as computer vision
(Chapter 4 of this book), geosciences (Chapter 5), finance (Chapter 6), civil avia-
tion (Chapter 6), health care (Chapters 7 and 8) and transportation (Chapter 8). In
fact, the applications of big data are endless, from which everyone is benefiting as
wehaveenteredadata-richera.
1.2 What is this book about?
Big data involves a collection of techniques that can help in extracting useful infor-
mation from data. Aiming at this objective, we develop and implement advanced
statistical and computational methodologies for use in various high impact areas
wherebigdataarebeingcollected.
In Chapter 2, classification methods will be discussed, which have been exten-
sively implemented for analysing big data in various fields such as customer seg-
mentation, fraud detection, computer vision, speech recognition and medical
diagnosis. In brief, classification can be viewed as a labelling process for new
observations, aiming at determining to which of a set of categories an unlabelled
object would belong. Fundamentals of classification will be introduced first, fol-
lowed by a discussion on several classification methods that have been popular in
Description:Due to the scale and complexity of data sets currently being collected in areas such as health, transportation, environmental science, engineering, information technology, business and finance, modern quantitative analysts are seeking improved and appropriate computational and statistical methods to