ebook img

Meta-analytics. Consensus approaches and system patterns for data analysis PDF

327 Pages·2019·3.717 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Meta-analytics. Consensus approaches and system patterns for data analysis

Meta-Analytics Meta-Analytics Consensus Approaches and System Patterns for Data Analysis Steven Simske MorganKaufmannisanimprintofElsevier 50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates #2019StevenSimske.PublishedbyElsevierInc.Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans, electronicormechanical,includingphotocopying,recording,oranyinformationstorageand retrievalsystem,withoutpermissioninwritingfromthepublisher.Detailsonhowtoseek permission,furtherinformationaboutthePublisher’spermissionspoliciesandour arrangementswithorganizationssuchastheCopyrightClearanceCenterandtheCopyright LicensingAgency,canbefoundatourwebsite:www.elsevier.com/permissions. Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythe Publisher(otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchand experiencebroadenourunderstanding,changesinresearchmethods,professionalpractices, ormedicaltreatmentmaybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgein evaluatingandusinganyinformation,methods,compounds,orexperimentsdescribedherein. Inusingsuchinformationormethodstheyshouldbemindfuloftheirownsafetyandthesafety ofothers,includingpartiesforwhomtheyhaveaprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors, assumeanyliabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterofproducts liability,negligenceorotherwise,orfromanyuseoroperationofanymethods,products, instructions,orideascontainedinthematerialherein. LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary ISBN:978-0-12-814623-1 ForinformationonallMorganKaufmannpublications visitourwebsiteathttps://www.elsevier.com/books-and-journals Publisher:JonathanSimpson AcquisitionEditor:GlynJones EditorialProjectManager:AleksandraPackowska ProductionProjectManager:PunithavathyGovindaradjane CoverDesigner:MatthewLimbert TypesetbySPiGlobal,India This book is dedicated to Tess, my partner for 30 years and my best friend in life. Acknowledgments Nomanisanisland,andabookisdefinitelyahumanarchipelago.Iowesomuchto somanyforthisbookbeingcompletedandhopefullyofhighrelevancetothereader. I am especially happy with the advancements in clustering and classification that showinhere,alongwithawidevarietyofanalyticapproachesbasedongreatwork indisparatefieldsofscience.IfIhaveseenanythingwellhere,toparaphrasethelate, greatNewton, it is becauseIam understanding on the shoulders ofgiants. ThankstotheteamatElsevierfortheirprodding,probing,professionalism,and promptness.Inparticular,I’dliketothankBrianGuerin,GlynJones,SabrinaWeb- ber,PeterLlewellyn,andAleksandraPackowskafortheirimportantrolesinseeing this book through itsmorethan 2-yearincubation and birth. Thanks to many, many encouraging colleagues and friends—from universities, fromHPInc.,andfromsomanygroupsandactivitieshereinFortCollins.Hundreds ofpeoplewho’vemademylifebetterduringthewritingofthisbookmaynotallbe namedhere,butrestassuredthatyouareappreciated!Withouthavinghadthechance toparticipateinsomanydifferentactivitiesandprofessionsovertheyears,Iwould never have been able to see the connectionsbetween them. ThankstoallthegreatfolksatColoradoStateUniversity,whichImademypro- fessional home at the beginning of the writing phase of this book. In particular, thanks to the systems engineering staff and faculty (featuring Jim Adams, Ann Batchelor,MikeBorky,IngridBridge,JimCale,MaryGomez,Greg“Bo”Marzolf, ErikaMiller,RonSega,andTomBradley)forprovidingmewithahomeandclass- roomsuitableforelaborationofkeypartsofthebook,nottomentiontheirsupport and friendship,which seem the ruleat CSU. Special thanks indeed to my Irish trio of great friends: Paul Ellingstad, Mick Keyes, and Gary Moloney. Their wisdom, friendship, kindhearted cynicism, energy,andinabilitytolosetheiroptimisminthefaceofthegrittinessofrealityhave alwaysbeenawindinmysails.Specialthanksalsotomynon-Irishsupportteamof friends and intellectual guides: Reed Ayers, Dave Barry, Gary Dispoto, Matt Gaubatz, Ellis Gayles, Stephen Pollard, Tom Schmeister, Steve Siatczynski, Dave Wright, andBob Ulichney. Thank you,brothers! Someofourbestfriendscomefromprofessionalorganizations.FromACMDoc Eng,Ihavelifelong friendsinSteveBagley,theBalinskys,DaveandJulieBrails- ford,AlexandraBonnici,TamirHassan,RafaelLins,CerstinMahlow,EthanMun- son, Michael Piotrowski, and so many more. Thank you all! From IS&T, Suzanne Grinnan and staff (Jenny O’Brien, Diana Gonzalez, Roberta Morehouse, Donna Smith,andMarionZoretichchiefamongthem),AlanHodgson,RobinJenkin,Susan Farnand, and many others have helped guide my research and professional career with friendshipand advice. AfriendandIS&TcolleaguewhoI’veworkedwithfor10yearsplayedahuge roleinthisbook.Thanks,andthen,morethanksgoestoMarieVansforproofreading this entire book from start to finish. If errors remain, they are of course my evil xiii xiv Acknowledgments spawn, but thanks to Marie; an unholy horde of heuristic horrors has already been eliminated. Marie, thank you so much! Having someone as talented as you are in theresearchareaofthisbookgothroughitwithafine-toothcombwaswonderful. Finally,thisbookisdedicatedtoTess,mylifepartnerfor30years.Icannotthank youenoughforyourpatience,encouragement,andoccasionalhardreset.Alongwith Tess,Icantrustmytwoamazingsons(KieranandDallen)andmygreatfriend,Doug Heins,tokeepmeontrack—inlifeandinlearning,whichisreallythesame.Your talents, feedback, investment, and love of learning are not just inspiring—they are the breath inspired. Thank you! Steve Simske Fort Collins, CO 18November 2019 CHAPTER 1 Introduction, overview, and applications Itisacapitalmistaketotheorizebeforeonehasdata ArthurConanDoyle(1887) Numquamponendaestpluralitassinenecessitate WilliamofOckham,DunsScotus,etal.(c.1300) Epluribusunum USMotto 1.1 Introduction Weliveinaworldinwhichmoredatahavebeencollectedinthepast2–3yearsthan werecollectedintheentirehistoryoftheworldbeforethen.Basedonthetrendsof thepastfewyears,we’llbesayingthisforawhile.Whyisthisthecase?Thecon- fluenceofnearlylimitlessstorageandprocessingpowerhas,quitesimply,madeit fareasiertogenerateandpreservedata.Themostrelevantquestionis,perhaps,not whetherthiswillcontinue,butratherhowmuchofthedatawillbeusedforanything more than filling up storage space. The machine intelligence community is, of course, interested in turning these dataintoinformationandhashadtremendoussuccesstodatealbeitinsomewhatspe- cific and/or constrained situations. Recent advancements in hardware—from raw processingpower andnearly limitlessstorage capacity, tothearchitecturalrevolu- tion that graphics processing units (GPUs) bring, to parallel and distributed computation—have allowed software developers and algorithm developers to encode processes that were unthinkable with the hardware of even a decade ago. Deep learning and in particularconvolutional neuralnetworks, together with data- flowprogramming,allowforaneaseofrollingoutsophisticatedmachinelearning algorithms and processes that is unprecedented, with the entire field having by all means a bright future. Takingthepowerofhybridarchitecturesasastartingpoint,analyticapproaches canbeupgradedtobenefitfromallcomponentswhenemployingapluralityofana- lytics. This book is about how simple building blocks of analytics can be used in aggregate to provide systems that are readily optimized for accuracy, robustness, cost,scalability,modularity,reusability,andotherdesignconcerns.Thisbookcovers the basics ofanalytics;builds onthem tocreatea setofmeta-analytic approaches; and provides straightforward analytics algorithms, processes, and designs that will bringaneophyteuptospeedwhileaugmentingthearsenalofananalyticsauthority. 1 Meta-Analytics.https://doi.org/10.1016/B978-0-12-814623-1.00001-0 #2019StevenSimske.PublishedbyElsevierInc.Allrightsreserved. 2 CHAPTER 1 Introduction, overview, and applications Thegoalofthebookistomakeanalyticsenjoyable,efficient,andcomprehensibleto the entire gamut ofdata scientists—in what is surelyan age of data science. 1.2 Why is this book important? Firstandforemost,thisbookismeanttobeaccessible toanyoneinterestedindata science.Dataalreadypermeateeveryscience,technology,engineering,andmathe- matics(STEM)endeavor,andtheexpectationstogeneraterelevantandcopiousdata inanyprocess,service,orproductwillonlycontinuetogrowintheyearstocome.A bookhelpingaSTEMprofessionalpickuptheartofdataanalysisfromtheground up,providing both fundamentals anda roadmap for the future, is needed. Thebookisaimedatsupplyinganextensivesetofpatternsfordatascientiststo useto“hitthegroundrunning”onanymachine-learning-baseddataanalysistaskand virtuallyensuresthatatleastoneapproachwillleadtobetteroverallsystembehavior (accuracy, cost, robustness, performance, etc.) than by using traditional analytic approachesonly.Becausethebookis“meta-”analytics,italsomustcovergeneral analytics well enough for the reader to engage with and comprehend the hybrid approaches,or“meta-”approaches.Assuch,thebookaimstoallowarelativenovice to analytics to move to an elevated level of competency and “fluency” relatively quickly. It is also intended to challenge the data scientist to think more broadly and more thoroughly than they might be otherwise motivated. The target audience, therefore, consists of data scientists in all sectors—acade- mia, industry, government, and NGO. Because of the importance of statistical methods, data normalization, data visualization, and machine intelligence to the typesofdatascienceincludedinthisbook,thebookhasrelevancetomachinetrans- lation,robotics,biologicalandsocialsciences,medicalandhealth-careinformatics, economics, business, and finance. The analytic approaches covered herein can be applied to predictive algorithms for everyone from police departments (crime pre- diction)tosportanalysts.Thebookisreadilyamenabletoagraduateclassonsys- tems engineering, analytics, or data science, in addition to a course on machine intelligence.Asubsetofthebookcouldbeusedforanadvancedundergraduateclass in intelligent systems. Predictiveanalyticshavelongheldafascinationforpeople.Seeingthefuturehas beenassociatedwithdivinity,withmagic,withtheoccult,orsimply—andmorein keepingwithOccam’srazor—withenhancedintelligence.ButisOccam’srazor,or thelawofparsimony,applicableintheageofdatascience?Itisnolongernecessarily thebestadvicetosay“Numquamponendaestpluralitassinenecessitate,”or“plu- ralityisnevertobepositedwithoutnecessity,”unless,ofcourse,oneuses“goodness of fit to a model,” “output of sensitivity analysis,” or “least-squares estimation,” amongotherquantitativeartifacts,asproxiesfor“necessity.”Theconceptofpredic- tiveanalytics,usedatthegalacticlevelandextendingmanythousandsofyearsinto thefuture,isthebasisoftheFoundationtrilogybyIsaacAsimov,writteninthemid- dle of the 20th century. Futurist—or should we say mathematician?—Hari Seldon 1.3 Organization of the book 3 particularized the science of psychohistory, which presumably incorporated an extremely multivariate analysis intended to remove as much uncertainty from the future as possible for those privy to his output. Perhaps, the only prediction he was unable to make was the randomness of the personality of the “Mule,” an €uberintelligent, €ubermanipulative leader of the future. However, his ability to esti- matethefutureinprobabilistictermsledtothe(correct)predictionofthecollapse oftheGalacticEmpireandsoincludedamanualtoabbreviatethemillenniaofchaos expectedtofollow.Inotherwords,hemayhaveforeseennotthe“specificrandom- ness” of the Mule, but constructed his psychohistory to be optimally robust to the unforeseen.Thatis,HariSeldonperformed“preflightsensitivityanalysis”ofhispre- dictivemodel.KudostoAsimovforanticipatingthevalueofanalyticsinthefuture. Butevenmoreso,kudosforanticipatingthatthelawofparsimonywouldbeinsuf- ficienttoaddresstheneedsofapredictiveanalyticsystemtobeinsensitivetosuch “unpredictable” random artifacts (people, places, and things). The need to provide forthesimplestmodelreasonable—thatis,thelawofmodelparsimony—remains. However,itisevidentthathybridsystems,affordingsimplicitywherepossiblebut abletohandlemuchmorecomplexitywhereappropriate,aremorerobustthaneither extremeand ultimately willremain relevantlonger inreal-worldapplications. Thisbookis,consequently,importantpreciselybecauseofthevalueprovidedby both theWilliams ofOckhamandthe Hari Seldon.Thereal worldis dynamic and ever-changing,andpredictivemodelsmustbepreadaptedtochangeintheassump- tionsthatunderpinthem,includingbutnotlimitedtothedriftindatafromthatused totrainthemodel;changesinthe“measurementsystem”includingsampling,filter- ing,transduction,andcompression;andchangesintheinteractionsbetweenthesys- tembeingmodeledandmeasuredandthelargerenvironmentaroundit.Ihopethat theapproachesrevisited,introduced,and/orelaboratedinthisbookwillaiddatasci- entists in their tasks while also bringing non-data scientists to sufficient data “flu- ency” to be able to interact intelligently with the world of data. One thing is certain—unlike Hari Seldon’s Galactic Empire, the world of data is not about to crumble.It is getting stronger—for goodand for bad—every day. 1.3 Organization of the book This,thefirst,isthecriticalchapterfortheentirebookandtakesonadisproportion- ate length compared with the other chapters intentionally, as this book is meant to stand on its own, allowing the student, data enthusiast, and even data professional touseitasasinglesourcetoproceed fromunstructureddatatofullytagged,clus- tered, and classified data. This chapter also provides background on the statistics, machinelearning,andartificialintelligenceneededforanalyticsandmeta-analytics. Additional chapters, then, elaborate further on what analytics provide. In Chapter2,thevalueoftrainingdataisthoroughlyinvestigated,andtheassumptions around the long-standing training, validation, and testing process are revisited. In Chapter 3, experimental design—from bias and normalization to the treatment of

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.