ebook img

Graphics of Large Datasets: Visualizing a Million PDF

275 Pages·23.987 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Graphics of Large Datasets: Visualizing a Million

Statistics and Computing Series Editors: J.Chambers D.Hand W.Härdle Statistics and Computing Brusco/Stahl:Branch-and-Bound Applications in Combinatorial Data Analysis. Dalgaard:Introductory Statistics with R. Gentle:Elements of Computational Statistics. Gentle:Numerical Linear Algebra for Applications in Statistics. Gentle:Random Number Generation and Monte Carlo Methods,2nd Edition. Härdle/Klinke/Turlach:XploRe:An Interactive Statistical Computing Environment. Hoermann/Leydold/Derflinger:Automatic Nonuniform Random Variate Generation. Krause/Olson:The Basics of S-PLUS,4th Edition. Lange:Numerical Analysis for Statisticians. Lemmon/Schafer:Developing Statistical Software in Fortran 95 Loader:Local Regression and Likelihood. Ó Ruanaidh/Fitzgerald:Numerical Bayesian Methods Applied to Signal Processing. Pannatier:VARIOWIN:Software for Spatial Data Analysis in 2D. Pinheiro/Bates:Mixed-Effects Models in S and S-PLUS. Venables/Ripley:Modern Applied Statistics with S,4th Edition. Venables/Ripley:S Programming. Wilkinson:The Grammar of Graphics,2nd Edition. Antony Unwin Martin Theus Heike Hofmann Graphics of Large Datasets Visualizing a Million Antony Unwin Martin Theus Heike Hofmann Department ofComputer Department ofComputer Department ofStatistics Oriented Statistics and Oriented Statistics and Iowa State University Data Analysis Data Analysis Ames,IA 50011-1210 University ofAugsburg University ofAugsburg Augsburg 86135 Augsburg 86135 Germany Germany Series Editors: J.Chambers D.Hand W.Härdle Bell Labs,Lucent Department of Institut für Statistik und Technologies Mathematics Ökonometrie 600 Mountain Ave. South Kensington Campus Humboldt-Universität Murray Hill,NJ 07974 Imperial College,London zu Berlin USA London Spandauer Str.1 SW7 2AZ D-10178 Berlin,Germany United Kingdom Library ofCongress Control Number:2006922760 ISBN-10:0-387-32906-4 Printed on acid-free paper. ISBN-13:978-0387-32906-2 © 2006 Springer Science+Business Media,LLC All rights reserved.This work may not be translated or copied in whole or in part without the written permission ofthe publisher (Springer Science+Business Media,LLC,233 Spring Street,New York, NY 10013,USA),except for brief excerpts in connection with reviews or scholarly analysis.Use inconnection with any form of information storage and retrieval,electronic adaptation,computer software,or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names,trademarks,service marks,and similar terms,even if theyare not identified as such,is not to be taken as an expression ofopinion as to whether or not they are subject to proprietary rights. Printed in Singapore (KYO) 9 8 7 6 5 4 3 2 1 springer.com Preface Analysing data is fun. It is fascinating to find out about different topics, andeachdatasetbringsnewchallenges.Whetheryouarelookingattimes ofgoalsscoredinBundesligasoccergames,ultrasoundmeasurementsof babies in the womb, or company bankruptcy data, there is always some- thing new to be learnt. Graphic displays play a major role in data anal- ysis, and all the main authors of this book have spent a lot of research timestudyinggraphicsandwaystoimproveandenhancethem.Wehave alsospentalotoftimeactuallyanalysingdata.Thetwogotogether. Oneofthemajorproblemsovertheyearshasbeenrepresentingdata from large datasets. Initially, computers could barely read in a large dataset, so whether you could display it sensibly or not was not rele- vant.Gradually,computershavebeenabletocopewithlargerandlarger datasets, and some weaknesses of standard data graphics have become apparent. In tackling these problems, we have become aware that there is knowledge around as to how to display large datasets, but it is not readily available, certainly not in one place. We hope that our book will help others interested in visualizing large datasets to find out more eas- ilywhathasbeendoneandtocontributethemselves.Moreespecially,we hopeitwillhelpdataanalystsinanalysingtheirdata. This book grew out of discussions at a visualization workshop held in Augsburg in 2002. The main authors of each guest chapter (Dianne Cook, Ed Wegman, Graham Wills, Simon Urbanek, Steve Marron) were all there, and we are grateful to them both for agreeing to contribute to our book and for many insightful discussions at the workshop and on otheroccasions. Manypeoplehavecontributedtoourunderstandingandknowledgeof graphicsanddataanalysis.Discussionsatconferences,debatesviaemail, and,especially,debatesabouthowtoanalyseparticulardatasetshaveall left lasting impressions. Experimenting with software, our ownand that of many others, has also taught us a great deal. We would like to thank in no particular order Gu¨nther Sawitzki, Fred Eicker, Paul Velleman, vi Preface Luke Tierney, Lee Wilkinson, Peter Dirschedl, Rick Becker, Allan Wilks, Andreas Buja, Debbie Swayne, John Sall, Bob Rodriguez, Rick Wicklin, Sandy Weisberg, Bill Eddy, Wolfgang Ha¨rdle, Junji Nakano, Moon Yul Huh, JJ Lee, Adi Wilhelm, Daniella DiBenedetto, Paul Campbell, Carlo Lauro, Roberta Siciliano, Al Inselberg, Peter Huber, Daryl Pregibon, Steve Eick, Audris Mockus, Michael Friendly, Sigbert Klinke, Matthias Nagel, Ru¨diger Ostermann, Axel Benner, Friedrich Pukelsheim, Chris- tianRo¨ttger,ThomasKlein,AnneroseZeis,MarionJackson,EndaTreacy, David Harmon, Robert Spence, Berwin Turlach, Bill Venables, Brian Ripley, — and too many people associated with R to name individually. Thanks are also due to former colleagues in the Statistics Department at Trinity College Dublin for provocative exchanges both over coffee and long into the night at the Irish Statistics conferences (graphical discus- sions,thoughnotnecessarilyaboutgraphics). Somepeoplereadparts(orall!)ofthebookandmadepertinent,help- ful comments, but we also benefitted from encouraging and constructive criticism from Springer’s anonymous referees. For help with the proof- reading we are indebted to Lindy Wolsleben, Estelle Feldman, Veronika Schuster,andSandraSchulz.Beingstudents,VeronikaandSandrawere properlyrespectfulandcareful.LindyandEstellecalledaspadeaspade, especially when we had called it a spode. Our thanks to all of them; au- thorsneedbothkindsofhelp.Needlesstosay(butwewillsayitanyway), any remaining errors are our fault. John Kimmel has been a supportive editoranditisalwaysapleasuretochatwithhimattheSpringerstand at meetings. (We hope that the sales figures for our book will be good enough that he will still be prepared to talk to us in future.) We would also like to thank the six anonymous reviewers who gave significant in- putatvariousstagesoftheproject.IttakesagoodeditorlikeJohntofind thesepeople. Graphics research depends on good, if not exceptional, software to turnvisualizationideasintopractice.In1987,oneofus(AU)gotasmall grant from Apple Computers to write graphics software for teaching. (It is hard to believe, but I actually wrote some code myself initially. When I saw how good the students employed on the project were, I vowed never to write any code again.) In Dublin, it was the students Michael Lloyd,GrahamWills,andEoinMcDonnellwholedtheway.InAugsburg, where the Impressionists’ software packages have been developed, we have benefitted from the skills of George Hawkins, Stefan Lauer, Heike Hofmann,BerndSiegl,ChristianOrdelt,SylviaWinkler,SimonUrbanek, Klaus Bernt, Claus-Michael Mu¨ller, Rene´ Keller, Markus Helbig, Tobias Wichtrey,AlexGouberman,SergeiPotapov,andAlexGribov. In1999,JohnChambersgenerouslydonatedhisACMawardmoneyto the creation of a special prize for the development of statistical software bystudents.Recognitionoftheimportantcontributionsoftwaremakesto Preface vii progressinresearchwaslongoverdue,andwearedelightedthatthreeof ourstudentshavealreadywontheprize. Data analysis is a practical science and much of our knowledge and ideas stems from project collaborations with colleagues in universities and in firms. It would be impossible to talk about the problems arising fromlargedatasets,withoutactuallyworkingonproblemslikethese.We aregratefultoourprojectpartnersfortheircooperationandforactually sharingtheirdatawithus.Itisalltooeasytoforgetjusthowmuchwork andeffortisinvolvedincollectingthedatainadataset. Finally, as academics, we are grateful to our universities for sup- porting and encouraging our research, to the Deutsche Forschungsge- meinschaft for some project support, and, especially, to the Volkswagen Stiftung, whose initial funding led to the founding of the Department of Computer Oriented Statistics and Data Analysis in Ausgburg, which broughtusalltogether. Apple is a registered trademark of Apple Computer, Inc., AT&T is a registered trademarkofAT&TCorp.,DataDeskisaregisteredtrademarkofDataDescrip- tion, Inc., IBM is a registered trademark of International Business Machines Corporation, Inxight TableLens is a registered trademark of Inxight Software, Inc.,JMP,SASandSAS-InsightareregisteredtrademarksofSASInstituteInc. and/oritsaffiliates,OpenGLisatrademarkofSiliconGraphics,Inc.,PostScript isaregisteredtrademarkofAdobeSystemsIncorporated,S-PLUSisaregistered trademarkoftheInsightfulCorporation,SPSSisaregisteredtrademarkofSPSS Inc., UNIX is a registered trademark of The Open Group., Windows is a regis- teredtrademarkofMicrosoftCorporation.Otherthird-partytrademarksbelong totheirrespectiveowners. Contents 1 Introduction .............................................. 1 1.1 Introduction ........................................... 1 1.2 DataVisualization...................................... 4 1.3 ResearchLiterature .................................... 7 1.4 HowLargeIsaLargeDataset? .......................... 9 1.5 TheEffectsofLargeness ................................ 17 1.5.1 Storage.......................................... 18 1.5.2 Quality .......................................... 19 1.5.3 Complexity ...................................... 20 1.5.4 Speed ........................................... 20 1.5.5 Analyses......................................... 21 1.5.6 Displays ......................................... 21 1.5.7 GraphicalFormats ............................... 22 1.6 WhatIsinThisBook ................................... 22 1.7 Software............................................... 23 1.8 WhatIsontheWebsite ................................. 24 1.8.1 FilesandCodeforFigures......................... 24 1.8.2 LinkstoSoftware................................. 24 1.8.3 Datasets......................................... 25 1.9 ContributingAuthors ................................... 26 PartI Basics 2 StatisticalGraphics....................................... 31 2.1 Introduction ........................................... 31 2.2 PlotsforCategoricalData ............................... 31 2.2.1 Barcharts and Spineplots for Univariate CategoricalData ................................. 32 2.2.2 MosaicPlotsforMulti-dimensionalCategoricalData 33 2.3 PlotsforContinuousData ............................... 36 x Contents 2.3.1 Dotplots,Boxplots,andHistograms ................ 36 2.3.2 Scatterplots,ParallelCoordinates,andtheGrand Tour............................................. 39 2.4 DataonMixedScales ................................... 44 2.5 Maps .................................................. 47 2.6 ContourPlotsandImageMaps .......................... 49 2.7 TimeSeriesPlots....................................... 50 2.8 StructurePlots......................................... 51 3 ScalingUpGraphics ...................................... 55 3.1 Introduction ........................................... 55 3.2 UpscalingasaGeneralProbleminStatistics.............. 55 3.3 AreaPlots ............................................. 56 3.3.1 Histograms ...................................... 57 3.3.2 Barcharts........................................ 58 3.3.3 MosaicPlots ..................................... 60 3.4 PointPlots ............................................ 62 3.4.1 Boxplots ........................................ 62 3.4.2 Scatterplots ..................................... 63 3.4.3 ParallelCoordinates .............................. 65 3.5 FromAreastoPointsandBack .......................... 67 3.5.1 α-BlendingandTonalHighlighting ................ 69 3.6 ModifyingPlots ........................................ 71 3.7 Summary .............................................. 72 4 InteractingwithGraphics ................................ 73 4.1 Introduction ........................................... 73 4.2 Interaction............................................. 74 4.3 InteractionandDataDisplays ........................... 75 4.3.1 Querying ........................................ 75 4.3.2 SelectionandLinking............................. 77 4.3.3 SelectionSequences .............................. 78 4.3.4 VaryingPlotCharacteristics....................... 82 4.3.5 InterfacesandInteraction......................... 84 4.3.6 DegreesofLinking ............................... 86 4.3.7 WarningsandRedmarking ........................ 87 4.4 InteractionandLargeDatasets .......................... 88 4.4.1 Querying ........................................ 88 4.4.2 Selection,Linking,andHighlighting ............... 89 4.4.3 VaryingPlotCharacteristicsforLargeDatasets..... 92 4.5 NewInteractiveTasks .................................. 98 4.5.1 Subsetting ....................................... 98 4.5.2 AggregationandRecoding......................... 99 4.5.3 Transformations ................................. 99 4.5.4 Weighting ....................................... 99

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.