ebook img

Handbook of Statistics 24: Data Mining and Data Visualization PDF

660 Pages·2005·12.872 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Handbook of Statistics 24: Data Mining and Data Visualization

This page intentionally left blank HANDBOOKOFSTATISTICS VOLUME24 Handbook of Statistics VOLUME 24 GeneralEditor C.R. Rao AMSTERDAM•BOSTON•HEIDELBERG•LONDON•NEWYORK•OXFORD PARIS•SANDIEGO•SANFRANCISCO•SINGAPORE•SYDNEY•TOKYO Data Mining and Data Visualization Editedby C.R. Rao CenterforMultivariateAnalysis DepartmentofStatistics,ThePennsylvaniaStateUniversity UniversityPark,PA,USA E.J. Wegman CenterforComputationalStatistics GeorgeMasonUniversity Fairfax,VA,USA J.L. Solka NavalSurfaceWarfareCenter,DD Dahlgren,VA,USA 2005 AMSTERDAM•BOSTON•HEIDELBERG•LONDON•NEWYORK•OXFORD PARIS•SANDIEGO•SANFRANCISCO•SINGAPORE•SYDNEY•TOKYO ELSEVIERB.V. ELSEVIERInc. ELSEVIERLtd ELSEVIERLtd Radarweg29 525BStreet,Suite1900 TheBoulevard,LangfordLane 84TheobaldsRoad P.O.Box211,1000AEAmsterdam SanDiego,CA92101-4495 Kidlington,OxfordOX51GB LondonWC1X8RR TheNetherlands USA UK UK ©2005ElsevierB.V.Allrightsreserved ThisworkisprotectedundercopyrightbyElsevierB.V.,andthefollowingtermsandconditionsapplytoitsuse: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. PermissionofthePublisherandpaymentofafeeisrequiredforallotherphotocopying,includingmultipleor systematiccopying,copyingforadvertisingorpromotionalpurposes,resale,andallformsofdocumentdelivery. Specialratesareavailableforeducationalinstitutionsthatwishtomakephotocopiesfornon-profiteducational classroomuse. PermissionsmaybesoughtdirectlyfromElsevier’sRightsDepartmentinOxford,UK:phone(+44)1865843830, fax(+44)1865853333,e-mail:permissions@elsevier.com.Requestsmayalsobecompletedon-lineviatheEl- sevierhomepage(http://www.elsevier.com/locate/permissions). IntheUSA,usersmayclearpermissionsandmakepaymentsthroughtheCopyrightClearanceCenter,Inc.,222 RosewoodDrive,Danvers,MA01923,USA;phone:(+1)(978)7508400;fax:(+1)(978)7504744,andinthe UKthroughtheCopyrightLicensingAgencyRapidClearanceService(CLARCS),90TottenhamCourtRoad, LondonW1P0LP,UK;phone:(+44)2076315555;fax:(+44)2076315500.Othercountriesmayhavealocal reprographicrightsagencyforpayments. DerivativeWorks Tablesofcontentsmaybereproducedforinternalcirculation,butpermissionofthePublisherisrequiredfor externalresaleordistributionofsuchmaterial.PermissionofthePublisherisrequiredforallotherderivative works,includingcompilationsandtranslations. ElectronicStorageorUsage PermissionofthePublisherisrequiredtostoreoruseelectronicallyanymaterialcontainedinthiswork,including anychapterorpartofachapter. Exceptasoutlinedabove,nopartofthisworkmaybereproduced,storedinaretrievalsystemortransmittedin anyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,withoutpriorwritten permissionofthePublisher. Addresspermissionsrequeststo:Elsevier’sRightsDepartment,atthefaxande-mailaddressesnotedabove. Notice NoresponsibilityisassumedbythePublisherforanyinjuryand/ordamagetopersonsorpropertyasamatter ofproductsliability,negligenceorotherwise,orfromanyuseoroperationofanymethods,products,instruc- tionsorideascontainedinthematerialherein.Becauseofrapidadvancesinthemedicalsciences,inparticular, independentverificationofdiagnosesanddrugdosagesshouldbemade. Firstedition2005 LibraryofCongressCataloginginPublicationData AcatalogrecordfromtheLibraryofCongress. BritishLibraryCataloguinginPublicationData AcataloguerecordisavailablefromtheBritishLibrary. ISBN:0-444-51141-5 ISSN:0169-7161 (cid:1)∞ThepaperusedinthispublicationmeetstherequirementsofANSI/NISOZ39.48-1992(PermanenceofPaper). PrintedinTheNetherlands. Preface Ithaslongbeenaphilosophicalthemethatstatisticiansoughttobedatacentricasop- posed to methodology centric. Throughout the history of the statistical discipline, the most innovative methodological advances have come when brilliant individuals have wrestledwithnewdatastructures.Inferentialstatistics,linearmodels,sequentialanaly- sis, nonparametric statistics, robust statistical methods, and exploratory data analysis have all come about by a focus on a puzzling new data structure. The computer rev- olution has brought forth a myriad of new data structures for researchers to contend withincludingmassivedatasets,high-dimensionaldatasets,opportunisticallycollected datasets, image data, text data, genomic and proteomic data, and a host of other data challengesthatcouldnotbedealtwithwithoutmoderncomputingresources. This volume presents a collection of chapters that focus on data; in our words, it is data-centric. Data mining and data visualization are both attempts to handle non- standard statistical data, that is, data, which do not satisfy traditional assumptions of independence, stationarity, identically distribution, or parametric formulations. We believe it is desirable for statisticians to embrace such data and bring innovative per- spectivestotheseemergingdatatypes. Thisvolumeisconceptuallydividedintothreesections.Thefirstfocusesonaspects of data mining, the second on statistical and related analytical methods applicable to datamining,andthethirdondatavisualizationmethodsappropriatetodatamining.In Chapter1,WegmanandSolkapresentanoverviewofdataminingincludingbothstatis- ticalandcomputerscience-basedperspectives.Wecallattentiontotheirdescriptionof theemergingfieldofmassivestreamingdatasets.KaufmanandMichalskiapproachdata miningfromamachinelearningperspectiveandemphasizecomputationalintelligence andknowledgemining.Marchettedescribesexcitingmethodsforminingcomputerse- curitydatawiththeimportantapplicationtocybersecurity.Martinezturnsourattention tominingoftextdataandsomeapproachestofeatureextractionfromtextdata.Solka et al. also focuses on text mining applying these methods to cross corpus discovery. They describe methods and software for discovery subtle, but significant associations between two corpora covering disparate fields. Finally Duric et al. round out the data miningmethodswithadiscussionofinformationhidingknownassteganography. The second section, on statistical methods and related methods applicable to data mining begins with Rao’s description of methods applicable to dimension reduction andgraphicalrepresentation.Handpresentsanoverviewofmethodsofstatisticalpat- ternrecognition,whileScottandSainpresentanupdateofScott’sseminal1992book onmultivariatedensityestimation.Hubertetal.,inturn,describethedifficultproblem v vi Preface of analytically determining multivariate outliers and their impact on robustness. Sut- tondescribesrecentdevelopmentsinclassificationandregressiontrees,especiallythe conceptsofbaggingandboosting.Marchetteetal.describesomenewcomputationally effectiveclassificationtools,and,finallySaidgivesanoverviewofgeneticalgorithms. Thefinalsectionondatavisualizationbeginswithadescriptionofrotations(grand- tour methods) for high-dimensional visualization by Buja et al. This is followed by Carr’sdescriptionoftemplatesandsoftwareforshowingstatisticalsummaries,perhaps themostnovelofcurrentapproachestovisualdatapresentation.Wilhelmdescribesin depth a framework for interactive statistical graphics. Finally, Chen describes a com- puterscientist’sapproachtodatavisualizationcoupledwithvirtualreality. The editors sincerely hope that this combination of philosophical approaches and technical descriptions stimulates, and perhaps even irritates, our readers to encourage themtothinkdeeplyandwithinnovationabouttheseemergingdatastructuresandde- velopevenbetterapproachestoenrichourdiscipline. C.R.Rao E.J.Wegman J.L.Solka Table of contents Preface v Contributors xiii Ch.1. StatisticalDataMining 1 EdwardJ.WegmanandJeffreyL.Solka 1. Introduction 1 2. Computationalcomplexity 2 3. Thecomputersciencerootsofdatamining 9 4. Datapreparation 14 5. Databases 19 6. Statisticalmethodsfordatamining 21 7. Visualdatamining 29 8. Streamingdata 37 9. Afinalword 44 Acknowledgements 44 References 44 Ch.2. FromDataMiningtoKnowledgeMining 47 KennethA.KaufmanandRyszardS.Michalski 1. Introduction 47 2. Knowledgegenerationoperators 49 3. Strongpatternsvs.completeandconsistentrules 60 4. Rulesetvisualizationviaconceptassociationgraphs 62 5. Integrationofknowledgegenerationoperators 66 6. Summary 69 Acknowledgements 70 References 71 Ch.3. MiningComputerSecurityData 77 DavidJ.Marchette 1. Introduction 77 2. BasicTCP/IP 78 vii viii Tableofcontents 3. Thethreat 84 4. Networkmonitoring 92 5. TCPsessions 97 6. Signaturesversusanomalies 101 7. Userprofiling 102 8. Programprofiling 104 9. Conclusions 107 References 107 Ch.4. DataMiningofTextFiles 109 AngelR.Martinez 1. Introductionandbackground 109 2. Naturallanguageprocessingatthewordandsentencelevel 110 3. Approachesbeyondthewordandsentencelevel 119 4. Summary 129 References 130 Ch.5. TextDataMiningwithMinimalSpanningTrees 133 JeffreyL.Solka,AvoryC.BryantandEdwardJ.Wegman Introduction 133 1. Approach 133 2. Results 140 3. Conclusions 168 Acknowledgements 169 References 169 Ch.6. InformationHiding:SteganographyandSteganalysis 171 ZoranDuric,MichaelJacobsandSushilJajodia 1. Introduction 171 2. Imageformats 172 3. Steganography 174 4. Steganalysis 179 5. Relationshipofsteganographytowatermarking 181 6. Literaturesurvey 184 7. Conclusions 186 References 186 Ch.7. CanonicalVariateAnalysisandRelatedMethodsforReductionof DimensionalityandGraphicalRepresentation 189 C.RadhakrishnaRao 1. Introduction 189 2. Canonicalcoordinates 190 3. Principalcomponentanalysis 197

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.