ebook img

New Developments in Unsupervised Outlier Detection: Algorithms and Applications PDF

287 Pages·2021·4.645 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview New Developments in Unsupervised Outlier Detection: Algorithms and Applications

Xiaochun Wang Xiali Wang Mitch Wilkes New Developments in Unsupervised Outlier Detection Algorithms and Applications New Developments in Unsupervised Outlier Detection · · Xiaochun Wang Xiali Wang Mitch Wilkes New Developments in Unsupervised Outlier Detection Algorithms and Applications XiaochunWang XialiWang SchoolofSoftwareEngineering SchoolofInformationEngineering Xi’anJiaotongUniversity Chang’anUniversity Xi’an,Shaanxi,China Xi’an,Shaanxi,China MitchWilkes DepartmentofElectricalEngineeringand ComputerScience VanderbiltUniversity Nashville,TN,USA ISBN978-981-15-9518-9 ISBN978-981-15-9519-6 (eBook) https://doi.org/10.1007/978-981-15-9519-6 JointlypublishedwithXi’anJiaotongUniversityPress TheprinteditionisnotforsaleinChina(Mainland).CustomersfromChina(Mainland)pleaseorderthe printbookfrom:Xi’anJiaotongUniversityPress. ©Xi’anJiaotongUniversityPress2021 Thisworkissubjecttocopyright.AllrightsarereservedbythePublishers,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublishers,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishersnortheauthorsor theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublishersremainneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSingaporePteLtd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Foreword Being an active research topic in data mining, outlier detection aims to discover observationsinadatasetthatdeviatefromotherobservationssomuchastoarouse suspicionsthattheyaregeneratedbyadifferentmechanismandisofutmostimpor- tance in many application domains. Unsupervised outlier detection plays a crucial roleintheoutlierdetectionresearchandsetsoutenormoustheoreticalandapplied challenges to advanced data mining technology using unsupervised learning tech- niques.Thismonographaddressesunsupervisedoutlierdetectioninalocalsettingof k-nearestneighborhood.Unliketraditionaldistribution-basedoutlierdetectiontech- niques,k-nearestneighbor-basedoutlierdetectionapproaches,typifiedbydistance- based and density-based outlier detection methods, have become more and more popular.However,theproblemswiththesemethodsarethattheyareverysensitiveto thevalueofk,mayhavedifferentrankingsfortopoutliers,anddoubtsexistingeneral whethertheywouldworkwellforhigh-dimensionaldatasets.Topartiallycircumvent theseproblems,thealgorithmsofchoiceproposedforunsupervisedoutlierdetection inthecurrentresearchcombinek-nearestneighbor-basedoutlierdetectionmethods andgeneticclusteringalgorithms. Distance-basedoutliersanddensity-basedoutliersdenotetwodifferentkindsof definitionsforoutlierdetectionalgorithms.Distance-basedoutlierdetectionmethods can identify more globally oriented outliers while density-based outlier detection methods can identify more locally distributed outliers. In this book, several new globaloutlierfactorsandnewlocaloutlierfactorshavebeenproposed,andefficient and effective outlier detection algorithms have been developed upon them that are easytoimplementandcanprovidecompetingperformanceswithexistingsolutions. Havingbeenexploitedinoutlierdetectionresearchforyears,distance-basedand density-basedoutlierdetectionmethodsworktheoreticallybycalculatingk-nearest neighbors for each data point, computing outlier scores for them, ranking all the objects according to their scores, and finally returning data points with top largest scoresasoutliers.However,thereisnoreasontoassumethatthismustbethecase. To take this aspect into account, several outlier indicators are introduced to judge whetherdistance-basedanddensity-basedoutliersexistornot.Bythisway,outliers canbenotonlydetectedbutalsodiscriminatedfromboundarypoints. v vi Foreword Itisgenerallyagreedthatlearning,eithersupervisedorunsupervised,canprovide thebestpossiblespecificationofknownclassesandofferinferenceforoutlierdetec- tionbyadissimilaritythresholdfromthenominalfeaturespace.Novelobjectdetec- tioncantakeastepfurtherbyinvestigatingwhethertheseoutliersformnewdense clustersinboththefeaturespaceandtheimagespace.Bydefininganovelobjectto beapatterngroupthathasnotbeenseenbeforeinthefeaturespaceandtheimage space,anonconventionalapproachisproposedformultiplenovelobjectdetection applications. Timeseriesoftencontainoutliersandstructuralchanges.Theseunexpectedevents areoftheutmostimportanceinfrauddetection,astheymaypinpointsuspiciousactiv- ities.Thepresenceofsuchunusualactivitiescaneasilymisleadconventionaltime series analysis and yield erroneous conclusions. Traditionally, time series data are firstdividedintosmallchunks.k-nearestneighbor-basedoutlierdetectionapproaches are then applied for monitoring behavior over time in data mining. However, time seriesdataareverylargeinsizesotheycannotbescannedmultipletimes.Further, astheyareproducedcontinuously,newdataarearrived.Tocopewiththespeedthey are coming, a simple statistical parameter-based anomaly method is proposed for environmentaltimeseriesdatafrauddetectionapplications. Thechapterscoversuchtopicsasdistance-basedoutlierdetection,density-based outlier detection, clustering-based outlier detection, and the applications of these techniquestowardboundarypointdetection,novelobjectdetection,andfrauddetec- tion in environmental time series data. Overall, the book features a perspective on bridgingthegapbetweenk-nearestneighbor-basedoutlierdetectionandclustering- basedoutlierdetection,layingthegroundworkforfutureadvancesinunsupervised outlierdetectionresearch.Ihopenewdevelopmentsinunsupervisedoutlierdetection algorithmsandapplicationswillserveasaninvaluablereferenceforoutlierdetection researchersforyearstocome. Xi’an,China XubangShen May2020 ChineseNationalAcademician Preface Dataminingrepresentsacomplexoftechnologiesthatarerootedinmanydisciplines: mathematics, statistics, computer science, physics, engineering, biology, etc., and withdiverseapplicationsinalargevarietyofdifferentdomains:business,healthcare, science and engineering, etc. Basically, data mining can be seen as the science of exploringlargedatasetsforextractingimplicit,previouslyunknownandpotentially usefulinformation.Recently,outlierdetectionasaresearchareaindatamininghas advanceddramatically.Amultitudeofdataminingtechniqueshasbeendeveloped withimpactonunsupervisedoutlierdetectionareas.Ouraiminwritingthisbookis toprovideafriendlyandcomprehensiveguideforthoseinterestedinexploringthis fascinatingdomain.Inotherwords,thepurposeofthisbookistoprovideeasyaccess totherecentcontributionstounsupervisedoutlierdetectiontheoryandtoassessits impactonthefieldanditsimplicationsfortheoryandpractice.Itisalsointendedto beusedasanintroductorytextforadvancedundergraduate-levelorgraduate-level coursesincomputerscience,engineering,orotherfields.Inthisregard,thebookis intendedtobelargelyself-contained,althoughitisassumedthatthepotentialreader hasaquitegoodknowledgeofmathematics,statistics,andcomputerscience. The book is organized as follows. The first part of this book aims to review thestate-of-the-artunsupervisedtechniquesusedinoutlierdetection.Thematerial presentedinthesecondpartofthisbookisanextendedversionofseveralselected conferencearticlesandrepresentssomeofthemostrecentimportantadvancements in the field of unsupervised outlier detection. In the third part of this book, outlier detectiontechniquesareappliedtopracticalapplications.Morespecifically,thefirst part consists of two chapters. In Chap. 1, an overview of the book chapters and a summaryofcontributionsarepresented.First,theresearchissuesonunsupervised outlierdetectionareexplained.Theoverviewofthebookisthenfollowed.Finally, contributions are highlighted. In Chap. 2, some well-known unsupervised outlier detectiontechniquesandmodelsarereviewed.Thischapterbeginswithanoverview ofsomeofthemanyfacetsofoutlieranalysis.Then,itinvestigatessomestandard outlierdetectionapproaches.Finally,theproblemofevaluatingtheperformanceof differentoutlierdetectionmodelsisdiscussed.Thesecondpartconsistsoffivechap- ters, which provide an ever-growing list of unsupervised outlier detection models. InChap. 3,adivisivehierarchicalclusteringalgorithmisexploredasasolutionfor vii viii Preface fastdistance-basedoutlierdetectionproblems.InChap. 4,anewk-nearestneighbor centroid-based outlier detection method is proposed for both distance-based and density-basedoutlierdetectiontasks.InChap. 5,wepresentanewfastminimum spanningtree-inspiredalgorithmforoutlierdetectiontasks.InChap. 6,anefficient spectralclustering-basedoutlierdetectionalgorithmisproposedtoextractinforma- tionfromdatainsuchawaythatdistribution-basedoutlierdetectiontechniquescan beemployedformulti-dimensionaldata.InChap. 7,anoutlierindicatorisproposed toenhanceoutlierdetectioninwhichtheselectionofappropriateparametersisless difficultbutmoremeaningful.Theperformancesevaluatedonsomestandarddatasets demonstratetheeffectivenessandefficiencyofthesemethods.Thethirdpartofthis book is concerned with the applications of outlier detection techniques in real-life problems.Followingthetechniquesdiscussedinthesecondpart,wedevoteanentire chapter, that is Chap. 8, to a boundary point detection problem, another, that is Chap. 9, to a novel object detection problem, and finally, the third one, that is, Chap. 10, to a time series fraud detection problem. An extensive bibliography is included,whichisintendedtoprovidethereaderwithusefulinformationcovering allthetopicsapproachedinthisbook. Last,butcertainlynotleast,itisourhopethatgraduatestudents,youngandsenior researchers,andprofessionalsfrombothacademiaandindustrywillfindthebook usefulforunderstandingandreviewingcurrentapproachesinunsupervisedoutlier detectionresearch. Xi’an,China XiaochunWang Xi’an,China XialiWang Nashville,USA MitchWilkes June2020 Acknowledgements Firstandforemost,theauthorswouldliketothankNationalNaturalScienceFoun- dation of China for its valuable support of this work under award 61473220 and NaturalScienceFoundationofShaanxiProvince,China,foritsvaluablesupportof thisworkunderaward2020JM-046.Withoutthesupports,thisworkwouldnothave beenpossible. Theauthorsgratefullyacknowledgethecontributionofmanypeople.Firstofall, they would like to take this opportunity to acknowledge the work of the graduate students of School of Software Engineering at Xi’an Jiaotong University, Yiqin Chen, Yongqiang Ma, Yuan Wang, and Jia Li for their diligence and quality work through these projects. More specifically, Y. Chen developed a k-nearest neighbor centroid-basedoutlierdetectionalgorithmandappliedittoboundarypointdetection. Y.MadevelopedaminiMST-basedoutlierdetectionalgorithm.Y.Wangproposed a spectral clustering-based outlier detection algorithm. J. Li accomplished all the outlierdetectionexperimentsforspectralclustering-basedoutlierdetectiononreal multi-dimensionaldatasets.TheauthorswouldalsoliketothankYuanBaoofXi’an Jiaotong University Press for her timely suggestions and encouragement with the preparationofthemanuscript. Finally,theauthorswishtoexpresstheirdeepgratitudetotheirfamiliesfortheir assistanceinmanywaysforthesuccessfulcompletionofthisbook. ix Contents PartI Introduction 1 OverviewandContributions ................................... 3 1.1 Introduction ............................................. 3 1.2 ResearchIssuesonUnsupervisedOutlierDetection ........... 4 1.3 OverviewoftheBook ..................................... 7 1.4 Contributions ............................................ 8 1.5 Conclusions ............................................. 10 2 DevelopmentsinUnsupervisedOutlierDetectionResearch ........ 13 2.1 Introduction ............................................. 13 2.1.1 ABriefOverviewoftheEarlyDevelopments inOutlierAnalysis ................................ 15 2.2 Some Standard Unsupervised Outlier Detection Approaches ............................................. 21 2.2.1 Probabilistic Model-Based Outlier Detection Approach ........................................ 22 2.2.2 Clustering-BasedOutlierDetectionApproaches ....... 24 2.2.3 Distance-BasedOutlierDetectionApproaches ........ 24 2.2.4 Density-BasedOutlierDetectionApproaches ......... 25 2.2.5 OutlierDetectionforTimeSeries ................... 27 2.3 Performance Evaluation Metrics of Outlier Detection Approaches ............................................. 29 2.3.1 Precision,RecallandRankPower ................... 30 2.4 Conclusions ............................................. 31 References .................................................... 32 xi

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.