Table Of ContentExploiting the Power of
Group Differences
Using Patterns to Solve Data Analysis Problems
Synthesis Lectures on Data
Mining and Knowledge
Discovery
Editors
JiaweiHan,UniversityofIllinoisatUrbana-Champaign
LiseGetoor,UniversityofCalifornia,SantaCruz
WeiWang,UniversityofCalifornia,LosAngeles
JohannesGehrke,CornellUniversity
RobertGrossman,UniversityofChicago
SynthesisLecturesonDataMiningandKnowledgeDiscoveryiseditedbyJiaweiHan,Lise
Getoor,WeiWang,JohannesGehrke,andRobertGrossman.Theseriespublishes50-to150-page
publicationsontopicspertainingtodatamining,webmining,textmining,andknowledge
discovery,includingtutorialsandcasestudies.Potentialtopicsinclude:dataminingalgorithms,
innovativedataminingapplications,dataminingsystems,miningtext,webandsemi-structured
data,highperformanceandparallel/distributeddatamining,dataminingstandards,datamining
andknowledgediscoveryframeworkandprocess,dataminingfoundations,miningdatastreams
andsensordata,miningmulti-mediadata,miningsocialnetworksandgraphdata,miningspatial
andtemporaldata,pre-processingandpost-processingindatamining,robustandscalable
statisticalmethods,security,privacy,andadversarialdatamining,visualdatamining,visual
analytics,anddatavisualization.
ExploitingthePowerofGroupDifferences:UsingPatternstoSolveDataAnalysis
Problems
GuozhuDong
2019
MiningStructuresofFactualKnowledgefromText
XiangRenandJiaweiHan
2018
IndividualandCollectiveGraphMining:Principles,Algorithms,andApplications
DanaiKoutraandChristosFaloutsos
2017
PhraseMiningfromMassiveTextandItsApplications
JialuLiu,JingboShang,andJiaweiHan
2017
iii
ExploratoryCausalAnalysiswithTimeSeriesData
JamesM.McCracken
2016
MiningHumanMobilityinLocation-BasedSocialNetworks
HuijiGaoandHuanLiu
2015
MiningLatentEntityStructures
ChiWangandJiaweiHan
2015
ProbabilisticApproachestoRecommendations
NicolaBarbieri,GiuseppeManco,andEttoreRitacco
2014
OutlierDetectionforTemporalData
ManishGupta,JingGao,CharuAggarwal,andJiaweiHan
2014
ProvenanceDatainSocialMedia
GeoffreyBarbier,ZhuoFeng,PritamGundecha,andHuanLiu
2013
GraphMining:Laws,Tools,andCaseStudies
D.ChakrabartiandC.Faloutsos
2012
MiningHeterogeneousInformationNetworks:PrinciplesandMethodologies
YizhouSunandJiaweiHan
2012
PrivacyinSocialNetworks
ElenaZheleva,EvimariaTerzi,andLiseGetoor
2012
CommunityDetectionandMininginSocialMedia
LeiTangandHuanLiu
2010
EnsembleMethodsinDataMining:ImprovingAccuracyThroughCombining
Predictions
GiovanniSeniandJohnF.Elder
2010
ModelingandDataMininginBlogosphere
NitinAgarwalandHuanLiu
2009
Copyright©2019byMorgan&Claypool
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedin
anyformorbyanymeans—electronic,mechanical,photocopy,recording,oranyotherexceptforbriefquotations
inprintedreviews,withoutthepriorpermissionofthepublisher.
ExploitingthePowerofGroupDifferences:UsingPatternstoSolveDataAnalysisProblems
GuozhuDong
www.morganclaypool.com
ISBN:9781681735023 paperback
ISBN:9781681735030 ebook
ISBN:9781681735047 hardcover
DOI10.2200/S00897ED1V01Y201901DMK016
APublicationintheMorgan&ClaypoolPublishersseries
SYNTHESISLECTURESONDATAMININGANDKNOWLEDGEDISCOVERY
Lecture#16
SeriesEditors:JiaweiHan,UniversityofIllinoisatUrbana-Champaign
LiseGetoor,UniversityofCalifornia,SantaCruz
WeiWang,UniversityofCalifornia,LosAngeles
JohannesGehrke,CornellUniversity
RobertGrossman,UniversityofChicago
SeriesISSN
Print2151-0067 Electronic2151-0075
Exploiting the Power of
Group Differences
Using Patterns to Solve Data Analysis Problems
Guozhu Dong
WrightStateUniversity
SYNTHESISLECTURESONDATAMININGANDKNOWLEDGE
DISCOVERY#16
M
&C Morgan &cLaypool publishers
ABSTRACT
This book presents pattern-based problem-solving methods for a variety of machine learning
and data analysis problems. The methods are all based on techniques that exploit the power of
groupdifferences.Theymakeuseofgroupdifferencesrepresentedusingemergingpatterns(aka
contrastpatterns),whicharepatternsthatmatchsignificantlydifferentnumbersofinstancesin
different data groups. A large number of applications outside of the computing discipline are
alsoincluded.
Emerging patterns (EPs) are useful in many ways. EPs can be used as features, as sim-
ple classifiers, as subpopulation signatures/characterizations, and as triggering conditions for
alerts. EPs can be used in gene ranking for complex diseases since they capture multi-factor
interactions.ThelengthofEPscanbeusedtodetectanomalies,outliers,andnovelties.Emerg-
ing/contrast pattern-based methods for clustering analysis and outlier detection do not need
distancemetrics,avoidingpitfallsofthelatterinexploratoryanalysisofhighdimensionaldata.
EP-basedclassifierscanachievegoodaccuracyevenwhenthetrainingdatasetsaretiny,making
themusefulforexploratorycompoundselectionindrugdesign.EPscanserveasopportunities
inopportunity-focusedboostingandareusefulforconstructingpowerfulconditionalensembles.
EP-basedmethodsoftenproduceinterpretablemodelsandresults.Ingeneral,EPsareusefulfor
classification,clustering,outlierdetection,generankingforcomplexdiseases,predictionmodel
analysisandimprovement,andsoon.
EPs are useful for many tasks because they represent group differences, which have ex-
traordinarypower.Moreover,EPsrepresentmulti-factorinteractions,whoseeffectivehandling
isofvitalimportanceandisamajorchallengeinmanydisciplines.
Based on the results presented in this book, one can clearly say that patterns are useful,
especiallywhentheyarelinkedtoissuesofinterest.
Webelievethatmanyeffectivewaystoexploitgroupdifferences’powerstillremaintobe
discovered.Hopefullythisbookwillinspirereaderstodiscoversuchnewways,besidesshowing
themexistingways,tosolvevariouschallengingproblems.
KEYWORDS
data mining, machine learning, data analytic, classification, regression, cluster-
ing, anomaly detection, outlier detection, intrusion detection, compound selec-
tion,complexdiseaseanalysis,extremeinstanceselection,factorranking,prediction
model analysis, group difference analysis, feature, multifactor interaction, diverse
relationship,heterogeneity,boosting,ensemble,associationrule,emergingpattern,
contrastpattern,frequentpattern,distancemetric,interpretability
vii
To my wife for her love and support!
– Guozhu Dong
ix
Contents
Acknowledgments ................................................. xv
1 IntroductionandOverview ...........................................1
1.1 ImportanceofGroupDifferences .................................... 1
1.2 SummaryofChapters ............................................. 2
1.2.1 ReadingOrderoftheChapters ................................ 5
1.3 KnownUsesofGroupDifferencesviaEmergingPatterns................. 5
1.4 UniquePropertiesofEmergingPatternBasedMethods .................. 6
1.5 ScenariosWhereEmergingPatternsAreEspeciallyUseful................ 7
1.6 RelatedTopicsNotCoveredinThisBook ............................. 8
2 GeneralPreliminaries ...............................................9
2.1 Attributes,Features,andVariables ................................... 9
2.2 DataInstancesandDatasets ........................................ 9
2.3 AttributeBinningandDiscretization ................................ 10
2.4 Patterns,MatchingDatasets,Supports,andFrequentPatterns............ 12
2.5 EquivalenceClasses,ClosedPatterns,MinimalGenerators,andBorders.... 13
2.6 IllustratingExamples ............................................. 13
3 EmergingPatternsandaFlexibleMiningAlgorithm.....................15
3.1 SettingforGroupDifferenceAnalysis ............................... 15
3.2 BasicsofEmergingPatterns ....................................... 15
3.3 BorderDiff:ASimple,FlexibleEmergingPatternMiningAlgorithm ........ 18
3.4 WhatEmergingPatternsCanRepresent ............................. 20
3.5 ComparisonwithAssociationRules,Confidence,andOddsRatio......... 21
3.6 PointerstoSectionsIllustratingUsesofEmergingPatterns .............. 23
3.7 TraditionalAnalysisofGroupDifferences ............................ 24
3.8 DiscussionofRelatedIssues ....................................... 25