APPROVAL SHEET TitleofThesis: EfficientLocal AlgorithmsforDistributedDataMiningin LargeScale Peer toPeer Environments: A DeterministicApproach NameofCandidate: KanishkaBhaduri DoctorofPhilosophy,2008 Thesis and AbstractApproved: Dr. HillolKargupta AssociateProfessor DepartmentofComputerScienceand Electrical Engineering DateApproved: Curriculum Vitae Name: KanishkaBhaduri. Permanent Address: 211-FAtholgateLane, BaltimoreMD-21229. Degree anddate to beconferred: DoctorofPhilosophy,2008. DateofBirth: April24, 1980. PlaceofBirth: Kolkata,India. Secondary Education: SouthPointHighSchool, Kolkata,India,1997. Collegiateinstitutions attended: UniversityofMaryland BaltimoreCounty,Maryland,USA, • DoctorofPhilosophy,2008. UniversityofMiami,USA, • 2003–2004. JadavpurUniversity,Kolkata,India, • BachelorofEngineering,ComputerScienceand Engineering,2003. Major: ComputerScience. Professionalpublications: Refereed Journals 1. K.Bhaduri, H. Kargupta. DistributedMultivariateRegressioninPeer-to-Peer Networks. StatisticalAnalysisandDataMining(SAM)Special IssueonBest of SDM’08. (submitted)2008. 2. K.Bhaduri, R. Wolff,C. Giannella,H. Kargupta. DecisionTreeInductionin Peer-to-Peer Systems. StatisticalAnalysisand DataMining(SAM)accepted for publication. 2008. 3. K. Das,K. Bhaduri, K. Liu,H. Kargupta. DistributedIdentification ofTop-l Inner Product Elementsand itsApplicationinaPeer-to-Peer Network. IEEE Transactions onKnowledgeand DataEngineering. Volume20,Issue4,pp. 475-488. April2008. 4. K. Liu,K.Bhaduri, K. Das, P. Nguyen,H. Kargupta. Client-sideWeb Miningfor CommunityFormationin Peer-to-Peer Environments. SIGKDD Explorations. Volume8, Issue2, pp. 11-20. December 2006. 5. R. Wolff,K. Bhaduri, H. Kargupta. A GenericLocal AlgorithmwithApplications forDataMiningin LargeDistributedSystems. IEEE Transactionson Knowledge and DataEngineering(TKDE)(submitted). 2007. BookChapter 1. K.Bhaduri, K. Das, K. SivaKumar,H.Kargupta, R. Wolff, R. Chen. Algorithms forDistributedDataStream Mining. Achapterin DataStreams: Modelsand Algorithms,Charu Aggarwal(Editor),Springer. pp. 309-332. 2006. Refereed Conference Proceedings 1. K.Bhaduri, H. Kargupta. Anefficient local AlgorithmforDistributedMultivariate RegressioninPeer-to-Peer Networks. Accepted forpublicationat the2008SIAM InternationalDataMiningConference. (Best ofSDM’08) 2. R. Wolff,K. Bhaduri, H. Kargupta. Local L2 ThresholdingBased DataMiningin Peer-to-Peer Systems. SIAM InternationalConference inDataMining,Bethesda, Maryland,USA. pp. 430-441. 2006. Refereed WorkshopProceedings 1. K. Liu,KBhaduri, K. Das, P. Nguyen,H. Kargupta. Client-sideWeb Miningfor CommunityFormationin Peer-to-Peer Environments. SIGKDD workshoponweb usageandanalysis(WebKDD). Philadelphia,Pennsylvania,USA. 2006. (Selected asthe mostinteresting paper from the WebKDDworkshop) 2. K.Bhaduri, K. Das, H. Kargupta. Peer-to-Peer DataMining. Autonomous IntelligentSystems: Agentsand DataMining. V. Gorodetsky,C. Zhang,V. Skormin,L. Cao (Editors),LNAI4476,Springer. pp. 1-10. 2007. Invited 1. S. Datta, K.Bhaduri, C. Giannella, R. Wolff,H. Kargupta. DistributedData Miningin Peer-to-Peer Networks. IEEEInternet Computingspecial issueon DistributedDataMining. Volume10, Number4, pp. 18-26. 2006. Professionalpositions held: Research Assistant. (05/2004–03/2008). • DistributedAdaptiveDiscoveryandComputationLab,Department ofComputer Science andElectrical Engineering,UniversityofMaryland BaltimoreCounty (UMBC). SoftwareInternship. (05/2007– 08/2007). • SymantecCorporation,Columbia,Maryland. Teaching Assistant. (08/2003– 05/2004). • DepartmentofComputerScience, UniversityofMiami,Florida. ABSTRACT TitleofDissertation: Efficient Local Algorithms for Distributed Data Mining in Large Scale Peer to Peer Environments: ADeterministicApproach KanishkaBhaduri, DoctorofPhilosophy,2008 Thesis directed by: Dr. HillolKargupta AssociateProfessor DepartmentofComputerScienceand Electrical Engineering Peer-to-peer (P2P) systems such as Gnutella, Napster, e-Mule, Kazaa, and Freenet areincreasinglybecomingpopularformanyapplicationsthatgobeyonddownloadingmu- sic files without paying for it. Examples include P2P systems for network storage, web caching,searchingandindexingofrelevantdocumentsanddistributednetwork-threatanal- ysis. These environments are rich in data and this data, if mined, can provide valuable source of information. Mining the web cache of users, for example, may often give in- formationabout theirbrowsingpatternsleading toefficient searching,resource utilization, query routing and more. However, most of the off-the-shelf data analysis techniques are designed for centralized applications where the entire data is stored in a single location. These techniques do not work in a highly decentralized, distributed environment such as a P2P network. We need distributed data mining algorithms that are fundamentally local, scalable, decentralized, asynchronousand anytimetosolvethisproblem. ThisresearchproposesDeFraLC:aDeterminsiticFrameworkforLocalComputation of functions defined on data distributed in large scale (peer to peer) systems. Computing globaldatamodelsinsuchenvironmentscanbeveryexpensive. Movingallorsomeofthe datatoacentral locationdoesnotworkbecauseofthehighcostinvolvedincentralization. The cost increases even more under a dynamic scenario where the peers’ data and the network topology change arbitrarily. In this dissertation we have focused on developing algorithms for deterministic function-computation in large scale P2P environments. Our algorithmic framework is local which means that a peer can compute a function based on theinformationofonlyahandfulofnearby neighborsand thecommunicationoverheadof the algorithm is upper bounded by some constant, independent of the size of the system. As a consequence, several messages can be pruned, leading to excellent scalability of our algorithms. Thefirstalgorithmthatwehavedeveloped—PeGMA,Peer-to-PeerGenericMonitoring Algorithm—iscapableofcomputingcomplexfunctionsdefinedontheaverageofthehor- izontallydistributeddata. Thisgenericalgorithmisextremelyaccurate,highlyscalableand can seamlesslyadapt to changes in thedata orthenetwork. FollowingPeGMA, severalin- teresting algorithms can be developed such as the L2 norm monitoring of distributed data which is a very powerful primitive. Using a two step feedback loop, a number of data mining algorithms have been proposed. The first step uses the local algorithm to raise a flag whenever the current data does not fit the function. The second step uses a feedback loop to sample data from the network and build a new function. The correctness of the local algorithm guarantees that once the computation terminates each peer has the same resultcomparedtoacentralizedscenario. WeproposesolutionsforP2Pk-meansmonitor- ing, eigen monitoring and multivariate regression in P2P environments. Furthermore, we have shown how a complex data mining algorithm such as decision tree induction can be developed for P2P environments. Finally we have implemented all of the algorithms in a Distributed Data Mining Toolkit (DDMT) [44] developed at the DIADIC research lab at UMBC.Ourextensiveexperimentalresultsshowthattheproposedalgorithmsareaccurate, efficient and highlyscalable. EFFICIENT LOCAL ALGORITHMS FOR DISTRIBUTED DATA MINING IN LARGE SCALE PEER TO PEER ENVIRONMENTS: A DETERMINISTIC APPROACH by Kanishka Bhaduri DissertationsubmittedtotheFacultyoftheGraduateSchool oftheUniversityofMarylandinpartial fulfillment oftherequirementsforthedegreeof DoctorofPhilosophy 2008 c CopyrightKanishkaBhaduri 2008 (cid:13)

in Data Streams: Models and. Algorithms, Charu Aggarwal (Editor), Springer. pp. designed for centralized applications where the entire data is stored in a single location. 4.2 Related Work: Distributed Classification Algorithms 131 . tire experiment — including transitional phases.
