ebook img

efficient local algorithms for distributed data mining in large scale peer to peer environments PDF

236 Pages·2010·2.44 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview efficient local algorithms for distributed data mining in large scale peer to peer environments

APPROVAL SHEET TitleofThesis: EfficientLocal AlgorithmsforDistributedDataMiningin LargeScale Peer toPeer Environments: A DeterministicApproach NameofCandidate: KanishkaBhaduri DoctorofPhilosophy,2008 Thesis and AbstractApproved: Dr. HillolKargupta AssociateProfessor DepartmentofComputerScienceand Electrical Engineering DateApproved: Curriculum Vitae Name: KanishkaBhaduri. Permanent Address: 211-FAtholgateLane, BaltimoreMD-21229. Degree anddate to beconferred: DoctorofPhilosophy,2008. DateofBirth: April24, 1980. PlaceofBirth: Kolkata,India. Secondary Education: SouthPointHighSchool, Kolkata,India,1997. Collegiateinstitutions attended: UniversityofMaryland BaltimoreCounty,Maryland,USA, • DoctorofPhilosophy,2008. UniversityofMiami,USA, • 2003–2004. JadavpurUniversity,Kolkata,India, • BachelorofEngineering,ComputerScienceand Engineering,2003. Major: ComputerScience. Professionalpublications: Refereed Journals 1. K.Bhaduri, H. Kargupta. DistributedMultivariateRegressioninPeer-to-Peer Networks. StatisticalAnalysisandDataMining(SAM)Special IssueonBest of SDM’08. (submitted)2008. 2. K.Bhaduri, R. Wolff,C. Giannella,H. Kargupta. DecisionTreeInductionin Peer-to-Peer Systems. StatisticalAnalysisand DataMining(SAM)accepted for publication. 2008. 3. K. Das,K. Bhaduri, K. Liu,H. Kargupta. DistributedIdentification ofTop-l Inner Product Elementsand itsApplicationinaPeer-to-Peer Network. IEEE Transactions onKnowledgeand DataEngineering. Volume20,Issue4,pp. 475-488. April2008. 4. K. Liu,K.Bhaduri, K. Das, P. Nguyen,H. Kargupta. Client-sideWeb Miningfor CommunityFormationin Peer-to-Peer Environments. SIGKDD Explorations. Volume8, Issue2, pp. 11-20. December 2006. 5. R. Wolff,K. Bhaduri, H. Kargupta. A GenericLocal AlgorithmwithApplications forDataMiningin LargeDistributedSystems. IEEE Transactionson Knowledge and DataEngineering(TKDE)(submitted). 2007. BookChapter 1. K.Bhaduri, K. Das, K. SivaKumar,H.Kargupta, R. Wolff, R. Chen. Algorithms forDistributedDataStream Mining. Achapterin DataStreams: Modelsand Algorithms,Charu Aggarwal(Editor),Springer. pp. 309-332. 2006. Refereed Conference Proceedings 1. K.Bhaduri, H. Kargupta. Anefficient local AlgorithmforDistributedMultivariate RegressioninPeer-to-Peer Networks. Accepted forpublicationat the2008SIAM InternationalDataMiningConference. (Best ofSDM’08) 2. R. Wolff,K. Bhaduri, H. Kargupta. Local L2 ThresholdingBased DataMiningin Peer-to-Peer Systems. SIAM InternationalConference inDataMining,Bethesda, Maryland,USA. pp. 430-441. 2006. Refereed WorkshopProceedings 1. K. Liu,KBhaduri, K. Das, P. Nguyen,H. Kargupta. Client-sideWeb Miningfor CommunityFormationin Peer-to-Peer Environments. SIGKDD workshoponweb usageandanalysis(WebKDD). Philadelphia,Pennsylvania,USA. 2006. (Selected asthe mostinteresting paper from the WebKDDworkshop) 2. K.Bhaduri, K. Das, H. Kargupta. Peer-to-Peer DataMining. Autonomous IntelligentSystems: Agentsand DataMining. V. Gorodetsky,C. Zhang,V. Skormin,L. Cao (Editors),LNAI4476,Springer. pp. 1-10. 2007. Invited 1. S. Datta, K.Bhaduri, C. Giannella, R. Wolff,H. Kargupta. DistributedData Miningin Peer-to-Peer Networks. IEEEInternet Computingspecial issueon DistributedDataMining. Volume10, Number4, pp. 18-26. 2006. Professionalpositions held: Research Assistant. (05/2004–03/2008). • DistributedAdaptiveDiscoveryandComputationLab,Department ofComputer Science andElectrical Engineering,UniversityofMaryland BaltimoreCounty (UMBC). SoftwareInternship. (05/2007– 08/2007). • SymantecCorporation,Columbia,Maryland. Teaching Assistant. (08/2003– 05/2004). • DepartmentofComputerScience, UniversityofMiami,Florida. ABSTRACT TitleofDissertation: Efficient Local Algorithms for Distributed Data Mining in Large Scale Peer to Peer Environments: ADeterministicApproach KanishkaBhaduri, DoctorofPhilosophy,2008 Thesis directed by: Dr. HillolKargupta AssociateProfessor DepartmentofComputerScienceand Electrical Engineering Peer-to-peer (P2P) systems such as Gnutella, Napster, e-Mule, Kazaa, and Freenet areincreasinglybecomingpopularformanyapplicationsthatgobeyonddownloadingmu- sic files without paying for it. Examples include P2P systems for network storage, web caching,searchingandindexingofrelevantdocumentsanddistributednetwork-threatanal- ysis. These environments are rich in data and this data, if mined, can provide valuable source of information. Mining the web cache of users, for example, may often give in- formationabout theirbrowsingpatternsleading toefficient searching,resource utilization, query routing and more. However, most of the off-the-shelf data analysis techniques are designed for centralized applications where the entire data is stored in a single location. These techniques do not work in a highly decentralized, distributed environment such as a P2P network. We need distributed data mining algorithms that are fundamentally local, scalable, decentralized, asynchronousand anytimetosolvethisproblem. ThisresearchproposesDeFraLC:aDeterminsiticFrameworkforLocalComputation of functions defined on data distributed in large scale (peer to peer) systems. Computing globaldatamodelsinsuchenvironmentscanbeveryexpensive. Movingallorsomeofthe datatoacentral locationdoesnotworkbecauseofthehighcostinvolvedincentralization. The cost increases even more under a dynamic scenario where the peers’ data and the network topology change arbitrarily. In this dissertation we have focused on developing algorithms for deterministic function-computation in large scale P2P environments. Our algorithmic framework is local which means that a peer can compute a function based on theinformationofonlyahandfulofnearby neighborsand thecommunicationoverheadof the algorithm is upper bounded by some constant, independent of the size of the system. As a consequence, several messages can be pruned, leading to excellent scalability of our algorithms. Thefirstalgorithmthatwehavedeveloped—PeGMA,Peer-to-PeerGenericMonitoring Algorithm—iscapableofcomputingcomplexfunctionsdefinedontheaverageofthehor- izontallydistributeddata. Thisgenericalgorithmisextremelyaccurate,highlyscalableand can seamlesslyadapt to changes in thedata orthenetwork. FollowingPeGMA, severalin- teresting algorithms can be developed such as the L2 norm monitoring of distributed data which is a very powerful primitive. Using a two step feedback loop, a number of data mining algorithms have been proposed. The first step uses the local algorithm to raise a flag whenever the current data does not fit the function. The second step uses a feedback loop to sample data from the network and build a new function. The correctness of the local algorithm guarantees that once the computation terminates each peer has the same resultcomparedtoacentralizedscenario. WeproposesolutionsforP2Pk-meansmonitor- ing, eigen monitoring and multivariate regression in P2P environments. Furthermore, we have shown how a complex data mining algorithm such as decision tree induction can be developed for P2P environments. Finally we have implemented all of the algorithms in a Distributed Data Mining Toolkit (DDMT) [44] developed at the DIADIC research lab at UMBC.Ourextensiveexperimentalresultsshowthattheproposedalgorithmsareaccurate, efficient and highlyscalable. EFFICIENT LOCAL ALGORITHMS FOR DISTRIBUTED DATA MINING IN LARGE SCALE PEER TO PEER ENVIRONMENTS: A DETERMINISTIC APPROACH by Kanishka Bhaduri DissertationsubmittedtotheFacultyoftheGraduateSchool oftheUniversityofMarylandinpartial fulfillment oftherequirementsforthedegreeof DoctorofPhilosophy 2008 c CopyrightKanishkaBhaduri 2008 (cid:13)

Description:
in Data Streams: Models and. Algorithms, Charu Aggarwal (Editor), Springer. pp. designed for centralized applications where the entire data is stored in a single location. 4.2 Related Work: Distributed Classification Algorithms 131 . tire experiment — including transitional phases.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.