ebook img

Models and Methods for Privacy-Preserving Data Publishing and Analysis PDF

54 Pages·2006·0.66 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Models and Methods for Privacy-Preserving Data Publishing and Analysis

Models and Methods for Privacy-Preserving Data Publishing and Analysis Johannes Gehrke Department of Computer Science http://www.cs.cornell.edu/johannes An Abundance of Data (cid:122) Supermarket scanners (cid:122) Scientific experiments (cid:122) Credit card transactions (cid:122) Sensors (cid:122) Direct mail response (cid:122) Cameras (cid:122) Call center records (cid:122) Interactions in social (cid:122) ATM machines networks (cid:122) Web server logs (cid:122) Newswires (cid:122) Customer web site trails (cid:122) Speech-to-text translation (cid:122) Podcasts (cid:122) Email (cid:122) Blogs (cid:122) Closed caption •Print, film, optical, and magnetic storage: 5 Exabytes (EB) of new information in 2002, doubled in the last three years [How much Information 2003, UC Berkeley] SIGKDD 2006 Tutorial, August 2006 Driving Factors: A LARGE Hardware Revolution [Intel CorporatSiIoGKnD]D 2006 Tutorial, August 2006 1 Driving Factors: A Hardware Revolution small (cid:122) Experts on ants estimate that there are 1016to 1017ants on earth. In the year 1997, we produced one transistor per ant. [Gordon Moore] SIGKDD 2006 Tutorial, August 2006 SIGKDD 2006 Tutorial, August 2006 1962 1972 1982 2002 SIGKDD 2006 Tutorial, August 2006 2 Pulsars (cid:122)Pulsars are rotating stars (cid:122)Of interest are (cid:122)Millisecond pulsars (cid:122)Compact binaries (cid:122)Example: (cid:122)Hulse-Taylor binary (cid:122)Used to infer gravitational waves in support of Einstein’s General Theory of Relativity (cid:122)Nobel price in physics in 1993 SIGKDD 2006 Tutorial, August 2006 Project Requirements (cid:122) Data (cid:122) 14 TB every 2 weeks (cid:122) Shipped on USB-2 disk drives (cid:122) Need to archive raw data 5+ years (cid:122) Need to make data products available to the astronomy research community (cid:122) Processing (cid:122) Extremely processor intensive (cid:122) Find new pulsars ---and other interestingphenomena [Calimlim, Cordes, Demers, Gehrke, Lifka; http://arecibo.tc.cornell.edu] Driving Factors: Analysis Capabilities Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6% SIGKDD 2006 Tutorial, August 2006 3 Driving Factors: Connectivity and Bandwidth (cid:122)Metcalf’s law (network usefulness increases squared with the number of users) (cid:122)Gilder’s law (bandwidth doubles every 6 months) SIGKDD 2006 Tutorial, August 2006 Concerns About Privacy Recent example: “Last week AOL did another stupid thing, but at least it was in the name of science….” [Annalee Newitz, AlterNet, August 15, 2006] SIGKDD 2006 Tutorial, August 2006 A Face Is Exposed for AOL Searcher No. 4417749 [New York Times, August 9, 2006] … No. 4417749 conducted hundreds of searches over a three- month period on topics ranging from “numb fingers”to “60 single men”to “dog that urinates on everything.” And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for “landscapers in Lilburn, Ga,”several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnettcounty georgia.” It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’medical ailments and loves her three dogs. “Those are my searches,”she said, after a reporter read part of the list to her. … SIGKDD 2006 Tutorial, August 2006 4 A Face Is Exposed for AOL Searcher No. 4417749 [New York Times, August 9, 2006] Ms. Arnold says she loves online research, but the disclosure of her searches has left her disillusioned. In response, she plans to drop her AOL subscription. “We all have a right to privacy,”she said. “Nobody should have found this all out.” http://data.aolsearchlogs.com SIGKDD 2006 Tutorial, August 2006 The Setup Server D B Customer 1 Customer 2 Customer 3 Customer N r r r r 1 2 3 N SIGKDD 2006 Tutorial, August 2006 Model I: Untrusted Data Collector Findaggregate Copmroppaneyr tAies of Company {r, r, …, r } 1 2 N D B Customer 1 Customer 2 Customer 3 Customer N r r r r 1 2 3 N SIGKDD 2006 Tutorial, August 2006 5 Minimal Information Sharing (cid:122) Ideally, we want an algorithm that discloses only the query result, and only to the requesting party. (In practice, we need some extra disclosure.) (cid:122) How do we design algorithms that compute queries while preserving data privacy? (cid:122) How do we measure privacy(this extra disclosure)? SIGKDD 2006 Tutorial, August 2006 Types of Disclosure Tolerated Disclosure Statistically Computationally private private too fuzzy or unlikely hard to use SIGKDD 2006 Tutorial, August 2006 Types of Disclosure Tolerated Disclosure Cryptographic protocols Statistically Computationally private private too fuzzy or unlikely hard to use SIGKDD 2006 Tutorial, August 2006 6 Types of Disclosure Knowledge as Tolerated distribution: Disclosure This tutorial! Statistically Computationally private private too fuzzy or unlikely hard to use SIGKDD 2006 Tutorial, August 2006 Model II: Trusted Data Collector Publish Copmroppaneyr tAies of Government {r, r, …, r } 1 2 N D B Customer 1 Customer 2 Customer 3 Customer N r r r r 1 2 3 N SIGKDD 2006 Tutorial, August 2006 Disclosure Limitations (cid:122) Ideally, we want a solution that discloses as much statistical information as possible while preserving privacy of the individuals who contributed data. (cid:122) How do we design algorithms that allow the “largest”set of queries that can be disclosed while preserving data privacy? (cid:122) How do we measure disclosure? SIGKDD 2006 Tutorial, August 2006 7 This Tutorial: Statistical Methods (cid:122) Privacy-preserving data analysis (cid:122) Privacy-preserving data publishing Goal: (cid:122) Rather than talk about everything superficially, but nothing in-depth, make hard choices Caveats: (cid:122) Not a comprehensive survey (cid:47) SIGKDD 2006 Tutorial, August 2006 What is Left Out? (cid:122) Work on secure multi-party computation (secure join, secure intersection, homomorphicencryption, certificate revocation, etc.) (cid:122) Architectural and language issues (Hippocratic databases, P3P, etc.) (cid:122) Disclosure control (statistical databases, auditing, database queries, etc.) (cid:122) Privacy through distributed data mining (cid:122) Resources (cid:122) See excellent tutorials by Rakesh Agrawal and Chris Clifton; keynote talk by SrikantRamakrishnanat this conference. SIGKDD 2006 Tutorial, August 2006 Tutorial Outline (cid:122)Untrusted data collector (cid:122)Trusted data collector SIGKDD 2006 Tutorial, August 2006 8 Privacy Preserving Data Mining Build a data mining model Company over {t1, t2, …, tN} D Customer 1 Customer 2 Customer 3 Customer N t t t t 1 2 3 N SIGKDD 2006 Tutorial, August 2006 The Model Alice J.S. Bach, painting, Server nasa.gov, … Bob B. Spears, baseball, cnn.com, … Chris B. Marley, camping, linux.org, … SIGKDD 2006 Tutorial, August 2006 The Model Alice J.S. Bach, JJ.S.S. .B Baacchh,, painting, ppaainintitningg,, Server nasa.gov, nnaassaa.g.goovv,, … …… BB. .S Sppeeaarrss,, Bob bbaasseebbaalll,l, ccnnnn.c.coomm,, B. Spears, …… baseball, BB. .M Maarrleleyy,, cnn.com, ccaammppiningg,, … Chris lilninuuxx.o.orrgg,, …… B. Marley, camping, linux.org, … SIGKDD 2006 Tutorial, August 2006 9 The Model (Contd.) Alice J.S. Bach, JJ.S.S. .B Baacchh,, painting, ppaainintitningg,, Server nasa.gov, nnaassaa.g.goovv,, … …… BB. .S Sppeeaarrss,, Bob bbaasseebbaalll,l, Data Mining Model ccnnnn.c.coomm,, B. Spears, …… baseball, BB. .M Maarrleleyy,, cnn.com, ccaammppiningg,, … Chris lilninuuxx.o.orrgg,, Usage …… B. Marley, camping, linux.org, … SIGKDD 2006 Tutorial, August 2006 The Model (Contd.) Alice J.S. Bach, MMeetatalllilcicaa,, painting, ppaainintitningg,, Server nasa.gov, nnaassaa.g.goovv,, … …… Statistics Recovery BB. .S Sppeeaarrss,, Bob ssoocccceerr,, Data Mining Model bbbbcc.c.coo.u.ukk,, B. Spears, …… baseball, BB. .M Maarrleleyy,, cnn.com, ccaammppiningg,, … Chris mmicicrorossooftf.tc.coomm Usage …… B. Marley, camping, linux.org, … SIGKDD 2006 Tutorial, August 2006 The Problem (cid:122)How to randomize data such that (cid:122)we can build a good data mining model (utility) (cid:122)while preserving privacy at the record level (privacy)? SIGKDD 2006 Tutorial, August 2006 10

Description:
Need to make data products available to the astronomy ailments and loves her three dogs. “Those are my queries while preserving data privacy?
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.