Table Of ContentModels and Methods for
Privacy-Preserving
Data Publishing and Analysis
Johannes Gehrke
Department of Computer Science
http://www.cs.cornell.edu/johannes
An Abundance of Data
(cid:122) Supermarket scanners (cid:122) Scientific experiments
(cid:122) Credit card transactions (cid:122) Sensors
(cid:122) Direct mail response (cid:122) Cameras
(cid:122) Call center records (cid:122) Interactions in social
(cid:122) ATM machines networks
(cid:122) Web server logs (cid:122) Newswires
(cid:122) Customer web site trails (cid:122) Speech-to-text translation
(cid:122) Podcasts (cid:122) Email
(cid:122) Blogs (cid:122) Closed caption
•Print, film, optical, and magnetic storage: 5 Exabytes (EB) of
new information in 2002, doubled in the last three years
[How much Information 2003, UC Berkeley]
SIGKDD 2006 Tutorial, August 2006
Driving Factors: A LARGE Hardware
Revolution
[Intel CorporatSiIoGKnD]D 2006 Tutorial, August 2006
1
Driving Factors: A Hardware Revolution
small
(cid:122) Experts on ants estimate that there are 1016to
1017ants on earth. In the year 1997, we
produced one transistor per ant.
[Gordon Moore]
SIGKDD 2006 Tutorial, August 2006
SIGKDD 2006 Tutorial, August 2006
1962 1972
1982 2002
SIGKDD 2006 Tutorial, August 2006
2
Pulsars
(cid:122)Pulsars are rotating stars
(cid:122)Of interest are
(cid:122)Millisecond pulsars
(cid:122)Compact binaries
(cid:122)Example:
(cid:122)Hulse-Taylor binary
(cid:122)Used to infer gravitational waves in support
of Einstein’s General Theory of Relativity
(cid:122)Nobel price in physics in 1993
SIGKDD 2006 Tutorial, August 2006
Project Requirements
(cid:122) Data
(cid:122) 14 TB every 2 weeks
(cid:122) Shipped on USB-2 disk drives
(cid:122) Need to archive raw data 5+ years
(cid:122) Need to make data products available to the astronomy
research community
(cid:122) Processing
(cid:122) Extremely processor intensive
(cid:122) Find new pulsars ---and other interestingphenomena
[Calimlim, Cordes, Demers, Gehrke, Lifka;
http://arecibo.tc.cornell.edu]
Driving Factors: Analysis Capabilities
Data mining is the exploration and analysis
of large quantities of data in order to
discover valid, novel, potentially useful,
and ultimately understandable patterns in
data.
Example pattern (Census Bureau Data):
If (relationship = husband), then (gender = male). 99.6%
SIGKDD 2006 Tutorial, August 2006
3
Driving Factors: Connectivity and Bandwidth
(cid:122)Metcalf’s law (network usefulness
increases squared with the number of
users)
(cid:122)Gilder’s law (bandwidth doubles every 6
months)
SIGKDD 2006 Tutorial, August 2006
Concerns About Privacy
Recent example:
“Last week AOL did another stupid thing,
but at least it was in the name of
science….”
[Annalee Newitz, AlterNet, August 15, 2006]
SIGKDD 2006 Tutorial, August 2006
A Face Is Exposed for AOL Searcher No.
4417749 [New York Times, August 9, 2006]
…
No. 4417749 conducted hundreds of searches over a three-
month period on topics ranging from “numb fingers”to
“60 single men”to “dog that urinates on everything.”
And search by search, click by click, the identity of AOL
user No. 4417749 became easier to discern. There are
queries for “landscapers in Lilburn, Ga,”several people
with the last name Arnold and “homes sold in shadow
lake subdivision gwinnettcounty georgia.”
It did not take much investigating to follow that data trail
to Thelma Arnold, a 62-year-old widow who lives in
Lilburn, Ga., frequently researches her friends’medical
ailments and loves her three dogs. “Those are my
searches,”she said, after a reporter read part of the list
to her.
…
SIGKDD 2006 Tutorial, August 2006
4
A Face Is Exposed for AOL Searcher No.
4417749 [New York Times, August 9, 2006]
Ms. Arnold says she loves online
research, but the disclosure of her
searches has left her disillusioned. In
response, she plans to drop her AOL
subscription. “We all have a right to
privacy,”she said. “Nobody should
have found this all out.”
http://data.aolsearchlogs.com
SIGKDD 2006 Tutorial, August 2006
The Setup
Server
D
B
Customer 1 Customer 2 Customer 3 Customer N
r r r r
1 2 3 N
SIGKDD 2006 Tutorial, August 2006
Model I: Untrusted Data Collector
Findaggregate
Copmroppaneyr tAies of Company
{r, r, …, r }
1 2 N
D
B
Customer 1 Customer 2 Customer 3 Customer N
r r r r
1 2 3 N
SIGKDD 2006 Tutorial, August 2006
5
Minimal Information Sharing
(cid:122) Ideally, we want an algorithm that discloses
only the query result, and only to the requesting
party. (In practice, we need some extra
disclosure.)
(cid:122) How do we design algorithms that compute
queries while preserving data privacy?
(cid:122) How do we measure privacy(this extra
disclosure)?
SIGKDD 2006 Tutorial, August 2006
Types of Disclosure
Tolerated
Disclosure
Statistically Computationally
private private
too fuzzy or unlikely hard to use
SIGKDD 2006 Tutorial, August 2006
Types of Disclosure
Tolerated
Disclosure Cryptographic
protocols
Statistically Computationally
private private
too fuzzy or unlikely hard to use
SIGKDD 2006 Tutorial, August 2006
6
Types of Disclosure
Knowledge as Tolerated
distribution: Disclosure
This tutorial!
Statistically Computationally
private private
too fuzzy or unlikely hard to use
SIGKDD 2006 Tutorial, August 2006
Model II: Trusted Data Collector
Publish
Copmroppaneyr tAies of Government
{r, r, …, r }
1 2 N
D
B
Customer 1 Customer 2 Customer 3 Customer N
r r r r
1 2 3 N
SIGKDD 2006 Tutorial, August 2006
Disclosure Limitations
(cid:122) Ideally, we want a solution that discloses as
much statistical information as possible while
preserving privacy of the individuals who
contributed data.
(cid:122) How do we design algorithms that allow the
“largest”set of queries that can be disclosed
while preserving data privacy?
(cid:122) How do we measure disclosure?
SIGKDD 2006 Tutorial, August 2006
7
This Tutorial: Statistical Methods
(cid:122) Privacy-preserving data analysis
(cid:122) Privacy-preserving data publishing
Goal:
(cid:122) Rather than talk about everything superficially,
but nothing in-depth, make hard choices
Caveats:
(cid:122) Not a comprehensive survey (cid:47)
SIGKDD 2006 Tutorial, August 2006
What is Left Out?
(cid:122) Work on secure multi-party computation (secure join,
secure intersection, homomorphicencryption, certificate
revocation, etc.)
(cid:122) Architectural and language issues (Hippocratic
databases, P3P, etc.)
(cid:122) Disclosure control (statistical databases, auditing,
database queries, etc.)
(cid:122) Privacy through distributed data mining
(cid:122) Resources
(cid:122) See excellent tutorials by Rakesh Agrawal and Chris Clifton;
keynote talk by SrikantRamakrishnanat this conference.
SIGKDD 2006 Tutorial, August 2006
Tutorial Outline
(cid:122)Untrusted data collector
(cid:122)Trusted data collector
SIGKDD 2006 Tutorial, August 2006
8
Privacy Preserving Data Mining
Build a data
mining model Company
over
{t1, t2, …, tN} D
Customer 1 Customer 2 Customer 3 Customer N
t t t t
1 2 3 N
SIGKDD 2006 Tutorial, August 2006
The Model
Alice
J.S. Bach,
painting, Server
nasa.gov,
…
Bob
B. Spears,
baseball,
cnn.com,
… Chris
B. Marley,
camping,
linux.org,
…
SIGKDD 2006 Tutorial, August 2006
The Model
Alice
J.S. Bach, JJ.S.S. .B Baacchh,,
painting, ppaainintitningg,, Server
nasa.gov, nnaassaa.g.goovv,,
… ……
BB. .S Sppeeaarrss,,
Bob bbaasseebbaalll,l,
ccnnnn.c.coomm,,
B. Spears, ……
baseball, BB. .M Maarrleleyy,,
cnn.com, ccaammppiningg,,
… Chris lilninuuxx.o.orrgg,,
……
B. Marley,
camping,
linux.org,
…
SIGKDD 2006 Tutorial, August 2006
9
The Model (Contd.)
Alice
J.S. Bach, JJ.S.S. .B Baacchh,,
painting, ppaainintitningg,, Server
nasa.gov, nnaassaa.g.goovv,,
… ……
BB. .S Sppeeaarrss,,
Bob bbaasseebbaalll,l, Data Mining Model
ccnnnn.c.coomm,,
B. Spears, ……
baseball, BB. .M Maarrleleyy,,
cnn.com, ccaammppiningg,,
… Chris lilninuuxx.o.orrgg,, Usage
……
B. Marley,
camping,
linux.org,
…
SIGKDD 2006 Tutorial, August 2006
The Model (Contd.)
Alice
J.S. Bach, MMeetatalllilcicaa,,
painting, ppaainintitningg,, Server
nasa.gov, nnaassaa.g.goovv,,
… ……
Statistics Recovery
BB. .S Sppeeaarrss,,
Bob ssoocccceerr,, Data Mining Model
bbbbcc.c.coo.u.ukk,,
B. Spears, ……
baseball, BB. .M Maarrleleyy,,
cnn.com, ccaammppiningg,,
… Chris mmicicrorossooftf.tc.coomm Usage
……
B. Marley,
camping,
linux.org,
…
SIGKDD 2006 Tutorial, August 2006
The Problem
(cid:122)How to randomize data such that
(cid:122)we can build a good data mining model
(utility)
(cid:122)while preserving privacy at the record level
(privacy)?
SIGKDD 2006 Tutorial, August 2006
10
Description:Need to make data products available to the astronomy ailments and loves her three dogs. “Those are my queries while preserving data privacy?