Privacy Via Pseudorandom Sketches
Privacy via Pseudorandom Sketches
[Extended Abstract]
Nina Mishra
Mark Sandler
Computer Science Department
Computer Science Department
University of Virginia
Cornell University
∗
†
nmishra@cs.virginia.edu
sandler@cs.cornell.edu
ABSTRACT
1.
INTRODUCTION
Imagine a collection of individuals who each possess private
Privacy is a paramount concern in applications where sen-
data that they do not wish to share with a third party. This
sitive personal information is used for the purpose of dis-
paper considers how individuals may represent and publish
covering patterns. This data is often gathered by a central
their own data so as to simultaneously preserve their privacy
third party either with the assurance that each individual’s
and to ensure that it is possible to extract large-scale statis-
privacy will be maintained or with the goal of selling this
tical behavior from the original unperturbed data. Existing
information to interested buyers. In the former case, pri-
techniques for perturbing data are limited by the number of
vate data is not necessarily kept private as there have been
users required to obtain approximate answers to queries, the
many instances where organizations have either accidentally
richness of preserved statistical behavior, the privacy guar-
or intentionally violated their privacy agreements in the face
antees given and/or the amount of data that each individual
of mergers, bankruptcy or theft [14, 15]. In the latter case,
must publish.
private data is not even intended to be kept private. For ex-
This paper introduces a new technique to describe parts of
ample, Acxiom [16] is a company that sells private data to
an individual’s data that is based on pseudorandom sketches.
companies who hope to improve target marketing, customer
The sketches guarantee that each individual’s privacy is
retention, etc.
provably maintained assuming one of the strongest defi-
Since it is evident that our data is not private when trusted
nitions of privacy that we are aware of: given unlimited
to another party, this paper develops an idea suggested in [1]
computational power and arbitrary partial knowledge, the
– how can privacy be put back in the hands of individuals?
attacker can not learn any additional private information
Our view is that individuals maintain all of their private
from the published sketches. However, sketches from mul-
data and what we investigate are methods by which indi-
tiple users that describe a subset of attributes can be used
viduals can release perturbed versions of their personal data
to estimate the fraction of users that satisfy any conjunc-
so that privacy is preserved and so that large-scale statisti-
tion over the full set of negated or unnegated attributes
cal patterns present in the original unperturbed data can be
provided that there are enough users. We show that the
approximately recovered from the released perturbed set.
error of approximation is independent of the number of at-
Some solutions to the non-trusted party problem have
tributes involved and only depends on the number of users
been proposed. One solution, known as randomized response
available. An additional benefit is that the size of the sketch
advocated by Warner in the 1960s [22], amounts essentially
is minuscule:
log log O(M ) bits, where M is the number
to flipping bits in the private data. In this context, when
of users. Finally, we show how sketches can be combined
an individual is surveyed about a sensitive matter, such as
to answer more complex queries. An interesting property
whether they are HIV+, the respondent flips the answer to
of our approach is that despite using cryptographic primi-
their question with probability p or answers truthfully with
tives, our privacy guarantees do not rely on any unproven
probability 1 − p. If p is slightly less than 1/2 then this sim-
cryptographic conjectures.
ple bit-flipping procedure can be shown to preserve privacy
and, with some simple analysis, if there are enough users
then one can derive a reasonably good estimate of the actual
∗Work done partly at HP Labs and Stanford University.
fraction of people who answered the question yes. However,
Research supported in part by NSF grant EIA-0137661.
this procedure, while powerful, has the disadvantage that if
†Part of this work was done at HP labs.
a user has a relatively sparse private vector then the result-
ing perturbed vector may be quite dense, and furthermore,
the number of users required to compute more complicated
queries grows very fast.
Permission to make digital or hard copies of all or part of this work for
A more recent solution suggested by Evfimievski et al [10,
personal or classroom use is granted without fee provided that copies are
11] develops an improved randomized procedure that has
not made or distributed for profit or commercial advantage and that copies
similar privacy guarantees as bit flipping, but also produces
bear this notice and the full citation on the first page. To copy otherwise, to
a compressed version of the dense flipped vector. However,
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
their result is highly optimized for databases where each user
PODS’06, June 26–28, 2006, Chicago, Illinois, USA.
has a small number of items in their transaction (e.g. only
Copyright 2006 ACM 1-59593-318-2/06/0003 ...$5.00.
a small number of bits are set to one). Further, no analysis
of a small number of AND queries.
is given of the number of users needed to obtain accurate
estimates of the frequency of an itemset as the size of the
Related Work. In recent years several different approaches
itemset grows. It appears (to us) that both the complexity
to privacy have emerged. They can be roughly divided into
of the algorithm and the number of users needed grows ex-
three categories by the extent to which the trusted third
ponentially with the size of the itemset. Thus, the method
party is used in maintaining privacy. More specifically, they
has only limited utility when users have a large number of
can be partitioned based on whether the trusted third party
bits set to 1, e.g., poll data or non-binary data – where nor-
is (a) used to answer queries about the private data or (b)
mally the underlying binary representation might contain a
used only to create an initial sanitization of the private data
large fraction of bits set to one.
or (c) not needed at all.
A generalization of randomized response to non-binary at-
In the first approach, the idea is to perturb the answer
tributes was undertaken by Agrawal et al [3]. In this work,
to every query [5, 7, 9], or to precisely answer some queries
bit flipping was generalized to non-binary attributes via re-
but deny the answer to others [17, 18, 8] so as to ensure
tention replacement – each user keeps their true value with
privacy. The second kind of work (that assumes a centralized
fixed probability, or replaces their true value with noise. Ar-
third party who publishes a sanitized version of the data)
bitrary queries involving a fixed number of attributes can be
includes different flavors of anonymity [19, 2, 21, 20] and
answered with this technique. However, their privacy def-
data perturbation [6]. Note that in this case, the trusted
inition potentially allows an attacker with prior knowledge
server is needed only during the initial stage. After the data
to learn a user’s value with high confidence. For example,
is released it immediately becomes public and no third party
if an attacker knows that someone’s private value is either
is needed to answer queries.
1, 1, 2, 2, 3, 3 or 4, 4, 5, 5, 6, 6 then seeing the perturbed
The most general approach, which is most relevant to our
1, 9, 8, 2, 3, 5 virtually reveals to the attacker the exact pri-
work does not assume the existence of a trusted party at
vate data of the individual since the perturbation has a much
all. Information provided by the user immediately becomes
higher probability of being generated from the first private
public and available for everybody’s use. The main idea here
value than the second.
is to randomize or clean the user’s data [10, 11, 3] so that
personal data is virtually hidden and yet it is possible to
Our Contributions. To describe our results, we begin by
gather relevant statistics from the data of many individuals.
defining privacy. If a user holds some private data and re-
To the best of our knowledge, no previous results could both
leases some value s then we say that a user’s privacy is
operate on non-binary data and guarantee strong privacy.
preserved if for any two possible private values d and d ,
Note that any result in the last two categories immediately
the attacker cannot distinguish between whether the orig-
implies a similar result in the first one. In particular, we
inal private data was d or d , i.e., P r(s|d ) ≈ P r(s|d ).
believe that this paper provides some useful insights for the
A more formal definition and discussion can be found in
framework described in [5]. We elaborate more on this in
Section 2.
Section 5.
With regard to utility, the basic query that we strive to
accurately answer is the AND query: Suppose that each
2.
PRELIMINARIES
user holds a bit vector in d ∈ {0, 1}q over the variables
Let d denote a user’s private data which we sometimes
x1, . . . , xq. Given a subset of literals, where each element is
refer to as a user’s profile. In addition we assume that each
either an attribute xi or its negation xi, the output is the
user holds a unique public identifier id - which does not
fraction of users that satisfy the conjunction. One example
contain any private information (for example it could be
of a query is what fraction of individuals are HIV+ and do
a timestamp of user registration in the system). The pair
not have AIDS.
(id, d) fully describe an individual. We denote the set of all
Our main contribution is a novel method where users
pairs (id, d) by D.
may “publish” their data using succinct representations -
Our goal is to provide users with some mechanism for
sketches.
Each sketch describes a subset of a user’s at-
generating a sketch s about their profile so that privacy is
tributes which when collected from multiple users can be
preserved. Informally, we say that the sanitized information
combined in order to approximate the fraction of users who
preserves privacy if for the private data d and d of any two
satisfy an arbitrary conjunction on the entire subset. The
individuals, Pr [s|d ] ≈ Pr [s|d ]. Thus the sanitized data
key difference from [10] and [22] is that the error in esti-
s does not help the attacker distinguish between the case
mating the answer to an AND query is independent of the
when the user’s private data is d and d . This definition is
number of attributes involved. This is in contrast to pre-
identical to [10]’s notion of γ-amplification.
vious work, where the error appears to grow exponentially
in the number of attributes in the query.1 The size of the
Definition 1. The sanitization s is -private if for any
sketch is tiny:
log log O(M )
where M is the number of
two values of private data d and d
users. We prove that sketches preserve privacy in the sense
Pr [User published s|User’s private value is d ]
just described.
≤ (1 + )
While we limit our analysis to only
Pr [User published s|User’s private value is d ]
AND queries, we
demonstrate that these queries are quite powerful in the
(1)
sense that many other queries (such as means, higher mo-
Note that this definition guarantees very strong privacy
ments and interval queries) can be expressed as a collection
since it says that the sanitized data is almost equally likely to
be generated from any possible value of the underlying user
1However as opposed to [10] a single sketch can only be used
data. Thus, no matter how much the attacker knows about
to answer queries on a fixed subset of bits
a user in advance, after the sanitized sketch is released, very
little new information can be learned. There are also other
size) subset of bits. The idea is that any subset of interest
definitions of privacy, we present a brief comparison of them
B can be sketched, so that we can answer the query |I(B, v)|
in Appendix A.
for an arbitrary value v. In other words, each sketch over
To better understand the definition of -privacy we give
a set of k attributes gives us the ability to answer 2k AND
a simple example. Consider the situation where each user
queries (over the full set of k attributes, where each attribute
holds a single bit that is either 0 or 1. Then for v ∈ {0, 1},
appears either negated or unnegated). The sketching tech-
we want
nique turns out to be very useful in mining non-binary data
P r(˜
x
where for each attribute there are only a few subsets that
i = v|xi = 0) ≤ (1 + )
(2)
P r(˜
x
need to be sketched. Sketches can also be combined together
i = v|xi = 1)
to produce answers to more complex queries.
Observe that if each individual flips their bit with proba-
Sketching can be viewed as an analog of hashing but with
bility exactly 1/2 then we have absolute privacy, but abso-
better privacy protection. Indeed, if each user hashes their
lutely no utility. To understand why, observe that each user
value on a subset of bits B, then the hash value can be used
publishes 1 with probability 1/2 and 0 with probability 1/2
to answer the query I(B, v), by computing the number of
independent of their original unperturbed value. We do not
users who hold the hash value of vector v. However, even
even need a user to exhibit this kind of random behavior –
though the hash function is non-reversible, it might violate
all we need is a fair coin. Consequently, flipping with prob-
privacy. Indeed, if Bob knows that Alice’s private value can
ability 1/2 does not provide any utility. However, if each
be only one out of 100 known possible values, then once he
individual flips their bit with probability p just a tinge un-
sees the hash value, by applying the hash function to each
der 1/2, i.e., p = 1/2 −
then we can simultaneously ensure
potential value, he can deduce the original value (with very
privacy and estimate the fraction of ‘1’s in the unperturbed
high probability). Sketches are devoid of this property – see-
data. The privacy proof is folklore and can be found in [22].
ing a sketch tells almost nothing (in a precise mathematical
Furthermore, the fraction of ones r can be estimated by solv-
sense) about the private value which generated it.
ing for r in the following equation where ˜
r is the fraction of
ones in the perturbed data: E(˜
r) = (1−p)r +p(1−r). (And,
Intuition. Consider a subset of bits B. We want to be able
E(˜
r) can be estimated using the Chernoff bound – refer to
to estimate the fraction of users who have their subset of bits
the proof of Lemma 4.1 for details.)
B equal to a particular value v. Suppose that we are not
concerned about the efficiency of our representation. How
AND queries. AND queries are a natural generalization of
can we estimate this fraction while hiding the real values?
estimating the frequency of an item-set. Given a subset of
Consider a user u and imagine that for each possible value
bits B = {b1, . . . , bk} and their values (v1 . . . vk), an AND
v he publishes a p-perturbed vector indicating whether the
query returns the fraction of users that satisfy a query of
user’s real value is equal to v or not. In other words, for
the form V(db = v
i
i), In other words, given a set B and
a k bit long subset B, we represent the user’s data by a
a binary string v = {v1, . . . , vk}, we want to estimate how
2k bit long vector, which contains zeros everywhere, except
many users satisfy dB = v, where dB denotes a substring
at the position v – the bit corresponding to the user’s true
induced by set B. This generalization (from the monotone
value. At the end we perturb each bit with probability p and
version considered in frequent itemset mining) enables effi-
publish this perturbed vector. An example of this procedure
cient computation of a broad range of other queries - not
is presented in Figure 3.
necessarily on binary data. We let I(B, v) denote the to-
Note that the published sequence is almost equiprobable
tal number of users whose profile d satisfies the constraint
to be generated from any possible underlying value, since for
dB = v. One way to compute answers to AND queries is
any possible user values v and v
the original 2k indicator
via a system of linear equations similar to the one intro-
vector differs in only 2 bits (corresponding to v and v ).
duced in [10]. However the error introduced seems to grow
Now, to learn how often the value v occurs across many
exponentially in the number of bits involved and thus only
users, we just look up the column corresponding to the value
appears to be useful for answering short, monotone AND
v, and there for each user we have a perturbed indicator of
queries.
whether the value is v or 0. To obtain an estimate, we use
the analysis for single bits discussed in Section 2.
Assumptions. Within a particular user’s profile, we allow
Evidently, publishing 2k bits for each subset B is not very
arbitrary dependencies between private values and our pri-
practical. Fortunately this can be avoided. Indeed, the pro-
vacy guarantees hold independently of the amount of de-
cess used to generate these 2k bits is extremely simple – all
pendencies present. On the other hand, we do assume that
bits except one are generated using a p-biased coin, and the
each user’s profile is independent of every other user’s pro-
bit corresponding to the actual value is generated using a
file. Intuitively, this assumption seems necessary for any
(1 − p)-biased coin. This sequence can be simulated using
input perturbation scheme, since otherwise a user’s privacy
pseudorandom functions which we only define intuitively.
can be compromised even if he does not publish anything –
The precise definition is beyond the scope of this paper, and
since another user’s input can compromise his value. How-
can be found in [12].
ever, in practice, inter-user dependencies do exist and we
For the purposes of this section we just assume that we
leave open the question of how/whether this assumption can
have an oracle which, when provided a key s, generates a
be lifted.
random function fs, such that for a random key and for any
value v, Pr [fs(v) = 1] = p with all values being mutually
3.
SKETCHES
independent. We emphasize here that each value is chosen
In this section we develop an approach that will allow us
only once for each key and each value. Thus for a fixed key s
to answer
the function f
AND queries defined on a large (or really, any
s is deterministic. However before we evaluate
All Possible Private Values:
000
001
010
011
100
101
110
111
User Indicator Vector
0
0
0
0
1
0
0
0
User Published Vector
1
0
1
1
1
0
1
0
Figure 1: A very private (but very inefficient) publishing method. The user holds a private 3-bit value=‘100’
which is first represented as 23 = 8-bit indicator vector (with a ‘1’ in the position corresponding to ‘100’).
This vector is then perturbed and published.
fs(v) we have no way of knowing what the value is going
Algorithm 1 Sketch(id, d, B)
to be. Suppose now, that a user holds value u. Then, we
Input: Pseudorandom p-biased function H, security pa-
want to modify the process of key selection so that the value
rameter p, user data (id, d), subset B ⊆ [1..|d|].
of fs(u) would be more biased towards 1. E.g. if we just
Output: A sketch s for dB.
output random s then fs(u) = 1 with probability p, whereas
1: Choose s uniformly at random without replacement
we want it to be 1 − p.
from 1 to 2 where
is given in Lemma 3.1
Intuitively, a user can seek out the necessary bias by skew-
2: if H(id, B, dB, s) = 1 then
ing the distribution of keys. More specifically, a user selects
3:
Publish s and stop.
a random key and then rejects it with non-zero probability if
4: else
fs(u) = 0, and accepts it with probability 1 if fs(u) = 1. We
5:
With probability
p2
publish s and stop.
(1−p)2
call the key generated by this process a sketch (of a subset
6:
Otherwise continue to step 1.
B). We emphasize here, that the outcome of this process is
7: end if
no longer a key chosen uniformly at random - but is skewed
8: If all values of s are exhausted then report failure and
so that fs(u) = 1 is more likely.
stop.
Note that in order to keep independence, the oracle must
generate functions independently for each user and each bit
subset. To achieve this, we use a single random function
which takes these values as input parameters. Indeed, sup-
conjunctions.
pose the function H takes as input a unique user identifier
id, a bit subset B, a value v and a key s (the one used as
Analysis of the algorithm. The sketching algorithm is
input to the oracle above). The value of such a function at
given in Algorithm 1. There are several questions that need
a given input is chosen to be 1 or 0 using a p-biased coin.
to be addressed. First, we need to estimate the required
Different user identifiers will ensure that each user receives
length of a sketch and the running time of Algorithm 1.
a random function that is independent of everybody else’s
Second, we prove that the sketch preserves privacy. And the
function. But we do still need an efficient way to represent
last question, that we defer until the next section, is what
such a function. Fortunately, there are standard algorithms
queries can be approximately answered with the sketch.
for doing this. For example, any collision free secure hash
Let us begin with a more formal introduction to pseudo-
(such as SHA family [13] or WHIRLPOOL [4]) is an exam-
random functions. Imagine the space of all functions that
ple of such a function2.
map n bit strings to n bit strings. Now imagine a uniform
distribution over this space of functions – this is a truly ran-
Observe that a function which returns a uniform value
dom function ensemble. Pseudorandom function ensembles
can be transformed to mimic a p-biased coin flip using a
are ensembles that cannot be distinguished from truly ran-
simple algorithm. Indeed, suppose we have a collision-free
dom function ensembles by any efficient algorithm that can
function H : {0, 1}∗
{0, 1}λ, and we would like to obtain
probe values of the functions at arguments it selects. Put
a binary function H : {0, 1}∗
{0, 1} such that for random
another way, there is no algorithm that can distinguish be-
x, H(x) = 1 with probability p. To achieve this, we write
tween a function drawn from a pseudorandom ensemble and
p in binary form, p = Pt
p
a truly random ensemble when given the ability to examine
i=1
i2−i.
We assume that p can
be written using only λ bits3. Then for a given input x let
the function at various points.
H(x) = v1 . . . vλ. We report 1 if Pλ
v
p
It is a common conjecture in cryptography that pseudo-
i=1
i2−i ≤ Pλ
i=1
i2−i
and 0 otherwise.
random functions exist. However, their existence has not
Note that the original randomized response is a special
been formally proven. In our approach, as we show, the
case of our technique where we sketch each bit individually.
existence of pseudorandom functions is somewhat less cru-
However the main advantage of our approach is in sketch-
cial since privacy is unaffected by their existence, and re-
ing large subsets of bits.
For example, a sketch of 3 bits
garding utility, it is highly unlikely that non-adversarially
gives us the ability to estimate the number of users that
chosen queries will ever expose non-randomness of our per-
satisfy any conjunction of length 3 over those 3 bits (where
turbation. Indeed, if such queries did exist, then we would
each bit is either negated or unnegated) – there are 23 such
be able to differentiate between a ‘currently-thought-to-be’
pseudorandom function and a random one – and that would
2
be a breakthrough for modern cryptography. As for the cur-
It is also possible to generate a new function for each
database individually using standard constructions of [12]
rent state of the art, if the length of the generator key is at
3Standard hash functions have length 128-512 bits, which
least 300 bits4, it is unfeasible to build an algorithm whose
is much larger than the typical precision used to represent
real values. Furthermore if the need arises, the length of the
4Here we are referring to the key used to define the global
output of a hash function can be increased [12].
pseudorandom function for the entire database, not the
answers on a pseudorandom function will differ from those
Theorem 3.2
(Privacy with one released sketch).
it would produce on a truly random function.
For a subset B, if a user u releases a sketch s according to
We assume that there is a public pseudorandom function
Algorithm 1 then for any possible values of their profile d
H, which upon receiving a random binary string returns 1
and d
we have
with probability p and 0 otherwise. It is useful to think
Pr [User publishes sketch s|User profile is d ]
„ 1 − p «4
about a pseudorandom function as a black box such that
≤
for every set of parameters for which we have not yet eval-
Pr [User publishes sketch s|User profile is d ]
p
uated our function, the value is generated randomly on the
where the probability is taken only over the outcomes of the
fly. Indeed, such an interpretation is possible since from the
user’s private coin flips and not over the outcomes of a pub-
point of view of a user the values are computationally in-
lic pseudorandom function.
distinguishable from those that are chosen at random. This
substitution allows us to compute probability distributions
“
”2
Proof.
Let r =
p
. Since we will only be considering
and prove tail bounds over outcomes of pseudorandom func-
1−p
a sketch of a single subset for B and for a single user with
tions even though the actual answer is of course entirely
identifier id, we use f (d, s) as shorthand for H(id, B, d
deterministic.
B , s).
To prove our result, we analyze the algorithm’s behavior
The first question we address is how many bits does a
given that the user profile is d.
We say that the key s
user need to represent a sketch so that Algorithm 1 fails
evaluates to z (on the user profile d) if f (d, s) = z. As in
with very small probability. It turns out that the number of
the previous result, we assume that the sketch has length
bits we need is doubly logarithmic in the number users and
bits thus taking L = 2 possible values. Also we say that a
the failure probability.
key s is considered if, over the course of running Algorithm
Lemma 3.1
(Minimal length of the sketch). If the
1, s is sampled. Let Yds denote the probability that the
log M
algorithm given a profile d considers a key s. Note that if
length of a sketch is at least
= log
τ
, and there
| log 1−p2|
the key s evaluates to 1, then considering it is equivalent to
are at most M users in the system then the probability that
publishing it. Thus, the probability Xds that the algorithm
the publishing algorithm fails for any user is less than τ .
publishes key s given that the profile is d is then bounded
by
Proof. Consider a user with identifier id, a subset B, and
suppose that the user’s value on B is v. For each key value
rYds ≤ Xds ≤ Yds,
that we have considered, the algorithm stops with probabil-
ity at least p2. Now, in order for our algorithm to fail, it
since the algorithm publishes a considered key with prob-
should be the case that it did not stop on any of them. Thus
ability at least r, and it never publishes a non-considered
the probability of failure is at most
key. Now we show that there exists Ymin and Ymax such
that Ymin/Ymax ≥ r and for any possible user profile d and
log τ
|
M
|
τ
key s:
(1 − p2)2 ≤ (1 − p2) log 1−p2 ≤
.
M
Ymin ≤ Yds ≤ Ymax.
where we use the fact that a -bit long binary value encodes
2 different values. Using the union bound, we get the de-
If we could show that then we have
sired result.
Ymin
rYmin ≤ Pr [User (id, d) publishes sketch s] ≤ Ymax ≤
r
Note that the length of the sketch grows very slowly com-
(3)
pared to the number of users and τ and grows independently
independent of the value d, and the lemma would follow.
of the length of the data we are sketching. For example if
Notice that the behavior of the algorithm is invariant with
p > 1/4, then a 10 bit sketch is sufficient for any foreseeable
respect to permutations of the key evaluations (because we
practical use.
sample uniformly). Thus, for any two profiles d and d
In terms of the algorithm’s running time, note that the
that have the same number of keys that evaluate to 1, and
double logarithmic bound on the sketch length implies a
keys s and s
that evaluate to the same value on d and
logarithmic bound on the number of iterations. Since the
d
respectively, we have that Yd s = Yd s . Therefore,
algorithm only tries every sketch at most once, there will be
Yds can be written just as a function of how many keys
at most log M/τ
iterations in the worst case. Furthermore
evaluate to 1 and the value f (d, s). We can thus put all
log(1−p2)
it can be easily shown that since the algorithm terminates on
such keys and profiles into equivalence classes based on the
each iteration with probability at least
p2
the expected
value w = f (d, s) and q = |{s : f (d, s) = 1}|. Thus, we just
(1−p)2
need to bound the probability Z(q)
w
that any such seed s is
running time is much less – and in fact is less than (1−p)2
p2
considered (i.e., given that the profile is d such that f (s, d)
iterations. In what follows, for clarity purposes all of our
evaluates to w and given that q out of 2 seeds on d evaluate
results will be conditioned on the fact that our algorithm
to “1”).
does not fail.
Recall that there are L = 2 possible keys. The values of
Now we prove the main privacy result that any sketch
Z(L) and Z(0) are undefined. For the remaining values 0 ≤
is almost equally likely to be generated from any private
0
1
q < 2 − 1, we have Z(q) = Z(q+1). The latter holds because
profile.
As we mentioned before, our privacy guarantees
0
1
a key is considered before we evaluate its value, therefore
are independent of the quality of the public pseudorandom
the probability that the algorithm considers a key which
generator, e.g. even an adversarial choice of the values of H
evaluates to “0” is equal to the probability of considering a
does not compromise a user’s privacy.
key which evaluates to “1”, given that the remaining 2 −
short keys used to generate sketches
1 evaluations stay the same. Thus, it is sufficient to only
consider Z(q). Henceforth, we omit the lower index and use
1
Lemma 3.4
(Correctness of the algorithm). For a
simply Z(q).
user (id, d), if the algorithm does not fail on a subset of bits
We now show that Z(q) is monotone in q. Recall that
B, it outputs a sketch s such that
the algorithm always terminates if it considers a key that
Pr [H(id, B, d
evaluates to “1”. Also, the algorithm has a fixed probabil-
B , s) = 1] = 1 − p
ity of performing another iteration if it considers a key that
and for all v = dB,
evaluates to “0”. Therefore Z(q) ≤ Z(q−1) since changing
the evaluation of a single key to “0” increases the probabil-
Pr [H(id, B, v, s) = 1] = p,
ity of having one more iteration (and hence any key which
evaluates to “1” is more likely to be considered). Therefore,
where the probability is taken over all possible outcomes of
the algorithm and all evaluations of H.
Ymin = Z(L) ≤ Z(q) ≤ Z(1) = Ymax.
If all keys evaluate to one then the algorithm always ter-
Proof.
To prove the second part it is sufficient to notice
minates on the first iteration. Therefore, each key has equal
that our choice of s is independent from any B = d
probability of being considered and hence Z(L) = 1 = Y
B , thus
L
min.
Pr [H(id, B, v, s) = 1] = p. To prove the first part we again
Now we only need to compute Z(1) - the probability that
use f (s) as a shorthand for H(id, B, v, s). Let Tt denote the
the algorithm chooses “1”, if all keys but one evaluate to
event that the algorithm terminates on iteration t and F de-
zero. To achieve this, we compute the probability Vi that
note the event that the algorithm reports failure. Recall
the algorithm chooses the single key that evaluates to “1”
that we sample our key values without replacement. Thus,
at iteration i. We have:
on each iteration, f is evaluated on a different key. There-
h
i
V
Qi−1
fore, we can use the trick where all values of H are assumed
i =
(1 −
1
)(1 − r) ×
1
= (1−r)i ×
j=0
L−j
L−i
L−i
to be generated on the fly. Now we show that for any t,
h
i
×
L−i
× L−i+1 × · · · × L−1 = (1−r)i
L−
Pr [f (s
i+1
L−i+2
L
L
t) = 1|Tt] = 1 − p, where st is the key generated on
the iteration t. Indeed, if the algorithm reaches iteration t,
where the first term is the probability that the algorithm
then it terminates with probability p +
p2
. Therefore
does not terminate on the first i − 1 iterations, and the
(1−p)
second term is the probability that it terminates during the
p
Pr [f (s
= 1 − p,
ith iteration, by choosing the key which evaluates to one.
t) = 1|Tt] = p + p2
1−p
Therefore we have
Therefore
L
L
1
1
Z(1) = X V
X
i =
(1 − r)i ≤
L
rL
Pr [f (s) = 1]
=
P2
(Pr [f (s
t=1
t) = 1|Tt] Pr [Tt])
i=0
i=0
=
(1 − p) P2
Pr [T
t=1
t] = (1 − p)(1 − Pr [F ])
The result follows.
Since by assumption the algorithm succeeds, Pr [F ] = 0 and
Note that in the proof we only used the assumption that
the result follows.
the user has access to a random generator to select keys. We
did not use the pseudorandomness assumption on H(·).
4.
UTILITY
Corollary 3.3
(Privacy with many sketches). If a
In this section we show that if we sketch a subset B, then
user u releases l sketches according to Algorithm 1 then for
we can answer an arbitrary query of the form I(B, v) and
any possible values of their entire profile d and d
we have
bound the amount of noise introduced. Note that since we
now operate with pseudorandom functions (instead of the
„
p
«4l
Pr [s
„ 1 − p «4l
≤
1 . . . sl|d ] ≤
truly random ones), the analysis of utility is slightly sub-
1 − p
Pr [s1 . . . sl|d ]
p
tle. In particular, no result can be proven unless we assume
the existence of pseudorandom functions. To simplify the
where Pr [s1 . . . sl|d] denotes the probability of publishing
exposition we prove our utility guarantees assuming that
sketches s1 . . . sl, given that the user’s true profile is d. In
values of H(·) are chosen at random each time the function
particular, if 1 − ε ≤ p ≤ 1 , then
2
16l
2
is computed on a new set of parameters. Then, assuming
Pr [s
pseudorandomness of chosen H(·), we conclude that if the
1 . . . sl|d ] ≤ 1 + ε
generating key of H(·) is long enough5, then the querying
Pr [s1 . . . sl|d ]
algorithm will produce the same result with probability only
.
negligibly different from the one for a purely random func-
tion. (If not, then we have constructed an algorithm that
Proof. Since conditioned on a profile, each sketch is gen-
can differentiate between a random and a pseudorandom
erated independently, the first part follows.
The second
function – which is unlikely.)
part follows from the behavior of the exponent of the form
We begin by giving an algorithm that can be used to an-
(1 + ε/q)q ≈ (1 + ε).
swer AND queries over sketched subsets.
We next show that the sketch published by Algorithm 1
Now we show that this algorithm produces an answer
indeed defines a function which is p−biased towards 1 on the
which is not too far away from the true answer.
user’s real value, and p-biased towards 0 on all other values.
5Here we are referring to the global key that defines the
This lemma will be used to in the utility results in the next
function for the entire database – not the keys that each user
section. Notice that the pseudorandomness assumption is
selects when publishing sketches. With the current state of
needed in this lemma.
the art, 300 bits are more than sufficient.
Algorithm 2 AND Query
bits set to one. This would solve the problem with sketches
Input: Pseudorandom function H, database of sketches
(where we use the values of the pseudorandom function as
S(id, B), subset of bits involved in the query B and
perturbed bits).
querying value v
Without loss of generality assume that the user profile d
Output: Approximate fraction of users who satisfy the
has k bits. Now, each original value d, has a fixed probabil-
query d
ity of being perturbed into any other profile ˜
d.
B = v
1: Compute
fraction
˜
r
of
users
who
satisfy
H(id, B, v, S(id, B)) = 1.
w[d → ˜
d] = p d−˜
d 1 (1 − p)k− d−˜d 1
2: Report r = ˜
r−p .
1−2p
where d − ˜
d 1 is the standard Hamming distance between
boolean vectors. Since we observe the frequencies of all per-
turbed profiles, at least theoretically we can write a system
Lemma 4.1
(Quality guarantee for algorithm 2).
of linear equations to solve for the frequencies of the actual
Assuming that H is a pseudorandom function, the proba-
values. But the system has size 2k and is not feasible to solve
bility that the answer r produced by Algorithm 2, is dif-
in most cases. A similar approach was employed in Agrawal
ferent from the true answer r by more than ε is at most
et al [3]. There however they considered non-binary data
exp[− ε2(1−2p)2M ].
and justifiably argued that in many applications k is very
4
Equivalently, if p is bounded away from 1/2, then for any
small.
δ with probability 1 − δ, the error introduced into an answer
In our case, a better solution exists6. Since each bit is
q
is at most O(
log 1/δ ).
perturbed with the same probability, the system is very ho-
M
mogeneous and the size of the system of equations can be
Proof. By lemma 3.4 we have
reduced from 2k to k.
Suppose B is a subset of bits, and without loss of gener-
E [˜
r] = (1 − p)r + p(1 − r),
ality suppose we are interested in how many users have all
thus r = E[˜
r]−p , hence it is sufficient to show that with
these bits equal to one (since our perturbation is symmetric,
1−2p
we can immediately generalize it to any other value). Let
probability exp[− ε2(1−2p)2M ], ˜
r differs from its expectation
4
v[l → l ] denote the probability that users who originally
by a factor of ε(1 − 2p). This immediately follows from the
had l bits in B set to 1 will have l bits set to 1, after each
Chernoff bound. Without loss of generality we can assume
bit was perturbed. We evaluate all the different ways that
that E [˜
r] ≥ 1 since we can always count the number of ‘0’
2
we can obtain l bits set to one if we originally had l bits
entries instead. Thus, by the Chernoff bound we have:
set to one. If we switch h bits which were set to 1, then
γ2E [˜
r] M
γ2M
in order to obtain l bit in the resulting value, we need to
Pr ˆ˛˛˜
r − E [˜
r]˛˛ ≥ γE [˜
r]˜ ≤ exp[−
] ≤ exp[−
]
switch l − (l − h) bits which were originally set to 0. In
2
4
order to avoid switching negative number of bits we must
substituting γ = ε(1 − 2p) we have the desired result. The
have max(0, l − l ) ≤ h ≤ min(l, k − l ). For each h, there are
second part follows.
„ l « „
k − l
«
4.1
Combining sketches
h
l − l + h
Given multiple sketches, we can also answer queries which
involve unions of the subsets that those sketches describe.
different ways to achieve this, and the probability of each
Suppose that each user sketches possibly overlapping subsets
choice happening is
B1, . . . , Bq. We can estimate how many users satisfy an
„
«2h
arbitrary
p
AND query defined on the union B = B1 ∪· · ·∪Bq
phpl −l+h(1 − p)k−(l −l+2h) = pl −l(1 − p)k−(l −l)
and a value v ∈ {0, 1}|B|.
1 − p
This can be solved via a system of linear equations sim-
Thus we have:
ilar to the one introduced in [10]. The crucial difference
from [10], is that now we can count frequencies of large item-
v[l → l ]
=
pl −l(1 − p)k−(l −l)×
sets which can be combined together using their technique
min(k−l ,l)
„
p
«2h „ l « „
k − l
«
for single bits.
×
X
,
Let v
1 − p
h
l − l + h
1 . . . vq denote the projection of v into the subsets
h=max(l−l ,0)
B1, . . . , Bq. Let a sketch for user u of a subset i be denoted
(4)
sui. Then for each user (id, d) and index i, we have a per-
Let xl denote the number of users whose original profile
turbed virtual bit (as defined by a function H(id, B, vi, sui))
matches l bits. We are interested in estimating xk. Similarly
indicating whether their true profile matches the query (Bi, vi).
let yl denote the fraction of users whose perturbed profile
Thus, to answer the query about (B, v) we just need to es-
matched l bits. Then we have
timate how many users have all unperturbed bits equal to
“1”. Analogously, by estimating how many users have these
E [y] = V x
bits equal to “0”, we learn how many users do not satisfy
where V
any query of the form I(v
l l = v[l → l ], and y, x denotes a vector comprised
i, Bi) – which could be used to es-
of y
timate how many users satisfy a disjunction of conjunctions.
0 . . . yl and x0 . . . xl respectively. Then we have:
Observe that the problem of combining sketches for pseu-
x = V −1E [y]
dorandom functions reduces to the following simplified prob-
lem: Given k bits from each user, where each is changed with
6A similar solution was proposed by Evfimievski et al [10],
probability p, estimate how many users originally have all k
for a slightly different perturbation scheme
and in particular xk = V −1E [y]. E [y] is hidden, however
in the underlying binary representation.
l
we can use an observed value of y as an approximation for
Similarly, we can compute the average product of two in-
E [y]. If the condition number of V is a constant C, and the
teger attributes a and b:
system has ν users, then with high probability the error of
X X
estimation of x would be O( C
√ ).
S = X aubu = X
22k−i−j auibuj
ν
u∈U
u
i
j
X
4.2
Computable queries
= X
22k−(i+j) X auibuj
i
j
u∈U
In this section we describe several types of queries which
= P P 22k−(i+j)I(A
can be computed using only a constant number of
i ∪ Bj , 11)
AND
i
j
queries. Our description has an empirical flavor – we do
where the last transition follows from the fact that auibij is
not provide a formal notion of what kind of queries we can
‘1’ if and only if both aui and bij are both ‘1’. Therefore
compute but rather show how to compute different queries.
P
a
u
uibuj corresponds to the AND query “how many users
We believe that these empirical observations are an impor-
have their bits ai and j set to one”. That is, the product
tant first step towards understanding the full power of AND
can be written as k2 2-bit queries.7 Notice that here we do
queries. Also, we present an example of a query which is not
not have to sketch each pair AiBj , but we can just use indi-
immediately expressible with a constant number of AND
vidual sketches and then use the technique from Section 4.1
queries, and yet is efficiently computable.
to “glue” them together.
Thus sketches have the potential to support a richer query
language.
Interval queries. We now consider queries of the form
“how many users satisfy au ≤ c?” for constant c. In the
Boolean queries. As we already mentioned AND queries
case when a denotes the user’s salary, this corresponds to a
correspond to the easiest type of boolean query. For exam-
query of the form “How many users have salary less than c?”.
ple, if each user holds boolean values d1 . . . dt then the AND
As usual, we consider c in binary notation: c = c1c2 . . . ck.
query on bits i1 . . . ik corresponds to estimating the fraction
How do we say that number x = x1x2 . . . xk is less or equal
of users that satisfy an arbitrary conjunction of length k
than c? x ≤ c if and only if there exists i, such that for
over these boolean values where each element in the con-
j < i xj = cj and xi < ci. For any x, at most one such i
junction is either di or ¯
d . Additionally using the system
l
il
exists. Note that since xi and ci are either 0 or 1, in order to
of equations similar to the one in 4.1, one can estimate the
satisfy the last inequality we must have xi = 0; ci = 1. But
fraction of users that satisfy exactly l out of k bits in the
then, for each i, this is a single AND query. Thus in order
query.
to count all possible au ≤ c, for every i, such that ci = 1,
More generally, one can estimate the fraction of users that
we need to pose a query about the first k − i bits, such that
satisfy a given decision tree. Each path in the decision tree
au1au2 . . . aui = c1 . . . ci−10, or using I notation
corresponds to a single AND query and any user satisfies at
most one path of the decision tree. Thus the total fraction
k−1
X
of users who satisfy a decision tree is simply the sum of the
|{u : au ≤ c}| =
ciI(Ai, c1 . . . ci−10).
fraction of users that satisfy each path (
i=0
AND query).
Notice that the number of queries we need to ask is equal
Computing Means. For the rest of this section, we assume
to the number of ‘1’s in the binary representation of c, with
that each profile holds several k-bit integer attributes a, b,
the upper bound being the length of the integer.
c, etc, that are stored in binary form in the user’s profile d.
However if we try to use a similar approach to answer the
The value of an attribute for user u is denoted by au. Let A
query au + bu < c, it turns out that it requires exponential
denote the subset of bits used to store the value of attribute
(in k) number of AND queries. By using a slightly different
a in the user profile. Furthermore, let Ai denote the subset
approach, presented in Section 4.2, we can express the query
which contains the i highest bits of a. Let Ai denote the
with a smaller number of AND queries.
index of the ith highest bit. If user u has profile d then dA
is au written in binary notation and dA is the value of the
Combining queries together. Suppose we are interested
i
i-th highest bit. Also to avoid multiple subscripts we will
in estimating the number of users who for two constants c
denote dA by a
and d satisfy a
i
ui.
u = c and bu < d. The number of users who
We begin with the simplest type of query: computing the
satisfy au = c, can be estimated by a single AND query:
sum (or average) S = P
a
I(
u∈U
u.
We expand the binary
A, c1 . . . ck).
The number of users who satisfy the sec-
representation of au
ond constraint can be estimated by posing k bit queries
I(Bi, d1 . . . di), for each 1 ≤ i ≤ k. Now, to compute the
k−i
number of users that satisfy both conditions we just need to
a
X
u =
aui2k−i.
(5)
compute k queries of the form: I(A ∪ Bi, c1 . . . ckd1 . . . di).
i=1
Analogously, we show how to compute the average value
Then we have:
of bu such that au ≤ c:
k−1
k
k
k
S = X X a
X
X
X
X
ui2k−i =
2k−i X aui =
2k−iI(Ai, 1)
2k−iI(Aj ∪ Bi, c1 . . . cj−101)
u∈U i=0
i=1
u∈U
i=1
j=1,...k,cj =1,...,k i=1
Note that after we rearrange the sum order the inner sum
7The number of queries can be reduced if we ignore terms
becomes a single-AND query on bit Ai. Thus if each bit
which contribute much less than the expected noise in the
gets released, it is sufficient to release the sketch of each bit
answer.
Similarly, one can combine constraints on different variables
Using this simple approach, we can overcome a negative
into queries about users who satisfy all those constraints.
result of Dinur and Nissim[7] and Blum et al [5] which sug-
gests that linear noise must be added in order to protect
Query not directly expressible as a few AND queries. In
from an attacker with unlimited computational power. In
√
this section we present an example of a query that cannot
our case we only add9 O( M ) noise in all except a neg-
be easily translated into a small number of AND queries.
ligible fraction of the queries. So while technically in the
Yet via variable substitution we can efficiently answer the
worst case, the added noise could be linear, the chance that
query. For the purpose of this section, we assume that each
the worst case happens decreases exponentially as the size
bit of the database is simply p-perturbed – or equivalently
of the database increases. For example, in a system with M
we sketch every single bit.
users, the chance we encounter such a bad query is 2−ω(M].
We wish to determine “How many users satisfy: a
This result is tight in the following sense. It is impossible
u + bu <
√
2r?” where we assume that a
to devise a system that would add noise o( M ) in all but
u = au1au2 . . . auk and bu =
b
a negligibly small number of queries. This follows from one
u1bu2 . . . buk are k-bit integers. As per our assumption, we
have ˜
a
of the results of the same paper[7].
ui and ˜
bui which are p-perturbations of each bit of
a
From a practical point of view, one might implement both
u and bu respectively. In order to satisfy the constraint we
must have
input and output perturbation in their system, and then
offer two types of access (for example paid and free). The
aui = bui = 0, for all i ≥ r.
paid mode would correspond to output perturbation (for
Now, if a
example, the SULQ framework [5]) and would only add a
u(r−1) = bu(r−1) = 0, then the constraint is au-
√
tomatically satisfied - no matter what the other bits are
small amount of noise E ≤
M to the system. However,
a
the total number of queries answered in this mode is limited
u + bu < 2t.
If on the contrary au(r−1) = bu(r−1) = 1
then the constraint is automatically violated.
Finally if
(by the minimum of E2 and the total number of users in the
a
database). Once the limit of queries is exhausted the system
u(r−1) + bu(r−1) = 1, then we have to check the r − 2 bit,
where the same rule applies. Thus, for each j ≤ r − 1, we
will stop answering those queries. Even before the system
need to compute how many users satisfy:
exhausts paid queries, it can be used in the second mode,
√
where it adds noise O( M )), but the database can answer
aui + bui = 1, for i > j and auj = buj = 0;
(6)
an unlimited number of queries. Note that the amount of
So we need to ask all
noise that the system adds is about the same, as SULQ adds
AND queries where for each i, exactly
one a
in the situation where it is tuned to answer as many queries
ui and bui is ‘1’ and another is ‘0’ – thus asking an
exponential number of them. To avoid this, we introduce a
as possible. In a way, our scheme closes the gap on how
binary variable q
much noise should be added when we allow only a finite or
ui = [(aui + bui) = 1]. Given a p-perturbed
infinite number of queries.
version ˜
aui and ˜
bui, note that ˜
qui = ˜
aui ⊕ ˜
bui are 2p(1 − p)-
perturbed variants of qui - since the evaluation changes if
and only if exactly one of aui and bui get perturbed. But
6.
CONCLUSIONS AND FUTURE DIREC-
now we can use all our machinery on the “virtual” bits as
TIONS
well. In particular we can compute how many users satisfy
q
We presented a technique that enables approximate com-
ui = 1 for all i > j.
Thus for any i, the number of users
satisfying (6) can now be efficiently computed.
putation of various queries on private user data which has
With some extra effort, the query can be generalized to
very strong privacy guarantees. Our results are based on
a
sketching - a novel technique that allows every user to pub-
u + bu < c, where c is an arbitrary constant, but we omit
that discussion.
lish a sketch of their data in such a way that it is impossible
to gain significant confidence about the true value – even
5.
OUTPUT PERTURBATION
for an attacker with arbitrary knowledge. There are still
many open questions. First, is it possible to formally de-
In this section, we give some useful insight into how our
scribe what kind of queries can be formulated using only a
input perturbation technique can be used to solve the output
limited number of AND queries? An ideal solution would
perturbation problem with better privacy guarantees, albeit
be a query language, such that any query in this language
less utility.
would require only linear, (or polynomial) in the length of
We consider a system that contains a database with pri-
the query, number of AND queries.
vate information and that answers queries about the content
Also a natural generalization of sketching bit subsets is
of this database. The goal is to perturb answers in such a
sketching arbitrary functions of a user’s profile. The same
way that it is impossible to gather any personal information
privacy guarantees apply, but the main question is whether
about any individual user data stored in the database. To
we can significantly expand the range of queries we can an-
use our scheme, the system administrator devises the set
swer.
of bit subsets that need to be sketched and computes the
We presented a worst cases analysis of the privacy if the
sketches on each row of the database. Then, for each query,
user publishes a fixed number of sketches. It would also be
the system computes the answer using only the sketches (and
interesting to see if these results could be improved by tak-
not the actual user data). The only thing the attacker can
potentially learn are the sketches themselves. And we al-
tacker can infer explicitly are the evaluations of the pseudo-
ready know that the sketches do not leak information about
random function. This fact simplifies the proof of Theorem
the user’s real data.8
3.2 and slightly improves the privacy guarantee.
9Here and elsewhere in this section we follow [5] and assume
8Even learning the values of sketches is challenging in the
that each user contributes at most 1 to the answer - of course
case of a trusted party since the only information the at-
the result could be scaled to any arbitrary value
ing into account independence between different parts of the
[17] K. Kenthapadi, N. Mishra, and K. Nissim.
data, and/or relaxing the privacy definition. In particular, if
Simulatable auditing. In PODS, pages 118–127, 2005.
one is willing to relax the privacy guarantees from determin-
[18] J. Kleinberg, C. Papadimitriou, and P. Raghavan.
istic to a negligibly small probability of leak then the result
Auditing boolean attributes. Journal of Computer and
of Theorem 3.3 might be improved to allow more sketches
System Sciences, 6:244–253, 2003.
while giving essentially the same privacy guarantees.
[19] A. Machanavajjhala, J. Gehrke, D. Kifer, and
While we were able to demonstrate that many large-scale
M. Venkitasubramaniam. l-diversity: Privacy beyond
queries can be accurately estimated with pseudorandom sketches
k-anonymity. In Proc. of ICDE, 2006.
(with sufficiently many users), we still do not have a general
[20] P. Samarati and L. Sweeney. Protecting privacy when
characterization of what classes of queries can and cannot
disclosing information: k-anonymity and its
be answered with sketches. We leave open the question of
enforcement through generalization and suppression.
how to derive stronger utility results with pseudorandom
In IEEE Symposium on Research in Security and
sketches.
Privacy, 1998.
[21] L. Sweeney. k-anonymity: a model for protecting
7.
ACKNOWLEDGMENTS
privacy. Int. J. on Uncertainty, Fuzziness and
We thank Kobbi Nissim and Jim Rowson for fruitful dis-
Knowledge-Based Systems, 2002.
cussions.
[22] S. Warner. Randomized response: A survey technique
for eliminating error answer bias. J. of Am. Stat. Ass.,
8.
REFERENCES
1965.
[1] G. Aggarwal, M. Bawa, P. Ganesan,
H. Garcia-Molina, K. Kenthapadi, N.Mishra,
APPENDIX
R. Motwani, U. Srivastava, J. Widom D. Thomas, and
A.
DIFFERENT PRIVACY DEFINITIONS
Y. Xu. Enabling privacy for the paranoids (vision
paper). In Proc. of VLDB, 2004.
The definition of privacy that is most similar to ours is
(ε, δ, T )-privacy introduced in [7, 5]. Essentially, a pertur-
[2] G. Aggarwal, T. Feder, K.Kenthapadi, R. Motwani,
bation satisfies this definition if each user is guaranteed to
R. Panigrahy, D. Thomas, and An Zhu. Anonimizing
have ε-privacy with probability at least 1 − δ given that the
tables. In ICDT, 2005.
attacker answers at most T queries. ε-privacy is equivalent
[3] R. Agrawal, R. Srikant, and D. Thomas. Privacy
to (ε, 0, ∞)-privacy of [5]. Note that even though ε-privacy
preserving olap. In SIGMOD, pages 251–262, 2005.
is a stronger definition, the (ε, δ, T )-privacy was used in a
[4] P.S.L.M. Barreto and V. Rijmen. The whirlpool
much less general context than ours (that is, the third party
hashing function. In NESSIE Workshop, 2000.
answers queries and data is never released). But the benefit
[5] A. Blum, C. Dwork, F. McSherry, and K. Nissim.
is that it answered a much richer set of queries .
Practical privacy: the sulq framework. In PODS,
Other definitions include ρ1-to-ρ2 privacy breach[10, 11]
pages 128–138, 2005.
which occurs when the prior probability of any predicate of
[6] S. Chawla, C. Dwork, F. McSherry, A. Smith, and
a user’s data d is at most ρ1 while the posterior probability
H. Wee. Toward privacy in public databases. In
of Q(d) given sanitized information s is at least ρ2. Typical
Theory of Cryptography Conference, 2005.
values for ρ1 and ρ2 are in the range 10 − 90%. It can be
[7] I. Dinur and K. Nissim. Revealing information while
shown [10] that -privacy implies ρ1-to-ρ2 privacy, but not
preserving privacy. In PODS, pages 202–210, 2003.
vice versa. In fact, -privacy is a much stronger definition.
[8] D. Dobkin, A. Jones, and R. Lipton. Secure databases:
In particular, it bounds the relative change of the posterior
protection against user influence. ACM Trans.
to prior probability, but not the absolute change. For ex-
Database Syst., 4(1):97–106, 1979.
ample, let ρ2 = 50% and let the prior probability that a
[9] C. Dwork and K. Nissim. Privacy-preserving
user is HIV+ be 0.001%. Then if an attacker learns that
datamining on vertically partitioned databases. In
the posterior probability that a user is HIV + is 49%, it is
CRYPTO, 2004.
not considered a privacy breach – even though the attacker
[10] A. Evfimievski, J. Gehrke, and R. Srikant. Limiting
learned an enormous amount about the user.
privacy breaches in privacy preserving data mining. In
We emphasize here that this problem is not alleviated by
PODS, pages 211–222, 2003.
any single choice of ρ1 and ρ2. Note, however, that the per-
[11] A. Evfimievski, R. Srikant, R. Agarwal, and
turbation scheme described in [10, 11] does actually satisfy
J. Gehrke. Privacy preserving mining of association
our privacy definition.
rules. Inf. Syst., 29(4):343–364, 2004.
A similar but even weaker definition called (s, ρ1, ρ2)-privacy
is suggested in [3]. In particular ρ
[12] O. Goldreich. Foundations of Cryptography, Volume
1 to ρ2-privacy implies
(s, ρ
II. Cambridge University Press, 2004.
1, ρ2)-privacy, but not vice versa.
Furthermore in con-
trast with [10], the method described in [3] can not be ex-
[13] http://csrc.nist.gov/encryption/tkhash.html. Secure
tended to our stronger privacy definition. The main idea
hash standard. FIPS PUB 180-2.
in [3] is to keep each attribute with relatively high prob-
[14] http://www.macworld.com. Hackers breach lexisnexis,
ability, and replace it with noise otherwise. Unfortunately,
graph info on 32,000 people. 2005.
this method is susceptible to partial knowledge attack as
[15] http://www.washingtonpost.com. Northwest airlines
explained in the introduction.
faces privacy suits. 2004.
[16] http://www.wired.com. Acxiom opts out of opt-out.
2003.