ebook img

DTIC ADA464602: Privacy-Preserving Collaborative Data Mining PDF

9 Pages·0.14 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA464602: Privacy-Preserving Collaborative Data Mining

Privacy-preserving Collaborative Data Mining Zhijun Zhan1 and LiWu Chang2 1Department of Electrical Engineering and Computer Science, Syracuse University, USA 2Center for High Assurance Computer Systems, Naval Research Laboratory, USA Email: [email protected] and [email protected] Abstract world. The problem is the collaborative data min- ing. Data mining is a technology that emerges as a Privacy-preserving data mining is an important is- means for identifying patterns and trends from a large sue in the areas of data mining and security. In this quantity of data. The goal of our studies is to develop paper, we study how to conduct association rule min- technologiestoenablemultiplepartiestoconductdata ing, one of the core data mining techniques, on private mining collaboratively without disclosing their private data in the following scenario: Multiple parties, each data. having a private data set, want to jointly conduct as- In recent times, the explosion in the availability of sociation rule mining without disclosing their private various kinds of data has triggeredtremendous oppor- data to other parties. Because of the interactive na- tunitiesforcollaboration,inparticularcollaborationin ture among parties, developing a secure framework to data mining. The following is some realistic scenarios: achieve such a computation is both challenging and de- sirable. In this paper, we present a secure framework 1. Multiple competing supermarkets,eachhavingan formultiple parties toconduct privacy-preserving asso- extra large set of data records of its customers’ ciation rule mining. buyingbehaviors,wanttoconductdataminingon theirjointdatasetformutualbene(cid:12)t. Sincethese companies are competitors in the market,they do not want to disclose too much about their cus- Key Words: privacy, security,association rulemin- tomers’ information to each other, but they know ing, secure multi-party computation. the results obtained from this collaboration could bring them an advantage over other competitors. 1 INTRODUCTION 2. Several pharmaceutical companies, each have in- vested a signi(cid:12)cant amount of money conduct- Business successes are no longer the result of an in- ing experiments related to human genes with the dividualtoilinginisolation;rathersuccessesaredepen- goalofdiscoveringmeaningfulpatternsamongthe dentuponcollaboration,teame(cid:11)orts,andpartnership. genes. Toreducethecost,thecompaniesdecideto In the modern business world, collaboration becomes join force, but neither wants to disclose too much especially important because of the mutual bene(cid:12)t it informationabouttheirrawdatabecausetheyare brings. Sometimes, such a collaboration even occurs only interested in this collaboration;by disclosing among competitors, or among companies that have the raw data, a company essentially enables other conflict of interests, but the collaborators are aware partiestomakediscoveriesthatthecompanydoes that the bene(cid:12)t brought by such a collaboration will not want to share with others. give them an advantage over other competitors. For this kind of collaboration, data’s privacy becomes ex- To use the existing data mining algorithms, all tremely important: all the parties of the collaboration parties need to send their data to a trusted central promise to provide their private data to the collabora- place (such as a super-computing center) to conduct tion,butneitherofthemwantseachotheroranythird the mining. However, in situations with privacy con- party to learn much about their private data. cerns, the parties may not trust anyone. We call this Thispaperstudiesaveryspeci(cid:12)ccollaborationthat type of problem the Privacy-preserving Collaborative becomes more and more prevalent in the business Data Mining (PCDM) problem. For each data min- 1 Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2003 2. REPORT TYPE 00-00-2003 to 00-00-2003 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Privacy-preserving Collaborative Data Mining 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Naval Research Laboratory,Center for High Assurance Computer REPORT NUMBER Systems,4555 Overlook Avenue, SW,Washington,DC,20375 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 8 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 ing problem, there is a corresponding PCDM prob- ing Association Rules On Private Data (MAP) prob- lem. Fig.1 shows how a traditional data mining prob- lem de(cid:12)ned as follows: multiple parties want to con- lem could be transformed to a PCDM problem (this duct associationrule mining on a data set that consist paper only focuses on the heterogeneous collaboration alltheparties’privatedata,andneitherpartyiswilling (Fig.1.c))(heterogeneouscollaborationmeansthateach to disclose its raw data to other parties. partyhasdi(cid:11)erentsetsofattributes. Homogeneouscol- The existing researchon association rule mining [8] laboration means that each party has the same sets of providesthebasisforthecollaborativeassociationrule attributes.) mining. However, none of those methods satisfy the securityrequirementsofMAP orcanbe trivially mod- i(cid:12)ed to satisfy them. With the increasing needs of Data privacy-preservingdatamining,moreandmorepeople Data Results Mining are interested in (cid:12)nding solutions to the MAP prob- lem. VaidyaandCliftonproposedasolution[9]fortwo parties to conduct privacy-preserving association rule (A) Non-collaborative Data Mining mining. However,forthegeneralcasewheremorethan two parties are involved, the MAP problem presents a D1 much greater challenge. D1 The paperis organizedasfollows: The relatedwork is discussed in Section 2. We describe the associa- D2 M Dinatiang Results D2 M Dinatiang Results tion rule mining procedure in Section 3. We then for- mallyde(cid:12)neourproposedsecureprotocolinSection4. In Section 5, we conduct security and communication D3 analysis. We give our conclusion in Section 6. D3 (B) Homogeneous Collaboration (C) Heterogeneous Collaboration 2 RELATED WORK Figure 1. Privacy Preserving Non- Secure Multi-party Computation collaborativeandCollaborativeDataMining Briefly, a Secure Multi-party Computation (SMC) problemdeals with computing any function on any in- Genericsolutionsforanykindofsecurecollaborative put, in a distributed network where each participant computing exist in the literature [5]. These solutions holds one of the inputs, while ensuring that no more are the results of the studies of the Secure Multi-party information is revealed to a participant in the com- Computation problem [10, 5], which is a more general putation than can be inferred from that participant’s formofsecurecollaborativecomputing. However,none inputandoutput. TheSMCproblemliteraturewasin- of the proposed generic solutions is practical; they are troduced by Yao [10]. It has been proved that for any notscalableandcannothandlelarge-scaledatasetsbe- function, thereisasecuremulti-partycomputationso- cause of the prohibitive extra cost in protecting data’s lution[5]. Theapproachusedisasfollows: thefunction privacy. Therefore, practical solutions need to be de- F to be computed is (cid:12)rst represented as a combinato- veloped. This need underlies the rationale for our re- rialcircuit,andthenthepartiesrunashortprotocolfor search. every gate in the circuit. Every participant gets corre- Data mining includes a number of di(cid:11)erent tasks, spondingsharesoftheinputwiresandtheoutputwires suchasassociationrulemining,classi(cid:12)cation,andclus- for every gate. This approach, although appealing in tering. This paper studies the association rule min- its generality and simplicity, is highly impractical. ing problem. The goal of association rule mining is Privacy-preservation Data Mining to discovermeaningful associationrules among the at- In the early work on such a privacy-preservingdata tributes of a large quantity of data. For example, let mining problem, Lindell and Pinkas [7] propose a so- us consider the database of a medical study, with each lution to the privacy-preserving classi(cid:12)cation prob- attribute representing a symptom found in a patient. lem using the oblivious transfer protocol, a powerful Adiscoveredassociationrulepatterncouldbe \70%of tool developed through the secure multi-party com- patients who are drug injection takers also have hep- putation studies. Another approach for solving the atitis". This information can be useful for the disease- privacy-preservingclassi(cid:12)cationproblemwasproposed control, medical research, etc. Based on the existing byAgrawalandSrikant[1]. Intheirapproach,eachin- associationruleminingtechnologies,westudytheMin- dividualdataitemisperturbedandthedistributionsof 2 thealldataisreconstructedatanaggregatelevel. The they only contain 0’s and 1’s, where n is the to- technique worksfor those data mining algorithmsthat tal number of parties. (Our method is applicable use the probability distributions rather than individ- to attributes that are of non-binary value. An at- ualrecords. In[9],asolutiontotheassociationmining tribute of non-binary value will be converted to problem for the case of two parties was proposed. In a binary representation. Detailed implementation [3], a procedure is provided to build a classi(cid:12)er on pri- includes discretizing and categorizing attributes vatedata,whereasemi-trustedpartywasemployedto that are of continuous or ordinal values.) improve the performance of communication and com- putation. Inthispaper,wealsoadoptthemodelofthe 2. D1, D2, (cid:1)(cid:1)(cid:1) and Dn contain the same number of records. LetNdenote the totalnumber ofrecords semi-trustedpartybecauseofthee(cid:11)ectivenessanduse- for each data set. fulnessitbringsandpresentasecureprotocolallowing computation to be carried out by the parties. 3. The identities of the ith (for i 2 [1;N]) record in D1, D2, (cid:1)(cid:1)(cid:1) and Dn are the same. 3 MINING ASSOCIATION RULES ON PRIVATE DATA Mining Association Rules On Private Data problem: Party1 has a private data setD1, party 2 Since its introduction in 1993 [8], the association hasaprivatedatasetD2,(cid:1)(cid:1)(cid:1)andpartynhasaprivate rulemining hasreceivedagreatdealofattention. Itis data set Dn. The data set [D1[D2[(cid:1)(cid:1)(cid:1)[Dn] is the still one of most popular pattern-discoverymethods in unionofD1, D2,(cid:1)(cid:1)(cid:1)andDn (by verticallyputtingD1, the(cid:12)eldofknowledgediscovery. Briefly,anassociation D2, (cid:1)(cid:1)(cid:1) and Dn together so that the concatenation of ruleisanexpressionX )Y,whereXandYaresetsof theithrowinD1,D2,(cid:1)(cid:1)(cid:1)andDn becomestheithrow items. The meaning of such rules is as follows: Given in[D1[D2[(cid:1)(cid:1)(cid:1)[Dn]). Thenpartieswanttoconduct a database D of records,X )Y means that whenever associationruleminingon[D1[D2[(cid:1)(cid:1)(cid:1)[Dn]andto a record R contains X then R also contains Y with (cid:12)nd the association rules with support and con(cid:12)dence certain con(cid:12)dence. The rule con(cid:12)dence is de(cid:12)ned as being greaterthanthe giventhresholds. We sayanas- thepercentageofrecordscontainingbothXandYwith sociation rule (e.g., xi ) yj) has con(cid:12)dence c% in the regard to the overall number of records containing X. dataset[D1[D2[(cid:1)(cid:1)(cid:1)[Dn]ifin[D1[D2[(cid:1)(cid:1)(cid:1)[Dn]c% The fraction of records R supporting an item X with oftherecordswhichcontainxi alsocontainyj (namely, respect to database D is called the support of X. c% = P(yj jxi)). We saythat the associationrule has supports%in[D1[D2[(cid:1)(cid:1)(cid:1)[Dn]ifs%ofthe records 3.1 ProblemDefinition in [D1[D2(cid:1)(cid:1)(cid:1)[Dn] contain both xi and yj (namely, s%=P(xi\yj)). We consider the scenario where multiple parties, 3.2 AssociationRuleMiningProcedure each having a private data set (denoted by D1, D2, (cid:1)(cid:1)(cid:1), and Dn respectively), want to collaborativelycon- ductassociationruleminingonthe unionoftheirdata Thefollowingistheprocedureforminingassociation sets. Because they are concerned about their data’s rules on [D1 [ D2 (cid:1)(cid:1)(cid:1) [ Dn]. privacy,neitherpartyiswillingtodiscloseitsrawdata set to others. Without loss of generality, we make the 1. L1 = large 1-itemsets following assumptions on the data sets. The assump- 2. for (k = 2; Lk−1 6= (cid:30); k++) do begin tions can be achieved by pre-processing the data sets 3. Ck = apriori-gen(Lk−1) D1, D2, (cid:1)(cid:1)(cid:1), and Dn, and such a pre-processing does 54.. forCaolml cpanudteidcat.cesoucn2tCknndWo ebewgililnshowhow notrequireone party to sendits data setto other par- to compute it in Section 3.3 ties. (In this paper, we consider applications where 6. end the identi(cid:12)er of each data record is recorded. In con- 7. Lk = fc2Ckjc.count (cid:21) min-supg trast,fortransactionssuchasthesupermarket-buying, 8. end customers’ IDs may not be needed. The IDs and the 9. Return L = [kLk namesofattributesareknowntoallpartiesduringthe joint computation. A data record used in the joint The procedure apriori-gen is described in the follow- association rule mining has the same ID in di(cid:11)erent ing (please also see [6] for details). databases.) apriori-gen(Lk−1: large (k-1)-itemsets) 1. D1, D2, (cid:1)(cid:1)(cid:1) and Dn are binary data sets, namely 1. for each itemset l1 2 Lk−1 do begin 3 2. for each itemset l2 2Lk−1 do begin 4 BUILDING BLOCK 3. if ((l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^ (cid:1)(cid:1)(cid:1) ^ (l1[k−1]=l2[k−1]) ^ (l1[k−1]<l2[k−1]))f Howtwoormultiplepartiesjointlycomputec.count 4. then c = l1 join l2 without revealing their raw data to each other is the 5. for each (k-1)-subset s of c do begin challenge that we want to address. The number prod- 6. if s 2= Lk−1 uct protocoldescribed in this section is the main tech- 7. then delete c nical tool used to compute it. We will describe an 8. else add c to Ck e(cid:14)cientsolutionofthenumberproductprotocolbased 9. end 10. g on a commodity server,a semi-trusted party. Ourbuildingblocksaretwoprotocols: The(cid:12)rstpro- 11. end 12. end tocol is for two parties to conduct the multiplication 13. return Ck operation. This protocol di(cid:11)ers from [3] in that we consider the product of numbers instead of vectors. 3.3 Howtocomputec.count Since product computation can only applied for two vectors, it cannot deal with the computation involved in multiple parties where more than two vectors may If all the candidates belong to the same party, then participate in the computation. The second protocol, c.count,whichreferstothefrequencycountsforcandi- with the (cid:12)rst protocol as the basis, is designed for the dates,canbecomputedbythisparty. Ifthecandidates secure multi-party product operation. belongto di(cid:11)erent parties,they thenconstructvectors for their own attributes and apply our number prod- uct protocol, which will be discussed in Section 4, to 4.1 IntroducingTheCommodityServer obtain the c.count. We use an example to illustrate how to compute c.count among three parties. Party 1, For performance reasons, we use an extra server, party 2 and party 3 construct vectors X, Y and Z for the commodity server [2] in our protocol. The par- their own attributes rePspectively. To obtain c.count, ties could send requests to the commodity server and they need to compute N X[i](cid:1)Y[i](cid:1)Z[i] where N receive data (called commodities) from the server, but i=1 is the total number of values in each vector. For in- the commodities must be independent of the parties’ sPtance, if the vectors aPre as depicted in Fig.2, then private data. The purpose of the commodities is to N X[i](cid:1)Y[i](cid:1)Z[i]= 5 X[i](cid:1)Y[i](cid:1)Z[i]=3. We help the parties conduct the desired computations. i=1 i=1 provideane(cid:14)cientprotocolinSection4fortheparties The commodity serveris semi-trusted in the follow- to compute this value without revealing their private ing senses: (1) It should not be trusted; therefore it data to each other. should not be possible to derive the private informa- tion of the data from the parties; it should not learn Drug Injection Taker Hepatitis Credit Rating the computation result either. (2) It should not col- Yes Have Bad lude with all the parties. (3) It follows the protocol correctly. Because of these characteristics,we saythat No Have Good it is a semi-trusted party. In the real world, (cid:12)nding Yes Have Bad such a semi-trusted party is much easier than (cid:12)nding Yes Have Bad a trusted party. As we will see from our solutions, the commodity Yes Haven’t Good server does not participate in the actual computation Vector Construction among the parties; it only supplies commodities that are independent of the parties’ private data. There- 1 1 1 fore, the server can generate independent data o(cid:11)-line 0 1 0 beforehand,andsellthemascommoditiestotheprover andthe veri(cid:12)er(hencethe name\commodityserver"). X 1 Y 1 Z 1 1 1 1 4.2 SecureNumberProductProtocol 1 0 0 Let’s (cid:12)rst consider the case of two parties where Alice Bob Carol n=2(moregeneralcaseswheren(cid:21)3willbediscussed Figure2.RawDataForAlice,BobandCarol later). Alice has a vector X and Bob has a vector Y. Both vectors have N elements. Alice and Bob want 4 to computePthe product between XPand Y such that parties is large, and more e(cid:14)cient solution is still un- N N PAlice gets i=P1Ux[i] and BPob gets i=1Uy[i], where derresearch. Withoutlossofgenerality,letAlicehasa Ni=1Ux[i]+ Ni=1Uy[i] = Ni=1X[i](cid:1)Y[i] = X (cid:1)Y. private vectorX andarandomlygeneratedvectorRx, Uy[i] and Ux[i] are random numbers. Namely, the Bob has a private vector Y and a randomly generated scalar product of X and Y is divided into two secret vectorRy andletRy[i]=Ry0[i]+Ry00[i]fori2[1;N]and pieces, with one piece going to Alice and the other go- CarolhasaprivatevectorZ andarandomlygenerated ing to Bob. We assumethat randomnumbers aregen- vector Rz. First, we let the parties hide these private erated from the integer domain. numbers by using their respective random numbers, then conduct the product for the multiple numbers. 4.2.1 Secure Two-party Product Protocol T[1]= Protocol 1. (Secure Two-party Product Protocol) (X[1]+Rx[1])(cid:3)(Y[1]+Ry[1])(cid:3)(Z[1]+Rz[1]) 1. The Commodity Server generates two random =X[1]Y[1]Z[1]+X[1]Ry[1]Z[1]+Rx[1]Y[1]Z[1] numbers Rx[1] andRy[1] , and lets rx[1]+ry[1]= +Rx[1]Ry[1]Z[1]+X[1]Y[1]Rz[1]+X[1]Ry[1]Rz[1] Rx[1] (cid:1) Ry[1], where rx[1] (or ry[1]) is a ran- +Rx[1]Y[1]Rz[1]+Rx[1]Ry[1]Rz[1] domly generated number. Then the server sends (Rx[1];rx[1]) to Alice, and (Ry[1];ry[1]) to Bob. = 2. Alice sends X^[1]=X[1]+Rx[1] to Bob. +XR[1x]Y[1[]1R]Zy[[11]]Z+[1X]+[1X]R[y1[]1Y]Z[1[]1R]z+[1R]+x[1X]Y[1[]1R]Zy[[11]]Rz[1] 3. Bob sends Y^[1]=Y[1]+Ry[1] to Alice. +Rx[1]Y[1]Rz[1]+Rx[1](Ry0[1]+Ry00[1])Rz[1] 4. Bob generates a random number Uy[1], and com- =X[1]Y[1]Z[1]+X[1]Ry[1]Z[1]+Rx[1]Y[1]Z[1] putes X^[1](cid:1)Y[1]+(ry[1]−Uy[1]), then sends the +Rx[1]Ry[1]Z[1]+X[1]Y[1]Rz[1]+X[1]Ry[1]Rz[1] result to Alice. +Rx[1]Y[1]Rz[1]+Rx[1]Ry0[1]Rz[1]+Rx[1]Ry00[1]Rz[1] 5. Alice computes (X^[1] (cid:1) Y[1] + (ry[1] − Uy[1])) − = (Rx[1](cid:1)Y^[1])+rx[1]=X[1](cid:1)Y[1]−Uy[1]+(ry[1]− X[1]Y[1]Z[1]+(X[1]Ry[1]+Rx[1]Y[1]+Rx[1]Ry[1])Z[1] Rx[1](cid:1)Ry[1]+rx[1])= X[1](cid:1)Y[1]−Uy[1]=Ux[1]. +(X[1]Y[1]+X[1]Ry[1]+Rx[1]Y[1]+Rx[1]Ry0[1])Rz[1] 6. Repeat step 1-5 to compute X[i] (cid:1) Y[i] for i 2 +Rx[1]Ry00[1]Rz[1]=T0[1]+T1[1]+T2[1]+T3[1]; P N [P2;N]. Alice then gets i=1Ux[i] and Bob gets where N i=1Uy[i]. T0[1]=X[1]Y[1]Z[1]; The bit-wise communication cost of this protocol is T1[1]=(X[1]Ry[1]+Rx[1]Y[1]+Rx[1]Ry[1])Z[1]; 7(cid:3)M(cid:3)N,whereMisthemaximumbitsforthe values T2[1] involved in our protocol. The cost is approximately 7 =(X[1]Y[1]+X[1]Ry[1]+Rx[1]Y[1]+Rx[1]Ry0[1])Rz[1]; times ofthe optimal costofa two-partyscalarproduct T3[1]=Rx[1]Ry00[1]Rz[1]: (the optimal cost of a scalar product is de(cid:12)ned as the costofconductingtheproductofX andY withoutthe T0[1] is what we want to obtain. To compute T0[1], privacy constraints, namely one party simply sends its we need to know T[1], T1[1], T2[1] and T3[1]. In this data in plain to the other party). The cost can be protocol,weletAlicegetT[1],BobgetT3[1]andCarol decreased to 3(cid:3)M (cid:3)N if the commodity server just get T1[1] and T2[1]. Bob separates Ry[1] into Ry0[1] and R00[1]. If he fails to do so, then his data might be sends seeds to Alice and Bob since the seeds can be y disclosed during the computation of these terms. used to generate a set of random numbers. To compute T1[1] and T2[1], Alice and Bob can use Protocol 1 to compute X[1]Ry[1], Rx[1]Y[1], 4.2.2 Secure Multi-party Product Protocol Rx[1]Ry[1], X[1]Y[1] and Rx[1]Ry0[1]. Thus, accord- Wehavediscussedourprotocolofsecurenumberprod- ing to Protocol 1, Alice gets Ux[1], Ux[2], Ux[3], uctfortwoparties. Next,wewillconsidertheprotocol Ux[4] and Ux[5] and Bob gets Uy[1], Uy[2], Uy[3], for securely computing the number product for multi- Uy[4] and Uy[5]. Then they compute (X[1]Ry[1] + ple parties. For simplicity, we only describe the proto- Rx[1]Y[1]+Rx[1]Ry[1]) and (X[1]Y[1]+X[1]Ry[1]+ colwhenn=3. Theprotocolsforthecaseswhenn>3 Rx[1]Y[1]+Rx[1]Ry0[1]) and send the results to Carol canbe similarly derived. Our concern is that similarly who can then compute T1[1] and T2[1]. Note that derived solution may not e(cid:14)cient when the number of X[1]Ry[1] = Ux[1]+Uy[1], Rx[1]Y[1] = Ux[2]+Uy[2], 5 Rx[1]Ry[1] = Ux[3]+Uy[3], X[1]Y[1] = Ux[4]+Uy[4] 1. Alice and Caroluse Protocol1 to compute Rx[1](cid:1) and Rx[1]Ry0[1]=Ux[5]+Uy[5]. Rz[1] and send the values they obtained from the TocomputeT3[1],AliceandCaroluseProtocol1to protocol to Bob. ccoamnpthuetne Rcoxm[1p]Rutze[1T],3[t1h]e.n send the results to Bob who 2. Bob computes T3[1]=Ry00[1](cid:1)Rx[1](cid:1)Rz[1]. To compute T[1], Bob sends Y[1]+Ry[1] to Alice Step III: Repeating and Carol sends Z[1]+Rz[1] to Alice. Repeat the above process to compute T[i], T1[i], 1. Repeat the Step I and Step II to compute T[i], T2[i], TP3[i] and T0[i] for iP2 [2;N]. Then, Al- T1[i], T2[i] and T3[i] for i2[2;N]. N N PPgicoeaNi=lh1aaTsn1d[i]io=baP1ntaTdin[iP],PNiB=Nio1=bT12Xh[iPa[]is.]Y[Fii]i=Zn1a[iTl]l3y[,=i]PwaePndaNi=cCh1iaTerv0oe[li]htha=es 2. R[Aaly]i[ci=e])t(cid:3)Phe(ZnNi=[g1i]eT+ts[iR] z=[i]P). Ni=1(X[i]+ Rx[i]) (cid:3) (Y[i] + Ni=1T[i]− Ni=1T1[i]− Ni=1T2[i]− Ni=1T3[i]. 3. Bob gePts P Protocol 2. (Secure Multi-party Product Protocol) [b] = Ni=1T3[i] = Ni=1(Y[i]+ Ry[i]) (cid:3) (Z[i]+ Step I: Random Number Generation Rz[i]). 1. Alice generates a random number Rx[1]. 4. C[c]ar=olPgeNi=ts1T1[i]=PNi=1(X[i](cid:1)Ry[i]+Rx[i](cid:1)Y[i]+ 2. Bob generates two random numbers Ry0[1] and Rx[i](cid:1)PRy[i])(cid:1)Z[i],Pand Ry00[1]. [d]= Ni=1T2[i]= Ni=1(X[i](cid:1)Y[i]+X[i](cid:1)Ry[i]+ Rx[i](cid:1)Y[i]P+Rx[i](cid:1)Ry0[i])(cid:1)Rz[i]. P 3. Carol generates a random number Rz[1]. Note that Ni=1X[i](cid:1)Y[i](cid:1)Z[i] = Ni=1T0[i] = [a]−[b]−[c]−[d]. Step II: Collaborative Computing Sub-step 1: To Compute T[1] Theorem 1. Protocol 1 is secure such that Alice can- not learn Y and Bob cannot learn X either. 1. Carol computes Z[1]+Rz[1] and sends it to Bob. Proof. The number X^[i] = X[i] + Rx[i] is all what 2. Bob computes Y[1]+Ry[1] and sends it to Alice. Bob gets. Because of the randomness and the se- 3. RAlyi[c1e])c(cid:3)om(Zp[u1]te+sRTz[1[1]])=. (X[1]+ Rx[1]) (cid:3) (Y[1] + (ct2ore)ctZyhe[oi]fpr=RoxtXo[^ic][,oil]B,(cid:1)oAYbl[iicc]ea+ngn(eorttys[(cid:12)i(]n1−d) YUo^uy[it[]i]X)=,[ia]Y.nd[iA](c+3c)oRrrdxyi[[nii]]g,, Sub-step 2: To Compute T1[1] and T2[1] Rthxa[ti],fowrhaerneyraxr[ib]i+trarryy[i]Y=0[iR],xt[ih]e(cid:1)rReye[xi]i.stWs er0w[ii]l,l Rsh0o[wi] y y 1. Alice and Bob use Protocol 1 to compute X[1](cid:1) and Uy0[i] that satis(cid:12)es the above equations. Assume Ry[1], Rx[1](cid:1)Y[1], Rx[1](cid:1)Ry[1], X[1](cid:1)Y[1] and Y0[i] is an arbitrary number. Let Ry0[i]= Y^[i]−Y0[i], Rx[1] (cid:1) Ry0[1]. Then Alice obtains Ux[1], Ux[2], ry0[i]=Rx[i](cid:1)Ry[i]−rx[i],andUy0[i]=X^[i](cid:1)Y0[i]+ry0[i]. Ux[3], Ux[4] and Ux[5] and Bob obtains Uy[1], Therefore, Alice has (1) Y^[i] = Y0[i] + R0[i], (2) y Uy[2], Uy[3], Uy[4] and Uy[5]. Z[i] = X^[i](cid:1)Y0[i]+(ry0[i]−Uy0[i]) and (3) rx[i], Rx[i], 2. Alice sends (Ux[1]+Ux[2]+Ux[3]) and (Ux[4]+ where rx[i]+ry0[i]=Rx[i](cid:1)Ry0[i]. Thus, fromwhat Al- Ux[1]+Ux[2]+Ux[5]) to Carol. ice learns, there exists in(cid:12)nite possible values for Y[i]. Therefore, Alice cannot know Y and neither can Bob 3. Bob sends (Uy[1] + Uy[2] + Uy[3]) and (Uy[4] + know X. Uy[1]+Uy[2]+Uy[5]) to Carol. Theorem 2. Protocol 2 is secure such that Alice can- 4. Carol computes not learn Y and Z, Bob cannot learn X and Z, and T1[1]=(X[1](cid:1)Ry[1]+Rx[1](cid:1)Y[1]+Rx[1](cid:1)Ry[1])(cid:1) Carol cannot learn X and Y. Z[1], and T2[1]=(X[1](cid:1)Y[1]+X[1](cid:1)Ry[1]+Rx[1](cid:1)Y[1]+ Proof. According to the protocol, Alice obtains Rx[1](cid:1)Ry0[1])(cid:1)Rz[1]. (1) (Y[i]+Ry[i]), and (2)(Z[i]+Rz[i]). Bob gets Rx[i](cid:1)Rz[i]. Carol gets Sub-step 3: To Compute T3[1] (1) (X[i](cid:1)Ry[i]+Rx[i](cid:1)Y[i]+Rx[i](cid:1)Ry[i]) and 6 (2)(X[i](cid:1)Y[i]+X[i](cid:1)Ry[i]+Rx[i](cid:1)Y[i]+Rx[i](cid:1)Ry0[i]). 1. Make a correct guess of both Alice’s and Carol’s values. Since Rx[i], Ry[i](= (Ry0[i]+Ry00[i])) and Rz[i] are If Bobis not chosen to hold the frequency counts, arbitrary random numbers. From what Alice learns, he then chooses to randomly guess and the prob- there exists in(cid:12)nite possible values for Y[i] and Z[i]. 1 ability for him to make a correct guess is . In FromwhatBoblearns,therealsoexistsin(cid:12)nitepossible 4 caseBobis chosen,ifthe productresultis 1(with values for Z[i]. From what Carol learns, there still 1 the probabilityof ), he then concludes thatboth exists in(cid:12)nite possible values for X[i] and Y[i]. 4 Alice andCarolhavevalue 1;ifthe productresult Therefore, Alice cannot learn Y and Z, Bob cannot is 0 (with the probability of 3), he would have a learnXandZ,andCarolcannotlearnXandY either. 1 4 chance of to make a correct guess. Therefore, 3 wehave 2(cid:3)1+1(1+3(cid:3)1)(cid:25)33%. Notethatthe 3 4 3 4 4 3 chance for Bob to make a correct guess, without 5 ANALYSIS his data being purposely manipulated, is 25%. 5.1 Securityanalysis 2. Make a correct guess for only one party’s (e.g., Alice) value. 5.1.1 Why Choose A Large Domain If Bobis not chosen to hold the frequency counts, 1 the chance that his guess is correct is . In case In our protocols, all the random numbers are gener- 2 Bob is chosen, if the product result is 1, he then ated from a very large domain (e.g., the integer do- knowstheAlice’svaluewithcertainty;iftheresult main). If the random numbers are generated from a is 0, there are two possibilities that need to be smalldomain,then onepartymightgetsomeinforma- considered: (1)ifAlice’svalueis0,thenthechance tionabouttheotherparties’privatedata. Forexample, 2 that his guess is correctis ; (2) if Alice’s value is in the Protocol1, if the elements of Y[i] are in the do- 3 1 1, then the chance that his guess is correct is . mainof[0,300],andwealsoknowtherandomnumbers 3 Therefore,wehave 2(cid:3)1+1(1+3(2(cid:3)2+1(cid:3)1))(cid:25) are generated from [0, 500], then if an element of the 3 2 3 4 4 3 3 3 3 56%. However, if Bob chooses to make a random vectorY[i]+Ry[i]is650,weknowtheoriginalelement guess, he then has 50% of chance to be correct. in the vector Yi is larger than 150. ItcanbeshownthattheratioforBobtomakeacor- 5.1.2 Malicious Model Analysis rect guess with/without manipulating his data in case 1 is (n+1)/n and in case 2 is approaching 1 with an Inthispaper,ouralgorithmisbasedonthesemi-honest exponentialrateofn ((cid:25)2−(n−1)), where nis the num- model, where all the parties behave honestly and co- ber of parties. The probability for a malicious party operatively during the protocol execution. However, to makea correctguessaboutother parties’values de- in practice, one of the parties (e.g., Bob) may be ma- creasessigni(cid:12)cantlyasthenumberofpartiesincreases. licious in that it wants to gain true values of other parties’ data by purposely manipulating its own data before executing the protocols. For example, in Pro- 5.1.3 How to deal with information disclosure tocol 1 Bob wants to know whether X[i]=1 for some by the inference from the results i. He may make up a set of numbers with all, but the ith, values being set to 0’s (i.e., Y[i]=1 and Y[j]=0 Assume the association rule, DrugInjection ) fPor j 6= i). APccording to Protocol 1, if Bob obtains Hepatitis, is what we get from the collaborative asso- N N i=1Ux[i] + i=1Uy[i], indicating the total number ciation rule mining, and this rule has 99% con(cid:12)dence ofcountsforbothXandYbeing1,thenBobcanknow level (i.e., P(HepatitisjDrugInjection)= 0:99). Now thatX[i]is0iftheaboveresultis0andX[i]is1ifthe given a data item item1 with Alice:(Drug-Injection), above result is 1. To deal with this problem, we may Alice can (cid:12)gure out Bob’s data (i.e., Bob:(Hepatitis) randomly select a party to hold the frequency counts. is in item1) with con(cid:12)dence 99% (but not vice versa). For example, let’s consider the scenario of three par- Such an inference problem exists whenever the items ties. Without loss of generality, we assume Bob is a of the associationrule is small and its con(cid:12)dence mea- malicious party. The chance that Bob gets chosen to sure is high. To deal with the information disclosure 1 holdthefrequencycountsis . Weconsiderthefollow- through inference, we may enforce the parties to ran- 3 ingtwocases.(Assume thatthe probabilityofsamples domize their data as in [4] with some probabilities be- in a sample space are equally likely.) fore conducting the association rule mining. 7 5.1.4 How to deal with the repeat use of pro- ManagementofData,pages439{450,Dallas,TXUSA, tocol May 15 - 18 2000. [2] D.Beaver. Commodity-basedcryptography(extended A malicious party (e.g., Bob) may ask to run the pro- abstract). In Proceedings of the twenty-ninth annual tocolmultiple timeswithdi(cid:11)erentsetofvaluesateach ACM symposium on Theory of computing, El Paso, time by manipulating his Uy[:] value. If other parties TX USA,May 4-6 1997. respondwithhonestanswers,thenthismaliciousparty [3] W.DuandZ.Zhan.Buildingdecisiontreeclassi(cid:12)eron may have chance to obtain actual values of other par- private data. In Workshop on Privacy, Security, and ties. To avoidthis type of disclosure, constraints must Data Mining at The 2002 IEEE International Con- be imposed on the number of repetitions. ference on Data Mining (ICDM’02), Maebashi City, Japan, December 9 2002. 5.2 CommunicationAnalysis [4] W.DuandZ.Zhan. Usingrandomizedresponsetech- niquesforprivacy-preservingdatamining. InProceed- Therearethreesourceswhichcontributetothetotal ings of The 9th ACM SIGKDD International Confer- bit-wise communication cost for the above protocols: enceonKnowledgeDiscoveryandDataMining,Wash- (1) the number of rounds of communication to com- ington, DC, USA,August 24-27 2003. pute the number product for a single value (denoted [5] O. Goldreich. Secure multi- byNumRod);(2)themaximumnumberofbitsforthe party computation (working draft). values involved in the protocols (denoted by M); (2) http://www.wisdom.weizmann.ac.il/home/oded/ thenumberoftimes(N)thattheprotocolsareapplied. public html/foc.html, 1998. The total cost can be expressed by NumRod(cid:3)M (cid:3)N [6] J. Han and M. Kamber. Data Mining Concepts and where NumRod and M are constants for each proto- Techniques. Morgan Kaufmann Publishers, 2001. col. Therefore the communication cost is O(N). N is a [7] Y.LindellandB.Pinkas.Privacypreservingdatamin- large number when the number of parties is big since ing. In Advances in Cryptology - Crypto2000, Lecture N exponentially increases as the number of parties ex- Notes in Computer Science, volume 1880, 2000. pands. [8] R.Agrawal,T.ImielinskiandA.Swami. Miningassoci- ationrulesbetweensetsofitemsinlargedatabases. In 6 CONCLUDING REMARKS ProceedingsofACMSIGMODConferenceonManage- ment of Data SIGMOD93, pages 207{216, May 1993. In this paper, we consider the problem of privacy- [9] J. Vaidya and C. Clifton. Privacy preserving associ- preservingcollaborativedatamining withinputs ofbi- ation rule mining in vertically partitioned data. In nary data sets. In particular, we study how multiple Proceedings of the 8th ACM SIGKDD International parties to jointly conduct association rule mining on ConferenceonKnowledgeDiscoveryandDataMining, July 23-26 2002. private data. We provided an e(cid:14)cient association rule mining procedure to carry out such a computation. In [10] A.C. Yao. Protocols for secure computations. In order to securely collecting necessary statistical mea- Proceedings of the 23rd Annual IEEE Symposium on sures fromdata of multiple parties,we have developed Foundations of Computer Science, 1982. a secure protocol, namely the number product proto- col, for multiple-party to jointly conduct their desired computations. We also discussed the malicious model and approached it by distributing the measure of fre- quency counts to di(cid:11)erent parties, and suggested the use of the randomization method to reduce the infer- ence of data disclosure. In our future work, we will extend our method to dealwith non-binarydata sets. We will also apply our technique to other data mining computations, such as privacy-preservingclustering. References [1] R. Agrawal and R. Srikant. Privacy-preserving data mining. InProceedings of the 2000 ACM SIGMOD on 8

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.