ebook img

DTIC ADA465138: Statistical Sensitive Data Protection and Inference Prevention with Decision Tree Methods PDF

4 Pages·0.18 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA465138: Statistical Sensitive Data Protection and Inference Prevention with Decision Tree Methods

Statistical Sensitive Data Protection And Inference Prevention with Decision Tree Methods (cid:3) LiWu Chang Abstract tional table (columns: attributes; rows: records) and containstwoparts: oneis thetrainingandthe otheris Wepresentanewapproachforprotectingsensitivedata the test (Table 1). The decision tree method was (cid:12)rst inarelationaltable(columns: attributes;rows: records). discussed in [1], where sensitive data were represented If sensitive data can be inferred by unauthorized users as values of the class label (i.e., the class attribute) of with non-sensitivedata, wehave theinference problem. the test data. However, sensitive data may not be re- We consider inference as correct classi(cid:12)cation and ap- stricted to one particular attribute and the class label proach it with decision tree methods. As in our previ- in many data sets is not speci(cid:12)ed. We consider the ous work, sensitive data are viewed as classes of those case where sensitive data are distributed over the en- test data and non-sensitive data are the rest attribute tire data set. We extend the decision tree method to values. In general, however, sensitive data may not handle distributed sensitive data. be associated with one attribute (i.e., the class), but are distributed among many attributes. We present a generalized decision tree method for distributed sensi- 2 Inference Problem tive data. This method takes in turn each attribute as Weconsiderasimpletwo-leveledsecurityprotocolwhich the class and analyze the corresponding classi(cid:12)cation has High and Low users. The High users (e.g., the error. Attribute values that maximize an integrated er- database manager) view the entire database, and the ror measure are selected for modi(cid:12)cation. Ouranalysis Low users share the High view with the exception of shows thatmodi(cid:12)edattributevaluescan berestoredand any con(cid:12)dential data. When data are shared, High re- hence, sensitive data are not securely protected. This leases some of the non-sensitive data to Low. In the result implies that modi(cid:12)ed values must themselves be pre-processing of data release, sensitive data are re- subjected to protection. We present methods for this placed by \?"s. It is well-known that to prevent infer- rami(cid:12)ed protection problem and also discuss other sta- ence, removalof sensitivedata aloneis insu(cid:14)cient and tistical attacks. modi(cid:12)cationofsomenon-sensitivedata(e.g.,blocking) is necessary (e.g., [3]). 1 Introduction Inferencepreventionproceedsasfollows([2]). High generates rules from the available data set, and then Information sharing and data disclosure have led to determines whether there is inference based on those unprecedentdemandfore(cid:11)ectivesensitivedataprotec- rules. If the inference is excessive, then it implements tionmethods. Ifsensitivedatacanbeinferredbyunau- aprotectionplantolessentheinference(i.e.,decidesto thorizedusersfrom non-sensitivedata, wehavethe in- modifybydeletingcertaindatafromthedatabaseasit ference problem. In this paper, we apply the decision appears toLow). The output of our inferencemodel is tree method ([5]) to inference prevention and sensitive thedatasetthatcanbereleasedtoLow. Ourgoalisto dataprotection. Thedecisiontreemethodconveniently make modi(cid:12)cations as parsimoniously as possible and provides a more localized description of data records. thus avoid imposing unnecessary changes which lessen We assume that the data set is in the form of a rela- functionality. (cid:3) Correspondence: Mail code 5540, Naval Research Labora- tory,Washington,[email protected] Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2003 2. REPORT TYPE 00-00-2003 to 00-00-2003 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Statistical Sensitive Data Protection and Inference Prevention with 5b. GRANT NUMBER Decision Tree Methods 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Naval Research Laboratory,Code 5540,4555 Overlook Avenue, REPORT NUMBER SW,Washington,DC,20375 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 3 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 function (e.g.,weighted average) of Eis and Fis. The Table 1: Relational Table for Evaluation. Aj denotes measureofthelossoffunctionality F isusuallyhasan the jth attribute and the \?" denotes an unknown upper bound of a giventhreshold(cid:29) (i.e., F (cid:20) (cid:29)) that value, a piece of con(cid:12)dential datum, or a previously represents the maximum level of information loss that modi(cid:12)ed value. key A1 A2 .. Ak .. AM classlabel users are willing to tolerate. With the de(cid:12)nitions of E and F in mind, our optimization goal is to training data .. ? .. .. .. .. .. ? .. .. ? .. ? .. Minimize E, while keeping F (cid:20)(cid:29); .. .. .. .. .. .. .. testing .. .. .. ? .. .. ? i.e., we optimize E with F as the objective function .. ? .. .. .. .. ? (optimization criterion). Note that the e(cid:11)ect of pro- tection is evaluated from High’s perspective, while the database functionality is evaluated from Low’s view. 3 Decision Tree Method The class label attribute in conventional decision tree 5 Modi(cid:12)cation Control methods is deterministic. To deal with inference in thepresenceofdistributedsensitivedata,anyattribute In theory, the optimal set of attribute values for mod- may be consideredas the class lable (thereby the orig- i(cid:12)cation can be determined by exhaustively evaluat- inal class label becomes an ordinary attribute.) ing every possible batch of attribute values of non- As in our previous work ([1]), sensitive data are con(cid:12)dential data and selecting the batch that scores viewed as classes of those test data and non-sensitive the highest with respect to a given optimization crite- data are the rest attribute values. We consider infer- rion. Such an exhaustive search is impractical when enceascorrectclassi(cid:12)cation-thelowerthecorrectclas- the volume of the data is large. Instead of evaluating si(cid:12)cation,thehigherthesecurityofthedatawillbe. To each attribute in turn, we prioritize the attributes and preventinference, weincrease(decrease) the classi(cid:12)ca- evaluate the one that yields the highest threat. Dur- tion error (correct classi(cid:12)cation) by modifying the set ing modi(cid:12)cation, we visit the attribute that has the ofattribute valuesofnon-sensitivedatathatyieldsthe largest number of sensitive data records and the low- largest increase (decrease) in classi(cid:12)cation error (cor- est classi(cid:12)cation error - the highest inference threat. rect classi(cid:12)cation). We formalize the requirements in (Prioritization needs to be carried out for each run of the next section. modi(cid:12)cation.) Let the total number of con(cid:12)dential at- tribute values be denoted as S, and the classi(cid:12)cation error of the test data with respect to the jth attribute 4 Metric be Crj. We select from all the M attributes the one Data modi(cid:12)cation is most likely to incur degradation that maximizes the product of the number of associ- of data performance. Important metrics in data modi- ated con(cid:12)dential data records and the inverse of the (cid:12)cation are the e(cid:11)ectiveness measure of sensitive data classi(cid:12)cation error protection (E) and the measure of the loss of func- M TEj MAXj=1 (1(cid:0)Crj)( ) tionality (F) in a data set. In terms of the decision S tree method, the e(cid:11)ectiveness measure (w.r.t. to the whereTEj isthenumbertestdataassociatedwiththe current class label) is determined by the classi(cid:12)cation jth attribute. Handling attributes that are of high in- errorofthe test data(i.e. thecon(cid:12)dential data),while ference threat (cid:12)rst allows us to achieve the necessary the measureof lossof functionality is afunction of the level of protection more e(cid:11)ectively. classi(cid:12)cation error of the training data (i.e. the to-be- released data). 6 Example Supposethejthattributeispostedastheclass label. Letthemeasureofprotectione(cid:11)ectivenesswithrespect Table 2 is a small sample taken from a Submarine De- tothe jth attribute be denoted asEj and the measure 1 signdatabase andrepresentstheinitialLowview. In- ofthelossoffunctionalitybedenotedasFj. The over- 1 The Submarine Design database has 98 records and 9 at- all measure of E and F forthe entire database arethe tributesandwascollectedintheJane’sNavalWeaponSystems. of restoration attack in which an adversary computes Table 2: initial Low database themostinformativeattributew.r.taclass label,posts name diesels range depth it as the new class lable, and estimates those associ- Agosta low short medium ated hidden values. For example, consider the \vot- Foxtrot medium medium medium ing"data([4]). Inthisdataset,theoriginalclass lable Seawolf high long deep is \party" and the corresponding most informative at- S. Cruz ? medium medium tribute is \physician fee freeze". It can be shown that Preveze low ? medium thepreviouslyhiddenattributevalues(e.g.,\physician feefreeze")arerestoredbyusingsomeotherattributes (e.g.,\ElSalvadoraid"). (\ElSalvadoraid"isthemost formation of \diesels of S. Cruz" and \range of Pre- informativeattributeto\physicianfeefreeze".) Topre- veze" is assumed to be sensitive. From the available information (D), one can infer these two pieces of sen- vent the possible restorationof modi(cid:12)ed values, we re- peat the process of attribute value hiding by making sitive data with probability 1, i.e., Pr(\diesels of S. previously modi(cid:12)ed non-sensitive data sensitive until Cruz" = medium j D) = 1 and Pr(\range of Preveze" the restoration risk drops below a speci(cid:12)ed threshold. = short j D) = 1. The inference is excessive. After (Of course, this threshold is incorporated in F.) The downgrading, the modi(cid:12)ed Low view is shown in Ta- need of repeated hiding is referred to as the rami(cid:12)ca- ble3,wheremodi(cid:12)edattributevalues(i.e.,\?"s)arein tion problem of data inference ([2]). bold-face. The probabilities of these two pieces of sen- sitive data become Pr(\diesels of S. Cruz" = medium j D) = 0.33 and Pr(\range of Preveze" = short j D) 8 Future Work = 0.33, indicating equal likelihood, and the result is desirable. We will study the e(cid:11)ects of inference preventionbased on di(cid:11)erent modi(cid:12)cation methods, investigate di(cid:11)er- Table 3: modi(cid:12)ed Low database ent types of statistical attack, and extends the current name diesels range depth model to distributed environments. Agosta low short medium Foxtrot medium medium medium References Seawolf high long deep S. Cruz ? ? medium [1] Chang, L. & Moskowitz, I. (1998) \Parsimonious Preveze ? ? medium Downgrading and Decision Trees Applied to the Inference Problem," NSP Workshop, pp. 82-89. [2] Chang, L & Moskowitz, I. (2000) \An Inte- 7 Restoration Attacks gratedFrameworkforDatabaseInferenceandPri- vacy Protection," Data And Applications Secu- Ifanadversaryknowsthestrategyofinferencepreven- rity, (eds. Thuraisingham, van de Riet, Dittrich tion, then (s)he may be able to restore the modi(cid:12)ed & Tari), Kluwer, pp. 161-172. attribute values (referred to as the restoration attack). [3] P. Doyle, J. Lane, L. Zayatz, and J. Theeuwes. In this case, the sensitive data are not correctly pro- Con(cid:12)dentiality, Disclosure and Data Access: The- tected. ory and Practical Applications for Statistical As discussed, sensitive data associated with an at- Agencies. Elsevier Science, 2001. tributearedeemedastheclassesofthetestdatawhen this attribute is posted as the class label. For a de- [4] http://kdd.ics.uci.edu/ . cision tree, the root node (attribute) are more likely to be selected for modi(cid:12)cation than other nodes (at- [5] Quinlan, R. (1992) C4.5, Morgan Kaufmann. tributes),becausetherootnodeisthemostinformative attribute (in terms of the Shannon’s entropy measure) to the class lable. Among many, we consider one type

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.