您的当前位置：首页 chapter 8 A survey of quantification of privacy preserving data mining algorithms

chapter 8 A survey of quantification of privacy preserving data mining algorithms

来源：刀刀网

Chapter8

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms

ElisaBertino

DepartmentofComputerSciencePurdueUniversity

bertino@cs.purdue.edu

DanLin

DepartmentofComputerSciencePurdueUniversity

lindan@cs.purdue.edu

WeiJiang

DepartmentofComputerSciencePurdueUniversity

wjiang@cs.purdue.edu

Abstract

Theaimofprivacypreservingdatamining(PPDM)algorithmsistoextractrel-evantknowledgefromlargeamountsofdatawhileprotectingatthesametimesensitiveinformation.Animportantaspectinthedesignofsuchalgorithmsistheidentiﬁcationofsuitableevaluationcriteriaandthedevelopmentofrelatedbenchmarks.Recentresearchintheareahasdevotedmuchefforttodetermineatrade-offbetweentherighttoprivacyandtheneedofknowledgediscovery.Itisoftenthecasethatnoprivacypreservingalgorithmexiststhatoutperformsalltheothersonallpossiblecriteria.Therefore,itiscrucialtoprovideacompre-hensiveviewonasetofmetricsrelatedtoexistingprivacypreservingalgorithmssothatwecangaininsightsonhowtodesignmoreeffectivemeasurementandPPDMalgorithms.Inthischapter,wereviewandsummarizeexistingcriteriaandmetricsinevaluatingprivacypreservingtechniques.Privacymetric.

Keywords:

184Privacy-PreservingDataMining:ModelsandAlgorithms

8.1Introduction

Privacyisoneofthemostimportantpropertiesthataninformationsystemmustsatisfy.Forthisreason,severaleffortshavebeendevotedtoincorporat-ingprivacypreservingtechniqueswithdataminingalgorithmsinordertopre-ventthedisclosureofsensitiveinformationduringtheknowledgediscovery.Theexistingprivacypreservingdataminingtechniquescanbeclassiﬁedac-cordingtothefollowingﬁvedifferentdimensions[32]:(i)datadistribution(centralizedordistributed);(ii)themodiﬁcationappliedtothedata(encryp-tion,perturbation,generalization,andsoon)inordertosanitizethem;(iii)thedataminingalgorithmwhichtheprivacypreservationtechniqueisdesignedfor;(iv)thedatatype(singledataitemsorcomplexdatacorrelations)thatneedstobeprotectedfromdisclosure;(v)theapproachadoptedforpreservingprivacy(heuristicorcryptography-basedapproaches).Whileheuristic-basedtechniquesaremainlyconceivedforcentralizeddatasets,cryptography-basedalgorithmsaredesignedforprotectingprivacyinadistributedscenariobyus-ingencryptiontechniques.Heuristic-basedalgorithmsrecentlyproposedaimathidingsensitiverawdatabyapplyingperturbationtechniquesbasedonprob-abilitydistributions.Moreover,severalheuristic-basedapproachesforhidingbothrawandaggregateddatathroughahidingtechnique(k-anonymization,addingnoises,dataswapping,generalizationandsampling)havebeendevel-oped,ﬁrst,inthecontextofassociationruleminingandclassiﬁcationand,morerecently,forclusteringtechniques.

Giventhenumberofdifferentprivacypreservingdatamining(PPDM)tech-niquesthathavebeendevelopedintheseyears,thereisanemergingneedofmovingtowardstandardizationinthisnewresearcharea,asdiscussedbyOliveiraandZaiane[23].OnesteptowardthisessentialprocessistoprovideaquantiﬁcationapproachforPPDMalgorithmstomakeitpossibletoevaluateandcomparesuchalgorithms.However,duetothevarietyofcharacteristicsofPPDMalgorithms,itisoftenthecasethatnoprivacypreservingalgorithmex-iststhatoutperformsalltheothersonallpossiblecriteria.Rather,analgorithmmayperformbetterthananotheroneonspeciﬁccriterialikeprivacylevel,dataquality.Therefore,itisimportanttoprovideuserswithacomprehensivesetofprivacypreservingrelatedmetricswhichwillenablethemtoselectthemostappropriateprivacypreservingtechniqueforthedataathand,withrespecttosomespeciﬁcparameterstheyareinterestedinoptimizing[6].

ForabetterunderstandingofPPDMrelatedmetrics,wenextidentifyapropersetofcriteriaandtherelatedbenchmarksforevaluatingPPDMalgo-rithms.Wethenadoptthesecriteriatocategorizethemetrics.First,weneedtobeclearwithrespecttotheconceptof“privacy”andthegeneralgoalsofaPPDMalgorithm.Inoursocietytheprivacytermisoverloaded,andcan,ingeneral,assumeawiderangeofdifferentmeanings.Forexample,inthe

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms185

contextoftheHIPAA1PrivacyRule,privacymeanstheindividual’sabilitytocontrolwhohastheaccesstopersonalhealthcareinformation.Fromtheorga-nizationspointofview,privacyinvolvesthedeﬁnitionofpoliciesstatingwhichinformationiscollected,howitisused,andhowcustomersareinformedandinvolvedinthisprocess.Moreover,therearemanyotherdeﬁnitionsofprivacythataregenerallyrelatedwiththeparticularenvironmentinwhichtheprivacyhastobeguaranteed.Whatweneedisamoregenericdeﬁnition,thatcanbein-stantiatedtodifferentenvironmentsandsituations.Fromaphilosophicalpointofview,Schoeman[26]andWalters[33]identifythreepossibledeﬁnitionsofprivacy:

Privacyastherightofapersontodeterminewhichpersonalinformationabouthimself/herselfmaybecommunicatedtoothers.Privacyasthecontroloveraccesstoinformationaboutoneself.Privacyaslimitedaccesstoapersonandtoallthefeaturesrelatedtotheperson.

Inthreedeﬁnitions,whatisinterestingfromourpointofviewistheconceptof“ControlledInformationRelease”.Fromthisidea,wearguethatadeﬁnitionofprivacythatismorerelatedwithourtargetcouldbethefollowing:“Therightofanindividualtobesecurefromunauthorizeddisclosureofinformationaboutoneselfthatiscontainedinanelectronicrepository”.Performingaﬁnaltuningofthedeﬁnition,weconsiderprivacyas“Therightofanentitytobese-curefromunauthorizeddisclosureofsensibleinformationthatarecontainedinanelectronicrepositoryorthatcanbederivedasaggregateandcomplexinformationfromdatastoredinanelectronicrepository”.Thelastgeneraliza-tionisduetothefactthattheconceptofindividualprivacydoesnotevenexist.Asin[23]weconsidertwomainscenarios.

TheﬁrstisthecaseofaMedicalDatabasewherethereistheneedtopro-videinformationaboutdiseaseswhilepreservingthepatientidentity.Anotherscenarioistheclassical“MarketBasket”database,wherethetransactionsre-latedtodifferentclientpurchasesarestoredandfromwhichitispossibletoextractsomeinformationinformofassociationruleslike“IfaclientbuysaproductX,he/shewillpurchasealsoZwithy%probability”.Theﬁrstisanexamplewhereindividualprivacyhastobeensuredbyprotectingfromunau-thorizeddisclosuresensitiveinformationinformofspeciﬁcdataitemsrelatedtospeciﬁcindividuals.Thesecondone,instead,emphasizeshownotonlytherawdatacontainedintoadatabasemustbeprotected,butalso,insomecases,thehighlevelinformationthatcanbederivedfromnonsensiblerawdataneed

1Health

InsurancePortabilityandAccountabilityAct

186Privacy-PreservingDataMining:ModelsandAlgorithms

toprotected.Suchascenariojustiﬁestheﬁnalgeneralizationofourprivacydeﬁnition.Inthelightoftheseconsiderations,itis,now,easytodeﬁnewhicharethemaingoalsaPPDMalgorithmshouldenforce:

1APPDMalgorithmshouldhavetopreventthediscoveryofsensiblein-formation.2Itshouldberesistanttothevariousdataminingtechniques.

3Itshouldnotcompromisetheaccessandtheuseofnonsensitivedata.4Itshouldnothaveanexponentialcomputationalcomplexity.

Correspondingly,weidentifythefollowingsetofcriteriabasedonwhichaPPDMalgorithmcanbeevaluated.

-Privacylevelofferedbyaprivacypreservingtechnique,whichindicateshowcloselythesensitiveinformation,thathasbeenhidden,canstillbeestimated.

-Hidingfailure,thatis,theportionofsensitiveinformationthatisnothiddenbytheapplicationofaprivacypreservationtechnique;

-Dataqualityaftertheapplicationofaprivacypreservingtechnique,con-sideredbothasthequalityofdatathemselvesandthequalityofthedataminingresultsafterthehidingstrategyisapplied;

-Complexity,thatis,theabilityofaprivacypreservingalgorithmtoexe-cutewithgoodperformanceintermsofalltheresourcesimpliedbythealgorithm.

Fortherestofthechapter,weﬁrstpresentdetailsofeachcriteriathroughanalyzingexistingPPDMtechniques.Thenwediscusshowtoselectpropermetricunderaspeciﬁedcondition.Finally,wesummarizethischapterandoutlinefutureresearchdirections.

8.2MetricsforQuantifyingPrivacyLevel

Beforepresentingdifferentmetricsrelatedtoprivacylevel,weneedtotakeintoaccounttwoaspects:(i)sensitiveorprivateinformationcanbecontainedintheoriginaldataset;and(ii)privateinformationthatcanbediscoveredfromdataminingresults.Werefertotheﬁrstoneasdataprivacyandthelatterasresultprivacy.

8.2.1DataPrivacy

Ingeneral,thequantiﬁcationusedtomeasuredataprivacyisthedegreeofuncertainty,accordingtowhichoriginalprivatedatacanbeinferred.The

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms187

higherthedegreeofuncertaintyachievedbyaPPDMalgorithm,thebetterthedataprivacyisprotectedbythisPPDMalgorithm.ForvarioustypesofPPDMalgorithms,thedegreeofuncertaintyisestimatedindifferentways.Accordingtotheadoptedtechniques,PPDMalgorithmscanbeclassiﬁedintotwomaincategories:heuristic-basedapproachesandcryptography-basedap-proaches.Heuristic-basedapproachesmainlyincludefoursub-categories:ad-ditivenoise,multiplicativenoise,k-anonymization,andstatisticaldisclosurecontrolbasedapproaches.Inwhatfollows,wesurveyrepresentativeworksofeachcategoryofPPDMalgorithmsandreviewthemetricsusedbythem.

Additive-Noise-basedPerturbationTechniques.Thebasicideaoftheadditive-noise-basedperturbationtechniqueistoaddrandomnoisetotheac-tualdata.In[2],AgrawalandSrikantusesanadditive-noise-basedtechniquetoperturbdata.Theythenestimatetheprobabilitydistributionoforiginalnumericdatavaluesinordertobuildadecisiontreeclassiﬁerfromperturbedtrainingdata.Theyintroduceaquantitativemeasuretoevaluatetheamountofprivacyofferedbyamethodandevaluatetheproposedmethodagainstthismeasure.Theprivacyismeasuredbyevaluatinghowcloselytheoriginalvaluesofamodiﬁedattributecanbedetermined.Inparticular,iftheperturbedvalueofanattributecanbeestimated,withaconﬁdencec,tobelongtoaninterval[a,b],thentheprivacyisestimatedby(b−a)withconﬁdencec.However,thismetricdoesnotworkwellbecauseitdoesnottakeintoaccountthedistributionoftheoriginaldataalongwiththeperturbeddata.Therefore,ametricthatconsidersalltheinformativecontentofdataavailabletotheuserisneeded.AgrawalandAggarwal[1]addressthisproblembyintroducinganewprivacymetricbasedontheconceptofinformationentropy.Morespeciﬁcally,theyproposeanEx-pectationMaximization(EM)basedalgorithmfordistributionreconstruction,whichconvergestothemaximumlikelihoodestimateoftheoriginaldistrib-utionontheperturbeddata.Themeasurementofprivacygivenbythemcon-sidersthefactthatboththeperturbedindividualrecordandthereconstructeddistributionareavailabletotheuseraswellastheperturbingdistribution,asitisspeciﬁedin[10].ThismetricdeﬁnestheaverageconditionalprivacyofanattributeAgivenotherinformation,modeledwitharandomvariableB,as2h(A|B),whereh(A|B)istheconditionaldifferentialentropyofAgivenBrepresentingameasureofuncertaintyinherentinthevalueofA,giventhevalueofB.

Anotheradditive-noise-basedperturbationtechniqueisbyRivziandHaritsa[24].Theyproposeadistortionmethodtopre-processthedatabeforeexecut-ingtheminingprocess.Theirprivacymeasuredealswiththeprobabilitywithwhichtheuser’sdistortedentriescanbereconstructed.Theirgoalistoen-sureprivacyatthelevelofindividualentriesineachcustomertuple.Inotherwords,theauthorsestimatetheprobabilitythatagiven1or0inthetruematrix

188Privacy-PreservingDataMining:ModelsandAlgorithms

representingthetransactionaldatabasecanbereconstructed,evenifformanyapplicationsthe1’sand0’svaluesdonotneedthesamelevelofprivacy.

Evﬁmievskietal.[11]proposeaframeworkforminingassociationrulesfromtransactionsconsistingofcategoricalitems,wherethedatahasbeenran-domizedtopreserveprivacyofindividualtransactions,whileensuringatthesametimethatonlytrueassociationsaremined.Theyalsoprovideaformaldeﬁnitionofprivacybreachesandaclassofrandomizationoperatorsthataremuchmoreeffectiveinlimitingbreachesthanuniformrandomization.Accord-ingtoDeﬁnition4from[11],anitemsetAresultsinaprivacybreachoflevelρiftheprobabilitythataniteminAbelongstoanonrandomizedtransaction,giventhatAisincludedinarandomizedtransaction,isgreaterthanorequaltoρ.Insomescenarios,beingconﬁdentthatanitemnotpresentintheoriginaltransactionmayalsobeconsideredaprivacybreach.Inordertoevaluatetheprivacybreaches,theapproachtakenbyEvﬁmievskietal.istocounttheoc-currencesofanitemsetinarandomizedtransactionandinitssub-itemsinthecorrespondingnonrandomizedtransaction.Outofallsub-itemsofanitemset,theitemcausingtheworstprivacybreachischosen.Then,foreachcombina-tionoftransactionsizeanditemsetsize,theworstandtheaveragevalueofthisbreachlevelarecomputedoverallfrequentitemsets.Theitemsetsizegivingtheworstvalueforeachofthesetwovaluesisselected.

Finally,weintroduceauniversalmeasureofdataprivacylevel,proposedbyBertinoetal.in[6].Themeasureisdevelopedbasedon[1].Thebasicconceptusedbythismeasureisinformationentropy,whichisdeﬁnedbyShannon[27]:letXbearandomvariablewhichtakesonaﬁnitesetofvaluesaccordingtoaprobabilitydistributionp(x).Then,theentropyofthisprobabilitydistributionisdeﬁnedasfollows:

󰀥

(8.1)h(X)=−p(x)log2(p(x))or,inthecontinuouscase:

h(X)=−

󰀧

f(x)log2(f(x))dx

(8.2)

wheref(x)denotesthedensityfunctionofthecontinuousrandomvariable

x.Informationentropyisameasureofhowmuch“choice”isinvolvedintheselectionofaneventorhowuncertainweareofitsoutcome.Itcanbeusedforquantifyingtheamountofinformationassociatedwithasetofdata.Theconceptof“informationassociatedwithdata”canbeusefulintheevaluationoftheprivacyachievedbyaPPDMalgorithm.Becausetheentropyrepresentstheinformationcontentofadatum,theentropyafterdatasanitizationshouldbehigherthantheentropybeforethesanitization.Moreovertheentropycanbeassumedastheevaluationoftheuncertainforecastlevelofaneventwhichinourcontextisevaluationoftherightvalueofadatum.Consequently,thelevel

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms1

ofprivacyinherentinanattributeX,givensomeinformationmodeledbyY,isdeﬁnedasfollows:

Π(X|Y)=2

−󰀢

fX,Y(x,y)log2fX|Y=y(x))dxdy

(8.3)

Theprivacyleveldeﬁnedinequation8.3isverygeneral.InordertouseitinthedifferentPPDMcontexts,itneedstobereﬁnedinrelationwithsomecharacteristicslikethetypeoftransactions,thetypeofaggregationandPPDMmethods.In[6],anexampleofinstantiatingtheentropyconcepttoevaluatetheprivacylevelinthecontextof“associationrules”ispresented.

However,itisworthnotingthatthevalueoftheprivacyleveldependsnotonlyonthePPDMalgorithmused,butalsoontheknowledgethatanattackerhasaboutthedatabeforetheuseofdataminingtechniquesandtherelevanceofthisknowledgeinthedatareconstructionoperation.Thisproblemisunder-lined,forexample,in[29,30].In[6],thisaspectisnotconsidered,butitispossibletointroduceassumptionsonattackerknowledgebyproperlymodel-ingY.

Multiplicative-Noise-basedPerturbationTechniques.Accordingto[16],additiverandomnoisecanbeﬁlteredoutusingcertainsignalprocessingtech-niqueswithveryhighaccuracy.Toavoidthisproblem,randomprojection-basedmultiplicativeperturbationtechniqueshasbeenproposedin[19].Insteadofaddingsomerandomvaluestotheactualdata,randommatricesareusedtoprojectthesetoforiginaldatapointstoarandomlychosenlower-dimensionalspace.However,thetransformeddatastillpreservesmuchstatisticalaggregatesregardingtheoriginaldatasetsothatcertaindataminingtasks(e.g.,computinginnerproductmatrix,linearclassiﬁcation,K-meansclusteringandcomputingEuclideandistance)canbeperformedonthetransformeddatainadistributedenvironment(dataareeitherverticallypartitionedorhorizontallypartitioned)withsmallerrors.

Inaddition,thisapproachprovidesahighdegreeofprivacyregardingtheoriginaldata.Asanalyzedinthepaper,eveniftherandommatrix(i.e.,themultiplicativenoise)isdisclosed,itisimpossibletoﬁndtheexactvaluesoftheoriginaldataset,butﬁndingapproximationoftheoriginaldataispossible.Thevarianceoftheapproximateddataisusedasprivacymeasure.

OliveiraandZaiane[22]alsoadoptamultiplicative-noise-basedperturba-tiontechniquetoperformaclusteringanalysiswhileensuringatthesametimeprivacypreservation.Theyhaveintroducedafamilyofgeometricdatatrans-formationmethodswheretheyapplyanoisevectortodistortconﬁdentialnu-mericalattributes.Theprivacyensuredbysuchtechniquesismeasuredasthevariancedifferencebetweentheactualandtheperturbedvalues.ThismeasureisgivenbyVar(X−Y),whereXrepresentsasingleoriginalattributeandY

190Privacy-PreservingDataMining:ModelsandAlgorithms

thedistortedattribute.ThismeasurecanbemadescaleinvariantwithrespecttothevarianceofXbyexpressingsecurityasSec=Var(X−Y)/Var(X).Theconceptofk-anonymizationisintro-k-AnonymizationTechniques.

ducedbySamaratiandSweeneyin[25,28].Adatabaseisk-anonymouswithrespecttoquasi-identiﬁerattributes(asetofattributesthatcanbeusedwithcertainexternalinformationtoidentifyaspeciﬁcindividual)ifthereexistatleastktransactionsinthedatabasehavingthesamevaluesaccordingtothequasi-identiﬁerattributes.Inpractice,inordertoprotectsensitivedatasetT,beforereleasingTtothepublic,TisconvertedintoanewdatasetT∗thatguaranteesthek-anonymitypropertyforasensibleattributebyperformingsomevaluegeneralizationsonquasi-identiﬁerattributes.Therefore,thedegreeofuncertaintyofthesensitiveattributeisatleast1/k.

Statistical-Disclosure-Control-basedTechniques.Inthecontextofsta-tisticaldisclosurecontrol,alargenumberofmethodshavebeendevelopedtopreserveindividualprivacywhenreleasingaggregatedstatisticsondata.Toanonymizethereleasedstatisticsfromthosedataitemssuchasperson,house-holdandbusiness,whichcanbeusedtoidentifyanindividual,notonlyfea-turesdescribedbythestatisticsbutalsorelatedinformationpubliclyavailableneedtobeconsidered[35].In[7]adescriptionofthemostrelevantperturba-tionmethodsproposedsofarispresented.Amongthesemethodsspeciﬁcallydesignedforcontinuousdata,thefollowingmaskingtechniquesaredescribed:additivenoise,datadistortionbyprobabilitydistribution,resampling,microag-gregation,rankswapping,etc.Forcategoricaldatabothperturbativeandnon-perturbativemethodsarepresented.Thetop-codingandbottom-codingtech-niquesarebothappliedtoordinalcategoricalvariables;theyrecode,respec-tively,theﬁrst/lastpvaluesofavariableintoanewcategory.Theglobal-recodingtechnique,instead,recodestheplowestfrequencycategoriesintoasingleone.

Theprivacylevelofsuchmethodisassessedbyusingthedisclosurerisk,thatis,theriskthatapieceofinformationbelinkedtoaspeciﬁcindividual.Thereareseveralapproachestomeasurethedisclosurerisk.Oneapproachisbasedonthecomputationofthedistance-basedrecordlinkage.Anintruderisassumedtotrytolinkthemaskeddatasetwiththeexternaldatasetusingthekeyvariables.Thedistancebetweenrecordsintheoriginalandthemaskeddatasetsiscomputed.Arecordinthemaskeddatasetislabelledas“linked”or“linkedto2ndnearest”ifthenearestor2ndnearestrecordintheoriginaldatasetturnsouttobethecorrespondingoriginalrecord.Thenthedisclosureriskiscomputedasthepercentageof“linked”and“linkedto2ndnearest”.Thesecondapproachisbasedonthecomputationoftheprobabilisticrecordlink-age.Thelinearsumassignmentmodelisusedto‘pair’recordsintheoriginal

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms191

ﬁleandthemaskedﬁle.Thepercentageofcorrectlypairedrecordsisameasureofdisclosurerisk.Anotherapproachcomputesrankintervalsfortherecordsinthemaskeddataset.Theproportionoforiginalvaluesthatfallintotheintervalcenteredaroundtheircorrespondingmaskedvalueisameasureofdisclosurerisk.

Cryptography-basedTechniques.Thecryptography-basedtechniqueusu-allyguaranteesveryhighlevelofdataprivacy.In[14],KantarciogluandCliftonaddresstheproblemofsecureminingofassociationrulesoverhori-zontallypartitioneddata,usingcryptographictechniquestominimizethein-formationshared.Theirsolutionisbasedontheassumptionthateachpartyﬁrstencryptsitsownitemsetsusingcommutativeencryption,thenthealreadyen-crypteditemsetsofeveryotherparty.Lateron,aninitiatingpartytransmitsitsfrequencycount,plusarandomvalue,toitsneighbor,whichaddsitsfrequencycountandpassesitontootherparties.Finally,asecurecomparisontakesplacebetweentheﬁnalandinitiatingpartiestodetermineiftheﬁnalresultisgreaterthanthethresholdplustherandomvalue.

Anothercryptography-basedapproachisdescribedin[31].Suchapproachaddressestheproblemofassociationrulemininginverticallypartitioneddata.Inotherwords,itsaimistodeterminetheitemfrequencywhentransactionsaresplitacrossdifferentsites,withoutrevealingthecontentsofindividualtransac-tions.Thesecurityoftheprotocolforcomputingthescalarproductisanalyzed.Thoughcryptography-basedtechniquescanwellprotectdataprivacy,theymaynotbeconsideredgoodwithrespecttoothermetricslikeefﬁciencythatwillbediscussedinlatersections.

8.2.2ResultPrivacy

Sofar,wehaveseenprivacymetricsrelatedtothedataminingprocess.Manydataminingtasksproduceaggregateresults,suchasBayesianclassiﬁers.Althoughitispossibletoprotectsensitivedatawhenaclassiﬁerisconstructed,canthisclassiﬁerbeusedtoinfersensitivedatavalues?Inotherwords,dodataminingresultsviolateprivacy?Thisissuehasbeenanalyzedandaframeworkisproposedin[15]totestifaclassiﬁerCcreatesaninferencechannelthatcouldbeadoptedtoinfersensitivedatavalues.

Theframeworkconsidersthreetypesofdata:publicdata(P),accessibletoeveryoneincludingtheadversary;private/sensitivedata(S),mustbeprotectedandunknowntotheadversary;unknowndata(U),notknowntotheadversary,butthereleaseofthisdatamightcauseprivacyviolation.Theframeworkas-sumesthatSdependsonlyonPandU,andtheadversaryhasatmosttdatasamplesoftheform(pi,si).Theapproachtodeterminewhetheraninferencechannelexistsiscomprisedoftwosteps.First,aclassiﬁerC1isbuiltonthetdatasamples.ToevaluatetheimpactofC,anotherclassiﬁerC2isbuiltbased

192Privacy-PreservingDataMining:ModelsandAlgorithms

onthesametdatasamplesplustheclassiﬁerC.IftheaccuracyofC2issig-niﬁcantlybetterthanC1,wecansaythatCprovidesaninferencechannelforS.

ClassiﬁeraccuracyismeasuredbasedonBayesianclassiﬁcationerror.Sup-posewehaveadataset{x1,...,xn},andwewanttoclassifyxiintomclasseslabelledas{1,...,m}.GivenaclassiﬁerC:

C:xi→C(xi)∈{1,...,m},

TheclassiﬁeraccuracyforCisdeﬁnedas:

m󰀥j=1

i=1,...,n

Pr(C(xi)=j|z=j)Pr(z=j)

wherezistheactualclasslabelofxi.Sincecryptography-basedPPDMtech-niquesusuallyproducethesameresultsasthoseminedfromtheoriginal

dataset,analyzingprivacyimplicationsfromtheminingresultsisparticularimportanttothisclassoftechniques.

8.3MetricsforQuantifyingHidingFailure

Thepercentageofsensitiveinformationthatisstilldiscovered,afterthedatahasbeensanitized,givesanestimateofthehidingfailureparameter.Mostofthedevelopedprivacypreservingalgorithmsaredesignedwiththegoalofobtainingzerohidingfailure.Thus,theyhideallthepatternsconsideredsen-sitive.However,itiswellknownthatthemoresensitiveinformationwehide,themorenon-sensitiveinformationwemiss.Thus,somePPDMalgorithmshavebeenrecentlydevelopedwhichallowonetochoosetheamountofsen-sitivedatathatshouldbehiddeninordertoﬁndabalancebetweenprivacyandknowledgediscovery.Forexample,in[21],OliveiraandZaianedeﬁnethehidingfailure(HF)asthepercentageofrestrictivepatternsthatarediscoveredfromthesanitizeddatabase.Itismeasuredasfollows:

#RP(D󰀈)

HF=

#RP(D)

(8.4)

where#RP(D)and#RP(D󰀈)denotethenumberofrestrictivepatternsdis-coveredfromtheoriginaldatabaseDandthesanitizeddatabaseD󰀈respec-tively.Ideally,HFshouldbe0.Intheirframework,theygiveaspeciﬁcationofadisclosurethresholdφ,representingthepercentageofsensitivetransactionsthatarenotsanitized,whichallowsonetoﬁndabalancebetweenthehidingfailureandthenumberofmisses.Notethatφdoesnotcontrolthehidingfailuredirectly,butindirectlybycontrollingtheproportionofsensitivetransactionstobesanitizedforeachrestrictivepattern.

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms193

Moreover,aspointedoutin[32],itisimportantnottoforgetthatintrudersanddataterroristswilltrytocompromiseinformationbyusingvariousdataminingalgorithms.Therefore,aPPDMalgorithmdevelopedagainstaparticu-lardataminingtechniquesthatassuresprivacyofinformation,maynotattainsimilarprotectionagainstallpossibledataminingalgorithms.Inordertopro-videforacompleteevaluationofaPPDMalgorithm,weneedtomeasureitshidingfailureagainstdataminingtechniqueswhicharedifferentfromthetech-niquethatthePPDMalgorithmhasbeendesignedfor.Theevaluationneedstheconsiderationofaclassofdataminingalgorithmswhicharesigniﬁcantforourtest.Alternatively,aformalframeworkcanbedevelopedthatupontestingofaPPDMalgorithmagainstpre-selecteddatasets,wecantransitivelyproveprivacyassuranceforthewholeclassofPPDMalgorithms.

8.4MetricsforQuantifyingDataQuality

ThemainfeatureofthemostPPDMalgorithmsisthattheyusuallymodifythedatabasethroughinsertionoffalseinformationorthroughtheblockingofdatavaluesinordertohidesensitiveinformation.Suchperturbationtechniquescausethedecreaseofthedataquality.Itisobviousthatthemorethechangesaremadetothedatabase,thelessthedatabasereﬂectsthedomainofinterest.Therefore,dataqualitymetricsareveryimportantintheevaluationofPPDMtechniques.Sincethedataisoftensoldformakingproﬁt,orsharedwithothersinthehopeofleadingtoinnovation,dataqualityshouldhaveanacceptablelevelaccordingalsototheintendeddatausage.Ifdataqualityistoodegraded,thereleaseddatabaseisuselessforthepurposeofknowledgeextraction.

Inexistingworks,severaldataqualitymetricshavebeenproposedthatareeithergenericordata-use-speciﬁc.However,currently,thereisnometricthatiswidelyacceptedbytheresearchcommunity.Herewetrytoidentifyasetofpossiblemeasuresthatcanbeusedtoevaluatedifferentaspectsofdataquality.Inevaluatingthedataqualityaftertheprivacypreservingprocess,itcanbeusefultoassessboththequalityofthedataresultingfromthePPDMprocessandthequalityofthedataminingresults.Thequalityofthedatathemselvescanbeconsideredasageneralmeasureevaluatingthestateoftheindividualitemscontainedinthedatabaseaftertheenforcementofaprivacypreservingtechnique.Thequalityofthedataminingresultsevaluatesthealterationintheinformationthatisextractedfromthedatabaseaftertheprivacypreservationprocess,onthebasisoftheintendeddatause.

8.4.1

QualityoftheData

ResultingfromthePPDMProcess

Themainproblemwithdataqualityisthatitsevaluationisrelative[18],inthatitusuallydependsonthecontextinwhichdataareused.Inparticular,there

194Privacy-PreservingDataMining:ModelsandAlgorithms

aresomeaspectsrelatedtodataqualityevaluationthatareheavilyrelatednotonlywiththePPDMalgorithm,butalsowiththestructureofthedatabase,andwiththemeaningandrelevanceoftheinformationstoredinthedatabasewithrespecttoawelldeﬁnedcontext.Inthescientiﬁcliteraturedataqualityisgen-erallyconsideredamulti-dimensionalconceptthatincertaincontextsinvolvesbothobjectiveandsubjectiveparameters[3,34].Amongthevariouspossibleparameters,thefollowingonesareusuallyconsideredthemostrelevant:

-Accuracy:itmeasurestheproximityofasanitizedvaluetotheoriginalvalue.

-Completeness:itevaluatesthedegreeofmisseddatainthesanitizeddatabase.

-Consistency:itisrelatedtotheinternalconstraints,thatis,therelation-shipsthatmustholdamongdifferentﬁeldsofadataitemoramongdataitemsinadatabase.

Accuracy.Theaccuracyiscloselyrelatedtotheinformationlossresult-ingfromthehidingstrategy:thelessistheinformationloss,thebetteristhedataquality.ThismeasurelargelydependsonthespeciﬁcclassofPPDMal-gorithms.Inwhatfollows,wediscusshowdifferentapproachesmeasuretheaccuracy.

Asforheuristic-basedtechniques,wedistinguishthefollowingcasesbasedonthemodiﬁcationtechniquethatisperformedforthehidingprocess.Ifthealgorithmadoptsaperturbationorablockingtechniquetohidebothrawandaggregateddata,theinformationlosscanbemeasuredintermsofthedissimi-laritybetweentheoriginaldatasetDandthesanitizedoneD󰀈.In[21],OliveiraandZaianeproposethreedifferentmethodstomeasurethedissimilaritybe-tweentheoriginalandsanitizeddatabases.Theﬁrstmethodisbasedonthedifferencebetweenthefrequencyhistogramsoftheoriginalandthesanitizeddatabases.Thesecondmethodisbasedoncomputingthedifferencebetweenthesizesofthesanitizeddatabaseandtheoriginalone.Thethirdmethodisbasedonacomparisonbetweenthecontentsoftwodatabases.Amorede-tailedanalysisonthedeﬁnitionofdissimilarityispresentedbyBertinoetal.in[6].Theysuggesttousethefollowingformulainthecaseoftransactionaldatasetperturbation:

Diss(D,D󰀈)=

󰀠n

|fD(i)−fD󰀇(i)|i=1󰀠n

i=1fD(i)

(8.5)

whereiisadataitemintheoriginaldatabaseDandfD(i)isitsfrequencywithinthedatabase,whereasi’isthegivendataitemaftertheapplicationof

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms195

aprivacypreservationtechniqueandfD󰀇(i)isitsnewfrequencywithinthetransformeddatabaseD󰀈.Aswecansee,theinformationlossisdeﬁnedastheratiobetweenthesumoftheabsoluteerrorsmadeincomputingthefrequen-ciesoftheitemsfromasanitizeddatabaseandthesumofallthefrequenciesofitemsintheoriginaldatabase.Theformula8.5canalsobeusedforthePPDMalgorithmswhichadoptablockingtechniqueforinsertingintothedatasetun-certaintyaboutsomesensitivedataitemsortheircorrelations.ThefrequencyoftheitemibelongingtothesanitizeddatasetD󰀈isthengivenbythemeanvaluebetweentheminimumfrequencyofthedataitemi,computedbyconsid-eringalltheblockingvaluesassociatedwithitequaltozero,andthemaximumfrequency,obtainedbyconsideringallthequestionmarksequaltoone.

Incaseofdataswapping,theinformationlosscausedbyanheuristic-basedalgorithmcanbeevaluatedbyaparametermeasuringthedataconfusionin-troducedbythevalueswappings.Ifthereisnocorrelationamongthedifferentdatabaserecords,thedataconfusioncanbeestimatedbythepercentageofvaluereplacementsexecutedinordertohidespeciﬁcinformation.

Forthemultiplicative-noise-basedapproaches[19],thequalityoftheper-turbeddatadependsonthesizeoftherandomprojectionmatrix.Ingeneral,theerrorboundoftheinnerproductmatrixproducebythisperturbationtechniqueis0onaverageandthevarianceisboundedbytheinverseofthedimensionalityofthereducedspace.Inotherwords,whenthedimensionalityoftherandomprojectionmatrixisclosetothatoftheoriginaldata,theresultofcomputingtheinnerproductmatrixbasedonthetransformedorprojecteddataisalsoclosetotheactualvalue.Sinceinnerproductiscloselyrelatedtomanydistance-basedmetrics(e.g.,Euclideandistance,cosineangleoftwovectors,correlationcoef-ﬁcientoftwovectors,etc),theanalysisonerrorboundhasdirectimpactontheminingresultsifthesedataminingtasksadoptcertaindistance-basedmetrics.Ifthedatamodiﬁcationconsistsofaggregatingsomedatavalues,theinfor-mationlossisgivenbythelossofdetailinthedata.Intuitively,inthiscase,inordertoperformthehidingoperation,thePPDMalgorithmsusesometypeof“GeneralizationorAggregationScheme”thatcanbeideallymodeledasatreescheme.EachcellmodiﬁcationappliedduringthesanitizationphaseusingtheGeneralizationtreeintroducesadataperturbationthatreducesthegeneralac-curacyofthedatabase.Asinthecaseofthek-anonymityalgorithmpresentedin[28],wecanusethefollowingformula.GivenadatabaseTwithNAﬁeldsandNtransactions,ifweidentifyasgeneralizationschemeadomaingeneral-izationhierarchyGTwithadepthh,itispossibletomeasuretheinformationloss(IL)ofasanitizeddatabaseT∗as:

󰀠i=NA󰀠i=N

IL(T∗)=

i=1

j=1|GTAi|

|T|∗|NA|

(8.6)

196Privacy-PreservingDataMining:ModelsandAlgorithms

where|GTrepresentthedetaillossforeachcellsanitized.Forhidingtech-Ai|

niquesbasedonsamplingapproach,thequalityisobviouslyrelatedtothesizeoftheconsideredsampleand,moregenerally,onitsfeatures.

Therearesomeotherprecisionmetricsspeciﬁcallydesignedfork-anonymizationapproaches.Oneoftheearliestdataqualitymetricsisbasedontheheightofgeneralizationhierarchies[25].Theheightisthenumberoftimestheoriginaldatavaluehasbeengeneralized.Thismetricassumesthatageneralizationonthedatarepresentsaninformationlossontheoriginaldatavalue.Therefore,datashouldbegeneralizedasfewerstepsaspossibletopre-servemaximumutility.However,thismetricdoesnottakeintoaccountthatnoteverygeneralizationstepsareequalinthesenseofinformationloss.

Later,Iyengar[13]proposesagenerallossmetric(LM).SupposeTisadatatablewithnattributes.TheLMmetricisthoughtastheaverageinformationlossofalldatacellsofagivendataset,deﬁnedasfollows:

󰀠n󰀠|T|f(T∗[i][j])−1

LM(T∗)=

i=1j=1

g(Ai)−1

|T|·n

(8.7)

Inequation8.7,T∗istheanonymizedtableofT,fisafunctionthatgivenadatacellvalueT∗[i][j],returnsthenumberofdistinctvaluesthatcanbegeneralizedtoT∗[i][j],andgisafunctionthatgivenanattributeAi,returnsthenumberofdistinctvaluesofAi.

Thenextmetric,classiﬁcationmetric(CM),isintroducedbyIyengar[13]tooptimizeak-anonymousdatasetfortrainingaclassiﬁer.ItisdeﬁnedasthesumoftheindividualpenaltiesforeachrowinthetablenormalizedbythetotalnumberofrowsN.

󰀠

penalty(rowr)

(8.8)CM(T∗)=allrows

NThepenaltyvalueofrowris1,i.e.,rowrispenalized,ifitissuppressedorifitsclasslabelisnotthemajorityclasslabelofitsgroup.Otherwise,thepenaltyvalueofrowris0.Thismetricisparticularlyusefulwhenwewanttobuildaclassiﬁeroveranonymousdata.

Anotherinterestingmetricisthediscernibilitymetric(DM)proposedbyBayadoandAgrawal[4].Thisdiscernibilitymetricassignsapenaltytoeachtuplebasedonhowmanytuplesinthetransformeddatasetareindistinguish-ablefromit.LettbeatuplefromtheoriginaltableT,andletGT∗(t)bethesetoftuplesinananonymizedtableT∗indistinguishablefromtorthesetoftuplesinT∗equivalenttotheanonymizedvalueoft.ThenDMisdeﬁnedasfollows:

󰀥∗

|GT∗(t)|(8.9)DM(T)=

t∈T

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms197

Notethatifatuplethasbeensuppressed,thesizeofGT∗(t)isthesameasthesizeofT∗.Inmanysituation,suppressionsareconsideredtobemostex-pensiveinthesenseofinformationloss.Thus,tomaximizedatautility,tuplesuppressionshouldbeavoidedwheneverpossible.

ForanygivenmetricM,ifM(T)>M(T󰀈),wesayThasahigherinfor-mationloss,orislessprecise,thanb.Inotherwords,dataqualityofTisworsethanthatofT󰀈.Isthistrueforallmetrics?Whatisagoodmetric?Itisnoteasytoanswerthesekindsofquestions.Asshownin[20],CMworksbetterthanLMinclassiﬁcationapplication.Inaddition,LMisbetterforassociationrulemining.Itisapparentthattojudgehowgoodaparticularmetricis,weneedtoassociateourjudgementwithspeciﬁcapplications(e.g.,classiﬁcation,miningassociationrules).

TheCMmetricandtheinformationgainprivacylossratio[5,28]aremoreinterestingmeasureofutilitybecauseitconsidersthepossibleapplicationforthedata.Nevertheless,itisunclearwhattodoifwewanttobuildclassiﬁersonvariousattributes.Inaddition,thesetwometricsonlyworkwellifthedataareintendedtobeusedforbuildingclassiﬁers.Isthereautilitymetricthatworkswellforvariousapplications?Havingthisinmind,Kifer[17]proposesautilitymeasurerelatedtoKullback-Leiblerdivergence.Intheory,usingthismeasure,betteranonymousdatasets(fordifferentapplications)canbeproduced.Re-searchershavemeasuredtheutilityoftheresultinganonymousdatasets.Pre-liminaryresultsshowthatthismetricworkswellinpracticalapplications.Forthestatistical-basedperturbationtechniqueswhichaimtohidetheval-uesofaconﬁdentialattribute,theinformationlossisbasicallythelackofpre-cisioninestimatingtheoriginaldistributionfunctionofthegivenattribute.Asdeﬁnedin[1],theinformationlossincurredduringthereconstructionofestimatingthedensityfunctionfX(x)oftheattributeX,ismeasuredbycom-putingthefollowingvalue:

1󰀨E)=I(fX,fX

󰀊󰀧

ΩX

󰀄󰀋󰀄

󰀄󰀄󰀨󰀄fX(x)−fX(x)󰀄dx

(8.10)

󰀨thatis,halfoftheexpectedvalueofL1normbetweenfX(x)andfX(x),which

arethedensitydistributionsrespectivelybeforeandaftertheapplicationoftheprivacypreservingtechnique.

Whenconsideringthecryptography-basedtechniqueswhicharetypicallyemployedindistributedenvironments,wecanobservethattheydonotuseanykindofperturbationtechniquesforthepurposeofprivacypreserving.Instead,theyusethecryptographictechniquestoassuredataprivacyateachsitebylimitingtheinformationsharedbyallthesites.Therefore,thequalityofdatastoredateachsiteisnotcompromisedatall.

198Privacy-PreservingDataMining:ModelsandAlgorithms

CompletenessandConsistency.Whiletheaccuracyisarelativelygeneralparameterinthatitcanbemeasuredwithoutstrongassumptionsonthedatasetanalyzed,thecompletenessisnotsogeneral.Forexample,insomePPDMstrategies,e.g.blocking,thecompletenessevaluationisnotsigniﬁcant.Ontheotherhand,theconsistencyrequirestodeterminealltherelationshipsthatarerelevantforagivendataset.

In[5],Bertinoetal.proposeasetofevaluationparametersincludingthecompletenessandconsistencyevaluation.Unlikeothertechniques,theirap-proachtakesintoaccounttwomoreimportantaspects:relevanceofdataandstructureofdatabase.Theyprovideaformaldescriptionthatcanbeusedtomagnifytheaggregateinformationofinterestforatargetdatabaseandtherel-evanceofdataqualitypropertiesofeachaggregateinformationandforeachattributeinvolvedintheaggregateinformation.Speciﬁcally,thecompletenesslack(denotedasCML)ismeasuredasfollows:

CML=

n󰀥i=0

(DMG.Ni.CV×DMG.Ni.CW)

(8.11)

Inequation8.11,DMGisanorientedgraphwhereeachnodeNiisanat-tributeclass.CVisthecompletenessvalueandCWistheconsistencyvalue.

Theconsistencylack(denotedasCSL)isgivenbythenumberofconstraintviolationsoccurredinallthesanitizedtransactionmultipliedbytheweightassociatedwitheveryconstraints.

n󰀥

(DMG.SCi.csv×DMG.SCi.cw)CSL=

i=0m󰀥j=0

(DMG.CCj.csv×DMG.CCj.cw)

(8.12)

Inequation8.11,csvindicatesthenumberofviolations,cwistheweightof

theconstraint,SCidescribesasimpleconstraintclass,andCCjdescribesacomplexconstraintclass.

8.4.2QualityoftheDataMiningResults

Insomesituations,itcanbeusefulandalsomorerelevanttoevaluatethequalityofthedataminingresultsafterthesanitizationprocess.Thiskindofmetricisstrictlyrelatedtotheusethedataareintendedfor.Datacanbeana-lyzedinordertomineinformationintermsofassociationsamongsingledataitemsortoclassifyexistingdatawiththegoalofﬁndinganaccurateclas-siﬁcationofnewdataitems,andsoon.Basedontheintendeddatause,theinformationlossismeasuredwithaspeciﬁcmetric,dependingeachtimeontheparticulartypeofknowledgemodeloneaimstoextract.

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms199

Iftheintendeddatausageisdataclustering,theinformationlosscanbemea-suredbythepercentageoflegitimatedatapointsthatarenotwell-classiﬁedaf-terthesanitizationprocess.Asin[22],amisclassiﬁcationerrorMEisdeﬁnedtomeasuretheinformationloss.

k1󰀥

ME=(|Clusteri(D)|−|Clusteri(D󰀈)|)

i=1

(8.13)

whereNrepresentsthenumberofpointsintheoriginaldataset,kisthenum-berofclustersunderanalysis,and|Clusteri(D)|and|Clusteri(D󰀈)|repre-sentthenumberoflegitimatedatapointsoftheithclusterintheoriginaldatasetDandthesanitizeddatasetD󰀈respectively.Sinceaprivacypreservingtech-niqueusuallymodifydataforthesanitizationpurpose,theparametersinvolvedintheclusteringanalysisisalmostinevitablyaffected.Inordertoachievehighclusteringquality,itisveryimportanttokeeptheclusteringresultsasconsis-tentaspossiblebeforeandaftertheapplicationofadatahidingtechnique.Whenquantifyinginformationlossinthecontextoftheotherdatausages,itisusefultodistinguishbetween:lostinformationrepresentingthepercent-ageofnon-sensitivepatterns(i.e.,association,classiﬁcationrules)whicharehiddenasside-effectofthehidingprocess;andtheartifactualinformationrepresentingthepercentageofartifactualpatternscreatedbytheadoptedpri-vacypreservingtechnique.Forexample,in[21],OliveiraandZaianedeﬁnetwometricsmissescostandartifactualpatternwhicharecorrespondingtolostinformationandartifactualinformationrespectively.Inparticular,missescostmeasuresthepercentageofnon-restrictivepatternsthatarehiddenafterthesanitizationprocess.Thishappenswhensomenon-restrictivepatternslosesupportinthedatabaseduetothesanitizationprocess.Themissescost(MC)iscomputedasfollows:

#∼RP(D)−#∼RP(D󰀈)

MC=

#∼RP(D)

(8.14)

where#∼RP(D)and#∼RP(D󰀈)denotethenumberofnon-restrictivepatternsdiscoveredfromtheoriginaldatabaseDandthesanitizeddatabaseD󰀈respectively.Inthebestcase,MCshouldbe0%.Noticethatthereisacom-promisebetweenthemissescostandthehidingfailureintheirapproach.Themorerestrictivepatternstheyhide,themorelegitimatepatternstheymiss.Theothermetric,artifactualpattern(AP),ismeasuredintermsofthepercentageofthediscoveredpatternsthatareartifacts.Theformulais:

󰀤

|P󰀈|−|PP󰀈|

(8.15)AP=

P󰀈

where|X|denotesthecardinalityofX.Accordingtotheirexperiments,theirapproachdoesnothaveanyartifactualpatterns,i.e.,APisalways0.

200Privacy-PreservingDataMining:ModelsandAlgorithms

Incaseofassociationrules,thelostinformationcanbemodeledasthesetofnon-sensitiverulesthatareaccidentallyhidden,referredtoaslostrules,bytheprivacypreservationtechnique,theartifactualinformation,instead,rep-resentsthesetofnewrules,alsoknownasghostrules,thatcanbeex-tractedfromthedatabaseaftertheapplicationofasanitizationtechnique.Similarly,iftheaimoftheminingtaskisdataclassiﬁcation,e.g.bymeansofdecisiontreesinductions,boththelostandartifactualinformationcanbequantiﬁedbymeansofthecorrespondinglostandghostassociationrulesde-rivedbytheclassiﬁcationtree.Thesemeasuresallowonetoevaluatethehighlevelinformationthatareextractedfromadatabaseinformofthewidely-usedinferencerulesbeforeandaftertheapplicationofaPPDMalgorithm.

Itisworthnotingthatformostcryptography-basedPPDMalgorithms,thedataminingresultsarethesameasthatproducedfromunsanitizeddata.

8.5ComplexityMetrics

ThecomplexitymetricmeasurestheefﬁciencyandscalabilityofaPPDMalgorithm.Efﬁciencyindicateswhetherthealgorithmcanbeexecutedwithgoodperformance,whichisgenerallyassessedintermsofspaceandtime.Spacerequirementsareassessedaccordingtotheamountofmemorythatmustbeallocatedinordertoimplementthegivenalgorithm.

Fortheevaluationoftimerequirements,thereareseveralapproaches.TheﬁrstapproachistoevaluatetheCPUtime.Forexample,in[21],theyﬁrstkeepconstantboththesizeofthedatabaseandthesetofrestrictivepatterns,andthenincreasethesizeoftheinputdatatomeasuretheCPUtimetakenbytheiralgorithm.Analternativeapproachwouldbetoevaluatethetimerequirementsintermsofthecomputationalcost.Inthiscase,itisobviousthatanalgorithmhavingapolynomialcomplexityismoreefﬁcientthananotheronewithexpo-nentialcomplexity.Sometimes,thetimerequirementscanevenbeevaluatedbycountingtheaveragenumberofoperationsexecutedbyaPPDMalgorithm.Asin[14],theperformanceismeasuredintermsofthenumberofencryptionanddecryptionoperationsrequiredbythespeciﬁcalgorithm.Thelasttwomea-sures,i.e.thecomputationalcostandtheaveragenumberofoperations,donotprovideanabsolutemeasure,buttheycanbeconsideredinordertoperformafastcomparisonamongdifferentalgorithms.

Incaseofdistributedalgorithms,especiallythecryptography-basedalgo-rithms(e.g.[14,31]),thetimerequirementscanbeevaluatedintermsofcom-municationcostduringtheexchangeofinformationamongsecureprocessing.Speciﬁcally,in[14],thecommunicationcostisexpressedasthenumberofmessagesexchangedamongthesites,thatarerequiredbytheprotocolforse-curelycountingthefrequencyofeachrule.

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms201

ScalabilityisanotherimportantaspecttoassesstheperformanceofaPPDMalgorithm.Inparticular,scalabilitydescribestheefﬁciencytrendswhendatasizesincrease.Suchparameterconcernstheincreaseofbothperformanceandstoragerequirementsaswellasthecostsofthecommunicationsrequiredbyadistributedtechniquewiththeincreaseofdatasizes.

Duetothecontinuousadvancesinhardwaretechnology,largeamountsofdatacannowbeeasilystored.Databasesalongwithdatawarehousestodaystoreandmanageamountsofdatawhichareincreasinglylarge.Forthisrea-son,aPPDMalgorithmhastobedesignedandimplementedwiththecapabilityofhandlinghugedatasetsthatmaystillkeepgrowing.Thelessfastisthede-creaseintheefﬁciencyofaPPDMalgorithmforincreasingdatadimensions,thebetterisitsscalability.Therefore,thescalabilitymeasureisveryimportantindeterminingpracticalPPDMtechniques.

8.6HowtoSelectaProperMetric

Inprevioussection,wehavediscussedvarioustypesofmetrics.Anim-portantquestionhereis“whichoneamongthepresentedmetricsisthemostrelevantforagivenprivacypreservingtechnique?”.

DworkandNissim[9]makesomeinterestingobservationsaboutthisques-tion.Inparticular,accordingtotheminthecaseofstatisticaldatabasesprivacyisparamount,whereasinthecaseofdistributeddatabasesforwhichtheprivacyisensuredbyusingasecuremultipartycomputationtechniquefunctionalityisofprimaryimportance.Sincearealdatabaseusuallycontainsalargenumberofrecords,theperformanceguaranteedbyaPPDMalgorithm,intermsoftimeandcommunicationrequirements,isanotnegligiblefactor,aswellasitstrendwhenincreasingdatabasesize.ThedataqualityguaranteedbyaPPDMalgo-rithmis,ontheotherhand,veryimportantwhenensuringprivacyprotectionwithoutdamagingthedatausabilityfromtheauthorizedusers.

Fromtheaboveobservations,wecanseethatatrade-offmetricmayhelpustostateauniquevaluemeasuringtheeffectivenessofaPPDMalgorithm.In[7],thescoreofamaskingmethodprovidesameasureofthetrade-offbe-tweendisclosureriskandinformationloss.Itisdeﬁnedasanaveragebetweentheranksofdisclosureriskandinformationlossmeasures,givingthesameimportancetobothmetrics.In[8],aR-UconﬁdentialitymapisdescribedthattracestheimpactondisclosureriskRanddatautilityUofchangesintheparametersofadisclosurelimitationmethodwhichadoptsanadditivenoisetechnique.WebelievethatanindexassigningthesameimportancetoboththedataqualityandthedegreeofprivacyensuredbyaPPDMalgorithmisquiterestrictive,becauseinsomecontextsoneoftheseparameterscanbemorerel-evantthantheother.Moreover,inouropiniontheotherparameters,evenlessrelevantones,shouldbealsotakenintoaccount.Theefﬁciencyandscalability

202Privacy-PreservingDataMining:ModelsandAlgorithms

measures,forinstance,couldbediscriminatingfactorsinchoosingamongasetofPPDMalgorithmsthatensuresimilardegreesofprivacyanddatautility.Aweightedmeancouldbe,thus,agoodmeasureforevaluatingbymeansofauniquevaluethequalityofaPPDMalgorithm.

8.7ConclusionandResearchDirections

Inthischapter,wehavesurveyeddifferentapproachesusedinevaluatingtheeffectivenessofprivacypreservingdataminingalgorithms.Asetofcriteriaisidentiﬁed,whichareprivacylevel,hidingfailure,dataqualityandcomplexity.AsnoneoftheexistingPPDMalgorithmscanoutperformalltheotherswithrespecttoallthecriteria,wediscussedtheimportanceofcertainmetricsforeachspeciﬁctypeofPPDMalgorithms,andalsopointedoutthegoalofagoodmetric.

ThereareseveralfutureresearchdirectionsalongthewayofquantifyingaPPDMalgorithmanditsunderneathapplicationordataminingtask.OneistodevelopacomprehensiveframeworkaccordingtowhichvariousPPDMal-gorithmscanbeevaluatedandcompared.ItisalsoimportanttodesigngoodmetricsthatcanbetterreﬂectthepropertiesofaPPDMalgorithm,andtode-velopbenchmarkdatabasesfortestingalltypesofPPDMalgorithms.

References

[1]Agrawal,D.,Aggarwal,C.C.:Onthedesignandquantiﬁcationofpri-vacypreservingdataminingalgorithms.In:Proceedingsofthe20thACMSIGACT-SIGMOD-SIGARTSymposiumonPrincipleofData-baseSystem,pp.247–255.ACM(2001)

[2]Agrawal,R.,Srikant,R.:Privacypreservingdatamining.In:Proceeed-ingsoftheACMSIGMODConferenceofManagementofData,pp.439–450.ACM(2000)

[3]Ballou,D.,Pazer,H.:Modellingdataandprocessqualityinmultiinput,

multioutputinformationsystems.Managementscience31(2),150–162(1985)

[4]Bayardo,R.,Agrawal,R.:Dataprivacythroughoptimalk-anonymization.In:Proc.ofthe21stInt’lConf.onDataEngineering(2005)

[5]Bertino,E.,Fovino,I.N.:Informationdrivenevaluationofdatahiding

algorithms.In:7thInternationaConferenceonDataWarehousingandKnowledgeDiscovery,pp.418–427(2005)

[6]Bertino,E.,Fovino,I.N.,Provenza,L.P.:Aframeworkforevaluatingpri-vacypreservingdataminingalgorithms.DataMiningandKnowledgeDiscovery11(2),121–154(2005)

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms203

[7]Domingo-Ferrer,J.,Torra,V.:Aquantitativecomparisonofdisclosure

controlmethodsformicrodata.In:L.Zayatz,P.Doyle,J.Theeuwes,J.Lane(eds.)Conﬁdentiality,DisclosureandDataAccess:TheoryandPracticalApplicationsforStatisticalAgencies,pp.113–134.North-Holland(2002)

[8]Duncan,G.T.,Keller-McNulty,S.A.,Stokes,S.L.:Disclosurerisksvs.

datautility:TheR-Uconﬁdentialitymap.Tech.Rep.121,NationalInsti-tuteofStatisticalSciences(2001)

[9]Dwork,C.,Nissim,K.:Privacypreservingdatamininginverticallypar-titioneddatabase.In:CRYPTO2004,vol.3152,pp.528–544(2004)[10]Evﬁmievski,A.:Randomizationinprivacypreservingdatamining.

SIGKDDExplor.Newsl.4(2),43–48(2002)

[11]Evﬁmievski,A.,Srikant,R.,Agrawal,R.,Gehrke,J.:Privacypreserving

miningofassociationrules.In:8thACMSIGKDDInternationalCon-ferenceonKnowledgeDiscoveryandDataMining,pp.217–228.ACM-Press(2002)

[12]Fung,B.C.M.,Wang,K.,Yu,P.S.:Top-downspecializationforinforma-tionandprivacypreservation.In:Proceedingsofthe21stIEEEInter-nationalConferenceonDataEngineering(ICDE2005).Tokyo,Japan(2005)

[13]Iyengar,V.:Transformingdatatosatisfyprivacyconstraints.In:Proc.,

theEigthACMSIGKDDInt’lConf.onKnowledgeDiscoveryandDataMining,pp.279–288(2002)

[14]Kantarcioglu,M.,Clifton,C.:Privacypreservingdistributedminingof

associationrulesonhorizontallypartitioneddata.In:ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,pp.24–31(2002)[15]Kantarcıo˘glu,M.,Jin,J.,Clifton,C.:Whendodataminingresultsviolate

privacy?In:Proceedingsofthe2004ACMSIGKDDInternationalCon-ferenceonKnowledgeDiscoveryandDataMining,pp.599–604.Seattle,WA(2004).

[16]Kargupta,H.,Datta,S.,Wang,Q.,Sivakumar,K.:Ontheprivacypreserv-ingpropertiesofrandomdataperturbationtechniques.In:ProceedingsoftheThirdIEEEInternationalConferenceonDataMining(ICDM’03).Melbourne,Florida(2003)

[17]Kifer,D.,Gehrke,J.:Injectingutilityintoanonymizeddatasets.In:Pro-ceedingsofthe2006ACMSIGMODInternationalConferenceonMan-agementofData,pp.217–228.ACMPress,Chicago,IL,USA(2006)[18]KumarTayi,G.,Ballou,D.P.:Examiningdataquality.Communications

oftheACM41(2),54–57(1998)

204Privacy-PreservingDataMining:ModelsandAlgorithms

[19]Liu,K.,Kargupta,H.,Ryan,J.:Randomprojection-basedmultiplicative

dataperturbationforprivacypreservingdistributeddatamining18(1),92–106(2006)

[20]Nergiz,M.E.,Clifton,C.:Thoughtsonk-anonymization.In:TheSecond

InternationalWorkshoponPrivacyDataManagementheldinconjunctionwithThe22ndInternationalConferenceonDataEngineering.Atlanta,Georgia(2006)

[21]Oliveira,S.R.M.,Zaiane,O.R.:Privacypreservingfrequentitemsetmin-ing.In:IEEEicdmWorkshoponPrivacy,SecurityandDataMining,vol.14,pp.43–54(2002)

[22]Oliveira,S.R.M.,Zaiane,O.R.:Privacypreservingclusteringbydata

transformation.In:18thBrazilianSymposiumonDatabases(SBBD2003),pp.304–318(2003)

[23]Oliveira,S.R.M.,Zaiane,O.R.:Towardstandardizationinprivacypre-servingdatamining.In:ACMSIGKDD3rdWorkshoponDataMiningStandards,pp.7–17(2004)

[24]Rizvi,S.,Haritsa,R.:Maintainingdataprivacyinassociationrulemining.

In:28thInternationalConferenceonVeryLargeDatabases,pp.682–693(2002)

[25]Samarati,P.:Protectingrespondents’identitiesinmicrodatarelease.

IEEETransactionsonKnowledgeandDataEngineering(TKDE)13(6),1010–1027(2001).

[26]Schoeman,F.D.:PhilosophicalDimensionsofPrivacy:AnAnthology.

CambridgeUniversityPress.(1984)

[27]Shannon,C.E.:Amathematicaltheoryofcommunication.BellSystem

TechnicalJournal27,379–423,623–656(1948)

[28]Sweeney,L.:Achievingk-anonymityprivacyprotectionusinggeneral-izationandsuppression.InternationalJournalofUncertainty,FuzzinessandKnowledgeBasedSystems10(5),571–588(2002)

[29]Trottini,M.:Adecision-theoreticapproachtodatadisclosureproblems.

ResearchinOfﬁcialStatistics4,7–22(2001)

[30]Trottini,M.:Decisionmodelsfordatadisclosurelimitation.Ph.D.thesis,

CarnegieMellonUniversity(2003).

[31]Vaidya,J.,Clifton,C.:Privacypreservingassociationrulemininginver-ticallypartitioneddata.In:8thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pp.639–4.ACMPress(2002)

[32]Verykios,V.S.,Bertino,E.,NaiFovino,I.,Parasiliti,L.,Saygin,Y.,

Theodoridis,Y.:State-of-the-artinprivacypreservingdatamining.SIG-MODRecord33(1),50–57(2004)

ASurveyofQuantiﬁcationofPrivacyPreservingDataMiningAlgorithms205

[33]Walters,G.J.:HumanRightsinanInformationAge:APhilosophical

Analysis,chap.5.UniversityofTorontoPress.(2001)

[34]Wang,R.Y.,Strong,D.M.:Beyondaccuracy:whatdataqualitymeans

todataconsumers.JournalofManagementInformationSystems12(4),5–34(1996)

[35]Willenborg,L.,DeWaal,T.:Elementsofstatisticaldisclosurecontrol,

LectureNotesinStatistics,vol.155.Springer(2001)

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文