ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms
ElisaBertino
DepartmentofComputerSciencePurdueUniversity
bertino@cs.purdue.edu
DanLin
DepartmentofComputerSciencePurdueUniversity
lindan@cs.purdue.edu
WeiJiang
DepartmentofComputerSciencePurdueUniversity
wjiang@cs.purdue.edu
Abstract
Theaimofprivacypreservingdatamining(PPDM)algorithmsistoextractrel-evantknowledgefromlargeamountsofdatawhileprotectingatthesametimesensitiveinformation.Animportantaspectinthedesignofsuchalgorithmsistheidentificationofsuitableevaluationcriteriaandthedevelopmentofrelatedbenchmarks.Recentresearchintheareahasdevotedmuchefforttodetermineatrade-offbetweentherighttoprivacyandtheneedofknowledgediscovery.Itisoftenthecasethatnoprivacypreservingalgorithmexiststhatoutperformsalltheothersonallpossiblecriteria.Therefore,itiscrucialtoprovideacompre-hensiveviewonasetofmetricsrelatedtoexistingprivacypreservingalgorithmssothatwecangaininsightsonhowtodesignmoreeffectivemeasurementandPPDMalgorithms.Inthischapter,wereviewandsummarizeexistingcriteriaandmetricsinevaluatingprivacypreservingtechniques.Privacymetric.
Keywords:
184Privacy-PreservingDataMining:ModelsandAlgorithms
8.1Introduction
Privacyisoneofthemostimportantpropertiesthataninformationsystemmustsatisfy.Forthisreason,severaleffortshavebeendevotedtoincorporat-ingprivacypreservingtechniqueswithdataminingalgorithmsinordertopre-ventthedisclosureofsensitiveinformationduringtheknowledgediscovery.Theexistingprivacypreservingdataminingtechniquescanbeclassifiedac-cordingtothefollowingfivedifferentdimensions[32]:(i)datadistribution(centralizedordistributed);(ii)themodificationappliedtothedata(encryp-tion,perturbation,generalization,andsoon)inordertosanitizethem;(iii)thedataminingalgorithmwhichtheprivacypreservationtechniqueisdesignedfor;(iv)thedatatype(singledataitemsorcomplexdatacorrelations)thatneedstobeprotectedfromdisclosure;(v)theapproachadoptedforpreservingprivacy(heuristicorcryptography-basedapproaches).Whileheuristic-basedtechniquesaremainlyconceivedforcentralizeddatasets,cryptography-basedalgorithmsaredesignedforprotectingprivacyinadistributedscenariobyus-ingencryptiontechniques.Heuristic-basedalgorithmsrecentlyproposedaimathidingsensitiverawdatabyapplyingperturbationtechniquesbasedonprob-abilitydistributions.Moreover,severalheuristic-basedapproachesforhidingbothrawandaggregateddatathroughahidingtechnique(k-anonymization,addingnoises,dataswapping,generalizationandsampling)havebeendevel-oped,first,inthecontextofassociationruleminingandclassificationand,morerecently,forclusteringtechniques.
Giventhenumberofdifferentprivacypreservingdatamining(PPDM)tech-niquesthathavebeendevelopedintheseyears,thereisanemergingneedofmovingtowardstandardizationinthisnewresearcharea,asdiscussedbyOliveiraandZaiane[23].OnesteptowardthisessentialprocessistoprovideaquantificationapproachforPPDMalgorithmstomakeitpossibletoevaluateandcomparesuchalgorithms.However,duetothevarietyofcharacteristicsofPPDMalgorithms,itisoftenthecasethatnoprivacypreservingalgorithmex-iststhatoutperformsalltheothersonallpossiblecriteria.Rather,analgorithmmayperformbetterthananotheroneonspecificcriterialikeprivacylevel,dataquality.Therefore,itisimportanttoprovideuserswithacomprehensivesetofprivacypreservingrelatedmetricswhichwillenablethemtoselectthemostappropriateprivacypreservingtechniqueforthedataathand,withrespecttosomespecificparameterstheyareinterestedinoptimizing[6].
ForabetterunderstandingofPPDMrelatedmetrics,wenextidentifyapropersetofcriteriaandtherelatedbenchmarksforevaluatingPPDMalgo-rithms.Wethenadoptthesecriteriatocategorizethemetrics.First,weneedtobeclearwithrespecttotheconceptof“privacy”andthegeneralgoalsofaPPDMalgorithm.Inoursocietytheprivacytermisoverloaded,andcan,ingeneral,assumeawiderangeofdifferentmeanings.Forexample,inthe
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms185
contextoftheHIPAA1PrivacyRule,privacymeanstheindividual’sabilitytocontrolwhohastheaccesstopersonalhealthcareinformation.Fromtheorga-nizationspointofview,privacyinvolvesthedefinitionofpoliciesstatingwhichinformationiscollected,howitisused,andhowcustomersareinformedandinvolvedinthisprocess.Moreover,therearemanyotherdefinitionsofprivacythataregenerallyrelatedwiththeparticularenvironmentinwhichtheprivacyhastobeguaranteed.Whatweneedisamoregenericdefinition,thatcanbein-stantiatedtodifferentenvironmentsandsituations.Fromaphilosophicalpointofview,Schoeman[26]andWalters[33]identifythreepossibledefinitionsofprivacy:
Privacyastherightofapersontodeterminewhichpersonalinformationabouthimself/herselfmaybecommunicatedtoothers.Privacyasthecontroloveraccesstoinformationaboutoneself.Privacyaslimitedaccesstoapersonandtoallthefeaturesrelatedtotheperson.
Inthreedefinitions,whatisinterestingfromourpointofviewistheconceptof“ControlledInformationRelease”.Fromthisidea,wearguethatadefinitionofprivacythatismorerelatedwithourtargetcouldbethefollowing:“Therightofanindividualtobesecurefromunauthorizeddisclosureofinformationaboutoneselfthatiscontainedinanelectronicrepository”.Performingafinaltuningofthedefinition,weconsiderprivacyas“Therightofanentitytobese-curefromunauthorizeddisclosureofsensibleinformationthatarecontainedinanelectronicrepositoryorthatcanbederivedasaggregateandcomplexinformationfromdatastoredinanelectronicrepository”.Thelastgeneraliza-tionisduetothefactthattheconceptofindividualprivacydoesnotevenexist.Asin[23]weconsidertwomainscenarios.
ThefirstisthecaseofaMedicalDatabasewherethereistheneedtopro-videinformationaboutdiseaseswhilepreservingthepatientidentity.Anotherscenarioistheclassical“MarketBasket”database,wherethetransactionsre-latedtodifferentclientpurchasesarestoredandfromwhichitispossibletoextractsomeinformationinformofassociationruleslike“IfaclientbuysaproductX,he/shewillpurchasealsoZwithy%probability”.Thefirstisanexamplewhereindividualprivacyhastobeensuredbyprotectingfromunau-thorizeddisclosuresensitiveinformationinformofspecificdataitemsrelatedtospecificindividuals.Thesecondone,instead,emphasizeshownotonlytherawdatacontainedintoadatabasemustbeprotected,butalso,insomecases,thehighlevelinformationthatcanbederivedfromnonsensiblerawdataneed
1Health
InsurancePortabilityandAccountabilityAct
186Privacy-PreservingDataMining:ModelsandAlgorithms
toprotected.Suchascenariojustifiesthefinalgeneralizationofourprivacydefinition.Inthelightoftheseconsiderations,itis,now,easytodefinewhicharethemaingoalsaPPDMalgorithmshouldenforce:
1APPDMalgorithmshouldhavetopreventthediscoveryofsensiblein-formation.2Itshouldberesistanttothevariousdataminingtechniques.
3Itshouldnotcompromisetheaccessandtheuseofnonsensitivedata.4Itshouldnothaveanexponentialcomputationalcomplexity.
Correspondingly,weidentifythefollowingsetofcriteriabasedonwhichaPPDMalgorithmcanbeevaluated.
-Privacylevelofferedbyaprivacypreservingtechnique,whichindicateshowcloselythesensitiveinformation,thathasbeenhidden,canstillbeestimated.
-Hidingfailure,thatis,theportionofsensitiveinformationthatisnothiddenbytheapplicationofaprivacypreservationtechnique;
-Dataqualityaftertheapplicationofaprivacypreservingtechnique,con-sideredbothasthequalityofdatathemselvesandthequalityofthedataminingresultsafterthehidingstrategyisapplied;
-Complexity,thatis,theabilityofaprivacypreservingalgorithmtoexe-cutewithgoodperformanceintermsofalltheresourcesimpliedbythealgorithm.
Fortherestofthechapter,wefirstpresentdetailsofeachcriteriathroughanalyzingexistingPPDMtechniques.Thenwediscusshowtoselectpropermetricunderaspecifiedcondition.Finally,wesummarizethischapterandoutlinefutureresearchdirections.
8.2MetricsforQuantifyingPrivacyLevel
Beforepresentingdifferentmetricsrelatedtoprivacylevel,weneedtotakeintoaccounttwoaspects:(i)sensitiveorprivateinformationcanbecontainedintheoriginaldataset;and(ii)privateinformationthatcanbediscoveredfromdataminingresults.Werefertothefirstoneasdataprivacyandthelatterasresultprivacy.
8.2.1DataPrivacy
Ingeneral,thequantificationusedtomeasuredataprivacyisthedegreeofuncertainty,accordingtowhichoriginalprivatedatacanbeinferred.The
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms187
higherthedegreeofuncertaintyachievedbyaPPDMalgorithm,thebetterthedataprivacyisprotectedbythisPPDMalgorithm.ForvarioustypesofPPDMalgorithms,thedegreeofuncertaintyisestimatedindifferentways.Accordingtotheadoptedtechniques,PPDMalgorithmscanbeclassifiedintotwomaincategories:heuristic-basedapproachesandcryptography-basedap-proaches.Heuristic-basedapproachesmainlyincludefoursub-categories:ad-ditivenoise,multiplicativenoise,k-anonymization,andstatisticaldisclosurecontrolbasedapproaches.Inwhatfollows,wesurveyrepresentativeworksofeachcategoryofPPDMalgorithmsandreviewthemetricsusedbythem.
Additive-Noise-basedPerturbationTechniques.Thebasicideaoftheadditive-noise-basedperturbationtechniqueistoaddrandomnoisetotheac-tualdata.In[2],AgrawalandSrikantusesanadditive-noise-basedtechniquetoperturbdata.Theythenestimatetheprobabilitydistributionoforiginalnumericdatavaluesinordertobuildadecisiontreeclassifierfromperturbedtrainingdata.Theyintroduceaquantitativemeasuretoevaluatetheamountofprivacyofferedbyamethodandevaluatetheproposedmethodagainstthismeasure.Theprivacyismeasuredbyevaluatinghowcloselytheoriginalvaluesofamodifiedattributecanbedetermined.Inparticular,iftheperturbedvalueofanattributecanbeestimated,withaconfidencec,tobelongtoaninterval[a,b],thentheprivacyisestimatedby(b−a)withconfidencec.However,thismetricdoesnotworkwellbecauseitdoesnottakeintoaccountthedistributionoftheoriginaldataalongwiththeperturbeddata.Therefore,ametricthatconsidersalltheinformativecontentofdataavailabletotheuserisneeded.AgrawalandAggarwal[1]addressthisproblembyintroducinganewprivacymetricbasedontheconceptofinformationentropy.Morespecifically,theyproposeanEx-pectationMaximization(EM)basedalgorithmfordistributionreconstruction,whichconvergestothemaximumlikelihoodestimateoftheoriginaldistrib-utionontheperturbeddata.Themeasurementofprivacygivenbythemcon-sidersthefactthatboththeperturbedindividualrecordandthereconstructeddistributionareavailabletotheuseraswellastheperturbingdistribution,asitisspecifiedin[10].ThismetricdefinestheaverageconditionalprivacyofanattributeAgivenotherinformation,modeledwitharandomvariableB,as2h(A|B),whereh(A|B)istheconditionaldifferentialentropyofAgivenBrepresentingameasureofuncertaintyinherentinthevalueofA,giventhevalueofB.
Anotheradditive-noise-basedperturbationtechniqueisbyRivziandHaritsa[24].Theyproposeadistortionmethodtopre-processthedatabeforeexecut-ingtheminingprocess.Theirprivacymeasuredealswiththeprobabilitywithwhichtheuser’sdistortedentriescanbereconstructed.Theirgoalistoen-sureprivacyatthelevelofindividualentriesineachcustomertuple.Inotherwords,theauthorsestimatetheprobabilitythatagiven1or0inthetruematrix
188Privacy-PreservingDataMining:ModelsandAlgorithms
representingthetransactionaldatabasecanbereconstructed,evenifformanyapplicationsthe1’sand0’svaluesdonotneedthesamelevelofprivacy.
Evfimievskietal.[11]proposeaframeworkforminingassociationrulesfromtransactionsconsistingofcategoricalitems,wherethedatahasbeenran-domizedtopreserveprivacyofindividualtransactions,whileensuringatthesametimethatonlytrueassociationsaremined.Theyalsoprovideaformaldefinitionofprivacybreachesandaclassofrandomizationoperatorsthataremuchmoreeffectiveinlimitingbreachesthanuniformrandomization.Accord-ingtoDefinition4from[11],anitemsetAresultsinaprivacybreachoflevelρiftheprobabilitythataniteminAbelongstoanonrandomizedtransaction,giventhatAisincludedinarandomizedtransaction,isgreaterthanorequaltoρ.Insomescenarios,beingconfidentthatanitemnotpresentintheoriginaltransactionmayalsobeconsideredaprivacybreach.Inordertoevaluatetheprivacybreaches,theapproachtakenbyEvfimievskietal.istocounttheoc-currencesofanitemsetinarandomizedtransactionandinitssub-itemsinthecorrespondingnonrandomizedtransaction.Outofallsub-itemsofanitemset,theitemcausingtheworstprivacybreachischosen.Then,foreachcombina-tionoftransactionsizeanditemsetsize,theworstandtheaveragevalueofthisbreachlevelarecomputedoverallfrequentitemsets.Theitemsetsizegivingtheworstvalueforeachofthesetwovaluesisselected.
Finally,weintroduceauniversalmeasureofdataprivacylevel,proposedbyBertinoetal.in[6].Themeasureisdevelopedbasedon[1].Thebasicconceptusedbythismeasureisinformationentropy,whichisdefinedbyShannon[27]:letXbearandomvariablewhichtakesonafinitesetofvaluesaccordingtoaprobabilitydistributionp(x).Then,theentropyofthisprobabilitydistributionisdefinedasfollows:
(8.1)h(X)=−p(x)log2(p(x))or,inthecontinuouscase:
h(X)=−
f(x)log2(f(x))dx
(8.2)
wheref(x)denotesthedensityfunctionofthecontinuousrandomvariable
x.Informationentropyisameasureofhowmuch“choice”isinvolvedintheselectionofaneventorhowuncertainweareofitsoutcome.Itcanbeusedforquantifyingtheamountofinformationassociatedwithasetofdata.Theconceptof“informationassociatedwithdata”canbeusefulintheevaluationoftheprivacyachievedbyaPPDMalgorithm.Becausetheentropyrepresentstheinformationcontentofadatum,theentropyafterdatasanitizationshouldbehigherthantheentropybeforethesanitization.Moreovertheentropycanbeassumedastheevaluationoftheuncertainforecastlevelofaneventwhichinourcontextisevaluationoftherightvalueofadatum.Consequently,thelevel
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms1
ofprivacyinherentinanattributeX,givensomeinformationmodeledbyY,isdefinedasfollows:
Π(X|Y)=2
−
fX,Y(x,y)log2fX|Y=y(x))dxdy
(8.3)
Theprivacyleveldefinedinequation8.3isverygeneral.InordertouseitinthedifferentPPDMcontexts,itneedstoberefinedinrelationwithsomecharacteristicslikethetypeoftransactions,thetypeofaggregationandPPDMmethods.In[6],anexampleofinstantiatingtheentropyconcepttoevaluatetheprivacylevelinthecontextof“associationrules”ispresented.
However,itisworthnotingthatthevalueoftheprivacyleveldependsnotonlyonthePPDMalgorithmused,butalsoontheknowledgethatanattackerhasaboutthedatabeforetheuseofdataminingtechniquesandtherelevanceofthisknowledgeinthedatareconstructionoperation.Thisproblemisunder-lined,forexample,in[29,30].In[6],thisaspectisnotconsidered,butitispossibletointroduceassumptionsonattackerknowledgebyproperlymodel-ingY.
Multiplicative-Noise-basedPerturbationTechniques.Accordingto[16],additiverandomnoisecanbefilteredoutusingcertainsignalprocessingtech-niqueswithveryhighaccuracy.Toavoidthisproblem,randomprojection-basedmultiplicativeperturbationtechniqueshasbeenproposedin[19].Insteadofaddingsomerandomvaluestotheactualdata,randommatricesareusedtoprojectthesetoforiginaldatapointstoarandomlychosenlower-dimensionalspace.However,thetransformeddatastillpreservesmuchstatisticalaggregatesregardingtheoriginaldatasetsothatcertaindataminingtasks(e.g.,computinginnerproductmatrix,linearclassification,K-meansclusteringandcomputingEuclideandistance)canbeperformedonthetransformeddatainadistributedenvironment(dataareeitherverticallypartitionedorhorizontallypartitioned)withsmallerrors.
Inaddition,thisapproachprovidesahighdegreeofprivacyregardingtheoriginaldata.Asanalyzedinthepaper,eveniftherandommatrix(i.e.,themultiplicativenoise)isdisclosed,itisimpossibletofindtheexactvaluesoftheoriginaldataset,butfindingapproximationoftheoriginaldataispossible.Thevarianceoftheapproximateddataisusedasprivacymeasure.
OliveiraandZaiane[22]alsoadoptamultiplicative-noise-basedperturba-tiontechniquetoperformaclusteringanalysiswhileensuringatthesametimeprivacypreservation.Theyhaveintroducedafamilyofgeometricdatatrans-formationmethodswheretheyapplyanoisevectortodistortconfidentialnu-mericalattributes.Theprivacyensuredbysuchtechniquesismeasuredasthevariancedifferencebetweentheactualandtheperturbedvalues.ThismeasureisgivenbyVar(X−Y),whereXrepresentsasingleoriginalattributeandY
190Privacy-PreservingDataMining:ModelsandAlgorithms
thedistortedattribute.ThismeasurecanbemadescaleinvariantwithrespecttothevarianceofXbyexpressingsecurityasSec=Var(X−Y)/Var(X).Theconceptofk-anonymizationisintro-k-AnonymizationTechniques.
ducedbySamaratiandSweeneyin[25,28].Adatabaseisk-anonymouswithrespecttoquasi-identifierattributes(asetofattributesthatcanbeusedwithcertainexternalinformationtoidentifyaspecificindividual)ifthereexistatleastktransactionsinthedatabasehavingthesamevaluesaccordingtothequasi-identifierattributes.Inpractice,inordertoprotectsensitivedatasetT,beforereleasingTtothepublic,TisconvertedintoanewdatasetT∗thatguaranteesthek-anonymitypropertyforasensibleattributebyperformingsomevaluegeneralizationsonquasi-identifierattributes.Therefore,thedegreeofuncertaintyofthesensitiveattributeisatleast1/k.
Statistical-Disclosure-Control-basedTechniques.Inthecontextofsta-tisticaldisclosurecontrol,alargenumberofmethodshavebeendevelopedtopreserveindividualprivacywhenreleasingaggregatedstatisticsondata.Toanonymizethereleasedstatisticsfromthosedataitemssuchasperson,house-holdandbusiness,whichcanbeusedtoidentifyanindividual,notonlyfea-turesdescribedbythestatisticsbutalsorelatedinformationpubliclyavailableneedtobeconsidered[35].In[7]adescriptionofthemostrelevantperturba-tionmethodsproposedsofarispresented.Amongthesemethodsspecificallydesignedforcontinuousdata,thefollowingmaskingtechniquesaredescribed:additivenoise,datadistortionbyprobabilitydistribution,resampling,microag-gregation,rankswapping,etc.Forcategoricaldatabothperturbativeandnon-perturbativemethodsarepresented.Thetop-codingandbottom-codingtech-niquesarebothappliedtoordinalcategoricalvariables;theyrecode,respec-tively,thefirst/lastpvaluesofavariableintoanewcategory.Theglobal-recodingtechnique,instead,recodestheplowestfrequencycategoriesintoasingleone.
Theprivacylevelofsuchmethodisassessedbyusingthedisclosurerisk,thatis,theriskthatapieceofinformationbelinkedtoaspecificindividual.Thereareseveralapproachestomeasurethedisclosurerisk.Oneapproachisbasedonthecomputationofthedistance-basedrecordlinkage.Anintruderisassumedtotrytolinkthemaskeddatasetwiththeexternaldatasetusingthekeyvariables.Thedistancebetweenrecordsintheoriginalandthemaskeddatasetsiscomputed.Arecordinthemaskeddatasetislabelledas“linked”or“linkedto2ndnearest”ifthenearestor2ndnearestrecordintheoriginaldatasetturnsouttobethecorrespondingoriginalrecord.Thenthedisclosureriskiscomputedasthepercentageof“linked”and“linkedto2ndnearest”.Thesecondapproachisbasedonthecomputationoftheprobabilisticrecordlink-age.Thelinearsumassignmentmodelisusedto‘pair’recordsintheoriginal
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms191
fileandthemaskedfile.Thepercentageofcorrectlypairedrecordsisameasureofdisclosurerisk.Anotherapproachcomputesrankintervalsfortherecordsinthemaskeddataset.Theproportionoforiginalvaluesthatfallintotheintervalcenteredaroundtheircorrespondingmaskedvalueisameasureofdisclosurerisk.
Cryptography-basedTechniques.Thecryptography-basedtechniqueusu-allyguaranteesveryhighlevelofdataprivacy.In[14],KantarciogluandCliftonaddresstheproblemofsecureminingofassociationrulesoverhori-zontallypartitioneddata,usingcryptographictechniquestominimizethein-formationshared.Theirsolutionisbasedontheassumptionthateachpartyfirstencryptsitsownitemsetsusingcommutativeencryption,thenthealreadyen-crypteditemsetsofeveryotherparty.Lateron,aninitiatingpartytransmitsitsfrequencycount,plusarandomvalue,toitsneighbor,whichaddsitsfrequencycountandpassesitontootherparties.Finally,asecurecomparisontakesplacebetweenthefinalandinitiatingpartiestodetermineifthefinalresultisgreaterthanthethresholdplustherandomvalue.
Anothercryptography-basedapproachisdescribedin[31].Suchapproachaddressestheproblemofassociationrulemininginverticallypartitioneddata.Inotherwords,itsaimistodeterminetheitemfrequencywhentransactionsaresplitacrossdifferentsites,withoutrevealingthecontentsofindividualtransac-tions.Thesecurityoftheprotocolforcomputingthescalarproductisanalyzed.Thoughcryptography-basedtechniquescanwellprotectdataprivacy,theymaynotbeconsideredgoodwithrespecttoothermetricslikeefficiencythatwillbediscussedinlatersections.
8.2.2ResultPrivacy
Sofar,wehaveseenprivacymetricsrelatedtothedataminingprocess.Manydataminingtasksproduceaggregateresults,suchasBayesianclassifiers.Althoughitispossibletoprotectsensitivedatawhenaclassifierisconstructed,canthisclassifierbeusedtoinfersensitivedatavalues?Inotherwords,dodataminingresultsviolateprivacy?Thisissuehasbeenanalyzedandaframeworkisproposedin[15]totestifaclassifierCcreatesaninferencechannelthatcouldbeadoptedtoinfersensitivedatavalues.
Theframeworkconsidersthreetypesofdata:publicdata(P),accessibletoeveryoneincludingtheadversary;private/sensitivedata(S),mustbeprotectedandunknowntotheadversary;unknowndata(U),notknowntotheadversary,butthereleaseofthisdatamightcauseprivacyviolation.Theframeworkas-sumesthatSdependsonlyonPandU,andtheadversaryhasatmosttdatasamplesoftheform(pi,si).Theapproachtodeterminewhetheraninferencechannelexistsiscomprisedoftwosteps.First,aclassifierC1isbuiltonthetdatasamples.ToevaluatetheimpactofC,anotherclassifierC2isbuiltbased
192Privacy-PreservingDataMining:ModelsandAlgorithms
onthesametdatasamplesplustheclassifierC.IftheaccuracyofC2issig-nificantlybetterthanC1,wecansaythatCprovidesaninferencechannelforS.
ClassifieraccuracyismeasuredbasedonBayesianclassificationerror.Sup-posewehaveadataset{x1,...,xn},andwewanttoclassifyxiintomclasseslabelledas{1,...,m}.GivenaclassifierC:
C:xi→C(xi)∈{1,...,m},
TheclassifieraccuracyforCisdefinedas:
mj=1
i=1,...,n
Pr(C(xi)=j|z=j)Pr(z=j)
wherezistheactualclasslabelofxi.Sincecryptography-basedPPDMtech-niquesusuallyproducethesameresultsasthoseminedfromtheoriginal
dataset,analyzingprivacyimplicationsfromtheminingresultsisparticularimportanttothisclassoftechniques.
8.3MetricsforQuantifyingHidingFailure
Thepercentageofsensitiveinformationthatisstilldiscovered,afterthedatahasbeensanitized,givesanestimateofthehidingfailureparameter.Mostofthedevelopedprivacypreservingalgorithmsaredesignedwiththegoalofobtainingzerohidingfailure.Thus,theyhideallthepatternsconsideredsen-sitive.However,itiswellknownthatthemoresensitiveinformationwehide,themorenon-sensitiveinformationwemiss.Thus,somePPDMalgorithmshavebeenrecentlydevelopedwhichallowonetochoosetheamountofsen-sitivedatathatshouldbehiddeninordertofindabalancebetweenprivacyandknowledgediscovery.Forexample,in[21],OliveiraandZaianedefinethehidingfailure(HF)asthepercentageofrestrictivepatternsthatarediscoveredfromthesanitizeddatabase.Itismeasuredasfollows:
#RP(D)
HF=
#RP(D)
(8.4)
where#RP(D)and#RP(D)denotethenumberofrestrictivepatternsdis-coveredfromtheoriginaldatabaseDandthesanitizeddatabaseDrespec-tively.Ideally,HFshouldbe0.Intheirframework,theygiveaspecificationofadisclosurethresholdφ,representingthepercentageofsensitivetransactionsthatarenotsanitized,whichallowsonetofindabalancebetweenthehidingfailureandthenumberofmisses.Notethatφdoesnotcontrolthehidingfailuredirectly,butindirectlybycontrollingtheproportionofsensitivetransactionstobesanitizedforeachrestrictivepattern.
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms193
Moreover,aspointedoutin[32],itisimportantnottoforgetthatintrudersanddataterroristswilltrytocompromiseinformationbyusingvariousdataminingalgorithms.Therefore,aPPDMalgorithmdevelopedagainstaparticu-lardataminingtechniquesthatassuresprivacyofinformation,maynotattainsimilarprotectionagainstallpossibledataminingalgorithms.Inordertopro-videforacompleteevaluationofaPPDMalgorithm,weneedtomeasureitshidingfailureagainstdataminingtechniqueswhicharedifferentfromthetech-niquethatthePPDMalgorithmhasbeendesignedfor.Theevaluationneedstheconsiderationofaclassofdataminingalgorithmswhicharesignificantforourtest.Alternatively,aformalframeworkcanbedevelopedthatupontestingofaPPDMalgorithmagainstpre-selecteddatasets,wecantransitivelyproveprivacyassuranceforthewholeclassofPPDMalgorithms.
8.4MetricsforQuantifyingDataQuality
ThemainfeatureofthemostPPDMalgorithmsisthattheyusuallymodifythedatabasethroughinsertionoffalseinformationorthroughtheblockingofdatavaluesinordertohidesensitiveinformation.Suchperturbationtechniquescausethedecreaseofthedataquality.Itisobviousthatthemorethechangesaremadetothedatabase,thelessthedatabasereflectsthedomainofinterest.Therefore,dataqualitymetricsareveryimportantintheevaluationofPPDMtechniques.Sincethedataisoftensoldformakingprofit,orsharedwithothersinthehopeofleadingtoinnovation,dataqualityshouldhaveanacceptablelevelaccordingalsototheintendeddatausage.Ifdataqualityistoodegraded,thereleaseddatabaseisuselessforthepurposeofknowledgeextraction.
Inexistingworks,severaldataqualitymetricshavebeenproposedthatareeithergenericordata-use-specific.However,currently,thereisnometricthatiswidelyacceptedbytheresearchcommunity.Herewetrytoidentifyasetofpossiblemeasuresthatcanbeusedtoevaluatedifferentaspectsofdataquality.Inevaluatingthedataqualityaftertheprivacypreservingprocess,itcanbeusefultoassessboththequalityofthedataresultingfromthePPDMprocessandthequalityofthedataminingresults.Thequalityofthedatathemselvescanbeconsideredasageneralmeasureevaluatingthestateoftheindividualitemscontainedinthedatabaseaftertheenforcementofaprivacypreservingtechnique.Thequalityofthedataminingresultsevaluatesthealterationintheinformationthatisextractedfromthedatabaseaftertheprivacypreservationprocess,onthebasisoftheintendeddatause.
8.4.1
QualityoftheData
ResultingfromthePPDMProcess
Themainproblemwithdataqualityisthatitsevaluationisrelative[18],inthatitusuallydependsonthecontextinwhichdataareused.Inparticular,there
194Privacy-PreservingDataMining:ModelsandAlgorithms
aresomeaspectsrelatedtodataqualityevaluationthatareheavilyrelatednotonlywiththePPDMalgorithm,butalsowiththestructureofthedatabase,andwiththemeaningandrelevanceoftheinformationstoredinthedatabasewithrespecttoawelldefinedcontext.Inthescientificliteraturedataqualityisgen-erallyconsideredamulti-dimensionalconceptthatincertaincontextsinvolvesbothobjectiveandsubjectiveparameters[3,34].Amongthevariouspossibleparameters,thefollowingonesareusuallyconsideredthemostrelevant:
-Accuracy:itmeasurestheproximityofasanitizedvaluetotheoriginalvalue.
-Completeness:itevaluatesthedegreeofmisseddatainthesanitizeddatabase.
-Consistency:itisrelatedtotheinternalconstraints,thatis,therelation-shipsthatmustholdamongdifferentfieldsofadataitemoramongdataitemsinadatabase.
Accuracy.Theaccuracyiscloselyrelatedtotheinformationlossresult-ingfromthehidingstrategy:thelessistheinformationloss,thebetteristhedataquality.ThismeasurelargelydependsonthespecificclassofPPDMal-gorithms.Inwhatfollows,wediscusshowdifferentapproachesmeasuretheaccuracy.
Asforheuristic-basedtechniques,wedistinguishthefollowingcasesbasedonthemodificationtechniquethatisperformedforthehidingprocess.Ifthealgorithmadoptsaperturbationorablockingtechniquetohidebothrawandaggregateddata,theinformationlosscanbemeasuredintermsofthedissimi-laritybetweentheoriginaldatasetDandthesanitizedoneD.In[21],OliveiraandZaianeproposethreedifferentmethodstomeasurethedissimilaritybe-tweentheoriginalandsanitizeddatabases.Thefirstmethodisbasedonthedifferencebetweenthefrequencyhistogramsoftheoriginalandthesanitizeddatabases.Thesecondmethodisbasedoncomputingthedifferencebetweenthesizesofthesanitizeddatabaseandtheoriginalone.Thethirdmethodisbasedonacomparisonbetweenthecontentsoftwodatabases.Amorede-tailedanalysisonthedefinitionofdissimilarityispresentedbyBertinoetal.in[6].Theysuggesttousethefollowingformulainthecaseoftransactionaldatasetperturbation:
Diss(D,D)=
n
|fD(i)−fD(i)|i=1n
i=1fD(i)
(8.5)
whereiisadataitemintheoriginaldatabaseDandfD(i)isitsfrequencywithinthedatabase,whereasi’isthegivendataitemaftertheapplicationof
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms195
aprivacypreservationtechniqueandfD(i)isitsnewfrequencywithinthetransformeddatabaseD.Aswecansee,theinformationlossisdefinedastheratiobetweenthesumoftheabsoluteerrorsmadeincomputingthefrequen-ciesoftheitemsfromasanitizeddatabaseandthesumofallthefrequenciesofitemsintheoriginaldatabase.Theformula8.5canalsobeusedforthePPDMalgorithmswhichadoptablockingtechniqueforinsertingintothedatasetun-certaintyaboutsomesensitivedataitemsortheircorrelations.ThefrequencyoftheitemibelongingtothesanitizeddatasetDisthengivenbythemeanvaluebetweentheminimumfrequencyofthedataitemi,computedbyconsid-eringalltheblockingvaluesassociatedwithitequaltozero,andthemaximumfrequency,obtainedbyconsideringallthequestionmarksequaltoone.
Incaseofdataswapping,theinformationlosscausedbyanheuristic-basedalgorithmcanbeevaluatedbyaparametermeasuringthedataconfusionin-troducedbythevalueswappings.Ifthereisnocorrelationamongthedifferentdatabaserecords,thedataconfusioncanbeestimatedbythepercentageofvaluereplacementsexecutedinordertohidespecificinformation.
Forthemultiplicative-noise-basedapproaches[19],thequalityoftheper-turbeddatadependsonthesizeoftherandomprojectionmatrix.Ingeneral,theerrorboundoftheinnerproductmatrixproducebythisperturbationtechniqueis0onaverageandthevarianceisboundedbytheinverseofthedimensionalityofthereducedspace.Inotherwords,whenthedimensionalityoftherandomprojectionmatrixisclosetothatoftheoriginaldata,theresultofcomputingtheinnerproductmatrixbasedonthetransformedorprojecteddataisalsoclosetotheactualvalue.Sinceinnerproductiscloselyrelatedtomanydistance-basedmetrics(e.g.,Euclideandistance,cosineangleoftwovectors,correlationcoef-ficientoftwovectors,etc),theanalysisonerrorboundhasdirectimpactontheminingresultsifthesedataminingtasksadoptcertaindistance-basedmetrics.Ifthedatamodificationconsistsofaggregatingsomedatavalues,theinfor-mationlossisgivenbythelossofdetailinthedata.Intuitively,inthiscase,inordertoperformthehidingoperation,thePPDMalgorithmsusesometypeof“GeneralizationorAggregationScheme”thatcanbeideallymodeledasatreescheme.EachcellmodificationappliedduringthesanitizationphaseusingtheGeneralizationtreeintroducesadataperturbationthatreducesthegeneralac-curacyofthedatabase.Asinthecaseofthek-anonymityalgorithmpresentedin[28],wecanusethefollowingformula.GivenadatabaseTwithNAfieldsandNtransactions,ifweidentifyasgeneralizationschemeadomaingeneral-izationhierarchyGTwithadepthh,itispossibletomeasuretheinformationloss(IL)ofasanitizeddatabaseT∗as:
i=NAi=N
IL(T∗)=
i=1
h
j=1|GTAi|
|T|∗|NA|
(8.6)
196Privacy-PreservingDataMining:ModelsandAlgorithms
h
where|GTrepresentthedetaillossforeachcellsanitized.Forhidingtech-Ai|
niquesbasedonsamplingapproach,thequalityisobviouslyrelatedtothesizeoftheconsideredsampleand,moregenerally,onitsfeatures.
Therearesomeotherprecisionmetricsspecificallydesignedfork-anonymizationapproaches.Oneoftheearliestdataqualitymetricsisbasedontheheightofgeneralizationhierarchies[25].Theheightisthenumberoftimestheoriginaldatavaluehasbeengeneralized.Thismetricassumesthatageneralizationonthedatarepresentsaninformationlossontheoriginaldatavalue.Therefore,datashouldbegeneralizedasfewerstepsaspossibletopre-servemaximumutility.However,thismetricdoesnottakeintoaccountthatnoteverygeneralizationstepsareequalinthesenseofinformationloss.
Later,Iyengar[13]proposesagenerallossmetric(LM).SupposeTisadatatablewithnattributes.TheLMmetricisthoughtastheaverageinformationlossofalldatacellsofagivendataset,definedasfollows:
n|T|f(T∗[i][j])−1
LM(T∗)=
i=1j=1
g(Ai)−1
|T|·n
(8.7)
Inequation8.7,T∗istheanonymizedtableofT,fisafunctionthatgivenadatacellvalueT∗[i][j],returnsthenumberofdistinctvaluesthatcanbegeneralizedtoT∗[i][j],andgisafunctionthatgivenanattributeAi,returnsthenumberofdistinctvaluesofAi.
Thenextmetric,classificationmetric(CM),isintroducedbyIyengar[13]tooptimizeak-anonymousdatasetfortrainingaclassifier.ItisdefinedasthesumoftheindividualpenaltiesforeachrowinthetablenormalizedbythetotalnumberofrowsN.
penalty(rowr)
(8.8)CM(T∗)=allrows
NThepenaltyvalueofrowris1,i.e.,rowrispenalized,ifitissuppressedorifitsclasslabelisnotthemajorityclasslabelofitsgroup.Otherwise,thepenaltyvalueofrowris0.Thismetricisparticularlyusefulwhenwewanttobuildaclassifieroveranonymousdata.
Anotherinterestingmetricisthediscernibilitymetric(DM)proposedbyBayadoandAgrawal[4].Thisdiscernibilitymetricassignsapenaltytoeachtuplebasedonhowmanytuplesinthetransformeddatasetareindistinguish-ablefromit.LettbeatuplefromtheoriginaltableT,andletGT∗(t)bethesetoftuplesinananonymizedtableT∗indistinguishablefromtorthesetoftuplesinT∗equivalenttotheanonymizedvalueoft.ThenDMisdefinedasfollows:
∗
|GT∗(t)|(8.9)DM(T)=
t∈T
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms197
Notethatifatuplethasbeensuppressed,thesizeofGT∗(t)isthesameasthesizeofT∗.Inmanysituation,suppressionsareconsideredtobemostex-pensiveinthesenseofinformationloss.Thus,tomaximizedatautility,tuplesuppressionshouldbeavoidedwheneverpossible.
ForanygivenmetricM,ifM(T)>M(T),wesayThasahigherinfor-mationloss,orislessprecise,thanb.Inotherwords,dataqualityofTisworsethanthatofT.Isthistrueforallmetrics?Whatisagoodmetric?Itisnoteasytoanswerthesekindsofquestions.Asshownin[20],CMworksbetterthanLMinclassificationapplication.Inaddition,LMisbetterforassociationrulemining.Itisapparentthattojudgehowgoodaparticularmetricis,weneedtoassociateourjudgementwithspecificapplications(e.g.,classification,miningassociationrules).
TheCMmetricandtheinformationgainprivacylossratio[5,28]aremoreinterestingmeasureofutilitybecauseitconsidersthepossibleapplicationforthedata.Nevertheless,itisunclearwhattodoifwewanttobuildclassifiersonvariousattributes.Inaddition,thesetwometricsonlyworkwellifthedataareintendedtobeusedforbuildingclassifiers.Isthereautilitymetricthatworkswellforvariousapplications?Havingthisinmind,Kifer[17]proposesautilitymeasurerelatedtoKullback-Leiblerdivergence.Intheory,usingthismeasure,betteranonymousdatasets(fordifferentapplications)canbeproduced.Re-searchershavemeasuredtheutilityoftheresultinganonymousdatasets.Pre-liminaryresultsshowthatthismetricworkswellinpracticalapplications.Forthestatistical-basedperturbationtechniqueswhichaimtohidetheval-uesofaconfidentialattribute,theinformationlossisbasicallythelackofpre-cisioninestimatingtheoriginaldistributionfunctionofthegivenattribute.Asdefinedin[1],theinformationlossincurredduringthereconstructionofestimatingthedensityfunctionfX(x)oftheattributeX,ismeasuredbycom-putingthefollowingvalue:
1E)=I(fX,fX
2
ΩX
fX(x)−fX(x)dx
(8.10)
thatis,halfoftheexpectedvalueofL1normbetweenfX(x)andfX(x),which
arethedensitydistributionsrespectivelybeforeandaftertheapplicationoftheprivacypreservingtechnique.
Whenconsideringthecryptography-basedtechniqueswhicharetypicallyemployedindistributedenvironments,wecanobservethattheydonotuseanykindofperturbationtechniquesforthepurposeofprivacypreserving.Instead,theyusethecryptographictechniquestoassuredataprivacyateachsitebylimitingtheinformationsharedbyallthesites.Therefore,thequalityofdatastoredateachsiteisnotcompromisedatall.
198Privacy-PreservingDataMining:ModelsandAlgorithms
CompletenessandConsistency.Whiletheaccuracyisarelativelygeneralparameterinthatitcanbemeasuredwithoutstrongassumptionsonthedatasetanalyzed,thecompletenessisnotsogeneral.Forexample,insomePPDMstrategies,e.g.blocking,thecompletenessevaluationisnotsignificant.Ontheotherhand,theconsistencyrequirestodeterminealltherelationshipsthatarerelevantforagivendataset.
In[5],Bertinoetal.proposeasetofevaluationparametersincludingthecompletenessandconsistencyevaluation.Unlikeothertechniques,theirap-proachtakesintoaccounttwomoreimportantaspects:relevanceofdataandstructureofdatabase.Theyprovideaformaldescriptionthatcanbeusedtomagnifytheaggregateinformationofinterestforatargetdatabaseandtherel-evanceofdataqualitypropertiesofeachaggregateinformationandforeachattributeinvolvedintheaggregateinformation.Specifically,thecompletenesslack(denotedasCML)ismeasuredasfollows:
CML=
ni=0
(DMG.Ni.CV×DMG.Ni.CW)
(8.11)
Inequation8.11,DMGisanorientedgraphwhereeachnodeNiisanat-tributeclass.CVisthecompletenessvalueandCWistheconsistencyvalue.
Theconsistencylack(denotedasCSL)isgivenbythenumberofconstraintviolationsoccurredinallthesanitizedtransactionmultipliedbytheweightassociatedwitheveryconstraints.
n
(DMG.SCi.csv×DMG.SCi.cw)CSL=
i=0mj=0
+
(DMG.CCj.csv×DMG.CCj.cw)
(8.12)
Inequation8.11,csvindicatesthenumberofviolations,cwistheweightof
theconstraint,SCidescribesasimpleconstraintclass,andCCjdescribesacomplexconstraintclass.
8.4.2QualityoftheDataMiningResults
Insomesituations,itcanbeusefulandalsomorerelevanttoevaluatethequalityofthedataminingresultsafterthesanitizationprocess.Thiskindofmetricisstrictlyrelatedtotheusethedataareintendedfor.Datacanbeana-lyzedinordertomineinformationintermsofassociationsamongsingledataitemsortoclassifyexistingdatawiththegoaloffindinganaccurateclas-sificationofnewdataitems,andsoon.Basedontheintendeddatause,theinformationlossismeasuredwithaspecificmetric,dependingeachtimeontheparticulartypeofknowledgemodeloneaimstoextract.
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms199
Iftheintendeddatausageisdataclustering,theinformationlosscanbemea-suredbythepercentageoflegitimatedatapointsthatarenotwell-classifiedaf-terthesanitizationprocess.Asin[22],amisclassificationerrorMEisdefinedtomeasuretheinformationloss.
k1
ME=(|Clusteri(D)|−|Clusteri(D)|)
N
i=1
(8.13)
whereNrepresentsthenumberofpointsintheoriginaldataset,kisthenum-berofclustersunderanalysis,and|Clusteri(D)|and|Clusteri(D)|repre-sentthenumberoflegitimatedatapointsoftheithclusterintheoriginaldatasetDandthesanitizeddatasetDrespectively.Sinceaprivacypreservingtech-niqueusuallymodifydataforthesanitizationpurpose,theparametersinvolvedintheclusteringanalysisisalmostinevitablyaffected.Inordertoachievehighclusteringquality,itisveryimportanttokeeptheclusteringresultsasconsis-tentaspossiblebeforeandaftertheapplicationofadatahidingtechnique.Whenquantifyinginformationlossinthecontextoftheotherdatausages,itisusefultodistinguishbetween:lostinformationrepresentingthepercent-ageofnon-sensitivepatterns(i.e.,association,classificationrules)whicharehiddenasside-effectofthehidingprocess;andtheartifactualinformationrepresentingthepercentageofartifactualpatternscreatedbytheadoptedpri-vacypreservingtechnique.Forexample,in[21],OliveiraandZaianedefinetwometricsmissescostandartifactualpatternwhicharecorrespondingtolostinformationandartifactualinformationrespectively.Inparticular,missescostmeasuresthepercentageofnon-restrictivepatternsthatarehiddenafterthesanitizationprocess.Thishappenswhensomenon-restrictivepatternslosesupportinthedatabaseduetothesanitizationprocess.Themissescost(MC)iscomputedasfollows:
#∼RP(D)−#∼RP(D)
MC=
#∼RP(D)
(8.14)
where#∼RP(D)and#∼RP(D)denotethenumberofnon-restrictivepatternsdiscoveredfromtheoriginaldatabaseDandthesanitizeddatabaseDrespectively.Inthebestcase,MCshouldbe0%.Noticethatthereisacom-promisebetweenthemissescostandthehidingfailureintheirapproach.Themorerestrictivepatternstheyhide,themorelegitimatepatternstheymiss.Theothermetric,artifactualpattern(AP),ismeasuredintermsofthepercentageofthediscoveredpatternsthatareartifacts.Theformulais:
|P|−|PP|
(8.15)AP=
P
where|X|denotesthecardinalityofX.Accordingtotheirexperiments,theirapproachdoesnothaveanyartifactualpatterns,i.e.,APisalways0.
200Privacy-PreservingDataMining:ModelsandAlgorithms
Incaseofassociationrules,thelostinformationcanbemodeledasthesetofnon-sensitiverulesthatareaccidentallyhidden,referredtoaslostrules,bytheprivacypreservationtechnique,theartifactualinformation,instead,rep-resentsthesetofnewrules,alsoknownasghostrules,thatcanbeex-tractedfromthedatabaseaftertheapplicationofasanitizationtechnique.Similarly,iftheaimoftheminingtaskisdataclassification,e.g.bymeansofdecisiontreesinductions,boththelostandartifactualinformationcanbequantifiedbymeansofthecorrespondinglostandghostassociationrulesde-rivedbytheclassificationtree.Thesemeasuresallowonetoevaluatethehighlevelinformationthatareextractedfromadatabaseinformofthewidely-usedinferencerulesbeforeandaftertheapplicationofaPPDMalgorithm.
Itisworthnotingthatformostcryptography-basedPPDMalgorithms,thedataminingresultsarethesameasthatproducedfromunsanitizeddata.
8.5ComplexityMetrics
ThecomplexitymetricmeasurestheefficiencyandscalabilityofaPPDMalgorithm.Efficiencyindicateswhetherthealgorithmcanbeexecutedwithgoodperformance,whichisgenerallyassessedintermsofspaceandtime.Spacerequirementsareassessedaccordingtotheamountofmemorythatmustbeallocatedinordertoimplementthegivenalgorithm.
Fortheevaluationoftimerequirements,thereareseveralapproaches.ThefirstapproachistoevaluatetheCPUtime.Forexample,in[21],theyfirstkeepconstantboththesizeofthedatabaseandthesetofrestrictivepatterns,andthenincreasethesizeoftheinputdatatomeasuretheCPUtimetakenbytheiralgorithm.Analternativeapproachwouldbetoevaluatethetimerequirementsintermsofthecomputationalcost.Inthiscase,itisobviousthatanalgorithmhavingapolynomialcomplexityismoreefficientthananotheronewithexpo-nentialcomplexity.Sometimes,thetimerequirementscanevenbeevaluatedbycountingtheaveragenumberofoperationsexecutedbyaPPDMalgorithm.Asin[14],theperformanceismeasuredintermsofthenumberofencryptionanddecryptionoperationsrequiredbythespecificalgorithm.Thelasttwomea-sures,i.e.thecomputationalcostandtheaveragenumberofoperations,donotprovideanabsolutemeasure,buttheycanbeconsideredinordertoperformafastcomparisonamongdifferentalgorithms.
Incaseofdistributedalgorithms,especiallythecryptography-basedalgo-rithms(e.g.[14,31]),thetimerequirementscanbeevaluatedintermsofcom-municationcostduringtheexchangeofinformationamongsecureprocessing.Specifically,in[14],thecommunicationcostisexpressedasthenumberofmessagesexchangedamongthesites,thatarerequiredbytheprotocolforse-curelycountingthefrequencyofeachrule.
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms201
ScalabilityisanotherimportantaspecttoassesstheperformanceofaPPDMalgorithm.Inparticular,scalabilitydescribestheefficiencytrendswhendatasizesincrease.Suchparameterconcernstheincreaseofbothperformanceandstoragerequirementsaswellasthecostsofthecommunicationsrequiredbyadistributedtechniquewiththeincreaseofdatasizes.
Duetothecontinuousadvancesinhardwaretechnology,largeamountsofdatacannowbeeasilystored.Databasesalongwithdatawarehousestodaystoreandmanageamountsofdatawhichareincreasinglylarge.Forthisrea-son,aPPDMalgorithmhastobedesignedandimplementedwiththecapabilityofhandlinghugedatasetsthatmaystillkeepgrowing.Thelessfastisthede-creaseintheefficiencyofaPPDMalgorithmforincreasingdatadimensions,thebetterisitsscalability.Therefore,thescalabilitymeasureisveryimportantindeterminingpracticalPPDMtechniques.
8.6HowtoSelectaProperMetric
Inprevioussection,wehavediscussedvarioustypesofmetrics.Anim-portantquestionhereis“whichoneamongthepresentedmetricsisthemostrelevantforagivenprivacypreservingtechnique?”.
DworkandNissim[9]makesomeinterestingobservationsaboutthisques-tion.Inparticular,accordingtotheminthecaseofstatisticaldatabasesprivacyisparamount,whereasinthecaseofdistributeddatabasesforwhichtheprivacyisensuredbyusingasecuremultipartycomputationtechniquefunctionalityisofprimaryimportance.Sincearealdatabaseusuallycontainsalargenumberofrecords,theperformanceguaranteedbyaPPDMalgorithm,intermsoftimeandcommunicationrequirements,isanotnegligiblefactor,aswellasitstrendwhenincreasingdatabasesize.ThedataqualityguaranteedbyaPPDMalgo-rithmis,ontheotherhand,veryimportantwhenensuringprivacyprotectionwithoutdamagingthedatausabilityfromtheauthorizedusers.
Fromtheaboveobservations,wecanseethatatrade-offmetricmayhelpustostateauniquevaluemeasuringtheeffectivenessofaPPDMalgorithm.In[7],thescoreofamaskingmethodprovidesameasureofthetrade-offbe-tweendisclosureriskandinformationloss.Itisdefinedasanaveragebetweentheranksofdisclosureriskandinformationlossmeasures,givingthesameimportancetobothmetrics.In[8],aR-UconfidentialitymapisdescribedthattracestheimpactondisclosureriskRanddatautilityUofchangesintheparametersofadisclosurelimitationmethodwhichadoptsanadditivenoisetechnique.WebelievethatanindexassigningthesameimportancetoboththedataqualityandthedegreeofprivacyensuredbyaPPDMalgorithmisquiterestrictive,becauseinsomecontextsoneoftheseparameterscanbemorerel-evantthantheother.Moreover,inouropiniontheotherparameters,evenlessrelevantones,shouldbealsotakenintoaccount.Theefficiencyandscalability
202Privacy-PreservingDataMining:ModelsandAlgorithms
measures,forinstance,couldbediscriminatingfactorsinchoosingamongasetofPPDMalgorithmsthatensuresimilardegreesofprivacyanddatautility.Aweightedmeancouldbe,thus,agoodmeasureforevaluatingbymeansofauniquevaluethequalityofaPPDMalgorithm.
8.7ConclusionandResearchDirections
Inthischapter,wehavesurveyeddifferentapproachesusedinevaluatingtheeffectivenessofprivacypreservingdataminingalgorithms.Asetofcriteriaisidentified,whichareprivacylevel,hidingfailure,dataqualityandcomplexity.AsnoneoftheexistingPPDMalgorithmscanoutperformalltheotherswithrespecttoallthecriteria,wediscussedtheimportanceofcertainmetricsforeachspecifictypeofPPDMalgorithms,andalsopointedoutthegoalofagoodmetric.
ThereareseveralfutureresearchdirectionsalongthewayofquantifyingaPPDMalgorithmanditsunderneathapplicationordataminingtask.OneistodevelopacomprehensiveframeworkaccordingtowhichvariousPPDMal-gorithmscanbeevaluatedandcompared.ItisalsoimportanttodesigngoodmetricsthatcanbetterreflectthepropertiesofaPPDMalgorithm,andtode-velopbenchmarkdatabasesfortestingalltypesofPPDMalgorithms.
References
[1]Agrawal,D.,Aggarwal,C.C.:Onthedesignandquantificationofpri-vacypreservingdataminingalgorithms.In:Proceedingsofthe20thACMSIGACT-SIGMOD-SIGARTSymposiumonPrincipleofData-baseSystem,pp.247–255.ACM(2001)
[2]Agrawal,R.,Srikant,R.:Privacypreservingdatamining.In:Proceeed-ingsoftheACMSIGMODConferenceofManagementofData,pp.439–450.ACM(2000)
[3]Ballou,D.,Pazer,H.:Modellingdataandprocessqualityinmultiinput,
multioutputinformationsystems.Managementscience31(2),150–162(1985)
[4]Bayardo,R.,Agrawal,R.:Dataprivacythroughoptimalk-anonymization.In:Proc.ofthe21stInt’lConf.onDataEngineering(2005)
[5]Bertino,E.,Fovino,I.N.:Informationdrivenevaluationofdatahiding
algorithms.In:7thInternationaConferenceonDataWarehousingandKnowledgeDiscovery,pp.418–427(2005)
[6]Bertino,E.,Fovino,I.N.,Provenza,L.P.:Aframeworkforevaluatingpri-vacypreservingdataminingalgorithms.DataMiningandKnowledgeDiscovery11(2),121–154(2005)
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms203
[7]Domingo-Ferrer,J.,Torra,V.:Aquantitativecomparisonofdisclosure
controlmethodsformicrodata.In:L.Zayatz,P.Doyle,J.Theeuwes,J.Lane(eds.)Confidentiality,DisclosureandDataAccess:TheoryandPracticalApplicationsforStatisticalAgencies,pp.113–134.North-Holland(2002)
[8]Duncan,G.T.,Keller-McNulty,S.A.,Stokes,S.L.:Disclosurerisksvs.
datautility:TheR-Uconfidentialitymap.Tech.Rep.121,NationalInsti-tuteofStatisticalSciences(2001)
[9]Dwork,C.,Nissim,K.:Privacypreservingdatamininginverticallypar-titioneddatabase.In:CRYPTO2004,vol.3152,pp.528–544(2004)[10]Evfimievski,A.:Randomizationinprivacypreservingdatamining.
SIGKDDExplor.Newsl.4(2),43–48(2002)
[11]Evfimievski,A.,Srikant,R.,Agrawal,R.,Gehrke,J.:Privacypreserving
miningofassociationrules.In:8thACMSIGKDDInternationalCon-ferenceonKnowledgeDiscoveryandDataMining,pp.217–228.ACM-Press(2002)
[12]Fung,B.C.M.,Wang,K.,Yu,P.S.:Top-downspecializationforinforma-tionandprivacypreservation.In:Proceedingsofthe21stIEEEInter-nationalConferenceonDataEngineering(ICDE2005).Tokyo,Japan(2005)
[13]Iyengar,V.:Transformingdatatosatisfyprivacyconstraints.In:Proc.,
theEigthACMSIGKDDInt’lConf.onKnowledgeDiscoveryandDataMining,pp.279–288(2002)
[14]Kantarcioglu,M.,Clifton,C.:Privacypreservingdistributedminingof
associationrulesonhorizontallypartitioneddata.In:ACMSIGMODWorkshoponResearchIssuesinDataMiningandKnowledgeDiscovery,pp.24–31(2002)[15]Kantarcıo˘glu,M.,Jin,J.,Clifton,C.:Whendodataminingresultsviolate
privacy?In:Proceedingsofthe2004ACMSIGKDDInternationalCon-ferenceonKnowledgeDiscoveryandDataMining,pp.599–604.Seattle,WA(2004).
[16]Kargupta,H.,Datta,S.,Wang,Q.,Sivakumar,K.:Ontheprivacypreserv-ingpropertiesofrandomdataperturbationtechniques.In:ProceedingsoftheThirdIEEEInternationalConferenceonDataMining(ICDM’03).Melbourne,Florida(2003)
[17]Kifer,D.,Gehrke,J.:Injectingutilityintoanonymizeddatasets.In:Pro-ceedingsofthe2006ACMSIGMODInternationalConferenceonMan-agementofData,pp.217–228.ACMPress,Chicago,IL,USA(2006)[18]KumarTayi,G.,Ballou,D.P.:Examiningdataquality.Communications
oftheACM41(2),54–57(1998)
204Privacy-PreservingDataMining:ModelsandAlgorithms
[19]Liu,K.,Kargupta,H.,Ryan,J.:Randomprojection-basedmultiplicative
dataperturbationforprivacypreservingdistributeddatamining18(1),92–106(2006)
[20]Nergiz,M.E.,Clifton,C.:Thoughtsonk-anonymization.In:TheSecond
InternationalWorkshoponPrivacyDataManagementheldinconjunctionwithThe22ndInternationalConferenceonDataEngineering.Atlanta,Georgia(2006)
[21]Oliveira,S.R.M.,Zaiane,O.R.:Privacypreservingfrequentitemsetmin-ing.In:IEEEicdmWorkshoponPrivacy,SecurityandDataMining,vol.14,pp.43–54(2002)
[22]Oliveira,S.R.M.,Zaiane,O.R.:Privacypreservingclusteringbydata
transformation.In:18thBrazilianSymposiumonDatabases(SBBD2003),pp.304–318(2003)
[23]Oliveira,S.R.M.,Zaiane,O.R.:Towardstandardizationinprivacypre-servingdatamining.In:ACMSIGKDD3rdWorkshoponDataMiningStandards,pp.7–17(2004)
[24]Rizvi,S.,Haritsa,R.:Maintainingdataprivacyinassociationrulemining.
In:28thInternationalConferenceonVeryLargeDatabases,pp.682–693(2002)
[25]Samarati,P.:Protectingrespondents’identitiesinmicrodatarelease.
IEEETransactionsonKnowledgeandDataEngineering(TKDE)13(6),1010–1027(2001).
[26]Schoeman,F.D.:PhilosophicalDimensionsofPrivacy:AnAnthology.
CambridgeUniversityPress.(1984)
[27]Shannon,C.E.:Amathematicaltheoryofcommunication.BellSystem
TechnicalJournal27,379–423,623–656(1948)
[28]Sweeney,L.:Achievingk-anonymityprivacyprotectionusinggeneral-izationandsuppression.InternationalJournalofUncertainty,FuzzinessandKnowledgeBasedSystems10(5),571–588(2002)
[29]Trottini,M.:Adecision-theoreticapproachtodatadisclosureproblems.
ResearchinOfficialStatistics4,7–22(2001)
[30]Trottini,M.:Decisionmodelsfordatadisclosurelimitation.Ph.D.thesis,
CarnegieMellonUniversity(2003).
[31]Vaidya,J.,Clifton,C.:Privacypreservingassociationrulemininginver-ticallypartitioneddata.In:8thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pp.639–4.ACMPress(2002)
[32]Verykios,V.S.,Bertino,E.,NaiFovino,I.,Parasiliti,L.,Saygin,Y.,
Theodoridis,Y.:State-of-the-artinprivacypreservingdatamining.SIG-MODRecord33(1),50–57(2004)
ASurveyofQuantificationofPrivacyPreservingDataMiningAlgorithms205
[33]Walters,G.J.:HumanRightsinanInformationAge:APhilosophical
Analysis,chap.5.UniversityofTorontoPress.(2001)
[34]Wang,R.Y.,Strong,D.M.:Beyondaccuracy:whatdataqualitymeans
todataconsumers.JournalofManagementInformationSystems12(4),5–34(1996)
[35]Willenborg,L.,DeWaal,T.:Elementsofstatisticaldisclosurecontrol,
LectureNotesinStatistics,vol.155.Springer(2001)
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- gamedaodao.com 版权所有 湘ICP备2022005869号-6
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务