rrw - a robust and reversible watermarking technique for ... - a robust and reversible... · 1 rrw...

14
1 RRW - A Robust and Reversible Watermarking Technique for Relational Data Saman Iftikhar, M. Kamran and Zahid Anwar Abstract—Advancement in information technology is playing an increasing role in the use of information systems comprising relational databases. These databases are used effectively in collaborative environments for information extraction; consequently, they are vulnerable to security threats concerning ownership rights and data tampering. Watermarking is advocated to enforce ownership rights over shared relational data and for providing a means for tackling data tampering. When ownership rights are enforced using watermarking, the underlying data undergoes certain modifications; as a result of which, the data quality gets compromised. Reversible watermarking is employed to ensure data quality along-with data recovery. However, such techniques are usually not robust against malicious attacks and do not provide any mechanism to selectively watermark a particular attribute by taking into account its role in knowledge discovery. Therefore, reversible watermarking is required that ensures; (i) watermark encoding and decoding by accounting for the role of all the features in knowledge discovery; and, (ii) original data recovery in the presence of active malicious attacks. In this paper, a robust and semi-blind reversible watermarking (RRW) technique for numerical relational data has been proposed that addresses the above objectives. Experimental studies prove the effectiveness of RRW against malicious attacks and show that the proposed technique outperforms existing ones. Index Terms—Reversible Watermarking, Genetic Algorithm, Data Recovery, Data Quality, Robustness, Numerical Data. 1 I NTRODUCTION In the digital world of today, data is excessively being generated due to the increasing use of the Internet and Cloud Computing [1]. Data is stored in different digital formats such as images, audio, video, natural language texts and relational data. Relational data in particular is shared extensively by the owners with research commu- nities and in virtual data storage locations in the Cloud. The purpose is to work in a collaborative environment and make data openly available so that it is useful for knowledge extraction and decision making. Take the case of Walmart- a large multinational retail corporation that has made its sales database available openly over the Internet so that it may be used for the purposes of identifying market trends through data mining [2]. However these openly available datasets make attractive targets for attacks. For example there are documented attack incidents where data containing personal infor- mation related to customers using certain Walmart video services [3] was stolen. According to a survey related to the security of outsourced customer data [4], it is reported that 46% of organizations do not consider secu- rity and privacy issues while sharing their confidential Saman Iftikhar is with the Department of Computing, School of Electrical Engineering and Computer Sciences, National University of Sciences and Technology, Islamabad, Pakistan. E-mail: [email protected] M. Kamran is with the Department of Computer Science, COMSATS Institute of Information Technology, Wah Cantt, Pakistan. E-mail: [email protected] Zahid Anwar is with the Department of Computing, School of Electrical Engineering and Computer Sciences, National University of Sciences and Technology, Islamabad, Pakistan. E-mail: [email protected] data. Therefore, 64% organizations have to face data loss repeatedly. Similarly, data breaches in the health care and medical domain are increasing alarmingly [5]. Therefore it is imperative, that in shared environments such as that of the Cloud, security threats that arise from un-trusted parties and relational databases 1 need to be addressed along with the enforcement of ownership rights on behalf of their owners. Watermarking techniques have historically been used to ensure security in terms of ownership protection and tamper proofing for a wide variety of data formats. This includes images, audio, video [6], [7], [8], [9], [10], natural language processing software [11], relational databases [12], [13] and more. Reversible watermarking techniques can ensure data recovery along with own- ership protection. Fingerprinting, data hashing, serial codes are some other techniques used for ownership protection [14]. Fingerprints also called transactional watermarks, are used to monitor and identify digital ownership by watermarking all the copies of contents with different watermarks for different recipients. Pri- marily this type of digital watermarking tries to identify the source of data leakage by tracing a guilty agent. In hashing, digital contents can be saved by performing a one-way hash function whereby the data contents do not change. If the hash of the original and tampered data is the same, data authenticity can be verified but ownership cannot be proved easily. Serial or classifica- tion codes are used for filtering of inappropriate contents over the Internet and are mainly applicable to images, 1. The terms, relational databases, data and datasets are used inter- changeably in this paper. IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

Upload: others

Post on 14-Sep-2019

11 views

Category:

Documents


0 download

TRANSCRIPT

1

RRW - A Robust and Reversible WatermarkingTechnique for Relational Data

Saman Iftikhar, M. Kamran and Zahid Anwar

Abstract—Advancement in information technology is playing an increasing role in the use of information systems comprising relationaldatabases. These databases are used effectively in collaborative environments for information extraction; consequently, they arevulnerable to security threats concerning ownership rights and data tampering. Watermarking is advocated to enforce ownershiprights over shared relational data and for providing a means for tackling data tampering. When ownership rights are enforced usingwatermarking, the underlying data undergoes certain modifications; as a result of which, the data quality gets compromised. Reversiblewatermarking is employed to ensure data quality along-with data recovery. However, such techniques are usually not robust againstmalicious attacks and do not provide any mechanism to selectively watermark a particular attribute by taking into account its role inknowledge discovery. Therefore, reversible watermarking is required that ensures; (i) watermark encoding and decoding by accountingfor the role of all the features in knowledge discovery; and, (ii) original data recovery in the presence of active malicious attacks. Inthis paper, a robust and semi-blind reversible watermarking (RRW) technique for numerical relational data has been proposed thataddresses the above objectives. Experimental studies prove the effectiveness of RRW against malicious attacks and show that theproposed technique outperforms existing ones.

Index Terms—Reversible Watermarking, Genetic Algorithm, Data Recovery, Data Quality, Robustness, Numerical Data.

F

1 INTRODUCTIONIn the digital world of today, data is excessively beinggenerated due to the increasing use of the Internet andCloud Computing [1]. Data is stored in different digitalformats such as images, audio, video, natural languagetexts and relational data. Relational data in particular isshared extensively by the owners with research commu-nities and in virtual data storage locations in the Cloud.The purpose is to work in a collaborative environmentand make data openly available so that it is useful forknowledge extraction and decision making. Take thecase of Walmart- a large multinational retail corporationthat has made its sales database available openly overthe Internet so that it may be used for the purposesof identifying market trends through data mining [2].However these openly available datasets make attractivetargets for attacks. For example there are documentedattack incidents where data containing personal infor-mation related to customers using certain Walmart videoservices [3] was stolen. According to a survey relatedto the security of outsourced customer data [4], it isreported that 46% of organizations do not consider secu-rity and privacy issues while sharing their confidential

• Saman Iftikhar is with the Department of Computing, School of ElectricalEngineering and Computer Sciences, National University of Sciences andTechnology, Islamabad, Pakistan.E-mail: [email protected]

• M. Kamran is with the Department of Computer Science, COMSATSInstitute of Information Technology, Wah Cantt, Pakistan.E-mail: [email protected]

• Zahid Anwar is with the Department of Computing, School of ElectricalEngineering and Computer Sciences, National University of Sciences andTechnology, Islamabad, Pakistan.E-mail: [email protected]

data. Therefore, 64% organizations have to face dataloss repeatedly. Similarly, data breaches in the healthcare and medical domain are increasing alarmingly [5].Therefore it is imperative, that in shared environmentssuch as that of the Cloud, security threats that arisefrom un-trusted parties and relational databases1 need tobe addressed along with the enforcement of ownershiprights on behalf of their owners.

Watermarking techniques have historically been usedto ensure security in terms of ownership protection andtamper proofing for a wide variety of data formats.This includes images, audio, video [6], [7], [8], [9], [10],natural language processing software [11], relationaldatabases [12], [13] and more. Reversible watermarkingtechniques can ensure data recovery along with own-ership protection. Fingerprinting, data hashing, serialcodes are some other techniques used for ownershipprotection [14]. Fingerprints also called transactionalwatermarks, are used to monitor and identify digitalownership by watermarking all the copies of contentswith different watermarks for different recipients. Pri-marily this type of digital watermarking tries to identifythe source of data leakage by tracing a guilty agent. Inhashing, digital contents can be saved by performing aone-way hash function whereby the data contents donot change. If the hash of the original and tampereddata is the same, data authenticity can be verified butownership cannot be proved easily. Serial or classifica-tion codes are used for filtering of inappropriate contentsover the Internet and are mainly applicable to images,

1. The terms, relational databases, data and datasets are used inter-changeably in this paper.

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

2

audio and video. Watermarking has the property thatit can provide ownership protection over the digitalcontent by marking the data with a watermark unique tothe owner. The embedded watermark can subsequentlybe used for proving and claiming ownership.

Digital watermarking of multimedia content is morecommonly known. Particularly image watermarking -a derivative of Steganography is an age-old practiceallowing covert transmission of messages from one partyto another by exploiting redundancy in common imageformats. However the basic process of multimedia wa-termarking is very different from that used to watermarkrelational databases because of a fundamental differencein the properties of the data. Multimedia data is highlycorrelated and continuous whereas relational data isindependent and discrete. With the advent of moderncopyright protection and information hiding techniques,database watermarking can be used to enforce owner-ship rights of relational data. However a major drawbackof these techniques is that they modify the data to a verylarge extent which often results in the loss of data quality.There is a strong need to preserve the data quality inwatermarked data so that it is of sufficiently high qualityand fit for use in decision making as well as in planningprocesses in different application domains. Data qualitycan be defined as the appropriateness of data for itsintended applications.

Reversible watermarking tries to overcome the prob-lem of data quality degradation by allowing recoveryof original data along with the embedded watermarkinformation. This paper proposes one such reversiblewatermarking technique that keeps the data useful forknowledge discovery. Data modifications are allowed tosuch extent that the quality of the data before embeddingwatermark information and after extracting, is accept-able for knowledge extraction process. Consequently,knowledge discovery becomes successful in decisionsupport systems where high quality data recovery isimperative. Reversible watermarking techniques are al-ready available in literature; however, to the best of ourknowledge, no work has been conducted on overcomingthe problems of reversible watermarking techniques inthe presence of malicious attacks. Achieving robustness(attack resilience) in the presence of reversibility (abilityto recover the watermark and the original data) is achallenging task. These two features may be potentiallyconflicting because a reversible watermark string alsomakes it an easier target for attack. Therefore, we tryto find the most appropriate watermark bandwidth thatensures maximum watermark robustness without signif-icant loss of information that may result by watermark-ing. To this end, we model the bandwidth optimizationas a constraints optimization problem.

Constrained Optimization (CO) [15], [16] allows one tooptimize a single or multiple objectives with respect tocertain variables that are bounded by some constraints.Our motivated problem is also a CO problem, wherebythe research communities want to share databases over

the public Internet or a Cloud environment for theirknowledge discovery processes. Ownership rights ofthese databases need to protected from malicious recipi-ents; in the presence of data quality constraint. Recent re-search studies enunciate that computational intelligencetechniques, such as Genetic Algorithm (GA) and ParticleSwarm Optimization (PSO) are a promising branch ofevolutionary computation that model hard constrainedoptimization problems [17] using biological inspiredcomputing algorithms. Digital Watermarking can also bemodeled as an optimization problem as demonstrated bysome recent research works [18] and [19] that use PSOfor watermarking different data formats and the resultsare quite encouraging. GA - an optimization algorithmis employed in the robust and reversible watermarkingtechnique (RRW) proposed in this paper to achieve anoptimal solution that is feasible for the problem at handand does not violate the defined constraints. An optimalwatermark value is created through the GA and insertedinto the selected feature of the relational database in sucha way that the data quality remains intact.

Our focus is to develop an information model througha statistical measure that identifies such features thatdo not have a significant effect on the decision makingprocess. Mutual Information [20], [21], a well knowninformation theory (concept), statistically measures theamount of information that one feature contains aboutthe other features in a database. In RRW, mutual infor-mation is used to select a suitable (candidate) featurefrom the database for watermarking. According to litera-ture, existing reversible watermarking techniques, do nottake into account the mutual information measure fordetermining relative importance of features. In RRW, theknowledge of mutual information for every candidatefeature is also employed to compute the watermarkinformation. Thus, it is ensured that the data quality willnot be affected. Consequently, RRW provides a robustsolution for data recovery that is reversible and resilientagainst heavy attacks.

RRW mainly comprises a (1) data preprocessing phase,(2) watermark encoding phase, (3) attacker channel,(4) watermark decoding phase and (5) data recoveryphase. In data preprocessing phase, secret parametersare defined and strategies are used to analyze and rankfeatures to watermark. An optimum watermark string iscreated in this phase by employing GA - an optimizationscheme - that ensures reversibility without data qualityloss. In the watermark encoding phase, the watermarkinformation is embedded in the selected feature(s). Twoparameters, β the optimized value from the GA and ηra change matrix are used in the watermark encodingand decoding phases. Finally, the watermarked data forintended recipients is generated. The attacker channelcomprises subset alteration, subset deletion and sub-set insertion attacks generated by the adversary. Thesemalicious attacks modify the original data and try todegrade its quality. In the watermark decoding phase theembedded watermark is decoded from the suspicious

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

3

data. In order to achieve this the preprocessing stepis performed again, and decoding strategies (featureselection on the basis of MI, β the optimized valuefrom the GA and ηr the change matrix) are used torecover the watermark. Semi-blind nature of RRW isused mainly for data reversibility in case of heavy attacks(attacks that may target large number of tuples). Originaldata is recovered in data recovery phase, through postprocessing steps for error correction and recovery.

The major contributions of our work are: (1) the designof an intelligent reversible watermarking technique forrelational data that ensures data recovery without com-promising data quality, and (2) a robust data recoveryscheme that is resilient against subset alteration, subsetdeletion and subset insertion attacks. RRW detects thewatermark fully and recovers the original data. Therobustness of RRW is evaluated through attack analysis,considering the attacker channel. It is worth mentioningthat sound watermarking techniques exploit redundancyin the data to embed the watermark in a manner that itdoes not impact the overall size. For example considerthe case of image watermarking that embeds informationin the least significant bit (LSB) of pixel values which isimperceptible as well as it does not effect the overallimage size. This is desirable because changes in the sizeof the original data can compromise the presence orabsence of a watermark. Similarly one other contributionof our technique is that it avoids detection by keepingthe size of the original relational data unaltered.

The subsequent sections of the paper are structured asfollows: In Section 2, a brief overview of related researchis provided by emphasizing the different directions ofour work. In Section 3, the proposed scheme is describedin detail. In Section 4, experiments and results are dis-cussed concisely through a short case study of medicaldata. Finally, the paper is concluded with a discussionto future research in Section 5.

2 RELATED WORK

The first irreversible watermarking technique for rela-tional databases was proposed by Agrawal et al. in[12]. Similarly, the first reversible watermarking schemefor relational databases was proposed in [22]. In thistechnique, histogram expansion is used for reversiblewatermarking of relational database. Zhang et al. pro-posed a method of distribution of error between twoevenly distributed variables and selected some initialnonzero digits of errors to form histograms. Histogramexpansion technique is used to reversibly watermark theselected nonzero initial digits of errors. This technique iskeeps track of overhead information to authenticate dataquality. However, this technique is not robust againstheavy attacks (attacks that may target large number oftuples).

Difference Expansion Watermarking techniques(DEW), [23], [24], [25] exploit methods of arithmeticoperations on numeric features and perform

transformations. The watermark information is normallyembedded in the LSB of features of relational databasesto minimize distortions. Whereas, in RRW, a GA basedoptimum value is embedded in the selected featureof the dataset with the objective of preserving thedata quality while minimizing the data distortions asa result of watermark embedding. Another reversiblewatermarking technique proposed in [26] is based ondifference expansion and support vector regression(SVR) prediction to protect the database from beingtampered. The intention behind the design of thesetechniques is to provide ownership proof. Suchtechniques are vulnerable to modification attacks asany change in the expanded value will fail to detectwatermark information and the original data.

Genetic Algorithm based on Difference Expansion wa-termarking (GADEW) technique is used in a proposedrobust and reversible solution for relational databases[27]. GADEW improves upon the drawbacks mentionedabove by minimizing distortions in the data, increasingwatermark capacity and lowering false positive rate. Tothis end, a GA is employed to increase watermark capac-ity and minimize introduced distortion. This is becausethe watermark capacity increases with the increase innumber of features and the GA runs on more featuresto search the optimum one for watermarking. However,watermark capacity decreases with the increase in wa-termarked tuples. GADEW used the distortion measures(AWD and TWD) to control distortions in the resultantdata. In this context, the robustness of GADEW canbe compromised when AWD and TWD are given highvalues.

Prediction-error expansion watermarking techniques(PEEW) like [28], [29], [30], [31], [32], [33] incorporatea predictor as apposed to a difference operator to selectcandidate pixels or features for embedding of watermarkinformation. The PEEW proposed technique by Farfouraet al. is fragile against malicious attacks as the water-mark information is embedded in the fractional part ofnumeric features only. In this particular scenario, thescheme works because the intention of the attacker isto preserve the usefulness of the data; otherwise, he caneasily compromise the fractional part. RRW is robust, asthe watermark information is embedded in the values ofnumeric features, to make the scheme resilient againstsuch attacks.

In [34], the authors proposed a robust, blind, resilientand reversible, image based watermarking scheme forlarge scale databases. The bit string of an image is usedas a watermark where one bit from the bit string isembedded in all tuples of a single partition and thesame process is repeated for the rest of the partitions.This technique demonstrates a remarkable decrease inwatermark detection rate during various types of heavyattacks, and the database tuples get highly distorted. InRRW, a GA is used to generate a parameter that controlsthe data distortions to make sure that the data qualityremains intact after watermarking. Moreover, the semi-

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

4

blind nature of the technique allows robustness againstheavy attacks and also for regeneration of the originaldataset after watermark decoding.

Gupta and Pieprzyks’ [23], proposed reversible wa-termarking technique introduces distortions as a resultof the embedding process. Changes in the data arecontrolled by placing certain bounds on LSB. On thecontrary, to limit the distortions, the data outside thelimited bounds is left un-watermarked. As a result,the watermark robustness gets compromised. However,RRW has no such limitations.

The reversible watermarking techniques DEW,GADEW and PEEW, proposed in [23], [27], [28]respectively, are not robust and reversible against heavyattacks. Features are selected in these techniques forwatermarking without considering their importance inknowledge discovery. RRW is robust and reversibleand copes with the above mentioned problems anddata quality is preserved by taking into account theimportance of the features in knowledge discovery.

In RRW, all the tuples of the selected feature canbe marked thanks to the selection of a low distortionwatermark; therefore, the attacker will have to attackall the tuples to corrupt the watermark to mitigate theeffect of the majority voting scheme. Attacking all thetuples is not a viable option for the attacker because hehas no knowledge of the original data or the usabilityconstraints and that would completely compromise itsusefulness. Moreover, since RRW can afford to embedwatermark bits in all or a large fraction of the tuples ofthe selected feature; it achieves high robustness againstheavy attacks. However, marking all tuples is not arequirement. RRW is configurable in that the data ownercan choose a fraction for watermarking if it is required.RRW outperforms existing state of the art reversiblewatermarking techniques including DEW, GADEW andPEEW. These techniques embed the watermark in par-titions of the data to ensure minimum distortion; there-fore, recover original data with degraded data qualityand lack robustness. RRW has overcome drawbacksof these techniques and is also resilient against heavyattacks.

3 RRW ARCHITECTUREThis section discusses RRW for reversible watermarkingof relational databases that improves data recovery ratio.The main architecture of RRW is presented in figure1. RRW includes the following four major phases: (1)watermark preprocessing; (2) watermark encoding; (3)watermark decoding; and (4) data recovery.

The watermark preprocessing phase computes differ-ent parameters for calculation of an optimal watermark.These parameters are used for watermark encoding anddecoding.

The main focus of watermark encoding phase is toembed watermark information in such a way that itdoes not affect the data quality. During watermark em-bedding, data gets modified according to the available

Fig. 1. Main Architecture of RRW

bandwidth (or capacity) of the watermark information.The bandwidth of the watermark should be sufficientlylarge to ensure robustness but not so large that it de-stroys the data quality. The data owner decides theamount of data modification such that the quality isnot compromised for a particular database applicationbefore-hand and therefore defines usability constraintsλ to introduce tolerable distortion into the data. Thedatasets used in our research, are taken from UCI ma-chine learning repository, including, Cleveland HeartDisease dataset [35], MAGIC Gamma Telescope dataset[36] and PAMAP2 Physical Activity Monitoring dataset[37]. It is to note that these datasets were used forevaluation purposes and that RRW is not dependent onany particular dataset or feature. Numerical features canbe taken under consideration from any dataset and asuitable feature is determined to embed watermark onthe basis of mutual information.

After watermarking, the data is released to the in-tended recipients over a communication channel thatis assumed to be insecure and termed as the ”attackerchannel” in this research domain. The data may undergoseveral malicious attacks in the attacker channel. Theefficiency and effectiveness of RRW is described throughrobustness analysis determined by its response to subsetinsertion, alteration and deletion attacks.

The Watermark decoding phase recovers watermarkinformation effectively for detection of the embeddedwatermark. Data recovery phase mainly comprises theimportant task of successful recovery of the original data.For a quick reference, Table 1 lists the notations used inthis paper.

In subsequent sections, different phases of RRW arediscussed.

3.1 Watermark Preprocessing Phase

In the preprocessing phase, two important tasks areaccomplished: (1) selection of a suitable feature for

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

5

TABLE 1Notations Used in the Paper

Symbol Description Symbol DescriptionD Original database b The watermark bitDW Watermarked database min(aw) The minimum value a of a feature after watermarkingR Total number of tuples/rows/records in a table (or dataset) r A tuple in the database tableηr Detected amount of percentage change in encoding ∇ A matrix containing percent change in data valuesmax(a) The maximum value a of a feature ζ The watermark encoder used for watermark encodingA a feature/column/attribute selected for watermarking (D) D

′W A watermarked database after the malicious attacks

min(a) The minimum value of a feature max(aw) The maximum value of a feature a after watermarkingl The length of the watermark ηdr Detected amount of change in the value of a feature after

an attack on the watermark bit bw Watermark bits η∆r The difference between the changes detected in the value

of a feature during the encoding and decoding processAW The watermarked feature WD Decoded watermarkF Total number of features in the database DW D watermarked by the proposed schemedtW detected watermark bit Dr Recovered DataMIO Mutual information of original data MIW Mutual information of watermarked dataAccO Classification accuracy of original data AccW Classification accuracy of watermarked dataSnO Sensitivity of original data SnW Sensitivity of watermarked dataSpO Specificity of original data SpW Specificity of watermarked data∆MI Change in value of mutual information a value of feature Aβ An optimized value to watermark a feature λ The usability constraints defined by the data ownerζ The watermark decoder used for watermark decoding RRW proposed Robust and Reversible Watermarking techniqueµD Mean of the original data of RRW σD Variance of the original data of RRWµDW

Mean of the watermarked data of RRW σDWVariance of the watermarked data of RRW

µd Mean of the original data of PEEW Technique σd Variance of the original data of PEEW Techniqueµdw Mean of the watermarked data of PEEW Technique σdw Variance of the watermarked data of PEEW Technique

watermark embedding; (2) calculation of an optimalwatermark with the help of an optimization technique.

3.1.1 Feature Analysis and SelectionFor developing a decisive information model of variousfeatures of the dataset, all the features are ranked ac-cording to their importance in information extraction,subject to their mutual dependence on other features.For this purpose, mutual information (MI), is exploited,that is an important statistical measure for computationof mutual dependence of two random variables. Mutualinformation of every feature with all other features iscalculated by using equation 1.

MI(A,B) =∑a

∑b

PAB(a, b)logPAB(a, b)

PA(a)PB(b). (1)

Where MI(A,B) measures the degree of correlation offeatures by measuring the marginal probability distribu-tions as PA(a), PB(b) and the joint probability distribu-tion PAB(a, b).

Then MI of one feature with all other features iscomputed using the relation:

MIfi = (MIfij ). (2)

Where i, j = 1,2 ..., ft with i 6= j, and ft is the totalnumber of features.

The value of MI of each feature is then used to rankthe features. The attacker can try and predict the featurewith the lowest MI in an attempt to guess which featurehas been watermarked. To deceive the attacker for thisparticular scenario, a secret threshold can be used forselecting the feature for watermark embedding. In thiscontext, the data owner can define a secret threshold

based on MI of all the features in the database. Thefeature(s) having MI lower than that threshold can beselected for watermarking. The attacker will not attackthe features having large MI as in that case the usabilityof the data will be compromised. Therefore, he will beforced to attack the feature(s) with lower MI withoutconcrete knowledge (due to the use of secret threshold)of which features have been watermarked.

3.1.2 Watermark Creation through Genetic AlgorithmFor the creation of optimal watermark information, thatneeds to be embedded in the original data, we use anevolutionary technique; Genetic Algorithm (GA). GAis a population-based computational model, basicallyinspired from genetic evolution [38]. GA evolves a poten-tial solution to an optimization problem by searching thepossible solution space. In the search of optimal solution,the GA follows an iterative mechanism to evolve apopulation of chromosomes. The GA preserves essentialinformation through the application of basic geneticoperations to these chromosomes that include: selection,crossover, mutation and replacement. The GA evaluatesthe quality of each candidate chromosome by employinga fitness function. The evolutionary mechanism of theGA continues through a number of generations, untilsome termination criteria is met.

During watermark creation phase, we employed thefollowing major steps of the GA for getting optimalwatermark information:

1) Initial random population of binary strings calledchromosomes is generated. Gene values of eachchromosome represents l-bit watermark string asshown in Fig. 2.

2) Fitness of each chromosome is evaluated by em-ploying a constrained optimized fitness function

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

6

discussed in Section 3.1.2.1.3) Tournament selection mechanism is applied to get

the most appropriate individuals as parent chro-mosomes.

4) Genetic operations of crossover and mutation areperformed on parent chromosomes to create off-springs. A single point crossover operator is ap-plied to evolve high quality individuals, inheritingparental characteristics, by exchanging informationbetween two or more chromosomes. A uniformmutation operator is applied to bring diversityin population through small random changes ingene values of binary chromosomes. The valuesof crossover fraction and mutation rate are setempirically and are given in Table 4.

5) Elitism strategy is applied to hire two individualswith best fitness value; as elites to the next gener-ation without genetic changes.

6) Remaining population of the next generation iscreated by replacing less fit individuals of theprevious generation with the most fit newly createdoff-springs.

7) Step 2 to 6 are repeated until MIO and MIW reachapproximately equal values for a certain numberof generations.

8) Both, optimal watermark information string andbest fitness value (β) is returned after the fulfill-ment of the termination criteria.

Fig. 2. The Structure of GA Chromosome Used in theProposed RRW technique for Watermark Creation

3.1.2.1 Constrained Optimized Fitness Function:In the proposed scheme, the GA is populated with a con-strained fitness function to acquire an optimal change indata that will ensure data quality while embedding thewatermark. The gene values of chromosomes are usedto decide which action will be performed on selectedfeature by using a value of β. Initially, some suitablevalue of β is set to embed a bearable change in featurevalues.

The optimal fitness value obtained through GA isbasically the change - β - to be embedded in the originaldata that needs to be watermarked. The purpose ofgetting an optimum value of β is to justify the amountof change that a feature value can withhold withoutcompromising the data quality. The fitness function canbe mathematically expressed as:

maxβA, where AεF, ∀Asubject to λ

(3)

Data quality is ensured by imposing the followingusability constraints λ on original and watermarked datagiven in equation 4.

µD − µDW= 0

σD − σDW= 0

MIW −MIO = 0

(4)

where MIO, µD and σD denote mutual information,mean and variance of the original data and MIW , µDW

and σDWrepresent mutual information, mean and vari-

ance of the watermarked data respectively. The min-imum and maximum values of the selected featureshould also be the same both before and after watermarkembedding as denoted in equation 5.

min(a) = min(aw)

max(a) = max(aw)(5)

In the watermark encoding phase, the optimized valueof β is embedded in the particular feature that is selectedto be watermarked on the basis of the selection criteriadiscussed in Section 3.1.1. The gene values of the bestchromosome are responsible for deciding which math-ematical operation will be performed on the selectedfeature by using the optimized value of β. Subsequently,this optimal watermark information is decoded from thespecified dataset successfully and also needs to be de-coded during the decoding process by the data recipient.

3.2 Watermark Encoding Phase

Watermark information calculation is formulated as aCO problem to meet the data quality constraint of thedata owner. A GA is used to create optimal watermarkinformation that includes: (1) Optimal chromosomalstring (watermark string of length l); and (2) β value. β isa parameter that is computed using GA and representsa tolerable amount of change to embed in the featurevalues. Once the optimum value of β for each candidatefeature A is found, it is saved for use during watermarkencoding and decoding. A watermark (bit string) oflength l and an optimum value β is used to manipulatethe data provided it satisfies the usability constraints λ.The value β is added into every tuple of the selectedfeature A when a given bit is 0; otherwise, its value issubtracted from the value of the feature.

It is ensured that the mutual information of a featureremains unchanged, when the watermark is insertedinto the database. The watermark is inserted into everytuple for the selected feature of the dataset. The dataowner can select any number of features for watermarkembedding based upon a secret threshold and MI of thefeature(s).

After finding the optimum value of β, a parameter ηris calculated according to the equation 6, that representsthe percentage change in the watermark encoding. Thisparameter is calculated for a tuple r as:

ηr = Dr ∗ ζ (6)

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

7

Since the length of the watermark is l; ηr is calculatedand β is inserted l times in the database. The lengthl of the watermark should be carefully chosen. If it istoo small, it will make the watermark fragile againstattacks, and if it is too large, it might compromise thedata quality because the data gets altered for every bitof the watermark. In RRW, the data gets altered foreach watermark bit in every tuple. After a number ofempirical studies, a length of 16 bits was selected. Itis also worth mentioning here that RRW allows a userdefined number for watermark bit length.

The watermark encoding algorithm starts the embed-ding process with the most significant bit MSB of thewatermark. For this purpose the algorithm works withone tuple at a time. If the MSB of the watermark is 1,the new value of Dr denoted by DWr is calculated usingequation 7 as follows.

DWr= Dr − β (7)

In order to embed the second MSB of the watermark, thealgorithm is again employed using the same procedure,but the updated value Dr of the feature (that has nowbecome DWr

) is used for calculating new values of ηrand DWr

. If the algorithm encounters a watermark bitthat is 0 then the new value of DWr

is calculated usingequation 8.

DWr= Dr + β (8)

The above procedure is repeated until all tuples of thedataset have been watermarked. Algorithm 1 describesthe steps involved in the watermark encoding phase.

Algorithm 1 Watermark EncodingInput: D, w,βOutput: DW , ∇

for w = 1 to l do//loop will iterate for all watermark bits w from 1 to length l of thewatermarkfor r = 1 to R do

//loop will iterate for all tuples of the dataif br,w == 0 then

// the case when the watermark bit is 0changes are calculated by using equation 6data is watermarked by using equation 8insert ηr into ∇

end ifif br,w == 1 then

// the case when the watermark bit is 1changes are calculated by using equation 6data is watermarked by using equation 7insert ηr into ∇

end ifend for

end forreturn DW ,∇

3.3 Watermark Decoding PhaseIn the watermark decoding process,the first step is tolocate the features which have been marked. The processof optimization through GA is not required during thisphase. We use a watermark decoder ζ, which calculatesthe amount of change in the value of a feature thatdoes not affect its data quality. The watermark decoder

decodes the watermark by working with one bit at atime. In the decoding phase, ηdr is calculated usingequation 9 and represents the percent change detected inthe watermarked data. The value of ηdr , ηr and η∆r

is cal-culated using the values of tuple r and therefore mightbe different for every r. The parameter η∆r

is computedby calculating the difference between the original datachange amount ηr and the watermark detected changeamount ηdr using equation 10.

ηdr = D′

W ∗ ζ (9)

η∆r = ηdr − ηr (10)

The watermark decoding algorithm is presented inAlgorithm 2. The decoding phase mainly consists of twosteps: Step 1. For every candidate feature A of all thetuples in D

W , the watermark bits are detected startingfrom the LSB (least significant bit) and moving towardsthe MSB (most significant bit). The bits are detectedin the reverse order compared with the bits encodingorder because it is easy to detect the effect of the lastencoded bit of the watermark. This process is carriedout using the change matrix ηr. Step 2. The bits are thendecoded according to the percentage change values ofwatermarked data. If η∆r ≤ 0, the detected watermarkbit will be 1. If η∆r

> 0 and η∆r≤ 1, the detected

watermark bit will be 0. The final watermark informationis retrieved through a majority voting scheme usingequation 11.

WD ⇐ mode(dtW (1, 2, ..., l)) (11)

Algorithm 2 Watermark DecodingInput: DW or D

′W , ∇, l

Output: WD

for r = 1 to R do//loop will iterate for all tuples of the datafor b = l to 1 do

//loop will iterate for all watermark bits b from 1 to length l of thewatermarkηdr ⇐ D

′W (r) ∗ ζ

η∆r ⇐ ηdr − ηrif η∆r ≤ 0 then

detected watermark bit (dtW) is 1else if η∆r > 0 and η∆r ≤ 1 then

detected watermark bit (dtW) is 0end if

end forend forWD ⇐ mode(dtW (1, 2, ..., l))return WD

3.4 Data Recovery Phase

After detecting the watermark string, some post process-ing steps are carried out for error correction and datarecovery. The optimized value of β computed throughthe GA is used for regeneration of original data. Thedata recovery algorithm is presented in algorithm 3. The

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

8

value of a numeric feature is recovered using equations12 and 13.

Dr = D′

Wr+ β (12)

Dr = D′

Wr− β (13)

Algorithm 3 Data RecoveryInput: DW or D

′W , b

Output: Dr

for r = 1 to R do//loop will iterate for all tuples of the datafor b = l to 1 do

//loop will iterate for all watermark bits b from 1 to length l of thewatermarkif dtW (r, b) == 1 then

// 0 or 1 watermark bit is detected from every tuple rdata is recovered by using equation 12

elsedata is recovered by using equation 13

end ifend for

end forreturn Dr

4 RESULTS AND DISCUSSION

Experiments are conducted on Intel Core i3 with CPUof 2.40GHz and RAM of 2GB. For brevity, heart diseasemedical dataset, [35] containing more than 300 tuples isselected. RRW was evaluated for: (1) investigating effecton the data quality of the underlying data; (2) robust-ness against malicious attacks; and (3) restoration of theoriginal data. The data recovery, watermark detectionaccuracy and effect of RRW on data quality are evaluatedusing the case study of a heart disease medical dataset.A small set of tuples from the same dataset are also usedas an example to illustrate the entire procedure step bystep.

Robustness of RRW is demonstrated through an ex-tensive attack analysis. Our results have shown 100%accuracy in both watermark detection and data recovery.The experiments performed, demonstrate data recoveryin best case as well as in worst case scenarios whereMallory tries to insert, alter and delete 10%, 20%, 30%,40%, 50% and up-to 100% of the data and the resultsare plotted in the graphs below. After such attacks,RRW recovered 100%, 65.02% and 50.50% tuples with50% insertion, alteration and deletion attacks on datarespectively. In another experiment, 37% of tuples havebeen recovered with 100% distortion in data. The effectof RRW on the statistical measures reported in the GAfitness function as in equation 4 is analyzed with theoriginal as well as watermarked data. Results are com-pared with effect on the statistical measures of the PEEWtechnique before and after watermarking. RRW is alsocompared with DEW, GADEW and PEEW techniquesfor watermark detection accuracy with subset insertion,subset alteration and subset deletion attacks. In all thesescenarios, RRW technique has shown better results.

The computational time of RRW is (l ∗ R ∗ A) wherel is the watermark length, R is the total number of

tuples in the dataset and A is the feature selectedfor watermarking. The number of tuples are usuallymuch larger as compared to the number of features indatabases and the watermark length l; so, A << R andl << R. Therefore, for large databases, (R termed as′n′) the time complexity of RRW for watermark insertionand detection is O(n) (it is worth mentioning here thatthe time for computing the GA based optimal valueis not included in this calculation because it is partof pre-processing and same is the case for GADEW).For datasets involving large number of features or largenumber of tuples, the data owner may use a separatemachine, with high computation power, for watermark-ing the datasets. This is a small price to pay for animproved sense of privacy against data theft (false claimof ownership can be tackled by watermark encoding anddecoding). Computational time of PEEW is also O(n)as this technique also needs to process all the tuples.DEW marks less number of tuples based on whetherthey meet a certain threshold; however, it still needs toprocess all tuples leading to a time complexity of O(n).For GADEW, the time complexity is (C ∗G∗R∗A) whereC is the number of chromosomes, G is the number ofgenerations involved in calculating the total cost, R isthe number of tuples and A is the number of features.In general this is again O(n).

4.1 A Case Study of Medical Data Recovery4.1.1 Phase 1: PreprocessingConsider the medical dataset of heart disease classifi-cation problem; for brevity, here only two features (A1and A2) and one feature (A14 - Class Labels) with fourtuples are shown in Table 2. The class label is dividedinto five classes, where 0 corresponds to absence of heartdisease and 1,2,3,4 corresponds to four different typesof heart diseases on the basis of cp (chest pain) typenaming: typical angina, atypical angina, non-anginalpain, asymptomatic respectively. In order to select thecandidate feature for watermarking, mutual informationof the whole dataset is calculated. The MIO and MIWof the watermarked feature A1 has been provided in theTable 3 (A1 was selected as the candidate feature forwatermarking after calculating the MI of all features).Its statistical relationship (MI) with itself has not beenshown in the table.

TABLE 2D (Original Data) for watermarking

A1 A2 ... Class Labels63.0 M ... 167.0 F ... 037.0 M ... 341.0 M ... 4

An optimum value of β = 0.29 is calculated by usingan optimization scheme. This value might be differentfor different databases. We have performed sufficientnumber of experiments to find the most reliable set

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

9

TABLE 3Mutual Information of Selected Feature before and after

watermarking

Sr no Name of features MIO MIW ∆MI

A1 Age - - -A2 Sex 0.1175 0.1175 0A3 cp 0.3919 0.3919 0A4 trestbps 1.9705 1.9705 0A5 chol 3.8996 3.8996 0A6 fbs 0.1113 0.1113 0A7 restecg 0.1763 0.1763 0A8 thalach 3.0698 3.0698 0A9 exang 0.1396 0.1396 0A10 oldpeak 0.4373 0.4373 0A11 slope 0.1925 0.1925 0A12 ca 0.4259 0.4259 0A13 thal 0.2191 0.2191 0A14 Class Labels 0.3631 0.3631 0

of parameters for the genetic algorithm. The detailedset of GA parameters found reliable are given in table4. The preprocessing phase, takes approximately 1.7milliseconds for computing an optimum value and awatermark string for the entire medical dataset [35].Figure 3 demonstrates the time incurred in this phasefor increasing dataset sizes (number of tuples) where lwas kept fixed at 16-bits and the number of features were14.

TABLE 4Genetic Algorithm parameters

GA Parameter ValueNo. of Generations 100Population Size 50Chromosome Length 3Selection Mechanism Tournament Selection

Tournament Size = 5Crossover Type: Single Point

Fraction: 0.7Mutation Type: Uniform

Rate: 0.1Elitism Count 2

100

102

104

1060

5

10

15

20

Number of Tuples

Co

mp

uta

tio

n t

ime

(ms)

Fig. 3. Computation Time for Genetic Algorithm

4.1.2 Phase 2: Watermark EncodingWe report an example where the owner inserts a 3-bits long watermark 110 into the dataset and ζ = 10%change is incorporated into the tuple values. We use thishigh value of ζ just to make the difference between theoriginal and the marked data clear for the readers ofthis paper; however, in real practice, relatively smallervalues of ζ are used. The values of ηr to be embeddedinto the values Dr of feature A1 are calculated using

equation 6. The watermark encoding algorithm startsthe embedding process with the most significant bit(MSB) of the watermark. For this purpose, the algorithmworks with one tuple at a time. Suppose the MSB ofthe watermark is 1; therefore, the new value of Dr

(denoted by DWr) will be calculated using equation 7.

In order to embed the second MSB of the watermark,the algorithm is again run using the same procedure,but the updated value Dr of the feature A1 is used forcalculating new values of ηr and DWr

. If the algorithmencounters a bit of the watermark that is 0 then the newvalue DWr

of A1 will be calculated using equation 8.The above procedure is repeated until all tuples of thedataset are watermarked. Four tuples from the originaldata are shown in table 2 for brevity. The watermarkencoding process is explained in tables 5 - 7. These threetables have represented the manipulation of a 3-bits longwatermark string (110), on all four values of feature A1,one-by-one.

TABLE 5Watermark Encoding with Watermark Bits 110, Started

from First Bit (1)

A1 Class Labels ηr62.71 1 6.366.71 0 6.736.71 3 3.740.71 4 4.1

TABLE 6Data Encoded with 2nd Bit (1)

A1 Class Labels ηr62.42 1 6.27166.42 0 6.67136.42 3 3.67140.42 4 4.071

TABLE 7Data Encoded with the 3rd Bit (0) (DW - Watermarked

Data)

A1 Class Labels ηr62.71 1 6.24266.71 0 6.64236.71 3 3.64240.71 4 4.042

4.1.3 Phase 3: Watermark Decoding

In the decoding phase - A1 - the watermarked feature hasbeen selected after calculating MI. The embedded bits aredecoded in reverse order in which they are encoded. Thevalues of ηr are used, that are saved by the data ownerduring the watermark encoding stage in a matrix, tocalculate the value of ηdr for DWr by calculating equation9. The percentage change in data of 0.1 is used bothfor encoding and decoding. The value of ηr is used tocompute its difference from ηdr using equation 10.

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

10

The value of η∆ris compared with the value of

watermark decoding operator (having values between0 and 1) to decode the last embedded watermark bit.After decoding the last embedded bit for all the tuples,majority vote is taken and if the output of the majorityvoting process for that bit is 1 then 1 is saved as thedetected bit and the values of feature A1 in all thetuples is updated using equation 12. If the output of themajority voting is 0 then 0 is saved as the detected bitand the values of feature A1 in all the tuples is updatedusing equation 13.

Watermark decoding is performed in the reverse orderof the watermark embedding process; the last embeddedbit is decoded first and so on. Therefore, in the exampleof watermark decoding the watermark bits are decodedafter applying the majority voting scheme on every tupleof the dataset. In our case these are 110, which is thesame as the embedded watermark bits. The decodingprocess is explained in tables 8 - 10. These three tableshave shown the detection of a 3-bits long watermarkstring (110), from all four values of feature A1, one-by-one.

TABLE 8Majority Bit = 0, Detected Watermark Bit = 0

A1 Class Labels ηdr ηr η∆r Detected Bits62.42 1 6.271 6.242 0.029 066.42 0 6.671 6.642 0.029 036.42 3 3.671 3.642 0.029 040.42 4 4.071 4.042 0.029 0

TABLE 9Majority Bit = 1, Detected Watermark Bit = 1

A1 Class Labels ηdr ηr η∆r Detected Bits62.71 1 6.242 6.271 -0.029 166.71 0 6.642 6.671 -0.029 136.71 3 3.642 3.671 -0.029 140.71 4 4.042 4.071 -0.029 1

TABLE 10Majority Bit = 1, Detected Watermark Bit = 1

A1 Class Labels ηdr ηr η∆r Detected Bits63 1 6.271 6.3 -0.029 167 0 6.671 6.7 -0.029 137 3 2.671 3.7 -0.029 141 4 4.071 4.1 -0.029 1

4.1.4 Phase 4: Data RecoveryThe original data is recovered along with watermark bitsfrom watermarked data A1w. The recovered data is thesame as shown in table 2.

4.2 Robustness AnalysisIn this experimental study, the watermarked data issubjected to different types of attacks. Suppose, Alice isthe owner of the data and she embeds a watermark W in

the tuples which results in DW with the intent of owner-ship protection. Bob, the recipient, requires that the dataquality should remain intact for information extractionprocess. Alice is now willing to share watermarked dataDW with Bob. In the meantime, Mallory (the intruder)wants to corrupt the watermarked data from the datasetso that Bob is unable to detect the watermark and usethe data for knowledge extraction process.

In this study, the following assumptions are con-sidered: (1) Mallory is unable to access original data;and (2) Mallory does not have access to secret param-eters like candidate feature of the dataset, percentagedata change ζ, watermark length l, and the watermarkencoder/decoder β. These assumptions will make thetask of Mallory very difficult and she has to corruptthe watermark in such a manner that the data qualityremains intact. Therefore, in a worst case scenario, evenif she is able to successfully corrupt the original data,the proposed scheme is robust enough for its successfulrecovery. The watermark is inserted in all the tuples ofthe selected feature in the dataset, thus the watermark(and hence the original data) can be recovered fromthe remaining tuples if Mallory is somehow able tosuccessfully corrupt the watermark in some tuples.

In the robustness study, experiments were performedwith three types of attacks: (1) Insertion; (2) Deletion;and (3) Alteration. The original data recovery is reportedfor RRW only and watermark detection rate is comparedwith DEW [23], PEEW [28], and GADEW [27] techniqueswith different fractions of tuples. RRW is highly robustas compared to PEEW, GADEW and DEW techniquesin the analysis of the three types of above mentionedattacks. In RRW all the tuples of the selected featurewere watermarked to achieve robustness. GADEW wa-termarks a large number of tuples in two particularfeatures as compared to DEW and thus improves robust-ness in scenarios of attribute-wise and tuple-wise multi-faceted attacks. However, the dependency of robustnessof GADEW on AWD and TWD requires a trade-off be-tween the watermark robustness and the data distortionintroduced during watermark embedding. As a result,the decoding accuracy of GADEW degrades if datadistortions are minimized. In particular, its performanceis not satisfactory in case of heavy attacks because water-marked tuples are affected by such attacks; consequently,it becomes difficult to decode the watermark correctlyfrom the remaining un-affected tuples. On the contrary,RRW watermarks all the tuples of the selected featureand still provides a way of controlling data distortionswhile ensuring data quality after watermark embedding.

4.2.1 Insertion Attacks

In this particular type of attack, the insertion of newtuples by Mallory did not damage the data quality andwatermark information, because, she is not disturbingthe original tuples. Mallory may insert a number of αduplicate tuples or randomly generated fake tuples into

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

11

the database. RRW is resilient against these types ofattacks.

4.2.1.1 Data Recovery: When Mallory tries to in-sert 50% tuples within the range of values of the wa-termarked feature, 100% data is recovered as shown infigure 4. The reason for successful data recovery is themarking of all the tuples (majority vote) and use of ηr.

0 20 40 60 80 10099

99.5

100

100.5

101

Inserted Tuples (%)

Dat

a R

eco

very

(%

)

Fig. 4. Data Recovery with RRW after Insertion Attacks

4.2.1.2 Watermark Detection: RRW is comparedwith the most recent techniques namely DEW, GADEWand PEEW for analyzing its watermark decoding accu-racy. Watermark detection accuracy is 100% in RRW, 89%to 94% in DEW, 100% GADEW and 98% to 100% inPEEW technique as shown in figure 5.

0 20 40 60 80 10085

90

95

100

Inserted Tuples (%)

Dec

tect

ion

Acc

ura

cy (

%)

RRWGADEWDEWPEEW

Fig. 5. A Comparison of Watermark Decoding Accuracyof RRW with DEW, GADEW and PEEW after InsertionAttacks

4.2.2 Deletion AttacksIn this type of attack, Mallory might delete α subset oftuples from the watermarked database. In the decodingphase, the watermark information and original data isrecovered from the rest of the data. If Mallory deletesα tuples from the dataset and mutual information ofthe features in the database is changed, the data qualityof the features gets compromised and as a result theknowledge extraction process makes wrong decisions.Since, the attacker wants to disturb the data usefulness,α tuples will be deleted from the database such that thedata quality is unaffected. Consequently, the ranking offeatures is not disturbed after such attacks and the dataremains useful.

4.2.2.1 Data Recovery: The original data is recov-ered with more than 50% accuracy in case 50% data wasdeleted as shown in figure 6. The reason for the successof the proposed scheme is the ability of being able toembed a low distortion watermark in all the tuples andapplying a majority voting scheme.

0 20 40 60 80 1000

50

100

Deleted Tuples (%)

Dat

a R

eco

very

(%

)

Fig. 6. Data Recovery with RRW after Deletion Attacks

4.2.2.2 Watermark Detection: Our experimentsshowed that when more than 80% of the data wasdeleted, the watermark was detected with 100% accu-racy. We compared RRW with well known reversiblewatermarking techniques for detecting the watermarkinformation after such attacks. RRW has shown 100%accuracy when up to 90% tuples were deleted whileDEW, GADEW and PEEW were less accurate when alarger number of tuples were attacked. The results ofthis study are reported in figure 7.

0 20 40 60 80 10040

60

80

100

Deleted Tuples (%)

Dec

tect

ion

Acc

ura

cy (

%)

RRWGADEWDEWPEEW

Fig. 7. A Comparison of RRW against DEW, GADEW andPEEW for Resilience to Deletion Attacks

4.2.3 Alteration Attacks

In such attacks, Mallory can modify the value of thewatermarked feature within a certain range. Mallory canmake random or fixed alterations within the range ofminimum min(a) to maximum values max(a). WhenMallory alters α tuples, the watermark decoder helpsthe decoding process to successfully recover the originaldata and the watermark from unaltered tuples and someof the altered tuples if data usability is not affected afterthe attacks.

4.2.3.1 Data Recovery: RRW demonstrated morethan 65% of data recovery when half of the tuples getaltered (see figure 8).

0 20 40 60 80 10040

60

80

100

Altered Tuples (%)

Dat

a R

eco

very

(%

)

Fig. 8. Data Recovery with RRW after Alteration Attacks

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

12

4.2.3.2 Watermark Detection: Again, in the pres-ence of such attacks, RRW provided 100% accuracy with90% of attacked tuples whereas DEW, GADEW andPEEW gave less accuracy as shown in figure 9.

0 20 40 60 80 10070

80

90

100

Altered Tuples (%)

Dec

tect

ion

Acc

ura

cy (

%)

RRWGADEWDEWPEEW

Fig. 9. Watermark decoding accuracy with DEW,GADEW and PEEW after Alteration Attacks

To summarize, a comparative study of RRW withother state of the art techniques is given in table 11. It isobserved that RRW provides 100% watermark detectioneven with 90% attacked tuples while other techniques donot give such good results. As far as the data recoveryis concerned, even in the presence of heavy attacks mostof the data was recovered with high degree of accuracy.

4.3 Effect on Statistical and Classification MeasuresThe effect of RRW on data quality is evaluated withusability constraints defined in the fitness function inequation 3. For this purpose, the calculation of mutualinformation of the original data MIO and the water-marked data MIW is carried out and is shown in table 3.The proposed reversible watermarking (RRW) techniqueis also compared with PEEW technique by computingother statistical measures constrained in fitness functionsuch as mean and variance (see equation 4). Our re-ported mean and variance of the original selected featureis 54.439 and 81.697 and, the mean and variance of thewatermarked feature is 55.019 and 81.697 respectively.The reported value of variance is the same before andafter watermarking the data, while we can notice adifference of 0.58 in mean value which is negligible ascompared to that of PEEW [28] (bold in the table). Thevalues for mean and variance of some features takenfrom datasets used by RRW and PEEW before or afterapplying watermarking techniques are given in table 12.

Moreover, the effect of RRW on different classificationresults of the data, before and after watermarking thedataset is evaluated through a data mining application.Various classifiers are used for this purpose; such as,Support Vector Machine (SVM) with linear and RBFkernel functions [39], K-Nearest Neighbor (KNN), LinearDiscriminant Analysis (LDA), and Naive Bayes classifierto classify the heart disease data [35]. It is found that theclassification measures - including accuracy, sensitivityand specificity - remain almost same before and afterapplying RRW. The data is classified into 5 classes onthe basis of cardiac symptoms where 0 corresponds toabsence of heart disease and 1,2,3,4 correspond to fourdifferent types of heart diseases on the basis of the type

of chest pain including: typical angina, atypical angina,non-anginal pain, asymptomatic respectively. The clas-sification results of the data obtained through variousclassifiers before and after applying RRW are reportedin table 13.

5 CONCLUSIONS

Irreversible watermarking techniques make changes inthe data to such an extent that data quality gets compro-mised. Reversible watermarking techniques are used tocater to such scenarios because they are able to recoveroriginal data from watermarked data and ensure dataquality to some extent. However, these techniques arenot robust against malicious attacks – particularly thosetechniques that target some selected tuples for water-marking. In this paper, a novel robust and reversibletechnique for watermarking numerical data of relationaldatabases is presented. The main contribution of thiswork is that it allows recovery of a large portion ofthe data even after being subjected to malicious attacks.RRW is also evaluated through attack analysis wherethe watermark is detected with maximum decodingaccuracy in different scenarios. A number of experimentshave been conducted with different number of tuplesattacked. The results of the experimental study showthat, even if an intruder deletes, adds or alters up to 50%of tuples, RRW is able to recover both the embeddedwatermark and the original data. RRW is comparedwith recently proposed state-of-the-art techniques suchas DEW, GADEW and PEEW to demonstrate that RRWoutperforms all of them on different performance merits.

One of our future concerns is to watermark shareddatabases in distributed environments where differentmembers share their data in various proportions. We alsoplan to extend RRW for non-numeric data stores.

REFERENCES

[1] Y.-C. Liu, Y.-T. Ma, H.-S. Zhang, D.-Y. Li, and G.-S. Chen, “Amethod for trust management in cloud computing: Data coloringby cloud watermarking,” International Journal of Automation andComputing, vol. 8, no. 3, pp. 280–285, 2011.

[2] “Walmart to start sharing its sales data,”http://www.nypost.com/p/news/business/walmart-opens-up, last updated: 09:55 AM on February 4, 2012, last accessed:July, 20 2013.

[3] “Identity theft watch,” http://www.scambook.com/blog/2013/04/identity-theft-watch-customer-passwords-stolen-from-walmart-vudu-video-service/, last updated: April 11, 2013,last accessed: July, 20 2013.

[4] “Securing outsourced consumer data,”http://www.databreaches.net/securing-outsourced-consumer-data/, last updated: February 26, 2013, last accessed: July, 202013.

[5] “As patients’ records go digital,theft and hacking problems grow,”http://www.kaiserhealthnews.org/Stories/2012/June/04/electronic-health-records-theft-hacking.aspx, last updated: June03, 2012, last accessed: July, 20 2013.

[6] I. J. Cox, J. Kilian, F. T. Leighton, and T. Shamoon, “Secure spreadspectrum watermarking for multimedia,” Image Processing, IEEETransactions on, vol. 6, no. 12, pp. 1673–1687, 1997.

[7] I. Cox, M. Miller, J. Bloom, and M. Miller, Digital watermarking.Morgan Kaufmann, 2001.

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

13

TABLE 11Robustness Analysis of RRW with GADEW, DEW, PEEW

RRW GADEW DEW PEEWData distortions Negligible Less Less LessResilience against insertion attacks 100% 89% to 94% 100% 98% to 100%Resilience against alteration attacks 100% 74% to 100% 89% to 94% 95% to 100%Resilience against deletion attacks 100% 50% to 90% 89% to 94% 76% to 100%Watermark detection with or without attacks 100% 89% 94% 95%Original data recovery without attacks 100% 89% 94% 98%Original data recovery with 50% attacks 65% 37% 40% 50%

TABLE 12Effect of RRW and PEEW on Various Statistical Measures (Mean µD, Variance σD of RRW, and Mean µd, Variance

σd of PEEW)

Features µD σD µDWσDW

µd σd µdw σdwA1 54.4389 81.6974 55.019 81.697 134.92 28787.73 134.91 28787.89A2 0.6799 0.2184 0.6799 0.2184 299.64 141878.7 299.6426 141879.09A3 3.1584 0.9218 3.1584 0.9218 2.54 0.0312 2.5442 0.03294A4 131.6898 309.7511 131.6898 309.75 54.3 478.2 54.3024 478.339A5 246.6931 2.68e+03 246.6931 2.6e+03 156.92 31006.42 156.92244 31006.526A6 0.1485 0.1269 0.1485 0.1269 35.49 204.2 35.4297 204.326A7 0.9901 0.9900 0.9901 0.9900 4.54 0.030 4.54305 3.030882A8 149.6073 523.2658 149.6073 523.2658 38.9 1201.8 38.920716 1201.8555

TABLE 13Effect of RRW on Various Classification Measures ((Accuracy Acc, Sensitivity Sn and Specificity Sp)

Classifier Classes AccO AccW SnO SnW SpO SpW

Linear SVM

0 83.44 83.82 88.12 86.87 77.86 78.961 81.84 81.85 0 0 100 1002 88.24 88.24 0 0 100 1003 88.56 88.26 2.50 5.83 100 1004 95.64 95.64 0 0 100 100

RBF SVM

0 85.17 85.23 90 90.62 79.56 79.011 81.83 81.83 0 0 100 1002 88.23 88.25 0 0 100 1003 88.24 88.23 0 0 100 1004 95.64 95.65 0 0 100 100

NB

0 84.22 83.23 88.13 86.88 78.68 78.521 57.81 57.93 53.33 52.33 60.10 60.832 73.46 73.04 82.50 81.67 72.45 72.143 76.65 75.73 89.17 82.50 74.78 74.094 77.49 76.39 80 85 76.81 76.02

KNN

0 57.21 59.97 64.37 66.87 48.79 51.871 69.69 68.31 9.66 8.67 83.05 81.452 78.40 78.53 16.66 19.17 86.5 86.283 83.85 83.23 18.33 18.33 92.71 93.134 91.92 92.92 15 10 95.42 95.42

LDA

0 83.76 83.76 86.25 86.87 79.67 80.331 59.96 60.59 59.00 57.33 62.93 62.632 71.03 71.08 78.33 64.16 71.42 72.143 76.51 75.73 60.83 61.66 75.97 76.304 80.84 81.85 50 50 81.70 81.99

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015

14

[8] P. W. Wong, “A public key watermark for image verification andauthentication,” in Image Processing, 1998. ICIP 98. Proceedings.1998 International Conference on, vol. 1. IEEE, 1998, pp. 455–459.

[9] P. W. Wong and N. Memon, “Secret and public key imagewatermarking schemes for image authentication and ownershipverification,” Image Processing, IEEE Transactions on, vol. 10, no. 10,pp. 1593–1601, 2001.

[10] F. A. Petitcolas, “Watermarking schemes evaluation,” Signal Pro-cessing Magazine, IEEE, vol. 17, no. 5, pp. 58–64, 2000.

[11] J. T. Brassil, S. Low, and N. F. Maxemchuk, “Copyright protectionfor the electronic distribution of text documents,” Proceedings ofthe IEEE, vol. 87, no. 7, pp. 1181–1196, 1999.

[12] R. Agrawal and J. Kiernan, “Watermarking relational databases,”in Proceedings of the 28th international conference on Very Large DataBases. VLDB Endowment, 2002, pp. 155–166.

[13] R. Sion, M. Atallah, and S. Prabhakar, “Rights protection for cat-egorical data,” Knowledge and Data Engineering, IEEE Transactionson, vol. 17, no. 7, pp. 912–926, 2005.

[14] S. Subramanya and B. K. Yi, “Digital rights management,” Poten-tials, IEEE, vol. 25, no. 2, pp. 31–34, 2006.

[15] P. E. Gill, W. Murray, and M. A. Saunders, “Snopt: An sqpalgorithm for large-scale constrained optimization,” SIAM review,vol. 47, no. 1, pp. 99–131, 2005.

[16] K. E. Parsopoulos and M. N. Vrahatis, “Particle swarm optimiza-tion method for constrained optimization problems,” IntelligentTechnologies–Theory and Application: New Trends in Intelligent Tech-nologies, vol. 76, pp. 214–220, 2002.

[17] R. Hassan, B. Cohanim, O. De Weck, and G. Venter, “A Com-parison Of Particle Swarm Optimization And The Genetic Algo-rithm,” in 46 th AIAA/ASME/ASCE/AHS/ASC Structures, StructuralDynamics, and Materials Conference. American Institute of Aero-nautics and Astronautics, 2005, pp. 1–13.

[18] Y.-R. Wang, W.-H. Lin, and L. Yang, “An intelligent watermarkingmethod based on particle swarm optimization,” Expert Systemswith Applications, vol. 38, no. 7, pp. 8024 – 8029, 2011.

[19] M. Kamran and M. Farooq, “An information-preserving water-marking scheme for right protection of emr systems,” Knowledgeand Data Engineering, IEEE Transactions on, vol. 24, no. 11, pp.1950–1962, 2012.

[20] T. M. Cover, J. A. Thomas, and J. Kieffer, “Elements of informationtheory,” SIAM Review, vol. 36, no. 3, pp. 509–510, 1994.

[21] T. M. Cover and J. A. Thomas, Elements of information theory.Wiley-interscience, 2012.

[22] Y. Zhang, B. Yang, and X.-M. Niu, “Reversible watermarking forrelational database authentication,” Journal of Computers, vol. 17,no. 2, pp. 59–66, 2006.

[23] G. Gupta and J. Pieprzyk, “Reversible and blind database wa-termarking using difference expansion,” in Proceedings of the 1stinternational conference on Forensic applications and techniques intelecommunications, information, and multimedia and workshop. ICST(Institute for Computer Sciences, Social-Informatics and Telecom-munications Engineering), 2008, p. 24.

[24] A. M. Alattar, “Reversible watermark using difference expansionof triplets,” in Image Processing, 2003. ICIP 2003. Proceedings. 2003International Conference on, vol. 1. IEEE, 2003, pp. I–501.

[25] G. Gupta and J. Pieprzyk, “Database relation watermarking re-silient against secondary watermarking attacks,” in InformationSystems Security. Springer, 2009, pp. 222–236.

[26] J.-N. Chang and H.-C. Wu, “Reversible fragile database water-marking technology using difference expansion based on svrprediction,” in Computer, Consumer and Control (IS3C), 2012 In-ternational Symposium on. IEEE, 2012, pp. 690–693.

[27] K. Jawad and A. Khan, “Genetic algorithm and difference ex-pansion based reversible watermarking for relational databases,”Journal of Systems and Software, 2013.

[28] M. E. Farfoura and S.-J. Horng, “A novel blind reversible methodfor watermarking relational databases,” in Parallel and DistributedProcessing with Applications (ISPA), 2010 International Symposiumon. IEEE, 2010, pp. 563–569.

[29] D. M. Thodi and J. J. Rodriguez, “Prediction-error based reversiblewatermarking,” in Image Processing, 2004. ICIP’04. 2004 Interna-tional Conference on, vol. 3. IEEE, 2004, pp. 1549–1552.

[30] D. M. Thodi and J. J. Rodriguez, “Reversible watermarking byprediction-error expansion,” in Image Analysis and Interpretation,2004. 6th IEEE Southwest Symposium on. IEEE, 2004, pp. 21–25.

[31] D. M. Thodi and J. J. Rodriguez, “Expansion embedding tech-niques for reversible watermarking,” Image Processing, IEEE Trans-actions on, vol. 16, no. 3, pp. 721–730, 2007.

[32] M. E. Farfoura, S.-J. Horng, J.-L. Lai, R.-S. Run, R.-J. Chen,and M. K. Khan, “A blind reversible method for watermarkingrelational databases based on a time-stamping protocol,” ExpertSystems with Applications, vol. 39, no. 3, pp. 3185–3196, 2012.

[33] X. Li, B. Yang, and T. Zeng, “Efficient reversible watermarkingbased on adaptive prediction-error expansion and pixel selec-tion,” Image Processing, IEEE Transactions on, vol. 20, no. 12, pp.3524–3533, 2011.

[34] E. Sonnleitner, “A robust watermarking approach for largedatabases,” in Satellite Telecommunications (ESTEL), 2012 IEEE FirstAESS European Conference on. IEEE, 2012, pp. 1–6.

[35] K. Huang, H. Yang, I. King, M. R. Lyu, and L. Chan, “Biasedminimax probability machine for medical diagnosis,” in Proceed-ings of the 8th International Symposium on Artificial Intelligence andMathematics (AIM04), 2004.

[36] K. Bache and M. Lichman, “UCI machine learning repository,”2013. [Online]. Available: http://archive.ics.uci.edu/ml

[37] A. Reiss and D. Stricker, “Introducing a new benchmarked datasetfor activity monitoring,” in Wearable Computers (ISWC), 2012 16thInternational Symposium on. IEEE, 2012, pp. 108–109.

[38] M. Mitchell, “An introduction to genetic algorithms mit press,”Cambridge, Massachusetts. London, England, 1996.

[39] S. Bhatia, P. Prakash, and G. Pillai, “Svm based decision supportsystem for heart disease classification with integer-coded geneticalgorithm to select critical features,” in Proceedings of the WorldCongress on Engineering and Computer Science, WCECS, 2008, pp.22–24.

Saman Iftikhar received her M.S degree inInformation Technology in 2008 from NationalUniversity of Sciences and Technology (NUST),Islamabad, Pakistan. She has worked as Re-search Assistant in Health Life Horizon (HLH)project funded by Information and Communi-cation Technology Research and Development(ICT R&D) from 2008 to 2011 in NUST. Herresearch interests include information security,distributed computing, semantic web and cloudcomputing.

M. Kamran received the M.S. and Ph.D. degreesin computer science, in 2008 and 2012, re-spectively, from National University of Computerand Emerging Sciences (NUCES), Islamabad,Pakistan. Currently he is working as AssistantProfessor in Computer Science Department ofCOMSATS Institute of Information Technology,Wah Cantt, Pakistan. His research interests in-clude machine learning, evolutionary computa-tions techniques, data security, health informat-ics, big data analytics and decision support sys-

tems.

Zahid Anwar received his Ph.D. and M.S. de-grees in Computer Sciences in 2008 and 2005respectively from the University of Illinois atUrbana-Champaign, USA. Zahid has worked asa software engineer and researcher at IBM,New York, USA, Intel, Oregon, USA, Motorola,Chicago, USA and the National Center for Su-percomputing Applications (NCSA), Urbana, Illi-nois, USA on various projects related to informa-tion security and operating system design. Zahidholds post-doctorate experience from Concordia

University, Montreal, Canada. Currently he is a faculty member atthe School of Electrical Engineering and Computer Science, NationalUniversity of Sciences and Technology (NUST), Islamabad, Pakistan.

IEEE Transactions on Knowledge and Data Engineering val X, No : XX , 2015