chapter 2 literature reviewshodhganga.inflibnet.ac.in/bitstream/10603/49168/9/09_chapter2.pdf ·...

23

CHAPTER 2

LITERATURE REVIEW

In the previous chapter, the need for copyright protection and evolution of digital watermarking

techniques along with the motivation and the structure of thesis is presented. This chapter reviews

the existing watermarking techniques applied on audios. Prior to giving the literature review and

current state of art on audio watermarking the preliminaries for audio and brief of the properties

which are exploited during watermarking of audio is presented.

As discussed earlier also, watermarking of audio is far more challenging than watermarking an

image or video. One of the main reasons is the wide dynamic range of audio signal as compared

to others. In addition, the Human Auditory System (HAS) is far more complex and very sensitive

to small changes in the magnitude of audio samples. The HAS perceives sounds over a wide

range of frequencies from an order of Hertz to kilo Hertz .In terms of the power this range is of

the order of power of 10. The sensitivity of the HAS to the additive Gaussian noise is high as well

which implies that a small disturbance at some frequency will be audible to the ear. The

sensitivity of the human ear is not same for all frequency range also, for example human ear is

more sensitive at low frequency and as the frequency becomes higher and higher the sensitivity of

ear keeps on decreasing. Also, the weaker low amplitude sounds are masked by stronger and high

amplitude sound when subjected to hearing simultaneously. These principles are depicted through

psychoacoustic model which is presented next.

2.1 Psychoacoustics

The human hearing range is about 20 Hz to about 20 kHz but it is most sensitive to frequencies

between 1 kHz and 5 kHz. This is the reason why two sounds at different frequency but with

same loudness level sounds distinct in loudness. The sound at higher frequency i.e. more than 20

kHz becomes ultrasonic. The dynamic range i.e. the ratio of the maximum audible sound

amplitude to the quietest audible sound amplitude is of the order of 120 dB [20]. But people

begin to become uncomfortable above 90 dB. The decibel unit basically represents the ratio of the

intensity on a logarithmic scale. The reference point is 0 dB and is also the threshold of human

24

hearing i.e. the quietist of the sound which can be heard at 1 kHz frequency. This relationship is

plotted through Equal-Loudness Relations represented through Fletcher-Munson curves and

Equal loudness contours from ISO 226:2003.

Figure 2.1: Threshold of audibility for SPL & frequency [20]

The dotted line in the curve shows the absolute threshold of hearing. The loudness curves

display the relationship between perceived loudness (in phons) for a given stimulus sound

volume (Sound pressure Level in dB) as a function of frequency. For example, the lowest curve

in the figure shows the perceived loudness of a pure tone at 10 dB at different frequencies. For

the 10 dB curve at 1 kHz the pure tone will be just audible at 0dB. If we compare the perceived

loudness for lower frequency say 100 Hz and higher frequency say at 1 kHz for the 10 dB curve

then we find that the required loudness for both is 10 dB and 30 dB respectively. This shows that

our ear is not sensitive to lower frequencies. Also stimulus required in the frequency range of 2.5

kHz to 4 kHz is less. In fact the ear canal amplifies in this range of frequencies.

The threshold of hearing in dB for a frequency f in kHz is given by [20]

25

Threshold(f) = 3.63(f/1000)-0.8 -6.5e-0.6(f/1000-3.3)2 +10-3(f/1000)4

------------ (2.1)

2.1.1 Frequency Masking

The interference of one frequency with the other frequency at given loudness level is governed by

masking principles. Also masking interprets the level of noise we can tolerate before we can’t

hear the actual music. The lossy compressions encoding such as moving picture expert group

(MPEG) or Dolby Digital removes the sounds that are masked thus reducing the total size of the

information. The general situation regarding masking is presented as

A lower tone can mask a higher tone effectively i.e. in presence of a low frequency tone

the higher frequency tone is inaudible.

The range of frequencies which can be masked is dependent on the power of the masking

tone. More is the power of the masking tone wider will be its impact on the frequencies.

Frequency masking curves depicts the masking of the nearby frequencies through a louder

masking tone. The principle is used when encoding of the audio signal with lesser number of bits

is required. If the audio signal can be decomposed into different frequency components, then for

the frequencies that will be partially masked, only the audible part need to be used for setting the

quantization noise thresholds. These properties of the HAS are exploited for embedding

watermark.

As hearing has limited, frequency dependent resolutions, all perceptual models like the

psychoacoustic model simulates the HAS as a bank of overlapping band pass filters with every

filter working on particular frequency band. These filters are called as critical band filters. The

following table presents the frequency range for different critical bands used to model the HAS

through 25 band filter. Every individual band is treated as a separate entity within the frequency

spectrum. The filter bandwidth is almost constant with a value of 100 Hz up to a frequency of 500

Hz while for high frequencies it increases with the central frequency of the band [20]. The

bandwidth goes to as high as 4 kHz for higher frequencies. Masking of one frequency

component/sound from one critical band can be done by the other frequency component with in a

critical band or other critical bands. The former is called as intra-band masking & the later is

known as inter band masking.

26

Table 2.1: Critical band and their bandwidths [20]

Band

Lower Frequency Bound (Hz)

Centre Frequency (Hz)

Higher Frequency Bound (Hz)

Bandwidth (Hz)

1 50 100 2 100 150 200 100 3 200 250 300 100 4 300 350 400 100 5 400 450 510 110 6 510 570 630 120 7 630 700 770 140 8 770 840 920 150 9 920 1000 1080 160

10 1080 1170 1270 190 11 1270 1370 1480 210 12 1480 1600 1720 240 13 1720 1850 2000 280 14 2000 2150 2320 320 15 2320 2500 2700 380 16 2700 2900 3150 450 17 3150 3400 3700 550 18 3700 4000 4400 700 19 4400 4800 5300 900 20 5300 5800 6400 1100 21 6400 7000 7700 1300 22 7700 8500 9500 1800 23 9500 10500 12000 2500 24 12000 13500 15500 3500 25 15500 18775 22050 6550

The implementation of critical band filters working exactly in their frequency bandwidth is

however difficult & almost infeasible. So in place, at the expense of the resolution required in a

given critical band, efficient filter bank implementation is done through Quadrature mirror filter

(QMF) pairs, cosine modulated filter bank etc. Since the range of frequencies that are affected by

27

masking is broader for higher frequencies, a new frequency scale which fit to almost equal width

for all critical bands is derived. This new unit which is called the Bark after the name of Heinrich

Barkhauson and corresponds to the width of one critical band for any masking frequency and is

given as [20]

Critical band number (Bark) =

[

Another for19mula for Bark scale is given as [20]

b= 13.0 arctan(0.76f) + 3.5 arctan (f2/56.25) -------------------- (2.3)

Reversibly from the Bark scale the corresponding frequency can be depicted as follows [20]

F = [(exp (0.219 x b)/352) +0.1] x b -0.032 x exp [-0.15 x (b-5)2 ] -------------------- (2.4)

Also the critical bandwidth corresponding to a given center frequency is approximated as [20]

df = 25 + 75 x [1 + 1.4f2 ]0.69 ------------------- (2.5)

where f is in kHz and df is in Hz.

2.1.2 Temporal Masking

The time sensitivity of hearing or minimum delay time up to which the masking tone mask the

nearby frequencies after being turned off is termed as temporal masking. The effect of masking

after a louder tone is referred as post masking while the effect of masking prior to a louder tone is

referred as pre masking. In general, louder the test tone the lesser time it takes for our hearing to

get over hearing the masking tone.

f/100, for f < 500

9 +4log2(f/1000), for f >= 500 (2.2)

28

The following figure shows the temporal masking for different sound pressure level

Figure 2.2: Temporal masking [21]

In audio watermarking while embedding the watermark the deficiency of the HAS is exploited.

The psychoacoustic model provides this deficiency in the form of masking thresholds for time as

well as frequency. In addition, insensitivity of the human ear to absolute phase of the audio signal

is used. The imperceptibility dimension of audio watermarking requires that the embedding of

watermark should not produce any perceptible difference in the resultant audio but doing so

makes the resultant audio vulnerable to audio manipulation attacks. So, watermarking schemes

should ideally use the perceptible parts for embedding watermark which produces artifacts in the

resultant audio. This artifact increases as the size of the watermark i.e. payload increases. So there

is always a tradeoff between imperceptibility, robustness and payload. Many of the authors have

presented these three requirements of the audio watermarking as the vertices of a triangle as given

in the figure 2.3 which is presented next.

29

Figure 2.3: Three requirements of audio watermarking

The problem of digital watermarking thus can be viewed as an optimization problem which tries

to meet all the three requirements in an optimized way. It’s a big challenge to meet all the

requirements simultaneously. Therefore, it can still be seen as an open problem. In the next

section the different classification of audio watermarking techniques based on the embedding

domain, application requirement etc. is given.

2.2 Audio Watermarking Literature Survey

Similar to watermarking done on any other media, audio watermarking can also be done in time

domain as well as a transform domain. Also, the categorization of the audio watermarking

schemes is similar to image and video watermarking whether it is based upon the domain in

which the watermark is embedded i.e. time, any transform or compressed domain or the

requirement of the original cover object at the time of detection or extraction or the application

requirement etc. The following section classify the audio watermarking using the following

criteria

On the basis of the application requirement.

On the basis of source requirement.

On the basis of robustness.

On the basis of the embedding strategy.

Pictorially in a nutshell it can be represented by the figure 2.4.

Imperceptibility

Robustness Payload

30

Figure 2.4: Categorization of audio watermarking Schemes

Further, LSB modification, LSB substitution, patchwork, some spread spectrum

schemes etc. can be placed into time domain. In the transform domain and compressed domain

31

schemes, the audio are first transformed into some other domain like frequency, cepstral domain

etc. and then watermarking is done. More or less these schemes vary in the usage of the

transformed coefficients and may use the embedding strategies which come under spread

spectrum, patchwork etc. So we have given the review of some of the popular transform domain

and compressed domain schemes separately. The following subsection gives the audio

watermarking categorization in detail.

2.2.1 Source Requirement

From the disputed copy of the audio, for detection or extraction of the watermark the original

cover audio may or may not be required. The watermarking schemes are categorized depending

upon the same into three type’s namely uninformed or blind, semi blind and informed or non

blind watermarking schemes which are given as follows

2.2.1.1 Uninformed or Blind watermarking schemes:

These are the set of watermarking schemes in which the original media is not required for the

extraction of the watermark [21]-[29]. These are dependent upon the extracted watermarks which

are compared with the original watermarks. These schemes are used in practical scenarios as it is

assumed that at the time of detection or extraction the original cover audio is generally not

available. The blind schemes are difficult to implement.

2.2.1.2 Semi- blind watermarking schemes: These are the schemes which require some

information for the extraction or the decoding of the watermark [30]-[32]. The information

required from the original cover audio can be the highest SV for every audio segment , it can be

the values corresponding to highest value DCT coefficient, highest value DWT coefficient etc.

These schemes can be considered to be lying between the two extreme schemes in which one

requires the original cover audio & the other which not at all requires any information from the

original cover audio.

32

2.2.1.3 Informed or Non - Blind or oblivious watermarking schemes: These are a set of

watermarking techniques which require the original media completely for the extraction or the

decoding of the watermark i.e. the copyright information etc. [33]-[39]

Among the above mentioned techniques the blind watermarking is the most popular

among the researchers group because of the impracticality of the other two for many applications.

2.2.2 Robustness Requirement

This categorization is based on the requirement of the application for which watermarking is used

rather than the watermarking requirement itself. On the basis of the robustness (i.e. the ability to

resist or counter attacks) of the watermarking system required for different applications, the

watermarking schemes are categorized into

2.2.2.1 Fragile: In fragile watermarking, the watermark is deteriorated as soon as small

modification is done on the watermarked audio. Thus, this type of watermarking schemes is

suitable for audio authentication or detection of tampering done on the audio [40]-[48]. These

schemes require that the watermark should not show any resistance to even small changes in the

watermarked audio whether it is through analog to digital conversion or compression.

2.2.2.2 Semi Fragile: These watermarking system aims at giving robustness against common

signal processing attacks such as analog to digital and digital to analog conversion etc [49][50].

The constrained is a little bit relaxed here. The applications are embedding information for

broadcast monitoring, covert communication etc.

2.2.2.3 Robust: In robust watermarking systems the watermark remains intact even after

intentional or unintentional attacks. Further, the robust watermarking schemes are controlled

through a private or a public key. Also, the robust watermarking schemes can be invertible or non

invertible depending upon whether the original cover audio can be reproduced through the

watermarked audio or not. The main applications are the copyright protection, source detection,

destination detection etc. [21]-[29], [51]-[55].

33

The invertibility of the watermarking schemes can be explained through the following properties

derived from the encoding, decoding and comparator functions.

If E is the embedding algorithm, D is detection/extraction algorithm, C is Comparator function, A

is original cover audio, A` is watermarked audio, R is recovered attacked audio, W is watermark

data and W` is extracted watermark data, then:

E (A, W) = A`

D (A, A`) = W’ or D (R) = S`

Comparator Cσ:

Cσ(W,W`) = 1{ if Cσ(W,W`) >= σ(threshold) }

Cσ(W,W`) = 0{ if Cσ(W,W`) < σ }

A watermarking scheme (E, D, Cσ) is invertible if:

Inverse mapping E-1 does exist such that E-1 (A`) = (A`~, W’) & E (A`~, W`) = A’;

E-1 is computational feasible;

W` is an allowed watermark;

A` and A`~ are perceptually similar; and

Comparator output Cσ (D (A`, A`~), W`) = 1

Otherwise the watermarking scheme is non-invertible. A watermarking scheme (E, D, C) is

quasi-invertible if:

Properties for invertible watermarking schemes are met.

E (A’~, W’) = A’~ != A’; and

A’~ and A’ are perceptually similar.

Otherwise, the watermarking scheme is non-quasi-invertible. A non-invertible scheme can be

quasi-invertible and non-quasi-invertibility implies non-invertibility.

34

2.2.3 Application

On the basis of the application for which the audio watermarking is used it is further categorized

into the following:

2.2.3.1 Source Based

In source based watermarking schemes the watermark comprises of the copyright information.

On the cover original audio the copyright information corresponding to the owner is embedded.

At the time of dispute, the watermark which is the copyright information is extracted and claim

for the ownership of the audio is established. It is worthless to mention here that because the

claim for ownership is to be established, the embedded watermark must be robust. If the

watermark will not be robust, even the original owner will not be able to extract his/her copyright

information from the watermarked audio. Thus, it will be difficult to establish the ownership.

2.2.3.2 Destination Based

In destination based schemes, the source of piracy is required to be traced rather than the owner

of the original creation. Let’s say, the owner A of an audio Myaudio.wav want to sell the audio to

person B and C and also want to assure that either B or C should not be able to distribute it

further. Assuring redistribution of the audio is difficult, but what at least A can assure is that if

any illegal copy of the audio is found with person Z, he should be able to trace the source of

piracy. A can watermark the original cover audio with two unique watermarks to create two

watermarked audio one each for A and B. Each of the watermarks corresponds to B or C. So, if

from the illegal audio the extracted watermark corresponds to any of B or C it will be established

that the person was involved in illegal distribution of the copyrighted audio. These schemes also

require that through the different watermarked copies of the audio, a fresh audio shouldn’t be

produced. These schemes have got great financial implications also as they are used for audio

fingerprinting. For these schemes also the watermark should be robust.

2.2.4 Embedding Strategy

On the basis of the strategy used for embedding the watermark, the watermarking schemes are

broadly categorized into low bit coding or LSB modification or substitution, phase coding and

35

modulation, spread spectrum, echo hiding, patchwork based, transform domain or compressed

domain schemes. In time domain watermarking schemes, the watermark is embedded on the

cover original audio directly. In transform domain schemes the original cover audio is first

transformed into either frequency domain or time and frequency both before embedding the

watermark. The common transform used are DCT, DWT, FFT, DFT, SVD etc.

2.2.4 .1 LSB modifications or substitution Schemes:

The oldest of the audio watermarking techniques reported in the literature are based on least

significant bit (LSB) modification or substitution [56]-[61]. The principle behind using this

scheme is that if the LSB of an individual sample or a group of samples are modified according to

the watermarking bits the difference in the watermarked and the original will be minimum. These

schemes have a high payload and low computational complexity but bear a low robustness even

to common signal processing like analog to digital conversion and vice versa, filtering etc. In

substitution schemes the LSB’s corresponding to the individual samples are replaced by the bits

of the watermark.

For eg. if x is a segment of audio of length l in which a watermark of length l is to

be embedded then the substitution can be visualized through figure 2.5

Figure 2.5: LSB substitution scheme

LSB

1

0

1

0

1

1

0

1

0

1

Watermark Bits

0 1 1 0 0 1 1 0 0 1 1 0 0 1 0

1 0 1 1 1 1 1 0 1 0 1 0 0 1 0

1 0 1 1 0 1 1 1 0 1 1 0 0 1 0

0 0 1 0 1 1 1 0 0 0 1 0 0 1 0

1 0 1 1 0 1 1 0 0 0 1 0 0 1 0

Sample of 16 bit resolution

36

All LSBs are replaced by the watermarking bits.

In modification based schemes, the LSBs are modified according to some

predefined rule. The individual segments are used to embed a watermarking bit.

Let X1, X2, X3 ……………Xn be the n segments with k samples each which are to be

used for embedding n bits then pictorially

Figure 2.6: LSB modification scheme

In these schemes, basically the parity of the LSBs of individual segments is modified to

embed the watermarking bits. Nedeljko et. al. [56] proposed improved grey scale quantization for

improving the imperceptibility. The paper is mainly oriented towards reducing the distortion.

Robustness against any attack is not at all discussed. In one of the proposed method by Nedeljko

[58], watermark is embedded on to the 6th LSB without much distortion and appreciable

0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1

1 0 1 1 1 1 1 0 1 0 1 0 0 1 0 1

1 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1

0 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0

0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1

1 0 1 1 1 1 1 0 1 0 1 0 0 1 0 0

1 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1

0 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0

0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1

1 0 1 1 1 1 1 0 1 0 1 0 0 1 0 1

1 0 1 1 0 1 1 1 0 1 1 0 0 1 0 0

0 0 1 0 1 1 1 0 0 0 1 0 0 1 0 1

X1

X2

‐

‐

‐‐

Xn

‐ ‐‐ ‐ ‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐

37

robustness. The payload of the scheme is 176 bps. Mazdak et. al. [60] proposed a method to

embed single as well as multiple watermark bits on individual sample. Substitution based scheme

is used to embed watermark at 4th LSB and 5th LSB. Further he aims at reducing the amplitude

difference so that the imperceptibility is improved in addition to increasing the payload of the

watermarking system. All the bits except the watermarking bit can be changed in order to

minimize the difference. Cao [61] used the LSB substitution on the non silent samples of the

stereo audios.

The current research on LSB based techniques is in shifting the bit layer of embedding towards

the most significant bits (MSB) [59]-[61]. Some methods are proposed which even increases the

payload of the scheme from 1 bit per sample to 2 bit per sample i.e. for the sampling rate of 44.1

kHz it becomes 88200 bits per second (bps).

2.2.4.2 Phase coding and modulation:

These schemes are simple schemes in which the phase of the audio signal is modified according

to the watermarking bit. The schemes take advantage of the deficiency of HAS to detect absolute

phase and small phase differences [62]-[68]. These schemes exhibits a large SNR which is a

metric used for imperceptibility of the watermarked media. The phase of the first segment is

tuned according to the watermark bit and the rest following segments preserve a relative phase

difference. The disadvantage with these schemes is the very low payload. The other disadvantage

is the localization of watermark data on to the first block. This way it can be removed very easily.

Security is not at all imposed. Phase modulation is popular among phase coding schemes which

carries a relatively higher payload. In the phase modulation techniques, audio segment phase is

modulated by passing the audio segment through all pass filters [67] [68]. Two all pass filters

with different poles and zeroes are used for embedding 0 or 1 watermark bit. The watermark bits

are extracted through the location of the poles and zeroes. The different phase modulation

schemes differ in extracting the locations of the poles and zeroes and the transfer functions used

for the all pass filter. They are having the disadvantage of lower perceptibility as all the segments

are used to embed the watermark bit. Since after most of the attacks the phase is not retained,

most of the schemes based on phase coding or modulation show low robustness to attacks.

38

2.2.4.3 Spread Spectrum (SS) Schemes:

These are popular schemes in early days of watermarking. The principle behind these schemes is

to encode the watermark data by spreading it to the entire spectrum of the segmented signal such

that the distortion is at the lowest level [54]-[55], [69]-[77]. They exploit the deficiency of HAS

insensitivity to small change in amplitude. The embedding of watermark can be done directly in

time domain and in the transform domain as well. The traditional spread spectrum scheme can be

modeled through a embedding and detection module as given in the following figures.

Figure 2.7: Embedding Module for SS schemes

The input audio segment is treated as sequence of numbers which follow Gaussian independent

and identical distributions (IID). It has zero mean and variance of σx2.

The ith watermarked sample is represented as.

Xi = Yi + pi wi ……… (2.6)

The value of k controls the robustness of the watermarking system. A higher value of k means

higher robustness. But increasing k reduces the imperceptibility also. So, there is a tradeoff

between imperceptibility and robustness. The value of k should be set intelligently to meet the

Pseudo Random Number

Secret Key

Chip Sequence

Watermark

Cover Audio

Watermarked Audio

39

requirement of good imperceptibility and good robustness. The distortion introduced is given by |

y – x|.

For extraction of the watermark, decoders are used which find out the correlation

between the pseudo random number generated through the same key which is used at the time of

embedding and watermarked audio signal.

Figure 2.8: Detection Module for SS schemes

These schemes bear a modest payload and show appreciable robustness to attacks

and secured behavior. The disadvantage is the original signal interference because of which the

imperceptibility is at stake most of the times. In addition, even for closed loop attack i.e. without

any attack the complete retrieval of the watermark data is not guaranteed.

To improve the quality of the watermarked audio Malver et. al. [55] proposed an improved spread

spectrum scheme which is popularly called as improved spread spectrum watermarking scheme

by removing the signal as source of interference. Although, the spread spectrum schemes are

popular in the literature, the watermarked audio produced through them is still not imperceptible.

This is because of the chip sequence and even if the power of the chip sequence is reduced the

watermarked is not imperceptible. These schemes are vulnerable to watermark estimation attacks

and not suitable for multiple watermark embedding applications also. The spread spectrum

schemes are also utilized in transform domain in which the suitable transform coefficients are

Extracted Watermark

Pseudo Random Number

Chip Sequence

Watermarked Audio

Correlation Detector

Secret Key

40

used to embed the watermark. The challenge is in modifying the transform coefficient in such a

way that the audio imperceptibility and robustness remain intact which is yet not attained as

modifying few coefficients reduce the audio quality.

Further, as these schemes use detection based strategy thus are not suitable for real

time applications.

2.2.4.4 Echo hiding based schemes:

The basic idea of how echo can be used to embed the watermark bit was given by Bender in

1996 [78]. In echo hiding based schemes, the watermark bit are added through echo with different

delays [78]-[84]. Two different delays of the echo simulate bit 0 and bit 1. The HAS is insensitive

to temporal as well as frequency masking. If two signals differ in amplitude and close in time

then the higher amplitude signal masks the presence of the low amplitude signal. This deficiency

of HAS is exploited in echo based watermarking methods. The two different delays generally are

in the range of one thousand of the second. The watermark bit is added through two different

echo kernels and the strength of the echo is controlled by a scalar. The echo based watermarking

schemes shows good imperceptibility for smaller value of the scalar as the echo is tuned

according to the psychoacoustic model of HAS but robustness is reduced. Higher value of the

scalar reduces the imperceptibility. In addition, for extraction of watermark the original audio

signal is not required which make them blind techniques and suitable for practical applications.

What are required are the delay corresponding to bit 0 and bit 1. In the early days, watermark is

embedded in the form of a large echo which results in low imperceptibility [78]. The

imperceptibility is improved through multiple small echoes bearing different delays to represent

the watermark in the later echo hiding schemes [81-[89].

Further, to enhance imperceptibility Chen et.al. [81] used positive and negative echoes.

Kim [86] used forward and backward echo kernels for watermarking and his scheme has high

detection rate than the previous schemes. Also, in the early stage there is no security involved in

these schemes but later on schemes are developed that uses frequency hopping , scrambling etc.

to introduce the security. The echo hiding methods differ in the echo kernel used and the number

of echoes used. Larger number of echoes with same strength increases the robustness for the

41

same level of imperceptibility. The payload of the echo based watermarking methods depends

upon the psychoacoustics shaping.

2.2.4.5 Patchwork based watermarking Schemes:

The principle behind these schemes is to select two segments or patches with same statistical

properties like mean etc. and then modify each sample of the patches in opposite direction to

embed watermark bit [90]-[94]. The expected value of difference in the mean etc. detects the

watermarking bit at the time of detection. Bender [90] gives the core idea of Patchwork schemes

and he applied it for image watermarking. Arnold [91] extended the scheme for audio

watermarking and modifies the original by applying it in transform domain instead of spatial

domain as was done by Bender. Further, he used multiplicative approach as against the additive

approach used by Bender for modifying the samples. The successful detection demands for a

large variance among the patches which implicitly requires the length of the patches to be large.

Successive research is done to embed the watermark bits without increasing the variance by

Kalantari et. al.and called as modified patchwork algorithm [92]. In his method, wavlet transform

is used and only those audio segments are selected for which patches fits the suitability criteria

i.e. similar statistical properties. The audio segments that didn’t meet the criteria were rejected.

Kalantari didn’t give the procedure to select such audio segments at the detection side.

Natgunanathan [94] proposed the patchwork based schemes for mono as well as stereo audio. For

mono audio the audio segments are divided into two sub segments and DCT is applied on them.

The frame pairs are constituted by placing the coefficients of the given frequency range. Through

a selection criteria frame pairs are selected and embedding is done by modifying the DCT

coefficients. The selection is also controlled by a security key. The watermarking on stereo

audios is executed by exploiting the property that the channels of the stereo audio bear similarity.

Since the patchwork methods are based on the assumption that the statistical property of

the two selected patches for watermark embedding is the same which is not true practically, these

schemes suffers from false detection. The payload varies as in these schemes it is dependent upon

the number of patch pairs with comparable mean etc.

42

2.2.4.6 Transform domain watermarking schemes:

The two modules i.e. the embedding module and the extraction/detection module for transform

domain audio watermarking schemes can be pictorially represented using figure 2.9 and 2.10.

Figure 2.9: Transform Domain Embedding Module

The transform domain audio watermarking schemes mainly differs in the transform used for

watermarking, the type of transformed coefficient i.e. low frequency, high frequency, mid band

frequency etc., the methods for finding appropriate coefficients, no. of coefficient used for

watermarking, embedding strategy used for embedding watermark etc. Selection of the low

frequency coefficients for watermark embedding gives robustness but imperceptibility is reduced.

Similarly, selection of the high frequency coefficients gives good imperceptibility but robustness

is not achieved even for low pass filtering and re-quantization.

Original Cover Audio

Transformation DCT/FFT/DFT

Transform

Select regions of Embedding

Selected Transform regions Domain

Watermark

Watermarked Audio

Inverse Transformation. IDCT/IFFT/IDFT

43

Figure 2.10: Transform Domain Extraction Module

Although, the audio signal is used in transformed domain in some of the

techniques discussed in the previous categories, we are placing these schemes under a separate

category. The typical transform used in audio watermarking are Discrete Cosine Transform

(DCT) [28],[95]-[100], [51]-[53] Discrete Wavelet Transform (DWT) [28],[101]-[104], Discrete

Fourier Transform (DFT) [32],[46],[94],[105], Discrete Sine Transform (DST), Fractional

Fourier Transform (FRFT) [106], Fast Fourier Transform [107]-[108], Singular Value

Decomposition (SVD) [109 ]-[114], cepstrum [45][80] etc. Here we are limiting to DCT and

SVD which are the transforms used in the research work. DCT is an important transform that

proved a mark in image watermarking. The lower complexity compared to other transforms

distinguish it and make it better from other transform. The different schemes which embed the

Watermarked Audio (possibly attacked)

Transformation i.e. DCT/FFT/DFT

Transform Domain

Selected regions of Embedding

Selected Transform regions Domain

Secret key

Extracted watermark (possibly distorted)

Decision

Threshold Watermark Extraction

yes

no

44

data in DCT transform domain in principle differs in the no. of coefficients taken for embedding,

the type of coefficients i.e. low, high or middle, ac or dc coefficients, the methodology used for

embedding and finding the coefficients for watermark embedding which should produce

minimum distortion and maximum robustness. Z Zhou [51] proposed robust DCT based scheme

where the watermark bits are embedded by quantization of the DCT coefficients. W. Youngqi

[52] proposed DCT based audio watermarking scheme using a synchronic signal embedded in the

low frequency components. H Xiong et al. [53] scheme uses DCT coefficients for embedding but

the DCT coefficients are selected in a non uniform manner for enhancing security. Xia Zhang et

al. [95] used the double DCT method for the transmission of the audio through air channel for

one of the schemes. The embedding is done on the DCT coefficients achieved by applying DCT

on the low DCT coefficient of the first level DCT. The author claims to have achieved a moderate

robustness against the attacks. The second scheme uses the low frequency DCT coefficients for

watermark embedding and Barker code is used as a synchronization codes for robustness against

synchronization attacks. Q. Gou et al. [98],[99] also gives a DCT based scheme which he claims

to be robust especially against analog to digital and digital to analog conversion for air channel

transmission applications. Chang proposed DCT domain technique which modifies the low

frequency DCT coefficients for watermark embedding. He also uses Barker code for

synchronization. K Ren et al. [115] proposed DCT and DWT based algorithm which uses a color

image as a watermark. He claims the scheme to have a high payload as compared to the

contemporary DCT based technique. For scrambling the watermarking bits he applies Arnold

transform. The watermark is embedded in the low frequency DWT coefficients. Suresh et al.

[116] uses a DCT and SVD based approach for embedding and extended it to DWT and SVD. He

compared the two approaches and claim DCT based method to be more robust.

The proposed schemes based on DCT in the either uses the low frequency

components or the DC coefficients for watermark embedding and also doesn’t consider the

energy of the individual blocks for watermark embedding. This is a problem since low energy

block are always susceptible to being removed altogether through an attack without disturbing the

cover audio. Also, less work was being done in analyzing the effect of common signal

processing operation including mp3 conversion onto the different DCT coefficients and the DCT

blocks.

45

Similar to DCT, SVD transform also find its place in audio watermarking

after it was successfully applied on image watermarking. The SVD is applied on matrices, so

before applying SVD to audio the audio is transformed into a two dimensional matrix.

The SVD of an N * N matrix A is defined by the operation [111]-[114]

SVD (A) = U S VT ---------- (2.7)

The SVD operation divides a matrix into three orthogonal matrixes U, V and S. The matrix S is a

diagonal matrix in which all the entries are zero except the diagonal. The diagonal elements thus

produced are always in a descending order. The non- zero entries corresponding to the S matrix

are called as singular values.

The columns of the U matrix are called the left singular vectors while the columns of the V matrix

are called the right singular vectors of A. The columns of U and V are orthonormal eigen vectors

of AAT and ATA respectively. The singular values corresponding to the singular matrix S are the

square root of the eigen values received from the matrices U or V in descending order.

SVD is used mostly for image watermarking and very few have applied it on audio

watermarking. The SVD based watermarking techniques are categorized into two groups– One

using the original signal for watermark extraction/ detection called as the non blind techniques

,and the second category that is blind which doesn’t require the original .

The problem with the non blind watermarking schemes is that the information regarding the

original signal is required to be carried till the authentication process is done. In semi blind type

of schemes the partial information is to be carried about the original signal. In some of the

watermarking schemes using the SVD, the unitary as well as the singular matrix is to be carried

till the process of extraction of the watermark. In most SVD based techniques, the watermark is

embedded by manipulating the singular values in accordance with the watermarking bit. Wang

and Healy [114] used reduced singular value decomposition method which uses the unitary

matrix for watermark bit embedding. Some of the watermarking schemes also exist that used the

combination of DWT and SVD for watermark embedding and extraction.

The problem with the watermarking schemes using the SVD matrix is that the

SVD matrix itself is directly prone to the attack. There is no security key involved through which

the singular values which are required to be used for watermark embedding are hidden. Since,

46

small change in the singular values don’t affect the perceptibility of the audio signal, the intruder

can manipulate the same singular values of the SVD matrix. The requirement that, even if the

scheme of watermarking is known to the intruder/attacker, he should not have the access of the

watermarking locations is not met in the watermarking schemes using SVD. The use of the

singular values for watermark embedding is based on the fact that if there will be a slight change

in the singular values it will not disturb the transparency of the image or audio and also there is

no prominent change in singular values when the image or audio is subjected to common signal

processing operation. So, SVD-based audio watermarking algorithms exploits this property to

add the watermark information to the singular values of the diagonal matrix S or the columns of

the unitary matrices in such a way that imperceptibility /inaudibility is not disturbed and

robustness requirements of effective digital audio watermarking algorithms is achieved. The SVD

based method differs in the different SV’s use and the methodology through which the

embedding is done using SV’s.

2.2.5 Compressed Domain Schemes

In the compressed domain technique, the watermarking is done on compressed audios. Since the

audios are mostly posted on the internet in a compressed form, more and more researchers are

attracted towards watermarking the compressed audios. Qiao [120] proposed two audio

watermarking schemes for MPEG encoded audios. In the first scheme the header of the audio

mpeg file is used to embed watermark. In the second scheme, the mpeg encoded samples are used

for watermark embedding. For imperceptibility requirement only few encoded samples are used

for embedding. The disadvantage of the approach is the weakness in sustaining the re-

quantization and noise addition attacks. D. K. Koukopoulos [121] also proposed a blind digital

watermarking scheme for mpeg audio layer 3(mp3) audio files. The audio watermarking is done

on the compressed audio directly. For watermark embedding, the scale factor is manipulated. The

scheme claimed to overcome the disadvantage of the schemes operated on PCM coded audios

which are vulnerable to compression/recompression attacks. Rade Petrovic [122] proposed a

scheme in which prior to compression using AAC, multiple copies of the audio is produced and

on each copy a single watermark bit is embedded . After perceptual compression using MDCT, a

multiplexer is used to perform the task of selection of compression unit according to the code.

47

Neuber et. al. [123] did it for AAC MPEG- 2 bit stream. For getting the frequency information,

Huffman coding and de-multiplexers are used. The problem with the approach is the

imperceptibility which is not guaranteed because of the unavailability of precise perceptual

information at the embedding side. Cheng et. al. [124] also proposed scheme for AAC audios in

which watermark is embedded directly on to the quantization indices. Further, enhanced spread

spectrum based scheme is used to improve the payload and robustness. The watermarked audio

was robust against compression attacks but the experimentation for checking the robustness

against other signal processing attacks were not conducted and reported.

2.2.6 Miscellaneous Schemes

Although a number of methods are proposed and implemented, none of them tried to make a

watermarking system which can be used for any generalized application .From the literature it is

clear that a lot has been done to improve the perceptual quality of the watermarked signal and the

robustness of the watermark. For detection of the watermark few watermarking schemes uses

Support Vector Machines (SVM) [125]. The principle used for these schemes is to correctly

identify the watermark bit through a training and testing phase of classification. S.D Larbi et al

[18] used audio watermarking as a tool for making a signal stationary for a short duration which

can be used as a pre processing step for many applications. For a very short duration of time

approximately 20 ms an audio signal is treated as stationary which can be used for analysis.

Simulation results with two kinds of signals test and audio signals show a significant stationary

enhancement of short segments presenting transient attacks. Since transient attacks are more

prominent in music signals the enhancement are more limited to them. Since the watermarking

process is not adopted for copyright protection the robustness of the watermark is not checked

against signal processing. Nakashima et al [17] proposed a new application area where audio

watermarking is used and can help in deterring the camcorder piracy of the movies from the

movie hall itself. The entire system consists of a position estimation system which itself consist

of a watermarking system that tries to find out the exact position of the pirates i.e. the seat

number of the pirate sitting in the movie hall. The future work is in making the watermark

generated from the biometric features of the person to which the multimedia data originally

belongs. Also the Time Scale Modification is considered to be the worst attack as far as audio

48

watermarking is concerned .So much of the work is going on in dealing with synchronization in

addition to improving the perceptibility and robustness against other attacks.

2.3 Identified Issues

On the basis of the literature review, it can stated that the main issue with the audio watermarking

and with all the watermarking schemes which uses other type of cover object is to make the

watermarked object (which is embedded with extra information) robust to attacks while

maintaining the imperceptibility. This issue become more serious in case of audio watermarking

because of the sensitivity of HAS. The requirements of the audio watermarking contradict with

each other as robustness requires the watermark to be embedded in the prominent portion of the

audio so that it can’t be removed through attacks. But this definitely reduces the imperceptibility.

Also, with increase in watermark embedding density (i.e. payload expressed in bits per second

(bps)) the imperceptibility decreases. Therefore an optimal tradeoff is required to be maintained

for imperceptibility, robustness and payload for the watermarking schemes and thus it is still an

open problem. Some additional issues are identified which are as follows.

Issue 1: Although, the DCT based watermarking schemes have low embedding complexity but

the use the low frequency coefficients or the DC coefficients as the watermarking locations leads

to less imperceptibility. There is a need to give attention to the use of selected frequency

coefficients and better embedding strategy to provide a good balance between imperceptibility

and robustness. Embedding watermark on a single coefficient may not sustain robustness against

attacks but group of coefficients when used for data embedding has higher probability to show

robustness. Also, improvement on these watermarking schemes is required to carry variable

payloads for adjustability requirement with imposed security.

Issue 2: Uninformed destination based watermarking schemes which are mostly used for audio

fingerprinting requires multiple copies of the audios to be watermarked using different

watermarks. Additionally, they require the higher payload capacity so as to carry multiple

information i.e. owner info distributer info etc. But, estimation attacks tries to remove the

watermark by analyzing multiple watermarked copies. There is a need to develop improved

49

uninformed destination based watermarking schemes with high payload which can combat

against estimation of the watermark, unintentional mp3 compression and direct manipulation of

watermarked samples or coefficients used for embedding copyright information simultaneously,

with in an audio.

Issue 3: Very less work is reported on analyzing the affect of mp3 compression on the

watermarked audios. The watermarking schemes in which watermarking is done on already

compressed audios are prone to format change attacks. The watermarking scheme that can be

applied to uncompressed audios and robust to mp3 compression need to be developed based on

the study of the effect of mp3 compression on the individual blocks of audios used to embed

watermarking bits.

Issue 4: The watermarking schemes in the literature use arbitrary images or pseudo random

numbers as watermark. Patenting copyright on arbitrary images or key(s) to generate pseudo

random number is difficult which are mainly used as watermark. Thus, the schemes become

unacceptable when the watermark itself becomes the public property i.e. can be used by anyone.

Also, there can be situation when the watermark itself can be used to mislead the ownership. To

defame a person, arbitrary watermark used by an owner can be used by a malicious person to

watermark other’s creations. The watermark used by one can be used by many more individuals

also and claiming the copyright on such watermark and ultimately on the cover media becomes

very difficult. Less attention is paid on the need to use unique watermark preferably those

generated from biometric features to combat against ambiguous situation and defamation of an

individual.

2.4 Thesis Objectives

Based on the literature review done and the issues identified along with the main issue of

watermarking, the thesis objective is oriented towards improvement of the uninformed source and

destination based watermarking schemes with respect to imperceptibility, robustness, security and

payload. The watermarking schemes proposed are source based i.e. ownership detection as well

50

as destination based i.e. pirate detection. The DCT and SVD transform is used for embedding as

they are well accepted in watermarking domain. The robustness against the common signal

processing attacks along with compression attack is must as the audios are provided on the

networked environment with minimum bandwidth using compressed forms. The module to

generate a unique watermark for owner authentication and tracing of the pirate also seems to be

an utmost requirement.

The objectives of the thesis are summarized below as.

Objective 1: The First objective is to develop improved uninformed audio watermarking

schemes using selected frequency DCT coefficients which is capable of carrying variable payload

with good imperceptibility and is robust to compression attacks in addition to the common signal

processing attacks. This objective covers issue 1.

Objective 2: The SV’s obtained from the SVD transformation inherently shows some sort of

robustness to attacks and small change in the SV’s doesn’t make perceptible change on the cover

audio object. The second objective is to develop an improved uninformed secured audio

watermarking schemes using SVD which is capable of carrying high payload and is robust to

compression attacks and direct manipulation of the SVs. This addresses issue 2.

Objective 3: For robustness against compression attack specially the mp3 attack, the blocks with

in an audio are required on which there is least effect of compression along with robustness to

other common signal processing attacks. So, the third objective is to identify such blocks after

analyzing the effect of mp3 compression at different compression rate and developing embedding

strategy on the individual blocks to improve the robustness. This resolves issue 3.

Objective 4: For dealing with issue 4 a unique watermark generation module is required which is

capable of producing a unique watermark. The unique watermark should be able to combat

against the problem which arises due to common ambiguous watermark.

51

2.5 Summary

From the literature survey, watermarking proves to be a need of today. It is clear that among the

watermarking techniques a majority of the watermarking techniques are done in transform

domain and still a lot of techniques are proposed on different frequency transforms. For the

watermark insertion, psychoacoustic model is used which works for temporal as well as

frequency masking and it increases the complexity of the watermarking technique. Many

techniques are proposed which embeds the watermark directly in LSBs of the samples using

substitution. The LSB techniques have high payload but less robustness especially against re-

sampling, simple filtering and re-quantization attacks. The recent trend in LSB substitution and

embedding techniques is in shifting of the watermark bit layer from the LSB’s to MSB’s without

introducing distortion or some distortion which is imperceptible to human ears. But still there is a

need to modify LSB techniques to make them robust against attacks. User specific data need to be

embedded as a watermark for watermarking. A lot of research is going on to make the

watermarked audio robust to compression mainly mp3 attack.

The schemes using any type of information like image, pseudo random number etc

as copyright information becomes unacceptable when the watermark itself becomes the public

property. Patenting copyright on an arbitrary image or key of a pseudo random number is

difficult.

SVD, DST,DWT,FRFT are all used to compact the energy in few transform

coefficients but all the transform doesn’t work equally good as far as watermarking of audio is

concerned. The DCT also exhibit the property of energy compaction of the signal into fewer

transformed coefficients and proved to be a good transform as far as compaction of energy and

audio watermarking is concerned. The low frequency DCT coefficients carry appreciable amount

of energy and change in the lower frequency DCT coefficients lead to greater distortion.

Modification of the high frequency components won’t produce much of the distortion but the

robustness is questionable. Simple filtering operation may lead to the complete removal of

watermark data. So there is a tradeoff of robustness and imperceptibility in selection of the

coefficients. The mid band frequency coefficients may shows robustness to common signal

processing operations like analog to digital conversion, digital to analog conversion, re-sampling,

52

re-quantization etc. Also, distorting these coefficients up to a smaller extent for watermark

embedding may make perceptible change in the resultant audio.

chapter 2 literature reviewshodhganga.inflibnet.ac.in/bitstream/10603/49168/9/09_chapter2.pdf ·...

Documents