chapter 2 literature reviewshodhganga.inflibnet.ac.in/bitstream/10603/49168/9/09_chapter2.pdf ·...
TRANSCRIPT
23
CHAPTER 2
LITERATURE REVIEW
In the previous chapter, the need for copyright protection and evolution of digital watermarking
techniques along with the motivation and the structure of thesis is presented. This chapter reviews
the existing watermarking techniques applied on audios. Prior to giving the literature review and
current state of art on audio watermarking the preliminaries for audio and brief of the properties
which are exploited during watermarking of audio is presented.
As discussed earlier also, watermarking of audio is far more challenging than watermarking an
image or video. One of the main reasons is the wide dynamic range of audio signal as compared
to others. In addition, the Human Auditory System (HAS) is far more complex and very sensitive
to small changes in the magnitude of audio samples. The HAS perceives sounds over a wide
range of frequencies from an order of Hertz to kilo Hertz .In terms of the power this range is of
the order of power of 10. The sensitivity of the HAS to the additive Gaussian noise is high as well
which implies that a small disturbance at some frequency will be audible to the ear. The
sensitivity of the human ear is not same for all frequency range also, for example human ear is
more sensitive at low frequency and as the frequency becomes higher and higher the sensitivity of
ear keeps on decreasing. Also, the weaker low amplitude sounds are masked by stronger and high
amplitude sound when subjected to hearing simultaneously. These principles are depicted through
psychoacoustic model which is presented next.
2.1 Psychoacoustics
The human hearing range is about 20 Hz to about 20 kHz but it is most sensitive to frequencies
between 1 kHz and 5 kHz. This is the reason why two sounds at different frequency but with
same loudness level sounds distinct in loudness. The sound at higher frequency i.e. more than 20
kHz becomes ultrasonic. The dynamic range i.e. the ratio of the maximum audible sound
amplitude to the quietest audible sound amplitude is of the order of 120 dB [20]. But people
begin to become uncomfortable above 90 dB. The decibel unit basically represents the ratio of the
intensity on a logarithmic scale. The reference point is 0 dB and is also the threshold of human
24
hearing i.e. the quietist of the sound which can be heard at 1 kHz frequency. This relationship is
plotted through Equal-Loudness Relations represented through Fletcher-Munson curves and
Equal loudness contours from ISO 226:2003.
Figure 2.1: Threshold of audibility for SPL & frequency [20]
The dotted line in the curve shows the absolute threshold of hearing. The loudness curves
display the relationship between perceived loudness (in phons) for a given stimulus sound
volume (Sound pressure Level in dB) as a function of frequency. For example, the lowest curve
in the figure shows the perceived loudness of a pure tone at 10 dB at different frequencies. For
the 10 dB curve at 1 kHz the pure tone will be just audible at 0dB. If we compare the perceived
loudness for lower frequency say 100 Hz and higher frequency say at 1 kHz for the 10 dB curve
then we find that the required loudness for both is 10 dB and 30 dB respectively. This shows that
our ear is not sensitive to lower frequencies. Also stimulus required in the frequency range of 2.5
kHz to 4 kHz is less. In fact the ear canal amplifies in this range of frequencies.
The threshold of hearing in dB for a frequency f in kHz is given by [20]
25
Threshold(f) = 3.63(f/1000)-0.8 -6.5e-0.6(f/1000-3.3)2 +10-3(f/1000)4
------------ (2.1)
2.1.1 Frequency Masking
The interference of one frequency with the other frequency at given loudness level is governed by
masking principles. Also masking interprets the level of noise we can tolerate before we can’t
hear the actual music. The lossy compressions encoding such as moving picture expert group
(MPEG) or Dolby Digital removes the sounds that are masked thus reducing the total size of the
information. The general situation regarding masking is presented as
A lower tone can mask a higher tone effectively i.e. in presence of a low frequency tone
the higher frequency tone is inaudible.
The range of frequencies which can be masked is dependent on the power of the masking
tone. More is the power of the masking tone wider will be its impact on the frequencies.
Frequency masking curves depicts the masking of the nearby frequencies through a louder
masking tone. The principle is used when encoding of the audio signal with lesser number of bits
is required. If the audio signal can be decomposed into different frequency components, then for
the frequencies that will be partially masked, only the audible part need to be used for setting the
quantization noise thresholds. These properties of the HAS are exploited for embedding
watermark.
As hearing has limited, frequency dependent resolutions, all perceptual models like the
psychoacoustic model simulates the HAS as a bank of overlapping band pass filters with every
filter working on particular frequency band. These filters are called as critical band filters. The
following table presents the frequency range for different critical bands used to model the HAS
through 25 band filter. Every individual band is treated as a separate entity within the frequency
spectrum. The filter bandwidth is almost constant with a value of 100 Hz up to a frequency of 500
Hz while for high frequencies it increases with the central frequency of the band [20]. The
bandwidth goes to as high as 4 kHz for higher frequencies. Masking of one frequency
component/sound from one critical band can be done by the other frequency component with in a
critical band or other critical bands. The former is called as intra-band masking & the later is
known as inter band masking.
26
Table 2.1: Critical band and their bandwidths [20]
Band
Lower Frequency Bound (Hz)
Centre Frequency (Hz)
Higher Frequency Bound (Hz)
Bandwidth (Hz)
1 50 100 2 100 150 200 100 3 200 250 300 100 4 300 350 400 100 5 400 450 510 110 6 510 570 630 120 7 630 700 770 140 8 770 840 920 150 9 920 1000 1080 160
10 1080 1170 1270 190 11 1270 1370 1480 210 12 1480 1600 1720 240 13 1720 1850 2000 280 14 2000 2150 2320 320 15 2320 2500 2700 380 16 2700 2900 3150 450 17 3150 3400 3700 550 18 3700 4000 4400 700 19 4400 4800 5300 900 20 5300 5800 6400 1100 21 6400 7000 7700 1300 22 7700 8500 9500 1800 23 9500 10500 12000 2500 24 12000 13500 15500 3500 25 15500 18775 22050 6550
The implementation of critical band filters working exactly in their frequency bandwidth is
however difficult & almost infeasible. So in place, at the expense of the resolution required in a
given critical band, efficient filter bank implementation is done through Quadrature mirror filter
(QMF) pairs, cosine modulated filter bank etc. Since the range of frequencies that are affected by
27
masking is broader for higher frequencies, a new frequency scale which fit to almost equal width
for all critical bands is derived. This new unit which is called the Bark after the name of Heinrich
Barkhauson and corresponds to the width of one critical band for any masking frequency and is
given as [20]
Critical band number (Bark) =
[
Another for19mula for Bark scale is given as [20]
b= 13.0 arctan(0.76f) + 3.5 arctan (f2/56.25) -------------------- (2.3)
Reversibly from the Bark scale the corresponding frequency can be depicted as follows [20]
F = [(exp (0.219 x b)/352) +0.1] x b -0.032 x exp [-0.15 x (b-5)2 ] -------------------- (2.4)
Also the critical bandwidth corresponding to a given center frequency is approximated as [20]
df = 25 + 75 x [1 + 1.4f2 ]0.69 ------------------- (2.5)
where f is in kHz and df is in Hz.
2.1.2 Temporal Masking
The time sensitivity of hearing or minimum delay time up to which the masking tone mask the
nearby frequencies after being turned off is termed as temporal masking. The effect of masking
after a louder tone is referred as post masking while the effect of masking prior to a louder tone is
referred as pre masking. In general, louder the test tone the lesser time it takes for our hearing to
get over hearing the masking tone.
f/100, for f < 500
9 +4log2(f/1000), for f >= 500 (2.2)
28
The following figure shows the temporal masking for different sound pressure level
Figure 2.2: Temporal masking [21]
In audio watermarking while embedding the watermark the deficiency of the HAS is exploited.
The psychoacoustic model provides this deficiency in the form of masking thresholds for time as
well as frequency. In addition, insensitivity of the human ear to absolute phase of the audio signal
is used. The imperceptibility dimension of audio watermarking requires that the embedding of
watermark should not produce any perceptible difference in the resultant audio but doing so
makes the resultant audio vulnerable to audio manipulation attacks. So, watermarking schemes
should ideally use the perceptible parts for embedding watermark which produces artifacts in the
resultant audio. This artifact increases as the size of the watermark i.e. payload increases. So there
is always a tradeoff between imperceptibility, robustness and payload. Many of the authors have
presented these three requirements of the audio watermarking as the vertices of a triangle as given
in the figure 2.3 which is presented next.
29
Figure 2.3: Three requirements of audio watermarking
The problem of digital watermarking thus can be viewed as an optimization problem which tries
to meet all the three requirements in an optimized way. It’s a big challenge to meet all the
requirements simultaneously. Therefore, it can still be seen as an open problem. In the next
section the different classification of audio watermarking techniques based on the embedding
domain, application requirement etc. is given.
2.2 Audio Watermarking Literature Survey
Similar to watermarking done on any other media, audio watermarking can also be done in time
domain as well as a transform domain. Also, the categorization of the audio watermarking
schemes is similar to image and video watermarking whether it is based upon the domain in
which the watermark is embedded i.e. time, any transform or compressed domain or the
requirement of the original cover object at the time of detection or extraction or the application
requirement etc. The following section classify the audio watermarking using the following
criteria
On the basis of the application requirement.
On the basis of source requirement.
On the basis of robustness.
On the basis of the embedding strategy.
Pictorially in a nutshell it can be represented by the figure 2.4.
Imperceptibility
Robustness Payload
30
Figure 2.4: Categorization of audio watermarking Schemes
Further, LSB modification, LSB substitution, patchwork, some spread spectrum
schemes etc. can be placed into time domain. In the transform domain and compressed domain
31
schemes, the audio are first transformed into some other domain like frequency, cepstral domain
etc. and then watermarking is done. More or less these schemes vary in the usage of the
transformed coefficients and may use the embedding strategies which come under spread
spectrum, patchwork etc. So we have given the review of some of the popular transform domain
and compressed domain schemes separately. The following subsection gives the audio
watermarking categorization in detail.
2.2.1 Source Requirement
From the disputed copy of the audio, for detection or extraction of the watermark the original
cover audio may or may not be required. The watermarking schemes are categorized depending
upon the same into three type’s namely uninformed or blind, semi blind and informed or non
blind watermarking schemes which are given as follows
2.2.1.1 Uninformed or Blind watermarking schemes:
These are the set of watermarking schemes in which the original media is not required for the
extraction of the watermark [21]-[29]. These are dependent upon the extracted watermarks which
are compared with the original watermarks. These schemes are used in practical scenarios as it is
assumed that at the time of detection or extraction the original cover audio is generally not
available. The blind schemes are difficult to implement.
2.2.1.2 Semi- blind watermarking schemes: These are the schemes which require some
information for the extraction or the decoding of the watermark [30]-[32]. The information
required from the original cover audio can be the highest SV for every audio segment , it can be
the values corresponding to highest value DCT coefficient, highest value DWT coefficient etc.
These schemes can be considered to be lying between the two extreme schemes in which one
requires the original cover audio & the other which not at all requires any information from the
original cover audio.
32
2.2.1.3 Informed or Non - Blind or oblivious watermarking schemes: These are a set of
watermarking techniques which require the original media completely for the extraction or the
decoding of the watermark i.e. the copyright information etc. [33]-[39]
Among the above mentioned techniques the blind watermarking is the most popular
among the researchers group because of the impracticality of the other two for many applications.
2.2.2 Robustness Requirement
This categorization is based on the requirement of the application for which watermarking is used
rather than the watermarking requirement itself. On the basis of the robustness (i.e. the ability to
resist or counter attacks) of the watermarking system required for different applications, the
watermarking schemes are categorized into
2.2.2.1 Fragile: In fragile watermarking, the watermark is deteriorated as soon as small
modification is done on the watermarked audio. Thus, this type of watermarking schemes is
suitable for audio authentication or detection of tampering done on the audio [40]-[48]. These
schemes require that the watermark should not show any resistance to even small changes in the
watermarked audio whether it is through analog to digital conversion or compression.
2.2.2.2 Semi Fragile: These watermarking system aims at giving robustness against common
signal processing attacks such as analog to digital and digital to analog conversion etc [49][50].
The constrained is a little bit relaxed here. The applications are embedding information for
broadcast monitoring, covert communication etc.
2.2.2.3 Robust: In robust watermarking systems the watermark remains intact even after
intentional or unintentional attacks. Further, the robust watermarking schemes are controlled
through a private or a public key. Also, the robust watermarking schemes can be invertible or non
invertible depending upon whether the original cover audio can be reproduced through the
watermarked audio or not. The main applications are the copyright protection, source detection,
destination detection etc. [21]-[29], [51]-[55].
33
The invertibility of the watermarking schemes can be explained through the following properties
derived from the encoding, decoding and comparator functions.
If E is the embedding algorithm, D is detection/extraction algorithm, C is Comparator function, A
is original cover audio, A` is watermarked audio, R is recovered attacked audio, W is watermark
data and W` is extracted watermark data, then:
E (A, W) = A`
D (A, A`) = W’ or D (R) = S`
Comparator Cσ:
Cσ(W,W`) = 1{ if Cσ(W,W`) >= σ(threshold) }
Cσ(W,W`) = 0{ if Cσ(W,W`) < σ }
A watermarking scheme (E, D, Cσ) is invertible if:
Inverse mapping E-1 does exist such that E-1 (A`) = (A`~, W’) & E (A`~, W`) = A’;
E-1 is computational feasible;
W` is an allowed watermark;
A` and A`~ are perceptually similar; and
Comparator output Cσ (D (A`, A`~), W`) = 1
Otherwise the watermarking scheme is non-invertible. A watermarking scheme (E, D, C) is
quasi-invertible if:
Properties for invertible watermarking schemes are met.
E (A’~, W’) = A’~ != A’; and
A’~ and A’ are perceptually similar.
Otherwise, the watermarking scheme is non-quasi-invertible. A non-invertible scheme can be
quasi-invertible and non-quasi-invertibility implies non-invertibility.
34
2.2.3 Application
On the basis of the application for which the audio watermarking is used it is further categorized
into the following:
2.2.3.1 Source Based
In source based watermarking schemes the watermark comprises of the copyright information.
On the cover original audio the copyright information corresponding to the owner is embedded.
At the time of dispute, the watermark which is the copyright information is extracted and claim
for the ownership of the audio is established. It is worthless to mention here that because the
claim for ownership is to be established, the embedded watermark must be robust. If the
watermark will not be robust, even the original owner will not be able to extract his/her copyright
information from the watermarked audio. Thus, it will be difficult to establish the ownership.
2.2.3.2 Destination Based
In destination based schemes, the source of piracy is required to be traced rather than the owner
of the original creation. Let’s say, the owner A of an audio Myaudio.wav want to sell the audio to
person B and C and also want to assure that either B or C should not be able to distribute it
further. Assuring redistribution of the audio is difficult, but what at least A can assure is that if
any illegal copy of the audio is found with person Z, he should be able to trace the source of
piracy. A can watermark the original cover audio with two unique watermarks to create two
watermarked audio one each for A and B. Each of the watermarks corresponds to B or C. So, if
from the illegal audio the extracted watermark corresponds to any of B or C it will be established
that the person was involved in illegal distribution of the copyrighted audio. These schemes also
require that through the different watermarked copies of the audio, a fresh audio shouldn’t be
produced. These schemes have got great financial implications also as they are used for audio
fingerprinting. For these schemes also the watermark should be robust.
2.2.4 Embedding Strategy
On the basis of the strategy used for embedding the watermark, the watermarking schemes are
broadly categorized into low bit coding or LSB modification or substitution, phase coding and
35
modulation, spread spectrum, echo hiding, patchwork based, transform domain or compressed
domain schemes. In time domain watermarking schemes, the watermark is embedded on the
cover original audio directly. In transform domain schemes the original cover audio is first
transformed into either frequency domain or time and frequency both before embedding the
watermark. The common transform used are DCT, DWT, FFT, DFT, SVD etc.
2.2.4 .1 LSB modifications or substitution Schemes:
The oldest of the audio watermarking techniques reported in the literature are based on least
significant bit (LSB) modification or substitution [56]-[61]. The principle behind using this
scheme is that if the LSB of an individual sample or a group of samples are modified according to
the watermarking bits the difference in the watermarked and the original will be minimum. These
schemes have a high payload and low computational complexity but bear a low robustness even
to common signal processing like analog to digital conversion and vice versa, filtering etc. In
substitution schemes the LSB’s corresponding to the individual samples are replaced by the bits
of the watermark.
For eg. if x is a segment of audio of length l in which a watermark of length l is to
be embedded then the substitution can be visualized through figure 2.5
Figure 2.5: LSB substitution scheme
LSB
1
0
1
0
1
1
0
1
0
1
Watermark Bits
0 1 1 0 0 1 1 0 0 1 1 0 0 1 0
1 0 1 1 1 1 1 0 1 0 1 0 0 1 0
1 0 1 1 0 1 1 1 0 1 1 0 0 1 0
0 0 1 0 1 1 1 0 0 0 1 0 0 1 0
1 0 1 1 0 1 1 0 0 0 1 0 0 1 0
Sample of 16 bit resolution
36
All LSBs are replaced by the watermarking bits.
In modification based schemes, the LSBs are modified according to some
predefined rule. The individual segments are used to embed a watermarking bit.
Let X1, X2, X3 ……………Xn be the n segments with k samples each which are to be
used for embedding n bits then pictorially
Figure 2.6: LSB modification scheme
In these schemes, basically the parity of the LSBs of individual segments is modified to
embed the watermarking bits. Nedeljko et. al. [56] proposed improved grey scale quantization for
improving the imperceptibility. The paper is mainly oriented towards reducing the distortion.
Robustness against any attack is not at all discussed. In one of the proposed method by Nedeljko
[58], watermark is embedded on to the 6th LSB without much distortion and appreciable
0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1
1 0 1 1 1 1 1 0 1 0 1 0 0 1 0 1
1 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1
0 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0
0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1
1 0 1 1 1 1 1 0 1 0 1 0 0 1 0 0
1 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1
0 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0
0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1
1 0 1 1 1 1 1 0 1 0 1 0 0 1 0 1
1 0 1 1 0 1 1 1 0 1 1 0 0 1 0 0
0 0 1 0 1 1 1 0 0 0 1 0 0 1 0 1
X1
X2
‐
‐
‐‐
Xn
‐ ‐‐ ‐ ‐ ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
37
robustness. The payload of the scheme is 176 bps. Mazdak et. al. [60] proposed a method to
embed single as well as multiple watermark bits on individual sample. Substitution based scheme
is used to embed watermark at 4th LSB and 5th LSB. Further he aims at reducing the amplitude
difference so that the imperceptibility is improved in addition to increasing the payload of the
watermarking system. All the bits except the watermarking bit can be changed in order to
minimize the difference. Cao [61] used the LSB substitution on the non silent samples of the
stereo audios.
The current research on LSB based techniques is in shifting the bit layer of embedding towards
the most significant bits (MSB) [59]-[61]. Some methods are proposed which even increases the
payload of the scheme from 1 bit per sample to 2 bit per sample i.e. for the sampling rate of 44.1
kHz it becomes 88200 bits per second (bps).
2.2.4.2 Phase coding and modulation:
These schemes are simple schemes in which the phase of the audio signal is modified according
to the watermarking bit. The schemes take advantage of the deficiency of HAS to detect absolute
phase and small phase differences [62]-[68]. These schemes exhibits a large SNR which is a
metric used for imperceptibility of the watermarked media. The phase of the first segment is
tuned according to the watermark bit and the rest following segments preserve a relative phase
difference. The disadvantage with these schemes is the very low payload. The other disadvantage
is the localization of watermark data on to the first block. This way it can be removed very easily.
Security is not at all imposed. Phase modulation is popular among phase coding schemes which
carries a relatively higher payload. In the phase modulation techniques, audio segment phase is
modulated by passing the audio segment through all pass filters [67] [68]. Two all pass filters
with different poles and zeroes are used for embedding 0 or 1 watermark bit. The watermark bits
are extracted through the location of the poles and zeroes. The different phase modulation
schemes differ in extracting the locations of the poles and zeroes and the transfer functions used
for the all pass filter. They are having the disadvantage of lower perceptibility as all the segments
are used to embed the watermark bit. Since after most of the attacks the phase is not retained,
most of the schemes based on phase coding or modulation show low robustness to attacks.
38
2.2.4.3 Spread Spectrum (SS) Schemes:
These are popular schemes in early days of watermarking. The principle behind these schemes is
to encode the watermark data by spreading it to the entire spectrum of the segmented signal such
that the distortion is at the lowest level [54]-[55], [69]-[77]. They exploit the deficiency of HAS
insensitivity to small change in amplitude. The embedding of watermark can be done directly in
time domain and in the transform domain as well. The traditional spread spectrum scheme can be
modeled through a embedding and detection module as given in the following figures.
Figure 2.7: Embedding Module for SS schemes
The input audio segment is treated as sequence of numbers which follow Gaussian independent
and identical distributions (IID). It has zero mean and variance of σx2.
The ith watermarked sample is represented as.
Xi = Yi + pi wi ……… (2.6)
The value of k controls the robustness of the watermarking system. A higher value of k means
higher robustness. But increasing k reduces the imperceptibility also. So, there is a tradeoff
between imperceptibility and robustness. The value of k should be set intelligently to meet the
Pseudo Random Number
Secret Key
Chip Sequence
Watermark
Cover Audio
Watermarked Audio
39
requirement of good imperceptibility and good robustness. The distortion introduced is given by |
y – x|.
For extraction of the watermark, decoders are used which find out the correlation
between the pseudo random number generated through the same key which is used at the time of
embedding and watermarked audio signal.
Figure 2.8: Detection Module for SS schemes
These schemes bear a modest payload and show appreciable robustness to attacks
and secured behavior. The disadvantage is the original signal interference because of which the
imperceptibility is at stake most of the times. In addition, even for closed loop attack i.e. without
any attack the complete retrieval of the watermark data is not guaranteed.
To improve the quality of the watermarked audio Malver et. al. [55] proposed an improved spread
spectrum scheme which is popularly called as improved spread spectrum watermarking scheme
by removing the signal as source of interference. Although, the spread spectrum schemes are
popular in the literature, the watermarked audio produced through them is still not imperceptible.
This is because of the chip sequence and even if the power of the chip sequence is reduced the
watermarked is not imperceptible. These schemes are vulnerable to watermark estimation attacks
and not suitable for multiple watermark embedding applications also. The spread spectrum
schemes are also utilized in transform domain in which the suitable transform coefficients are
Extracted Watermark
Pseudo Random Number
Chip Sequence
Watermarked Audio
Correlation Detector
Secret Key
40
used to embed the watermark. The challenge is in modifying the transform coefficient in such a
way that the audio imperceptibility and robustness remain intact which is yet not attained as
modifying few coefficients reduce the audio quality.
Further, as these schemes use detection based strategy thus are not suitable for real
time applications.
2.2.4.4 Echo hiding based schemes:
The basic idea of how echo can be used to embed the watermark bit was given by Bender in
1996 [78]. In echo hiding based schemes, the watermark bit are added through echo with different
delays [78]-[84]. Two different delays of the echo simulate bit 0 and bit 1. The HAS is insensitive
to temporal as well as frequency masking. If two signals differ in amplitude and close in time
then the higher amplitude signal masks the presence of the low amplitude signal. This deficiency
of HAS is exploited in echo based watermarking methods. The two different delays generally are
in the range of one thousand of the second. The watermark bit is added through two different
echo kernels and the strength of the echo is controlled by a scalar. The echo based watermarking
schemes shows good imperceptibility for smaller value of the scalar as the echo is tuned
according to the psychoacoustic model of HAS but robustness is reduced. Higher value of the
scalar reduces the imperceptibility. In addition, for extraction of watermark the original audio
signal is not required which make them blind techniques and suitable for practical applications.
What are required are the delay corresponding to bit 0 and bit 1. In the early days, watermark is
embedded in the form of a large echo which results in low imperceptibility [78]. The
imperceptibility is improved through multiple small echoes bearing different delays to represent
the watermark in the later echo hiding schemes [81-[89].
Further, to enhance imperceptibility Chen et.al. [81] used positive and negative echoes.
Kim [86] used forward and backward echo kernels for watermarking and his scheme has high
detection rate than the previous schemes. Also, in the early stage there is no security involved in
these schemes but later on schemes are developed that uses frequency hopping , scrambling etc.
to introduce the security. The echo hiding methods differ in the echo kernel used and the number
of echoes used. Larger number of echoes with same strength increases the robustness for the
41
same level of imperceptibility. The payload of the echo based watermarking methods depends
upon the psychoacoustics shaping.
2.2.4.5 Patchwork based watermarking Schemes:
The principle behind these schemes is to select two segments or patches with same statistical
properties like mean etc. and then modify each sample of the patches in opposite direction to
embed watermark bit [90]-[94]. The expected value of difference in the mean etc. detects the
watermarking bit at the time of detection. Bender [90] gives the core idea of Patchwork schemes
and he applied it for image watermarking. Arnold [91] extended the scheme for audio
watermarking and modifies the original by applying it in transform domain instead of spatial
domain as was done by Bender. Further, he used multiplicative approach as against the additive
approach used by Bender for modifying the samples. The successful detection demands for a
large variance among the patches which implicitly requires the length of the patches to be large.
Successive research is done to embed the watermark bits without increasing the variance by
Kalantari et. al.and called as modified patchwork algorithm [92]. In his method, wavlet transform
is used and only those audio segments are selected for which patches fits the suitability criteria
i.e. similar statistical properties. The audio segments that didn’t meet the criteria were rejected.
Kalantari didn’t give the procedure to select such audio segments at the detection side.
Natgunanathan [94] proposed the patchwork based schemes for mono as well as stereo audio. For
mono audio the audio segments are divided into two sub segments and DCT is applied on them.
The frame pairs are constituted by placing the coefficients of the given frequency range. Through
a selection criteria frame pairs are selected and embedding is done by modifying the DCT
coefficients. The selection is also controlled by a security key. The watermarking on stereo
audios is executed by exploiting the property that the channels of the stereo audio bear similarity.
Since the patchwork methods are based on the assumption that the statistical property of
the two selected patches for watermark embedding is the same which is not true practically, these
schemes suffers from false detection. The payload varies as in these schemes it is dependent upon
the number of patch pairs with comparable mean etc.
42
2.2.4.6 Transform domain watermarking schemes:
The two modules i.e. the embedding module and the extraction/detection module for transform
domain audio watermarking schemes can be pictorially represented using figure 2.9 and 2.10.
Figure 2.9: Transform Domain Embedding Module
The transform domain audio watermarking schemes mainly differs in the transform used for
watermarking, the type of transformed coefficient i.e. low frequency, high frequency, mid band
frequency etc., the methods for finding appropriate coefficients, no. of coefficient used for
watermarking, embedding strategy used for embedding watermark etc. Selection of the low
frequency coefficients for watermark embedding gives robustness but imperceptibility is reduced.
Similarly, selection of the high frequency coefficients gives good imperceptibility but robustness
is not achieved even for low pass filtering and re-quantization.
Original Cover Audio
Transformation DCT/FFT/DFT
Transform
Select regions of Embedding
Selected Transform regions Domain
Watermark
Watermarked Audio
Inverse Transformation. IDCT/IFFT/IDFT
43
Figure 2.10: Transform Domain Extraction Module
Although, the audio signal is used in transformed domain in some of the
techniques discussed in the previous categories, we are placing these schemes under a separate
category. The typical transform used in audio watermarking are Discrete Cosine Transform
(DCT) [28],[95]-[100], [51]-[53] Discrete Wavelet Transform (DWT) [28],[101]-[104], Discrete
Fourier Transform (DFT) [32],[46],[94],[105], Discrete Sine Transform (DST), Fractional
Fourier Transform (FRFT) [106], Fast Fourier Transform [107]-[108], Singular Value
Decomposition (SVD) [109 ]-[114], cepstrum [45][80] etc. Here we are limiting to DCT and
SVD which are the transforms used in the research work. DCT is an important transform that
proved a mark in image watermarking. The lower complexity compared to other transforms
distinguish it and make it better from other transform. The different schemes which embed the
Watermarked Audio (possibly attacked)
Transformation i.e. DCT/FFT/DFT
Transform Domain
Selected regions of Embedding
Selected Transform regions Domain
Secret key
Extracted watermark (possibly distorted)
Decision
Threshold Watermark Extraction
yes
no
44
data in DCT transform domain in principle differs in the no. of coefficients taken for embedding,
the type of coefficients i.e. low, high or middle, ac or dc coefficients, the methodology used for
embedding and finding the coefficients for watermark embedding which should produce
minimum distortion and maximum robustness. Z Zhou [51] proposed robust DCT based scheme
where the watermark bits are embedded by quantization of the DCT coefficients. W. Youngqi
[52] proposed DCT based audio watermarking scheme using a synchronic signal embedded in the
low frequency components. H Xiong et al. [53] scheme uses DCT coefficients for embedding but
the DCT coefficients are selected in a non uniform manner for enhancing security. Xia Zhang et
al. [95] used the double DCT method for the transmission of the audio through air channel for
one of the schemes. The embedding is done on the DCT coefficients achieved by applying DCT
on the low DCT coefficient of the first level DCT. The author claims to have achieved a moderate
robustness against the attacks. The second scheme uses the low frequency DCT coefficients for
watermark embedding and Barker code is used as a synchronization codes for robustness against
synchronization attacks. Q. Gou et al. [98],[99] also gives a DCT based scheme which he claims
to be robust especially against analog to digital and digital to analog conversion for air channel
transmission applications. Chang proposed DCT domain technique which modifies the low
frequency DCT coefficients for watermark embedding. He also uses Barker code for
synchronization. K Ren et al. [115] proposed DCT and DWT based algorithm which uses a color
image as a watermark. He claims the scheme to have a high payload as compared to the
contemporary DCT based technique. For scrambling the watermarking bits he applies Arnold
transform. The watermark is embedded in the low frequency DWT coefficients. Suresh et al.
[116] uses a DCT and SVD based approach for embedding and extended it to DWT and SVD. He
compared the two approaches and claim DCT based method to be more robust.
The proposed schemes based on DCT in the either uses the low frequency
components or the DC coefficients for watermark embedding and also doesn’t consider the
energy of the individual blocks for watermark embedding. This is a problem since low energy
block are always susceptible to being removed altogether through an attack without disturbing the
cover audio. Also, less work was being done in analyzing the effect of common signal
processing operation including mp3 conversion onto the different DCT coefficients and the DCT
blocks.
45
Similar to DCT, SVD transform also find its place in audio watermarking
after it was successfully applied on image watermarking. The SVD is applied on matrices, so
before applying SVD to audio the audio is transformed into a two dimensional matrix.
The SVD of an N * N matrix A is defined by the operation [111]-[114]
SVD (A) = U S VT ---------- (2.7)
The SVD operation divides a matrix into three orthogonal matrixes U, V and S. The matrix S is a
diagonal matrix in which all the entries are zero except the diagonal. The diagonal elements thus
produced are always in a descending order. The non- zero entries corresponding to the S matrix
are called as singular values.
The columns of the U matrix are called the left singular vectors while the columns of the V matrix
are called the right singular vectors of A. The columns of U and V are orthonormal eigen vectors
of AAT and ATA respectively. The singular values corresponding to the singular matrix S are the
square root of the eigen values received from the matrices U or V in descending order.
SVD is used mostly for image watermarking and very few have applied it on audio
watermarking. The SVD based watermarking techniques are categorized into two groups– One
using the original signal for watermark extraction/ detection called as the non blind techniques
,and the second category that is blind which doesn’t require the original .
The problem with the non blind watermarking schemes is that the information regarding the
original signal is required to be carried till the authentication process is done. In semi blind type
of schemes the partial information is to be carried about the original signal. In some of the
watermarking schemes using the SVD, the unitary as well as the singular matrix is to be carried
till the process of extraction of the watermark. In most SVD based techniques, the watermark is
embedded by manipulating the singular values in accordance with the watermarking bit. Wang
and Healy [114] used reduced singular value decomposition method which uses the unitary
matrix for watermark bit embedding. Some of the watermarking schemes also exist that used the
combination of DWT and SVD for watermark embedding and extraction.
The problem with the watermarking schemes using the SVD matrix is that the
SVD matrix itself is directly prone to the attack. There is no security key involved through which
the singular values which are required to be used for watermark embedding are hidden. Since,
46
small change in the singular values don’t affect the perceptibility of the audio signal, the intruder
can manipulate the same singular values of the SVD matrix. The requirement that, even if the
scheme of watermarking is known to the intruder/attacker, he should not have the access of the
watermarking locations is not met in the watermarking schemes using SVD. The use of the
singular values for watermark embedding is based on the fact that if there will be a slight change
in the singular values it will not disturb the transparency of the image or audio and also there is
no prominent change in singular values when the image or audio is subjected to common signal
processing operation. So, SVD-based audio watermarking algorithms exploits this property to
add the watermark information to the singular values of the diagonal matrix S or the columns of
the unitary matrices in such a way that imperceptibility /inaudibility is not disturbed and
robustness requirements of effective digital audio watermarking algorithms is achieved. The SVD
based method differs in the different SV’s use and the methodology through which the
embedding is done using SV’s.
2.2.5 Compressed Domain Schemes
In the compressed domain technique, the watermarking is done on compressed audios. Since the
audios are mostly posted on the internet in a compressed form, more and more researchers are
attracted towards watermarking the compressed audios. Qiao [120] proposed two audio
watermarking schemes for MPEG encoded audios. In the first scheme the header of the audio
mpeg file is used to embed watermark. In the second scheme, the mpeg encoded samples are used
for watermark embedding. For imperceptibility requirement only few encoded samples are used
for embedding. The disadvantage of the approach is the weakness in sustaining the re-
quantization and noise addition attacks. D. K. Koukopoulos [121] also proposed a blind digital
watermarking scheme for mpeg audio layer 3(mp3) audio files. The audio watermarking is done
on the compressed audio directly. For watermark embedding, the scale factor is manipulated. The
scheme claimed to overcome the disadvantage of the schemes operated on PCM coded audios
which are vulnerable to compression/recompression attacks. Rade Petrovic [122] proposed a
scheme in which prior to compression using AAC, multiple copies of the audio is produced and
on each copy a single watermark bit is embedded . After perceptual compression using MDCT, a
multiplexer is used to perform the task of selection of compression unit according to the code.
47
Neuber et. al. [123] did it for AAC MPEG- 2 bit stream. For getting the frequency information,
Huffman coding and de-multiplexers are used. The problem with the approach is the
imperceptibility which is not guaranteed because of the unavailability of precise perceptual
information at the embedding side. Cheng et. al. [124] also proposed scheme for AAC audios in
which watermark is embedded directly on to the quantization indices. Further, enhanced spread
spectrum based scheme is used to improve the payload and robustness. The watermarked audio
was robust against compression attacks but the experimentation for checking the robustness
against other signal processing attacks were not conducted and reported.
2.2.6 Miscellaneous Schemes
Although a number of methods are proposed and implemented, none of them tried to make a
watermarking system which can be used for any generalized application .From the literature it is
clear that a lot has been done to improve the perceptual quality of the watermarked signal and the
robustness of the watermark. For detection of the watermark few watermarking schemes uses
Support Vector Machines (SVM) [125]. The principle used for these schemes is to correctly
identify the watermark bit through a training and testing phase of classification. S.D Larbi et al
[18] used audio watermarking as a tool for making a signal stationary for a short duration which
can be used as a pre processing step for many applications. For a very short duration of time
approximately 20 ms an audio signal is treated as stationary which can be used for analysis.
Simulation results with two kinds of signals test and audio signals show a significant stationary
enhancement of short segments presenting transient attacks. Since transient attacks are more
prominent in music signals the enhancement are more limited to them. Since the watermarking
process is not adopted for copyright protection the robustness of the watermark is not checked
against signal processing. Nakashima et al [17] proposed a new application area where audio
watermarking is used and can help in deterring the camcorder piracy of the movies from the
movie hall itself. The entire system consists of a position estimation system which itself consist
of a watermarking system that tries to find out the exact position of the pirates i.e. the seat
number of the pirate sitting in the movie hall. The future work is in making the watermark
generated from the biometric features of the person to which the multimedia data originally
belongs. Also the Time Scale Modification is considered to be the worst attack as far as audio
48
watermarking is concerned .So much of the work is going on in dealing with synchronization in
addition to improving the perceptibility and robustness against other attacks.
2.3 Identified Issues
On the basis of the literature review, it can stated that the main issue with the audio watermarking
and with all the watermarking schemes which uses other type of cover object is to make the
watermarked object (which is embedded with extra information) robust to attacks while
maintaining the imperceptibility. This issue become more serious in case of audio watermarking
because of the sensitivity of HAS. The requirements of the audio watermarking contradict with
each other as robustness requires the watermark to be embedded in the prominent portion of the
audio so that it can’t be removed through attacks. But this definitely reduces the imperceptibility.
Also, with increase in watermark embedding density (i.e. payload expressed in bits per second
(bps)) the imperceptibility decreases. Therefore an optimal tradeoff is required to be maintained
for imperceptibility, robustness and payload for the watermarking schemes and thus it is still an
open problem. Some additional issues are identified which are as follows.
Issue 1: Although, the DCT based watermarking schemes have low embedding complexity but
the use the low frequency coefficients or the DC coefficients as the watermarking locations leads
to less imperceptibility. There is a need to give attention to the use of selected frequency
coefficients and better embedding strategy to provide a good balance between imperceptibility
and robustness. Embedding watermark on a single coefficient may not sustain robustness against
attacks but group of coefficients when used for data embedding has higher probability to show
robustness. Also, improvement on these watermarking schemes is required to carry variable
payloads for adjustability requirement with imposed security.
Issue 2: Uninformed destination based watermarking schemes which are mostly used for audio
fingerprinting requires multiple copies of the audios to be watermarked using different
watermarks. Additionally, they require the higher payload capacity so as to carry multiple
information i.e. owner info distributer info etc. But, estimation attacks tries to remove the
watermark by analyzing multiple watermarked copies. There is a need to develop improved
49
uninformed destination based watermarking schemes with high payload which can combat
against estimation of the watermark, unintentional mp3 compression and direct manipulation of
watermarked samples or coefficients used for embedding copyright information simultaneously,
with in an audio.
Issue 3: Very less work is reported on analyzing the affect of mp3 compression on the
watermarked audios. The watermarking schemes in which watermarking is done on already
compressed audios are prone to format change attacks. The watermarking scheme that can be
applied to uncompressed audios and robust to mp3 compression need to be developed based on
the study of the effect of mp3 compression on the individual blocks of audios used to embed
watermarking bits.
Issue 4: The watermarking schemes in the literature use arbitrary images or pseudo random
numbers as watermark. Patenting copyright on arbitrary images or key(s) to generate pseudo
random number is difficult which are mainly used as watermark. Thus, the schemes become
unacceptable when the watermark itself becomes the public property i.e. can be used by anyone.
Also, there can be situation when the watermark itself can be used to mislead the ownership. To
defame a person, arbitrary watermark used by an owner can be used by a malicious person to
watermark other’s creations. The watermark used by one can be used by many more individuals
also and claiming the copyright on such watermark and ultimately on the cover media becomes
very difficult. Less attention is paid on the need to use unique watermark preferably those
generated from biometric features to combat against ambiguous situation and defamation of an
individual.
2.4 Thesis Objectives
Based on the literature review done and the issues identified along with the main issue of
watermarking, the thesis objective is oriented towards improvement of the uninformed source and
destination based watermarking schemes with respect to imperceptibility, robustness, security and
payload. The watermarking schemes proposed are source based i.e. ownership detection as well
50
as destination based i.e. pirate detection. The DCT and SVD transform is used for embedding as
they are well accepted in watermarking domain. The robustness against the common signal
processing attacks along with compression attack is must as the audios are provided on the
networked environment with minimum bandwidth using compressed forms. The module to
generate a unique watermark for owner authentication and tracing of the pirate also seems to be
an utmost requirement.
The objectives of the thesis are summarized below as.
Objective 1: The First objective is to develop improved uninformed audio watermarking
schemes using selected frequency DCT coefficients which is capable of carrying variable payload
with good imperceptibility and is robust to compression attacks in addition to the common signal
processing attacks. This objective covers issue 1.
Objective 2: The SV’s obtained from the SVD transformation inherently shows some sort of
robustness to attacks and small change in the SV’s doesn’t make perceptible change on the cover
audio object. The second objective is to develop an improved uninformed secured audio
watermarking schemes using SVD which is capable of carrying high payload and is robust to
compression attacks and direct manipulation of the SVs. This addresses issue 2.
Objective 3: For robustness against compression attack specially the mp3 attack, the blocks with
in an audio are required on which there is least effect of compression along with robustness to
other common signal processing attacks. So, the third objective is to identify such blocks after
analyzing the effect of mp3 compression at different compression rate and developing embedding
strategy on the individual blocks to improve the robustness. This resolves issue 3.
Objective 4: For dealing with issue 4 a unique watermark generation module is required which is
capable of producing a unique watermark. The unique watermark should be able to combat
against the problem which arises due to common ambiguous watermark.
51
2.5 Summary
From the literature survey, watermarking proves to be a need of today. It is clear that among the
watermarking techniques a majority of the watermarking techniques are done in transform
domain and still a lot of techniques are proposed on different frequency transforms. For the
watermark insertion, psychoacoustic model is used which works for temporal as well as
frequency masking and it increases the complexity of the watermarking technique. Many
techniques are proposed which embeds the watermark directly in LSBs of the samples using
substitution. The LSB techniques have high payload but less robustness especially against re-
sampling, simple filtering and re-quantization attacks. The recent trend in LSB substitution and
embedding techniques is in shifting of the watermark bit layer from the LSB’s to MSB’s without
introducing distortion or some distortion which is imperceptible to human ears. But still there is a
need to modify LSB techniques to make them robust against attacks. User specific data need to be
embedded as a watermark for watermarking. A lot of research is going on to make the
watermarked audio robust to compression mainly mp3 attack.
The schemes using any type of information like image, pseudo random number etc
as copyright information becomes unacceptable when the watermark itself becomes the public
property. Patenting copyright on an arbitrary image or key of a pseudo random number is
difficult.
SVD, DST,DWT,FRFT are all used to compact the energy in few transform
coefficients but all the transform doesn’t work equally good as far as watermarking of audio is
concerned. The DCT also exhibit the property of energy compaction of the signal into fewer
transformed coefficients and proved to be a good transform as far as compaction of energy and
audio watermarking is concerned. The low frequency DCT coefficients carry appreciable amount
of energy and change in the lower frequency DCT coefficients lead to greater distortion.
Modification of the high frequency components won’t produce much of the distortion but the
robustness is questionable. Simple filtering operation may lead to the complete removal of
watermark data. So there is a tradeoff of robustness and imperceptibility in selection of the
coefficients. The mid band frequency coefficients may shows robustness to common signal
processing operations like analog to digital conversion, digital to analog conversion, re-sampling,
52
re-quantization etc. Also, distorting these coefficients up to a smaller extent for watermark
embedding may make perceptible change in the resultant audio.