need of color histogram

7/27/2019 Need of Color Histogram

http://slidepdf.com/reader/full/need-of-color-histogram 1/14

Key-f rame extraction is a widely used method for video summarization. The key-frames are the

characteristic f rames of the video and represents meaningf ul information about the contents

of the video. The extracted key-frames from the video can be arranged chronologically to

generate a storyboard. In video archiving systems, the key-frames can be used f or indexing in

such a way that the content based indexing and retrieval techniques developed for image

retrieval can be applied for video retrieval.

The key-frames extracted must summarize the characteristics of the video, and the image

characteristics of a video can be tracked by all the key-frames in time sequence.A common

methodology for extraction of key-frames is to compare consecutive frames based on some low

level Frame Difference Measures (FDMs). The frame difference is measured and if this

difference exceeds a certain threshold, then that frame is selected as a key-frame otherwise

discard the frame. Some of the low level features which are commonly used for the extraction

purpose include color histogram, shape histogram, motion information and edge histogram etc.

Then we apply Discrete Wavelet Transform on the key-frame and decompose the frame into four

sub-images. Then the frame difference is computed between the current frame and the last

extracted key-frame. This frame difference is computed by using colour histogram, shape feature

descriptor, texture feature. Then the obtained frame difference is compared with certain

threshold, if the difference satisfies with the threshold condition then the current frame is

selected as a key-frame. By continuously repeating the procedure for all frames we can extract

the key-frames.

The main reason and advantage for applying the wavelet transform to the detection of

edges in a frame is the possibility of choosing the size of the details that will be detected. When

processing a 2-D frame, the wavelet analysis is performed separately for the horizontal, vertical

and diagonal directions. Thus, horizontal, vertical and diagonal co-efficients are obtained

separately. The 2D discrete wavelet transform (DWT) decomposes the frames into sub-images,

3 details and 1 approximation. The approximation looks similar to the input image but only 1/4

of original size. The 2-D DWT is an extension of the 1-D DWT in both horizontal andvertical directions. The successive decomposition is performed only on the low pass output. The

resulting sub-images from an octave (a single iteration of the DWT) are labeled as A (the

approximation or we say the smoothing image of the original frame which contains the most

information of the original frame), H (preserves the horizontal edge details), V (preserves the



vertical edge details), and D (preserves the diagonal details which are influenced by noise

greatly), according to the filters used to generate the sub-image.

a frame is decomposed into Approximate (A), Horizontal (H), Vertical (V) and Diagonal

(D) details. Two levels of decomposition are done. After that, quantization is done on the

decomposed frame where diff erent quantization maybe done on diff erent components thus

maximizing the amount of needed details and ignoring ‘not-so-wanted’ details. This is done by

thresholding where some coefficient values for pixels in frames are ‘thrown out’ or set to zero or

some ‘smoothing’ aff ect is done on the image matrix.

Need of color histogram

The color histograms have been commonly used for key-frame extraction in frame difference

based techniques. Each frame obtained from the video is added to the collection and is analyzed

to compute a color histogram which shows the proportion of pixels of each color within the

frame. Then histogram difference is calculated by comparing histograms of successive frames

using distance measure. This value is used to identify the key-frame by comparing with the

threshold (T). A particular threshold is automatically computed and set by using the average

value of the frame difference of all the extracted frames. The frames which are below the

threshold (T) are discarded and the frames which are above the threshold are taken as key-frames.

The color feature is one of the most widely used visual features used in distincting

frames. It is relatively robust to background complication and independent of image size and

orientation. In frame extraction color histogram is most commonly used color feature

representation. Statistically, it denotes the joint probability of intensities of the three color

channels.

The use of color in video processing is motivated by two principle factors

(1) First, color is a powerful descriptor that often simplifies object identification and

extraction from a frame.



(2) Second, humans can discern thousands of color shades and intensities compared to about

every two dozen shades of gray. This second factor is particularly important in manual

(i.e., when performed by human) image analysis .

The main purpose of the RGB color model is for the sensing, representation and display

of images in electronic systems such as televisions and computers, though it has also been used

in conventional photography. Before the electronic age, the RGB color model already had a solid

theory behind it based on human perception of colors.

We can represent the RGB model by using a unit cube as shown in Figure 3.2. Each point in the

cube (or vector where the other point is the origin) represents a specific color. This model is the

best for setting the electron guns for a CRT. Note that for the complimentary colors the sum of

the values equals white light (1,1,1).For example:

Red (1,0,1) + cyan (0,1,1) = white (1,1,1)

Green (0,1,0) + magenta (1,0,1) = white (1,1,1)

Blue (0,0,1) + yellow (1,1,0) = white (1,1,1)

SHAPE FEATURE EXTRACTION

In video summarization technique, depending on the applications, some require the shape

representation to be invariant to translation, rotation and scaling, while others do not. There are

many important feature components for describing the dissimilarity between the frames such as

color, texture, shape, and spatial relationship. Among these features, shape contains the most

attractive visual information for human perception. Shape representation compared to other

features, like texture and color, is much more effective in semantically characterizing the

contents of a frame.

In general shape representation can be divided into two categories, boundary based and

region based. The former uses only the outer boundary of the shape while the latter uses the



entire shape region. The most successful representatives of these two categories are transform

coefficients and moment invariants.

An important step before shape extraction is edge point detection. Edges define the boundaries

between regions in a frame, which helps with segmentation and object recognition. Edge

detection is a fundamental of low-level image processing and good edges are necessary for

higher level processing. Traditional edge-detection algorithms such as gradient-based edge

detectors, Laplacian of Gaussian (LOG), zero crossing, and Canny edge detectors suffer from

some particular limitations.

The construction of shape descriptors is even more complicated when invariance, with

respect to a number of possible transformations, such as scaling, shifting, and rotation, is

required.

For a given frame an edge map is obtained by using wavelet decomposition. Shape feature

vector is computed using moment invariants. Similarly feature vector is computed between all

the successive frames. Canberra distance measure is used to calculate the distance between

feature vectors. A threshold (T) is set to discard the similar frames for efficient key-frame

extraction.

Similarity between two frames is obtained by evaluating the canberra distance formula

between the feature vectors of every nth frame and successive (n+1)th frames in a video. Similar

frames of the video are discarded by calculating the distance between the obtained frame feature

vectors. Canberra distance formula is used for calculating the distance and is given by equation :

CDk =∑ni=1|xi - yik |

|xi| + |yik |

Where CD is the Canberra distance, x and y are the feature vectors of every n th and (n+1)th

frame respectively, n is the length of the feature vector. Here it is 28. And k = 1 to m, where m is

the number of frames in given video. Frames are indexed based on the distance between every n th

frame and (n+1)th frame. Similar frames are displayed in the ranking order.



TEXTURE FEATURE EXTRACTION

Texture is characterized by the spatial distribution of gray levels in a neighborhood.

Though texture is widely used and intuitively obvious, it has no precise definition due to its widevariability. Frame textures can be artificially created or found in natural scenes captured in a

frame. Frame textures are one way that can be used to help in summarization of videos (video

processing) or classification of images.

The two dimensional discrete wavelet transform (DWT) is an effective tool to analyze frames in

a video and to capture localized frame details in both spatial and frequency domains. The DWT

is efficiently implemented using 'Haar' wavelet that applies iterative linear filtering and critical

down sampling on the original frame yielding three high-frequency directional sub-bands at each

scale level in addition to one low-frequency sub-band usually known as image approximation.

Directional sub-bands are sparse sub-images exhibiting image details according to horizontal,

vertical and diagonal orientations.

The decomposition process with the top image undergoing a first level decomposition to

generate 3 detail sub-bands (H1, V1, and D1), and one image approximation (A1). At the second

level of decomposition, the approximation image (A1) undergoes the same process to produce a

second scale level of image details (V2, H2 and D2) and a new image approximation (A2)

resulting 2-level DWT. At the third level of decomposition, the approximation image (A2)

undergoes the same process to produce a second scale level of image details (V3, H3 and D3)

and a new image approximation (A3) resulting 3-level DWT .

APPROACH

Key-frame is extracted based on discrete wavelet transform. Using the wavelet

transform, the texture frame is decomposed into four sub images, as low-low, low-high, high-

low and high-high sub-bands. To compute the wavelet features in the first step Haar wavelet is

calculated for whole frame. As a result of this transform there are 4 sub band images at each

scale. The wavelet transform is a multi-resolution technique, which can be implemented as a

pyramid or tree structure and is similar to sub-band decomposition. This process is continued

until third level decomposition. Energy of all third level decomposed frames is calculated

http://en.wikipedia.org/wiki/Segmentation_(image_processing)







using energy level algorithm. Using Euclidean distance dissimilar frames are identified as

key-frames. The greater the distance is, the more dissimilar the frames are.

The frames in the video can be discriminated by their textures. To invoke the system, the

user must provide the first texture frame as the key-frame. The system then tries to compare the

frames with similar visual attributes of the successive frame. This method describes feature

extraction, the representation on textural descriptions, and the algorithm of similarity matching.

TEXTURE FEATURE CALCULATION

The texture frame is decomposed into four sub images, as low-low, low-high, high-

low and high-high sub-bands i.e, Approximation(A), Horizontal(H), Vertical(V) and

Diagonal(D) respectively. This process is continued until third level decomposition . Energy of

all decomposed frames is calculated using energy level algorithm. Similarity between the

successive frames can be obtained by comparing the Euclidean Distances.

ENERGY LEVEL ALGORITHM

1. Decompose the image into four sub-images.

2. Calculate the energy of all decomposed images at the same scale, using:

Where M and N are the dimension of the frame and X is the pixels co-efficient corresponding to

the ith row and jth column of the frame map.

3. Repeat from st e p 1 for the low-low sub-band image, until it becomes third level.

Using the above algorithm, the energy levels of the sub bands is calculated and further

decomposition of the low-low sub image is also done. This is repeated twice to reach the third

level decomposition. These energy levels are stored to be used for the Euclidean Distance

algorithm.

5.7.2 EUCLIDEAN DISTANCE CALCULATION



The Euclidean distance between the vectors X and Y is given by

D=√(∑(X-Y)^2)

Using the above algorithm, the Euclidean is distance is calculated between the n th frame

and every (n+1)th frame in the database. This process is repeated until all the frames are

compared with their adjacent frames. After completing the Euclidean distance algorithm, an

array of Euclidean distances is obtained and then it is sorted.

5.8 Steps for key-frame extraction using texture feature extraction

1. Convert the input video into frames.

2. Decompose the frames using Discrete Wavelet Transform.

3. Calculate the energy of the frame using energy level algorithm.

4. Calculate the histogram distance using Euclidean distance formula.

5. Compare the histogram using threshold and extract the key-frames.

FUZZY COMPREHENSIVE EVALUATION

METHODOLOGY

The evaluation of key-frame extraction mechanism is inherently subjective in nature.

Moreover, the factors effecting the human user’s decision to declare a frame as key frame are not

determined. This makes the problem of evaluation of summaries to be well suited for fuzzy

analysis. The purpose of fuzzy set and fuzzy logic is to deal with problems involving knowledge

expressed in vague linguistic terms. In a fuzzy set, each element of universe of disclosure isawarded a degree of membership value in a fuzzy set using a membership function. The

membership function is used to associate a grade to each linguistic variable. We use four

linguistic terms to represent the quality of the summary of a video. These fuzzy sets for quality

include Very Good (VG), Good (G), Average (A) and Poor (P). The Accuracy and Error Rates of



training videos are fuzzified based on the triangular like member functions to associate a level of

degree of goodness and badness to the user summaries.

Next the Fuzzy Comprehensive Evaluation (FCE) is separately applied to compute the

weight for each FDM. FCE is a well known method which comprehensively judges the

membership grade status of the items to be evaluated based on some factors. So, we evaluate

each FDM based on the factors of Accuracy and Error Rates by applying FCE.

In general, FCE has following requirements:

1. A Factor Set U= {u1,u2,....,um} is composed of m different factors. These factors influence the

evaluation of objects.

2. An Evaluation Set V= {v1,v2,....,vn}which is composed of n types of remarks.

3. A weight set W={w1,w2,.....,wm} where ∑ mi=1 wi = 1 and wi > = 0 each member of this set represents

the weight coefficients of each factor in the factor set U.

4. A Fuzzy transformation Γf that transforms Factor set U into the evaluation set V.

5. A Fuzzy relation R on UxV defined as:

R= (nij)m*n ∈ ( R )m*n

The membership degree of the subject to remark v j from the view point of factor u i is

given by:

ni j = uR (ui,v j)

where nij ∈ [0 ,1] , i=1,.....m; j=1,......n.

Based on sets U, V, W and Fuzzy relation

Algorithm for key-frame extraction based on color feature extraction

Step1: All the frames are extracted from the input sports video.

Step2: Consider first frame as a key-frame.

Step 3: Now the current key-frame is converted into a RGB frame.



Step 4: The RGB frame is converted into a HSV frame.

Step 5: Select the subsequent HSV frame from the extracted frames and apply discrete wavelet

transform. The frame is divided into A, H, V, D components.

Step 5.1: The frame is decomposed until 2

nd

level and finally we obtain 7 subcomponents frame.

Step 5.2: Apply Quantization for the subsequent frame.

Step 5.3: Now apply Normalization on the above obtained frame.

Step 6: Histogram Creation

Step6.1: Histogram Creation: The normalization values of each section are then averaged.

The histogram values are measured for two channel values i.e hue and saturation.

Step 6.2: Concatenation: The histogram values of the HSV frame are

concatenated.

Step 7: Distance Calculation

Step 7.1: Distance Calculation: Now the distance between consecutive frame histograms

is calculated using the Sum-of-Absolute Differences (SAD) formula.

Step 7.2: The Sum-of-Absolute Differences (SAD) is calculated using the below

formula:

SAD (fq, ft) = Σ|fq[i] − ft[i]|

Step 8: SAD is compared with the threshold value to detect key-frame. The frames with

higher SAD as compared to threshold are treated as key-frame.

Step 9: To detect key-frames based on color histogram difference measure in entire video

repeat from step3 to step8.



Algorithm for extracting key-frame based on shape feature extraction



Step3: Select the next subsequent frame from the extracted frames.

Step4: Edge map Creation: Four edge maps are obtained for each frame by multiplying

the four masks with approximation component.

Step4.1: Get first edge map by applying k1σ on h, v and d components and

combining them and then multiplying the resulting mask with the approximation

component.

Step4.2: Obtain second edge map similarly as that of first one by using k2σ.

Step4.3: Get third edge map by applying both k1σ and k2σ on approximation

component.

Step4.4: Obtain fourth edge map by finding the highest intensity pixels (both positive

and negative values) among h, v and d components and by multiplying with

approximation component.

Step5: Shape Representation: Seven moment invariants are for each edge map. As there

are four edge maps for each image, feature vector for an image consists of 28

features.

φ1 = Ŋ20 + Ŋ 02

φ2 = (Ŋ 20 -Ŋ 02)2

+ 4 Ŋ112

φ3 = (Ŋ 30 - 3 Ŋ 12)2 + (3 Ŋ 21 -Ŋ 03)2

φ4 = (Ŋ 30+ Ŋ 12)2 + (Ŋ 21 + Ŋ 03)

2

φ5 = (Ŋ 30 -3 Ŋ 12) ( Ŋ30+ Ŋ12) [ (Ŋ 30+ Ŋ 12)2 -3 ( Ŋ21+ Ŋ03)

2]+ (3Ŋ 21 - Ŋ 03)

( Ŋ21+Ŋ03) [ 3(Ŋ 30 + Ŋ12)2 - ( Ŋ21+ Ŋ03)

2]



φ6 = (Ŋ 20 - Ŋ 02) [ (Ŋ 30 + Ŋ12)2- ( Ŋ21+ Ŋ 03)

2] + 4 Ŋ11(Ŋ30 + Ŋ12) (Ŋ03 + Ŋ21)

φ7 = (3Ŋ21 - Ŋ03)( Ŋ30 + Ŋ12)[ (Ŋ30 + Ŋ12)2 -3(Ŋ21 + Ŋ03)

2] + (3Ŋ12 - Ŋ30) ( Ŋ21 + Ŋ03)

[3( Ŋ30 + Ŋ12)2 -(Ŋ21 + Ŋ03)

2]

Step6: Shape Matching: Canberra distance is used to find the distance measure for

similarity between the two feature vectors.

Step7: CD is compared with the threshold value to detect key-frame. The frames with

higher CD as compared to threshold are treated as key-frame.

Step8: To detect key-frames based on shape difference measure in entire video.

Algorithm for key-frame extraction based on texture feature extraction



Step3: Select the next subsequent frame from the extracted frames and convert RGB to

Gray image then divide frame into Approximation (A), Horizontal (H), Vertical

(V), Diagonal (D) each.

Step4: ENERGY CALCULATION:

Calculate the energy of all decomposed frames at the same scale, using:

Where M and N are the dimensions of the image, and X is the pixel coefficient

corresponding to i th row and j th column in the frame map.



Step5: Repeat from step 4 for the low-low sub-band image, until it becomes third level.

Step6: The Euclidean distance is calculated between the two frames.

D=√(∑(X-Y)^2)

Step7: ED is compared with the threshold value to detect key-frame. The frames with

higher ED as compared to threshold are treated as key-frame.

Step8: To detect key-frames based on texture feature measure in entire video.

Calculation for fuzzy comprehensive evaluation

We modeled our problem into the scenario of FCE as under:

1. The accuracy rate (CUSA) and error rate (CUSE) are the criteria used to evaluate the efficacy of

a specific measure thus they made up the factor set U.

2. The Evaluation set V includes the scale factor for CUS A and CUSE, such as Very Good (VG),

Good (G), Average (A) and Poor (P). The values of these scale factors are determined by

fuzzification of accuracy and error rates of the training videos and then averaging the degree of

membership of each set.

3. For better quality of video summary, the value of CUS A must be high and value of CUSE must

be low. Therefore, both the factors are equally important and are assigned equal weight to

determine weight set W = (0.5, 0.5).

4. The Fuzzy transformation Γf is defined as:

1

Γ f = 0.7

0.4

0



5. The single factor evaluation matrix R is then determined and b is computed using b = W◦R

and the weight/credibility of a technique is found by multiplying b with fuzzy transformation

function Γf.

6. The process is repeated to determine the weight of each frame difference measure.

Weight =b. Γ f

Including numerical example for the determination of weights. For this example, the

training has been shown only on 3 videos using a single FDM. The Accuracy and Error Rates,

the degree of membership after fuzzification and the average values of each scale factor for a

shape FDM are given in Table 1.

Vide

o

CUSA Degree of Membership (CUSA)

VG G A P

CUSE Degree of Membership( CUSE)

VG G A P

1 0.73 0.45 0.8 0 0 0.41 0 0.34 0.3 0

2 0.5 0 0.75 0 0 0.21 0.16 0 0 0

3 0.66 0.05 0.45 0 0 0.56 0 0.11 0.95 0

AVG 0.17 0.66 0 0 0.05 0 0.41 0

Table 1 A numerical example for fuzzification of accuracy and error rates.

Using Table 1, the matrix R is then given as:

0 .17 0 .66 0 0

R= and W = [0.5 0.5]0 .05 0 .11 0 .41 0

Therefore b = W◦R is determined as b = [0.17 0.5 0.41 0]

Weight =b. Γf = [0.17 0.5 0.41 0] 1 = 0.684

0.7



0.4

0

need of color histogram

Documents