noise removal efficient web data mining

8/3/2019 Noise Removal Efficient Web Data Mining

1/35

4/22/2012 1

Eliminating Noisy

Information in Web Pagesfor Data Mining

Presented by

Jyothi.B

S1 SE

TKMIT

Guided by

Ms.Revathy


2/35

4/22/2012 2

Noisy information

In web sites the noises are considered as blocks of copyright,privacy notices and advertisements.

Web noises are of two types.

Global noises.

Local noises.


3/35

4/22/2012 3

Web mining

Web usage mining is a process of extracting useful

information from server logs.

Web content mining is the process to discover usefulinformation from text, image, audio or video data in the

web.

Web structure mining is the process of using graphtheory to analyze the node and connection structure of aweb site.

Web mining is the application of data mining techniques todiscover patterns from the web.


4/35

4/22/2012 4

Sample web site


5/35

4/22/2012 5

Style tree

In this technique propose a tree structure called style tree to

capture the common presentation style and actual contents of

the pages in a given site.

Site style tree can be built for a web site.

Web page cleaning done using a cleaning technique.


6/35

4/22/2012 6

Cleaning Technique

Web page cleaning is a kind of preprocessing.

Based on the observation that most web pages are

automatically generated.

Parts of a page whose layouts and actual contents also appearin other pages in the site are more likely to be noises.

Parts of a page whose layouts or actual contents are quitedifferent from other pages are usually the main contents of thepage.


7/35

4/22/2012 7

Entropy based information

measure

Proposes an information based measure to determine which

parts of the style tree indicate noises and which part of the tree

contain the main contents of the pages in the website.

Importance measure formula proposed is entropy based.

Experimental results show a increase in accuracy of webmining using the proposed webpage cleaning method.


8/35

4/22/2012 8

Data structureSite Style Tree Web site commonly presented by DOM tree.

.

.


9/35

4/22/2012 9

Disadvantage of DOM

DOM tree is insufficient.

It is hard to study the overall presentation style and content of

a set of HTML pages and clean them based on individual

DOM trees.


10/35

4/22/2012 10

P

Root

IMGTABLE

P BR A

BODY

TABLE

DOMTreesandStyleTree

d1 d2

Root

IMG

TABLE TABLE

P

BODY

IMG A

Bgcolor=white Bgcolor=white

Width=800Width=800

Bgcolor=red

Bgcolor=red


11/35

4/22/2012 11

DOM TreesandStyleTree

Compressed representation of two DOM trees. It shows which

parts of the DOM trees are common and which parts are

different.

Root

BODY

TABLE IMG

IMG BR

TABLE

P A P AP

Width=800

Bgcolor=white

Bgcolor=red


12/35

4/22/2012 12

Site Style Tree

A style node(S) represents a layout or presentation style,which has two components ,denoted by (Es,n),where Es is asequence of element nodes and n is the number of pages thathas this particular style at this node level.

An element node E has three components denoted by(TAG,Attr,Ss), where

TAG is the tag name. Attr is the set of display attributes of TAG.

Ss is a set of style nodes below E.


13/35

4/22/2012 13

Importance measure

Entropy based importance measure is used fordetermining noisy elements in Style Tree(ST).

Based on the following assumptions

1. The more presentation styles that an elementnode has, the more important and vice versa.

2.The more diverse that the actual contents of anelement node are, the more important the elementnode is ,and vice versa.

.


14/35

4/22/2012 14

Importance of an element node is given by combining its

presentation importance and content importance.


15/35

4/22/2012 15

Root

ImgTable Table

BODY

Table

Tr Tr

A A

P

AA A

Img

P PPP

A

A

Text

Text

ImgA

ImportanceMeasure


16/35

4/22/2012 16

Importance measure formula

Composite importance measure for a node is the importancemeasure of the element node and its descendants.

For the internal node it is based on the presentation styles andimportance of its descendants.

CompImp(E)=(1-l

)NodeImp(E)+l

li=1(pi CompImp(Si))

Where is the attenuating factor which is set to 0.9.

-li=1 pi log m pi if m>1

NodeImp(E)=

1 if m=1

Where pi is the probability that a web page uses the ith style node in E.Sss

kj=1CompImp(Ej)

CompImp(Si)=

k

Where pi is the probability that E has the ith child style node in E.Ss


17/35

4/22/2012 17

Importance measure formula

Leaf nodes are different from internal nodes. composite

importance for leaf nodes is based on the information in its

actual contents of the nodes with no tags.

1 If m=1

lj=1H(ai)CompImp(E)= 1- if m>1

lWhere ai is an actual feature of the content in E. H(ai) is the

information entropy of ai within the context of E.

H(ai)=-j=1m pij log m pij


18/35

4/22/2012 18

Overall algorithm

1.Randomly crawl k pages from the given website S

2.Set null SST with virtual root E

3.For each page W in the k pages do

4. Build PST(W);

5. BuildSST(E,Ew);

6.End for

7.CalcCompImp(E)8.MarkNoise(E);


19/35

4/22/2012 19

9.Markmeaningful(E);

10.For each target web pages p do

11. Ep =BuildPST(P)

12. MapSST(E,Ep)

13.End for


20/35

4/22/2012 20

Algorithm-Mark Noise

Input:E: root element node of a SST. Return: TRUEifEand all of its descendents are noisy,else

FALSE.

Mark Noise(E)1. For each S E.Ss do

2. For each e S.Es do

3. If(markNoise(e)==FALSE) then

Noisy: For an element nodeEin the SST, if all

of its descendents and itself have composite

importance less than aspecified threshold t, then we

say element nodeEis noisy.


21/35

4/22/2012 21

4. return FALSE

5. End if

6. End for

7.End for

8. if(E.CompImp


22/35

4/22/2012 22

Algorithm -Definitions

Maximal noisy element node: If a noisy element nodeEinthe SST is not a descendent of any other noisy element node,we callEa maximal noisy element node.

Meaningful: If an element nodeEin the SST does not containany noisy descendent, we say thatEis meaningful.

Maximal meaningful element node: If a meaningful element

nodeEis not a descendent of any other meaningful elementnode, we sayEis a maximal meaningful element node.


23/35

4/22/2012 23

Root

ImgTable Table

BODY

Table

Tr Tr

A A

P

AA A

Img

P PPP

A

A

Text

Text

ImgA

Algorithm-Definition


24/35

4/22/2012 24

A simplified SST

Root

Body

Table Img Table Table

Tr Tr Text


25/35

4/22/2012 25

Algorithm-MapSST

MapSST(E,Ep)

1. IfE is noisy then

2. Delete Ep as noises

MapSST uses the simplified SST, compares thatwith page style tree and get the actual contents.

Input:E: Root element node of the simplified

SST.

Input:EPST: root element node of the pagestyle tree.

Return: The main content of the page after

cleaning.


26/35

4/22/2012 26

Algorithm-MapSST

3. Return NULL

4. end if

5. IfE is meaningful then

6. Ep is meaningful

7. Return the content under Ep

8. Else

9. ReturnContent = NULL

10. S2 is the style node in Ep.Ss

11. If(S1 E E.Ss S2 matches S1) then


27/35

4/22/2012 27

12. E1.i is the ith element node in sequence s1.Es;

13. E2.i is the ith node in sequence s2.Es

14. For each pair (e1i

, e2i

) do

15. returnContent += MapSST(e1i,e 2i)

16. End for

17. Return returnContent

18. Else Ep is possibly meaningful;

19. Return the content under Ep

20. End if

21.End if


28/35

4/22/2012 28

Overall algorithm-revisited

1.Randomly crawl k pages from the given website S

2.Set null SST with virtual root E

3.For each page W in the k pages do

4. Build PST(W);

5. BuildSST(E,Ew);

6.End for

7.CalcCompImp(E)8.MarkNoise(E);


29/35

4/22/2012 29

9.Markmeaningful(E);

10.For each target web pages p do

11. Ep =BuildPST(P)

12. MapSST(E,Ep)

13.End for


30/35

4/22/2012 30

Execution Time

The time taken to build SST is always below 20

second.

The process of computing composite Importance

finished in 2 seconds.

Final step of cleaning each page takes less than 0.1

second.


31/35

4/22/2012 31

Advantages

Proposed system is faster than the existing one.

Very efficient in handling the noise.

Removes around 99% of the unwanted content.


32/35

4/22/2012 32

Disadvantages

Faces difficulty in handling scripts inside the body.

Unformatted structure of a Web page causes exceptions. Only focused on HTML pages.


33/35

4/22/2012 33

Conclusions

Proposes a technique to clean web pages for web mining.

Introduces a data structure SST to capture layout andpresentation styles.

Proposes an information based measure to evaluate theimportance of element nodes in SST so as to detect noises.

Results show that proposed technique is highly effective.


34/35

4/22/2012 34

Reference

Anderberg, M.R. Cluster Analysis for Applications,AcademicPress, Inc. New York, 1973.

Bar-Yossef, Z. and Rajagopalan, S. Template DetectionviaData Mining and its Applications, WWW 2002, 2002.

Beeferman, D., Berger, A. and Lafferty, J.A model oflexicalattraction and repulsion. ACL-97, 1997

. Beeferman, D., Berger, A. and Lafferty, J. Statistical

modelsfor text segmentation. Machine learning, 34(1-3), 1999.


35/35

4/22/2012 35

noise removal efficient web data mining

Documents