noise removal efficient web data mining

Upload: jyothibpillai

Post on 07-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    1/35

    4/22/2012 1

    Eliminating Noisy

    Information in Web Pagesfor Data Mining

    Presented by

    Jyothi.B

    S1 SE

    TKMIT

    Guided by

    Ms.Revathy

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    2/35

    4/22/2012 2

    Noisy information

    In web sites the noises are considered as blocks of copyright,privacy notices and advertisements.

    Web noises are of two types.

    Global noises.

    Local noises.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    3/35

    4/22/2012 3

    Web mining

    Web usage mining is a process of extracting useful

    information from server logs.

    Web content mining is the process to discover usefulinformation from text, image, audio or video data in the

    web.

    Web structure mining is the process of using graphtheory to analyze the node and connection structure of aweb site.

    Web mining is the application of data mining techniques todiscover patterns from the web.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    4/35

    4/22/2012 4

    Sample web site

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    5/35

    4/22/2012 5

    Style tree

    In this technique propose a tree structure called style tree to

    capture the common presentation style and actual contents of

    the pages in a given site.

    Site style tree can be built for a web site.

    Web page cleaning done using a cleaning technique.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    6/35

    4/22/2012 6

    Cleaning Technique

    Web page cleaning is a kind of preprocessing.

    Based on the observation that most web pages are

    automatically generated.

    Parts of a page whose layouts and actual contents also appearin other pages in the site are more likely to be noises.

    Parts of a page whose layouts or actual contents are quitedifferent from other pages are usually the main contents of thepage.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    7/35

    4/22/2012 7

    Entropy based information

    measure

    Proposes an information based measure to determine which

    parts of the style tree indicate noises and which part of the tree

    contain the main contents of the pages in the website.

    Importance measure formula proposed is entropy based.

    Experimental results show a increase in accuracy of webmining using the proposed webpage cleaning method.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    8/35

    4/22/2012 8

    Data structureSite Style Tree Web site commonly presented by DOM tree.

    .

    .

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    9/35

    4/22/2012 9

    Disadvantage of DOM

    DOM tree is insufficient.

    It is hard to study the overall presentation style and content of

    a set of HTML pages and clean them based on individual

    DOM trees.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    10/35

    4/22/2012 10

    P

    Root

    IMGTABLE

    P BR A

    BODY

    TABLE

    DOMTreesandStyleTree

    d1 d2

    Root

    IMG

    TABLE TABLE

    P

    BODY

    IMG A

    Bgcolor=white Bgcolor=white

    Width=800Width=800

    Bgcolor=red

    Bgcolor=red

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    11/35

    4/22/2012 11

    DOM TreesandStyleTree

    Compressed representation of two DOM trees. It shows which

    parts of the DOM trees are common and which parts are

    different.

    Root

    BODY

    TABLE IMG

    IMG BR

    TABLE

    P A P AP

    Width=800

    Bgcolor=white

    Bgcolor=red

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    12/35

    4/22/2012 12

    Site Style Tree

    A style node(S) represents a layout or presentation style,which has two components ,denoted by (Es,n),where Es is asequence of element nodes and n is the number of pages thathas this particular style at this node level.

    An element node E has three components denoted by(TAG,Attr,Ss), where

    TAG is the tag name. Attr is the set of display attributes of TAG.

    Ss is a set of style nodes below E.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    13/35

    4/22/2012 13

    Importance measure

    Entropy based importance measure is used fordetermining noisy elements in Style Tree(ST).

    Based on the following assumptions

    1. The more presentation styles that an elementnode has, the more important and vice versa.

    2.The more diverse that the actual contents of anelement node are, the more important the elementnode is ,and vice versa.

    .

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    14/35

    4/22/2012 14

    Importance of an element node is given by combining its

    presentation importance and content importance.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    15/35

    4/22/2012 15

    Root

    ImgTable Table

    BODY

    Table

    Tr Tr

    A A

    P

    AA A

    Img

    P PPP

    A

    A

    Text

    Text

    ImgA

    ImportanceMeasure

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    16/35

    4/22/2012 16

    Importance measure formula

    Composite importance measure for a node is the importancemeasure of the element node and its descendants.

    For the internal node it is based on the presentation styles andimportance of its descendants.

    CompImp(E)=(1-l

    )NodeImp(E)+l

    li=1(pi CompImp(Si))

    Where is the attenuating factor which is set to 0.9.

    -li=1 pi log m pi if m>1

    NodeImp(E)=

    1 if m=1

    Where pi is the probability that a web page uses the ith style node in E.Sss

    kj=1CompImp(Ej)

    CompImp(Si)=

    k

    Where pi is the probability that E has the ith child style node in E.Ss

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    17/35

    4/22/2012 17

    Importance measure formula

    Leaf nodes are different from internal nodes. composite

    importance for leaf nodes is based on the information in its

    actual contents of the nodes with no tags.

    1 If m=1

    lj=1H(ai)CompImp(E)= 1- if m>1

    lWhere ai is an actual feature of the content in E. H(ai) is the

    information entropy of ai within the context of E.

    H(ai)=-j=1m pij log m pij

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    18/35

    4/22/2012 18

    Overall algorithm

    1.Randomly crawl k pages from the given website S

    2.Set null SST with virtual root E

    3.For each page W in the k pages do

    4. Build PST(W);

    5. BuildSST(E,Ew);

    6.End for

    7.CalcCompImp(E)8.MarkNoise(E);

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    19/35

    4/22/2012 19

    9.Markmeaningful(E);

    10.For each target web pages p do

    11. Ep =BuildPST(P)

    12. MapSST(E,Ep)

    13.End for

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    20/35

    4/22/2012 20

    Algorithm-Mark Noise

    Input:E: root element node of a SST. Return: TRUEifEand all of its descendents are noisy,else

    FALSE.

    Mark Noise(E)1. For each S E.Ss do

    2. For each e S.Es do

    3. If(markNoise(e)==FALSE) then

    Noisy: For an element nodeEin the SST, if all

    of its descendents and itself have composite

    importance less than aspecified threshold t, then we

    say element nodeEis noisy.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    21/35

    4/22/2012 21

    4. return FALSE

    5. End if

    6. End for

    7.End for

    8. if(E.CompImp

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    22/35

    4/22/2012 22

    Algorithm -Definitions

    Maximal noisy element node: If a noisy element nodeEinthe SST is not a descendent of any other noisy element node,we callEa maximal noisy element node.

    Meaningful: If an element nodeEin the SST does not containany noisy descendent, we say thatEis meaningful.

    Maximal meaningful element node: If a meaningful element

    nodeEis not a descendent of any other meaningful elementnode, we sayEis a maximal meaningful element node.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    23/35

    4/22/2012 23

    Root

    ImgTable Table

    BODY

    Table

    Tr Tr

    A A

    P

    AA A

    Img

    P PPP

    A

    A

    Text

    Text

    ImgA

    Algorithm-Definition

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    24/35

    4/22/2012 24

    A simplified SST

    Root

    Body

    Table Img Table Table

    Tr Tr Text

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    25/35

    4/22/2012 25

    Algorithm-MapSST

    MapSST(E,Ep)

    1. IfE is noisy then

    2. Delete Ep as noises

    MapSST uses the simplified SST, compares thatwith page style tree and get the actual contents.

    Input:E: Root element node of the simplified

    SST.

    Input:EPST: root element node of the pagestyle tree.

    Return: The main content of the page after

    cleaning.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    26/35

    4/22/2012 26

    Algorithm-MapSST

    3. Return NULL

    4. end if

    5. IfE is meaningful then

    6. Ep is meaningful

    7. Return the content under Ep

    8. Else

    9. ReturnContent = NULL

    10. S2 is the style node in Ep.Ss

    11. If(S1 E E.Ss S2 matches S1) then

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    27/35

    4/22/2012 27

    12. E1.i is the ith element node in sequence s1.Es;

    13. E2.i is the ith node in sequence s2.Es

    14. For each pair (e1i

    , e2i

    ) do

    15. returnContent += MapSST(e1i,e 2i)

    16. End for

    17. Return returnContent

    18. Else Ep is possibly meaningful;

    19. Return the content under Ep

    20. End if

    21.End if

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    28/35

    4/22/2012 28

    Overall algorithm-revisited

    1.Randomly crawl k pages from the given website S

    2.Set null SST with virtual root E

    3.For each page W in the k pages do

    4. Build PST(W);

    5. BuildSST(E,Ew);

    6.End for

    7.CalcCompImp(E)8.MarkNoise(E);

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    29/35

    4/22/2012 29

    9.Markmeaningful(E);

    10.For each target web pages p do

    11. Ep =BuildPST(P)

    12. MapSST(E,Ep)

    13.End for

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    30/35

    4/22/2012 30

    Execution Time

    The time taken to build SST is always below 20

    second.

    The process of computing composite Importance

    finished in 2 seconds.

    Final step of cleaning each page takes less than 0.1

    second.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    31/35

    4/22/2012 31

    Advantages

    Proposed system is faster than the existing one.

    Very efficient in handling the noise.

    Removes around 99% of the unwanted content.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    32/35

    4/22/2012 32

    Disadvantages

    Faces difficulty in handling scripts inside the body.

    Unformatted structure of a Web page causes exceptions. Only focused on HTML pages.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    33/35

    4/22/2012 33

    Conclusions

    Proposes a technique to clean web pages for web mining.

    Introduces a data structure SST to capture layout andpresentation styles.

    Proposes an information based measure to evaluate theimportance of element nodes in SST so as to detect noises.

    Results show that proposed technique is highly effective.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    34/35

    4/22/2012 34

    Reference

    Anderberg, M.R. Cluster Analysis for Applications,AcademicPress, Inc. New York, 1973.

    Bar-Yossef, Z. and Rajagopalan, S. Template DetectionviaData Mining and its Applications, WWW 2002, 2002.

    Beeferman, D., Berger, A. and Lafferty, J.A model oflexicalattraction and repulsion. ACL-97, 1997

    . Beeferman, D., Berger, A. and Lafferty, J. Statistical

    modelsfor text segmentation. Machine learning, 34(1-3), 1999.

  • 8/3/2019 Noise Removal Efficient Web Data Mining

    35/35

    4/22/2012 35