web mining research: a survey

28
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01

Upload: tirza

Post on 23-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

April 23rd 2014 CS332 Data Mining. pg 01. Web Mining Research: A Survey. Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson. pg 02. outline. Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions. pg 03. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Mining Research:  A Survey

Web Mining Research: A Survey

Authors: Raymond Kosala & Hendrik BlockeelPresenter: Ryan Patterson

April 23rd 2014 CS332 Data Mining

pg 01

Page 2: Web Mining Research:  A Survey

outline• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 02

Page 3: Web Mining Research:  A Survey

outline• Introduction• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 03

Page 4: Web Mining Research:  A Survey

Introduction“The Web is huge, diverse, and dynamic . . . we are currently drowning in information and facing

information overload.”Web users encounter problems:• Finding relevant information• Creating new knowledge out of the information

available on the Web• Personalization of the information• Learning about consumers or individual users

pg 04

Page 5: Web Mining Research:  A Survey

outline• Introduction

• Web Mining• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 05

Page 6: Web Mining Research:  A Survey

Web Mining“Web mining is the use of data mining

techniques to automatically discover and extract information from Web documents and

services.”Web mining subtasks:

1. Resource finding2. Information selection and pre-processing3. Generalization4. Analysis

pg 06

Page 7: Web Mining Research:  A Survey

Web MiningInformation Retrieval & Information Extraction

• Information Retrieval (IR)o the automatic retrieval of all relevant documents

while at the same time retrieving as few of the non-relevant as possible

• Information Extraction (IE)o transforming a collection of documents into

information that is more readily digested and analyzed

pg 07

Page 8: Web Mining Research:  A Survey

Live demo

pg 08

Page 9: Web Mining Research:  A Survey

outline• Introduction

• Web Mining

• Web Content Mining• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 09

Page 10: Web Mining Research:  A Survey

Web Content MiningInformation Retrieval View

Unstructured Documents• Most utilizes “bag of words” representation to generate documents features

o ignores the sequence in which the words occur

• Document features can be reduced with selection algorithmso ie. information gain

• Possible alternative document feature representations:o word positions in the documento phrases/terms (ie. “annual interest rate”)

Semi-Structured Documents• Utilize additional structural information gleaned from the document

o HTML markup (intra-document structure)o HTML links (inter-document structure)

pg 10

Page 11: Web Mining Research:  A Survey

Web content mining, IR unstructured documents

pg 11

Page 12: Web Mining Research:  A Survey

Web content mining, IR semi structured documents

pg 12

Page 13: Web Mining Research:  A Survey

Web Content MiningDatabase View

“the Database view tries . . . to transform a Web site to become a database so that . . . querying

on the Web become[s] possible.”• Uses Object Exchange Model (OEM)

o represents semi-structured data by a labeled graph

• Database view algorithms typically start from manually selected Web siteso site-specific parsers

• Database view algorithms produce:o extract document level schema or DataGuides

structural summary of semi-structured datao extract frequent substructures (sub-schema)o multi-layered database

each layer is obtained by generalizations on lower layers

pg 13

Page 14: Web Mining Research:  A Survey

Web content mining, Database view

pg 14

Page 15: Web Mining Research:  A Survey

outline• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining• Web Usage Mining

• Review

• Exam Questions

pg 15

Page 16: Web Mining Research:  A Survey

Web Structure Mining“. . . we are interested in the structure of the

hyperlinks within the Web itself”• Inspired by the study of social networks and citation analysis

o based on incoming & outgoing links we could discover specific types of pages (such as hubs, authorities, etc)

• Some algorithms calculate the quality/relevancy of each Web page

o ie. Page Rank

• Others measure the completeness of a Web site

o measuring frequency of local links on the same server

o interpreting the nature of hierarchy of hyperlinks on one domain

pg 16

Page 17: Web Mining Research:  A Survey

outline• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining• Review

• Exam Questions

pg 17

Page 18: Web Mining Research:  A Survey

Web Usage Mining“. . . focuses on techniques that could predict

user behavior while the user interacts with the Web.”

• Web usage is mined by parsing Web server logs

o mapped into relational tables → data mining techniques applied

o log data utilized directly

• Users connecting through proxy servers and/or users or ISP’s utilizing caching of Web data results in decreased server log accuracy

• Two applications:

o personalized - user profile or user modeling in adaptive interfaces

o impersonalized - learning user navigation patterns

pg 18

Page 19: Web Mining Research:  A Survey

outline• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review• Exam Questions

pg 19

Page 20: Web Mining Research:  A Survey

Review• Web mining

o 4 subtaskso IR & IE

• Web content miningo primarily intra-page analysiso IR view vs DB view

• Web structure miningo primarily inter-page analysis

• Web usage miningo primarily analysis of server activity logs

pg 20

Page 21: Web Mining Research:  A Survey

Web mining categories

Web Mining

Web Content MiningWeb Structure Mining Web Usage Mining

IR View DB View

View of Data - Unstructured- Semi structured

- Semi structured- Web site as DB

- Links structure - Interactivity

Main Data - Text documents- Hypertext documents

- Hypertext documents - Links structure - Server logs- Browser logs

Representation - Bag of word, n-grams- Terms, phrases- Concepts of ontology- Relational

- Edge-labeled graph (OEM)- Relational

- Graph - Relational table- Graphs

Method - TFIDF and variants- Machine learning- Statistical (incl. NLP)

- Proprietary algorithms- ILP- (modified) association rules

- Proprietary algorithms - Machine Learning- Statistical- (modified) association rules

ApplicationCategories

- Categorization- Clustering- Finding extraction rules- Finding patterns in text- User modeling

- Finding frequent sub-structures- Web site schema discovery

- Categorization- Clustering

- Site construction, adaptation, and management- Marketing- User modeling

pg 21

Page 22: Web Mining Research:  A Survey

outline• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 22

Page 23: Web Mining Research:  A Survey

Exam Question 1Q: Of the following Web mining paradigms:

• Information Retrieval

• Information ExtractionWhich does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer.

pg 23

Page 24: Web Mining Research:  A Survey

Exam Question 1Q: Of the following Web mining paradigms:

• Information Retrieval

• Information ExtractionWhich does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer.

A: Information Retrieval, the search engine attempts provides a list of documents ranked by their relevancy to the search query.

pg 24

Page 25: Web Mining Research:  A Survey

Exam Question 2Q: State one common problem hampering accurate Web usage mining? Briefly support your answer.

pg 25

Page 26: Web Mining Research:  A Survey

Exam Question 2Q: State one common problem hampering accurate Web usage mining? Briefly support your answer.

A:

• Users connecting to a Web site though a proxy server,

• Users (or their ISP’s) utilizing Web data caching,will result in decreased server log accuracy. Accurate server logs are required for accurate Web usage mining.

pg 26

Page 27: Web Mining Research:  A Survey

Exam Question 3Q: What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents?

pg 27

Page 28: Web Mining Research:  A Survey

Exam Question 3Q: What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents?

A: “Bag of words” representation.

pg 28