qian liu, computer and information sciences department

20
1 Qian Liu, Computer and Information Sciences Department A Presentation on Extracting Patterns and Relations from the World Wide Web Sergey Brin

Upload: wilma-mathis

Post on 31-Dec-2015

26 views

Category:

Documents


2 download

DESCRIPTION

A Presentation on Extracting Patterns and Relations from the World Wide Web Sergey Brin. Qian Liu, Computer and Information Sciences Department. Problem The World Wide Web as an information resource: Huge Widely distributed Complex, various styles and formats Scattered information - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Qian Liu,  Computer and Information Sciences Department

1

Qian Liu, Computer and Information Sciences Department

A Presentation on

Extracting Patterns and Relations from the World Wide Web

Sergey Brin

Page 2: Qian Liu,  Computer and Information Sciences Department

2

Qian Liu, Computer and Information Sciences Department

Problem

•The World Wide Web as an information resource:

•Huge•Widely distributed•Complex, various styles and formats•Scattered information

•So, if we could integrate the chunks of information...

Page 3: Qian Liu,  Computer and Information Sciences Department

3

Qian Liu, Computer and Information Sciences Department

Motivation

Discover information sources

Extract information of a particular data type automatically/with minimal human intervention

Integrate into a structured form

The largest and most diverse source of information

Page 4: Qian Liu,  Computer and Information Sciences Department

4

Qian Liu, Computer and Information Sciences Department

Applications

• To extract structured data from the entire World Wide Web

• Data types: books, movies, music, restaurants, etc.

Page 5: Qian Liu,  Computer and Information Sciences Department

5

Qian Liu, Computer and Information Sciences Department

Methods

Problem: To extract a relation of books --- (author, title) pairs from the Web.

Page 6: Qian Liu,  Computer and Information Sciences Department

6

Qian Liu, Computer and Information Sciences Department

Methods

Intuition:A small seed set of books (author, title pairs)

Find occurrences of them on the Web

Generate patterns

Search for books matching the patterns

Obtain a large list of books

Page 7: Qian Liu,  Computer and Information Sciences Department

7

Qian Liu, Computer and Information Sciences Department

Methods

Formal Definition of the Problem:

•World Wide Web

•Relation --- (author, title) pairs that occur on the Web

•Occurrences

•Every tuple of the relation occurs >= 1 times on the Web

• Consists of all fields of the tuple

• Fields --- in close proximity to one another

Page 8: Qian Liu,  Computer and Information Sciences Department

8

Qian Liu, Computer and Information Sciences Department

Methods

Formal Definition of the Problem (Continued):

• Patterns

•Matching one particular format of occurrences of tuples of the relation. (order, urlprefix, prefix, middle, suffix)

• Represented by a class of regular expressions

Page 9: Qian Liu,  Computer and Information Sciences Department

9

Qian Liu, Computer and Information Sciences Department

Methods

R’: Approximation of relation R

Coverage (recall) =

Error rate =

Precision =

|R’ + R| R

|R’ - R| R’|R’ + R| R’

Page 10: Qian Liu,  Computer and Information Sciences Department

10

Qian Liu, Computer and Information Sciences Department

Methods

Method: Dual Iterative Pattern Relation Expansion

Basis:

•Find tuples from patterns.

•Find patterns from tuples.

Page 11: Qian Liu,  Computer and Information Sciences Department

11

Qian Liu, Computer and Information Sciences Department

Methods

Set of patterns with high coverage and low error rate

Set of tuples

Find all matches to patterns

Find all occurrences of the tuples. Discover similarities in occurrences

Page 12: Qian Liu,  Computer and Information Sciences Department

12

Qian Liu, Computer and Information Sciences Department

Methods

1. Start with a small sample, e.g., five books.

2. Find all occurrences of the sample books on WWW.

Keep the context of every occurrence (url and surrounding text).

Page 13: Qian Liu,  Computer and Information Sciences Department

13

Qian Liu, Computer and Information Sciences Department

Methods

3. Generate patterns based on the occurrences.

Requirements:

• Generate patterns for sets of occurrences with similar context

• Low error rate

• Coverage

Page 14: Qian Liu,  Computer and Information Sciences Department

14

Qian Liu, Computer and Information Sciences Department

Methods

Procedure:

•Group the occurrences by order and middle.

• For each group: set urlprefix, prefix, suffix.

Specificity of Pattern:

• Too specific?

• Too general?

• Specificity(p)=|p.middle| |p.url| |p.prefix| |p.suffix|

Page 15: Qian Liu,  Computer and Information Sciences Department

15

Qian Liu, Computer and Information Sciences Department

Methods

4. Search the Web for tuples matching the pattern.

5. Is result large enough?

If yes, return.

If no, go to step 2.

Page 16: Qian Liu,  Computer and Information Sciences Department

16

Qian Liu, Computer and Information Sciences Department

Experiments

1st

iteration2nd

iteration3rd

iterationUnique(author,title) pairs

5 4047 9127

Occurrences 199 3972 9938patterns 3 105 346Result:unique pairs

4047 9127 15257

Page 17: Qian Liu,  Computer and Information Sciences Department

17

Qian Liu, Computer and Information Sciences Department

Limitations of Study

1. Scalability problem: Limited experiments due to time constraints.

2. Problem with data: duplicate books.

3. Measure of safety in matching tuples with patterns: To match a single pattern.

Page 18: Qian Liu,  Computer and Information Sciences Department

18

Qian Liu, Computer and Information Sciences Department

Suggestions for Future Studies

1. Scan for larger numbers of patterns and tuples over a huge repository.

2. Include methods to disregard differences such as capitalization, space, how the author is listed in the book, and so on.

Page 19: Qian Liu,  Computer and Information Sciences Department

19

Qian Liu, Computer and Information Sciences Department

Conclusions

• DIPRE --- a remarkable tool to extract structured data from the Web

• Minimal human intervention

• Application in domains other than books

• Finding books not listed in major online sources

--- change in information flow

Page 20: Qian Liu,  Computer and Information Sciences Department

20

Qian Liu, Computer and Information Sciences Department