harvesting useful information on researchers' home pages

32
1 TIM Ta Nha Linh 13 March 2009 Harvesting useful information on researchers' home pages Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan

Upload: talbot

Post on 21-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Harvesting useful information on researchers' home pages. Ta Nha Linh Supervisor: Asst. Prof. Min-Yen Kan. Motivation. Databases dedicated to scientific publications: CiteSeer, Google Scholar, ACM Portal, SpringerLink How about the authors of those publications? Publication-centric. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Harvesting useful information on researchers' home pages

1TIM

Ta Nha Linh

13 March 2009

Harvesting useful information on researchers' home pages

Ta Nha Linh

Supervisor: Asst. Prof. Min-Yen Kan

Page 2: Harvesting useful information on researchers' home pages

2TIM

Ta Nha Linh

13 March 2009

Motivation

• Databases dedicated to scientific publications: CiteSeer, Google Scholar, ACM Portal, SpringerLink

• How about the authors of those publications?

• Publication-centric.

Page 3: Harvesting useful information on researchers' home pages

3TIM

Ta Nha Linh

13 March 2009

Motivation

• Researcher-centric database?– Singapore Researchers Database: researchers to sign up and input, restricted conditions, in Singapore only

– Resilience Alliance Reseachers Database: manual submission by researchers, in ecological and social sciences

– Some other similar databases: manual update, specific to certain organization

Page 4: Harvesting useful information on researchers' home pages

4TIM

Ta Nha Linh

13 March 2009

• Goal: Automated system to build researchers database, for multiple disciplines

• Where to get the information? Their home pages.

– Basic information

– Contact information

– Educational history

– Publications

Page 5: Harvesting useful information on researchers' home pages

5TIM

Ta Nha Linh

13 March 2009

Challenges

• Different layouts– Templates

– Personal pages

• Different content– Pages introducing researchers

– CV-like

– Personal pages

• Different content structures– Tables / lists

– Natural language text

Page 6: Harvesting useful information on researchers' home pages

6TIM

Ta Nha Linh

13 March 2009

Page 7: Harvesting useful information on researchers' home pages

7TIM

Ta Nha Linh

13 March 2009

Page 8: Harvesting useful information on researchers' home pages

8TIM

Ta Nha Linh

13 March 2009

Page 9: Harvesting useful information on researchers' home pages

9TIM

Ta Nha Linh

13 March 2009

Challenges

• Different data presentations

hangli at microsoft dot com cs.duke.edu, junyang [email protected] erafalin(at)cs.tufts.edu <Image src=’email.jpg’/> Natalio.Krasnogor -replace all this by at symbol- nottingham.ac.uk wmt then the at-sign then uci dot edu

Page 10: Harvesting useful information on researchers' home pages

10TIM

Ta Nha Linh

13 March 2009

System Architecture

• Fields Identification (Tagging Core)

• Home page Identification

• Post Processing

Page 11: Harvesting useful information on researchers' home pages

11TIM

Ta Nha Linh

13 March 2009

Fields Identification - Purpose

• To identify data in the page contents to corresponding fields in a pre-defined set of desired information.

• Current set includes:Name – Position – Affiliation

Address – Phone – Fax - Email

BS year – BS major – BS university

MS year – MS major – MS university

PhD year – PhD major – PhD university

Research Interest – Publications

Page 12: Harvesting useful information on researchers' home pages

12TIM

Ta Nha Linh

13 March 2009

Fields Identification - Related works• Tang et al (2007), (2008) – ArnetMiner

– Prepocessing: tokenize text into 5 categories

– Tagging of tokens by using Conditional Random Field (CRF)

– F1 = 83.37% (~1,000 researchers)

– Set of features used: + Content features (word, morphological, image

features)+ Pattern features (positive word, special token,

reseacher name features)+ Term features (term, dictionary features)

Page 13: Harvesting useful information on researchers' home pages

13TIM

Ta Nha Linh

13 March 2009

Fields Identification - Related works

• Tang et al (2007), (2008) – ArnetMiner

– Has researcher’s name as input. This is an important information to be made used of when parse other fields. Different from TIM.

– Based only on text of the page. Stylistic information can be of use.

Page 14: Harvesting useful information on researchers' home pages

14TIM

Ta Nha Linh

13 March 2009

Fields Identification - Related works

• Cai et al (2003)

– VIsion-based Page Segmentation (VIPS) algorithm to produce visual-based content structure of a web page

– Make use of DOM tree and visual cues on web pages

– May help in narrowing down relevant sections

– Drawback: need a browser to get the visual information

Page 15: Harvesting useful information on researchers' home pages

15TIM

Ta Nha Linh

13 March 2009

Fields Identification - Related works• Lee (2004) PARCELS Stylistic Engine

– Made use of some heuristics proposed by Cai et al (2003)

– Parse the DOM tree for text-only and stylistic properties

– Text-only data passed to another engine for further process

– Stylistic data is stored in vector for machine learning, to classify sections with a set of domain-specific tags.

– The domain used was the news domain

Page 16: Harvesting useful information on researchers' home pages

16TIM

Ta Nha Linh

13 March 2009

Fields Identification - Method

• Input: a researcher home page

• CRF is employed as the automated learning model

• Features used– Global features

– Lexicon features

– Context features

– Dictionaries features

– Stylistic features

Page 17: Harvesting useful information on researchers' home pages

17TIM

Ta Nha Linh

13 March 2009

Fields Identification - Method

• Global features: apply for current token– Morphological features

– Initials

– Number

– Punctuation

• Lexicon features: apply for current token– Positive words for certain annotation fields: Position, Affiliation, Address, Phone, Fax, Email

Page 18: Harvesting useful information on researchers' home pages

18TIM

Ta Nha Linh

13 March 2009

Fields Identification - Method• Context features: apply for whole line

– Name context– Address context– Phone context: 'phone', 'tel', 'mobile'– Fax context: 'fax', 'facsimile'– Email context: 'email', 'e-mail'– Bachelor (BS) context: appearance of 'B.S' or 'BS' or 'Bachelor'– Master (MS) context: appearance of 'M.S' or 'MS' or 'Master'– Ph.D (PhD) context: appearance of 'Ph.D' or 'Doctorate' or 'Doctor(ate) of Philosophy'– Research-interest context: multiple line property– Publication context: multiple line property– Degree: help to correctly differentiate BS/MS/PhD info when they are presenting in prose style / on the same line.

Page 19: Harvesting useful information on researchers' home pages

19TIM

Ta Nha Linh

13 March 2009

Fields Identification - Method• Dictionaries

– Parscit dictionary: detect male names, female names, popular last names, month names, place names, publisher names, each is a single feature

– Major dictionary: to help in identifying researchers' major in their educational history, may also help in Research Interests

– Research dictionary: classified into high/mid/low confidence.

– Universities dictionary: of names of most of universities, according to Open Directory

Page 20: Harvesting useful information on researchers' home pages

20TIM

Ta Nha Linh

13 March 2009

Fields Identification - Method

• Stylistic features– List feature

– Table features

– Section feature: based on html tags like <div>, <p>, <title>, header tags, list elements, table

Page 21: Harvesting useful information on researchers' home pages

21TIM

Ta Nha Linh

13 March 2009

Fields Identification - Performance• Data set of 40 home pages, cross validation• processed 29271 tokens with 29271 phrases; found: 29271 phrases; correct:

23444.accuracy: 80.09%; precision: 80.09%; recall: 80.09%; FB1: 80.09 • address: precision: 78.90%; recall: 74.57%; FB1: 76.67 327 • affiliation: precision: 30.27%; recall: 59.47%; FB1: 40.12 1110 • bs-major: precision: 88.89%; recall: 78.05%; FB1: 83.12 36 • bs-uni: precision: 68.67%; recall: 57.00%; FB1: 62.30 83 • bs-year: precision: 90.00%; recall: 72.00%; FB1: 80.00 20 • email: precision: 79.31%; recall: 70.77%; FB1: 74.80 58 • fax: precision: 47.73%; recall: 72.41%; FB1: 57.53 88 • misc: precision: 85.23%; recall: 92.35%; FB1: 88.65 22888 • ms-major: precision: 71.43%; recall: 32.26%; FB1: 44.44 14 • ms-uni: precision: 52.94%; recall: 52.94%; FB1: 52.94 85 • ms-year: precision: 77.78%; recall: 56.00%; FB1: 65.12 18 • name: precision: 75.66%; recall: 51.34%; FB1: 61.17 152 • phd-major: precision: 83.33%; recall: 73.17%; FB1: 77.92 36 • phd-uni: precision: 74.56%; recall: 72.03%; FB1: 73.28 114 • phd-year: precision: 100.00%; recall: 74.07%; FB1: 85.11 20 • phone: precision: 53.38%; recall: 89.25%; FB1: 66.80 311 • position: precision: 79.46%; recall: 64.49%; FB1: 71.20 112 • publications: precision: 71.05%; recall: 43.27%; FB1: 53.79 3240• research-interest: precision: 48.48%; recall: 36.04%; FB1: 41.34 559

Page 22: Harvesting useful information on researchers' home pages

22TIM

Ta Nha Linh

13 March 2009

Fields Identification - Discussion

• Data fields to be annotated similar to those from ArnetMiner.– Extra: Name, Research Areas, Publications

– Missing: Image

• Stylistic feature used is minimal

Page 23: Harvesting useful information on researchers' home pages

23TIM

Ta Nha Linh

13 March 2009

Fields Identification - Discussion

• F1 value is slightly lower than that of ArnetMiner’s– ArnetMiner has the researcher name as input, and uses features referring to researcher name to identify other fields. TIM has absolutely no prior knowledge about the page to be parsed.

– Identifying ‘Research Interest’ and ‘Publications’ is the most challenging. Not always presented. If presented, in various styles

Page 24: Harvesting useful information on researchers' home pages

24TIM

Ta Nha Linh

13 March 2009

Home page Identification - Purpose

• Add-on component

• To complete automation of the system: finding home pages to input to the Fields Identification component.

Page 25: Harvesting useful information on researchers' home pages

25TIM

Ta Nha Linh

13 March 2009

Home page Identification – Related works• Ahoy!

– Input: Researcher name and institution name (optional)

– Use MetaCrawler as a 'reference source', cross filter by email database

– Heuristic-based filter: based entirely on reference's tile, URL, short textual extract (if supplied by the search engine)

– Ranking: based on 1/ person name match, 2/ institution URL match, 3/ page appears to be a homepage – URL Pattern Extraction and Generation: extract and learn the pattern if a success, else generate URL from database of URL patterns

Page 26: Harvesting useful information on researchers' home pages

26TIM

Ta Nha Linh

13 March 2009

Home page Identification – Related works

• Ahoy!– Dynamic search, high performance reported, URL patterns usage a good feature

– Does not serve the same purpose as my Home page Identification: should not take researcher name as input.

– Definition of ‘home page’ is not the same. Ahoy! classifies based on URL patterns, TIM classified based on page contents.

Page 27: Harvesting useful information on researchers' home pages

27TIM

Ta Nha Linh

13 March 2009

Home page Identification – Method• Collect a list of Universities domains

• Use Yahoo! BOSS to search for professors in the institutions

• For each valid web page, fetch the page, scan for words indicating ‘phone’, ‘mail’ and ‘professor’.

• Count the number of appearance. – #phone < 3 && #mail < 2 && #professor < 5 Home page

• Home pages will be passed to Fields Identification component.

Page 28: Harvesting useful information on researchers' home pages

28TIM

Ta Nha Linh

13 March 2009

Home page Identification – Discussion

• Query to Yahoo! BOSS is not optimal. But this covers the majority

• Drawback: result set from Yahoo! BOSS may get duplicate pages, or sub-pages of a researcher’s home page Treated as 2 different records.– Need high confidence in overall system performance. But researcher names are not unique.

– Best if can eliminate duplication by analyzing URLs. But domain hierachies differ within department, between departments, and between institutions.

Page 29: Harvesting useful information on researchers' home pages

29TIM

Ta Nha Linh

13 March 2009

Post-processing - Purpose

• Input: CRF++ output file from Fields Identification.

• Group neighboring tokens identified with the same annotation tag

• Deduplication

• Store into database

Page 30: Harvesting useful information on researchers' home pages

30TIM

Ta Nha Linh

13 March 2009

Contribution

• Produced an automated system for fetching researchers’ information from the world wide web.

• Introduced a number of features for Fields Identification machine learning.

Page 31: Harvesting useful information on researchers' home pages

31TIM

Ta Nha Linh

13 March 2009

Future improvements• Fields Identification

– Introduce more features, especially stylistic features– Strengthen features targeting Name, Research Interest and Publications tags– Cater for the <image> tag– Be able to handle pages using HTML frames– Be able to follow links on the page if necessary

• Home page Identification– Improve heuristics

• Post-processing– Be able to refine output from Fields Identification

• A new component to facilitate front end for user to query the database

Page 32: Harvesting useful information on researchers' home pages

32TIM

Ta Nha Linh

13 March 2009

THANK YOU!

Question?