information extraction research @ yahoo! labs bangalore

Information Extraction Research @ Yahoo! Labs Bangalore

Rajeev RastogiYahoo! Labs Bangalore

The most visited site on the internet

• 600 million+ users per month

• Super popular properties– News, finance, sports– Answers, flickr,

del.icio.us– Mail, messaging– Search

Unparalleled scale

• 25 terabytes of data collected each day– Over 4 billion clicks every day– Over 4 billion emails per day– Over 6 billion instant messages per day

• Over 20 billion web documents indexed• Over 4 billion images searchable

No other company on the planet processes as much data as we do!

Yahoo! Labs Bangalore

• Focus is on basic and applied research– Search– Advertizing– Cloud computing

• University relations– Faculty research grants– Summer internships– Sharing data/computing

infrastructure– Conference sponsorships– PhD co-op program

What does search look like today?

Search results of the future: Structured abstracts

yelp.com

babycenter

epicurious

answers.com

LinkedIn

webmd

New York Times

Gawker

Rank by price

Search results of the future: Intelligent ranking

A key technology for enabling search transformation

Information extraction (IE)

Reviews

Information extraction (IE)

• Goal: Extract structured records from Web pages

Name

AddressCategory

PhonePrice

Map

Multiple verticals

• Business, social networking, video, ….

Price

Category

Address

Phone Price

One schema per vertical

NameTitle

Education

Connections

Posted by

Title

Date

Rating Views

IE on the Web is a hard problem

• Web pages are noisy• Pages belonging to different Web sites have different layouts

Noise

Web page types

Template-based Hand-crafted

Template-based pages

• Pages within a Web site generated using scripts, have very similar structure – Can be leveraged for extraction

• ~30% of crawled Web pages • Information rich, frequently appear in the top

results of search queries• E.g. search query: “Chinese Mirch New York”

– 9 template-based pages in the top 10 results

Wrapper Induction

Learn

AnnotatePages

Sample pagesWebsite pages

LearnWrappers

Apply wrappers

Records

XPathRules

Extract

Annotations

Extract

Website pages

Sample

• Enables extraction from template-based pages

Example

XPath: /html/body/div/div/div/div/div/div/span /html/body//div//spanGeneralize

Filters

• Apply filters to prune from multiple candidates that match XPath expression

XPath: /html/body//div//span

Regex Filter (Phone):([0-9]3) [0-9]3-[0-9]4

Limitations of wrappers

• Won’t work across Web sites due to different page layouts

• Scaling to thousands of sites can be a challenge– Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites

can be time-consuming & expensive

Research challenge

• Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site

• Only annotate pages from a few sites initially as training data

Conditional Random Fields (CRFs)

• Models conditional probability distribution of label sequence y=y1,…,yn given input sequence x=x1,…,xn

– fk: features, k: weights

• Choose k to maximize log-likelihood of training data

• Use Viterbi algorithm to compute label sequence y with highest probability

||

11 ),,,(exp

)(

1)|(

x

xx

xyt k

ttkk tyyfZ

P

CRFs-based IE

Name

Category

Address

Phone

Noise

• Web pages can be viewed as labeled sequences

• Train CRF using pages from few Web sites• Then use trained CRF to extract from remaining sites

Drawbacks of CRFs

• Require too many training examples• Have been used previously to segment short

strings with similar structure• However, may not work too well across Web

sites that – contain long pages with lots of noise– have very different structure

An alternate approach that exploits site knowledge

• Build attribute classifiers for each attribute– Use pages from a few initial Web sites

• For each page from a new Web site– Segment page into sequence of fields (using static repeating

text)– Use attribute classifiers to assign attribute labels to fields

• Use constraints to disambiguate labels– Uniqueness: an attribute occurs at most once in a page– Proximity: attribute values appear close together in a page– Structural: relative positions of attributes are identical across

pages of a Web site

Attribute classifiers + constraints example

Chinese Mirch Chinese, Indian 120 Lexington AvenueNew York, NY 10016

(212) 532 3663Page1:

Jewel of India Indian 15 W 44th StNew York, NY 10016

(212) 869 5544Page2:

21 Club American 21 W 52nd StNew York, NY 10019

(212) 582 7200Page3:

Page3:

PhoneAddress

CategoryName

Category

Category, Name

Name

Name, Noise

Address

Address

Phone

Phone

Uniqueness constraint: NamePrecedence constraint: Name < Category

21 Club American 21 W 52nd StNew York, NY 10019

(212) 582 7200

CategoryName AddressPhone

Performance evaluation: Datasets

• 100 pages from 5 restaurant Web sites with very different structure– www.citysearch.com – www.fromers.com– www.nymag.com– www.superpages.com– www.yelp.com

• Extract attributes: Name, Address, Phone num, Hours of operation, Description

http://www.citysearch.com/

http://www.fromers.com/

http://www.nymag.com/

http://www.superpages.com/

http://www.yelp.com/

Methods considered

• CRFs, attribute classifiers + constraints• Features

– Lexicon: Words in the training Web pages– Regex: isAlpha, isAllCaps, isNum, is5DigitNum, isDay,

… – Attribute-level: Num of words, Overlap with title,…

Evaluation methodology

• Metrics– Precision, recall, F1 for attributes

• Test on one site, use pages from remaining 4 sites as training data

• Average measures over all 5 sites

Experimental results

CRF Constraint CRF Constraint

Name .39 1 .34 1Phone .02 1 .2 .99

Address .01 .81 .16 .83Hours .22 1 .36 1Desc .13 .25 0 .15

Overall .15 .81 .21 .76

Precision Recall

Other IE scenarios: Browse page extraction

Similar-structuredrecords

IE big picture/taxonomy

• Things to extract from– Template-based, browse, hand-crafted pages, text

• Things to extract– Records, tables, lists, named entities

• Techniques used– Structure-based (HTML tags, DOM tree paths) – e.g.

Wrappers– Content-based (attribute values/models) – e.g. dictionaries– Structure + Content (sequential/hierarchical relationships

among attribute values) – e.g. hierarchical CRFs• Level of automation

– Manual, supervised, unsupervised

information extraction research @ yahoo! labs bangalore

Documents

different web sites

web sitesthen

long pages

new web site

site annotating example

thousands of sites

web documents indexedover

hard problemweb pages