information extraction research @ yahoo! labs bangalore
DESCRIPTION
Information Extraction Research @ Yahoo! Labs Bangalore. Rajeev Rastogi Yahoo! Labs Bangalore. The most visited site on the internet. 600 million+ users per month Super popular properties News, finance, sports Answers, flickr, del.icio.us Mail, messaging Search. Unparalleled scale. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/1.jpg)
Information Extraction Research @ Yahoo! Labs Bangalore
Rajeev RastogiYahoo! Labs Bangalore
![Page 2: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/2.jpg)
The most visited site on the internet
• 600 million+ users per month
• Super popular properties– News, finance, sports– Answers, flickr,
del.icio.us– Mail, messaging– Search
![Page 3: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/3.jpg)
Unparalleled scale
• 25 terabytes of data collected each day– Over 4 billion clicks every day– Over 4 billion emails per day– Over 6 billion instant messages per day
• Over 20 billion web documents indexed• Over 4 billion images searchable
No other company on the planet processes as much data as we do!
![Page 4: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/4.jpg)
Yahoo! Labs Bangalore
• Focus is on basic and applied research– Search– Advertizing– Cloud computing
• University relations– Faculty research grants– Summer internships– Sharing data/computing
infrastructure– Conference sponsorships– PhD co-op program
![Page 5: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/5.jpg)
What does search look like today?
![Page 6: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/6.jpg)
Search results of the future: Structured abstracts
yelp.com
babycenter
epicurious
answers.com
webmd
New York Times
Gawker
![Page 7: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/7.jpg)
Rank by price
Search results of the future: Intelligent ranking
![Page 8: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/8.jpg)
A key technology for enabling search transformation
Information extraction (IE)
![Page 9: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/9.jpg)
Reviews
Information extraction (IE)
• Goal: Extract structured records from Web pages
Name
AddressCategory
PhonePrice
Map
![Page 10: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/10.jpg)
Multiple verticals
• Business, social networking, video, ….
![Page 11: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/11.jpg)
Price
Category
Address
Phone Price
One schema per vertical
NameTitle
Education
Connections
Posted by
Title
Date
Rating Views
![Page 12: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/12.jpg)
IE on the Web is a hard problem
• Web pages are noisy• Pages belonging to different Web sites have different layouts
Noise
![Page 13: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/13.jpg)
Web page types
Template-based Hand-crafted
![Page 14: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/14.jpg)
Template-based pages
• Pages within a Web site generated using scripts, have very similar structure – Can be leveraged for extraction
• ~30% of crawled Web pages • Information rich, frequently appear in the top
results of search queries• E.g. search query: “Chinese Mirch New York”
– 9 template-based pages in the top 10 results
![Page 15: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/15.jpg)
Wrapper Induction
Learn
AnnotatePages
Sample pagesWebsite pages
LearnWrappers
Apply wrappers
Records
XPathRules
Extract
Annotations
Extract
Website pages
Sample
• Enables extraction from template-based pages
![Page 16: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/16.jpg)
Example
XPath: /html/body/div/div/div/div/div/div/span /html/body//div//spanGeneralize
![Page 17: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/17.jpg)
Filters
• Apply filters to prune from multiple candidates that match XPath expression
XPath: /html/body//div//span
Regex Filter (Phone):([0-9]3) [0-9]3-[0-9]4
![Page 18: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/18.jpg)
Limitations of wrappers
• Won’t work across Web sites due to different page layouts
• Scaling to thousands of sites can be a challenge– Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites
can be time-consuming & expensive
![Page 19: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/19.jpg)
Research challenge
• Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site
• Only annotate pages from a few sites initially as training data
![Page 20: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/20.jpg)
Conditional Random Fields (CRFs)
• Models conditional probability distribution of label sequence y=y1,…,yn given input sequence x=x1,…,xn
– fk: features, k: weights
• Choose k to maximize log-likelihood of training data
• Use Viterbi algorithm to compute label sequence y with highest probability
||
11 ),,,(exp
)(
1)|(
x
xx
xyt k
ttkk tyyfZ
P
![Page 21: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/21.jpg)
CRFs-based IE
Name
Category
Address
Phone
Noise
• Web pages can be viewed as labeled sequences
• Train CRF using pages from few Web sites• Then use trained CRF to extract from remaining sites
![Page 22: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/22.jpg)
Drawbacks of CRFs
• Require too many training examples• Have been used previously to segment short
strings with similar structure• However, may not work too well across Web
sites that – contain long pages with lots of noise– have very different structure
![Page 23: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/23.jpg)
An alternate approach that exploits site knowledge
• Build attribute classifiers for each attribute– Use pages from a few initial Web sites
• For each page from a new Web site– Segment page into sequence of fields (using static repeating
text)– Use attribute classifiers to assign attribute labels to fields
• Use constraints to disambiguate labels– Uniqueness: an attribute occurs at most once in a page– Proximity: attribute values appear close together in a page– Structural: relative positions of attributes are identical across
pages of a Web site
![Page 24: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/24.jpg)
Attribute classifiers + constraints example
Chinese Mirch Chinese, Indian 120 Lexington AvenueNew York, NY 10016
(212) 532 3663Page1:
Jewel of India Indian 15 W 44th StNew York, NY 10016
(212) 869 5544Page2:
21 Club American 21 W 52nd StNew York, NY 10019
(212) 582 7200Page3:
Page3:
PhoneAddress
CategoryName
Category
Category, Name
Name
Name, Noise
Address
Address
Phone
Phone
Uniqueness constraint: NamePrecedence constraint: Name < Category
21 Club American 21 W 52nd StNew York, NY 10019
(212) 582 7200
CategoryName AddressPhone
![Page 25: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/25.jpg)
Performance evaluation: Datasets
• 100 pages from 5 restaurant Web sites with very different structure– www.citysearch.com – www.fromers.com– www.nymag.com– www.superpages.com– www.yelp.com
• Extract attributes: Name, Address, Phone num, Hours of operation, Description
![Page 26: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/26.jpg)
Methods considered
• CRFs, attribute classifiers + constraints• Features
– Lexicon: Words in the training Web pages– Regex: isAlpha, isAllCaps, isNum, is5DigitNum, isDay,
… – Attribute-level: Num of words, Overlap with title,…
![Page 27: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/27.jpg)
Evaluation methodology
• Metrics– Precision, recall, F1 for attributes
• Test on one site, use pages from remaining 4 sites as training data
• Average measures over all 5 sites
![Page 28: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/28.jpg)
Experimental results
CRF Constraint CRF Constraint
Name .39 1 .34 1Phone .02 1 .2 .99
Address .01 .81 .16 .83Hours .22 1 .36 1Desc .13 .25 0 .15
Overall .15 .81 .21 .76
Precision Recall
![Page 29: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/29.jpg)
Other IE scenarios: Browse page extraction
Similar-structuredrecords
![Page 30: Information Extraction Research @ Yahoo! Labs Bangalore](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813b46550346895da4245a/html5/thumbnails/30.jpg)
IE big picture/taxonomy
• Things to extract from– Template-based, browse, hand-crafted pages, text
• Things to extract– Records, tables, lists, named entities
• Techniques used– Structure-based (HTML tags, DOM tree paths) – e.g.
Wrappers– Content-based (attribute values/models) – e.g. dictionaries– Structure + Content (sequential/hierarchical relationships
among attribute values) – e.g. hierarchical CRFs• Level of automation
– Manual, supervised, unsupervised