summaries on the fly: query-based extraction of structured knowledge from web documents

19
Motivation Data on the Web 22/06/22 ICWE 2013, Aalborg, Denmark Some eyecatching opener illustrating growth and or diversity of web data Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents ICWE 2013: International Conference on Web Engineering 8-12 July 2013, Aalborg , Denmark Besnik Fetahu , Bernardo Pereira Nunes, Stefan Dietze (L3S Research Center, DE)

Upload: besnik-fetahu

Post on 27-Jan-2015

111 views

Category:

Technology


1 download

DESCRIPTION

Paper presentation at ICWE2013. Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents http://icwe2013.webengineering.org/accepted-full-papers

TRANSCRIPT

Page 1: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

MotivationData on the Web

10/04/23 ICWE 2013, Aalborg, Denmark

Some eyecatching opener illustrating growth and or diversity of web data

Summaries on the fly: Query-based Extraction of Structured Knowledge

from Web DocumentsICWE 2013: International Conference on Web Engineering

8-12 July 2013, Aalborg , Denmark

Besnik Fetahu, Bernardo Pereira Nunes, Stefan Dietze(L3S Research Center, DE)

Page 2: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Outline

– Introduction

– Related Work

– Focused Knowledge Extraction

• Pre-Processing & Query Expansion

• Pattern Generation

• Contextual Structure

– Evaluation

– Results

– Conclusions

10/04/23 ICWE 2013, Aalborg, Denmark

Page 3: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Introduction

• Motivation

– Large amounts of textual Web Documents

– Efficient techniques querying for relevant information

– Extraction of chunks of text: relations, named entities etc.

– Summaries as means on highlighting most important chunks of text

• Issues:

– Summaries as non-structured text

– Weak relationship of user interests and importance of specific chunks of

text in a corpus

10/04/23 ICWE 2013, Aalborg, Denmark

Page 4: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Prominent Text Summarisation Approaches

• Heuristics for relation extraction

• Extraction of information based on predefined templates

• Sentence inclusion based on inclusion of specific terms

• Latent Semantic Analysis (LSA) for measuring importance of specific terms

• Tree Kernels encoding relevant information for event detection

• Latent Dirichlet Allocation (LDA) for topic modelling

• Populating ontologies based on extracted information from text

10/04/23 ICWE 2013, Aalborg, Denmark

IE

IR

ML

SW

Page 5: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Focused Knowledge Extraction Overview

• Structured Summary Generation Components:

– Query Expansion and Reformulation

– Named Entity Definition and Co-Reference Resolution

– Pattern Generation

– Contextual Structure of Summaries

10/04/23 ICWE 2013, Aalborg, Denmark

Page 6: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Focused Knowledge ExtractionPipeline

10/04/23 ICWE 2013, Aalborg, Denmark

Stem Cell

user queryAnatomical structureBiotechnologyCloningCell biologyDevelopmental BiologyStem Cell

query typing and expansion

Corpus

OR/AND of expanded query terms

NERPOS

Annotate

filtered documents patterns

Democrats → applauded → Mr. Spitzer Eliot (Gov) calls → insure → 500 000 children → lack→ health insurance → enroll → 900 000 adults → are → eligible Medicaid → enrolled → issue debt → pay → stem cell research.

structured summary

Entities Actions

Page 7: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Focused Knowledge ExtractionQuery Expansion

• Query (“Stem Cell”) → NER → http://dbpedia.org/page/Stem_cell

• Query Typing & Expansion– DBpedia SPARQL Query Expansion:

• Query: “Stem Cell” is processed into:– Typed Query:

• http://dbpedia.org/page/Stem_cell

– Expanded Query:• http://dbpedia.org/page/Biotechnology• http://dbpedia.org/page/Cloning• http://dbpedia.org/page/Cell_biology• http://dbpedia.org/page/Developmental_biology

– Conjunction/Disjunction of expanded query terms

10/04/23 ICWE 2013, Aalborg, Denmark

SELECT ?o ?label WHERE{ <http://dbpedia.org/resource/Stem_cell> ?p ?o . ?o rdfs:label ?label }

Page 8: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Focused Knowledge Extraction - Named Entity Definitions & Co-Reference Resolution

• Entities recognised using NER&NED tools (Stanford’s NLP toolkit)

• Construct a co-occurrence matrix of proper nouns appearing consecutively

• Sample entities: “Chicago Bears”, “playoff games”

• Co-reference resolution crucial for accurate knowledge extraction

10/04/23 ICWE 2013, Aalborg, Denmark

k

iii termtermoccurrcoiMiscentity

11),(][

Page 9: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Focused Knowledge ExtractionPattern Generation

• Determine topic terms (LDA) from the

underlying filtered corpus

• Annotate using POS taggers topic terms

• Pattern items:

– POS tags from topic terms

– Query terms (incl. terms after expansion)

10/04/23 ICWE 2013, Aalborg, Denmark

police found women men dr death people drug medical officers man problems study killed heart hospital test sex patients evidence dead drugs officer….

police_NN found_VBD women_NNS men_NNS dr_VBP death_NN people_NNS drug_NN medical_JJ officers_NNS man_NN problems_NNS study_NN killed_VBD heart_NN hospital_NN test_NN sex_NN patients_NNS evidence_NN dead_NN drugs_NNS officer_NN

NN → VBD → NNS → VBP → NN….Stem Cell → Anatomical structure → Biotechnology Cloning → Cell Biology → Developmental Biology

Page 10: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Focused Knowledge ExtractionPattern Generation (I)

• Construct co-occurrence matrix of pattern items (POS tags, Query terms)

• Generate automatically emerging patterns reflecting syntactical relevance

of chunks of text

• Patterns as a sequence of co-occurring items, modelled as directed tree

graphs

• For each pattern item generate a directed tree graph, considering it as a

root node

• Patterns score conveys importance for a given corpus and query

10/04/23 ICWE 2013, Aalborg, Denmark

Page 11: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Generated Patterns Pattern Score ψscore

NN → JJ → VB → RB 0.28571429NN → VB → JJ → RB 0.19949495Stem Cell → NN → VB → RB → JJ 0.17361111JJ → RB → VB → NN → Stem Cell 0.17347462RB → JJ → NN → Stem Cell 0.16466599NN → Stem Cell → RB → VB → JJ 0.16155811RB → VB → Stem Cell → NN → JJ 0.16129665

10/04/23 ICWE 2013, Aalborg, Denmark

Focused Knowledge ExtractionPattern Generation (II)

Automatically generated patterns showing sequence of important syntactical items to appear in a sentence

Scoring mechanism of patterns as the marginal probability of co-occurring pattern items based on the filtered corpus

Prior probability of a pattern item, as the head node of the directed tree graph.

Conditional probability of two consecutive pattern items

Page 12: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Focused Knowledge ExtractionContextual Structure of Summaries

• Summaries generated as structured knowledge

• Decomposition of summaries into two structures:

– global (Entities, Actions) for entire corpus

– local (entity-context, action-context) for particular document

• Multiple summary perspectives based on generated context

• Enrichment with additional information from reference datasets (DBpedia)

10/04/23 ICWE 2013, Aalborg, Denmark

Page 13: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Focused Knowledge ExtractionContextual Structure of Summaries

10/04/23 ICWE 2013, Aalborg, Denmark

Contextual Structure of Summaries with global and local structures enabling multiple summary perspectives:“The kinds of stem cell therapies being researched for the most part do not involve the politically sensitive use of embryonic stem cells.”

Stem cellTherapies

researchedinvolve

Stem Cell:Embryonic, sensitive

researched: Stem cell therapies ↔ most part

Page 14: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Evaluation Setup

• Dataset: New York Times, year 2007

• 40,000 articles with manually generated summaries

• Summary relevance w.r.t the generated context (query)

• Coverage of the manually NYT generated summaries

• ROGUE-n metric to measure coverage of structured vs. manually generated

summaries

10/04/23 ICWE 2013, Aalborg, Denmark

Total n-grams

Matching n-grams from structured and manually generated summaries.

Page 15: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Results

• 10 queries used for evaluation (2007’s prominent events from Time’s

Magazine1)

• Human evaluation for summary relevance: 76% correctly generated

• 17 evaluators with an average of 20 summaries evaluated

1http://www.time.com/time/specials/2007/0,28757,1686204,00.html

10/04/23 ICWE 2013, Aalborg, Denmark

Query European Union

Super Bowl

US Congress

Virgina Tech

Stem Cell

Protest Harry Potter

Global Warming

National Security

Terrorist Attacks

#Q. Terms 7 13 17 28 5 2 22 5 0 0

#Doc. 157 370 13 12 105 129 10 198 250 57

#Summ. 129 325 19 11 86 103 7 170 207 52

Generated structured summaries for the different queries.

Page 16: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Results

• ROGUE-1 evaluation results for the 10 queries

• 25% precision and 32% recall as best performing results for ROGUE-1

10/04/23 ICWE 2013, Aalborg, Denmark

P/R/F1 measures based on ROGUE-1 metric for the 10 queries used for evaluation

Page 17: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

ResultsSample Generated Summaries

10/04/23 ICWE 2013, Aalborg, Denmark

Query: “Stem Cell”

Democrats → applauded → Mr. Spitzer Eliot (Gov) calls → insure → 500, 000 children → lack → health insurance → enrol → 900, 000 adults → are → eligible Medicaid → enrolled → issue debt → pay → stem cell research.

Congress’s Shift in Power → revives → Medicare Debate House Democrats → try to rush → legislation → requiring → government → negotiate → lower drug prices for Medicare beneficiaries → overturning → President Bush’s restrictions on embryonic stem cell research.

The nation → welcome → ambitious agenda → being offered → today by the new Congress Democratic majority → raising → minimum wage → advancing → stem cell research → restoring → oversight of the executive branch.

New study → suggesting → useful stem cells → be derived → amniotic fluid without → destroying → embryos.Swarns, Rachel L → announced → 9 Aug. federal government → pays → studies on stem cell colonies , lines → created before→ that date, government → does not encourage → destruction of additional embryos .

Stem cell research → has not produced → a single medical treatment → is morally wrong→ to create human life → to destroy → for research.

The measure → allow → scientists → receiving → federal funds → use → embryonic stem cells from surplus embryos → generated → fertility clinics , after cell lines → had been derived → by others → using → nonfederal funds.

Page 18: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Conclusions

• Query-based generated summaries

• Contextualised Structured Summaries

– Typing and expanding of queries using reference datasets

– Automated pattern generation

• Incorporated user interests and syntactical relevance of chunks of text

• Multiple summary perspectives

• Overall good accuracy of generated summaries

• Infer new knowledge by interlinking summaries of different/same contexts

10/04/23 ICWE 2013, Aalborg, Denmark

Page 19: Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Thank you!Questions?

10/04/23 ICWE 2013, Aalborg, Denmark