exploring content with wikipedia

18
Exploring Content with Semantic Transformations using Collaborative Knowledge Bases Yegin Genc Prof. Jeffrey V. Nickerson

Upload: yegin-genc

Post on 09-Jul-2015

204 views

Category:

Technology


0 download

DESCRIPTION

Presented at WIN2013

TRANSCRIPT

Page 1: Exploring Content with Wikipedia

Exploring Content with Semantic Transformations

using Collaborative Knowledge Bases

Yegin Genc

Prof. Jeffrey V. Nickerson

Page 2: Exploring Content with Wikipedia

OBJECTIVE

Understanding text automatically to support search driven exploratory activities.

Page 3: Exploring Content with Wikipedia

EXPLORATORY SEARCH

LOOKUP LEARN INVESTIGATE

Fact retrievalKnown item searchNavigation

Knowledge acquisitionComprehension/interpretationComparison

AccretionAnalysisExclusion/Negation

Marchionini, G. (2006)

Page 4: Exploring Content with Wikipedia

EXPLORATORY SEARCH

ILL-STRUCTURED PROBLEM

• No single right approach

• Problem definitions change as new information is gathered

Page 5: Exploring Content with Wikipedia

Foreign minorities, Germany

Page 6: Exploring Content with Wikipedia

Text: “ Foreign Minorities Germany ”

Page 7: Exploring Content with Wikipedia

Exploratory Search Task

Given a journal abstract, rank other abstracts based on their relevancy to the seed abstract.

Evaluation is based on relevancy and diversity.

Page 8: Exploring Content with Wikipedia

D: Documents K: Concepts W: Words

= d

k

DOCUMENT – CONCEPTΘ (D x K)

*d

k

DOCUMENT – W0RDD (D x W )

CONCEPT– WORDK (W x K)

Argsort (row.sum(Θ) )

Seed Document

Candidates

n-grams(1 to 3)

Concepts

(candidates that match to a Wikipedia Page title and connected through Ontology)

Tf-idf(D) Tf-idf(K)

Page 9: Exploring Content with Wikipedia

EXTRACTING CONCEPT NETWORK

“Representation independence formally characterizes theencapsulation provided by language constructs for dataabstraction and justifies reasoning by simulation.Representation independence has been shown for avariety of languages and constructs but not for sharedreferences to mutable state; indeed it fails in general forsuch languages. This article formulates representationindependence for classes, in an imperative, object-oriented language with pointers, subclassing and dynamicdispatch, class oriented visibility control, recursive typesand methods, and a simple form of module. An instanceof a class is considered to implement an abstraction usingprivate fields and so-called representation objects.Encapsulation of representation objects is expressed by arestriction, called confinement, on aliasing.Representation independence is proved for programssatisfying the confinement condition. A static analysis isgiven for confinement that accepts common designs suchas the observer and factory patterns. The formalizationtakes into account not only the usual interface between aclient and a class that provides an abstraction but also theinterface (often called \\protected\\") between the classand its subclasses."

Page 10: Exploring Content with Wikipedia

EXTRACTING CONCEPT NETWORK

“Representation independence formally characterizes theencapsulation provided by language constructs for dataabstraction and justifies reasoning by simulation.Representation independence has been shown for avariety of languages and constructs but not for sharedreferences to mutable state; indeed it fails in general forsuch languages. This article formulates representationindependence for classes, in an imperative, object-oriented language with pointers, subclassing and dynamicdispatch, class oriented visibility control, recursive typesand methods, and a simple form of module. An instanceof a class is considered to implement an abstraction usingprivate fields and so-called representation objects.Encapsulation of representation objects is expressed by arestriction, called confinement, on aliasing.Representation independence is proved for programssatisfying the confinement condition. A static analysis isgiven for confinement that accepts common designs suchas the observer and factory patterns. The formalizationtakes into account not only the usual interface between aclient and a class that provides an abstraction but also theinterface (often called \\protected\\") between the classand its subclasses."

Page 11: Exploring Content with Wikipedia

WIKIPEDIA PAGES AS CONCEPTS

Solar System“The Solar System[a] consists of the Sun and the astronomical objectsgravitationally bound in orbitaround it, all of which formedfrom the collapse of a giant molecular cloudapproximately 4.6 billion years ago…”

(http://en.wikipedia.org/wiki/Solar_System)

Word Stem Occ. Freq.

abstract 53 0.056

program 44 0.046

langu 33 0.035

spec 16 0.017

comput 12 0.013

conceiv 12 0.013

dat 12 0.013

bk = p(Wi | k) ={Wi Î k}

{Wi Î k}i

N

å

βk : Per-concept word distribution

Page 12: Exploring Content with Wikipedia

RANKING DOCUMENTS

D: Documents K: ConceptsW: Words

=d

k

DOCUMENT – CONCEPTΘ (D x K)

*d

k

DOCUMENT – W0RDD (D x W )

CONCEPT– WORDK (W x K)

Page 13: Exploring Content with Wikipedia

SORT DOCUMENTS

D: Documents K: ConceptsW: Words

=d

k

DOCUMENT – CONCEPTΘ (D x K)

*d

k

DOCUMENT – W0RDD (D x W )

CONCEPT– WORDK (W x K)

Page 14: Exploring Content with Wikipedia

EXPERIMENT

Given a journal abstract, rank other abstracts based on their relevancy to the seed abstract.

• Data: 619 abstracts of the Journal of the ACM (JACM) and their references.

• Task: Select Top-k (5,10,15, and 20) relevant abstracts.

• Observe: Relevancy (measured by LSA vector similarity) and Diversity (measured through the coverage of the references.)

Page 15: Exploring Content with Wikipedia

MAXIMAL MARGINAL RELEVANCE

• a measure to increase the diversity of documents retrieved by an IR system

-Similarity to query: BM25 (Xapian1)-Similarity to results: LSA similarity (Gensim2)

1. http://xapian.org

2. http://radimrehurek.com/gensim/

Page 16: Exploring Content with Wikipedia

MMR RESULTS

Page 17: Exploring Content with Wikipedia

WIKI-BASED MODEL VS MMR

Page 18: Exploring Content with Wikipedia

CONCLUDING REMARKS

• Our Wiki based technique provides high diversity with low relevancy loss.

• Semantics embedded in concept networks extracted from Wikipedia can improve exploratory search tasks.