exploring content with wikipedia

Exploring Content with Semantic Transformations

using Collaborative Knowledge Bases

Yegin Genc

Prof. Jeffrey V. Nickerson

OBJECTIVE

Understanding text automatically to support search driven exploratory activities.

EXPLORATORY SEARCH

LOOKUP LEARN INVESTIGATE

Fact retrievalKnown item searchNavigation

Knowledge acquisitionComprehension/interpretationComparison

AccretionAnalysisExclusion/Negation

Marchionini, G. (2006)

EXPLORATORY SEARCH

ILL-STRUCTURED PROBLEM

• No single right approach

• Problem definitions change as new information is gathered

Foreign minorities, Germany

Text: “ Foreign Minorities Germany ”

Exploratory Search Task

Given a journal abstract, rank other abstracts based on their relevancy to the seed abstract.

Evaluation is based on relevancy and diversity.

D: Documents K: Concepts W: Words

= d

k

DOCUMENT – CONCEPTΘ (D x K)

*d

k

DOCUMENT – W0RDD (D x W )

CONCEPT– WORDK (W x K)

Argsort (row.sum(Θ) )

Seed Document

Candidates

n-grams(1 to 3)

Concepts

(candidates that match to a Wikipedia Page title and connected through Ontology)

Tf-idf(D) Tf-idf(K)

EXTRACTING CONCEPT NETWORK

“Representation independence formally characterizes theencapsulation provided by language constructs for dataabstraction and justifies reasoning by simulation.Representation independence has been shown for avariety of languages and constructs but not for sharedreferences to mutable state; indeed it fails in general forsuch languages. This article formulates representationindependence for classes, in an imperative, object-oriented language with pointers, subclassing and dynamicdispatch, class oriented visibility control, recursive typesand methods, and a simple form of module. An instanceof a class is considered to implement an abstraction usingprivate fields and so-called representation objects.Encapsulation of representation objects is expressed by arestriction, called confinement, on aliasing.Representation independence is proved for programssatisfying the confinement condition. A static analysis isgiven for confinement that accepts common designs suchas the observer and factory patterns. The formalizationtakes into account not only the usual interface between aclient and a class that provides an abstraction but also theinterface (often called \\protected\\") between the classand its subclasses."

WIKIPEDIA PAGES AS CONCEPTS

Solar System“The Solar System[a] consists of the Sun and the astronomical objectsgravitationally bound in orbitaround it, all of which formedfrom the collapse of a giant molecular cloudapproximately 4.6 billion years ago…”

(http://en.wikipedia.org/wiki/Solar_System)

Word Stem Occ. Freq.

abstract 53 0.056

program 44 0.046

langu 33 0.035

spec 16 0.017

comput 12 0.013

conceiv 12 0.013

dat 12 0.013

bk = p(Wi | k) ={Wi Î k}

{Wi Î k}i

N

å

βk : Per-concept word distribution

http://en.wikipedia.org/wiki/Sun

http://en.wikipedia.org/wiki/Astronomical_objects

http://en.wikipedia.org/wiki/Gravity

http://en.wikipedia.org/wiki/Orbit

http://en.wikipedia.org/wiki/Formation_and_evolution_of_the_Solar_System

http://en.wikipedia.org/wiki/Molecular_cloud

RANKING DOCUMENTS

D: Documents K: ConceptsW: Words

=d

k


*d

k



SORT DOCUMENTS

D: Documents K: ConceptsW: Words

=d

k


*d

k



EXPERIMENT

Given a journal abstract, rank other abstracts based on their relevancy to the seed abstract.

• Data: 619 abstracts of the Journal of the ACM (JACM) and their references.

• Task: Select Top-k (5,10,15, and 20) relevant abstracts.

• Observe: Relevancy (measured by LSA vector similarity) and Diversity (measured through the coverage of the references.)

MAXIMAL MARGINAL RELEVANCE

• a measure to increase the diversity of documents retrieved by an IR system

-Similarity to query: BM25 (Xapian1)-Similarity to results: LSA similarity (Gensim2)

1. http://xapian.org

2. http://radimrehurek.com/gensim/

MMR RESULTS

WIKI-BASED MODEL VS MMR

CONCLUDING REMARKS

• Our Wiki based technique provides high diversity with low relevancy loss.

• Semantics embedded in concept networks extracted from Wikipedia can improve exploratory search tasks.

exploring content with wikipedia

Technology

exploratory activities

search systems

search resultsdecision

new information

dkdocument concept d

data abstraction

language constructs

seed abstract