Download - Exploring Content with Wikipedia
Exploring Content with Semantic Transformations
using Collaborative Knowledge Bases
Yegin Genc
Prof. Jeffrey V. Nickerson
OBJECTIVE
Understanding text automatically to support search driven exploratory activities.
EXPLORATORY SEARCH
LOOKUP LEARN INVESTIGATE
Fact retrievalKnown item searchNavigation
Knowledge acquisitionComprehension/interpretationComparison
AccretionAnalysisExclusion/Negation
Marchionini, G. (2006)
EXPLORATORY SEARCH
ILL-STRUCTURED PROBLEM
• No single right approach
• Problem definitions change as new information is gathered
Foreign minorities, Germany
Text: “ Foreign Minorities Germany ”
Exploratory Search Task
Given a journal abstract, rank other abstracts based on their relevancy to the seed abstract.
Evaluation is based on relevancy and diversity.
D: Documents K: Concepts W: Words
= d
k
DOCUMENT – CONCEPTΘ (D x K)
*d
k
DOCUMENT – W0RDD (D x W )
CONCEPT– WORDK (W x K)
Argsort (row.sum(Θ) )
Seed Document
Candidates
n-grams(1 to 3)
Concepts
(candidates that match to a Wikipedia Page title and connected through Ontology)
Tf-idf(D) Tf-idf(K)
EXTRACTING CONCEPT NETWORK
“Representation independence formally characterizes theencapsulation provided by language constructs for dataabstraction and justifies reasoning by simulation.Representation independence has been shown for avariety of languages and constructs but not for sharedreferences to mutable state; indeed it fails in general forsuch languages. This article formulates representationindependence for classes, in an imperative, object-oriented language with pointers, subclassing and dynamicdispatch, class oriented visibility control, recursive typesand methods, and a simple form of module. An instanceof a class is considered to implement an abstraction usingprivate fields and so-called representation objects.Encapsulation of representation objects is expressed by arestriction, called confinement, on aliasing.Representation independence is proved for programssatisfying the confinement condition. A static analysis isgiven for confinement that accepts common designs suchas the observer and factory patterns. The formalizationtakes into account not only the usual interface between aclient and a class that provides an abstraction but also theinterface (often called \\protected\\") between the classand its subclasses."
EXTRACTING CONCEPT NETWORK
“Representation independence formally characterizes theencapsulation provided by language constructs for dataabstraction and justifies reasoning by simulation.Representation independence has been shown for avariety of languages and constructs but not for sharedreferences to mutable state; indeed it fails in general forsuch languages. This article formulates representationindependence for classes, in an imperative, object-oriented language with pointers, subclassing and dynamicdispatch, class oriented visibility control, recursive typesand methods, and a simple form of module. An instanceof a class is considered to implement an abstraction usingprivate fields and so-called representation objects.Encapsulation of representation objects is expressed by arestriction, called confinement, on aliasing.Representation independence is proved for programssatisfying the confinement condition. A static analysis isgiven for confinement that accepts common designs suchas the observer and factory patterns. The formalizationtakes into account not only the usual interface between aclient and a class that provides an abstraction but also theinterface (often called \\protected\\") between the classand its subclasses."
WIKIPEDIA PAGES AS CONCEPTS
Solar System“The Solar System[a] consists of the Sun and the astronomical objectsgravitationally bound in orbitaround it, all of which formedfrom the collapse of a giant molecular cloudapproximately 4.6 billion years ago…”
(http://en.wikipedia.org/wiki/Solar_System)
Word Stem Occ. Freq.
abstract 53 0.056
program 44 0.046
langu 33 0.035
spec 16 0.017
comput 12 0.013
conceiv 12 0.013
dat 12 0.013
bk = p(Wi | k) ={Wi Î k}
{Wi Î k}i
N
å
βk : Per-concept word distribution
RANKING DOCUMENTS
D: Documents K: ConceptsW: Words
=d
k
DOCUMENT – CONCEPTΘ (D x K)
*d
k
DOCUMENT – W0RDD (D x W )
CONCEPT– WORDK (W x K)
SORT DOCUMENTS
D: Documents K: ConceptsW: Words
=d
k
DOCUMENT – CONCEPTΘ (D x K)
*d
k
DOCUMENT – W0RDD (D x W )
CONCEPT– WORDK (W x K)
EXPERIMENT
Given a journal abstract, rank other abstracts based on their relevancy to the seed abstract.
• Data: 619 abstracts of the Journal of the ACM (JACM) and their references.
• Task: Select Top-k (5,10,15, and 20) relevant abstracts.
• Observe: Relevancy (measured by LSA vector similarity) and Diversity (measured through the coverage of the references.)
MAXIMAL MARGINAL RELEVANCE
• a measure to increase the diversity of documents retrieved by an IR system
-Similarity to query: BM25 (Xapian1)-Similarity to results: LSA similarity (Gensim2)
1. http://xapian.org
2. http://radimrehurek.com/gensim/
MMR RESULTS
WIKI-BASED MODEL VS MMR
CONCLUDING REMARKS
• Our Wiki based technique provides high diversity with low relevancy loss.
• Semantics embedded in concept networks extracted from Wikipedia can improve exploratory search tasks.