san diego supercomputer center analyzing the nsdl collection peter shin, charles cowart tony...
TRANSCRIPT
San Diego Supercomputer Center
Analyzing the NSDL Collection
Peter Shin, Charles Cowart
Tony Fountain, Reagan Moore
San Diego Supercomputer Center
San Diego Supercomputer Center
Introduction
• Education Impact and Evaluation Standing Committee (EIESC)
• Goal: Characterize the contents of the NSDL collection by topic, audience and type
• Motivation: • Enable search of the NSDL collection by subject, audience, and type• Inform future collection activities
San Diego Supercomputer Center
Research Focus
• Intelligent Information Retrieval (IR) for the NSDL community by providing efficient discovery and access to relevant materials
• Support queries on audience, topic, and type
San Diego Supercomputer Center
Information Retrieval Techniques• Keyword Searches
• Find a document that contains the list of words in the search.• Example: digestive system -> “Digestive System Web Resources for Students”
• Relevance-based search, probabilistic approach• Instead of the actual keywords, find documents that have words with similar
meaning as in the query• Example: break down food -> NCC Food Group’s Food Stuff; break down food
body -> “Squirrel Tales”
• Text Categorization• Given the contents of a document, assign topic labels.
• Hybrid or combination approaches• Mix two or more of the above• Example: First, keyword search , then text categorization
San Diego Supercomputer Center
Challenges
• Not enough of metadata • e.g audience type (intended grade level)
• Need metadata standards • no structure in the metadata• Example: Odd/Even Number – subject: number senses, No mathematics, algebra,
or number theory• No concept map or ontology to capture complex topic relationships
• Example: Relationship between algebra and calculus• Need to make annotation easy and accurate
• Assist or automate labeling the documents with standard annotation• Possible errors in the existing hand-labeled documents
• Example: mismatch between the contents of the front pages and their metadata• Computationally intensive
• Over 20,000 HTML documents with over 1,300,000 unique terms
San Diego Supercomputer Center
Suggestions
• Involve the community to define methods, sample queries and evaluation criteria, generate metadata, perform comparative studies
• Create a training/testing data set that is annotated and checked for correctness which can be used by all researchers
• Provide a forum for sharing methods and results
• Build an evaluation testbed – collect data, algorithms, tools, results, plus hardware and software, provide online web portal interface to the resources
San Diego Supercomputer Center
Status of the NSDL testbed at SDSC
• Monthly web crawl of the NSDL sites• Persistent archive of the harvested materials• Processing pipeline for various IR techniques• Software Resources:
• Storage Resource Broker (SRB)• NSDL Archive Service (Web Crawling)• Various processing pipeline scripts that can run in parallel• SVMLight by Thorsten Joachim• Latent Semantic Indexing from Telecordia• Latent Dirichlet Allocation by David Blei from UC Berkeley• Cheshire – online catalog and full text retrieval system (from UC Berkeley and
University of Liverpool)• Hardware Resources:
• IBM Datastar – supercomputer with 10.4 teraflops of computing power• TeraGrid – collection of supercomputers with high throughput communication
San Diego Supercomputer Center
Summary
• Metadata evaluation is important and challenging.
• Information retrieval techniques are promising.
• NSDL community involvement is necessary to define evaluation methods.
• Collaborative testbed would facilitate analysis.
• An initial testbed is under development at SDSC.
San Diego Supercomputer Center
Latent Semantic Indexing (LSI)
• Assumption:• If documents have many words in common, the documents are closely
related.
• Application:• Search Engine• Archivist’s Assistance• Automated Writing Assessment• Information Filtering
• Drawbacks:• Not scalable• No incremental update
San Diego Supercomputer Center
Clustering before LSI
• Idea:• Instead of searching in the whole space, search within the concept
space.
• Task:• Define the levels of granularity in the document space.• Cluster the documents according to the concept space• Apply LSI within a cluster.
San Diego Supercomputer Center
ProcessHTML Documents
StripFormatting
Pick out content words using “stop lists”
Stemming
List ofWords
Discard words that appear too frequently
or too sparsely
TermWeightin
g
Each document in the Term Document Matrix is a “vector”
Build conceptclusters
Apply LSI within
A cluster
San Diego Supercomputer Center
Levels of Granularity
CollectionDocumen
t
Section
Subsection
San Diego Supercomputer Center
Building a Concept Space
Collection
Document
Section
Subsection
Co-Adjacent Granules
Co-Adjacent Granules
Finer
Granularity
San Diego Supercomputer Center
Hypotheses
• Definition:• Significant terms:
• Defined by the frequency of words in a granule
• Hypotheses:• As the granularity becomes finer, the number of significant terms in a
granule goes down.
• Within one granule, the overlapping significant terms around the specific space decrease as it moves further from it.
• Appropriate level of granularity for a knowledge is when the number of significant terms is the maximum and the number of overlapping significant terms around the space is minimum.
San Diego Supercomputer Center
Sample Data
• Web Crawl• On each document, web crawl written by Charles Cowart
gathers 20 levels deep.• Size: 200 GB and 1.7 million files