san diego supercomputer center analyzing the nsdl collection peter shin, charles cowart tony...

San Diego Supercomputer Center

Analyzing the NSDL Collection

Peter Shin, Charles Cowart

Tony Fountain, Reagan Moore



Introduction

• Education Impact and Evaluation Standing Committee (EIESC)

• Goal: Characterize the contents of the NSDL collection by topic, audience and type

• Motivation: • Enable search of the NSDL collection by subject, audience, and type• Inform future collection activities


Research Focus

• Intelligent Information Retrieval (IR) for the NSDL community by providing efficient discovery and access to relevant materials

• Support queries on audience, topic, and type


Information Retrieval Techniques• Keyword Searches

• Find a document that contains the list of words in the search.• Example: digestive system -> “Digestive System Web Resources for Students”

• Relevance-based search, probabilistic approach• Instead of the actual keywords, find documents that have words with similar

meaning as in the query• Example: break down food -> NCC Food Group’s Food Stuff; break down food

body -> “Squirrel Tales”

• Text Categorization• Given the contents of a document, assign topic labels.

• Hybrid or combination approaches• Mix two or more of the above• Example: First, keyword search , then text categorization


Challenges

• Not enough of metadata • e.g audience type (intended grade level)

• Need metadata standards • no structure in the metadata• Example: Odd/Even Number – subject: number senses, No mathematics, algebra,

or number theory• No concept map or ontology to capture complex topic relationships

• Example: Relationship between algebra and calculus• Need to make annotation easy and accurate

• Assist or automate labeling the documents with standard annotation• Possible errors in the existing hand-labeled documents

• Example: mismatch between the contents of the front pages and their metadata• Computationally intensive

• Over 20,000 HTML documents with over 1,300,000 unique terms


Suggestions

• Involve the community to define methods, sample queries and evaluation criteria, generate metadata, perform comparative studies

• Create a training/testing data set that is annotated and checked for correctness which can be used by all researchers

• Provide a forum for sharing methods and results

• Build an evaluation testbed – collect data, algorithms, tools, results, plus hardware and software, provide online web portal interface to the resources


Status of the NSDL testbed at SDSC

• Monthly web crawl of the NSDL sites• Persistent archive of the harvested materials• Processing pipeline for various IR techniques• Software Resources:

• Storage Resource Broker (SRB)• NSDL Archive Service (Web Crawling)• Various processing pipeline scripts that can run in parallel• SVMLight by Thorsten Joachim• Latent Semantic Indexing from Telecordia• Latent Dirichlet Allocation by David Blei from UC Berkeley• Cheshire – online catalog and full text retrieval system (from UC Berkeley and

University of Liverpool)• Hardware Resources:

• IBM Datastar – supercomputer with 10.4 teraflops of computing power• TeraGrid – collection of supercomputers with high throughput communication


Summary

• Metadata evaluation is important and challenging.

• Information retrieval techniques are promising.

• NSDL community involvement is necessary to define evaluation methods.

• Collaborative testbed would facilitate analysis.

• An initial testbed is under development at SDSC.


Latent Semantic Indexing (LSI)

• Assumption:• If documents have many words in common, the documents are closely

related.

• Application:• Search Engine• Archivist’s Assistance• Automated Writing Assessment• Information Filtering

• Drawbacks:• Not scalable• No incremental update


Clustering before LSI

• Idea:• Instead of searching in the whole space, search within the concept

space.

• Task:• Define the levels of granularity in the document space.• Cluster the documents according to the concept space• Apply LSI within a cluster.


ProcessHTML Documents

StripFormatting

Pick out content words using “stop lists”

Stemming

List ofWords

Discard words that appear too frequently

or too sparsely

TermWeightin

g

Each document in the Term Document Matrix is a “vector”

Build conceptclusters

Apply LSI within

A cluster


Levels of Granularity

CollectionDocumen

t

Section

Subsection


Building a Concept Space

Collection

Document

Section

Subsection

Co-Adjacent Granules

Co-Adjacent Granules

Finer

Granularity


Hypotheses

• Definition:• Significant terms:

• Defined by the frequency of words in a granule

• Hypotheses:• As the granularity becomes finer, the number of significant terms in a

granule goes down.

• Within one granule, the overlapping significant terms around the specific space decrease as it moves further from it.

• Appropriate level of granularity for a knowledge is when the number of significant terms is the maximum and the number of overlapping significant terms around the space is minimum.


Sample Data

• Web Crawl• On each document, web crawl written by Charles Cowart

gathers 20 levels deep.• Size: 200 GB and 1.7 million files