exploring bridges between open and vetted information domains by julie mason, data fountains service...

52
Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner Truppel, Programmer

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Exploring Bridges Between Open and Vetted Information Domains

By

Julie Mason, Data Fountains Service Manager

Johannes Ruscheinski, Lead Programmer

Wagner Truppel, Programmer

Page 2: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

SECTION I:

Background and Context of Our Projects

SECTION II:

Approaches to System Building

SECTION III:

Classification

SECTION IV:

Preview of Data Fountains Interface

INFOMINE, iVia, Data Fountains

Page 3: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

SECTION I:

Background and Context of Our Projects

INFOMINE, iVia, Data Fountains

Page 4: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

If you build it…

…they will come.

Page 5: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

T he E vo lu tion o f IN F O M IN E1993 - 2001

T he E vo lu tion o f IN F O M IN ET he E vo lu tion o f IN F O M IN E1993 - 20011993 - 2001

The Evolution of INFOMINEThe Evolution of INFOMINE1994-20041994-2004

The Evolution of INFOMINEThe Evolution of INFOMINE1994-20041994-2004

Guiding Principle…

To build a collaborative public domain academic Internet resources

finding tool through creative application of technologies.

Page 6: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

- INFOMINE is a virtual library containing over 100,000 links (A hybrid collection containing 26,000 librarian created links and 75,000 plus robot/crawler created links).

- Founded in January of 1994 it is one of the first Web- based services offered by a library anywhere.

- It is a cooperative effort of librarians from UC Riverside, other UCs (including UCLA, UCSC and the UC Shared Cataloging Project), three California State Universities, Wake Forest University and the University of Detroit. Special cooperative efforts are in process with the Library of Congress and NSDL.

Page 7: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

Our Technologies and Architecture: • New and Interactive Collection Building Technologies to

Amplify Expert Effort / New Uses of Expertise to Refine Collection Building Technologies

1.) Focused crawling

2.) Rich Text Identification and Harvest

3.) Machine-based Classification

4.) Foundation Record

5.) Hybrid Collections Architecture

6.) Usage in Existing, Closed, Relatively Homogeneous Collections

7.) Cooperative Technology

8.) Appropriately Scaled and Modularly Designed

Page 8: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

Page 9: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

http://infomine.ucr.edu/iVia/• The new open source software platform designed to help INFOMINE

and other virtual libraries scale well in terms of amplifying expert effort to enable the development of better and more representative collections.

• Automated and semi-automated Internet resource identification and collection is made possible through focused crawling software.

• Automated and semi-automated indexing or metadata generation is made possible through classifier software.

• A hybrid, two-tiered collection is supported. The first tier are our expert created records and the second tier is made up of machine created records.

• Brings together some of the best of expert created virtual library approaches with the best of automated approaches to collection building

Page 10: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

Page 11: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

Page 12: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

Page 13: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

Page 14: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

Data Fountains (under development this year)

• A cooperative, cost-recovery based metadata generation service that will be an array of iVias, one for each participating project or subject community, and which will create metadata records for the participants.

• A big emphasis, in addition to fully-automated resource discovery and metadata generation, will be on semi-automated approaches that strongly involve and amplify the efforts of collection experts. They, in turn, work to refine and perfect machine approaches and processes.

• The metadata created can be bundled in differing “products” according to differing participant needs in terms of amount of metadata needed, type (natural language or controlled terminology), degree of relevance or comprehensiveness desired (highly relevant records or moderately relevant).

Page 15: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

Page 16: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

Page 17: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

INFOMINE, iVia, Data Fountains

Modular Architecture that Supports a Federated Array ofSubject Specific Focused Crawlers and Classifiers

Page 18: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner
Page 19: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner
Page 20: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner
Page 21: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner
Page 22: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner
Page 23: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner
Page 24: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner
Page 25: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

If This Was of Interest to You, Consider Joining Our Data Fountains Founders Group

A group of 10-20 collections or portal projects cooperatively defining products that will be generated from the Data Fountains

service.

If interested contact: Julie Mason, [email protected]

Page 26: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner
Page 27: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner
Page 28: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner
Page 29: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

A Brief Overview Of The iVia And Data Fountains Architectures

Dr. Johannes Ruscheinski

[email protected]

Page 30: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

iVia & Data Fountains Overview

• What are iVia and Data Fountains (DF)?

• Why do we need iVia and DF?

• Architecture overview of iVia

• Architecture overview of DF

INFOMINE, iVia, Data Fountains

Page 31: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

1. What are iVia and Data Fountains?

• Tools to find resources on the Web and to assign metadata to them, both manually and automatically

• A public Web interface to search a database of URL’s based on various metadata and full text

• A family of crawlers (spiders) that automatically and semi-automatically retrieve Web pages and classify them

• A sophisticated Web interface for human classification experts for assigning metadata to Web documents

INFOMINE, iVia, Data Fountains

Page 32: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

2. Why do we need iVia and DF?

• We already have powerful search engines (e.g., Google) so why do we need yet more and much smaller ones?• Precision is abysmal• Researchers claim that even the largest search engines cover a

shrinking percentage of the Web

• iVia offers highly specific search and browse capabilities allowing for higher search precision

• Both iVia and DF use focused crawling to find authoritative pages on indicated topics

INFOMINE, iVia, Data Fountains

Page 33: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

3. Architecture overview of iVia

INFOMINE, iVia, Data Fountains

Page 34: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

3.1 Master Database

• MySQL-based SQL database

• Contains both expert and robot generated records

• Contains metadata (URL’s, subjects, keywords, authors, titles, …)

• Contains full text in the form of compressed Web page content

INFOMINE, iVia, Data Fountains

Page 35: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

3.2 The Adder Interface

• Sophisticated Web interface for expert classification of Web pages

• Password-protected with varying privilege levels• Allows both adding of new resources and editing of

already existing ones• Has automatic resource duplicate checking• Contains an automatic metadata extractor• Configurable via a preferences screen

INFOMINE, iVia, Data Fountains

Page 36: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

3.3 Crawlers

• Add robot records to the master database• Assign metadata to crawled records• Three different types of crawlers in iVia/DF• Expert-guided crawler with drill-down and drill-out to

crawl single sites• Vlcrawler to crawl virtual libraries• Nalanda iVia Focused Crawler (NIFC) to crawl Web

communities defined around a given topic

INFOMINE, iVia, Data Fountains

Page 37: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

3.4 Search Engine

• Public search interface (e.g. http://infomine.ucr.edu)

• Based on inverted index databases built from the contents of the master database on a nightly basis

• Supports sophisticated searching through metadata and full-text

• Nested boolean queries, truncated searches, word proximity searches, etc

• Search results can be displayed in a wide variety of different themes (skins) that allow collaborating institutions to brand their interface

INFOMINE, iVia, Data Fountains

Page 38: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

4. Architecture overview of DF

INFOMINE, iVia, Data Fountains

Page 39: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

4.1 Seed Set Generator

• Seed sets are sets of URL’s that define a topic of interest• Seed sets can be supplied in various formats by a client (e.g.

simple text file with a list of URL’s)• Typically need around 200 highly topic-specific URL’s• Problem: most users would come up with only a few dozen• Solution: scout module uses a search engine such as Google to

fatten up the user-provided initial set

INFOMINE, iVia, Data Fountains

Page 40: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

4.2 Nalanda iVia Focused Crawler

• Primarily developed by Dr. Soumen Chakrabarti (IIT Bombay), a leading crawler researcher

• Sophisticated focused crawler using document classification methods and Web graph analysis techniques to stay on topic

• Supports user interaction via URL pattern blacklisting etc• Uses an apprentice classifier to prioritize links that should be

followed• Returns a list of URL’s likely to be on the initial seed set topic

INFOMINE, iVia, Data Fountains

Page 41: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

4.3 Distiller

• Attempts to rank URL’s returned by the NIFC according to their relevance to the client-provided topic

• Uses improved Kleinberg-like Web graph analysis to assign hub and authority values to each URL

• Returns scores for each provided URL

INFOMINE, iVia, Data Fountains

Page 42: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

4.4 Metadata Exporter

• Final stage of DF• Provides clients with convenient data formats to

incorporate the best on-topic URL’s into their own databases

• Allows different amounts/quality of metadata to be exported based on the client’s selected service model

• Supports various export types and file formats (simple URL lists, delimiter-separated file formats, XML file formats, MARC records and export via OAI-PMH)

INFOMINE, iVia, Data Fountains

Page 43: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Automatic Document Classification

Wagner Truppel

[email protected]

Page 44: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Document Classification

• What is it?

– Doing to documents on the Internet what we already do to books in a library

• Why?

– To better organize information

– To speed up searches

– To rank search results

INFOMINE, iVia, Data Fountains

Page 45: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Example Subject Categories

• LCC: Library of Congress Categories• LCSH: Library of Congress Subject Headings

• Infomine Subject Categories– Biological, Agricultural, and Medical Sciences– Business and Economics– Cultural Diversity– Electronic Journals– Government Info– Maps and Geographical Information Systems– Physical Sciences, Engineering, and Mathematics– Social Sciences and Humanities– Visual and Performing Arts

INFOMINE, iVia, Data Fountains

Page 46: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Example

INFOMINE, iVia, Data Fountains

Page 47: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Example: Korea Rice Genome Database

• Is it about…

– Geography ?

– Agriculture ?

– Genetics ?

• Which Infomine category do we put it in ?

– Biological, Agricultural, and Medical Sciences

• Pretty obvious, right ?

– For humans, yes. But how do we automate it ?

INFOMINE, iVia, Data Fountains

Page 48: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Automating Document Classification

• We need a way to measure document similarity• Each document is basically just a list of words, so we can

count how frequently each word appears in it• These word frequencies are one of many possible document

attributes• Document similarity is mathematically defined in terms of

document attributes

INFOMINE, iVia, Data Fountains

Page 49: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Automating Document Classification

• The previous slide contains 51 words– document 6– word, of 3 each– we, a, in, is, each 2 each– All other words 1 each

• Note that we consider words such as word and words to be the same

• We also don’t care about capitalization• In general, we’d also ignore non-descriptive words such as we,

a, of, the, and so on

INFOMINE, iVia, Data Fountains

Page 50: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Automating Document Classification• Not an easy task

– The distribution of words shows that the slide in question is not very rich in content

• The most frequent word (document) is not very descriptive• The most descriptive word (classification) does not appear

very frequently in the slide– How descriptive and how frequent a word should be depends

on the category

• The task is easier when:– we have a large number of content-rich documents– categories are characterized by very specific words which

don’t appear very frequently in other categories

INFOMINE, iVia, Data Fountains

Page 51: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Automating Document Classification• Two documents sharing a large number of category-specific

words are considered to be very similar to each other• Document similarity can thus be quantified and computed

automatically• Documents can then be ranked by their similarity to each other• A large group of documents that are all very similar to each

other can then be considered to define the category they belong to (the set of all such groups is called the Training Corpus)

• One way to classify a document is then to put it in the same category as that of the training document that it’s most similar to

INFOMINE, iVia, Data Fountains

Page 52: Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner

Automating Document Classification

• The classification method just described is known as the Nearest Neighbor method

• There are other methods, which may be more suited for the classification of documents from the Internet– Naïve Bayes– Support Vector Machine (SVM)– Logistic Regression

• Infomine uses a flexible approach – supporting all of these methods – in an attempt to produce highly-accurate classifications

INFOMINE, iVia, Data Fountains