1 cs 430: information discovery lecture 21 non-textual materials 1

1

CS 430: Information Discovery

Lecture 21

Non-Textual Materials 1

2

Course Administration

Discussion classes

• Attend!• Speak!

3


Lecture 20

Material not covered in previous class

4

Information Retrieval Using PageRank

Simple Method

Consider all hits (i.e., all document vectors that share at least one term with the query vector) as equal.

Display the hits ranked by PageRank.

The disadvantage of this method is that it gives no attention to how closely a document matches a query

5

Reference Pattern Ranking using Dynamic Document Sets

PageRank calculates document ranks for the entire (fixed) set of documents. The calculations are made periodically (e.g., monthy) and the document ranks are the same for all queries.

Concept of dynamic document sets. Reference patterns among documents that are related to a specific query convey more information than patterns calculated across entire document collections.

With dynamic document sets, references patterns are calculated for a set of documents that are selected based on each individual query.

6

Reference Pattern Ranking using Dynamic Document Sets

Teoma Dynamic Ranking Algorithm (used in Ask Jeeves)

1. Search using conventional term weighting. Rank the hits using similarity between query and documents.

2. Select the highest ranking hits (e.g., top 5,000 hits).

3. Carry out PageRank or similar algorithm on this set of hits. This creates a set of document ranks that are specific to this query.

4. Display the results ranked in the order of the reference patterns calculated.

7

Combining Term Weighting with Reference Pattern Ranking

Combined Method

1. Find all documents that share a term with the query vector.

2. The similarity, using conventional term weighting, between the query and document j is sj.

3. The rank of document j using PageRank or other reference pattern ranking is pj.

4. Calculate a combined rank cj = sj + (1- )pj, where is a constant.

5. Display the hits ranked by cj.

This method is used in several commercial systems, but the details have not been published.

8

Cornell Note

Jon Kleinberg of Cornell Computer Science has carried out extensive research in this area, both theoretical and practical development of new algorithms. In particular he has studied hubs (documents that refer to many others) and authorities (documents that are referenced by many others).

9

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The Google File System." 19th ACM Symposium on Operating Systems Principles, October 2003.http://www.cs.rochester.edu/sosp2003/papers/p125-ghemawat.pdf

"Component failures are the norm rather than the exception.... The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies...."

10


Lecture 21

Non-Textual Materials 1

11

Examples of Non-textual Materials

Content Attribute

maps lat. and long., content

photograph subject, date and place

bird songs and images field mark, bird song

software task, algorithm

data set survey characteristics

video subject, date, etc.

12

Possible Approaches to Information Discovery for Non-text Materials

Human indexing

Manually created metadata records

Automated information retrieval

Automatically created metadata records (e.g., image recognition)Context: associated text, links, etc. (e.g., Google image search)Multimodal: combine information from several sources

User expertise

Browsing: user interface design

13

Surrogates

Surrogates for searching

• Catalog records

• Finding aids

• Classification schemes

Surrogates for browsing

• Summaries (thumbnails, titles, skims, etc.)

14

Catalog Records for Non-Textual Materials

• General metadata standards, such as Dublin Core and MARC, can be used to create a textual catalog record of non-textual items.

• Subject based metadata standards apply to specific categories of materials, e.g., FGDC for geospatial materials.

• Text-based searching methods can be used to search these catalog records.

15

Automated Creation of Metadata Records

Sometimes it is possible to generate metadata automatically from the content of a digital object. The effectiveness varies from field to field.

Examples

• Images -- characteristics of color, texture, shape, etc. (crude)

• Music -- optical recognition of score (good)

• Bird song -- spectral analysis of sounds (good)

• Fingerprints (good)

16

Image Retrieval: Blobworld

17

Example 1: Photographs

Photographs in the Library of Congress's American Memory collections

In American Memory, each photograph is described by a MARC record.

The photographs are grouped into collections, e.g., The Northern Great Plains, 1880-1920: Photographs from the Fred Hultstrand and F.A. Pazandak Photograph CollectionsInformation discovery is by:

• searching the catalog records

• browsing the collections

21

Photographs: Cataloguing Difficulties

Automatic

• Image recognition methods are very primitive

Manual

• Photographic collections can be very large

• Many photographs may show the same subject

• Photographs have little or no internal metadata (no title page)

• The subject of a photograph may not be known (Who are the people in a picture? Where is the location?)

22

Photographs: Difficulties for Users

Searching

• Often difficult to narrow the selection down by searching -- browsing is required

• Criteria may be different from those in catalog (e.g., graphical characteristics)

Browsing

• Offline. Handling many photographs is tedious. Photographs can be damaged by repeated handling

• Online. Viewing many images can be tedious. Screen quality may be inadequate.

23

Example 2: Mathematical Software

Netlib

• A digital library that of mathematical software (Jack Dongarra and Eric Grosse).

• Exchange of software in numerical analysis, especially for supercomputers with vector or parallel architectures.

• Organization of material assumes that users are mathematicians and scientists who will incorporate the software into their own computer programs.

• The collections are arranged in a hierarchy. The editors use their knowledge of the specific field to decide the method of organization.

24

GAMS: Guide to Available Mathematical Software

25

Multimedia 3: Geospatial Information

Example: Alexandria Digital Library at the University of California, Santa Barbara

• Funded by the NSF Digital Libraries Initiative since 1994.

• Collections include any data referenced by a geographical footprint.

terrestrial maps, aerial and satellite photographs, astronomical maps, databases, related textual information

• Program of research with practical implementation at the university's map library

26

Alexandria User Interface

27

Alexandria: Computer Systems and User Interfaces

Computer systems

• Digitized maps and geospatial information -- large files• Wavelets provide multi-level decomposition of image

-> first level is a small coarse image-> extra levels provide greater detail

User interfaces

• Small size of computer displays• Slow performance of Internet in delivering large files

-> retain state throughout a session

28

Alexandria: Information Discovery

Metadata for information discovery

Coverage: geographical area covered, such as the city of Santa Barbara or the Pacific Ocean.

Scope: varieties of information, such as topographical features, political boundaries, or population density.

Latitude and longitude provide basic metadata for maps and for geographical features.

29

Gazetteer

Gazetteer: database and a set of procedures that translate representations of geospatial references:

place names, geographic features, coordinatespostal codes, census tracts

Search engine tailored to peculiarities of searching for place names.

Research is making steady progress at feature extraction, using automatic programs to identify objects in aerial photographs or printed maps -- topic for long-term research.

30

Collections: Finding Aids and the EAD

Finding aid

• A list, inventory, index or other textual document created by an archive, library or museum to describe holdings.

• May provide fuller information than is normally contained within a catalog record or be less specific.

• Does not necessarily have a detailed record for every item.

The Encoded Archival Description (EAD)

• A format (XML DTD) used to encode electronic versions of finding aids.

• Heavily structured -- much of the information is derived from hierarchical relationships.

31

Collection-Level Metadata

Collection-level metadata is used to describe a group of items.

For example, one record might describe all the images in a photographic collection.

Note: There are proposals to add collection-level metadata records to Dublin Core. However, a collection is not a document-like object.

32

Collection-Level Metadata

33

Data Mining

• Extraction of information from online data.

• Not a topic of this course.

1 cs 430: information discovery lecture 21 non-textual materials 1

Documents