hong kong university january 2003 © 2003 michael i. shamos the million book project michael i....

46
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania, USA

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Million Book Project

Michael I. Shamos, Ph.D., J.D.Director, Universal Library

School of Computer ScienceCarnegie Mellon University

Pittsburgh, Pennsylvania, USA

Page 2: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

Where is Pittsburgh?

Page 3: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Universal Library

• Project of Carnegie Mellon University

• All published works of mankind digitized and online

• Instantly available

• Free to read

• In any language

• Anywhere in the world

• Searchable and browsable by humans and machines

• DEMO

Page 4: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Why Digitize?

• Books are inefficient carriers of information

• Heavy, expensive

• Environmentally harmful

• Linear, not hyperlinked

• Poorly indexed

• Not searchable

• Not easily transported

• MOST IMPORTANT: not everyone has every book

• IN FACT, no one has every book

Page 5: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

How Do We Convey Information?

• Books

• Orally

• Observation

• Teaching (a combination of the above)

• The book is

– Information

– AND a physical carrier

• The information can be conveyed digitally

• We don’t CARE about the carrier

Page 6: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Objections to Digital Books

• People can’t read books from a screen

• Books are convenient

– You can carry them

– You can write in them

– You can put a place marker in them

– You can lend them to people

• Books are beautiful

• Books smell nice

Page 7: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

How Many Books Are There?

• 1996 World published output: 800,000 books• Total book titles ever published ~ 100M• 1 book = 500 pp., 2000 char/page

= 1 megabyte uncompressed (about 1 floppy disk)– 108 books = 1014 bytes = 100 terabytes– Disk costs HK$10 per gigabyte– 100 terabytes costs about HK$1 million

• Total books in WorldCat = 41,000,000– Requires only 41 terabytes, HK$410,000

Page 8: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

We Can Store Everything

100 terabytes can store:

3,000,000,000 photographs (compressed)

100,000,000 books

10,000 movies

300 years of music

100 terabytes occupies 240 cubic feet on DVD

= 1 van 6 x 4 x 10 feet

Page 9: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

We Can Send Everything

Human speech: 30 bits/sec

Gigabit Internet: 1,000,000,000 bits/sec

(This talk: < 1 millisecond including slides)

Feb. 2002 Fujitsu achieved 5 terabits per second on one optical fiber

100 terabytes = 800 terabits

It would take less than 3 minutes to transmit every book ever published

Page 10: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Why a Universal Library?

• The largest library in the world (U.S. Library of Congress) has less than 20% of all books– Two hours to retrieve one book

– Must travel to Washington, DC

– No copying allowed

• Largest university library: 14 million (Harvard )

• Hong Kong University: 3 million

• Typical large U.S. university: 1 million

• Largest high school: 130,000 (Philips Andover)

• Largest public high schools: 30,000 (U.S.)

• Average high school: 5,000 (U.S.)

Page 11: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Universal Library Goals

• Democratization of information– Knowledge is power

• Education, distance learning– “Library” for distance education

• Research, technology transfer

• Promotion of understanding

• Preservation of human culture

Page 12: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Million Book Project

• A million books is a lot. CMU just reached 1 million.• Idea: scan 1 million books in each of several

countries. Make them available to everyone• NSF provided $3 million to buy scanners for China

and India• China and India are each providing 500 full-time

people for scanning• Each country is scanning 1 million books over the

next 3 years• CMU is hosting, indexing, building infrastructure

Page 13: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 14: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 15: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 16: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 17: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 18: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 19: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Effect of the Million Book Project

• All books scanned (in many languages) will be available free to read to everyone over the Internet

• Many cultural artifacts and treasures are being scanned

• All works are fully keyword-indexed and searchable• All participating countries will have complete copies

(mirrors) of all content• Knowledge will be available to all

Page 20: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Partners

• China – Beijing University – Chinese Academy of Science – Fudan University – Ministry of Education of China – Nanjing University – Shanghai Jiaotung University– State Planning Commission of China – Tsinghua University – Zhejiang University

Page 21: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Partners

• India – Arulmigu Kalasalingam College Of Engineering – Goa University – Indian Institute of Information Technology - Allahabad – Indian Institute of Science – International Institute of Information Technology - Hyderabad – Shanmugha Arts,Science,Technology & Research Academy – Tirumala Tirupati Devasthanams – Maharashtra Industrial Development Corporation – University of Pune

Page 22: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Copyright Problem

• Compulsory License– Owner CAN’T refuse; user MUST pay– Limited in US (Music: 1.55¢/min, 8.0¢/song)– Extensive compulsory licensing in Japan

• Flat-fee subscription (e.g. HBO)• Free (subsidized by government)• Public Lending Right (UK)• “Buy” button• Metered use (electric company)• Micropayments

Page 23: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Roadblocks

• Biggest obstacle: librarians• Belief that the project is too large• No funding

– In the U.S., everyone assumes it is being done– Outside the U.S., everyone assume the U.S. is doing it

• Copyright• Myriad of small independent digital libraries

Page 24: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Policy Challenges

• Convenience displaces quality (Gresham)• What to digitize first?• Suitable copyright law• Economics (Who pays? Who gets?)• Privacy• Reliability of information• Change in the nature of teaching, learning

Page 25: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

LAYERED UL MODEL

UNIVERSAL LIBRARY:DIGITIZED ITEMS

NAVIGATION TOOLSRETRIEVER SERVICE

CUSTOMCATALOGS

HYPERTEXTGENERATORS

SEARCHERS

TRANSLATORSNEWS AGENTS

HUMANUSERS

DIRECTMACHINE

USERS

HUMANUSERS

ENCYCLOPEDIA

VALUE-ADDED SERVICES

BASELINE UL SERVICES

Page 26: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Universal Dictionary

• A glossary containing every word in every language, with a translation

• Use: indexing the Universal Library• Now has 1 million words (26 languages)• 2 million by February (50 languages) • 3 million by May (80 languages)

Page 27: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

QA&

Page 28: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Multilingual Searching

• Find all documents containing “elephant”• Find all documents about elephants

– Even if the word “elephant” does not occur in the document

• Translation, transliteration– Book titles, works of art, proper names– Idioms, colloquial phrases

Page 29: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Use of © Content

• Philosophy: must pay for use– Authors, publishers must not lose

• Implied license• Bulk licensing • Compulsory licensing

Page 30: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Universal Dictionary

• Lexicon of all words in all languages, with English translations, e.g.

• Obtained from– Web dictionaries– Scanning + OCR– Publishers machine-readable form

• Uses:– Indexing the Universal Library– Machine translation– Spelling correction– Linguistic studies

Page 31: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Technological Challenges

• Input (scanning, digitizing, OCR)• Data representation

– text, kset, notations, images, web pages• Navigation and Search• Multilingual Issues• Output (voice, pictures, virtual reality)• Synthetic documents

Page 32: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Navigation

• Keyword searching does not scale– Imagine 106 hits

• Browsing, finding, searching, flying • Fractal view

– Keys are granularity and connectivity• View whole collections or one glyph

– Hyperbolic trees, virtual reality, discovered similarities

Page 33: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Hyperbolic Tree Navigation

Page 34: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Multilingual Issues

• Character sets

• RepresentationsÍîäà ôèçè÷åñêè íàõîäèòñÿ â çäàíèè Èçâåñòèé

Нода физически находится в здании Известий

• Multilingual navigation

• Translation assistance

Page 35: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

UNIVERSAL LIBRARY STATUS

• >10,000 digital volumes• Public-domain issues of the New York Times• Portal to hundreds of other collections• Art, music, video, Internet radio• Magazines, newspapers, journals• Installing 1.25 terabytes

Visit www.ulib.org

Page 36: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Language Identification

• Given a string x, which language(s) is it from?– What language is “peogwir” from?

• Given x, which language(s) does it seem to be from?– “contrefaçon” “dazs” “chalupa” “mbwewe”

• Character set may be unknown• Brief input (e.g. single word)• Intermixed languages

– “Zeitgeist Fever”• Neologisms, slang, abbreviations

Page 37: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Generative Approach

• Assume that the lexicon of a language L is generated by a probabilistic finite-state machine ML

<

a

b

z

a

z

>

a

z

>

a

z

>

a

z

>

STARTOF WORD

PROB THAT WORDSTARTS WITH A

PROB THAT WORDSTARTS WITH Z

PROB (a|<a)

PROB (>|<a)

PROB (a|<z)

PROB (z|<z)

PROB (z|<za)

> PRODUCT =PROB (<aza>)

> PRODUCT =PROB (<zaz>)

Page 38: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Problems

• Where do all the required probabilities come from?• How can they all be stored?• If string x does not actually occur in a language, its

probability will be zero. Won’t work for neologisms or misspellings.

• “Moving trigrams” work

Page 39: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Generative Approach

• Let pL(y| x) be the probability that string x is followed by string y

in language L (i.e. the probability given a prefix x the suffix is y)

• Then pL(x), the probability that x= <x1 x2 x3 ... xn > was generated

by L, is pL(x1 |<) pL ( x2 |<x1 ) pL(x3| <x1 x2) pL(x4| <x1 x2 x3)

… pL(xn| <x1 x2 x3 ... xn-1) pL(>| <x1 x2 x3 ... xn-1 xn)

• This computation requires huge memory, so approximate:Assume pL(xn| <x1 x2 x3 ... xn-1) pL(xn| xn-2 xn-1)

• So pL(x) pL(x3| <x1 x2) pL(x4|x2 x3) … pL(>| xn-1 xn)

• Try it

Page 40: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Searching Mathematics

0

2sin2

dxxe x

Has this integral ever been evaluated?

Page 41: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Searching Mathematics

0

2sin2

dxxe x

4/92

22

MATHEMATICA C.F.:

Integrate[

Times[Power[E,Times[

-1,Power[V1,2]]],

Sin[Power[V1,2]]],

{V1,0,Infinity}]

Page 42: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Hierarchical Nature of Aboutness

• What does it mean to say that a book is “about” chemistry? Can a word be about chemistry?

• If one paragraph is about chemistry, is the book about chemistry?

• If the book is about chemistry, is every sentence in it about chemistry?

• Aboutness is central to cataloging and retrieval

Page 43: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Aboutness HierarchyUniverse

Word

Sentence

Paragraph

Section

Chapter

Collection

BookNewspaper

Article

Photograph

Object

3D Artifact

Glyph

KEYWORD SEARCHINGOCCURS HERE

SUBJECT SEARCHINGOCCURS HERE

Page 44: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Thesauri and Aboutness

• A set of numbered thesaurus entries defines a topic• Thesaurus is topic-hierarchical• 1011 Hindrance

– 1011.5 barrier, bar, gate, fence, wall, rampart, dam, moat …

• A word is “about” any topic to which it belongs Dam:– 241.1 lake– 293.7 close (v.)– 560.11 mother– 757.2 horse– 856.11 put a stop to (v.)– 1011.5 barrier

Thesaurus + aboutness hierarchy canbe used to disambiguate meaningswithout “understanding”

Note: topic numbers are languageindependent

Page 45: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Set Theory of Aboutness

• Given a finite universe W of objects (e.g. all words)• Define a topic T W to be a subset of W (a wordlist)• Topic inclusion (defines the hierarchy):

– Topic T includes topic S iff S T • Definition of aboutness:

– A subset P W of the universe (e.g., a book) is about topic T iff P T (intersection is nonempty)

• Hierarchical nature of aboutness:– If P is about S and T includes S, then P is also about T

Page 46: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

We Can Search a Few Things

• Text• In the Roman alphabet• “Hidden” databases effectively unsearchable• No images or two-dimensional structures

– math– music– dance notation . . .

• No subject index of photographs or art– Corbis is one of the “best”