terminology mining at oclc

26
Terminology mining at OCLC Carol Jean Godby Research Scientist OCLC Online Computer Library Center, Inc. Ohio Academic Library Association May 19, 2006

Upload: gada

Post on 14-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Terminology mining at OCLC. Carol Jean Godby Research Scientist OCLC Online Computer Library Center, Inc. Ohio Academic Library Association May 19, 2006. Outline of this talk. The need for terminology Sources of terminology Why the terminology problem is hard - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Terminology mining at OCLC

Terminology mining at OCLC

Carol Jean GodbyResearch ScientistOCLC Online Computer Library Center, Inc.

Ohio Academic Library AssociationMay 19, 2006

Page 2: Terminology mining at OCLC

Outline of this talk

The need for terminology Sources of terminology Why the terminology problem is hard Three outcomes from terminology mining

projects Conclusions and recommendations

Page 3: Terminology mining at OCLC

What terminology looks like

havehaveihavelhavenhavenshaverahavertyhaveyhavice

havill havilland

health carehealth care coveragehealth insurancehousinghousing policy…….world tradeworld trade accordworld trade agreementworld trade centerworld trade center

bombing

Page 4: Terminology mining at OCLC

Two sources of terminology

From a human-managed resource• Controlled vocabulary• Subject classification scheme• Dictionary, gazeteer, encyclopedia,

subject terminology list

Algorithmically extracted from text

Page 5: Terminology mining at OCLC

Human-managed terminology Strengths

• It represents important and persistent concepts.• It is derived from linguistic or subject-matter expertise.• The form has been standardized. • It may provide a link to an ontology.• It promises interoperability between online resources

and traditionally published materials.

Weaknesses• Literary warrant is based on traditionally published

materials.• Human effort is required to keep it current.• The sources are not usually freely available and must be

modified for use in automated systems.

Page 6: Terminology mining at OCLC

Automatically extracted terminology

Strengths• It is timely.• The style is closer to an ordinary user’s

vocabulary.• Coverage is not restricted to traditionally

published material.

Weaknesses• The data is noisy and difficult to organize.• It is ephemeral.• The problem is ill-defined.

Page 7: Terminology mining at OCLC

Mining terminology in four steps

1. Select a text2. Assign part-of-speech tags3. Apply the part-of-speech filter4. Apply the post-extraction filter

Page 8: Terminology mining at OCLC

Andrew Newell Wyeth (born July 12, 1917) is an American realist painter, one of the best-known of the 20th century. He is sometimes referred to as the "Painter of the People" due to his popularity with the American public. Wyeth's favorite subject is the land and inhabitants around his hometown of Chadds Ford, Pennsylvania, and those near his summer home in Cushing, Maine

Page 9: Terminology mining at OCLC

Step 2: Assign Part-of-Speech Tags

Andrew/noun Newell/noun Wyeth/noun (born/verb July/noun 12/noun, 1917/noun) is/verb an/article American/adjective realist/adjective painter/noun, one/pronoun of/preposition the/article best-known/adjective of/preposition the/article 20th/adjective century/noun. …..

Page 10: Terminology mining at OCLC

Step 3: Apply part-of-speech filters

Andrew Newell Wyeth (noun-noun-noun) July 12 1917 (noun noun noun) (an) American realist painter (adjective-noun-noun) (one of the) best-known of the 20th century (adjective-adjective-

preposition-article-adjective-noun) Painter of the People (noun-preposition-article-noun) Popularity with the American public (noun-preposition-article-

adjective-noun) Wyeth’s favorite subject (adjective-adjective-noun) Land (noun) Inhabitants around his hometown (noun-preposition-pronoun-

noun) Chadds Ford Pennsylvania (noun-noun-noun) (his) Summer home in Cushing Maine (noun-noun-preposition

noun-noun)

Page 11: Terminology mining at OCLC

Issues in the part-of-speech filter

The implementations are mature, accessible in the open source community, and reasonably accurate.

But:• The filter must be designed, usually to select noun phrases. • The noun phrases must be normalized (e.g. trim leading

articles and pronouns; eliminate punctuation)• The noun phrases may be long or short. Which do we

choose?

Land and inhabitants around his hometown -- or:Land, inhabitants, hometown

Best-known of the 20th century – or:20th century

Page 12: Terminology mining at OCLC

So…

By extracting noun phrases from text, the designer is already implementing a simple theory of terminology.

But:

• The result is an overwhelming number of phrases that occur only once.

• The short-phrase vs. long-phrase problem shows that terminology has no obvious formal boundaries.

• In other words, we can’t identify terminology by part of speech alone.

Page 13: Terminology mining at OCLC

Step 4: A post-extraction filter

The goal is to select terminology of interest, dramatically reducing the output from the part-of-speech filter.

Some criteria for a “good” term:• It is accurately identified and represented.• It is easily obtained.• It represents a persistent concept.• It reveals major or minor themes in the source document.

Possible outcomes:• Named entities• Lexicalized noun phrases• Statistically improbable phrases

Page 14: Terminology mining at OCLC

Named entities

Goal • Identify and categorize the proper names in a text.

Results• Andrew Newell Wyeth (person), Chadds Ford (place),

Pennsylvania (place), Cushing (place), Maine (place), July 12, 1917 (date).

• “Painter of the People” – would be recognized, but special handling would be required to categorize it.

Page 15: Terminology mining at OCLC

Named entities: scorecard• Accurate?

• 95% accuracy is reported on systems that recognize personal, corporate, and geographic names.

• Easily obtained? -- high• The named entity problem is conceptually simple and

well-defined.• Most texts contain named entities.• Textual clues for names are machine-processable.• Software is mature and in the public domain.

• Represent persistent concepts? -- high• Something is assigned a name because it is persistent.

• Reveal major or minor document themes? --medium to low

• Documents that contain named entities may be about something else.

Page 16: Terminology mining at OCLC

Lexicalized noun phrases The goal is to identify common noun phrases that ‘name’ a persistent concept.

Language can be used either to describe or to name.• Descriptions

• Are constructed from the rules of syntax for immediate use.• The forms are variable.• The meaning of a phrase is the sum of its words.

• Names• Are stored in a mental dictionary and then retrieved as needed.• The forms are frozen.• The meaning of a phrase may not be easily inferred.

So:

A lexicalized noun phrase has acquired word-like characteristics.• It can be precisely defined.• It can acquire other lexical meaning – connotations, “branding”, etc.• It is a candidate for inclusion in a dictionary, thesaurus, or term list.

Page 17: Terminology mining at OCLC

Textual cues for lexicalized noun phrases

Weak positive contexts:• lists

bread, milk, laundry detergent, kool-aid, tp, pasta sauce, olive oil

• compound noun modifiers American realist painter, stock market quote

Strong positive contexts: study of, information about, professor of, department of, journal of, so-called, biblography on

metadata applications, data processing, automatic classification, internet resources, digital watermarking, font readability, digital image processing

Strong negative contexts: very, -ly adverbs (extremely) different things, few messages, good point, interesting example,

appealing idea, small extension, terse document, simple kind

Page 18: Terminology mining at OCLC

A lexicalized noun phrase: Recurrent erosion

Page 19: Terminology mining at OCLC

Not a lexicalized noun phrase:Recurrent problem

Page 20: Terminology mining at OCLC

Lexicalized noun phrases: scorecard

• Accurate? – medium• 70-80% agreement with human judges• But there is a natural upper limit.

• A text is an imperfect reflection of linguistic knowledge.• Terminology is in a constant state of flux.

• Easily obtained? – medium-to-low• The easiest cues are the least accurate; others are

dependent on certain subject domains and styles of discourse.

• Software is not in the public domain.

• Represent persistent concepts? – medium to high• High agreement with dictionaries and subject schemes

• Reveal major or minor document themes? -- low • Documents that contain lexicalized noun phrases may be

about something else.

Page 21: Terminology mining at OCLC

“Statistically improbable phrases”

A list of noun phrases automatically extracted from the full text of a book on Amazon.com.

The phrases are common in the book but uncommon in Amazon’s book corpus.

The phrases represent a “lexical signature” of the book.

Page 22: Terminology mining at OCLC
Page 23: Terminology mining at OCLC
Page 24: Terminology mining at OCLC

Statistically improbable phrases: scorecard • Accurate?

• Hard data unavailable, but output appears to be accurately parsed.

• Easily obtained? – high• The post-extraction filter is based on common information-retrieval

metrics available in the public domain.

• Represent persistent concepts? – medium to low• Likely candidates: networking form, informational societies• Unlikely candidates: new spatial logic, instant wars, new technological

paradigm

• Reveal major or minor document themes? -- medium• Lorcan Dempsey on the SIPs for the Rise of the Network Society:

This is interesting. For example, clicking on informational economy gives a list of books that may of interest that would have been difficult to find otherwise. Of course, informational is a distinctive usage of Castell's. What did not show up was space of flows and space of places, phrases that are central to some of Castell's arguments in this book. The Books on related topics is a good list (this is a list based on number of shared SIPs), but it does not show the other two books in the trilogy of which this is the first part.

Page 25: Terminology mining at OCLC

Some observations and recommendations The tests for good terminology show that the terminology mining problem is

complex and needs to be decomposed.

The hierarchical structure of the problem suggests a development path.1. The common early stages of processing (text preparation, part-of-speech

tagging, noun phrase parsing)• Algorithms are well-understood and available in the public domain.

2. The post-extraction filters• Named entities

• The problem is conceptually simple.• Software is mature and available in the public domain.

• Lexicalized noun phrases• A comprehensive solution is expensive and error-prone, but there is

low-hanging fruit.• Focus on domain-specific terminology.• In subjects with stable terminology, concentrate on automatic

assignment of controlled vocabulary.

• Document “aboutness”• Still an immature subject• Model with information retrieval metrics.

Page 26: Terminology mining at OCLC

For more information

“ANNIE - Open Source Information Extraction from The University of Sheffield.” http://www.aktors.org/technologies/annie/

Dempsey, Lorcan, 2005. “Amazon: making data work” http://orweblog.oclc.org/archives/000658.html

“Natual Language Toolkit.” http://nltk.sourceforge.net/

Godby, Carol Jean, 2002. A computational study of lexicalized noun phrases in English. PhD. dissertation. Ohio State University Department of Linguistics http://www.ohiolink.edu/etd/view.cgi?osu1017343683