mw2011: klavans, j. +, computational linguistics in museums: applications for cultural datasets

26
Your spoken paper cannot be the same as your written paper Read more: Museums and the Web 2011 (MW2011): Presentation Guidelines | conference.archimuse.com

Upload: museums-and-the-web

Post on 10-May-2015

1.032 views

Category:

Technology


3 download

DESCRIPTION

As museums continue to develop more sophisticated techniques for managing and analyzing cultural data, many are beginning to encounter challenges when trying to deal with the nuances of language and automated processing tools. How might user-generated comments be harvested and processed to determine the nature of the comment? Is it possible to use existing collection documentation to derive relations between similar objects? How can we train systems to automatically recognize (disambiguate) different meanings of the same word? Can automated language processing lead to more compelling browsing interfaces for online collections? Luckily, a good deal of expertise and tools exist within the field of computational linguistics that can be applied to these problems to achieve meaningful results. Informed by previous work in computational linguistics and relevant project experience, the authors will address a number of these questions providing insight about how answers to impact museum practice might be found. Authors will share tools and resources that museum software developers can use to prototype and experiment with these techniques - without being experts in language processing themselves. In addition, the authors will describe the work of the T3: Text, Tags, Trust research project and how they have applied these tools to a large shared dataset of object metadata and social tags collected by the Steve.museum project. Specific challenges regarding batch-processing tools and large datasets will be addressed. Best practices and algorithms will be shared for dealing with a number of sticky issues. Directions for future research and promising application areas will be also be discussed. A presentation from Museums and the Web 2011 (MW2011)

TRANSCRIPT

Page 1: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

• Your spoken paper cannot be the same as your written paper

• Read more: Museums and the Web 2011 (MW2011): Presentation Guidelines | conference.archimuse.com

Page 2: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

Klavans

Judit

hRobert Stein

SusanChun

Raul Guerr

a

Computational Linguistics in Museums:

Applications for Cultural Datasets

Page 3: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

COMPUTATIONALLINGUISTICS

• Language - Words, Words, Words• Use• Meaning• Syntax• Shape of words• Sounds

Page 4: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

APPLICATIONS

• Speech synthesis – 1980’s Talking Machines for the Blind

• Intelligent search – pre-google• Finding names – who, what, where• Translation• Speech recognition• Answering Questions – What is

Watson?

Page 5: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

DOMAINS FOR COMPUTATIONAL

LINGUISTICS• Healthcare – interpreting patient records• Government – helping people find

information• International Affairs – cross-language

translation• Law – analyzing Enron scandal email• Marketing – Opinions on products• Museums – analyzing text and tags

associated with objects for better access

Page 6: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

Computational Linguistics forMetadataBuilding+

Page 7: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

Klavans

Judit

hRobert Stein

SusanChun

Raul Guerr

a

Computational Linguistics in Museums:

Applications for Cultural Datasets

Page 8: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

INTERDISCIPLINARYRESEARCH

Computational Linguistics

in Museums

Page 9: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

Text, Tags, Trust

• Funded in 2008 by IMLS• With the University of Maryland, and

collaborative of museum partners• Studying the relationships between

social tags, scholarly text and resources, and the application of trust networks to improve access to museum collections.

Page 10: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

MW 2011 Contributions

• Which Computational Linguistic tools can or should be applied to tags?

• How do these tools impact tag analysis?

• What results differ from the initial steve.museum results from Trant 2007?

• So what – for CL?

• So what – for Museums?

Page 11: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

Hard Challenges

• What do these words really mean?• How can tags be related to other tags?

across languagesacross users

• How are tags over museum objects related to tags over anything else?• How can they be used?

Page 12: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

FINDING A NEEDLE IN THE HAYSTACK

Page 13: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

This canvas was the first one Gauguin painted during the two months he spent in Provence.... Gauguin had rebelled against Impressionism's reliance on the visible world, and he altered nature's shapes and colors to suggest his own more subjective reaction to the landscape.

While the rural subject and acidic colors show the influence of van Gogh, this image is more indebted to Paul Cézanne. In his careful integration of the haystack and farm buildings, Gauguin has echoed Cézanne's emphasis on geometric form.

Gallery Label

Page 14: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

TOOLS FOR TAGS• Morphological Analysis – Conflate

when possible– Cats, cat– Haystacks, haystack– Painting, paint ?

• What words are verbs, nouns, adjectives?

• How should multi-word tags be handled?

Page 15: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

Raw Tags or Tokens

Page 16: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

Results

25%

68%

93%

Page 17: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

1. NN=252052. JJ=63193. NNS=40414. NN_NN=22575. JJ_NN=17926. VBG=10437. VBN=7278. NP=7089. OD_NN=45410. JJ_NNS=413

Page 18: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

Top 10 POS Patterns:1. NN=67062. NN_NN=17133. JJ_NN=11944. JJ=9215. NNS=7576. JJ_NNS=3037. NN_NNS=3008. VBG=2389. NP=20910. VBN_NN=202

Page 19: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

Hard Challenges

• What do these words really mean?• How can tags be related to other tags?

across languagesacross users

• How are tags over museum objects related to tags over anything else?• How can they be used?

Page 20: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

• Integral to most language processing pipelines• Irecursor to parsing.

• However, for social tags, parsing is not a meaningful step.

Research:• Understand the nature of this kind of descriptive tagging. • Link part of speech information with other lexical resources for disambiguation

Why Part of Speech?

Page 21: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

Gold

Orange

Necklace

Ripe

You shall know a word by the company it keeps. J.R. Firth

Page 22: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

WHAT ABOUT “NEW ENGLAND”

• Idioms / lexicalized phrases are more difficult

• Heuristic comparison to Wikipedia Titles matched 46% (30% distinct) of multiword tags

• E.g. “Grapes of Wrath”, “Irish Wolfhound”, “Franco-Prussian War”

*Klavans and Golbeck, 2010

Page 23: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

WISH LIST - BETTER WAYS TO TAME THE

PROLIFERATION OF RICH BUT “NOISY” CONTENT

• Clustering over tags for similarity• Clustering over tags and terms from

text• Matching over existing terms to

identify meaningful units• Apply machine learning techniques

to guess meaning• Bigrams, Trigram, Thesauri, Corpus

Analysis

Page 24: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

ACKNOWLEDGEMENTSSteve.museum project membersT3 and steve.museum museum partnersUniversity of Maryland, T3 groupIMA Museum ……and other participants

Page 25: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets

THANK YOU!

Questions?

Page 26: MW2011: Klavans, J.  +, Computational Linguistics in Museums: Applications for Cultural Datasets