sharing and browsing linguistic data emeld arizona: terry langendoen scott farrar

27
Sharing and Sharing and Browsing Browsing Linguistic Data Linguistic Data EMELD Arizona: EMELD Arizona: Terry Langendoen Terry Langendoen Scott Farrar Scott Farrar

Upload: august-rice

Post on 26-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Sharing and Browsing Sharing and Browsing Linguistic DataLinguistic Data

EMELD Arizona:EMELD Arizona:

Terry LangendoenTerry Langendoen

Scott FarrarScott Farrar

Page 2: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Since Santa BarbaraSince Santa Barbara

Focus on morpho-syntaxFocus on morpho-syntax Decided to build ontology (to be Decided to build ontology (to be

discussed later in this talk)discussed later in this talk) Decided to build supporting toolsDecided to build supporting tools

– smart search engine (Hedwig)smart search engine (Hedwig)– editoreditor

Some work on xml markupSome work on xml markup

Page 3: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

The ProblemThe Problem

Currently there is no general way for Currently there is no general way for researchers in the endangered researchers in the endangered languages community to languages community to electronically share information.electronically share information.

The Web is the most likely tool that The Web is the most likely tool that could provide a solution.could provide a solution.

The current WWW is not adequate.The current WWW is not adequate. An Example from the WWW:An Example from the WWW:

Page 4: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar
Page 5: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar
Page 6: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar
Page 7: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar
Page 8: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Further ComplicationsFurther Complications

What about other data formats?What about other data formats?– lexiconslexicons– grammatical descriptionsgrammatical descriptions– (comparative) word lists(comparative) word lists– paradigmsparadigms– etc.etc.

Page 9: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Warumungu DescriptionWarumungu Description

'Grammatical case suffixes' are those which 'Grammatical case suffixes' are those which express grammatical relations (subject, express grammatical relations (subject, object, indirect object), like /karriny-ji/ in object, indirect object), like /karriny-ji/ in (4). A noun without a case suffix is (4). A noun without a case suffix is interpreted as having Absolutive case - interpreted as having Absolutive case - /nanttu/ in (4) and /wangarri/ in (5) - or as /nanttu/ in (4) and /wangarri/ in (5) - or as being the main predicator, or as agreeing being the main predicator, or as agreeing with some argument with Absolutive case - with some argument with Absolutive case - /kumppu/ and /pulyurrulyurru/ in (5)./kumppu/ and /pulyurrulyurru/ in (5).

(from J. Simpson 1998)(from J. Simpson 1998)

Page 10: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

(4)Karriny-ji +ajjul nyirri-njina nanttu, ngapa-kajji.people-ERG +3pl.S put-PAST.CONT humpy, water-LEST'The people were erecting humpies for fear of the rain.' [JS:PND:RS]

(5)Nyirri-nyi +ama wangarri kumppu pulyurrulyurru.place-PAST.PUN +he rock ABS big.ABS red.ABS'He placed a big red hill.' [JS:PND:RS]

Page 11: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Chichewa DescriptionChichewa Description

Other elements that appear as verbal Other elements that appear as verbal prefixes include modals – for prefixes include modals – for instance, -ngo- 'just, merely' – as well instance, -ngo- 'just, merely' – as well as directional elements -ka- 'go' and -as directional elements -ka- 'go' and -dza- 'come'. These are placed in the dza- 'come'. These are placed in the immediate pre-OM position, after the immediate pre-OM position, after the tense. This is shown by the following:tense. This is shown by the following:

(from Mchombo 1998)(from Mchombo 1998)

Page 12: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

(8a)Mkângo s-ú-ná-ngo-wá-phwány-a maûngu . . . 3-lion NEG-3SM-past-just-6OM-smash-fv 6-pumpkins . . .'The lion did not just smash them, the pumpkins . . .'

(8b)Mkângo u-ku-ká-phwány-á máûngu.3SM-pres.-go-smash-fv 6-pumpkins'The lion is going to smash some pumpkins.'

Page 13: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

A SolutionA Solution

Take advantage of new Web Take advantage of new Web technologytechnology

Build a community of practice on the Build a community of practice on the Semantic WebSemantic Web

What is the Semantic Web?What is the Semantic Web?

Page 14: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

The Semantic WebThe Semantic Web

New markup: <xml>, <rdf>, <owl>New markup: <xml>, <rdf>, <owl>

New tools: smart search engines New tools: smart search engines ontologies, new editorsontologies, new editors

Meaning is encoded explicitly.Meaning is encoded explicitly.

Pages are interpreted by a reasoner.Pages are interpreted by a reasoner.

Page 15: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

An Example from the Semantic An Example from the Semantic WebWeb

New markup adds functionality to New markup adds functionality to existing <html> documents.existing <html> documents.

Example:Example:

<rdf:Description rdf:about="#A110604">  <rdf:type rdf:resource="#State" />   <NS0:name>Tennessee</NS0:name>  </rdf:Description>

<rdf:Description rdf:about="#876555">  <rdf:type rdf:resource="#Language" />   <EMELD:name>Navajo</EMELD:name>  </rdf:Description>

Page 16: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Aardvark

nocturnal burrowing mammal of the grasslands of Africa that feeds on termites; sole extant representative of the order Tubulidentata WordNet for 'aardvark'

Nouns:

  1. nocturnal burrowing mammal of the grasslands of Africa that feeds on termites; sole extant representative of the order Tubulidentata  Synonyms: aardvark,ant_bear,anteater,Orycteropus_afer

Verbs:

Adjectives:

Adverbs:

Page 17: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

<html><head><rdf:RDF…<Word rdf:about="aardvark"> <hasSense rdf:resource="9385"/></Word><SynSet rdf:about="9385"> <type rdf:resource="noun"/> <rdfs:comment>nocturnal burrowing mammal of the grasslands of Africa that feeds on termites; sole extant representative of the order Tubulidentata </rdfs:comment> <hasElement rdf:resource="aardvark"/> <hasElement rdf:resource="ant_bear"/> <hasElement rdf:resource="anteater"/> <hasElement rdf:resource="Orycteropus_afer"/></SynSet></rdf:RDF></head><body>WordNet for 'aardvark'<br><br>Nouns:<br><br>&nbsp;&nbsp;1. nocturnal burrowing mammal of the grasslands of Africa that feeds on termites; sole extant representative of the order Tubulidentata<br>&nbsp;&nbsp;Synonyms: aardvark,ant_bear,anteater,Orycteropus_afer<br><br>Verbs:<br><br>Adjectives:<br><br>Adverbs:<br><br></body></html>

Page 18: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

The OntologyThe Ontology

Crucial component of the Semantic Crucial component of the Semantic WebWeb

A resource that explicitly defines A resource that explicitly defines what entities can exist in a domain, what entities can exist in a domain, i.e., the endangered languages i.e., the endangered languages communitycommunity

A resource that defines what A resource that defines what relations hold between entitiesrelations hold between entities

demodemo

Page 19: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

OWL Web Ontology LanguageOWL Web Ontology Language

Analogous role of <html> on the Analogous role of <html> on the WWWWWW

The most current “standard” The most current “standard” Semantic Web languageSemantic Web language

Under development at the W3C:Under development at the W3C:

www.w3c.orgwww.w3c.org

Page 20: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Facilitating ToolsFacilitating Tools

Search tools for the Semantic WebSearch tools for the Semantic Web Editors for composing Semantic Web Editors for composing Semantic Web

pagespages Reasoning enginesReasoning engines An extensible data modelAn extensible data model

Page 21: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

A Search EngineA Search Engine

EMELD Arizona’s prototype (Hedwig)EMELD Arizona’s prototype (Hedwig)

http://emeld.douglass.arizona.edu:http://emeld.douglass.arizona.edu:

8080/searchindex.html (temporarily 8080/searchindex.html (temporarily out of service)out of service)

demo on Sundaydemo on Sunday

Page 22: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

An EditorAn Editor

EMELD Arizona’s prototype (name?)EMELD Arizona’s prototype (name?)

demo on Sundaydemo on Sunday

Page 23: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

A Good Data Model for Creating a A Good Data Model for Creating a Community of PracticeCommunity of Practice

Language data should be searchable Language data should be searchable and comparable—broad access and comparable—broad access (centralized).(centralized).

Authors or communities want control Authors or communities want control over their data (local/distributed).over their data (local/distributed).

Local control should be balanced with Local control should be balanced with data interoperability (Semantic Web).data interoperability (Semantic Web).

Page 24: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Centralized ModelCentralized Model

Warumungu

Wari

Mocovi

Biao Min

ArchiHopi

Community

Page 25: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Local Control with Broad AccessLocal Control with Broad Access

Semantic Web

ontology

Wari<xml>

Hopi<xml>

Archi<xml>

Community

toolstools

tools

Page 26: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Community RequirementsCommunity Requirements

No need to standardize your No need to standardize your terminology or abandon tradition.terminology or abandon tradition.

No need to learn <xml> (it doesn’t No need to learn <xml> (it doesn’t hurt!)hurt!)

Use EMELD tools to put your data on Use EMELD tools to put your data on the Semantic Webthe Semantic Web

Maintain your dataMaintain your data

Page 27: Sharing and Browsing Linguistic Data EMELD Arizona: Terry Langendoen Scott Farrar

Contact InfoContact Info

Terry LangendoenTerry Langendoen Scott FarrarScott Farrar

[email protected]@[email protected]@u.arizona.edu

See our website:See our website:

http://emeld.douglass.arizona.edu:8080http://emeld.douglass.arizona.edu:8080