enriching the semantic web tutorial session 1

31
Monnet is supported by the European Union under Grant No. Session 1: NLP and the Multilingual Semantic Web: Challenges and Opportunities Tobias Wunner Digital Research Enterprise Institute (DERI) National University of Ireland, Galway (NUIG)

Upload: tobias-wunner

Post on 13-Jan-2015

1.076 views

Category:

Education


5 download

DESCRIPTION

Tutorial at ESWC 2011 with John McCrae and Elena Montiel-Ponsoda

TRANSCRIPT

Page 1: Enriching the semantic web tutorial session 1

Monnet is supported by the European Unionunder Grant No. 248458

Session 1: NLP and the Multilingual Semantic Web: Challenges and

OpportunitiesTobias Wunner

Digital Research Enterprise Institute (DERI)National University of Ireland, Galway (NUIG)

Page 2: Enriching the semantic web tutorial session 1

2

What’s on the Web?

• Wikipedia• 250 languages

• less than 25% in English

articles languages

3.6M English (1Bln words)

~ 1M German, French

< 800k Spanish, Polish, Italian

< 700k Russian, Dutch,Chinese

< 200k Slovenian

< 20k Afrikaans, Irish

From: http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons

3.5M

1M

2001 2011

Page 3: Enriching the semantic web tutorial session 1

What’s on the Web?

• Hudong Baike Chinese Encyclopedia

• 3.9m Chinese articles

GoogleTranslate

Page 4: Enriching the semantic web tutorial session 1

Language use on the Web

Rarely used term variation

• Term variations

300k

widely accepted term

7mmore results

Page 5: Enriching the semantic web tutorial session 1

Language use on the Web

Singular

• Linguistic variations “as Gaelge”Singular plural

Nominative leigheas leigheasanna

Genetive leigheas leighis

Dative … …

Vocative … …

Irish cases forword “medicine”

Genetive plural

Page 6: Enriching the semantic web tutorial session 1

Language use on the Web

• Linguistic variations - syntactic

TermLeasePaym + NOUN TermLeasePaym + ADJECTIVE

thirty times more results

Page 7: Enriching the semantic web tutorial session 1

Language use on the Web

COMPOUND (PaymDelay)

• Linguistic variations – morphological (German Compounding)

ADJECTIVE + NOUN (delayed Paym)

NOUNZahlungpayment

NOUNVerspätungdelay

ADJECTIVEverspätet delayed

NOUNZahlungpayment

deen

Page 8: Enriching the semantic web tutorial session 1

The Semantic Web• Structured data in Triples <Subject> <Predicate> <Object>

• Resources identified by URI (unique resource identifier)

dbpedia:TCM rdfs:label “Traditional Chinese Medicine”@endbpedia:TCM rdfs:label “Medicina Tradicional Chinese”@esdbpedia:TCM owl:sameAs dbpedia:TraditionalChineseMedicine

DBPedia RDFS label and OWL same as relationship

Linguistic and semantic information on the Semantic Web!

URI = http://dbpedia.org/resource/TCM dbpedia:TCM (Turtle)

Page 9: Enriching the semantic web tutorial session 1

The Semantic Web

• …is multilingual Multilingual literals (STW - German economy Thesaurus)

Multilingual vocabularies (Rechtspraak.nl –Dutch) case)law dataset)

Page 10: Enriching the semantic web tutorial session 1

Language use on the Web

• Different resources different labeling mechanisms!

• To (some extent) no linguistic right or wrong

--> Standards (formal agreements)

From http://www.nlm.nih.gov/mesh/MBrowser.html

MeSH (Medical Subject Headings)

Page 11: Enriching the semantic web tutorial session 1

What’s on the Semantic Web?

• How to search?

• Semantic Web Query Language (SPARQL)

• Semantic Web Search Engines

Page 12: Enriching the semantic web tutorial session 1

What’s on the Semantic Web?

• How to search with SPARQL?• Matching pattern on graph of triples

• Choose labeling mechanism e.g• …from RDFS vocabulary (label)

• …from SKOS vocabulary (preferred label)

• …other

Page 13: Enriching the semantic web tutorial session 1

• How to search with SPARQL?• Matching pattern on graph of triples

• Choose predicate according to labeling mechanism

• Query on literal value

What’s on the Semantic Web?

Resource

<Subject> <Predicate> <Object>

”Traditionelle chinesische Medizin”@de

rdfs:label

Page 14: Enriching the semantic web tutorial session 1

What’s on the Semantic Web?

• How to search with Sindice?• Query all literals with Greek encoded String “Χερσόνησος”

Page 15: Enriching the semantic web tutorial session 1

What’s on the Semantic Web?

• How to search embedded terms in URI?• Example: “all resources with word traditional”

dbpedia:TraditionalChineseMedicinedbpedia:TraditionalIrishMusicdbpedia:IrishTraditionalMusic...

with SPARQL filter

select ?subject where { ?subject ?predicate ?object filter regex(?subject,”.*traditional.*chinese.*” ) }

Page 16: Enriching the semantic web tutorial session 1

What’s on the Semantic Web?

• How to search embedded terms?• Example: “all resources with word traditional”

dbpedia:TraditionalChineseMedicinedbpedia:TraditionalIrishMusicdbpedia:IrishTraditionalMusic...

with Sindice star-shaped queries (SIREn)

Results

Page 17: Enriching the semantic web tutorial session 1

NLP for the Semantic Web

1. Multilingual/Ontology-based Information Extraction (BioCaster, OpenCalais)

2. Ontology Localization (LabelTranslator)

3. Ontology-based Natural Language Generation (CLANN)

Page 18: Enriching the semantic web tutorial session 1

Multilingual/Ontology-based Information Extraction (Biocaster)

• Aggregates and processes health news

• Annotates news based on a multilingual ontology

• Uses proprietary format and SKOS-XL to maintain terminology

http://born.nii.ac.jp

concept = measles

Page 19: Enriching the semantic web tutorial session 1

Multilingual/Ontology-based Information Extraction (Biocaster)

• Example: “Risk of measles outbreak in Malta unlikely…”

http://born.nii.ac.jp

[DISEASE] [COUNTRY]

Page 20: Enriching the semantic web tutorial session 1

Multilingual/Ontology-based Information Extraction (Biocaster)

• Challenges

• Multilingual adaptation

• Adaptation of information extracion rules to other domains

• Use of proprietary format is undesirable

Page 21: Enriching the semantic web tutorial session 1

Multilingual Information Extraction (OpenCalais)

• Semantic markup of unstructured text

• Multilingual (English, French, Spanish)

• English

• 39 entities

• 75 relations

Page 22: Enriching the semantic web tutorial session 1

Multilingual Information Extraction (OpenCalais)

• Domain tuned (Finance, Biomedical)

• Only 15 base entities for non-English, no relations

• Demo

http://viewer.opencalais.com

Page 23: Enriching the semantic web tutorial session 1

Multilingual Information Extraction (OpenCalais)

• Challenges

• Multilingual adaptation of lexicon and extraction rules

• Domain adaptation of lexicon and extraction rules

Page 24: Enriching the semantic web tutorial session 1

Ontology Localisation (LabelTranslator)

• Multilingual ontology editor

• Linguistic annotations (Num., POS, Gender)

• … for a better translation

part ofspeech

Number + Gender

Page 25: Enriching the semantic web tutorial session 1

Ontology Localisation (LabelTranslator)

“river”@en

“rivière”@fr

“fleuve”@fr

Ambiguous!

Page 26: Enriching the semantic web tutorial session 1

Ontology Localisation (LabelTranslator)

• Challenges

• Use linguistic features in the lexicon for better machine translation

• Use semantic features from the domain model as well

Page 27: Enriching the semantic web tutorial session 1

Natural Language Generation (CLANN)

• Controlled Language ANNotations (CLANN)

• To write domain specific grammars (meeting minutes)

• Intermediate representation

Domain ontology (e.g. meeting minutes)

MLink Grammer

LinkedGrammar

Page 28: Enriching the semantic web tutorial session 1

Natural Language Generation (CLANN)

• Example“John will present lemon model.”

nsubj

aux

dobj

:Sentence1 :hasRootNode [ rdf:type :TextNode ;:hasText "present" ;

:hasSubType :Verb ; :hasObject [ rdf:type :TextNode ; :hasText "model" ; :hasObjectModifier [ rdf:type :TextNode ;

:hasText "lemon" .

] ] ]

parse tree (absract)

parse treeIn MLINK

Page 29: Enriching the semantic web tutorial session 1

Natural Language Generation (CLANN)

• Challenges

• From text to triples?

• Domain adaptation (meeting minutes)

• Multilingual adaptation

Page 30: Enriching the semantic web tutorial session 1

Summary

• Web and Semantic Web is

• “Lingual” (variations within one language)

• Multilingual (between languages and cultures)

• NLP Applications need domain and multilingual adaptation

• Lexicon updates / extensions

• Extraction rules updates / extensions

• What do we need?

• Efficient adaptation and sharing of linguistic resourcesbetween ontology-based NLP applications

Page 31: Enriching the semantic web tutorial session 1

Links and resources

• Tutorial website• http://tiny.cc/tvzlc

• The Monnet Project• Multilingual Ontologies for Network for Networked

Knowledge

• http://www.monnet-project.eu/

• Lexinfo• http://lexinfo.net/