enriching the semantic web tutorial session 1
DESCRIPTION
Tutorial at ESWC 2011 with John McCrae and Elena Montiel-PonsodaTRANSCRIPT
Monnet is supported by the European Unionunder Grant No. 248458
Session 1: NLP and the Multilingual Semantic Web: Challenges and
OpportunitiesTobias Wunner
Digital Research Enterprise Institute (DERI)National University of Ireland, Galway (NUIG)
2
What’s on the Web?
• Wikipedia• 250 languages
• less than 25% in English
articles languages
3.6M English (1Bln words)
~ 1M German, French
< 800k Spanish, Polish, Italian
< 700k Russian, Dutch,Chinese
< 200k Slovenian
< 20k Afrikaans, Irish
From: http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons
3.5M
1M
2001 2011
What’s on the Web?
• Hudong Baike Chinese Encyclopedia
• 3.9m Chinese articles
GoogleTranslate
Language use on the Web
Rarely used term variation
• Term variations
300k
widely accepted term
7mmore results
Language use on the Web
Singular
• Linguistic variations “as Gaelge”Singular plural
Nominative leigheas leigheasanna
Genetive leigheas leighis
Dative … …
Vocative … …
Irish cases forword “medicine”
Genetive plural
Language use on the Web
• Linguistic variations - syntactic
TermLeasePaym + NOUN TermLeasePaym + ADJECTIVE
thirty times more results
Language use on the Web
COMPOUND (PaymDelay)
• Linguistic variations – morphological (German Compounding)
ADJECTIVE + NOUN (delayed Paym)
NOUNZahlungpayment
NOUNVerspätungdelay
ADJECTIVEverspätet delayed
NOUNZahlungpayment
deen
The Semantic Web• Structured data in Triples <Subject> <Predicate> <Object>
• Resources identified by URI (unique resource identifier)
dbpedia:TCM rdfs:label “Traditional Chinese Medicine”@endbpedia:TCM rdfs:label “Medicina Tradicional Chinese”@esdbpedia:TCM owl:sameAs dbpedia:TraditionalChineseMedicine
DBPedia RDFS label and OWL same as relationship
Linguistic and semantic information on the Semantic Web!
URI = http://dbpedia.org/resource/TCM dbpedia:TCM (Turtle)
The Semantic Web
• …is multilingual Multilingual literals (STW - German economy Thesaurus)
Multilingual vocabularies (Rechtspraak.nl –Dutch) case)law dataset)
Language use on the Web
• Different resources different labeling mechanisms!
• To (some extent) no linguistic right or wrong
--> Standards (formal agreements)
From http://www.nlm.nih.gov/mesh/MBrowser.html
MeSH (Medical Subject Headings)
What’s on the Semantic Web?
• How to search?
• Semantic Web Query Language (SPARQL)
• Semantic Web Search Engines
What’s on the Semantic Web?
• How to search with SPARQL?• Matching pattern on graph of triples
• Choose labeling mechanism e.g• …from RDFS vocabulary (label)
• …from SKOS vocabulary (preferred label)
• …other
• How to search with SPARQL?• Matching pattern on graph of triples
• Choose predicate according to labeling mechanism
• Query on literal value
What’s on the Semantic Web?
Resource
<Subject> <Predicate> <Object>
”Traditionelle chinesische Medizin”@de
rdfs:label
What’s on the Semantic Web?
• How to search with Sindice?• Query all literals with Greek encoded String “Χερσόνησος”
What’s on the Semantic Web?
• How to search embedded terms in URI?• Example: “all resources with word traditional”
dbpedia:TraditionalChineseMedicinedbpedia:TraditionalIrishMusicdbpedia:IrishTraditionalMusic...
with SPARQL filter
select ?subject where { ?subject ?predicate ?object filter regex(?subject,”.*traditional.*chinese.*” ) }
What’s on the Semantic Web?
• How to search embedded terms?• Example: “all resources with word traditional”
dbpedia:TraditionalChineseMedicinedbpedia:TraditionalIrishMusicdbpedia:IrishTraditionalMusic...
with Sindice star-shaped queries (SIREn)
Results
NLP for the Semantic Web
1. Multilingual/Ontology-based Information Extraction (BioCaster, OpenCalais)
2. Ontology Localization (LabelTranslator)
3. Ontology-based Natural Language Generation (CLANN)
Multilingual/Ontology-based Information Extraction (Biocaster)
• Aggregates and processes health news
• Annotates news based on a multilingual ontology
• Uses proprietary format and SKOS-XL to maintain terminology
http://born.nii.ac.jp
…
concept = measles
Multilingual/Ontology-based Information Extraction (Biocaster)
• Example: “Risk of measles outbreak in Malta unlikely…”
http://born.nii.ac.jp
[DISEASE] [COUNTRY]
Multilingual/Ontology-based Information Extraction (Biocaster)
• Challenges
• Multilingual adaptation
• Adaptation of information extracion rules to other domains
• Use of proprietary format is undesirable
Multilingual Information Extraction (OpenCalais)
• Semantic markup of unstructured text
• Multilingual (English, French, Spanish)
• English
• 39 entities
• 75 relations
Multilingual Information Extraction (OpenCalais)
• Domain tuned (Finance, Biomedical)
• Only 15 base entities for non-English, no relations
• Demo
http://viewer.opencalais.com
Multilingual Information Extraction (OpenCalais)
• Challenges
• Multilingual adaptation of lexicon and extraction rules
• Domain adaptation of lexicon and extraction rules
Ontology Localisation (LabelTranslator)
• Multilingual ontology editor
• Linguistic annotations (Num., POS, Gender)
• … for a better translation
part ofspeech
Number + Gender
Ontology Localisation (LabelTranslator)
“river”@en
“rivière”@fr
“fleuve”@fr
Ambiguous!
Ontology Localisation (LabelTranslator)
• Challenges
• Use linguistic features in the lexicon for better machine translation
• Use semantic features from the domain model as well
Natural Language Generation (CLANN)
• Controlled Language ANNotations (CLANN)
• To write domain specific grammars (meeting minutes)
• Intermediate representation
Domain ontology (e.g. meeting minutes)
MLink Grammer
LinkedGrammar
Natural Language Generation (CLANN)
• Example“John will present lemon model.”
nsubj
aux
dobj
:Sentence1 :hasRootNode [ rdf:type :TextNode ;:hasText "present" ;
:hasSubType :Verb ; :hasObject [ rdf:type :TextNode ; :hasText "model" ; :hasObjectModifier [ rdf:type :TextNode ;
:hasText "lemon" .
] ] ]
parse tree (absract)
parse treeIn MLINK
Natural Language Generation (CLANN)
• Challenges
• From text to triples?
• Domain adaptation (meeting minutes)
• Multilingual adaptation
Summary
• Web and Semantic Web is
• “Lingual” (variations within one language)
• Multilingual (between languages and cultures)
• NLP Applications need domain and multilingual adaptation
• Lexicon updates / extensions
• Extraction rules updates / extensions
• What do we need?
• Efficient adaptation and sharing of linguistic resourcesbetween ontology-based NLP applications
Links and resources
• Tutorial website• http://tiny.cc/tvzlc
• The Monnet Project• Multilingual Ontologies for Network for Networked
Knowledge
• http://www.monnet-project.eu/
• Lexinfo• http://lexinfo.net/