mlw sasaki-20101027
DESCRIPTION
TRANSCRIPT
![Page 1: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/1.jpg)
1
Language Resources, Language Technology, Text Mining, the Semantic Web: How interoperability of machines
can help humans in the multilingual web
Felix SasakiDFKI / University of Appl. Sciences Potsdam
W3C German-Austrian [email protected]
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 2: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/2.jpg)
2
Purpose of this talk (1)
• Show gaps– Between machines– Between machines and humans
• … which we need to fill to bridge gaps between humans
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 3: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/3.jpg)
3
Purpose of this talk (2)
• Identify groups / communities– To fill gaps– To come together in new alliances
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 4: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/4.jpg)
4
Basics: What are machines doing
(not only on the Web)?
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 5: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/5.jpg)
5
Language Technology
• Summarization
LT “These texts are about ... “
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 6: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/6.jpg)
6
Language Technology
• Machine Translation
LTこのワークショップは…で開催され
る
“The workshop takes place in …“
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 7: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/7.jpg)
7
Language Technology
• Spell and grammar checking
LT “The workshop takes place in …“
“The worksop take place in …“
• And many more applications• Coreference resolution, discourse analysis,
named entity recognition, natural language generation, question answering, …
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 8: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/8.jpg)
8
Text mining
• Finding out things you did not know
Text mining
•“Text A and text B are similar”•“The text collection has clusters of topics: …”
Visualizationof results
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 9: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/9.jpg)
9
Basics: What are machines doing
(not only on the Web)?How are they doing it?
They are using resources
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 10: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/10.jpg)
10
Resources in language technology
• Sample resources for summarization
LT “These texts are about ... “
NLG output text mining output
stop word list …
![Page 11: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/11.jpg)
11
Language Technology
• Sample resources in Machine Translation
LTこのワークショップは…で開催され
る
“The workshop takes place in …“
Lexicon Grammar (Training)corpora …
Generation
![Page 12: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/12.jpg)
12
Language Technology
• Sample resources for spell and grammar checking
LT “The workshop takes place in …“
“The worksop take place in …“
Lexicon Grammar …
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 13: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/13.jpg)
13
Text mining
• Sample resources for text mining
Text mining
•“Text A and text B are similar”•“The text collection has clusters of topics: …”
Lexicon Stop wordlist …
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 14: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/14.jpg)
14
In general: you need three types of data: input, resources, workflow
InputWork-flow
Output
Resources Resources …
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 15: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/15.jpg)
15
What gaps need to be filled for truly “multilingual content processing”?
• Gap 1: machines don’t use metadata available in the input
• Gap 2: machines don’t know about the workflow (input) data goes through
• Gap 3: machines don’t make explicit– “Who” they are– What resources they are using
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 16: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/16.jpg)
16
Gap 1: machines don’t use metadata available in the input
• Input from www.postbank.de„Ob Postbank direkt, Online-Banking,
Online-Brokerage oder myBHW. Die häufigsten Fragen zu unseren Transaktionssystemen finden Sie an dieser Stelle.“
• Output via Google translate“Whether Postbank direct, online banking,
online brokerage or myBHW. Frequently asked questions about our transaction systems can be found at this location.”
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 17: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/17.jpg)
17
Gap 1: machines don’t use metadata available in the input
• Input from www.postbank.de„Ob Postbank direkt, Online-Banking,
Online-Brokerage oder myBHW. Die häufigsten Fragen zu unseren Transaktionssystemen finden Sie an dieser Stelle.“
• Output via Google translate“Whether Postbank direct, online banking,
online brokerage or myBHW. Frequently asked questions about our transaction systems can be found at this location.”
Fixed terminologyshould not havebeen translated.But – the MT tool had no chance to “know” that – why?
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 18: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/18.jpg)
18
Gap 2: machines don’t know about processes data goes through
• Input from the data base – the “hidden web”:„Ob <term>Postbank direkt</term>,
<term>Online-Banking</term>, <term>Online-Brokerage</term> …“
• Output on the Web:„Ob <em>Postbank direkt</em>,
<em>Online-Banking</em>, <em>Online-Brokerage</em> …“
fixed terminology(= metadata) …
… is loston the Web
publicationprocess
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 19: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/19.jpg)
19
Gap 3: no common identification …
• Of metadata and processes chains (previous slides)
• Of resources – e.g. what is a lexicon– In machine translation?– In localization?– For a human reader?– Ability to combine tools depends on knowing
about them (capabilities, resources) in detail
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 20: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/20.jpg)
20
Who can fill these gaps – people dealing with multilingual content
• Content producers– Allow for terminology identification in source formats
/ CMS• Localizers– Make localization workflows aware of (process /
source content) metadata• “Machine” experts– Make their tools sensible to source content metadata
and expose their capabilities (what resources / workflows) in a clear defined way
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 21: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/21.jpg)
21
Who can fill these gaps – people dealing with multilingual content
• Users– Add metadata to source content– Use (machine translation) tools without knowing the
details – e.g. in the browser!• Browser vendors– Create APIs which make use of automatic tools /
resource and workflow descriptions / source code metadata
• …
The people in this room!W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 22: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/22.jpg)
22
How can they fill the gaps?
• All these groups need to agree upon one machine readable information space for filling the gaps
• It’s actually already here – the Semantic Web!
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 23: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/23.jpg)
23
What is the Semantic Web
• The Web as humans see it: Identification of “meaning” e.g. via (typographic or other) conventions
„Ob Postbank direkt …“
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 24: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/24.jpg)
24
What is the Semantic Web
• The Web as machines see it: Identification of meaning via RDF-based mechanisms (here via RDFa)
„Ob <span property=”its:term”>Postbank direkt</span> …“
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 25: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/25.jpg)
25
What is the Semantic Web –RDF in 30 seconds
• A framework for making statements about resources, using URIs
• RDF can help to fill our gaps1. Metadata in the input2. Metadata for workflows3. Identify 1., 2. and language technology resources
uniquely• In one information space – the machine
readable WebW3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 26: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/26.jpg)
26
Instead of a summary – call for project (participating in ) proposals
• Who needs to come together– Content producers, localizers, “machine” experts, browser vendors, users
• What should their work be based upon– Semantic Web technologies– Clear interfaces to the human (e.g. browser) Web, like RDFa
• What we do not need– Web-centred standardization of formats for language resources
themselves – that is already done elsewhere (see this session)• Where the place is to do that work?
– W3C, since it needs to be part of core Web technologies• For making it happen, we need a strong alliance of Web
technologies, other fields and machine technologies
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 27: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/27.jpg)
27
META-NET
• EU-funded project, closely related to “Multilingual Web”
• Main aim: build an alliance for improving language technologies in Europe
• Laaarge: soon 40+ participating organizations in 30+ countries
• Very important: bring users of language technology in
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 28: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/28.jpg)
28
META-NET
• Users and language technology companies = in Europe not only large companies, but more and more small SMEs
• Target of META-NET are these small and fast units – including you
• EU has started special funding programs for SMEs – see http://tinyurl.com/eu-lt-sme (“objective 4.1”)
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 29: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/29.jpg)
29
META-NET
• Event: META-NET Forum• Brussels, November 17th/18th
• Aim: Bring users / language technology developers / policy makers together
• Discuss a road map for the next 10 years of language technology road map and its applications
• Details and registration athttp://www.meta-net.eu/events
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid
![Page 30: Mlw sasaki-20101027](https://reader036.vdocuments.mx/reader036/viewer/2022062616/549a780ab479594c4d8b5923/html5/thumbnails/30.jpg)
30
Language Resources, Language Technology, Text Mining, the Semantic Web: How interoperability of machines
can help humans in the multilingual web
Felix SasakiDFKI / University of Appl. Sciences Potsdam
W3C German-Austrian [email protected]
W3C Workshop “The Multilingual Web - Where Are We?” 26-27 October 2010, Madrid