ilan kernerman: generating multilingual lexicographic resources

17
MLODE Leipzig, 2 September 2014 Generating multilingual lexicographic resources Ilan Kernerman K Dictionaries Ltd, Tel Aviv

Upload: mbruemmer

Post on 07-Jul-2015

94 views

Category:

Data & Analytics


0 download

DESCRIPTION

Ilan Kernerman (K Dictionaries Ltd, Tel Aviv) introduced K DICTIONARIES Ltd. which have a long tradition in the lexicographic field of dictionary development. He shared his experiences facing a transition from traditional dictionaries to multilingual datasets, data management and software engineering, architectures and design due to the increasing technological development. Today the K Dictionaries Ltd. resources comprise multilingual databases for over 20 major and some minor languages including linguistic information on morphology and pronunciation, lexicographic editorial tools and applications. The main focus lies in the quality of the language data, hence, the data is first collected and edited manually by first language speakers to build monolingual datasets which are then extended and connected to form bi- and multilingual datasets via automatic translations. The main goal is to get from traditional lexicography value for applications such as machine translation, e-learning, word processing, text mining and search engines. The use of linguistic linked open data is desired regarding its interconnectedness in nature and the vast amount of available language data. However, the integration of this data suffers from the mediocre quality of the automatically created content. The challenge is to arrive at automatically generated high quality content that can cope with the central problems of resolving the complex cross-linguistic relations that have rarely a 1:1 equivalence (for instance in compound words) as well as extending the few existing quality sensitive domains, e.g. education and healthcare which are even now interested in high quality linguistic data.

TRANSCRIPT

Page 1: Ilan Kernerman: Generating Multilingual Lexicographic Resources

MLODE

Leipzig, 2 September 2014

Generating multilingual lexicographic resources

Ilan Kernerman

K Dictionaries Ltd, Tel Aviv

Page 2: Ilan Kernerman: Generating Multilingual Lexicographic Resources

K DICTIONARIES

MLODE • 20140902 2

Established in 1993, based in Tel Aviv

Focus on Technology-Driven Content

Create lexicographic data for 40+ languages

Cooperate worldwide with

- language editors, translators and technicians

- software engineers, architects and designers

- digital & print publishing partners and LT firms

- the academe and professional associations

Page 3: Ilan Kernerman: Generating Multilingual Lexicographic Resources

RESOURCES

MLODE • 20140902 3

Dictionaries for English learners & native speakers

Dictionaries for native & foreign language learning

Dictionaries for bilingual & multilingual translation

Multi-language/Multi-layer datasets

Lexicographic editorial tools & applications

Morphology & pronunciation

Language supplements, audio & pictures

Page 4: Ilan Kernerman: Generating Multilingual Lexicographic Resources

MLODE • 20140902 4

VISION

Page 5: Ilan Kernerman: Generating Multilingual Lexicographic Resources

AlternativeScripting

AlternativeSpelling

Antonym

CompositionalPhrase

CrossReference

Definition

Example

GeographicalUsage

GrammaticalGender

GrammaticalNumber

HomographNumber

Lemma

Morphology

PartOfSpeech

Pronunciation

RangeOfApplication

Register

SenseIndicator

SenseQualifier

SubCategorization

SubjectField

Synonym

MAPPING

MLODE • 20140902 5

Page 6: Ilan Kernerman: Generating Multilingual Lexicographic Resources

MULTI-LAYER

MLODE • 20140902 6

network

Monolingual

Multilingual

Bilingual

Page 7: Ilan Kernerman: Generating Multilingual Lexicographic Resources

EVOLUTION

MLODE • 20140902 7

1. Monolingual English learner’s dictionary

2. Bilingual English learner’s dictionary

3. Multilingual English dictionary

4. L2-English reversed indices

5. L2 bilingual glossaries

6. L2, L3 etc. multilingual dictionaries

Page 8: Ilan Kernerman: Generating Multilingual Lexicographic Resources

ENGLISH MULTILINGUAL

MLODE • 20140902 8

PASSWORD semi-bilingual dictionary

KEMD (44 languages)Afrikaans | Arabic | Bulgarian | Catalan | Chinese

(Simplified | Traditional) | Croatian | Czech | Danish |

Dutch | English | Estonian | Farsi | Finnish | French |

German | Greek | Hebrew | Hindi | Hungarian |

Icelandic | Indonesian | Italian | Japanese | Korean |

Latvian | Lithuanian | Malay | Norwegian | Polish |

Portuguese (Brazil | Portugal) | Romanian | Russian |

Serbian | Slovak | Slovene | Spanish | Swedish |

Thai | Turkish | Ukrainian | Urdu | Vietnamese

Page 9: Ilan Kernerman: Generating Multilingual Lexicographic Resources

L2 MULTILINGUALS

MLODE • 20140902 9

Generating L2-English Index

― Produce L2 Index table

― Produce L2 Senses table

Editing L2 Index

― Include/Exclude HW in L2 Index

― Include/Exclude Sense (checkbox in Tree preview)

― Edit L2 HW and POS

― Edit the Entry (modify, add, remove, re-order Senses)

― Search Sense in English HW or Definition and add it

Translating Multilingually

― Link L2 HW via each English Sense to translations

in all other languages (of the English multilingual)

Page 10: Ilan Kernerman: Generating Multilingual Lexicographic Resources

SAMPLE. INDEX PREVIEW (FRENCH)

MLODE • 20140902 10

Page 11: Ilan Kernerman: Generating Multilingual Lexicographic Resources

SAMPLE. INDEX PREVIEW (RUSSIAN)

MLODE • 20140902 11

Page 12: Ilan Kernerman: Generating Multilingual Lexicographic Resources

SAMPLE. EDIT BY DEFINITION

MLODE • 20140902 12

Page 13: Ilan Kernerman: Generating Multilingual Lexicographic Resources

SAMPLE. SWEDISH MULTILINGUAL (RAW)

bortsprungen 1

MLODE • 20140902 13

runaway noun

a person, animal etc that runs away

◊ The police caught the two runaways.

■ (also adjective) ◊ a runaway horse.

af wegloper | ar جامحهارب، شارد، | bg беглец | br fugitivo | ca fugitiu |

cs uprchlík/-ice, uprchlý | de der/die Ausreißer(in); durchgebrannt |dk bortløben | el φυγάδας | es fugitivo | et põgenik | fa فراری |fi karkuri | fr fugitif/-ive | he בורח | hi अनियतरित, उच छ खल, बहत सहज |

hr odbjegao | hu szökevény | id pelarian | it fuggiasco, fuggitivo |

ja 逃亡者 | ko 도망자 | lt pabėgėlis; pabėgęs | lv bēglis; izbēdzis |

ml cabut lari | nl vluchteling | pl zbiec | ps فراری | pt fugitivo |ro evadat, fugar | ru беглец | sk utečenec/ka; na úteku, ktorý ušiel |sl ubežnik; pobegel | sr odbegao | th ผหลบหน | tr kaçak, firari |

tw 逃跑的人或動物 | uk утікач; дезертир | ur جاناہوفرار |

vi kẻ chạy trốn | zh 潜逃者,逃跑者

Page 14: Ilan Kernerman: Generating Multilingual Lexicographic Resources

SAMPLE. SWEDISH MULTILINGUAL (RAW)

bortsprungen 2

MLODE • 20140902 14

stray adjective

wandering or lost

◊ stray cats and dogs.af weglopend | ar تائهشارد، ضال، | bg изгубен | br perdido |

ca perdut, extraviat, llista de carrers | cs zatoulaný | de streunend |dk omstrejfende; herreløs | el αδέσποτος | es perdido, extraviado,

callejero | et hulkuv | fa گمشده | fi kuljeksiva | fr errant | he ביתחסר |hi भटका, भलाभटका | hr zalutao, zabludio | hu elkóborolt | id sesat |

it randagio | ja はぐれた | ko 길잃은 | lt benamis, valkataujantis |

lv noklīdis; klaiņojošs | ml terbiar | nl zwerf- | pl bezdomny |ps شویورک | pt perdido | ro rătăcit | ru бездомный | sk zatúlaný |sl klateški | sr izgubljen | th ซงพลดหลง | tr başıboş dolaşan |

tw 漫遊的 | uk бездомний | ur الوارثاواره یا | vi lạc, mất | zh 漫游的

Page 15: Ilan Kernerman: Generating Multilingual Lexicographic Resources

MULTILINGUAL – ANOTHER TYPE

Ilan Kernerman • LD4LT • 20140321 15

http://kdictionaries-online.com/nlMLDSplus.aspx

http://kdictionaries-

online.com/frMLDS.aspx?Languages=ar,zh,nl,de

Page 16: Ilan Kernerman: Generating Multilingual Lexicographic Resources

LT APPLICATION

MLODE • 20140902 16

Machine translation

E-learning

Word processing

Text-mining, search engines, etc.

User & travel guides, menus, etc.

Text-To-Speech, STT, etc.

Globalization & localization

Page 17: Ilan Kernerman: Generating Multilingual Lexicographic Resources

THANK YOU

[θӕŋk juː] interj. I thank you: Thank you for your attention!

MLODE • 20140902

Afrikaans dankieArabic شكرا، أشكركBulgarian благодаряChinese Simplified 谢谢(你)Chinese Traditional 謝謝(你)Croatian hvalaCzech děkujiDanish takDutch dank jeEstonian aitäh, tänan teidFarsi ممنونFinnish kiitosFrench merciGerman dankeGreek (σε, σας) ευχαριστώHebrew תודהHindi धनयवाद दन या मना करन का एकHungarian köszönöm!Icelandic þakka þérIndonesian terima kasihItalian grazie

Japanese ありがとうKorean 감사합니다Latvian paldies; pateicosLithuanian ačiūMalay terima kasihNorwegian tusen takk (for)Polish dziękujęPortuguese Brazil obrigado/-daPortuguese Portugal obrigado/-daRomanian mulţumescRussian благодарюSerbian hvalaSlovak ďakujemSlovene hvalaSpanish graciasSwedish tack [ska du/ni ha]!, tackar!Thai การแสดงความขอบคณTurkish teşekkür ederimUkrainian дякую; спасибіUrdu شکريہکااپVietnamese cảm ơn