Transcript
Page 1: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Sarawak Language Technology (SaLT)Research Group

SaLT Initiatives:Preservation and Maintenance of Sarawak Languages Faculty of Computer Science and Faculty of Computer Science and

Information TechnologyInformation Technology

Universiti Malaysia SarawakUniversiti Malaysia SarawakAssociate Professor Alvin W. YeoAssociate Professor Alvin W. Yeo

Page 2: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Overview

• Languages in Sarawak• Maintenance and Revitalisation: Holistic Approach

– Sarawak Language Technology (SaLT) Research Group

• SaLT Projects– Borneo Corpus Management System (BCMS)– Iban-English Machine Translation

• TRanslation IBan-English (TRIBE)

– Multimodal-INTegration (MINT) of Sketch and Melanau Daro-Matu Speech in Spatial Queries

– Speech Language Dialog Systems (SLaDS)– Development of Language Tools

• Current findings

Page 3: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Where are we?

Page 4: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

East Malaysia> Sarawak> Kuching

Kuching

Page 5: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Introduction (cont’d)

• Sarawak is a state rich in culture. – 27 ethnic groups in Sarawak (STB, 2005), each with

its own culture and language. – Sarawak has 46 living languages

and 1 extinct; according to the

Ethnologue (Gordon, 2005) – Each ethnic group may have

different languages– Sarawak Dewan Bahasa dan Pustaka

• 63 known languages in Sarawak

Page 6: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Rationale

Population No. of languages Cumulative no. of languages

Cumulative (%)

1– 100 4 4 9%

101 – 500 8 12 27%

501 – 1000 4 16 36%

1001 – 5000 18 34 76%

5001 – 10,000 4 38 84%

10,001 – 50,000 6 44 98%

50,001 – 100,000 0 44 98%

100,001 - 1 45 100%

Extinct 1 1

No data available

1 1

Cumulative language and number of speakers (Ethnologue,2005)

Page 7: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Problem

• World’s linguistic and cultural diversity is under threat. – Many minority languages are on

the brink of extinction.

• Minority language communities– Further disadvantaged economically and socially. – Dominant languages– Exogamy

• Revitalizing minority languages can bring economic and social benefits as well as cultural benefits.

Page 8: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Holistic Approach: Framework for Language Revitalization and Maintenance

People

Preservation of Culture

Applications

Internet: Online Presence

Software applications and operating systems

Hardware: Input devices: keyboards, tablets/pen/stylus

Supporting Technologies

StakeholdersStakeholders

Web techno-logies: Java, Flash

Methodologies: engaging communities; development lifecycles

Computing Technologies:Natural Language Processing, Image Processing, Speech Recognition and Generation

Community/civil society

Research institutions

Government agencies

NGOs

Industry

Trainers

Translators

Linguists

IT spec.

Comp. Scientists

Researchers

Social Scientists

Communi-ty readiness: ICT literacy

Ethnic group organisations

Page 9: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Sarawak Language Technologies (SaLT) Research Group

Page 10: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

SaLT

• Role of technology in language maintenance and revitalisation• On revitalising and maintaining the existing conventional

languages by building corpora, conducting research and developing tools for Sarawak Ethnic Languages.

Sarawak Language Technologies (SaLT) Research Group covers

• Codification of the ethnic languages – Creation of corpora of the various languages in Sarawak

• Research in computational linguistics projects– which involves languages and peoples of Sarawak

• Development of tools: word processors, spell

checkers

Page 11: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Language Technology

• Understanding and explication of language phenomena in a– computationally tractable form, resulting in – techniques for interchanging various linguistic

forms • speech, text, morphology, syntax, semantics/meaning,

discourse, knowledge,

– thus leading to the creation and development of intelligent applications involving language.

Page 12: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Levels of Technology

INPUT (corpus)

APPLICATION (machine translation,

multimodal spatial application)

PROCESSOR (tagger, parser,

multimodal integration)

Lexicographer/Linguist/ comp. scientist

Linguist/ comp. scientist

General and conceptual dictionary

Page 13: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Specialists Needed

• Lexicographers• Computer scientists DBA, SE & N/W (data

maintenance & grid)• Linguists• Information Scientists• Psychologists• Anthropologists• Computational Linguistics Natural Language

Processing

Page 14: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Current Projects

Page 15: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Current Projects (cont’d)

Page 16: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Roadmap for SaLT

Page 17: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Advisors and Organisations Involved

No Name Expertise Organisation

1 Prof. Zaharin Yusoff Computational Linguistics (CL) & Natural Language Proc. (NLP)

MMU

2. Prof. Ahmad Zaki Abu Bakar CL & NLP UTM

3. AP Dr Normaziah Abdul Aziz NLP & Artificial Intelligence UIAM

4. Prof. Dr. Tang Enya Kong CL & NLP MMU

5 Dr. Bali Ranaivo NLP & CL MMU

6. Prof. Dr. Zuraidah Mohd. Don Linguistics UM

7. Dr. Gerry Knowles Phonetics and Phonology MIQUEST Worldwide Sdn

Bhd 8. Professor Dr. Peter Songan Community development UNIMAS

Page 18: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Collaborators

Organisations Involved1. Tun Jugah Foundation2. Dewan Bahasa dan

Pustaka (Sarawak Branch)3. Melanau Association 4. Dayak Bidayuh National

Association5. Sarawak Museum6. Pustaka Negeri Sarawak7. Majlis Adat Istiadat

Universities Involved1. UNIMAS (FCSIT, FCSHD,

FSS, CLS) 2. Multimedia University3. Universiti Teknologi Malaysia4. Universiti Islam Antarabangsa

Malaysia5. Universiti Sains Malaysia6. Universiti Malaya7. Localisation Research Centre,

University of Limerick, Ireland8. University of Waikato, New

Zealand

Page 19: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Team members

a. Staff FCSIT

• AP Dr Alvin Yeo Wee (Head)• AP Dr. Narayanan K.• Dr Edwin Mit• Suhaila Saee• Sarah Flora Samson• Nurfauza Jali• Suriati Khartini Jali• Sy. Fazlin Seyed Fadzir• Lee Jun Choi

FCSHD• Dr. Ng Giap Weng• D’oria Islamiah• Wan Norizan

CLS• Dr. Ting Su Hie• Salbia Hassan• Yvonne Michelle Campbell

Page 20: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Team members (cont’d)

b. Research Assistants1. Beatrice Chin (FCSIT)2. Teh Lee Na (FCSIT)3. Jennifer Wilfred (FCSIT)4. Lai Nyong Fock (FCSIT)5. Mohd. Hanafiah Semuni (FCSHD)6. Loh Chee Wyai (FCSIT) 7. Ang Siaw Tiong (FCSIT)

c. StudentsLevel No. of Students

Post-graduate PhD 2

Master by Research 6

Master by Coursework 5

Undergraduate 22

Total 35

Page 21: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Borneo Corpus Management System (BCMS)

• Problem/Background: – Currently there is no existing corpus management system to manage

corpora available in minority languages of Sarawak

• Solution: – Build a system that is able to manage and maintain the corpora

• Objectives: – To design an easy and usable Corpus Processing Toolkit for

researchers– Integrate the various tools together in one single platform

• Current Status: – Working on the Morphological Analysers and Spell Checkers

Page 22: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Corpus Manager (After processing)

Editable Content

Used to highlight the extracted information

in the content

File tree that display the processed files. The file is

stored in the folder based on category

Original Content Processed Content

Page 23: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Corpus Analyser: Sentence Splitter

The output is each sentence of current document

Page 24: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Iban-Corpus Development

• Problem/Background – Indigenous languages in Sarawak are slowly dying out due to:– One way to stem this “extinction” of languages:

• Provide more local content – but how??

• Solution– Translate English documents to documents in minority languages– MT is needed to facilitates and accelerates the translation process

• Objectives– Identify a methodology that can be used to translate English to minority

languages, by taking Iban as a case study

• Current Status – Built Iban corpus with 23,833 words with 3,831 distinct words– Constructed bilingual lexicon with 1,688 words with 1,192 distinct words

Page 25: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Iban-English Machine Translation

• Problem/Background – Traditional knowledge (TK) is tacit knowledge; generally not

stored and known only by the older generation, who speaks little English

– TK is very important. It needs to be preserved and protected.– Machine Translation (MT) can help to preserve TK– Translate available resources into English so that it is accessible

by all, e.g. researchers (social scientists) and younger generation– However, translation of closely related languages is easier

• Solution– Translate TK documents to English through a closely related

language as pivot language– Case study: Iban as source language, Malay as pivot language

and English as target language

Page 26: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

• Objectives– To demonstrate that the performance of translation

through a pivot language is comparable with performance of direct translation

• Realise benefits (efficiency) of translating multiple “similar” languages through a common pivot language

• Current Status – Building of Iban corpus and lexicon– Linguistic comparison on Iban and Malay language

Page 27: Sa rawak  L anguage  T echnology ( SaLT ) Research Group
Page 28: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Multimodal Integration: Preamble

User sketching on the Wacom tablet with CogSketch sketch interface describing a place.

Dragon Naturally Speaking software for capturing thespeech with a microphone.

Page 29: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Multimodal Integration of Sketch and Melanau Daro-Matu Speech in Spatial Queries (MINT)

• Problem/Background– English: main communication medium– Language is unique and distinct

• Individual uses different languages may have different approaches in conceptualizing, communicating, reasoning, expressing their thoughts

– Translation is not sufficient enough – Building the entire system for certain targeted speakers is

time consuming

• Solution– Internationalisation (i18n)– Localisation (l10n)

Page 30: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

• Objectives– Integrate Melanau Daro-Matu speech and sketch

(image) modalities– Identify the interaction patterns of Melanau users.– Identify the similarities and difference of English,

Malay and Melanau (extending to Iban as well)– Localise architecture and representation of multimodal

integration in Melanau Daro-Matu, and other languages

Page 31: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Input Capturing

Input Interpretation

Modalities Representation Speech Representation

Sketch Interpretation

Sketch

Speech Interpretation

Part-Of-Speech Tagging

Language-Dependent ComponentsLanguage-Dependent Components

Tokenization

Tagging using trained corpusTagging using trained corpus

Tagging corrections acquired from templates

Tagging corrections acquired from templates

Lexicon required

Grammar rules required

Annotated Text

Spatial information retrieval

Speech

Sketch Representation

Modalities Integration Sketch and Speech Integration

Database Searching

Sentence Splitter

Transcription

Page 32: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Spoken Language Dialogue System (SLaDS)

• Problem/Background– Spoken language system (SLS) has become an ever-increasing human-

system interface. – Many studies have been conducted by foreign researchers to unravel

the challenge in the design of spoken language system. – This study focuses on the design and development of spoken language

dialogue system within the context of Malaysian user.

• Solution– The project is performed by conducting a simulation test of the real SLS

system with local user. – The system is then evaluated by adopting the Wizard of Oz method with

the objectives to determine its efficacy. – The result of this testing will be useful for the future development of

Malaysian SLS.

Page 33: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

• Objectives– To investigate the spoken language and interaction design,

and its employment in the development of Spoken Language Dialog Systems

– To determine the efficacy of imported usability evaluation techniques applied in the Spoken Language Dialogue Systems

– Identify speech patterns to develop a predictive model for speech recognition

• Current Status– To date, the study is already in its testing stage to capture the

dialogue content. – Respondent is prompt to interact with the system. – The dialogue from the interaction will be taped, transcribed and

analysed.

Page 34: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Wizard’s Control Panel

User’s view

SCREENSHOTS VIDEO

Video showing interaction sample;

Page 35: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Research Projects: Fundamental Research Grant

• Minority Languages Online (MiLO): Preserving Cultures by Mobilising Minority Languages (of Sarawak) Online. (completed 30 June 2007) – Continued with CLS, Univ. of Waikato– Wikipedia approach to development of Bidayuh lexicon

• Bario Lakuh Digital Library (completed) – Recordings of Kelabit songs– Transcibed, translated – With audio and video

Page 36: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

e-Vocabulary for Sarawak Malay

• Problems: Language endangerment• Vocabulary of Sarawak Malay (Original source) • Main source: Vocabulary book written by W.S.B.BUCK

from Bau, which was published by Sarawak Civil Service on 11th May, 1932.

• Total of word entries: 1026 words

Page 37: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Background• One of the most widely used computer application

nowadays is the word processor.• Open Source Software (OSS): can used, studied, and

redistributed in modified or unmodified form without restriction

Solution/Objectives• AbiWord (comprehensive word processor) to be

localised• To identify the processes of translation of computing

terminology

AbiWord in Local Languages

Page 38: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Current Status:

Task Progress

Data collection:Template Ongoing

Interface:ToolbarMenuSubmenuIconTooltipsOperation

Completed

Completed

Ongoing

Ongoing

Ongoing

Running

Page 39: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Screen shots

Interface

Example of Menu Panel

Page 40: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Current Findings: Challenges

• Resources of some languages available– Generally lacking; data collection very challenging

• Writing systems and grammar rules do not exist

• Lack of human resources– Fluent in the (untainted) form (translating, POS

tagging)

Page 41: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Current Findings: Bright future

• Community Awareness– Associations of ethnic groups aware of need– Advanced in age interested, younger generation not so

• Protocol followed– Upper management support required to “open doors”

• Local researchers are interested– Colleagues & students

• Machine translation, speech to text, text to speech

• Development of speech corpus

Page 42: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Multi-ethnic Group

Page 43: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Concluding Remarks

• Decreasing number of speakers of languages in Sarawak• Maintenance and Revitalisation: Holistic Approach

– Sarawak Language Technology (SaLT) Research Group

• SaLT Projects– Machine translation, multimodal integration, speech language dialog

system, corpus management systems, online dictionaries/repositories, digital libraries

• Challenges: community involvement and data collection and analysis

• Silver lining: committed NGOs and researchers• Internationalisation and localisation approach

Page 44: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Acknowledgements

• Institutional support from – Universiti Malaysia Sarawak– Jugah Foundation, Melanau Association, Dewan

Bahasa dan Pustaka (Sarawak Branch), Majlis Adat Istiadat, Dayak Bidayuh National Association

• Financial Support grants – UNIMAS Fundamental Research Grant Scheme– Federal Ministry of Science, Technology and Innovation

Science Fund Grant Scheme (01-09-SF0028, SF0029, SF0030)

Page 45: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Fifth International Cyberspace Conference on Ergonomics (CybErg 2008)– Theme: Local knowledge, Global Applications– Special Discussion on Maintenance and Preservation of

Languages

– On-going 15 Sept – 15 Oct 2008– Free Registration– http://www.cyberg08.org/forum

Page 46: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Sixth International

Conference on IT In Asia (CITA’09)•Theme: “Enabling technologies for Knowledge-driven Society: People-Powered Systems”•Tracks on Computational Linguistics, Human Computer Interaction, Software Engineering•Kuching, Malaysia, 6- 9 July 2009; Rainforest Music Festival

Page 47: Sa rawak  L anguage  T echnology ( SaLT ) Research Group

Thank YouTerima Kasih

Jian Kenin


Top Related