sa rawak l anguage t echnology ( salt ) research group

Download Sa rawak  L anguage  T echnology ( SaLT ) Research Group

If you can't read please download the document

Post on 11-Jan-2016




2 download

Embed Size (px)


Sa rawak L anguage T echnology ( SaLT ) Research Group SaLT Initiatives: Preservation and Maintenance of Sarawak Languages. Faculty of Computer Science and Information Technology. Universiti Malaysia Sarawak Associate Professor Alvin W. Yeo. Overview. Languages in Sarawak - PowerPoint PPT Presentation


  • Sarawak Language Technology (SaLT)Research Group

    SaLT Initiatives: Preservation and Maintenance of Sarawak LanguagesFaculty of Computer Science and Information TechnologyUniversiti Malaysia SarawakAssociate Professor Alvin W. Yeo

  • OverviewLanguages in SarawakMaintenance and Revitalisation: Holistic ApproachSarawak Language Technology (SaLT) Research GroupSaLT ProjectsBorneo Corpus Management System (BCMS)Iban-English Machine Translation TRanslation IBan-English (TRIBE)Multimodal-INTegration (MINT) of Sketch and Melanau Daro-Matu Speech in Spatial QueriesSpeech Language Dialog Systems (SLaDS)Development of Language ToolsCurrent findings

  • Where are we?

  • East Malaysia> Sarawak> KuchingKuching

  • Introduction (contd)

    Sarawak is a state rich in culture. 27 ethnic groups in Sarawak (STB, 2005), each with its own culture and language. Sarawak has 46 living languages and 1 extinct; according to the Ethnologue (Gordon, 2005) Each ethnic group may have different languagesSarawak Dewan Bahasa dan Pustaka63 known languages in Sarawak

  • RationaleCumulative language and number of speakers (Ethnologue,2005)

    PopulationNo. of languagesCumulative no. of languagesCumulative (%)1 100449%101 50081227%501 100041636%1001 5000183476%5001 10,00043884%10,001 50,00064498%50,001 100,00004498%100,001 - 145100%Extinct11No data available11

  • ProblemWorlds linguistic and cultural diversity is under threat. Many minority languages are on the brink of extinction. Minority language communitiesFurther disadvantaged economically and socially. Dominant languagesExogamyRevitalizing minority languages can bring economic and social benefits as well as cultural benefits.

  • Holistic Approach: Framework for Language Revitalization and MaintenancePreservation of CultureApplicationsInternet: Online PresenceSoftware applications and operating systemsHardware: Input devices: keyboards, tablets/pen/stylusSupporting TechnologiesStakeholders

    Web techno-logies: Java, FlashMethodologies: engaging communities; development lifecyclesComputing Technologies:Natural Language Processing, Image Processing, Speech Recognition and Generation Community/civil society

    Research institutions

    Government agencies






    IT spec.

    Comp. Scientists


    Social Scientists

    Communi-ty readiness: ICT literacyEthnic group organisations

  • Sarawak Language Technologies (SaLT) Research Group

  • SaLTRole of technology in language maintenance and revitalisationOn revitalising and maintaining the existing conventional languages by building corpora, conducting research and developing tools for Sarawak Ethnic Languages.Sarawak Language Technologies (SaLT) Research Group coversCodification of the ethnic languages Creation of corpora of the various languages in SarawakResearch in computational linguistics projectswhich involves languages and peoples of SarawakDevelopment of tools: word processors, spellcheckers

  • Language TechnologyUnderstanding and explication of language phenomena in acomputationally tractable form, resulting in techniques for interchanging various linguistic forms speech, text, morphology, syntax, semantics/meaning, discourse, knowledge, thus leading to the creation and development of intelligent applications involving language.

  • Levels of Technology

  • Specialists NeededLexicographersComputer scientists DBA, SE & N/W (data maintenance & grid)LinguistsInformation ScientistsPsychologistsAnthropologistsComputational Linguistics Natural Language Processing

  • Current Projects

  • Current Projects (contd)

  • Roadmap for SaLT

  • Advisors and Organisations Involved

    NoNameExpertiseOrganisation1Prof. Zaharin YusoffComputational Linguistics (CL) & Natural Language Proc. (NLP)MMU2.Prof. Ahmad Zaki Abu BakarCL & NLPUTM3.AP Dr Normaziah Abdul AzizNLP & Artificial IntelligenceUIAM4.Prof. Dr. Tang Enya KongCL & NLPMMU5Dr. Bali RanaivoNLP & CLMMU6.Prof. Dr. Zuraidah Mohd. DonLinguisticsUM7.Dr. Gerry Knowles Phonetics and PhonologyMIQUEST Worldwide Sdn Bhd 8. Professor Dr. Peter SonganCommunity developmentUNIMAS

  • CollaboratorsOrganisations InvolvedTun Jugah FoundationDewan Bahasa dan Pustaka (Sarawak Branch)Melanau Association Dayak Bidayuh National AssociationSarawak MuseumPustaka Negeri SarawakMajlis Adat Istiadat

    Universities InvolvedUNIMAS (FCSIT, FCSHD, FSS, CLS) Multimedia UniversityUniversiti Teknologi MalaysiaUniversiti Islam Antarabangsa MalaysiaUniversiti Sains MalaysiaUniversiti MalayaLocalisation Research Centre, University of Limerick, IrelandUniversity of Waikato, New Zealand

  • Team membersa. Staff FCSITAP Dr Alvin Yeo Wee (Head)AP Dr. Narayanan K.Dr Edwin MitSuhaila SaeeSarah Flora SamsonNurfauza JaliSuriati Khartini JaliSy. Fazlin Seyed FadzirLee Jun Choi

    FCSHDDr. Ng Giap WengDoria IslamiahWan Norizan

    CLSDr. Ting Su HieSalbia HassanYvonne Michelle Campbell

  • Team members (contd)

    b. Research AssistantsBeatrice Chin (FCSIT)Teh Lee Na (FCSIT)Jennifer Wilfred (FCSIT)Lai Nyong Fock (FCSIT)Mohd. Hanafiah Semuni (FCSHD)Loh Chee Wyai (FCSIT) Ang Siaw Tiong (FCSIT) c. Students

    LevelNo. of StudentsPost-graduatePhD2Master by Research6Master by Coursework5Undergraduate22Total35

  • Borneo Corpus Management System (BCMS)Problem/Background: Currently there is no existing corpus management system to manage corpora available in minority languages of Sarawak

    Solution: Build a system that is able to manage and maintain the corpora

    Objectives: To design an easy and usable Corpus Processing Toolkit for researchersIntegrate the various tools together in one single platform

    Current Status: Working on the Morphological Analysers and Spell Checkers

  • Corpus Manager (After processing)Editable ContentUsed to highlight the extracted information in the contentFile tree that display the processed files. The file is stored in the folder based on category Original ContentProcessed Content

  • Corpus Analyser: Sentence SplitterThe output is each sentence of current document

  • Iban-Corpus DevelopmentProblem/Background Indigenous languages in Sarawak are slowly dying out due to:One way to stem this extinction of languages:Provide more local content but how??

    SolutionTranslate English documents to documents in minority languagesMT is needed to facilitates and accelerates the translation process

    ObjectivesIdentify a methodology that can be used to translate English to minority languages, by taking Iban as a case study

    Current Status Built Iban corpus with 23,833 words with 3,831 distinct wordsConstructed bilingual lexicon with 1,688 words with 1,192 distinct words

  • Iban-English Machine TranslationProblem/Background Traditional knowledge (TK) is tacit knowledge; generally not stored and known only by the older generation, who speaks little EnglishTK is very important. It needs to be preserved and protected.Machine Translation (MT) can help to preserve TKTranslate available resources into English so that it is accessible by all, e.g. researchers (social scientists) and younger generationHowever, translation of closely related languages is easier

    SolutionTranslate TK documents to English through a closely related language as pivot languageCase study: Iban as source language, Malay as pivot language and English as target language

  • ObjectivesTo demonstrate that the performance of translation through a pivot language is comparable with performance of direct translationRealise benefits (efficiency) of translating multiple similar languages through a common pivot language

    Current Status Building of Iban corpus and lexiconLinguistic comparison on Iban and Malay language

  • Multimodal Integration: Preamble User sketching on the Wacom tablet with CogSketch sketch interface describing a place.Dragon Naturally Speaking software for capturing thespeech with a microphone.

  • Multimodal Integration of Sketch and Melanau Daro-Matu Speech in Spatial Queries (MINT)Problem/BackgroundEnglish: main communication mediumLanguage is unique and distinctIndividual uses different languages may have different approaches in conceptualizing, communicating, reasoning, expressing their thoughtsTranslation is not sufficient enough Building the entire system for certain targeted speakers is time consuming

    SolutionInternationalisation (i18n)Localisation (l10n)

  • ObjectivesIntegrate Melanau Daro-Matu speech and sketch (image) modalitiesIdentify the interaction patterns of Melanau users.Identify the similarities and difference of English, Malay and Melanau (extending to Iban as well)Localise architecture and representation of multimodal integration in Melanau Daro-Matu, and other languages

  • Spoken Language Dialogue System (SLaDS)Problem/BackgroundSpoken language system (SLS) has become an ever-increasing human-system interface. Many studies have been conducted by foreign researchers to unravel the challenge in the design of spoken language system. This study focuses on the design and development of spoken language dialogue system within the context of Malaysian user.

    SolutionThe project is performed by conducting a simulation test of the real SLS system with local user. The system is then evaluated by adopting the Wizard of Oz method with the objectives to determine its efficacy. The result of this testing will be useful for the future development of Malaysian SLS.

  • ObjectivesTo investigate the spoken language and interaction design, and its employment in the development of Spoken Language Dialog SystemsTo determine the efficacy of imported usability evaluation techniques applied in the Spoken Language Dialogue SystemsIdentify speech patterns to develop a predictive model for speech recognition

    Current StatusTo date, the study is already in its testing stage to capture the dialogue content. Respondent is prompt to interact with the system. The dialogue from the interaction will be taped, transcribed and analysed.

  • Wizards Control PanelUsers viewSCREENSHOTSVIDEOVideo showing interaction sample;

  • Research Projects: Fundamental Research GrantMinority Languages Online (MiLO): Preserving Cultures by Mobilising Minority Languages (of Sarawak) Online. (completed 30 June 2007) Continued with CLS, Univ. of WaikatoWikipedia approach to development of Bidayuh lexicon

    Bario Lakuh Digital Library (completed) Recordings of Kelabit songsTranscibed, translated With audio and video

  • e-Vocabulary for Sarawak MalayProblems: Language endangermentVocabulary of Sarawak Malay (Original source) Main source: Vocabulary book written by W.S.B.BUCK from Bau, which was published by Sarawak Civil Service on 11th May, 1932. Total of word entries: 1026 words

  • BackgroundOne of the most widely used computer application nowadays is the word processor.Open Source Software (OSS): can used, studied, and redistributed in modified or unmodified form without restriction

    Solution/ObjectivesAbiWord (comprehensive word processor) to be localisedTo identify the processes of translation of computing terminology

    AbiWord in Local Languages

  • Current Status:

    TaskProgressData collection:TemplateOngoingInterface:ToolbarMenuSubmenuIconTooltipsOperationCompletedCompletedOngoingOngoingOngoingRunning

  • Screen shotsInterfaceExample of Menu Panel

  • Current Findings: ChallengesResources of some languages availableGenerally lacking; data collection very challengingWriting systems and grammar rules do not existLack of human resourcesFluent in the (untainted) form (translating, POS tagging)

  • Current Findings: Bright futureCommunity AwarenessAssociations of ethnic groups aware of needAdvanced in age interested, younger generation not soProtocol followedUpper management support required to open doorsLocal researchers are interestedColleagues & studentsMachine translation, speech to text, text to speechDevelopment of speech corpus

  • Multi-ethnic Group

  • Concluding RemarksDecreasing number of speakers of languages in SarawakMaintenance and Revitalisation: Holistic ApproachSarawak Language Technology (SaLT) Research GroupSaLT ProjectsMachine translation, multimodal integration, speech language dialog system, corpus management systems, online dictionaries/repositories, digital libraries Challenges: community involvement and data collection and analysisSilver lining: committed NGOs and researchersInternationalisation and localisation approach

  • AcknowledgementsInstitutional support from Universiti Malaysia SarawakJugah Foundation, Melanau Association, Dewan Bahasa dan Pustaka (Sarawak Branch), Majlis Adat Istiadat, Dayak Bidayuh National AssociationFinancial Support grants UNIMAS Fundamental Research Grant SchemeFederal Ministry of Science, Technology and Innovation Science Fund Grant Scheme (01-09-SF0028, SF0029, SF0030)

  • Fifth International Cyberspace Conference on Ergonomics (CybErg 2008)Theme: Local knowledge, Global ApplicationsSpecial Discussion on Maintenance and Preservation of LanguagesOn-going 15 Sept 15 Oct 2008Free Registration

  • Sixth International Conference on IT In Asia (CITA09)Theme: Enabling technologies for Knowledge-driven Society: People-Powered SystemsTracks on Computational Linguistics, Human Computer Interaction, Software EngineeringKuching, Malaysia, 6- 9 July 2009; Rainforest Music Festival

  • Thank YouTerima KasihJian Kenin