hlt in south africa: yesterday, today and tomorrow
DESCRIPTION
HLT in South Africa: Yesterday, Today and Tomorrow. Justus Roux Stellenbosch University Centre for Language and Speech Technology. AIM Brunfelsia Latifolia Focus on official government policy development on HLT in South Africa Role players in policy making - PowerPoint PPT PresentationTRANSCRIPT
Workshop:HLT Collaboration 23 -26 November 2008
1
HLT in South Africa: Yesterday, Today and Tomorrow
Justus Roux
Stellenbosch University Centre for Language and Speech Technology
2
AIM
• Brunfelsia Latifolia
• Focus on official government policy development on HLT in South Africa
• Role players in policy making • Wish list regarding future planning and policies
3
YESTERDAY1999 / 2000:
• First initiative by Pan South African Language Board (PanSALB) and the Department of Arts, Culture, Science and Technology (DACST) towards setting up a “Human Language Technology Project”
• Joint Steering Committee: DACST, PanSALB, Universities: Stellenbosch, Pretoria, UNISA, Bloemfontein, ICOMTEK (CSIR), private translation company
• Task to develop a Strategic Plan for HLT development in South Africa
4
YESTERDAY
Thinking at that time very much influenced by
– European Model for ‘Language Engineering’ and FP5 funding for HLT in Europe
– Recognition of particular realities in SA• Academic & technical realities – limited – training and
reskilling programmes – technology transfer
• Financial realities – co-operation to be sought from Government, Academia, Private sector
• Political realities – official language situation > development of National Lexicographic Units (NLUs)
5
YESTERDAY
September 2000 – Report – The development of Human Language Technologies in South Africa – Strategic Planning.
Three steps• Step 1 Create a SA model for HLT development
and implementation– Component 1: Applied research and capacity building
(Specialised courses at tertiary institutions, short informal courses)
– Component 2: Production of language resources – standards – “Regulatory forum”
– Component 3: Developing enabling technologies – support to innovative projects – funding from Innovation Fund of DACST
– Component 4: Conscious steps to develop HLT industry
6
YESTERDAY
Step 2 Creation of a legal framework to ensure systematic acquisition of government resources
Ammendment of the Legal Deposit Act (1997)
Step 3 Development of physical infrastructure to manage the implementation of the model
(NB Role of the NLUs as integral part)
• Virtual National Language and Speech Resource Centre
• Virtual National Electronic Language and Speech Data Network
• Regulatory Forum for Human Language Technologies
7
8
• Strategic plan was accepted (by DACST) and on 8 November 2001 a Ministerial Advisory Panel on HLT was inaugurated with the task to focus on the viability of the establishment of a “virtual national electronic language and speech network”
• 8 members – three of whom are at this meeting
• Report delivered in to the Minister in September 2002
YESTERDAY
9
Recommendations
#1 A virtual HLT Centre to be established with a hub and spoke / nodes configuration (Accepted)
YESTERDAY
10
Structure of National Resource Centre for HLT (Virtual Centre: Hub and connected nodes)
Centre YSA Eng
AfrikaansUni D
N SothoSign Lang
Uni BVendaTsonga
Uni AXhosaSwati
Centre XZulu
Ndebele
Uni CN SothoTswanaManagerial Hub
Coordination of Node Activities
Data acquisitionData enhancement
Data management & backupTraining
NLU (?)Lang (?)
LELE
LE
LE = Language experts
11
Recommendation #2 (Not accepted)
Establishment of an interim Implementation Secretariat for period of one year
In stead an HLT Steering Committee was appointed to oversee
implementation within a period of five years
Recommendation #3 (Accepted – not implemented)
HLT development should take place in co-operation with Presidential National Commission on Information Society and Development
Recommendation #4 (Not accepted – not necessary)
Amendment of Legal Deposit Act (1997)
YESTERDAY
12
YESTERDAY
2002
Department of Science and Technology (DST) – National Research and Development Strategy – reference to ICT / HLT (Handout)
2003• National Language Policy Framework (NLPF) approved by Cabinet
(February) – specific reference to HLT in Section 3 (3.3) • The development of an official HLT Strategy as one of the
implementation mechanisms of the NLFP is suggested - Section 4 (4.8) (Refer “TODAY”)
• Establishment of an HLT Unit within National Language Service • HLT Steering Committee appointed to oversee implementation of
an HLT Resource Centre within a period of five years in collaboration with the HLT Unit of the National Language service (NLS) (2003-2007)
13
YESTERDAY
2004
Department of Trade and Industry Report
Benchmarking of Technology – Trends and Technology Developments
Emphasis on the important role of HLT within the economic sector in South Africa.
14
Summary of technologies with potential high impact on ICT sector
(SA Dept Trade and Industry Report 2004: 10)
Low HighSouth Africa`s ability to respond
Po
ten
tial i
mp
act
on
in
du
stry
Mobile
WirelessHLT
OSS
TelemedicineGrid computing
Geomatics
RFID
Manufacturing (CAD, Robotics)
Lim
ited
Pe
rva
sive
15
YESTERDAY2005• Establishment of Meraka Institute with HLT Research
Group Initiative of Department of Science and Technology (DST)
• National Workshop on HLT (May 2005 – CSIR Conference Centre) – Roadmapping – Main issues and recommendations are in handout.
• During this period several workshops and conference tracks were held:– PRASA annual conferences– ALASA SIG on Language and Speech Technology Development– ALASA International Conferences (special track)– Roadmapping workshop with State IT Agency (SITA) – Steven
Krauwer (BLARKS)
16
TODAYProgress of Steering Committee to set up Resource Centre in collaboration with NLS (HLT Unit) (1)
• Draft HLT National Strategy document developed and submitted (Detail Dr Jokweni)
• Great amount of work, but little progress
• The Steering Committee had a strained working relationship with previous Chief Director of NLS, hence two instances of disagreement:
– Unilateral call by DAC (NLS) (2005) for tenders as management agent for the envisaged National Resource Centre – failure – no funds available
– Unilateral call for development proposals by DAC (2006) – Steering Committee was not involved (amount distributed to successful applicants – outputs imminent)
17
TODAY
Progress of Steering Committee to set up Resource Centre in collaboration with NLS (HLT Unit) (2)
• The Steering Committee has a good working relationship with new Chief Director and staff of the of NLS
– Submissions for funding submitted
18
Research Role Players in South Africa: Universities
LanguageResources
EnablingTechno-logies
StandardiseFormats &Protocols
Speech recognition
Morph analysis
Speech generation
POS tagging
Syntactic analysis
Semantic analysis
Text corpora
Spoken corpora
Dictionaries
Lexicons
Grammars
Terminology banks
Research
UniversitiesEngineering
Computer Science Dedicated R&D Centres
Meraka Institute
DST
InternationalStandards
Organisation(ISO TC 37)
SABS TC 37
UniversitiesLanguages
Linguistics Dedicated R&D Centres
NLSPanSALB
DAC
19
TOMORROWWish list - Planning and policy
• Restructuring of the HLT Steering Committee: Real role players are needed to contribute to the debate (Request to the Minister through NLS / DAC)
• Establishment of the HLT Resource Centre as a priority.– Render support services to HLT community– Source of job creation
• Co-ordinated academic training at national level– Standard curricula over and above specialised curricula– Staff exchange programme (national & international)– Recognition of modules across accredited institutions
• Applied research conducted in accordance with national priorities set by, for example, a body of experts from user sectors. (Roadmaps, annually updated.)
• Blue sky research within HLT remains imperative also from funding perspective.
20
TOMORROW• National funding procedures for HLT research and
training should be transparent and equitable– Task for a Select Committee of National and International
Experts (?)
• Address the particular interest in HLT research and training within Africa: imminent projects – Algeria, Morocco, Kenya, Nigeria and Gabon. – Possibility of international funding, e.g. Association of African
Universities (AAU) staff & student exchange programme
• Hopefully more insights to be gained from this workshop, not only with respect to international co-operation, but also regarding the positioning of HLT activities in South Africa.
Workshop:HLT Collaboration 23 -26 November 2008
22
23
FUNCTIONS OF HLT CENTRE
Workshop:HLT Collaboration 23 -26 November 2008
24
Importance of a National Resource Centre for HLT
• Acquiring, enhancing and managing text and speech data for HLT applications:– Extremely costly– Extremely time consuming– Requires skilled language experts
• Therefore: Need to develop reusable resources
• General practice world wide:– ELSNET (Europe), LDC (USA), (Japan)
Workshop:HLT Collaboration 23 -26 November 2008
25
Functions of a National Resource Centre for HLT
• Constitutes one of the integral components for effective HLT product development in all official languages of SA.
• Will interact will all other role players for in the field to expedite service delivery in HLT applications.
• It will serve a depository of raw and enhanced reusable text and speech resources of all SA languages for use by different communities / institutions for language related purposes, e.g. NLUs, Terminology development sections, translation services, education etc
• It will serve as a language archive to document language and speech phenomena of the official languages of SA over a period of time as part of cultural heritage. (SA lost its ‘Sound Archive’)
Workshop:HLT Collaboration 23 -26 November 2008
26
Tasks of a National Resource Centre for HLT
Data acquisition • Text data
– Different types / genres• Official / Formal (announcements, legislation)• Informal (magazines etc)• Literary (novels, drama etc)
• Sources:• Printed media: News agencies, Publishers• Government services (all levels, including Hansard)
Workshop:HLT Collaboration 23 -26 November 2008
27
Tasks of a National Resource Centre for HLT
Data acquisition • Speech data
– Different types• Read speech • Spontaneous speech
– Different domains & conditions• Sport, news, interviews / noisy environments
– Different transmission modes• Telephone speech: mobile, fixed lines• Recorded speech (microphone)
– Different subjects• Male, Female, young, old, impaired
• Sources:• SABC archives• Own initiatives (!)
Workshop:HLT Collaboration 23 -26 November 2008
28
Tasks of a National Resource Centre for HLT
Data enhancementText• Development and application of
– Tokenisers (word identification)– Parts of speech taggers (nouns, verbs, adverbs etc)– Morphological analysers (composition of words)– Syntactic parsers (composition of phrases / sentences)(With tools to be developed in collaboration with experts
from Technology Component)
• Creation of machine readable lexicons (XML format)
Workshop:HLT Collaboration 23 -26 November 2008
29
A partial XML entry for the noun -ntu, class 1-2, is as follows
<Entry> <Head> <Stem>ntu</Stem> </Head> <Body> <Tone>3.2.9</Tone> <MSI>
<POS> <Noun> <Noun-features>
<Class-pf-s>umu</Class-pf-s><Class-pf-p>aba</Class-pf-p><Class-no>1-2</Class-no><Label>n</Label>
<Dim> <Form>umntwana</Form>
<Sense>baby, small child</Sense> </Dim> <Loc> <Form>kumuntu</Form>
Bosch SE, Pretorius L & Jones, J. Towards machine-readable lexicons for South African Bantu Languages. Nordic Journal of African Studies 16 (2): 131-145 (2007)
Workshop:HLT Collaboration 23 -26 November 2008
30
Tasks of a National Resource Centre for HLT
Data enhancement (2)
Speech• Orthographic transcriptions of speech (S to T)• Phonetic transcription and annotation of speech
– Sound like utterances• Fluent speech• Repetitions, false starts etc
– Non sound like utterances• Background noise• Lip smacks etc
• Supportive software programmes (e.g. Praat)
Workshop:HLT Collaboration 23 -26 November 2008
31
Ukuja(bula)
Speaker One – Ngithi ukujabula manje
u k u
Workshop:HLT Collaboration 23 -26 November 2008
32
Tasks of a National Resource Centre for HLT
Data management & Software development
• Determine data needs in collaboration with HLT Unit in NLS for government applications
• Acquire the data with the assistance of language specialists at different nodes of the Centre
• Solicit development of appropriate software• Manage, back-up, distribute data to users• Commercialise resources: private sector
developers
Workshop:HLT Collaboration 23 -26 November 2008
33
Tasks of a National Resource Centre for HLT
Training and Consultation• Identify training needs and potential trainers
• Develop non-formal training curricula for the reskilling of interested language practitioners
• Organise HLT training workshops at different venues in the country encouraging language bodies to participate
• Create awareness of HLT potential in collaboration with the HLT Unit of NLS
Workshop:HLT Collaboration 23 -26 November 2008
34
Structure of National Resource Centre for HLT (Virtual Centre: Hub and connected nodes)
Centre YSA Eng
AfrikaansUni D
N SothoSign Lang
Uni BVendaTsonga
Uni AXhosaSwati
Centre XZulu
Ndebele
Uni CN SothoTswanaManagerial Hub
Coordination of Node Activities
Data acquisitionData enhancement
Data management & backupTraining
NLU (?)Lang (?)
LELE
LE
LE = Language experts
Workshop:HLT Collaboration 23 -26 November 2008
35
Relationships
Seatla se sengwe se tlhapiswa ke se sengwe
(The one hand washes the other)
• No infringements on current lexicographic or terminological activities - Different foci
• Complementary activities:– Raw or enhanced data to be supplied to NLU`s /
PanSALB / NLS– NLU`s could contribute to National depository
• Win-win situation for the sake of technological development of our languages
Workshop:HLT Collaboration 23 -26 November 2008
36
Concluding remarks
• Attempt to speed up activities in the development of HLT applications to provide services in a language of choice.
• To provide new resources and tools for lexicographic and terminological development.
• To provide a new range of job opportunities for graduates in African languages
• Keep South Africa abreast with new developments in the Information Society and avoid the marginalisation of the indigenous languages.