infrastructures for the korean · pdf filekorea terminology research center for language and...
TRANSCRIPT
Korea Terminology Research Center for Language and Knowledge Engineering
Infrastructuresfor the Korean Language
Key-Sun Choi
Korea Terminology Research Center for Language and Knowledge Engineering
qS IG-Korean Language Computing under KoreaInformation Science Society
u 300 members
qKorea Information Society
u linguistics oriented
Korea Terminology Research Center for Language and Knowledge Engineering
u Purpose:§ To improve Ko rean Language P rocess i ng Techno l ogy
§ To promote Korean Sof tware Industry
• in the planning phase (1993), targetted to Hangul W ordprocessor,Machine Translation and Korean Linguistic Research
u 1995 - 1997 (Phase 1): “word ”§ Two ministry joint project + Industry
• Ministry of Science&Technology, Ministry of Culture
u 1998 - 2000 (Phase 2): “sentence ”§ O n ly by Min istry o f Sc ience&Technology + Industry
§ w il l be evaluated in O ctober, 2000
u 2001 - 2003 (Phase 3): “discourse” - not decided
u http://kibs.kaist.ac.kr/
Korea Terminology Research Center for Language and Knowledge Engineering
q Purpose
u To promote the Korean Language Research in the l ingu is t i cs s i
de
u To prepare for the language p lann ing
§ for Unification of South-/North-Korea
§ for International use of Korean
q Sponsor: Ministry of Culture
q Period: 1998 - 2007 (10 years)
q Items
u corpus, dict ionary, international ization, terminology, education,
font , o ld Korean, o ld Chinese characters
q http://w w w .sejong .or.kr/
Korea Terminology Research Center for Language and Knowledge Engineering
User(Dictionary)
End User
MA1
MA2
TA1
TA2
PA1
PA2
WSD1
WSD2
DA1
DA2
RM1
RM2
Ontology
Common Knowledge
Domain Knowledge
Engine Module Level
Engine Level
Basic DB
corpus
MRD
Knowledge extractor
Knowledge Source Level
MT engine IR engineSpell checker Style checker UI engine
Application LevelWord processor MT system Information
RetrievalSystem
AutomaticSpeech
Translation
User(P
rogramm
er)U
ser(lexicographyist)
-- System
Distributed ResourceManagement System
Master DB
Knowledge Level
Korea Terminology Research Center for Language and Knowledge Engineering
l Title of Projectl KIBS I : Integrated Korean Information Basel KIBS II : On Development of Deep-Level Processing and Qu
ality Management Technology for Very Large Korean Information Base
l Outlinel Term : 1994.12.4 ~ 2004.9.30 (10 years)l Sponsor : Ministry of Science and Technologyl Staff : 50 person/year
Korea Terminology Research Center for Language and Knowledge Engineering
•Standard Module Interface•Corpus and Electronic Dict ionary Development and Management System •Korean Part-of-Speech Tagging System•Korean Syntactic Tagging System•Korean/English Alignment System
•Standard Module Interface•Corpus and Electronic Dict ionary Development and Management System •Korean Part-of-Speech Tagging System•Korean Syntactic Tagging System•Korean/English Alignment System
•Terminological Data Base Development and Management System
•Standard Korean Input/Output Environment
•Standardized Methodology for the Construction of a Balanced Corpus
•Part-Of-Speech Transfer Dictionary Rules and an Example Package
•Terminological Data Base Development and Management System
•Standard Korean Input/Output Environment
•Standardized Methodology for the Construction of a Balanced Corpus
•Part-Of-Speech Transfer Dictionary Rules and an Example Package
•Tree-Tagged Corpus
•Word-Level Narrative Speech Data Base
•Hand-written Hangul scripts of high frequency
•Tree-Tagged Corpus
•Word-Level Narrative Speech Data Base
•Hand-written Hangul scripts of high frequency
Korea Terminology Research Center for Language and Knowledge Engineering
•Terminology Entries
•Domain-specif ic Corpus for Terminology Building
•Sublanguage Analysis and Extraction of Terminology
•Terminology Entries
•Domain-specif ic Corpus for Terminology Building
•Sublanguage Analysis and Extraction of Terminology
•Development/Management System for Information Base
•Development of Integrated Management System for Distr ibuted Resources
•Development/Management System for Information Base
•Development of Integrated Management System for Distr ibuted Resources
•Syntactic Information Base for Syntactic Analysis/Generation
•Semantic Information Base for Semantic Analysis/Generation
•Additional Information on Language and GUI for Developing Applicat ions
•Syntactic Information Base for Syntactic Analysis/Generation
•Semantic Information Base for Semantic Analysis/Generation
•Additional Information on Language and GUI for Developing Applicat ions
Korea Terminology Research Center for Language and Knowledge Engineering
l Korean Concordance Program (KCP)l Compound Noun Browserl Corpus Browserl Corpus Browser by Categoryl Automatic English-to-Korean Transliteration System (TLEK)l KAIST Ontology Browserl Korean Morphological Analyserl Korean Taggerl Korean Syntactic Analyserl Editing Support Tools to Electronic Dictionary
Korea Terminology Research Center for Language and Knowledge Engineering
qMajor Resultsl The first (KIBS I) : 1997.6. ~ present (80 site)
l Text corpus 10 million word phrasesl POS tagged corpus 1 million word phrasesl Syntactic structure tagged corpus 10 thousands sentencesl TDMS, Speech DB samples, Hand-written character DB samples
l The second (KIBS II) : 1998.12. ~ present (140 site)l Raw corpus 10 million word phrases, POS tagged corpus – 200 tho
usands word phrases
l The third (KIBS III) : 2000 (pending)l Proper noun 10 thousands entries, Compound noun 20 thousands e
ntries, Verb sentence pattern dictionary 3 thousands entries, ...
l Plan to maintain and distribute ...
Korea Terminology Research Center for Language and Knowledge Engineering
q D ictionaries: total 420K entries (estimated now)u Mach ine Readab le D ic t ionary ( Hangu l Society) : 200K entr ies
u Compound Noun, Proper Noun C lass i f i ca t ion , In terna l Semant ic S tructure: 50K entries
u S e a rched Compound Noun , P rope r Noun : open
u Ve rb Subcategor izat ion : 10K f rames (K -J compar ison)
u Thesau rus : Ko rean - Japanese -Ch inese -Eng l i sh – no t so good quality – 150K entr ies
u Usage f rom corpus fo r each sense
u Funct ional words
q Problemu Sense c lassi f icat ion standardizat ion
u Charac te r code : Ko rean , Japanese , Ch inese , … (most important problem) – now unde r un icode transfer
Korea Terminology Research Center for Language and Knowledge Engineering
qC o rpus KW IC for Korean and Japanese
u http://morph.kaist.ac.kr/kcp/
qKorean morphological analysis service
u http://morph.kaist.ac.kr/
u By email, if send a text file, then reply its PO S taggin
g
u G raphic editor/debugger for Korean morphology
qProject Status
u http://kibs.kaist.ac.kr/
Korea Terminology Research Center for Language and Knowledge Engineering
q Through World-Wide Terminology Collection and TheirStandardization and Harmonization in Local Society
q Distribution, Publication and Application in Language and Knowledge Engineering are promoted.
q Through Education and Consultation of Terminology R&D Methodology for Each Subject Field,
q High-Quality, High-Reliable Terminology and Its Infrastructure and System are achieved.
Center of Terminology and Knowledge Engineering
Korea Terminology Research Center for Language and Knowledge Engineering
Integration of Working Terminology•Terminology Collection (Basic S&T, Industry Standard, Economics)•Electronic Terminology (Publication)•R&D Environment (System Standardization)•Terminology Theory and Education Infrastructure
Value-Added Terminology Integration•Terminology Collection (Extended S&T)•Extension & Maintenance (Industry Standards)•High-Quality Terminology•Application in Language Industry•Verification for High-Reliability and Distribution
Multi-lingual Terminology Integration•Terminology Collection (Humanity and Social Science)•Maintenance and Extension•Large-Scale Knowledge Base for Terminology•Terminology Education Curriculum Development•Application Product Development
Continuous Extension and Management•Terminology Study Promotion•Distribution of Terminology Information Base•Continuous Terminology Extension and Management
Phase 2(2001-2003)
Value-Added Working System
Phase 3(2004-2007)Operation
Phase 4(2008 - )
Maintenance and Extension
Phase 1(1998-2000)
R&D Environment and Basic Data Collection
Korea Terminology Research Center for Language and Knowledge Engineering
q Basic Data (C orpus)
u Corpus fo r Each Sub jec t Domain
q E lectronic Dictionary for Basic Vocabulary
u Eve ryday Vocabu la r y cons i s t s o f Gene ra l Vocabu la r y and Eve r
yday Termino logy
q Internationalization of Korean Language
u S o u th-North Korean Termino logy Standard izat ion , Korean lang
uage Input Methods
q Korean Language Engineer ing
u S tandard ized Term Use for In format ion Retr ieva l , Mach ine Tran
s lat ion and Document Class i f icat ion
Korea Terminology Research Center for Language and Knowledge Engineering
qLanguage Engineering
u Information R e trieval:§ E ffective Internet Information Creation and Information/K n o
w ledge Acquis i t ion
§ Multi- l ingual ism
uMachine Translation:§ E ff ic ient Information Generat ion through Terminology and V
ocabulary Col lect ion and Standardizat ion
uW ordprocessor:§ High Product iv i ty by Spel l ing Correct ion, Summarizat ion an
d E f f i c ient Use .
Korea Terminology Research Center for Language and Knowledge Engineering
qLanguage, Information and Terminology
u Language Educat ion:
§ Techn ica l Th ink ing and Techn ica l Communicat ion
§ Termino logy -based Educat ion
u Language Study:
§ Domain - spec i f i c Language S tudy
Korea Terminology Research Center for Language and Knowledge Engineering
qSupport from Government, Organization and Industryaccording to each specialtyu Ministry of Culture and Tourism (KORTERM Center Operat
ion)u Ministry of Science and Technology (R&D Fund)u Ministry of Information and Telecommunication (R&D Fund)u Ministry of Diplomacy and Tradeu Ministry of Industry and Resourceu Ministry of Educationu Korea Science and Technology Foundation (Event Support)
Korea Terminology Research Center for Language and Knowledge Engineering
Terminology Base(Collection)Non-standards
International Term StandardTerminology Standard
Language&KnowledgeProduct
LanguageEducationEnvironment
Terminology Information Environment
R&
D E
nvironment
Application
Use
Term
inologySym
bolization
Terminology Access Standard Channel
Grid Size Controller
Application-Specific Dictionary
Language Education Adaptable to Student
R&D Industry Living Communication
Standardization & Harmonization
TerminologicalConceptual
Space
Korea Terminology Research Center for Language and Knowledge Engineering
Organization
Test Suite
••••
Specification Standardization
••••
•••
•••
Language
••••Speech
Image
Language
Speech
Image
••
••
••
Korea Terminology Research Center for Language and Knowledge Engineering
qTest Suites for IR /Q A
u Documents
§ 207,067 records (370MB)
§ Newspape r s
u Q u e ry Generat ion
§ 90 quer ies ( through 300 quiz query analys is)
§ Quer ies for W H -quest ion and o ther var ious types o f answer
s
§ fo r NLP prob lem so lv ing
§ r e l e ven t document se t to inc lude the answer
§ by us ing four k inds of commercia l ized IR systems by 16 k in
ds o f methods
Korea Terminology Research Center for Language and Knowledge Engineering
qType C lassification: About 300 Kinds
qTest Sentences and Test Query: 5,000 Records
u E x tracted from Textbook and G rammar books (1999-
2000)
u w il l be extracted from the Real usage l ike web, newspapers (2000-2001)
u E v a luation by Y e s /N o Q u e s tion
u Tested for 4 Commercia l ized Engl ish-Korean MT Systems
Korea Terminology Research Center for Language and Knowledge Engineering
Korea Terminology Research Center for Language and Knowledge Engineering
M e ta data Input W orkbenchb y X M L
Korea Terminology Research Center for Language and Knowledge Engineering
Korea Terminology Research Center for Language and Knowledge Engineering