kkap: kaist korean analysis platform morphological analyzer, pos tagger, parser sangwon park january...
TRANSCRIPT
KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser
Sangwon Park
January 12, 2011
• The goal of the research is to develop KKAP(KAIST Korean Analysis Platform), which is a infrastructure for Korean nat-ural language analysis.
• The KKAP will be flexible and easy to utilize so that it can be widely used in various areas. The platform will include morphological analyzer, POS tagger, parser, etc.
Research Goal
Contents
• 1. Introduction of Korean Morphological Analysis• 2. HanNanum Korean Morphological Analyzer & POS Tagger• 3. Extension to KKAP(KAIST Korean Analysis Platform)
Features of Korean morphological analysis
가시는 가시 /noun + 는 /josa (thorn, prickle) 가시 /verb + 는 /eomi (leave, disappear) 가 /verb + 시 /eomi + 는 /eomi (go) 갈 /verb + 시 /eomi + 는 /eomi (grind, sharpen)
Example Sentences: 그 선인장의 가시는 참 따가웠다 . 물을 마셨더니 갈증이 가시는 기분이다 . 할머니께서는 집에 가시는 길이었다 . 아저씨의 칼을 가시는 모습은 인상적이다 .
Difficulties of Korean morphological analy-sis
Ambiguity of part-of-speech Ambiguity of segmentation of morpheme
HanNanum Korean Morphological Ana-lyzer
• HanNanum has been developed since 1990s.• Written in C programming language• Module-based architecture• Based on KAIST morphological analyzed corpus• HMM-based, Maximum Entropy-based POS Tagger
HanNanum Architecture
Morphological Analyzer
Analyzer
Phoneme Restoration
ConnectionCheck
DictionarySearch
Tag SetCode Conver-
sion
SentenceDivisor
Tag Set TableConnection Info. Table
System Dic-tionary
UserDictionary
NumberDictionary
Tag Mapper
Tagger
Computation
FrequencyDictionary
Bigram Info.
OUTPUT
INPUT
(Trie)
(Trie)
Segment Position
InverseSegment Position
Mor-phemeChartChart
(lattice form)
HMM-based POS Tagger
• Shin Jung-ho, Han Young-seok, Park Young-chan, Choi Key-Sun, “An HMM Part-of-Speech Tagger for Korean Based on Wordphrase”, Proceedings of the Conference on Hangul and Korean Language Information Processing, 389-394, 1994.
• Transition probability between word phrase tag
• Transition probability between morpheme tag in a word phrase
• Probability of occurrence of mor-pheme and POS
Analysis Example
- POS-tagged Dictionary
- Check Connection rule
- Phoneme Restoration
- HMM-based TaggerFind the most suitable re-sult among the candidates
– Each functionality for the Korean morphological analysis is im-plemented as a plug-in.
– It allows a user to set up a workflow with existing plug-ins for his own goal.
Plug-In Pool
Corpus-base Morph Analyzer
CRF POS
Tagger
…
Plug-In Component-based Sys-tem
HMM POS
Tagger
Unknown Noun Proc.
Noun Ex-tracting
Tag Map-ping
…
Auto Spacing Sentence
SplitterInput Fil-
ter
…
Noun Ex-tractor
Tag Map-per Trans-lit-
eration
Chart-base Morph Ana-
lyzer
Phase2 Morphological Analyzer
Phase3 POS Tagger
Phase1 Supplement Plugin
Phase2 Supplement Plugin
Phase3 Supplement Plugin
Flexible Workflow
$$$$$ 장소 $$$$$
서울코엑스 3 층
$$$$$$/su+$/su+$/su+$/su+$/su
장소장소 /ncn
$$$$$$/su+$/su+$/su+$/su+$/su
서울서울 /nq
코엑스코엑스 /ncn
3 층3/nnc+ 층 /nbu
Sen-tence
Splitter
AutoSpacing
Un-knownProces-
sor
HMM-based
POS Tag-ger
Chart-basedMorphologi-
calAnalyzer
InformalInputFilter
Plain TextProcessor
Morphologi-cal Analyzer
Morpheme Processor
POS Tagger
Sen-tence
Splitter
NounExtrac-
tor
Chart-basedMorphologi-
calAnalyzer
Plain TextProcessor
Morphologi-cal Analyzer
Morpheme Processor
지난 9 월 거제도에서 열린 축제 …
9 월 /n
거제도 /ncn
축제 /ncn
- Analysis of Announcement on Web
- Indexing of News Articles
HanNanum Korean Morphological Analyzer
Phase 3.POS Tagging
Phase 2.Morphological Analy-
sis
Plugin PoolPhase 1. Plugin
SentenceSegmentation
InputFilter
AutoSpacing
NounExtraction
TagMapper
Unknown TermProcessing
Chart-baseMorph Ana-
lyzer
Phase 2. Plugin
HMM-basedPOS Tag-
ging
CRF-basedPOS Tagging
Phase 3. Plugin
Phase 1.Text Preprocessing
Supplement PluginSupplement
PluginMajor Plugin
Workflow for Morphological Analysis
SupplementPlugin
Major Plugin
NounExtraction
TagMapper
7/nnc+ 일 /nbu 저녁 /ncn 발표예정 /ncpa+ 이 /jp+ ㄴ /etm 노벨문학상 /nq+ 의 /jcm 유력 /ncps 수상자 /ncn+ 로 /jca 고은 /nq 시인 /ncn+ 이 /jcc 거론 /ncpa+ 되 /xsv+ 고 /ecc 있 /paa+ 다 /ef ./sf 통신은 통 /ncn+ 신 /ncn+ 은 /jxc 스웨덴 /nq+ 의 /jcm 노벨상 /ncn 관측통 /ncn+ 들 /xsn 사이 /ncn+ 에 /jca ….
7 일 저녁 발표예정인 노벨문학상의 유력 수상자로 고은 시인이 거론되고 있다 . AP통신은 스웨덴의 노벨상 관측통들 사이에 한국의 고은 시인이 시리아의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거론됐다고 전했다 . …
Korean Document Analysis
Extract the Part Of Speech In-
formation from Korean Text
Open Source Project
• http://kldp.net/projects/hannanum/• 2011.01.10 jhannanum 0.8.2 was released
GUI Demo
Plug-in Pool
Workflow
Information of a plug-in
Workflow control
Input & Output
KKAP: KAIST Korean Analysis Platform
Phase 3.POS Tagging
Phase 2.Morphological Analy-
sis
Plugin Pool
Phase 1. Plugin
SentenceSegmentation
InputFilter
AutoSpacing
NounExtraction
TagMapper
Unknown TermProcessing
Chart-baseMorph Ana-
lyzer
Phase 2. Plugin
Phase 1.Text Preprocessing
Supplement PluginSupplement
PluginMajor Plugin
Workflow for Korean Analysis
Major Plugin
7/nnc+ 일 /nbu 저녁 /ncn 발표예정 /ncpa+ 이 /jp+ ㄴ /etm 노벨문학상 /nq+ 의 /jcm 유력 /ncps 수상자 /ncn+ 로 /jca 고은 /nq 시인 /ncn+ 이 /jcc 거론 /ncpa+ 되 /xsv+ 고 /ecc 있 /paa+ 다 /ef ./sf 통신은 통 /ncn+ 신 /ncn+ 은 /jxc 스웨덴 /nq+ 의 /jcm 노벨상 /ncn 관측통 /ncn+ 들 /xsn 사이 /ncn+ 에 /jca ….
7 일 저녁 발표예정인 노벨문학상의 유력 수상자로 고은 시인이 거론되고 있다 . AP통신은 스웨덴의 노벨상 관측통들 사이에 한국의 고은 시인이 시리아의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거론됐다고 전했다 . …
Korean Document Analysis
Analyzed Korean Document
Phase 4. Parsing
SupplementPlugin
Major Plugin
SupplementPlugin
HMM-basedPOS Tag-
ging
Phase 3. Plugin
NounExtraction
TagMapper Chart
Parser
Phase 4. Plugin
Verb PhraseExtractor
Noun PhraseExtractor
Korean Syntactic Tree Tagged Corpus
• Registered at BoRA (Bank of Resource for Language and Annota-tion)– http://bora.or.kr– Corpus 5. Manual sentence analysis corpus– 31,091 Sentences from 97 different sources.– Length: 1 ~ 33 Eojeols
Average 11.35 Eojeols
• Related document– Kong joo Lee, Byung Gyu Chang, Gil Chang Kim, “Bracketing Guidelines
for Korean Syntactic Tree Tagged Corpus Version 1”, KAIST CS Depart-ment Technical Report, CS/TR-97-112, 1997 (In Korean)
– Byung Gyu Chang, Kong joo Lee, Gil Chang Kim, “Design and Implemen-tation of Tree Tagging Workbench To Build a Large Tree Tagged Corpus of Korean”, Proceedings of the Conference on Hangul and Korean Language Information Processing, pp.421~429, 1997 (In Korean)
Question & Comments