kkap: kaist korean analysis platform morphological analyzer, pos tagger, parser sangwon park january...

18
KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

Upload: ryder-merrill

Post on 14-Dec-2015

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser

Sangwon Park

January 12, 2011

Page 2: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

• The goal of the research is to develop KKAP(KAIST Korean Analysis Platform), which is a infrastructure for Korean nat-ural language analysis.

• The KKAP will be flexible and easy to utilize so that it can be widely used in various areas. The platform will include morphological analyzer, POS tagger, parser, etc.

Research Goal

Page 3: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

Contents

• 1. Introduction of Korean Morphological Analysis• 2. HanNanum Korean Morphological Analyzer & POS Tagger• 3. Extension to KKAP(KAIST Korean Analysis Platform)

Page 4: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

Features of Korean morphological analysis

가시는 가시 /noun + 는 /josa (thorn, prickle) 가시 /verb + 는 /eomi (leave, disappear) 가 /verb + 시 /eomi + 는 /eomi (go) 갈 /verb + 시 /eomi + 는 /eomi (grind, sharpen)

Example Sentences: 그 선인장의 가시는 참 따가웠다 . 물을 마셨더니 갈증이 가시는 기분이다 . 할머니께서는 집에 가시는 길이었다 . 아저씨의 칼을 가시는 모습은 인상적이다 .

Difficulties of Korean morphological analy-sis

Ambiguity of part-of-speech Ambiguity of segmentation of morpheme

Page 5: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

HanNanum Korean Morphological Ana-lyzer

• HanNanum has been developed since 1990s.• Written in C programming language• Module-based architecture• Based on KAIST morphological analyzed corpus• HMM-based, Maximum Entropy-based POS Tagger

Page 6: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

HanNanum Architecture

Morphological Analyzer

Analyzer

Phoneme Restoration

ConnectionCheck

DictionarySearch

Tag SetCode Conver-

sion

SentenceDivisor

Tag Set TableConnection Info. Table

System Dic-tionary

UserDictionary

NumberDictionary

Tag Mapper

Tagger

Computation

FrequencyDictionary

Bigram Info.

OUTPUT

INPUT

(Trie)

(Trie)

Segment Position

InverseSegment Position

Mor-phemeChartChart

(lattice form)

Page 7: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

HMM-based POS Tagger

• Shin Jung-ho, Han Young-seok, Park Young-chan, Choi Key-Sun, “An HMM Part-of-Speech Tagger for Korean Based on Wordphrase”, Proceedings of the Conference on Hangul and Korean Language Information Processing, 389-394, 1994.

• Transition probability between word phrase tag

• Transition probability between morpheme tag in a word phrase

• Probability of occurrence of mor-pheme and POS

Page 8: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

Analysis Example

- POS-tagged Dictionary

- Check Connection rule

- Phoneme Restoration

- HMM-based TaggerFind the most suitable re-sult among the candidates

Page 9: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

– Each functionality for the Korean morphological analysis is im-plemented as a plug-in.

– It allows a user to set up a workflow with existing plug-ins for his own goal.

Plug-In Pool

Corpus-base Morph Analyzer

CRF POS

Tagger

Plug-In Component-based Sys-tem

HMM POS

Tagger

Unknown Noun Proc.

Noun Ex-tracting

Tag Map-ping

Auto Spacing Sentence

SplitterInput Fil-

ter

Noun Ex-tractor

Tag Map-per Trans-lit-

eration

Chart-base Morph Ana-

lyzer

Phase2 Morphological Analyzer

Phase3 POS Tagger

Phase1 Supplement Plugin

Phase2 Supplement Plugin

Phase3 Supplement Plugin

Page 10: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

Flexible Workflow

$$$$$ 장소 $$$$$

서울코엑스 3 층

$$$$$$/su+$/su+$/su+$/su+$/su

장소장소 /ncn

$$$$$$/su+$/su+$/su+$/su+$/su

서울서울 /nq

코엑스코엑스 /ncn

3 층3/nnc+ 층 /nbu

Sen-tence

Splitter

AutoSpacing

Un-knownProces-

sor

HMM-based

POS Tag-ger

Chart-basedMorphologi-

calAnalyzer

InformalInputFilter

Plain TextProcessor

Morphologi-cal Analyzer

Morpheme Processor

POS Tagger

Sen-tence

Splitter

NounExtrac-

tor

Chart-basedMorphologi-

calAnalyzer

Plain TextProcessor

Morphologi-cal Analyzer

Morpheme Processor

지난 9 월 거제도에서 열린 축제 …

9 월 /n

거제도 /ncn

축제 /ncn

- Analysis of Announcement on Web

- Indexing of News Articles

Page 11: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

HanNanum Korean Morphological Analyzer

Phase 3.POS Tagging

Phase 2.Morphological Analy-

sis

Plugin PoolPhase 1. Plugin

SentenceSegmentation

InputFilter

AutoSpacing

NounExtraction

TagMapper

Unknown TermProcessing

Chart-baseMorph Ana-

lyzer

Phase 2. Plugin

HMM-basedPOS Tag-

ging

CRF-basedPOS Tagging

Phase 3. Plugin

Phase 1.Text Preprocessing

Supplement PluginSupplement

PluginMajor Plugin

Workflow for Morphological Analysis

SupplementPlugin

Major Plugin

NounExtraction

TagMapper

7/nnc+ 일 /nbu 저녁 /ncn 발표예정 /ncpa+ 이 /jp+ ㄴ /etm 노벨문학상 /nq+ 의 /jcm 유력 /ncps 수상자 /ncn+ 로 /jca 고은 /nq 시인 /ncn+ 이 /jcc 거론 /ncpa+ 되 /xsv+ 고 /ecc 있 /paa+ 다 /ef ./sf 통신은 통 /ncn+ 신 /ncn+ 은 /jxc 스웨덴 /nq+ 의 /jcm 노벨상 /ncn 관측통 /ncn+ 들 /xsn 사이 /ncn+ 에 /jca ….

7 일 저녁 발표예정인 노벨문학상의 유력 수상자로 고은 시인이 거론되고 있다 . AP통신은 스웨덴의 노벨상 관측통들 사이에 한국의 고은 시인이 시리아의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거론됐다고 전했다 . …

Korean Document Analysis

Extract the Part Of Speech In-

formation from Korean Text

Page 12: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

Open Source Project

• http://kldp.net/projects/hannanum/• 2011.01.10 jhannanum 0.8.2 was released

Page 13: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

GUI Demo

Plug-in Pool

Workflow

Information of a plug-in

Workflow control

Input & Output

Page 14: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

KKAP: KAIST Korean Analysis Platform

Phase 3.POS Tagging

Phase 2.Morphological Analy-

sis

Plugin Pool

Phase 1. Plugin

SentenceSegmentation

InputFilter

AutoSpacing

NounExtraction

TagMapper

Unknown TermProcessing

Chart-baseMorph Ana-

lyzer

Phase 2. Plugin

Phase 1.Text Preprocessing

Supplement PluginSupplement

PluginMajor Plugin

Workflow for Korean Analysis

Major Plugin

7/nnc+ 일 /nbu 저녁 /ncn 발표예정 /ncpa+ 이 /jp+ ㄴ /etm 노벨문학상 /nq+ 의 /jcm 유력 /ncps 수상자 /ncn+ 로 /jca 고은 /nq 시인 /ncn+ 이 /jcc 거론 /ncpa+ 되 /xsv+ 고 /ecc 있 /paa+ 다 /ef ./sf 통신은 통 /ncn+ 신 /ncn+ 은 /jxc 스웨덴 /nq+ 의 /jcm 노벨상 /ncn 관측통 /ncn+ 들 /xsn 사이 /ncn+ 에 /jca ….

7 일 저녁 발표예정인 노벨문학상의 유력 수상자로 고은 시인이 거론되고 있다 . AP통신은 스웨덴의 노벨상 관측통들 사이에 한국의 고은 시인이 시리아의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거론됐다고 전했다 . …

Korean Document Analysis

Analyzed Korean Document

Phase 4. Parsing

SupplementPlugin

Major Plugin

SupplementPlugin

HMM-basedPOS Tag-

ging

Phase 3. Plugin

NounExtraction

TagMapper Chart

Parser

Phase 4. Plugin

Verb PhraseExtractor

Noun PhraseExtractor

Page 15: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

Korean Syntactic Tree Tagged Corpus

• Registered at BoRA (Bank of Resource for Language and Annota-tion)– http://bora.or.kr– Corpus 5. Manual sentence analysis corpus– 31,091 Sentences from 97 different sources.– Length: 1 ~ 33 Eojeols

Average 11.35 Eojeols

• Related document– Kong joo Lee, Byung Gyu Chang, Gil Chang Kim, “Bracketing Guidelines

for Korean Syntactic Tree Tagged Corpus Version 1”, KAIST CS Depart-ment Technical Report, CS/TR-97-112, 1997 (In Korean)

– Byung Gyu Chang, Kong joo Lee, Gil Chang Kim, “Design and Implemen-tation of Tree Tagging Workbench To Build a Large Tree Tagged Corpus of Korean”, Proceedings of the Conference on Hangul and Korean Language Information Processing, pp.421~429, 1997 (In Korean)

Page 16: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011

Question & Comments

Page 17: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011
Page 18: KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 12, 2011