kyoungryol kim meeting information extraction from meeting announcement in korean

29
Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

Upload: eustacia-tamsyn-thompson

Post on 04-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

Kyoungryol Kim

Meeting Information Extraction from Meeting Announcement in Korean

Page 2: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

2

Table of Contents

1. Introduction Motivation Goal Problem Definition Contribution

2. The Proposed Method Finding

3. Discussion

Page 3: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

3

Introduction

Page 4: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

4

Motivation (1/3) : Necessity

Everyday we receive a lot of Meeting Announcement Conference, Seminar, Workshop, Meeting, Appointment… Meeting announcement accounts for 17%

(30,201 out of 183,022) of emails in Enron Email Dataset.

Smartphone era Many people manage schedule using online-calendar via

smartphonee.g. Google Calendar

But, typing by touch screen keyboard make many errors and even it’s difficult.

* Enron Email Dataset, August 21, 2009 version, http://www.cs.cmu.edu/~enron/

Page 5: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

5

Goal

Extracting schedule information from meeting announcement,and update them to the calendar, automatically.

무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다 .날짜 : 7 월 19 일 ( 토 ) 오후 2 시장소 : 민들레영토민들레영토 오는길지도와 같이 명동역 8 번 출구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA 빌딩 1 층에 있습니다 .

startTime 2011-07-19T14:00

isHeldAt서울시 중구 명동 1-3 민들레영토Latitude : 126.9797848, Longitude : 37.5687868

locationLandmark

서울지하철 4 호선 명동역 8 번출구Latitude : 126.9864660, Longitude : 37.5609660

Meeting Announcement

Extract Update

Page 6: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

6

Problem DefinitionTo find Meeting Location, the problem divided into 3 parts :

1. Finding locations for each type of complexity.

2. Named entity disambiguation on found locations.

무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다 .

날짜 : 7 월 19 일 ( 토 ) 오후 2

시장소 : 민들레영토기본 안건- 제작지원비 지급 지연에 대한 설명- 기금 조정 운영안- 가을 워크샵 준비위 구성- 기타 ( 기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다 )

민들레영토 오는길지도와 같이 명동역 8 번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA 빌딩 1

층에 있습니다 .

참고하세요

1. Finding Locations(Location-type NER)

무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다 .

날짜 : 7 월 19 일 ( 토 ) 오후 2

시장소 : 민들레영토기본 안건- 제작지원비 지급 지연에 대한 설명- 기금 조정 운영안- 가을 워크샵 준비위 구성- 기타 ( 기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다 )

민들레영토 오는길지도와 같이 명동역 8 번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA 빌딩 1

층에 있습니다 .

참고하세요

Start/End Time Extraction

2. NE Disambiguation

Page 7: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

7

isHeldAt

민들레영토민들레영토YMCA 빌딩 1 층

locationLandmark 명동역 8 번출구

3. Normalization &Co-reference

startTime 2011-07-19T14:00

isHeldAt서울시 중구 명동 1-3 민들레영토Latitude : 126.9797848, Longitude : 37.5687868

locationLandmark

서울지하철 4 호선 명동역 8 번출구Latitude : 126.9864660, Longitude : 37.5609660

Page 8: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

8

Definition Definition 1. Location Named Entity

A particular point or place in physical space (Wiktionary).

[Cyber Space] Exceptionally, If the cyber space is used as a place gathering people, then the cyber space can be a location.

e.g. MSN에서 9시에 모입니다 .

[Road, Street, Transportation] cannot be a location, except if it points particular place or it is necessary to de-scribe the location.

e.g. 진천 I/C, 왼쪽에 석촌지하차도가 보임 [Bridge] can be a location.

e.g. 납안교 , 한강대교 [Train/Subway Station, Bus-stop] can be a location.

e.g. 도곡역 1번출구 , 뱅뱅사거리 [Address] Full/partial address can be a location.

e.g. 전북 무주군 설천면 심곡리 43-15

[Organization, Company, Heritage, Building] can be a location if it is used to represent the location. [Parenthesis] If the location is ambiguous when the string in the parenthesis is removed and separated by the

parenthesis, then the string including parenthesis are the part of the location.e.g. COEX 컨퍼런스센터 4층 (402호 ), 건국대학교 (서울 ) 의생명연구동 강당 , 경인교육대학교 (경기캠퍼스 ),부산벡스코 (BEXCO) 컨벤션홀 201호 , 생명과학관 (녹지 ) 139호

[Enumeration] The different representations for same location are recognized separately.e.g. 장소 ? 가야 레스토랑 . 전화 /215-654-8900, 주소 /1002 Skippack Pike, Blue Bell, PA 19422전주 화산체육관 (전북 전주시 완산구 중화산동 1가 45번지 ), 2. 장소 : 늘푸름 (오산시 은계동 91-8)

Definition 2. Meeting LocationMeeting Location is the Location where the meeting will be held.

Definition 3. Location LandmarkLocation Landmark is the Location where can be used as a landmark to go to the meeting location.

Page 9: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

9

Complexity of the problems

d

Page 10: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

10

The Proposed Method1) Location Named Entity Recognition2) Relation Type Classification3) Co-reference4) Normalization

Page 11: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

11

무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다 .

날짜 : 7 월 19 일 ( 토 ) 오후 2 시장소 : 민들레영토기본 안건- 제작지원비 지급 지연에 대한 설명- 기금 조정 운영안- 가을 워크샵 준비위 구성- 기타 ( 기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다 )

민들레영토 오는길지도와 같이 명동역 8 번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA 빌딩 1 층에 있습니다 .

참고하세요

무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다 .

날짜 : 7 월 19 일 ( 토 ) 오후 2 시장소 : 민들레영토기본 안건- 제작지원비 지급 지연에 대한 설명- 기금 조정 운영안- 가을 워크샵 준비위 구성- 기타 ( 기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다 )

명동 민들레영토 오는길지도와 같이 명동역 8 번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA 빌딩 1 층에 있습니다 .

참고하세요

Overall Architecture

InputDocument

OUTPUTRelation TypeClassification

Named EntityRecognition(Location)

무더운 날씨가 본격적으로 시작되는 즈음하여 유니브캐스트의 상반기 평가와 하반기 운영을 위한 정기팀장회의를 개최합니다 .

날짜 : 7 월 19 일 ( 토 ) 오후 2 시장소 : 민들레영토기본 안건- 제작지원비 지급 지연에 대한 설명- 기금 조정 운영안- 가을 워크샵 준비위 구성- 기타 ( 기타 안건으로 상정할 것이 있으면 각 팀장들은 제안해 주시기 바랍니다 )

민들레영토 오는길지도와 같이 명동역 8 번 츨구로 나오셔서 쭉 상가 끼고 걸어가시면 저기 YMCA 빌딩 1 층에 있습니다 .

참고하세요

Co-reference Normalization

isHeldAt

민들레영토민들레영토YMCA 빌딩 1 층

locationLandmark 명동역 8 번출구

서울시 중구 명동 YMCA 빌딩 1 층

민들레영토

Page 12: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

12

1) Location Named Entity Recognition

Page 13: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

13

Architecture of Location NERTraining the system

(supervised learning)Testing the system

(actual use and evaluation)

Training Corpus Web

Gazetteer

Gazetteer Extraction

FeatureExtraction

CRFsModel

Input: Morpheme-level tok-enized sentence list

Output: NE Annotated Email Document

FeatureExtraction

BoundaryTagging (IOB2)by CRFs Model

Tokenization

TF-IDFCalculation Boundary Mark-

ing (IOB2)

CRFsLearning

TF-IDFScore Data

BoundaryMerging

Page 14: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

14

NER - Boundary Detection Boundary Tagset : IOB2 Features

Linguistic {-2,-1,0,1,2} POS-level word, {-2,-1,0,1,2} POS-tag,

POS-tag + length of the word

Orthographic : 18 types of the word isKorean, isAlpha, isAlnum, 2DigitNum, ...

Gazetteer : Person/Location Pronoun dictionary (ETRI 99) from Training corpus :

Heading words, Surrounding words, NE words External resources :

Person : Chosun/Joins.com Person DB (64,042) Location :

Nate Local DB 35,335, Sigaji.com 8,193, Ofood 43,390BusStop 19,431, Address,B/D 23,365, Subway 1,288,Hotel (Auction accomodation, hotelnjoy) 884,Country/Place name 11,946, School(Elementary~University) 21,957

Syntactic : Position of the POS-level word in the chunk (relative:S/C/E, absolute) Position of the chunk in the sentence (relative:S/SC/CE/E, absolute) Position of the sentence in the document (relative:S/SC/CE/E, absolute) TF-IDF

Page 15: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

15

Features : Gazetteer data

Location : Shop Name (80,436)

Nate Local DB (3~10 chars.) (http://localinfo.nate.com)

Sigaji.com Shop DB (3~10 chars.) (http://sigaji.com/location/)

oFood (http://ofood.co.kr)

Hotel Name (884) Auction Accomodation

(http://accommodations.auction.co.kr) Hotelnjoy

(http://www.hotelnjoy.com) Public Transportation (20,719)

Subway stations Bus-Stop names

Address (from Zipcode DB) (23,365) Si/do, Gu/gun, Dong/myun/ri, B/D names

Page 16: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

16

Evaluation Result (1/2) Baseline Boundary Detection

Target : 13,076 sentences in 1,011 documents. CRFs Model, 10-fold cross validation, 3-order, Exact Matching Baseline is the case applying Word and POS-tag feature only

baseline B-Location

Precision 49.99% 47.93%

Recall 16.97% 64.84%

F-measure 24.34% 55.11%

B-Location I-Location

baseline I-Location

Precision 24.94% 77.99%

Recall 39.82% 69.58%

F-measure 32.99% 73.54%

Precision Recall F-measure 0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Precision Recall F-measure 0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Page 17: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

17

2) Relation Type Classification

Page 18: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

18

Architecture of Relation Type Classifier

Training the system(supervised learning)

Testing the system(actual use and evaluation)

Training Corpus Web

Gazetteer

Gazetteer Ex-traction

FeatureExtraction

SVMsModel

Input: Location NE-taggedDocument

Output: Extracted NE withMeeting-NE Relation Type

FeatureExtraction

Relation TypeClassification

By SVMs Model

Tokenization

SVMsLearning

TemplateGeneration

Page 19: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

19

Statistics of Relation Types

Document-Location Relation Type Classification Target : 1,844 Location-type Terms

848 isHeldAt (45.99%) 161 locationLandmark (8.78%) 835 generalLocation (45.28%)

835

848

161

General Location

isHeldAt

locationLandmark

Page 20: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

20

Features

Linguistic1. Gazetteer

A. Named Entity Dictionary Nate Local DB 35,335, Sigaji.com 8,193, Ofood 43,390

BusStop 19,431, Address,B/D 23,365, Subway 1,288,Country/Place name 11,946,

B. from Training Corpus : Heading words in the current sentence Heading words in the previous sentence NE consisting words

2. Lexical PatternA. POS-tag feature before and next to the NE

B. Is this NE the first location NE next to colon?

C. Is this term in the parenthesis?

D. Is parenthesis opened and closed next to the NE ?

E. Is direction word just next to the NE?

Syntactic3. Syntactic Features

A. Is the NE the first or the last Location-type of NE in the sentence?

B. Ratio of location NE in the current sentence to the document

C. Relative position of the NEs in the sentences

D. Is the NE the longest location NE in the sentence?

Page 21: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

21

Experiment : Features (1/3)

1. Gazetteer

A.Named Entity Dictionary Collected from the web Check if each morpheme, eojeol or term

matches the word in the dictionary.

I. Nate Local DB, Sigaji.com, Ofood

II. Address, Building name

III. Bus-stop, Subway station

IV. Country name

V. Location-related Vocabulary

B.from Training Corpus :I. Heading words in the current sentence.

II. Heading words in the previous sentence.Heading word is the word before the colon in the sentence

e.g. 장소 : 피오레웨딩컨벤션 ( 봉계동 여수 세무서 옆 )

III. Eojeol-level NE consisting words

  isHeldAtP (%) R (%) F (%)

locationLandmarkP (%) R (%) F (%)

Acc.(%)

I 59.32 / 98.94 / 74.17 57.14 / 02.45 / 04.71 59.09

+II 60.45 / 93.99 / 73.58 47.62 / 06.13 / 10.87 58.39

+III 63.12 / 89.52 / 74.04 62.96 / 31.29 / 41.80 60.76

+IV 64.55 / 88.57 / 74.68 62.96 / 31.29 / 41.80 62.30

+V 70.64 / 86.45 / 77.75 70.91 / 47.85 / 57.14 67.20

Feature 1A

  isHeldAtP (%) R (%) F (%)

locationLandmarkP (%) R (%) F (%)

Acc.(%)

+I 81.45 / 85.87 / 83.60 71.09 / 55.83 / 62.54 75.17

+II 80.56 / 84.92 / 82.68 69.40 / 57.06 / 62.63 75.03

+III 84.86 / 87.16 / 86.00 77.44 / 63.19 / 69.59 79.93

Feature 1 (A+B)

Page 22: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

22

Experiment : Features (2/3)2. Lexical Patterns

A. POS-tag feature just before and next to the NEe.g. 장소 : 피오레웨딩컨벤션 ( 봉계동 여수 세무서 옆 )

B. Is this NE the first location NE next to colon?e.g. 장소 : 피오레웨딩컨벤션 ( 봉계동 여수 세무서 옆 )

C. Is this NE in the parenthesis?e.g. 장소 : 피오레웨딩컨벤션 ( 봉계동 여수 세무서 옆 )

D. Is parenthesis opened and closed next to the NE ?e.g. 장소 : 피오레웨딩컨벤션 ( 봉계동 여수 세무서 옆 )

E. Is direction word just next to the NE? 34 direction words : 위 , 아래 , 밑 , 옆 , 앞 , 내 , 외 , …

e.g. 장소 : 피오레웨딩컨벤션 ( 봉계동 여수 세무서 옆 )

F. Is the unit of length appeared in the next 3 eojeolsof the NE? [0-9]+(m|km|ft|yd|mile| 미터 | 킬로미터 | 피트 | 야드 | 마일 | 리 | 초 | 분 | 시간 )

G. Is transportation words contained in the left eojeol?

  isHeldAtP (%) R (%) F (%)

locationLandmarkP (%) R (%) F (%)

Acc.(%)

+A 86.52 / 86.93 / 86.72 77.10 / 61.96 / 68.71 80.77

+B 86.94 / 86.22 / 86.58 73.79 / 65.64 / 69.48 80.63

+C 87.98 / 86.22 / 87.09 76.19 / 68.71 / 72.26 81.05

+D 88.38 / 86.93 / 87.65 75.33 / 69.33 / 72.20 80.63

+E 88.52 / 87.16 / 87.83 77.18 / 70.55 / 73.22 81.82

+F 88.76 / 87.40 / 88.07 77.70 / 70.55 / 73.95 82.10

+G 88.33 / 87.40 / 87.86 80.41 / 73.01 / 76.53 82.24

Feature 1+2 (A~G)

Page 23: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

23

Experiment : Features (3/3)3. Syntactic Features

A. Is the NE the first or the last Location-type of NE in the sentence?e.g. (1 호선에서 갈아탈 경우 동묘역에서 6 호선을 갈아타고 봉화산방향으로 타고 오시면 2 번째 정거장이 보문역입니다 .

B. Ratio of NEs in the current sentence to the document (<25%,<50%,<75%,<100%,=100%)C. Relative position of the sentence to the document. (S / SC / CE / E)D. Relative position of the eojeol to the sentence. (S / SC / CE / E)E. Relative position of the NEs in the sentence (S / SC / CE / E)

e.g. (1 호선에서 갈아탈 경우 동묘역에서 6 호선을 갈아타고 봉화산방향으로 타고 오시면 2 번째 정거장이 보문역입니다 . S CE E

F. Is the NE the longest location NE in the sentence? G. Is this only location NE in the sentence?H. Is the NE on the previous/next to the NE in the sentence?I. Is same type of NE in the prev/next sentence ?J. Is phone number on the left or right side of the NE?K. Surrounding word (on the right side of the NE, n=1, pos=etc,p,j,m,x ) ?L. Is Colon included in the curr/prev/next sentence?M. Is the sentence starts/ends with the NE?N. # of chunks of the NE (max:99)O. Length of the NE (max:300)P. Is location related word included in heading ?Q. Is location related word included in heading of prev. sentence?R. is the NE which is appeared more than 2, is next to this NE?S. Is the NE appeared more than 2?T. Transport Dic Feature (left / right)U. Order of the ne in the sentence.V. does sentence starts with special char?

  isHeldAtP (%) R (%) F (%)

locationLandmarkP (%) R (%) F (%)

Acc.(%)

+GH 89.42 / 89.10 / 89.26 79.22 / 75.78 / 77.46 82.23

Feature 1+2+3 (G,H)

Page 24: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

24

Experiment: Relation Type Classification

Meeting-Location Relation Type Classification Target : 1,844 Location-type NEs SVMs 3 classes (multi-class) classifier Total Accuracy : 82.23

  isHeldAt locationLandmark

Precision 89.42% 79.22%

Recall 89.10% 75.78%

F-measure 89.26% 77.46%

isHeldAt locationLandmark0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Precision Recall F-measure

Page 25: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

25

3) Normalization & Co-reference

Page 26: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

26

Architecture of Normalizer

Country | State/City | City/Gu/Gun | Dong/Eup/Myeon | Ri | House no. | Org. | B/D | Floor | Shop Name | Room no.

1 2 3 4 5 6 7 8 9 10 11

isHeldAt

민들레영토 민들레영토 YMCA 빌딩 1 층

locationLandmark 명동역 8 번출구

InputDocument OUTPUT

VerifyingElements

민들레영토민들레영토

YMCA 빌딩 (A8) / 1 층 (A9)명동역 (S3) / 8 번출구 (S4)

Pattern

Addr.

Address Format

Subway Format

City | Line no. | Station Name | Gate no.

1 2 3 4

Expansion

민들레영토민들레영토

YMCA 빌딩 (A8) / 1 층 (A9)서울시 (S1) 4 호선 (S2) 명동역 (S3) / 8 번출구 (S4)

Combine

A1 : 대한민국A2 : 서울시A3 : 중구A4 : 명동A5 : -A6 : -A7 : -A8 : YMCA 빌딩A9 : 1 층A10 : 민들레영토A11 : -

Subway Open Map Services(Google Maps, Yahoo! Maps,

Daum Map, Naver Map)

Page 27: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

27

Discussion1) Limitations2) Applications

Page 28: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

28

Limitations

1. Performance Both system should be refined more in detail, with sophisticated experiment.

2. Scaling Up For our corpus consist of 1,011 emails, the method to cover more data in the

real-world should be mentioned.

3. Feature Selection Since we use +165,000 word-gazetteer and many of these features always zero

in the training data. In order to save memory and to maximize the perfor-mance, these unsupported features need to be removed.

Page 29: Kyoungryol Kim Meeting Information Extraction from Meeting Announcement in Korean

29

Applications

1. Smartphone application Extracting start/end time, location from email and update them to

Google Calendar.

2. Contribution to OpenStreetMap community Update found locations automatically to openstreetmap.com