the third chinese language processing bakeoff: word segmentation and named entity recognition

The Third Chinese Language Processing Bakeoff:Word Segmentation and Named Entity Recognition

Gina-Anne LevowFifth SIGHAN Workshop

July 22, 2006

Roadmap Bakeoff Task Motivation Bakeoff Structure:

Materials and annotations Tasks and conditions Participants and timeline

Results & Discussion: Word Segmentation Named Entity Recognition

Observations & Conclusions Thanks

Bakeoff Task Motivation Core enabling technologies for Chinese

language processing Word segmentation (WS)

Crucial tokenization in absence of whitespace Supports POS tagging, parsing, ref. resolution, etc Fundamental challenges:

“Word” not well, consistently defined; humans disagree Unknown words impede performance

Named Entity Recognition (NER) Essential for reference resolution, IR, etc Common class of new unknown words

Data Source Characterization Five corpora, providers

Annotation guidelines available, varied Simplified and traditional characters

Range of encodings, all available in Unicode (UTF-8)

Provided in common XML, converted to train/test form (LDC)

Tasks and Tracks Tasks:

Word Segmentation: Training and truth: whitespace delimited End-of-word tags replaced with space, no others

Named Entity Recognition: Training and truth: Similar to Co-NLL 2-column NAMEX only: LOC, PER, ORG (LDC: +GPE)

Tracks: Closed: Only provided materials may be used Open: Any materials may be used, but must

document

Structure: Participants &Timeline

Participants: 29 sites submitted runs for evaluation (36 init)

144 runs submitted: ~2/3 WS; 1/3 NER Diverse groups: 11 PRC, 7 Taiwan, 5 US, 2 Japan,

1each: Singapore, Korea, Hong Kong, Canada Mix of Commercial: MSRA, Yahoo!, Alias-I, FR Telecom,

etc- and Academic sites

Timeline: March 15: Registration open April 17: Training data released May 15: Test data released May 17: Results due

Word Segmentation: Results Contrasts: Left-to-right maximal

match Baseline: Uses only training vocabulary Topline: Uses only testing vocabularySource Recall Prec F-score OOV Roov Riv

CITYU 0.93 0.882 0.906 0.049 0.009 0.969

CKIP 0.915 0.87 0.892 0.042 0.03 0.954

MSRA 0.949 0.9 0.924 0.034 0.022 0.981

UPUC 0.869 0.79 0.828 0.088 0.011 0.951

Source Recall Prec F-Score

OOV Roov Riv

CITYU 0.982 0.985

0.984 0.04 0.993 0.981

CKIP 0.98 0.987

0.983 0.042 0.997 0.979

MSRA 0.991 0.993

0.992 0.034 0.999 0.991

UPUC 0.961 0.976

0.968 0.088 0.989 0.958

Word Segmentation: CityUSite RunID R P F Roov Riv

15 D 0.973 0.972 0.972 0.787 0.981

15 B 0.973 0.972 0.972 0.787 0.981

20 0.972 0.971 0.971 0.792 0.979

32 0.969 0.970 0.970 0.773 0.978

CityUClosed

Site RunID R P F Roov Riv

20 0.978 0.977 0.977 0.84 0.984

32 0.979 0.976 0.977 0.813 0.985

34 0.971 0.967 0.969 0.795 0.978

22 0.970 0.965 0.967 0.761 0.979

CityUOpen

Word Segmentation: CKIP


20 0.961 0.955 0.958 0.702 0.972

15 A 0.961 0.953 0.957 0.658 0.974

15 B 0.961 0.952 0.57 0.656 0.974

32 0.958 0.948 0.953 0.646 0.972


20 0.964 0.955 0.959 0.704 0.975

34 0.959 0.949 0.954 0.672 0.972

32 0.958 0.948 0.953 0.647 0.972

2 A 0.953 0.946 0.949 0.679 0.965

CKIPClosed

CKIPOpen

Word Segmentation: MSRASite RunID R P F Roov Riv

32 0.964 0.961 0.963 0.612 0.976

26 0.961 0.953 0.957 0.499 0.977

9 0.959 0.955 0.957 0.494 0.975

1 A 0.955 0.956 0.956 0.650 0.966


11 A 0.980 0.978 0.979 0.839 0.985

11 B 0.977 0.976 0.977 0.840 0.982

14 0.975 0.976 0.975 0.811 0.981

32 0.977 0.971 0.974 0.675 0.988

MSRAClosed

MSRAOpen

Word Segmentation: UPUC


20 0.940 0.926 0.933 0.707 0.963

32 0.936 0.923 0.930 0.683 0.961

1 A 0.940 0.914 0.927 0.634 0.969

26 A 0.936 0.917 0.926 0.617 0.966


34 0.949 0.939 0.944 0.768 0.966

2 0.942 0.928 0.935 0.711 0.964

20 0.940 0.927 0.933 0.741 0.959

7 0.944 0.922 0.933 0.680 0.970

UPUCClosed

UPUCOpen

Word Segmentation: Overview

F-scores: 0.481-0.797 Best score: MSRA Open Task (FR Telecom) Best relative to topline: CityU Open: >99% Most frequent top rank: MSRA

Both F-scores and OOV recall higher in Open

Overall good results: Most outperform baseline

Word Segmentation: Discussion Continuing OOV challenges

Highest F-scores on MSRA Also highest top and base lines

Lowest OOV rate Lowest F-scores on UPUC

Also lowest top and baselines Highest OOV rate (> double all other OOV) Smallest corpus (~1/3 MSRA)

Best scores: most consistent corpus Vocabulary, annotation

UPUC also varies in genre: train: CTB; test: CTB,NW,BN

NER Results Contrast: Baseline

Label as Named Entity if unique tag in training

Source P R F PER-F ORG-F LOC-F GPE-F

CITYU 0.611 0.467

0.529

0.587 0.516 0.503 N/A

LDC 0.493 0.378

0.428

0.395 0.29 0.259 0.539

MSRA 0.59 0.488

0.534

0.614 0.469 0.531 N/A

NER Results: CityUSite P R F ORG-F LOC-F PER-F

3 0.914 0.867 0.89 0.805 0.921 0.909

19 0.92 0.854 0.886 0.805 0.925 0.887

21a 0.927 0.847 0.885 0.797 0.92 0.89

21b 0.924 0.849 0.885 0.798 0.924 0.892

Site P R F ORG-F LOC-F PER-F

6 0.869 0.749 0.805 0.68 0.86 0.81

CityUClosed

CityUOpen

NER Results: LDC


7 0.7616 0.662 0.708 0.521 0.286 0.742

6-gpe-loc

0.672 0.655 0.664 0.455 0.708 0.742

6 0.306 0.298 0.302 0.455 0.037 0.742


3 0.803 0.726 0.763 0.658 0.305 0.788

8 0.814 0.594 0.688 0.585 0.170 0.657

LDCClosed

LDCOpen

NER Results: MSRASite P R F ORG-F LOC-F PER-F

14 0.889 0.842 0.865 0.831 0.854 0.901

21a 0.912 0.817 0.862 0.82 0.905 0.826

21b 0.884 0.829 0.856 0.77 0.901 0.849

3 0.881 0.823 0.851 0.815 0.906 0.794


10 0.922 0.902 0.912 0.859 0.903 0.960

14 0.908 0.892 0.899 0.84 0.91 0.926

11b 0.877 0.875 0.876 0.761 0.897 0.922

11a 0.864 0.84 0.852 0.694 0.874 0.92

MSRAClosed

MSRAOpen

NER: Overview

Overall results: Best F-score: MSRA Open Track: 0.91 Strong overall performance:

Only two results below baseline Direct comparison of NER Open vs Closed

Difficult: only two sites performed both tracks Only MSRA had large numbers of runs

Here Open outperformed Closed: top 3 Open > Closed

NER Observations Named Entity Recognition challenges

Tagsets, variation, and corpus size Results on MSRA/CityU much better than LDC

LDC corpus substantially smaller Also larger tagset: GPE GPE easily confused for ORG or LOC

NER results sensitive to corpus size, tagset, genre

Conclusions & Future Challenges

Strong, diverse participation in WS & NER Many effective competitive results

Cross-task, cross-evaluation comparisons Still difficult Scores sensitive to corpus size, annotation consistency,

tagset, genre, etc Need corpus, config-independent measure of progress Encourage submissions that support comparisons

Extrinsic, task-oriented evaluation of WS/NER Continuing challenges: OOV, annotation

consistency, encoding combinations and variation, code-switching

Thanks Data Providers:

Chinese Knowledge Information Processing Group, Academia Sinica, Taiwan:

Keh-Jiann Chen, Henning Chiu City University of Hong Kong:

Benjamin K.Tsou, Olivia Oi Yee Kwong Linguistic Data Consortium: Stephanie Strassel Microsoft Research Asia: Mu Li University of Pennsylvania/University of Colorado:

Martha Palmer, Nianwen Xue Workshop co-chairs:

Hwee Tou Ng and Olivia Oi Yee Kwong All participants!

the third chinese language processing bakeoff: word segmentation and named entity recognition

Documents

baselineword segmentation

results dueword segmentation

training data releasedmay

msraboth fscores

msrabest scores

word tags

othersnamed entity recognition

training vocabularytopline