unsupervised morpheme analysis – overview of morpho challenge 2007 in clef mikko kurimo, mathias...
Post on 18-Dec-2015
217 Views
Preview:
TRANSCRIPT
Unsupervised Morpheme Analysis – Overview of
Morpho Challenge 2007 in CLEFMikko Kurimo, Mathias Creutz, Matti Varjokallio,
Ville Turunen
Helsinki University of Technology, Finland
My job at Helsinki:
Multimodal Interfaces@ Adaptive Informatics
(Research Centre of Academy of Finland)
ContinuousSpeech
Recognition
Adaptive Natural Language
Modelling
ContentBased Image
and Video Retrieval
Multimodal Interfaces: Proactive audio-visual information navigation, Effective multilingual interaction, Intermodal cross-over of semantics
Research topics of MMI group
Motivation of Morpho Challenge• To design statistical machine learning
algorithms that discover which morphemes words consist of
• Follow-up to Morpho Challenge 2005 (segmentation of words into morphs)
• Morphemes are useful as vocabulary units for statistical language modeling in: Speech recognition, Machine translation, Information retrieval
The vocabulary problem• Many applications require
a large vocabulary: e.g. speech recognition, information retrieval, machine translation.
• Agglutinative and highly-inflected languages suffer from a severe vocabulary explosion
• We need more efficient representation units
Unique words per corpus size
Un
iqu
e w
ord
s (m
illi
on
s)
Corpus size (million words)
Scientific objectives• To learn of the phenomena underlying word
construction in natural languages• To discover approaches suitable for a wide
range of languages and tasks• To advance machine learning methodology
Morpho Challenge 2007• Part of the EU Network of Excellence
PASCAL’s Challenge Program• Organized in collaboration with CLEF• Participation is open to all and free of charge• Word sets are provided for: Finnish, English,
German and Turkish • Implement an unsupervised algorithm that
discovers morpheme analysis of words in each language!
Thanks
Thanks to all who made Morpho Challenge 2007 possible:
• PASCAL network, CLEF, Leipzig corpora collection• Morpho Challenge organizing committee• Morpho Challenge program committee• Morpho Challenge participants• Morpho Challenge evaluation team• CLEF 2007 organizers!
Rules• Morpheme analysis are submitted to the
organizers and two different evaluations are made
• Competition 1: Comparison to a linguistic morpheme "gold standard“
• Competition 2: Information retrieval experiments, where the indexing is based on morphemes instead of entire words.
Training data• Word lists downloadable at our home page• Each word in the list is preceded by its
frequency • Finnish: 3M sentences, 2.2M word types• Turkish: 1M sentences, 620K word types• German: 3M sentences, 1.3M word types• English: 3M sentences, 380K word types
• Small gold standard sample available in each language
Examples of gold standard analyses
• English: baby-sitters baby_N sit_V er_s +PL• Finnish: linuxiin linux_N +ILL• German: zurueckzubehalten zurueck_B zu be
halt_V +INF • Turkish: kontrole kontrol +DAT
1. A new linguistic evaluation method
• Problem: The unsupervised morphemes may have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings
• Solution: Compare to the linguistic gold standard analysis by matching the morpheme-sharing word pairs
• Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme
Evaluation measures• F-measure = 1/(1/Precision + 1/Recall)• Precision is the proportion of suggested word
pairs that also have a morpheme in common according to the gold standard
• Recall is the proportion of word pairs sampled from the gold standard that also have a morpheme in common according to the suggested algorithm
Participants• Delphine Bernhard, TIMC-IMAG, F (now moved to
Darmstadt, D)• Stefan Bordag, Univ. Leipzig, D • Paul McNamee and James Mayfield, JHU, USA • Daniel Zeman, Karlova Univ., CZ• Christian Monson et al., CMU, USA • Emily Pitler and Samarth Keshava, Univ. Yale,
USA• Morfessor MAP, Helsinki Univ. Tech, FI• (Michael Tepper, Univ. Washington, USA)
Results: Finnish, 2.2M word types
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
F-m
easu
re
Bernhard 2
Bernhard 1
Bordag 5a
Bordag 5
Zeman
McNamee 3
McNamee 4
McNamee 5
Morfessor MAP
Results: Turkish, 620K word types
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
F-m
easu
re
Zeman
Bordag 5a
Bordag 5
Bernhard 2
Bernhard 1
McNamee 3
McNamee 4
McNamee 5
Morfessor MAP
Tepper
Results: German, 1.3M word types
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
F-m
easu
re
Monson ParaMor-M.
Bernhard 2
Bordag 5a
Bordag 5
Monson Morfessor
Bernhard 1
Monson ParaMor
Zeman
McNamee 3
McNamee 4
McNamee 5
Morfessor MAP
Results: English, 380K word types
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
F-m
easu
re
Bernhard 2
Bernhard 1
Pitler
Monson Paramor-M.
Monson Paramor
Monson Morfessor
Zeman
Bordag 5a
Bordag 5
McNamee 3
McNamee 4
McNamee 5
Morfessor MAP
Tepper
2. Practical evaluation• Real world application for morpheme
analysis: Information Retrieval• Analysis is needed to handle morphology
(inflection, compounding) • CLEF collections for Finnish, German and
English
Data setsFinnish (CLEF 2004)
55K documents from articles in Aamulehti 94-9550 test queries and 23K binary relevance assessments
English (CLEF 2005)107K documents from articles in Los Angeles Times 94
and Glasgow Herald 9550 test queries and 20K binary relevance assessments
German (CLEF 2003)300K documents from short articles in Frankfurter
Rundschau 94, Der Spiegel 94-95 and SDA 94-9560 test queries and 23K binary relevance assessments
Reference methods• Morfessor Baseline: our public code since 2002• Morfessor Categories-MAP: improved, public since 2006 • dummy: no segmentation• grammatical: gold standard segmentations
– all: all alternatives included– first: only first alternative
• Porter: LEMUR's default stemmer • Tepper: hybrid method based on Morfessor MAP
Evaluation 1/2• Words in the documents and queries were
replaced by the submitted segmentations• New words:
– the CLEF collections contained words that were not in the original word list
– additional segmentations were requested– if segmentation was not provided, words were
indexed as such
Evaluation 2/2• LEMUR-toolkit ( http:// www.lemurproject.org/ )• Okapi BM25 retrieval, default parameter settings• Okapi seems to handle common morphemes
poorly => stoplist for most common ones (above a fixed frequency threshold)
• Also an alternative set of non-stoplisted results with TFIDF
Results: Finnish
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Me
an
ave
rag
e p
reci
sio
n
Bernhard 2
Bernhard 1
Morfessor baseline
Morfessor MAP
Bordag 5aBordag 5
grammatical all
grammatical first
McNamee 5
McNamee 4
porter
McNamee 3
dummy
Zeman
Results: German
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Me
an
ave
rag
e p
reci
sio
n
Bernhard 1
Bernhard 2
Monson Morfessor
Morfessor MAP
Morfessor baselineBordag 5
Bordag 5a
Monson ParaMor-M.
porter
McNamee 5
grammatical first
McNamee 4
Monson ParaMor
grammatical all
McNamee 3
Zeman
Results: English
0.2
0.25
0.3
0.35
0.4
0.45
Me
an
ave
rag
e p
reci
sio
n
porter
Bernhard 2
Bernhard 1
Morfessor baseline
grammatical firstTepper
Monson Morfessor
Morfessor MAP
Pitler
grammatical all
McNamee 4
McNamee 5
Monson ParaMor-M.
Bordag 5
Bordag 5a
dummy
McNamee 3
Monson ParaMor
Zeman
Conclusions• Analysis of new words important for
Finnish, less so for German and English• Porter stemming unbeaten for English (so
far)• Unsupervised morpheme analysis works
very well for IR!
Future directions?• Finnish, Turkish, English, German, ...?• Language modeling, Speech recognition,
Information Retrieval, ...?• Venice, Budapest, ...?• PASCAL, CLEF, ...?
Summary 2007• 14 different unsupervised algorithms• 8 participating research groups• Evaluations for 4 languages (3 for IR)• Good results in all languages and IR• Full report and papers in the CLEF proceedings• Details, presentations, links, info at website: http://www.cis.hut.fi/morphochallenge2007/
Acknowledgments• Data from Leipzig and CLEF• Gold standard providers in all languages!• Workshop organization by CLEF• Funding from PASCAL and Academy of Finland• Competition participants!
top related