[ppt]mecab...
TRANSCRIPT
-
MeCab, CaboCha
-
Spam ...(tokenization)(stemming, lemmatization)(part-of-speech tagging)MeCab:
-
MeCab
,,,*,*,*,,,
,,*,*,*,*,,,
,,*,*,,,,,
,,*,*,,,,,
,*,*,*,,,,,
,,*,*,*,*,,,
,,,*,*,*,,,
,,*,*,,,,,
,*,*,*,,,,,
EOS
1,2,3,4,,,,,EOS: End of sentence ()
-
: ()
-
()
2
-
TRIE
, MeCab
-
BOS
[]
[]
[]
[]
[]
[]
[]
EOS
[]
-
BOS
[]
[]
[]
[]
[]
[]
[]
EOS
[]
() (80): (KAKASI): :
-
BOS
[]
[]
[]
[]
[]
[]
[]
EOS
[]
()Viterbi
-
(Viterbi )
4500
4200
5700
3150
3200
7100
4550
-
(Viterbi )
4500
4200
5700
3150
3200
2700
1400
6900
7250
4550
7900
-
(Viterbi )
4200
5700
3150
3200
1400
2700
1400
1300
6900
4500
800
4550
1500
7300
8200
7650
-
(Viterbi )
4200
5700
3150
3200
1400
2700
1400
1300
6900
4500
800
4550
1500
7300
600
1200
7400
8260
-
(Viterbi )
700
2700
1000
4200
5700
3150
3200
800
1400
2700
1400
1300
6900
4500
800
4550
1500
600
7300
600
1200
960
500
7400
-
(Viterbi )
700
2700
1000
4200
5700
3150
3200
800
1400
2700
1400
1300
6900
4500
800
4550
1500
600
7300
600
1200
960
500
7400
-
(90) () ... + +
-
(VisualMorphs)
-
Conditional Random Fields
MeCab 0.90 CRF HMM HMM1/3 CRF
-
CRF
BOS
[]
[]
[]
[]
[]
[]
[]
EOS
[]
() 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-
CRF
BOS
[]
[]
[]
[]
[]
[]
[]
EOS
[]
+1 -1
]
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-: 0
-
CRF
BOS
[]
[]
[]
[]
[]
[]
[]
EOS
[]
-: +1
-: 0
-: -1
-: 0
-: +1
-: -1
-: +1
-: +1
-: -1
-: -1
-: -1
-: +1
]
+1 -1
-
CRF
-: 1
-: 0
-: -1
-: 0
-: 1
-: -
-: 1
-: 1
-: -1
-: -1
-: -1
-: 1
1
BOS
[]
[]
[]
[]
[]
[]
[]
EOS
[]
-
CRF
-: 1
-: 0
-: -1
-: 0
-: 1
-: -
-: 1
-: 1
-: -1
-: -1
-: -1
-: 1
+1 -1 = 0
BOS
[]
[]
[]
[]
[]
[]
[]
EOS
[]
-
1 1,2,3 11
-
92 0.6
05 5.1
JUMAN
96.09.30
ChaSen
NAIST TRIE
MeCab
TRIE
back port
03
Sen
MeCab
Java port
06 0.9
03 2.3.3
C++
03 4.0
back port
02.12.4
96 3.1
06 1.2.2
94 0.6
97 1.0
prolog , C 2
-
MeCab
/grep *.cpp :-)()
-
MeCab
/ChaSen , , , , ChaSen API C/C++, Perl, Java, Python, Ruby, C# ... N-best , ,
use MeCab;
my $str = "";
my $mecab = new MeCab::Tagger();
for (my $n = $mecab->parseToNode($str);
$n; $n = $n->{next}) {
printf %s\t%s\n, $n->{surface}, $n->{feature};
}
-
MeCab /
,166,166,8487,,,,*,*,*,,,
,1306,1306,1849,,,,,*,*,,,
,1304,1304,7265,,,,,*,*,,,
....
1. dic.csv ()
1306 166 -2559
1304 1303 401
166 1304 608
2. matrix.def ()
[]
[]
[]
id id
, id, id, , (CSV) id,id -1
- (,,)CSV
-2559
608
1306 1849 1306
166 8447 166
1304 7265 1304
-
MeCab /
NUMERIC 1 1 0
ALPHA 1 1 0
HIRAGANA 0 1 2
0x00C0..0x00FF ALPHA 0x3041..0x309F HIRAGANA
...
4. char.def ()
KANJI,1285,1285,11426,,,*,*,*,*,*
NUMERIC,1295,1295,27386,,,*,*,*,*,*
ALPHA,1285,1285,13398,,,*,*,*,*,*
5. unk.def ()
Unicode
-
MeCab
CSV MeCab 4CSVURL
,166,166,8487,,,,*,*,*,,particle
,1304,1304,7265,,,,,*,*,,,cherry
....
-
CaboCha
Support Vector Machines (SVMs) SVM PKE (Cascaded Chunking Model)
To get over these problems, we now introduce a new parsing algorithm called cascaded chunking model.
The original idea of our model stems from the English parsing proposed by Abeny 91.
The cascaded chunking model parses a sentence deterministically only deciding whether the current segment modifies the segment on its immediate right hand side.
Furthermore, training examples are extracted using this algorithm itself
-
CaboCha
-
MeCab/CaboCha
MeCabhttp://mecab.sourceforge.net/CaboChahttp://chasen.org/~taku/software/cabocha/ , Vol.41, No.11, pp.1208-1214, November 2000., Vol.43, No.3, pp.685-695, March 2002.MeCab Taku Kudo, Kaoru Yamamoto, Yuji Matsumoto (2004)Appliying Conditional Random Fields to Japanese Morphological Analysis, EMNLP 2004 CaboCha, Vol 43, No. 6 pp. 1834-1842, June 2002. , Vol.45, No.9, pp.2177-2185, September 2004--, Vol.19, No.3, pp.334-339, May 2004.
To get over these problems, we now introduce a new parsing algorithm called cascaded chunking model.
The original idea of our model stems from the English parsing proposed by Abeny 91.
The cascaded chunking model parses a sentence deterministically only deciding whether the current segment modifies the segment on its immediate right hand side.
Furthermore, training examples are extracted using this algorithm itself