簡易文本語義分類入門 (20170331)
TRANSCRIPT
-
: http://l.pulipuli.info/nccu/17-tm2017/3/31
mailto:[email protected]://l.pulipuli.info/nccu/17-tm
-
2
-
3
~~
(1)...
P4
P1
P2 /
P3
P4
P5
-
4
-
5
P4 P6
-
6
Weka
-
https://www.youtube.com/watch?v=-W3pnicVgn0
7
-
8
-
1.
2.
3.
4.
5.
6.
9
-
http://l.pulipuli.info/nccu/17-tm
10
Weka 3.8
()
Google
CSV to ARFFARFF to CSV
http://l.pulipuli.info/nccu/17-tmhttps://docs.google.com/document/d/1FMJz4rWNGuJnSVEwJG5vFx_0lpLk0VqNTCO0oEHokPs/pubhttps://docs.google.com/spreadsheets/http://pulipulichen.github.io/jieba-js/weka_csv_arff.htmlhttp://pulipulichen.github.io/jieba-js/weka_arff_csv.html
-
11
Part 1.
-
12
-
13
-
14
cm g
2.98 0.74
3.5 0.76
-
15
-
16
1.
2.
3.
? !
-
17
1
? !
46
26
: 8
: 2
2
-
,,,,,,
18
2
? !
2
2
-
19
3
? !
-
20
3
? !
5
6
5 1 1 1 1 1 0 0
6 1 1 1 0 0 1 1
1=; 0=
-
21https://www.wikiwand.com/zh-tw/%E5%90%91%E9%87%8F%E7%A9%BA%E9%96%93%E6%A8%A1%E5%9E%8B
Gerard Salton
d1d2
-
22
Part 2.
-
23
1
2
3
-
24
-
25
-
26
-
https://github.com/fxsjy/jieba
Python
PHPJavaNode.js.NET (C#)C++R
27
Jieba
https://www.slideshare.net/ssuser4568b0/jieba
1.
2.Trie
3.DAG
4.
5.
HMMViterbi
https://github.com/fxsjy/jiebahttps://www.slideshare.net/ssuser4568b0/jieba
-
28
Jieba-JS
https://goo.gl/YrSTn9
https://goo.gl/YrSTn9
-
29
-
30
-
31
CSV to ARFF
1. CSV
[]
2. CSV to ARFF 3. ARFF
-
LibreOffice CalcGoogle
Unicode
32
Microsoft Office
Big5
1. CSV
-
33
1. CSV
document
class
class: ?
-
http://l.pulipuli.info/nccu/17-tm
34
1. CSV
CSV
http://l.pulipuli.info/nccu/17-tmhttps://docs.google.com/spreadsheets/d/1jaHDl0692t5OHzRlE4YOXJiRKTgv5xQajl_wqWGlQUY/export?format=csvhttps://docs.google.com/spreadsheets/d/1jaHDl0692t5OHzRlE4YOXJiRKTgv5xQajl_wqWGlQUY/export?format=csv
-
1. CSV
35
-
http://l.pulipuli.info/nccu/17-tm
36
2. CSV to ARFF
CSV to ARFF
http://l.pulipuli.info/nccu/17-tmhttps://pulipulichen.github.io/jieba-js/weka_csv_arff.html
-
37
2. CSV to ARFF
1. CSV
2.
3.
-
38
3. ARFF&
train
test
-
39
Weka
weka.filters.unsupervised.arrtibute.
StringToWordVector
-
http://l.pulipuli.info/nccu/17-tm
40
1.CSV
2.CSV to ARFF
3.ARFF
train
test
http://l.pulipuli.info/nccu/17-tm
-
41
Part 3.
-
42
Weka
Java
WindowsMac OSLinux
-
43
Weka
http://l.pulipuli.info/nccu/17-tm
Weka 3.8.1
http://l.pulipuli.info/nccu/17-tmhttps://docs.google.com/document/d/1FMJz4rWNGuJnSVEwJG5vFx_0lpLk0VqNTCO0oEHokPs/pub
-
Weka (1/2)
Weka
C:\Program Files
\Weka-3-8
RunWeka.ini
>
44
-
45
Weka (2/2)
fileEncoding=Cp1252
fileEncoding=utf-8
-
46
&
1
2
3
4
-
1-1.
1-2.
1-3. ()
47
1.
train
test
-
48
2. & 2-1. Weka Explorer
2-2.
2-3. Meta
2-4. NaiveBayes
2-5.
StringToWordVector
2-6.
-
49
2-1. Weka Explorer
Weka 3.8 Explorer
-
50
2-2. (1/2)
Open file
train
-
51
2-2. (2/2)
6
document class
-
52
2-3. Meta
-
53
2-3. Meta
weka.classifiers.meta.
FilteredClassifier
-
54
2-4. NaiveBayes
weka.classifiers.bayes.
NaiveBayes
-
55
2-5. ()StringToWordVector
weka.filters.
unsupervised.arrtibute.
StringToWordVector
-
56
-
57
2-6.
class
-
58
2. &
2-1. Weka Explorer
2-2.
2-3. Meta
2-4. NaiveBayes
2-5.
StringToWordVector
2-6.
-
59
3.
3-1.
3-2.
3-3.
-
Cross-vailidation: 610
(6)
60
3-1.
-
1~6
61
6
()
1
2
3
4
5
6
1.
2.
4/6=66.7%
-
62
3-2.
Start
-
63
3-3.
Correctly Classified
Instances:
66.7%
-
64
...
66.7%
31
-
3-1.
3-2.
3-3.
65
3.
-
66
4.
4-1.
4-2.
4-3.
4-4.
4-5. ARFF to CSV
4-6.
-
4-1. (1/2)
67
Supplied test set: [Set...]
-
4-1. (2/2)
68
Open file
test
class
-
69
4-2. (1/2)
More options...
Output predictions:
[Choose] CSV
-
70
4-2. (2/2)
outputDistribution:
TrueCSV
-
71
4-3.
Start
-
72
4-4. ARFF
Save result buffer
result.txt
-
73
4-5. ARFF to CSV (1/3)
http://l.pulipuli.info/nccu/17-tm
ARFF to CSV
http://l.pulipuli.info/nccu/17-tmhttp://pulipulichen.github.io/jieba-js/weka_arff_csv.html
-
74
4-5. ARFF to CSV (2/3)
test
result.txt
-
75
4-5. ARFF to CSV (3/3)
.csv
-
76
4-6. (1/3)Google
http://l.pulipuli.info/nccu/17-tm
Google
http://l.pulipuli.info/nccu/17-tmhttps://drive.google.com/drive
-
77
4-6. (2/3)
.csv
-
78
4-6. (3/4)
-
79
4-6. (4/4)
-
80
document class
predicted
class
pro_dis:
pro_dis:
? *1 0
? *0.619 0.381
1896
? *1 0
Google
? 0.001 *0.999
-
4-1.
4-2.
4-3.
4-4.
4-5. ARFF to CSV
4-6.
81
4.
-
82
Part 4.
-
83
-
84
Information Gain
()
1100% 260%40%
166% 33% 2100% 3100%
()
2 1
2 1
2 3
1 1
1 2
2 2
2 2
-
85
-
86
1.
2.:
StringToWordVector
3.Class
4.
InfoGainAttributeEval
Ranker
5.
-
87
2. (1/2)
weka.filters.
unsupervised.arrtibute.
StringToWordVector
-
88
2. (2/2)
-
89
3. Class
class
-
90
4.
weka.attributeSelection.
InfoGainAttributeEval
weka.attributeSelection.
Ranker
-
91
5. (1/2)
Start
-
92
5. (2/2)
Selected attributes
-
93
Ranked attributes:
0.459 9
0.459 49
0.459 46
document class
28 15
80
1891
3D
-
94
document class
28 15
80
1891
3D
111
-
1.
2.:
StringToWordVector
3.Class
4.
InfoGainAttributeEval
Ranker
5.
95
-
96
Part 5.
-
97
1.
2.
3.
4. TF-IDF
5.
6.() ()
7.
-
98
1891
1891
-
99
tokenizer:
CharacterNGram
Tokenizer
- max 1
-min 1
StringToWordVector
-
100
-
101
(1/2)
-
102
(2/2)
-
103
? !
1 1 1 1 1 8
1 1 1 1 1 9
-
104
()
28 15
1 1 1 1 1
1 2 2 1 1
-
105
TF-IDF
...
1 0 1
0 1 1
TF-IDF
...
2 0 1
0 2 1
-
106
TF-IDF
=
=
-
107
TF-IDF
=
2
2
*2
-
108
TF-IDF
(1/2)
=
*3
-
109
TF-IDF
(2/2)
=
(
)
*()
-
110
Weka
2-5: IDFTransform
TFTransform
-
111
weka.classifiers.bayes.
NaiveBayes
weka.classifiers.function.
Logistic
weka.classifiers.trees.
J48
weka.classifiers.functions.
SMO
weka.classifiers.functions.
MultilayerPerceptron
weka.classifiers.functions.
NeuralNetwork
-
Weka
112
-
Weka
:
WEKA
2015
ISBN: 978-986-379-067-9
113
510.25474 368
-
() ()
114
97 23
document
class
...
....
document
class
... 97
23
-
115
doNotOperateOn
PerClassBasis:
True
StringToWordVector
-
116
weka.classifiers.bayes.
NaiveBayes
weka.classifiers.functions.
MultilayerPerceptron
/
-
117
&
1
2
3
4 Weka
-
118
Part 6.
-
119
P4
P6
-
120
P6194.87%
()
-
121
?
-
122
-
123
http://l.pulipuli.info/nccu/17-tm
http://l.pulipuli.info/nccu/17-tm
-
http://blog.pulipuli.info/
http://blog.pulipuli.info/