speaker : shau-shiang hung ( 洪紹祥 ) adviser : shu-chen cheng ( 鄭淑真 ) date :...
Post on 19-Jan-2016
213 Views
Preview:
TRANSCRIPT
Speaker : Shau-Shiang Hung ( 洪紹祥 )Adviser : Shu-Chen Cheng ( 鄭淑真 )
Date : 99/05/04
1
Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine Learning Methods for Medical Text Categorization," paccs, pp.494-497, 2009 Pacific-Asia Conference on Circuits, Communications and Systems, 2009
Outline Introduction Document indexingClassification AlgorithmExperiments Conclusion
2
Introduction Text categorization (TC) is the process of
automatically assigning one or more predefined category labels to text documents.
Digital medical information is rapidly increasing with the development of network.
How to effectively deal with and organize them is a problem in the field of medical informatics.
3
Document indexing Because classifiers cannot directly interpret
documents, it is necessary to transform them into the forms that classifiers can identify.
Vector space model (VSM) is a famous statistical model.
),..,...,,,( ||321 jTijjjjj wwwwwd
,...),,( 1 服務觀光古蹟文章
4
Document indexing A. Standard Term Frequency Inverse
Document Frequency (TFIDF)
)(#
||log),(#),(
kTr
rjkjk
t
Tdtdttfidf
all_j
k
t
t
100
1000log
100
10)1,( 文章古蹟tfidf
5
Document indexing In order for the weights to fall the [0,1]
interval and for the documents to be represent by vectors of equal length, the weights resulting from tfidf are often normalized by cosine normalization.
101
2
kj|T|s js
jkkj w ,
)),d(tfidf(t
),dtfidf(tw
文章 1 所有關鍵字的 TFIDF 平方相加
6
Document indexing B. Improvement
Term Frequency, Inverted Document Frequency and Inverted Entropy (TFIDFIE)
In the field of text classification, the importance of term depends on not only its term frequency, but also its contribution to classification. For example:
Term1 客房 and Term2 風景 has same weight
7
Document indexing In order to stand out the relation between
terms and categories, we also calculate the distribution of those documents in categories in course of weighting terms. This distribution can be weight by information entropy H.
||
1 )(#log
)(#),(
C
lkTr
kl
kTr
kljk
t
DF
t
DFdtH
)-(H100
15log
100
15
100
25log
100
25
100
20log
100
20
100
10log
100
10)1,(
44332211 文章客房 ),()(#
||log),(#
),(jk
kTrjk
jkdtH
tTr
dtdttdidfie
8
||1
2)),((
),(Ts js
jkkj
dttfidf
dttfidfiew
Classification AlgorithmA. K-Nearest Neighbor (KNN)B. Support Vector Machine (SVM)C. Naïve Bayes (NB)D. Clonal Selection Algorithm Based on
Antibody Density (CSABAD) Because the nature of immune algorithm is to
distinguish between self and non-self, it can be used in text categorization.
9
Classification Algorithm• CSABAD In text categorization, Antigen
training text. B cell
An individual of classifier. Antibody
affinity between the individual and training documents.
The final classifier is composed with many memory B cells.
The cosine value of two vectors is used to measure the affinity f(xi,dj) between of B cell xi and antigen djThe affinity f(xi) of B cell xi and N antigens is
defined as the average value of all N affinities.
The antibody selection probability P(xi) is defined as follows:
M
i
M
j
ji
M
j
ji
i
xfxf
xfxf
xP
1 1
1
|)()(|
|)()(|
)(
10
ExperimentsA. Data collection
OHSUMED is a bibliographical document collection.
Using a single-label subset of OHSUMED is called OHSCAL, which consists of 11162 documents include 10 categories.
11
ExperimentsB. Experiment results and analysis
Randomly divided the OHSCAL dataset into a training set and a test set in the proportion of 2:1.
For eliminating the chanciness of experimental results, we made ten independent experiments on OHSCAL.
12
Conclusion In this paper, we propose an improved approach,
called TFIDFIE. It considers the distribution of documents in the training set in which the term occurs.
The experiments show that SVM and CSABAD outperform significantly kNN and Naive Bayes, and TFIDFIE is more effective than TFIDF.
Considering the characteristics of professional medical words, we will study the feature selection in the medical text classification in further work.
13
top related