1 text classification for healthcare information support rey-long liu ( 劉瑞瓏 ) dept. of medical...
Post on 02-Jan-2016
247 Views
Preview:
TRANSCRIPT
1
Text Classification for Healthcare Information Support
Rey-Long Liu (劉瑞瓏 )Dept. of Medical Informatics
Tzu Chi University, Taiwan
2
Background
Text categorization (TC) as a fundamental component for information processing– Many TC techniques were developed
Unfortunately, high-quality TC is often an unrealizable ideal– Very high precision– Very high recall
3
Background (Cont.)
An application scenario: healthcare information support
Classification Confirmation
Classified Information
Information Gathered
Relevant Information Classified
Query
Query
Consultancy
InquiryClassified Inquiry
High-Quality TC
General Users (e.g. patients)
Information Gathering Systems
Healthcare Professionals
Classified Information Base
4
Outline
Interaction as an approach to high-quality TC – Main consideration
Reducing the amount of the interaction
– Criteria & straightforward interaction strategies
An intelligent interaction strategy: COM (Content Overlapping Measurement)
Empirical evaluation– Chinese cancer texts classification
Conclusion
5
Interaction for High-Quality TC
Interaction with the user– Possibly a “final” approach– More application scenarios
Information recommendation & archiving– Definite relevant vs. potentially relevant
Main consideration– Reducing the number of interactions
6
Interaction for High-Quality TC (Cont.)
Evaluation criteria– Confirmation Precision (CP)
Related to cognitive load to users
– Confirmation Recall (CR) Related to the quality of TC
y wrongpotentiall as identified decisions #
identified decisions wrong#
conducted onsconfirmati #
conducted onsconfirmatinecessary #
identified be should that decisions wrong#
identified decisions wrong#
conducted be should that onsconfirmati #
conducted onsconfirmatinecessary #
7
Interaction for High-Quality TC (Cont.)
Straightforward interaction strategies
Max DOA
x o xx x o x o o o oo o oxx x
(A) Setting two thresholds to identify the DOA range for confirmation (o: positive validation document; x: negative validation document)
:Rejection Threshold
Acceptance Threshold
Uniform Confirmation (UC): Preferring CR
(B) Confirmation strategy:
Prob = 0 (when DOA(d, c) > AT)
Prob = 0 (when DOA(d, c) < RT)
Prob = 1.0 (whenRT DOA(d, c) AT)
Min DOA
8
Interaction for High-Quality TC (Cont.)
Probabilistic Confirmation (PC): Preferring CP
Prob = 0 (when DOA(d, c) = Min)
Prob = 0 (when DOA(d, c) = Max)
Prob = 1.0 (when DOA(d, c) = threshold)
(B) Confirmation strategy:
(A) Tuning a threshold in the hope to optimize F1 (o: positive validation document; x: negative validation document):
x o xx x o x o o o oo o o
Min DOA
xx x
The classifier’s Threshold (T) Max
DOA
9
ICCOM: Interactive Confirmation by COM
Training
Testing
(2) Threshold Tuning based on Content Overlapping
Incoming Document
Training Documents for Classifier BuildingTraining
Documents for Threshold Tuning (validation)
ICCOM
Classified/Filtered Documents
Classifier Building
Feature Selection
Threshold Tuning
Underlying Classifier
(1) Content Overlap Measurement (COM)
Documents to be Confirmed
(3) Content Overlap Measurement (COM) Classification
10
ICCOM: Interactive Confirmation by COM (content overlapping measurement)
Procedure COM(c, d), where (1) c is a category,(2) d is a document for thresholding or testingReturn: Degree of content overlap (DCO) between d and c
Begin(1) DCO = 0;(2) For each term t that is positively correlated with c but does not appear in d, do
(2.1) DCO = DCO - 2(t,c); (3) For each term t that is negatively correlated with c but appears in d, do
(3.1) DCO = DCO - (number of occurrences of t in d) 2(t,c);(4) Return DCO;
End.
11
ICCOM: Interactive Confirmation by COM (content overlapping measurement, cont.)
To discriminate c from others To validate content overlapping Features that
correlate with c Features that correlate with other categories
Features that appear in c but do not appear in d
Features that do not appear in c but appear in d
Underlying classifier Considered Considered Not considered Not considered COM Not considered Not considered Considered Considered
12
ICCOM: Interactive Confirmation by COM (content overlapping measurement, cont.)
N: total number of documents,
A: # documents that are in c and contain t,
B: # documents that are not in c but contain t,
C: # documents that are in c but do not contain t, and
D: # documents that are not in c and do not contain t.
))()()((),(
)(2
2
DCDBCABA
Nct
BCAD
“positively-correlated” if AD>BC; otherwise “negative-correlated”
13
ICCOM: Interactive Confirmation by COM (thresholding)
Rejection Threshold (RT)
Rejection Confirmation
oxx
Confirmation
xo x o o o oo o o
Acceptance
Rejection
o xx x o x o o o oo o o
Min DOA
Max DOA
xx x
The classifier’s threshold (T)
Invoking COM to compute DCO
Positive Confirmation Threshold (PCT)
Negative Confirmation Threshold (NCT)
14
ICCOM: Interactive Confirmation by COM (collaboration with the classifier)
Procedure InteractiveHighQualityTC(c, d, T, RT, PCT, NCT), where (1) c is a category,(2) d is the document to be processed,(3) T is the classifier’s threshold for c,(4) RT is the rejection threshold for c,(5) PCT is the positive confirmation threshold for c, and (6) NCT is the negative confirmation threshold for c.
Return: A decision (acceptance, rejection, or confirmation) for d with respect to c.
Begin(1) DOAd = Invoke the classifier to compute DOA of d with respect to c;(2) If (DOAd RT), Return “rejection”;(3) Else
(3.1) DCOd = Invoke COM to compute DCO of d with respect to c;(3.2) If (DOAd T)
(3.2.1) If (DCOd PCT), Return “acceptance”;(3.2.2) Return “confirmation”;
(3.3) Else(3.3.1) If (DCOd NCT), Return “rejection”;(3.3.2) Return “confirmation”;
End.
15
Empirical Evaluation
Chinese disease (cancer) texts– 16 types of cancers (e.g. liver cancer, lung cancer, …,
etc.) top-ranked by the department of health in Taiwan
– Collected by sending cancer names to “知識 +” (knowledge+) in Yahoo! at Taiwan
– For each cancer, there are 5 subcategories Cause, symptom, curing, side-effect, and prevention Therefore, we have 80 (16*5) categories with 2850 documen
ts 90% for training; 10% for testing 2-fold cross validation (classifier building vs. thresholding)
16
Empirical Evaluation (cont.)
Best F1 by RO
F1 by RO+PC
CP of RO+PC
F1 by RO+UC
CP of RO+UC
F1 by RO+ICCOM
CP of RO+ICCOM
1st fold
0.3485 (FS=1500)
0.8413 0.0969 0.9610 0.0848 0.9607 0.1117
2nd fold
0.3270 (FS=1500)
0.7823 0.1037 0.9656 0.0725 0.9433 0.1166
Classification of cancer information
17
Empirical Evaluation (cont.)
Best F1 by RO
F1 by RO+PC
CP of RO+PC
F1 by RO+UC
CP of RO+UC
F1 by RO+ICCOM
CP of RO+ICCOM
1st fold
0.8919 (FS=300)
0.9610 0.0676 0.9744 0.1017 0.9610 0.1429
2nd fold
0.8718 (FS=300)
0.9620 0.1000 0.9750 0.0580 0.9744 0.1569
Classification of 40 symptom description without cancer names
Note: For the 40 test symptom documents, RO+ICCOM conducts 35 and 51 confirmations in the 1st and 2nd folds, respectively
18
Conclusion
High-quality TC is essential but often unrealizable
Interactive confirmation may be one final resort– Information recommendation & archiving– Healthcare information support
COM as a classifier-independent strategy for interaction
Thank you!Thank you!
top related