1 text classification for healthcare information support rey-long liu ( 劉瑞瓏 ) dept. of medical...

Text Classification for Healthcare Information Support

Rey-Long Liu (劉瑞瓏 )Dept. of Medical Informatics

Tzu Chi University, Taiwan

Background

Text categorization (TC) as a fundamental component for information processing– Many TC techniques were developed

Unfortunately, high-quality TC is often an unrealizable ideal– Very high precision– Very high recall

Background (Cont.)

An application scenario: healthcare information support

Classification Confirmation

Classified Information

Information Gathered

Relevant Information Classified

Consultancy

InquiryClassified Inquiry

High-Quality TC

General Users (e.g. patients)

Information Gathering Systems

Healthcare Professionals

Classified Information Base

Outline

Interaction as an approach to high-quality TC – Main consideration

Reducing the amount of the interaction

– Criteria & straightforward interaction strategies

An intelligent interaction strategy: COM (Content Overlapping Measurement)

Empirical evaluation– Chinese cancer texts classification

Conclusion

Interaction for High-Quality TC

Interaction with the user– Possibly a “final” approach– More application scenarios

Information recommendation & archiving– Definite relevant vs. potentially relevant

Main consideration– Reducing the number of interactions

Interaction for High-Quality TC (Cont.)

Evaluation criteria– Confirmation Precision (CP)

Related to cognitive load to users

– Confirmation Recall (CR) Related to the quality of TC

y wrongpotentiall as identified decisions #

identified decisions wrong#

conducted onsconfirmati #

conducted onsconfirmatinecessary #

identified be should that decisions wrong#

identified decisions wrong#

conducted be should that onsconfirmati #

conducted onsconfirmatinecessary #

Straightforward interaction strategies

Max DOA

x o xx x o x o o o oo o oxx x

(A) Setting two thresholds to identify the DOA range for confirmation (o: positive validation document; x: negative validation document)

:Rejection Threshold

Acceptance Threshold

Uniform Confirmation (UC): Preferring CR

(B) Confirmation strategy:

Prob = 0 (when DOA(d, c) > AT)

Prob = 0 (when DOA(d, c) < RT)

Prob = 1.0 (whenRT DOA(d, c) AT)

Min DOA

Probabilistic Confirmation (PC): Preferring CP

Prob = 0 (when DOA(d, c) = Min)

Prob = 0 (when DOA(d, c) = Max)

Prob = 1.0 (when DOA(d, c) = threshold)

(B) Confirmation strategy:

(A) Tuning a threshold in the hope to optimize F1 (o: positive validation document; x: negative validation document):

x o xx x o x o o o oo o o

Min DOA

The classifier’s Threshold (T) Max

ICCOM: Interactive Confirmation by COM

Training

Testing

(2) Threshold Tuning based on Content Overlapping

Incoming Document

Training Documents for Classifier BuildingTraining

Documents for Threshold Tuning (validation)

Classified/Filtered Documents

Classifier Building

Feature Selection

Threshold Tuning

Underlying Classifier

(1) Content Overlap Measurement (COM)

Documents to be Confirmed

(3) Content Overlap Measurement (COM) Classification

ICCOM: Interactive Confirmation by COM (content overlapping measurement)

Procedure COM(c, d), where (1) c is a category,(2) d is a document for thresholding or testingReturn: Degree of content overlap (DCO) between d and c

Begin(1) DCO = 0;(2) For each term t that is positively correlated with c but does not appear in d, do

(2.1) DCO = DCO - 2(t,c); (3) For each term t that is negatively correlated with c but appears in d, do

(3.1) DCO = DCO - (number of occurrences of t in d) 2(t,c);(4) Return DCO;

ICCOM: Interactive Confirmation by COM (content overlapping measurement, cont.)

To discriminate c from others To validate content overlapping Features that

correlate with c Features that correlate with other categories

Features that appear in c but do not appear in d

Features that do not appear in c but appear in d

Underlying classifier Considered Considered Not considered Not considered COM Not considered Not considered Considered Considered

ICCOM: Interactive Confirmation by COM (content overlapping measurement, cont.)

N: total number of documents,

A: # documents that are in c and contain t,

B: # documents that are not in c but contain t,

C: # documents that are in c but do not contain t, and

D: # documents that are not in c and do not contain t.

))()()((),(

DCDBCABA

“positively-correlated” if AD>BC; otherwise “negative-correlated”

ICCOM: Interactive Confirmation by COM (thresholding)

Rejection Threshold (RT)

Rejection Confirmation

Confirmation

xo x o o o oo o o

Acceptance

Rejection

o xx x o x o o o oo o o

Min DOA

Max DOA

The classifier’s threshold (T)

Invoking COM to compute DCO

Positive Confirmation Threshold (PCT)

Negative Confirmation Threshold (NCT)

ICCOM: Interactive Confirmation by COM (collaboration with the classifier)

Procedure InteractiveHighQualityTC(c, d, T, RT, PCT, NCT), where (1) c is a category,(2) d is the document to be processed,(3) T is the classifier’s threshold for c,(4) RT is the rejection threshold for c,(5) PCT is the positive confirmation threshold for c, and (6) NCT is the negative confirmation threshold for c.

Return: A decision (acceptance, rejection, or confirmation) for d with respect to c.

Begin(1) DOAd = Invoke the classifier to compute DOA of d with respect to c;(2) If (DOAd RT), Return “rejection”;(3) Else

(3.1) DCOd = Invoke COM to compute DCO of d with respect to c;(3.2) If (DOAd T)

(3.2.1) If (DCOd PCT), Return “acceptance”;(3.2.2) Return “confirmation”;

(3.3) Else(3.3.1) If (DCOd NCT), Return “rejection”;(3.3.2) Return “confirmation”;

Empirical Evaluation

Chinese disease (cancer) texts– 16 types of cancers (e.g. liver cancer, lung cancer, …,

etc.) top-ranked by the department of health in Taiwan

– Collected by sending cancer names to “知識 +” (knowledge+) in Yahoo! at Taiwan

– For each cancer, there are 5 subcategories Cause, symptom, curing, side-effect, and prevention Therefore, we have 80 (16*5) categories with 2850 documen

ts 90% for training; 10% for testing 2-fold cross validation (classifier building vs. thresholding)

Empirical Evaluation (cont.)

Best F1 by RO

F1 by RO+PC

CP of RO+PC

F1 by RO+UC

CP of RO+UC

F1 by RO+ICCOM

CP of RO+ICCOM

1st fold

0.3485 (FS=1500)

0.8413 0.0969 0.9610 0.0848 0.9607 0.1117

2nd fold

0.3270 (FS=1500)

0.7823 0.1037 0.9656 0.0725 0.9433 0.1166

Classification of cancer information

Empirical Evaluation (cont.)

Best F1 by RO

F1 by RO+PC

CP of RO+PC

F1 by RO+UC

CP of RO+UC

F1 by RO+ICCOM

CP of RO+ICCOM

1st fold

0.8919 (FS=300)

0.9610 0.0676 0.9744 0.1017 0.9610 0.1429

2nd fold

0.8718 (FS=300)

0.9620 0.1000 0.9750 0.0580 0.9744 0.1569

Classification of 40 symptom description without cancer names

Note: For the 40 test symptom documents, RO+ICCOM conducts 35 and 51 confirmations in the 1st and 2nd folds, respectively

Conclusion

High-quality TC is essential but often unrealizable

Interactive confirmation may be one final resort– Information recommendation & archiving– Healthcare information support

COM as a classifier-independent strategy for interaction

Thank you!Thank you!

1 text classification for healthcare information support rey-long liu ( 劉瑞瓏 ) dept. of medical...

Documents

137 10211933 劉思伶

presentation劉思竹v4.2 10122608

d0133680 劉維敏

劉容秀open mic

science talk-091221-劉昭麟

組員 : 劉亞晉蘇怡禎盂家君 ...

劉瑞瓏 rey-long liu 中華大學資訊管理系...

瑞士登峰造極馬特洪 matterhorn(神遊瑞士)

劉姿瑾 lte introduction

劉力學 pierre loisel

20080504 l.瑞士961211

chur.chu.edu.twchur.chu.edu.tw/bitstream/987654321/2837/1/g0088036320.pdf ·...

組員：林婕如張愛加陳泇妡...

桃園長庚紀念醫院 107 7 july -...

劉興欽

適性化多代理人網際網路環境資訊偵搜...

enhancing biomedical text rankers by term proximity...

mining for interactive identification of users’...

intern 劉致顯

contents brand... · 2011年12月1日創刊...