data mining in rediology reports

24
Data Mining in Radiology Reports Saeed Mehrabi Spring 2010 INFO-I535 Dr. Patrick W. Jamieson Dr. Josette Jones

Upload: saeedmehrabi

Post on 25-May-2015

1.316 views

Category:

Education


1 download

DESCRIPTION

This is a data mining of large scale of radiology reports

TRANSCRIPT

Page 1: Data Mining in Rediology reports

Data Mining in Radiology Reports

Saeed Mehrabi

Spring 2010INFO-I535

Dr. Patrick W. Jamieson

Dr. Josette Jones

Page 2: Data Mining in Rediology reports

Outline• Introduction to data and text mining

• Our data set

• Structuring free text

• Results

• Similar works

• Discussion

Page 3: Data Mining in Rediology reports

What is Data Mining • Data mining is

The extraction of useful patterns from data sources such as databases, texts and web.

• There is a big gap from stored data to knowledge and

the transition won’t occur automatically.

• Many interesting things you want to find cannot be found using database queries “find me people likely to buy my products”

“Who are likely to respond to my promotion”

Page 4: Data Mining in Rediology reports

Why data mining now?

• The data is abundant.

• The data is being warehoused.

• The computing power is affordable.

• The competitive pressure is strong.

• Data mining tools have become available

Page 5: Data Mining in Rediology reports

Text Mining

Text mining applies and adapts data mining techniques to text domain

Structured vs. Free Text

• Structured text can be stored in a relational database.

• Providing the means to represent data available in text in structured format will make information exchange, data mining and information retrieval more feasible.

Page 6: Data Mining in Rediology reports

Data Set

• Our corpus consists of: 594,000 de-identified radiology reports

36 million words

4.3 million sentences

• The reports were dictated by the Indiana University Radiology faculty, a group of 40 radiologists, from 1993-1998.

Page 7: Data Mining in Rediology reports

Structuring Free text

• Regular expression was used to detect sentences in reports!

• Regular expression is a concise and flexible way of matching strings of text, such as particular characters or words.

• Sentences annotated to propositions which simply are sentences expressing the same concept for similar findings within reports

Page 8: Data Mining in Rediology reports

Structuring Free text (Cont.)

• A proposition is a declarative sentence, that is either true or false but not both.

Today is a beautiful sunny day. ( A proposition)

x + 2 = 4 (Not a proposition)

• Users can select propositions and map sentences to propositions

Page 9: Data Mining in Rediology reports
Page 10: Data Mining in Rediology reports

Corpus Annotation

• So for annotating each new sentence from the radiology reports the computer initially propose propositions

• The suggested propositions by the software are reviewed by experts and corrected as needed before validation.

• If there is no proposition in the ontology then the expert can create new ones.

Page 11: Data Mining in Rediology reports
Page 12: Data Mining in Rediology reports

Results

• The process of building the ontology of propositions is in parallel with the expert annotating sentences to the existing proposition

• So far, 427,433 unique sentences from the corpus have been annotated.

Representing a total of 2,561,330 sentences or 60% of the total sentences.

Page 13: Data Mining in Rediology reports

Results (Cont.)• The propositions are categorized into main findings such as

brain and skull, general radiology, ..

• All propositions with information such as whether they are normal or abnormal finding and the number of the sentences mapped to them are all stored in a relational data base

• We can find the most frequent or highest ranked propositions by sorting them based the number of sentences that are mapped to them, how many of them are normal or abnormal and the number of normal and abnormal propositions and sentences in each category

Page 14: Data Mining in Rediology reports

1-50

0

501-

1000

1001

-150

0

1501

-200

0

2001

-250

0

2501

-300

0

3001

-350

0

3501

-400

0

4001

-450

0

4501

-500

0

5001

-550

0

5501

-600

0

6001

-650

0

6501

-700

0

7001

-750

0

7501

-800

0

8001

-850

0

8501

-900

0

9001

-950

0

9501

-100

00

1000

1-10

500

1050

1-11

000

1100

1-11

500

1150

1-12

000

1200

1-12

500

1250

1-13

000

1300

1-13

500

1350

1-13

581

0

50

100

150

200

250

300

350

Number of normal and abnormal propositions within the 500 interval of highest ranked propositions

NormalAbnormal

Rank of Propositions

Nu

mb

er

of

Pro

po

sit

ion

s

Page 15: Data Mining in Rediology reports

1-500 501-1000 1001-1500 1501-2000 2001-25000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Number of normal and abnormal sentences mapped to the propositions

NormalAbnormal

Rank of Propositions

Nu

mb

er

of

Se

nte

nc

es

Page 16: Data Mining in Rediology reports

2501

-300

0

3001

-350

0

3501

-400

0

4001

-450

0

4501

-500

0

5001

-550

0

5501

-600

0

6001

-650

0

6501

-700

0

7001

-750

0

7501

-800

0

8001

-850

0

8501

-900

0

9001

-950

0

9501

-100

00

1000

1-10

500

1050

1-11

000

1100

1-11

500

1150

1-12

000

1200

1-12

500

1250

1-13

000

1300

1-13

500

1350

1-13

581

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Number of normal and abnormal sentences mapped to the propositions

NormalAbnormal

Rank of Propositions

Nu

mb

er

of

Stu

nd

en

ts

Page 17: Data Mining in Rediology reports

Brain

and

Skull

Breas

t

Face,

Mas

toids

, and

Nec

k

Gastro

intes

tinal

Gener

al Rad

iolog

y

Genito

urina

ry

Heart

and

Great

Ves

sel

Lung

, Med

iastin

um, a

nd P

leura

Misc

ellan

eous

Obs

erva

tion

Skelet

al an

d Sof

t Tiss

ue

Spine

and

Conte

nts

Vascu

lar a

nd L

ymph

atic

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Number of normal and abnormal propositions based on report categories

NormalAbnormal

Categories of findings

Nu

mb

er

of

Pro

po

sit

ion

s

Page 18: Data Mining in Rediology reports

Brain

and

Skull

Breas

t

Face,

Mas

toids

, and

Nec

k

Gastro

intes

tinal

Gener

al Rad

iolog

y

Genito

urina

ry

Heart

and

Great

Ves

sel

Lung

, Med

iastin

um, a

nd P

leura

Misc

ellan

eous

Obs

erva

tion

Skelet

al an

d Sof

t Tiss

ue

Spine

and

Conte

nts

Vascu

lar a

nd L

ymph

atic

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

Number of normal and abnormal sentences based on report cat-egories

NormalAbnormal

Categroies of Findings

Nu

mb

er

of

Sc

en

ten

ce

s

Page 19: Data Mining in Rediology reports

Similar works

CLEF (Clinical E-Science Framework)

• It consists of both structured records and free text documents(clinical narratives, radiology reports and histopathology report)

• Semantic annotation of clinical text to assist in the development and evaluation of an Information Extraction system

Page 20: Data Mining in Rediology reports

LEXIcon Mediated Entropy Reduction

Page 21: Data Mining in Rediology reports

LEXIMER(Cont.)

• Phrase Isolation includes scanning the report text and separating the content into phrases

• Noise Reduction decreases the amount of non-clinically relevant information contained within the report

• Signal Extraction pulls out the positive statements and recommendations from the clinically relevant phrases

Page 22: Data Mining in Rediology reports

NLP using OLAP for assessing Recommendations in radiology reports

• Database:4,279,179 radiology reports from a single tertiary health care center

10-year period (1995-2004)

Consist of reports of most common imaging modalities tests with patient demographics

• Leximer in conjunction with OnLine Analytic Processing was used for classifying reports into those with recommendation (IREC) and without recommendations for imaging

• IREC rates were determined for different patient age groups, gender, imaging modalities, indications, diseases, subspecialties, and referring physicians

Page 23: Data Mining in Rediology reports

Discussion

• CLEF work is on very limited number of reports

• In Leximer, there is no validation of their classification method and phrases cannot convey the meaning of a sentence.

• What distinguish our work from others is the large amount of data that is mined and consistent expert validation.

Page 24: Data Mining in Rediology reports

Reference

• Friedlin, J., Mahoui, M., Jones, J., Kashyap, V., & Jamieson , P. (2010). Knowledge Discovery and Data Mining of Free Text Radiology. Submitted to the journal of biomedical informatics

• Roberts, A., Gaizauskas, R., Hepple, M., Demetriou, G., Guo, Y., Setzer, A., et al. (2008). Semantic Annotation of Clinical Text: The CLEF Corpus. Retrieved April 20, 2010, from ftp://ftp.dcs.shef.ac.uk/home/robertg/papers/lrec08-clefcorpus.pdf

• Dang PA, Kalra MK, Blake MA, Schultz TJ, Stout M, Lemay PR, Freshman DJ, Halpern EF, Dreyer KJ. Natural language processing using online analytic processing for assessing recommendations in radiology reports.J Am Coll Radiol. 2008 Mar;5(3):197-204.

• http://www.nuance.com/healthcare/products/radcube-for-radiology.asp