7/10/07 - sede'07 1 data mining applications margaret h. dunham southern methodist university...

55
7/10/07 - SEDE'07 1 7/10/07 - SEDE'07 DATA MINING DATA MINING APPLICATIONS APPLICATIONS Margaret H. Dunham Margaret H. Dunham Southern Methodist University Southern Methodist University Dallas, Texas 75275 Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 Foundation under Grant No. 9820841 Some slides used by permission from Some slides used by permission from Dr Eamonn Keogh; Dr Eamonn Keogh; University of California Riverside; [email protected]

Post on 22-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

7/10/07 - SEDE'07 17/10/07 - SEDE'07

DATA MINING DATA MINING APPLICATIONSAPPLICATIONS

Margaret H. DunhamMargaret H. DunhamSouthern Methodist UniversitySouthern Methodist University

Dallas, Texas 75275Dallas, Texas 75275

[email protected]

This material is based in part upon work supported by the National Science Foundation under Grant No. This material is based in part upon work supported by the National Science Foundation under Grant No. 98208419820841

Some slides used by permission from Some slides used by permission from Dr Eamonn Keogh; Dr Eamonn Keogh; University of California Riverside;[email protected]

7/10/07 - SEDE'07 27/10/07 - SEDE'07

The 2000 ozone hole over the antarctic seen by EPTOMS http://jwocky.gsfc.nasa.gov/multi/multi.html#hole

7/10/07 - SEDE'07 37/10/07 - SEDE'07

OBJECTIVE

Explore some of the applications of data mining techniques.

7/10/07 - SEDE'07 47/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics

Conclusions

7/10/07 - SEDE'07 57/10/07 - SEDE'07

Data Mining Overview

Finding hidden information in a database Fit data to a model

You must know what you are looking for You must know how to look for you

7/10/07 - SEDE'07 67/10/07 - SEDE'07

“If it looks like a duck,

walks like a duck, and

quacks like a duck, then

it’s a duck.”

Description Behavior AssociationsClassification Clustering Link Analysis (Profiling) (Similarity)

“If it looks like a terrorist,

walks like a terrorist, and

quacks like a terrorist, then

it’s a terrorist.”

7/10/07 - SEDE'07 77/10/07 - SEDE'07

Classification Applications

Teachers classify students’ grades as A, B, C, D, or F.

Letter Recognition andwriting Recognition Phishing: http://computerworld.com/action/article.do?

command=viewArticleBasic&taxonomyName=cybercrime_hacking&articleId=9002996&taxonomyId=82

Pluto: http://www.npr.org/templates/story/story.php?storyId=5705254

7/10/07 - SEDE'07 87/10/07 - SEDE'07

Grasshoppers

Katydids

Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is.

(c) Eamonn Keogh, [email protected]

Classification Example

user

7/10/07 - SEDE'07 97/10/07 - SEDE'07

An

tenn

a L

engt

hA

nte

nna

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Grasshoppers Katydids

Abdomen LengthAbdomen Length

(c) Eamonn Keogh, [email protected]

7/10/07 - SEDE'07 107/10/07 - SEDE'07

Clustering Applications

Targeted Marketing Determining Gene Functionality Identifying Species

Clustering vs. Classification No prior knowledge Number of clusters Meaning of clusters

Unsupervised learning

7/10/07 - SEDE'07 117/10/07 - SEDE'07http://149.170.199.144/multivar/ca.htm

7/10/07 - SEDE'07 127/10/07 - SEDE'07

What is SimilarityWhat is Similarity??

(c) Eamonn Keogh, [email protected]

7/10/07 - SEDE'07 137/10/07 - SEDE'07

Association Rules Applications

People who buy diapers also buy beer If gene A is highly expressed in this disease then gene B is

also expressed Relationships between people www.amazon.com Book Stores Department Stores Advertising Product Placement

7/10/07 - SEDE'07 147/10/07 - SEDE'07

Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.

DILBERT reprinted by permission of United Feature Syndicate, Inc.

7/10/07 - SEDE'07 157/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications

Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics

Conclusions

7/10/07 - SEDE'07 167/10/07 - SEDE'07

7/10/07 - SEDE'07 177/10/07 - SEDE'07

Fraud Detection

Identify fraudulent behavior Used Extensively in financial, law enforcement, health

care, etc. sectors http://www.aaai.org/AITopics/html/fraud.html SPSS:

http://www.spss.com/predictiveclaims/fraud_detection.htm Neural Technologies:

http://www.neuralt.com/fraud_management.html

7/10/07 - SEDE'07 187/10/07 - SEDE'07

Law Enforcement

Identify suspect behavior and relationships I2 Inc.

Investigative analytic/visualization software http://www.i2inc.com

Social Network Analysis – Analyze patterns of relationships

Relationships: personal, religious, operational, etc.

7/10/07 - SEDE'07 197/10/07 - SEDE'07

Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network”  Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.

7/10/07 - SEDE'07 207/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications Fraud Detection & Illegal Activities

Facial Recognition Cheating & Plagiarism Bioinformatics

Conclusions

7/10/07 - SEDE'07 217/10/07 - SEDE'07

How Stuff Works, “Facial Recognition,” http://computer.howstuffworks.com/facial-recognition1.htm

7/10/07 - SEDE'07 227/10/07 - SEDE'07

Facial Recognition

Based upon features in face Convert face to a feature vector Less invasive than other biometric techniques http://www.face-rec.org http://computer.howstuffworks.com/facial-

recognition.htm SIMS:

http://www.casinoincidentreporting.com/Products.aspx

7/10/07 - SEDE'07 237/10/07 - SEDE'07(c) Eamonn Keogh, [email protected]

7/10/07 - SEDE'07 247/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications Fraud Detection & Illegal Activities Facial Recognition

Cheating & Plagiarism Bioinformatics

Conclusions

7/10/07 - SEDE'07 257/10/07 - SEDE'07

Cheating on Multiple Choice Tests

Similarity between tests based on number of common wrong answers.

(George O. Wesolowsky, “Detecting Excessive Similarity in Answers on Multiple Choice Exams,” Journal of Applied Statistics, vol 27, no 7,200, pp909-923.)

The number of common correct answers is often ignored. H-H Index (D.N. Harpp, J.J. Hogan, and J.S. Jennings, 1996, “Crime in the

Classroom – Part II, and update,” Journal of Chemical Education, vol 73, no 4, pp 349-351):

H-H = (Number of exact answers in common)(Number of different answers)

7/10/07 - SEDE'07 267/10/07 - SEDE'07

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

7/10/07 - SEDE'07 277/10/07 - SEDE'07

No/Little Cheating

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

7/10/07 - SEDE'07 287/10/07 - SEDE'07

Rampant Cheating

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

7/10/07 - SEDE'07 297/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications Fraud Detection & Illegal Activities Facial Recognition

Cheating & Plagiarism Bioinformatics

Conclusions

7/10/07 - SEDE'07 307/10/07 - SEDE'07

DNA

Basic building blocks of organisms

Located in nucleus of cells Composed of 4

nucleotides Two strands bound

together

http://www.visionlearning.com/library/module_viewer.php?mid=63

7/10/07 - SEDE'07 317/10/07 - SEDE'07

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Central Dogma: DNA -> RNA -> Protein

www.bioalgorithms.info; chapter 6; Gene Prediction

7/10/07 - SEDE'07 327/10/07 - SEDE'07

miRNA

Short (20-25nt) sequence of noncoding RNA Known since 1993 but significance not widely

appreciated until 2001 Impact / Prevent translation of mRNA Generally reduce protein levels without impacting mRNA

levels (animal cells) Functions

Causes some cancers Guide embryo development Regulate cell Differentiation Associated with HIV …

7/10/07 - SEDE'07 337/10/07 - SEDE'07

Questions

If each cell in an organism contains the same DNA –

How does each cell behave differently? Why do cells behave differently during

childhood/? What causes some cells to act differently –

such as during disease? DNA contains many genes, but only a few are

being transcribed – why? One answer - miRNA

7/10/07 - SEDE'07 347/10/07 - SEDE'07http://www.time.com/time/magazine/article/0,9171,1541283,00.html

7/10/07 - SEDE'07 357/10/07 - SEDE'07

Human Genome

Scientists originally thought there would be about 100,000 genes

Appear to be about 20,000 WHY?

Almost identical to that of Chimps. What makes the difference?

Visualization from UCRdnaQT.mov

Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)

7/10/07 - SEDE'07 367/10/07 - SEDE'07

RNAi – Nobel Prize in Medicine 2006

Double stranded RNA

Short Interfering RNA (~20-25 nt)

RNA-Induced Silencing Complex

Binds to mRNA

Cuts RNA

siRNA may be artificially added to cell!

Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3

7/10/07 - SEDE'07 377/10/07 - SEDE'07

Computer Science & Bioinformatics

Algorithms Data Structures Improving efficiency Data Mining Biologists don’t usually understand or even

appreciate what Computer Science can do Issues:

Scalability Fuzzy

We will look at: Microarray Clustering TCGR

7/10/07 - SEDE'07 387/10/07 - SEDE'07

Affymetrix GeneChip® Array

http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx

7/10/07 - SEDE'07 397/10/07 - SEDE'07

Microarray Data Analysis

Each probe location associated with gene Measure the amount of mRNA Color indicates degree of gene expression Compare different samples (normal/disease) Track same sample over time Questions

Which genes are related to this disease? Which genes behave in a similar manner? What is the function of a gene?

Clustering Hierarchical K-means

7/10/07 - SEDE'07 407/10/07 - SEDE'07

Microarray Data - Clustering

"Gene expression profiling identifies clinically relevant subtypes of prostate cancer"

Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816,

January 20, 2004

7/10/07 - SEDE'07 417/10/07 - SEDE'07

miRNA Research Issues

Predict / Find miRNA in genomic sequence Predict miRNA targets Identify miRNA functions

7/10/07 - SEDE'07 427/10/07 - SEDE'07

Temporal CGR (TCGR) 2D Array

Each Row represents counts for a particular window in sequence• First row – first window• Last row – last window • We start successive windows at the next character location

Each Column represents the counts for the associated pattern in that window

• Initially we have assumed order of patterns is alphabetic Size of TCGR depends on sequence length and subpattern

length

7/10/07 - SEDE'07 437/10/07 - SEDE'07

TCGR Example (cont’d)

TCGRs for Sub-patterns of length 1, 2, and 3

7/10/07 - SEDE'07 447/10/07 - SEDE'07

TCGR – Mature miRNA(Window=5; Pattern=3)

All Mature

Mus Musculus

Homo Sapiens

C Elegans

ACG CGC GCG UCG

7/10/07 - SEDE'07 457/10/07 - SEDE'07

POSITIVE

NEGATIVE

TCGRs for Xue Training Data

C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

7/10/07 - SEDE'07 467/10/07 - SEDE'07

POSITIVE

NEGATIVE

TCGRs for Xue Test Data

7/10/07 - SEDE'07 477/10/07 - SEDE'07

Data Mining Applications Outline

Introduction – Data Mining Overview Classification (Prediction,Forecasting) Clustering Association Rules (Link Analysis)

Applications Fraud Detection & Illegal Activities Facial Recognition Cheating & Plagiarism Bioinformatics

Conclusions

7/10/07 - SEDE'07 487/10/07 - SEDE'07

Conclusions

Not magic Doesn’t work for all applications Stock Market Prediction Issues

Privacy Data

Here are some infamous examples of failed data mining applications

7/10/07 - SEDE'07 497/10/07 - SEDE'07

7/10/07 - SEDE'07 507/10/07 - SEDE'07

Dallas Morning News

October 7, 2005

7/10/07 - SEDE'07 517/10/07 - SEDE'07

http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236

7/10/07 - SEDE'07 527/10/07 - SEDE'07

BIG BROTHER ? Total Information Awareness

http://infowar.net/tia/www.darpa.mil/iao/index.htm http://www.govtech.net/magazine/story.php?id=45918 http://en.wikipedia.org/wiki/Information_Awareness_Office

Terror Watch List http://www.businessweek.com/technology/content/may2005/tc20050

511_8047_tc_210.htm http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ http://blogs.abcnews.com/theblotter/2007/06/fbi_terror_watc.html http://www.thedenverchannel.com/news/9559707/detail.html

CAPPS http://www.theregister.co.uk/2004/04/26/airport_security_failures/ http://www.heritage.org/Research/HomelandDefense/BG1683.cfm http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ http://en.wikipedia.org/wiki/CAPPS

7/10/07 - SEDE'07 537/10/07 - SEDE'07

7/10/07 - SEDE'07 547/10/07 - SEDE'07

7/10/07 - SEDE'07 557/10/07 - SEDE'07