a brief history of data mining society

A Brief History of Data Mining Society

1989 IJCAI Workshop on Knowledge Discovery in Databases Knowledge Discovery in Databases (G. Piatetsky-Shapiro

and W. Frawley, 1991) 1991-1994 Workshops on Knowledge Discovery in Databases

Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)

Journal of Data Mining and Knowledge Discovery (1997)

A Brief History of Data Mining Society

ACM SIGKDD conferences since 1998 and SIGKDD Explorations

More conferences on data mining PAKDD (1997), PKDD (1997), SIAM-Data

Mining (2001), (IEEE) ICDM (2001), etc. ACM Transactions on KDD starting in

2007

Conferences and Journals on Data Mining

KDD Conferences ACM SIGKDD Int. Conf. on Knowledge

Discovery in Databases and Data Mining (KDD)

SIAM Data Mining Conf. (SDM) (IEEE) Int. Conf. on Data Mining (ICDM) Conf. on Principles and practices of Knowledge

Discovery and Data Mining (PKDD) Pacific-Asia Conf. on Knowledge Discovery and

Data Mining (PAKDD)

Where to Find References? DBLP, CiteSeer, Google

Data mining and KDD (SIGKDD: CDROM) Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM,

PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD

Explorations, ACM TKDD

Bioinformatics Conferences: RECOMB, CSB, PSB, BIBE, etc Journals: Bioinformatics, BMC Bioinformatics, TCBB,…

Top-10 Algorithm Finally Selected at ICDM’06

#1: Decision Tree (61 votes) #2: K-Means (60 votes) #3: SVM (58 votes) #4: Apriori (52 votes) #5: EM (48 votes) #6: PageRank (46 votes) #7: AdaBoost (45 votes) #8: kNN (45 votes) #9: Naive Bayes (45 votes) #10: CART (34 votes)

Association Rules

Association Rules

support, s, probability that a transaction contains X Y

confidence, c, conditional probability that a transaction having X also contains Y

Association Rules

Let’s have an example

Association Rules T100 1,2,5 T200 2,4 T300 2,3 T400 1,2,4 T500 1,3 T600 2,3 T700 1,3 T800 1,2,3,5 T900 1,2,3

Association Rules

Classification

Classification—A Two-Step Process

Classification classifies data (constructs a model)

based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

predicts categorical class labels (discrete or nominal)

Classification

Typical applications Credit approval Target marketing Medical diagnosis Fraud detection And much more

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Decision Tree Decision Tree induction is the learning

of decision trees from class-labeled training tuples

A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute

Each Branch represents an outcome of the test

Each Leaf node holds a class label

Decision Tree Example

Decision Tree Algorithm

Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive

divide-and-conquer manner At start, all the training examples are at the

root Attributes are categorical (if continuous-

valued, they are discretized in advance) Test attributes are selected on the basis of a

heuristic or statistical measure (e.g., information gain)

Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest

information gain Let pi be the probability that an

arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|

Expected information (entropy) needed to

classify a tuple in D:)(log)( 2

1i

m

ii ppDInfo

Attribute Selection Measure: Information Gain (ID3/C4.5) Information needed (after using A

to split D into v partitions) to classify D:

Information gained by branching on attribute A

)(||

||)(

1j

v

j

jA DI

D

DDInfo

(D)InfoInfo(D)Gain(A) A

Decision Tree

940.0)14

5(log

14

5)

14

9(log

14

9)5,9()( 22 IDInfo


Decision Tree

694.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIDInfoage


means “age <=30” has 5 out of 14 samples, with 2 yes’s and 3 no’s.

I(2,3) = -2/5 * log(2/5) – 3/5 * log(3/5)

)3,2(14

5I

Decision Tree

Similarily, we can compute Gain(income)=0.029 Gain(student)=0.151 Gain(credit_rating)=0.048

Since “age” obtains highest information gain, we can partition the tree into:

246.0)()()( DInfoDInfoageGain age

Decision Tree

a brief history of data mining society

Documents