icx 1 - syracuse university

7
Midterm Review: CIS 563 – Intro to data science Sagnik Basumallik Some Important Points from the slides: Types of Clusterings: hierarchical and partitional Clustering Algorithms: Kmeans and its variants, Hierarchical clustering, Densitybased clustering kmeans complexity Problems with Selecting Initial Points Solutions to Initial Centroids Problem: Multiple runs Helps, but probability is not on your side Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these initial centroids Select most widely separated Postprocessing Bisecting Kmeans Not as susceptible to initialization issues What is postprocessing? Evaluating Kmeans Clusters Kmeans has problems when clusters are of differing Sizes Densities Nonglobular shapes Kmeans has problems when the data contains outliers. Overcome? One solution is to use many clusters. Find parts of clusters, but need to put together. Types of Hierarchical Clustering: Agglomerative and Agglomerative Algorithms for each clustering process How to Define InterCluster Similarity: pros and cons Type Pros Cons MIN Can handle nonelliptical shapes Sensitive to noise and outliers MAX Less susceptible to noise and outliers Tends to break large clusters Biased towards globular clusters Group Average Less susceptible to noise and outliers Biased towards globular clusters Distance Between Centroids Less susceptible to noise and outliers Biased towards globular clusters K i C x i i x m dist SSE 1 2 ) , (

Upload: others

Post on 12-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: iCx 1 - Syracuse University

Midterm Review: CIS 563 – Intro to data science 

Sagnik Basumallik 

Some Important Points from the slides: 

Types of Clusterings: hierarchical and partitional 

Clustering Algorithms: K‐means and its variants, Hierarchical clustering, Density‐based clustering 

k‐means complexity 

 

Problems with Selecting Initial Points 

Solutions to Initial Centroids Problem: Multiple runs Helps, but probability is not on your side Sample 

and use hierarchical clustering to determine initial centroids Select more than k initial centroids and 

then select among these initial centroids Select most widely separated Postprocessing Bisecting K‐

means Not as susceptible to initialization issues 

What is postprocessing? 

Evaluating K‐means Clusters 

 

K‐means has problems when clusters are of differing Sizes Densities Non‐globular shapes K‐means has 

problems when the data contains outliers. 

Overcome? One solution is to use many clusters. Find parts of clusters, but need to put together. 

Types of Hierarchical Clustering: Agglomerative and Agglomerative 

Algorithms for each clustering process 

How to Define Inter‐Cluster Similarity: pros and cons 

Type  Pros  Cons 

MIN Can handle non‐elliptical shapes 

• Sensitive to noise and outliers  

MAX • Less susceptible to noise and outliers 

 

• Tends to break large clusters • Biased towards globular clusters 

 

Group Average • Less susceptible to noise and outliers 

 

Biased towards globular clusters 

 

Distance Between Centroids

• Less susceptible to noise and outliers 

 

Biased towards globular clusters 

 

K

i Cxi

i

xmdistSSE1

2 ),(

Page 2: iCx 1 - Syracuse University

Ward’s Method  • Less susceptible to noise and outliers 

 

Biased towards globular clusters 

 

Hierarchical Clustering:  Problems and Limitations: Once a decision is made to combine two clusters, it 

cannot be undone No objective function is directly minimized Different schemes have problems with 

one or more of the following: Sensitivity to noise and outliers Difficulty handling different sized clusters 

and convex shapes Breaking large clusters 

Density based clustering: DBSCAN? 

• Resistant to Noise 

•  Can handle clusters of different shapes and sizes 

Does not work well under 

• Varying densities 

•  High‐dimensional data 

Measures of Cluster Validity 

1. Internal Measures:  

a. Cohesion and Separation 

b. Silhouette Coefficient combine ideas of both cohesion and separation, but for individual 

points, as well as clusters and clusterings 

c. Correlation 

2. External measures 

a. Entropy 

b. Purity 

SSE :   

Time Series 

1. SAX 

2. DTW 

Association Rule Mining: 

1. find frequent itemsets 

2. Hash trees 

 

Given d items,  

Total number of itemsets = 2d 

Total number of possible association rules:   

Subset Operation: Given a transaction t, what are the possible subsets of size 3? 

Support and Confidence 

Page 3: iCx 1 - Syracuse University

Example 1: Find frequent itemsets and the strong association rules 

ID  List 

1  I1 I2 I3 

2  I2 I4 

3  I2 I3 

4  I1 I2 I4 

5  I1 I3 

6  I2 I3 

7  I1 I3 

8  I1 I2 I3 I5 

9  I1 I2 I3 

 

   

Page 4: iCx 1 - Syracuse University

Example 2: Hash Trees 

{145} {124} {457} {125} {458} {159} {136} {234} {567} {345} {356}  

Length = 3 

Given Hash function:  

 

   

Page 5: iCx 1 - Syracuse University

Example 3: DTW: measure similarity between two sequences which may vary in time or speed 

Let the two time series be given as: 

A = [1 3 4 9 8 2 1 5 7 3] 

B = [1 6 2 3 0 9 4 3 6 3] 

   

Page 6: iCx 1 - Syracuse University

Example 4: 

 

   

Page 7: iCx 1 - Syracuse University

Example 5: