icx 1 - syracuse university
TRANSCRIPT
Midterm Review: CIS 563 – Intro to data science
Sagnik Basumallik
Some Important Points from the slides:
Types of Clusterings: hierarchical and partitional
Clustering Algorithms: K‐means and its variants, Hierarchical clustering, Density‐based clustering
k‐means complexity
Problems with Selecting Initial Points
Solutions to Initial Centroids Problem: Multiple runs Helps, but probability is not on your side Sample
and use hierarchical clustering to determine initial centroids Select more than k initial centroids and
then select among these initial centroids Select most widely separated Postprocessing Bisecting K‐
means Not as susceptible to initialization issues
What is postprocessing?
Evaluating K‐means Clusters
K‐means has problems when clusters are of differing Sizes Densities Non‐globular shapes K‐means has
problems when the data contains outliers.
Overcome? One solution is to use many clusters. Find parts of clusters, but need to put together.
Types of Hierarchical Clustering: Agglomerative and Agglomerative
Algorithms for each clustering process
How to Define Inter‐Cluster Similarity: pros and cons
Type Pros Cons
MIN Can handle non‐elliptical shapes
• Sensitive to noise and outliers
MAX • Less susceptible to noise and outliers
• Tends to break large clusters • Biased towards globular clusters
Group Average • Less susceptible to noise and outliers
Biased towards globular clusters
Distance Between Centroids
• Less susceptible to noise and outliers
Biased towards globular clusters
K
i Cxi
i
xmdistSSE1
2 ),(
Ward’s Method • Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical Clustering: Problems and Limitations: Once a decision is made to combine two clusters, it
cannot be undone No objective function is directly minimized Different schemes have problems with
one or more of the following: Sensitivity to noise and outliers Difficulty handling different sized clusters
and convex shapes Breaking large clusters
Density based clustering: DBSCAN?
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Does not work well under
• Varying densities
• High‐dimensional data
Measures of Cluster Validity
1. Internal Measures:
a. Cohesion and Separation
b. Silhouette Coefficient combine ideas of both cohesion and separation, but for individual
points, as well as clusters and clusterings
c. Correlation
2. External measures
a. Entropy
b. Purity
SSE :
Time Series
1. SAX
2. DTW
Association Rule Mining:
1. find frequent itemsets
2. Hash trees
Given d items,
Total number of itemsets = 2d
Total number of possible association rules:
Subset Operation: Given a transaction t, what are the possible subsets of size 3?
Support and Confidence
Example 1: Find frequent itemsets and the strong association rules
ID List
1 I1 I2 I3
2 I2 I4
3 I2 I3
4 I1 I2 I4
5 I1 I3
6 I2 I3
7 I1 I3
8 I1 I2 I3 I5
9 I1 I2 I3
Example 2: Hash Trees
{145} {124} {457} {125} {458} {159} {136} {234} {567} {345} {356}
Length = 3
Given Hash function:
Example 3: DTW: measure similarity between two sequences which may vary in time or speed
Let the two time series be given as:
A = [1 3 4 9 8 2 1 5 7 3]
B = [1 6 2 3 0 9 4 3 6 3]
Example 4:
Example 5: