clustering. computational journalism week 2

Fron%ers of Computa%onal Journalism

Columbia Journalism School

Week 2: Clustering

September 12, 2014

Classifica%on and Clustering

“Classifica%on is arguably one of the most central and generic of all our conceptual exercises. It is the founda%on not only for conceptualiza%on, language, and speech, but also for mathema%cs, sta%s%cs, and data analysis in general.”

-‐ Kenneth D. Bailey, Typologies and Taxonomies: An Introduc7on to Classifica7on Techniques

Each xi is a numerical or categorical feature N = number of features or “dimension”

x1x2x3xN

!

"

#######

$

%

&&&&&&&

Vector representa%on of objects

Examples of vector representa%ons Obvious – movies watched / items purchased – Legisla%ve vo%ng history for a poli%cian – crime loca%ons

Less obvious, but standard – document vector space model – psychological survey results

Tricky research problem: disparate field types – Corporate filing document – Wikileaks SIGACT

What can we do with vectors? Predict one variable based on others –  this is called “regression” – supervised machine learning

Group similar items together – This is classifica%on or clustering – We may or may not know pre-‐exis%ng classes

Distance metric

Intui%vely: how (dis)similar are two items? Formally:

d(x, y) ≥ 0 d(x, x) = 0

d(x, y) = d(y, x) d(x, z) ≤ d(x, y) + d(y, z)

Distance metric

d(x, y) ≥ 0 -‐  distance is never nega%ve

d(x, x) = 0 -‐  “reflexivity”: zero distance to self

d(x, y) = d(y, x) -‐  “symmetry”: x to y same as y to x

d(x, z) ≤ d(x, y) + d(y, z) -‐ “triangle inequality”: going direct is shorter

Distance matrix Data matrix for M objects of N dimensions

Distance matrix

X =

x1x2xM

!

"

####

$

%

&&&&

=

x1,1 x1,2 x1,Nx2,1 x2,2 x1,M xM ,N

!

"

#####

$

%

&&&&&

Dij = Dji = d(xi , x j ) =

d1,1 d1,2 dM ,Md2,1 d2,2 d1,M dM ,M

!

"

#####

$

%

&&&&&

We think of a cluster like this…

Real data isn’t so simple…

Many possible defini%ons of a cluster

Many possible defini%ons of a cluster

•  “every point inside is closer to the center of this cluster than the center of any other”

•  “no point outside this cluster is closer than ε to any point inside”

•  “every point in this cluster is closer to all points inside than any point outside”

Different clustering algorithms

•  Par%%oning – keep adjus%ng clusters un%l convergence – e.g. K-‐means

•  Agglomera%ve hierarchical – start with leaves, repeatedly merge clusters – e.g. MIN and MAX approaches

•  Divisive hierarchical – start with root, repeatedly split clusters – e.g. binary split

K-‐means demo

hjp://www.paused21.net/off/kmeans/bin/

Agglomera%ve – combining clusters

put each item into a leaf node while num clusters > 1 find two closest clusters merge them

single link or “min” complete link or “max”

average

UK House of Lords vo%ng clusters Algorithm instructed to separate MPs into five clusters. Output: !!1 2 2 1 3 2 2 2 1 4 !1 1 1 1 1 1 5 2 1 1 !2 2 1 2 3 2 2 4 2 1 !2 3 2 1 3 1 1 2 1 2 !1 5 2 1 4 2 2 1 2 1 !

1 4 1 1 4 1 2 2 1 5 !1 1 1 2 3 3 2 2 2 5 !2 3 1 2 1 4 1 1 4 4 !1 1 2 1 1 2 2 2 2 1 !2 1 2 1 2 2 1 3 2 1 !1 2 2 1 2 3 4 2 2 2!

! ! ! ! ! ! ! .!! ! ! ! ! ! ! .!! ! ! ! ! ! ! . !

Vo%ng clusters with par%es LDem XB Lab LDem XB Lab XB Lab Con XB ! 1 2 2 1 3 2 2 2 1 4 ! Con Con LDem Con Con Con LDem Lab Con LDem ! 1 1 1 1 1 1 5 2 1 1 !

Lab Lab Con Lab XB XB Lab XB Lab Con ! 2 2 1 2 3 2 2 4 2 1 ! Lab XB Lab Con XB XB LDem Lab XB Lab !

2 3 2 1 3 1 1 2 1 2 ! Con Con Lab Con XB Lab Lab Con XB XB ! 1 5 2 1 4 2 2 1 2 1 ! Con XB Con Con XB Con Lab XB LDem Con !

1 4 1 1 4 1 2 2 1 5 ! Con Con Con Lab Bp XB Lab Lab Lab LDem ! 1 1 1 2 3 3 2 2 2 5 !

Lab XB Con Lab Con XB Con Con XB XB ! 2 3 1 2 1 4 1 1 4 4 ! Con Con Lab Con Con XB Lab Lab Lab Con ! 1 1 2 1 1 2 2 2 2 1 !

Lab LDem Lab Con Lab Lab Con XB Lab Con ! 2 1 2 1 2 2 1 3 2 1 ! Con Lab XB Con XB XB XB Lab Lab Lab ! 1 2 2 1 2 3 4 2 2 2!

! ! ! ! ! ! ! ! .!! ! ! ! ! ! ! ! .!! ! ! ! ! ! ! ! .!

!

Clustering Algorithm

Input: data points (feature vectors). Output: a set of clusters, each of which is a set of points.

Visualiza%on

Input: data points (feature vectors). Output: a picture of the points.

Dimensionality reduc%on

Problem: vector space is high-‐dimensional. Up to thousands of dimensions. The screen is two-‐dimensional. We have to go from

x ∈ RN to much lower dimensional points

y ∈ RK<<N Probably K=2 or K=3.

This is called "projec%on"

Projec%on from 3 to 2 dimensions

Linear projec%ons

Projects in a straight line to closest point on "screen." Mathema%cally,

y = Px

where P is a K by N matrix.


Think of this as rota%ng to align the "screen" with coordinate axes, then simply throwing out values of higher dimensions.


Which direc%on should we look from? Principal components analysis: find a linear projec%on that preserves greatest variance

Take first K eigenvectors of covariance matrix corresponding to largest eigenvalues. This gives a K-‐dimensional sub-‐space for projec%on.

Some%mes overlap is unavoidable

Real data isn’t so simple…

Nonlinear projec%ons

S%ll going from high-‐dimensional x to low-‐dimensional y, but now

y = f(x) for some func%on f(), not linear. So, may not preserve rela%ve distances, angles, etc.

Fish-‐eye projec%on from 3 to 2 dimensions

Mul%dimensional scaling

Idea: try to preserve distances between points "as much as possible." If we have the distances between all points in a distance matrix,

D = |xi – xj| for all i,j We can recover the original {xi} coordinates exactly (up to rigid transforma%ons.) Like working out a country map if you know how far away each city is from every other.

Mul%dimensional scaling Torgerson's "classical MDS" algorithm (1952)

Reducing dimension with MDS

No%ce: dimension N is not encoded in the distance matrix D (it’s M by M where M is number of points) MDS formula (theore%cally) allows us to recover point coordinates {x} in any number of dimensions k.

MDS Stress minimiza%on The formula actually minimizes “stress” Think of “springs” between every pair of points. Spring between xi, xj has rest length dij Stress is zero if all high-‐dimensional distances matched exactly in low dimension.

stress(x) = xi − x j − dij( )2

i, j∑

Mul%-‐dimensional Scaling

Like "flajening" a stretchy structure into 2D, so that distances between points are preserved (as much as possible")

House of Lords MDS plot

Robustness of results

Regarding these analyses of congressional vo%ng, we could s%ll ask: •  Are we modeling the right thing? (What about other legisla%ve work, e.g. in commijee?)

•  Are our underlying assump%ons correct? (do representa%ves really have “ideal points” in a preference space?)

•  What are we trying to argue? What will be the effect of poin%ng out this result?

Why do clusters have meaning?

What is the connec%on between mathema%cal and seman%c proper%es?

No unique “right” clustering

Different distance metrics and clustering algorithms give different results. Should we sort incident reports by loca%on, %me, actor, event type, author, cost, casual%es…? There is only context-‐specific categoriza%on. And the computer doesn’t understand your context.

Different libraries, different categories

clustering. computational journalism week 2

Documents