clustering. computational journalism week 2

40
Fron%ers of Computa%onal Journalism Columbia Journalism School Week 2: Clustering September 12, 2014

Upload: jonathan-stray

Post on 19-Jul-2016

30 views

Category:

Documents


1 download

DESCRIPTION

Jonathan Stray, Columbia University, Fall 2014Syllabus at http://www.compjournalism.com/?p=113

TRANSCRIPT

Page 1: Clustering. Computational Journalism week 2

Fron%ers  of  Computa%onal  Journalism  

 Columbia  Journalism  School  

Week  2:  Clustering    

September  12,  2014        

Page 2: Clustering. Computational Journalism week 2

Classifica%on  and  Clustering  

“Classifica%on  is  arguably  one  of  the  most  central  and  generic  of  all  our  conceptual  exercises.  It  is  the  founda%on  not  only  for  conceptualiza%on,  language,  and  speech,  but  also  for  mathema%cs,  sta%s%cs,  and  data  analysis  in  general.”    

 -­‐  Kenneth  D.  Bailey,  Typologies  and  Taxonomies:  An  Introduc7on  to  Classifica7on  Techniques    

Page 3: Clustering. Computational Journalism week 2

Each  xi  is  a  numerical  or  categorical  feature  N  =  number  of  features  or  “dimension”  

 

x1x2x3xN

!

"

#######

$

%

&&&&&&&

Vector  representa%on  of  objects  

Page 4: Clustering. Computational Journalism week 2

Examples  of  vector  representa%ons  Obvious  – movies  watched  /  items  purchased  – Legisla%ve  vo%ng  history  for  a  poli%cian  – crime  loca%ons  

Less  obvious,  but  standard  – document  vector  space  model  – psychological  survey  results  

Tricky  research  problem:  disparate  field  types  – Corporate  filing  document  – Wikileaks  SIGACT  

Page 5: Clustering. Computational Journalism week 2

What  can  we  do  with  vectors?    Predict  one  variable  based  on  others  –  this  is  called  “regression”  – supervised  machine  learning    

Group  similar  items  together  – This  is  classifica%on  or  clustering  – We  may  or  may  not  know  pre-­‐exis%ng  classes  

 

Page 6: Clustering. Computational Journalism week 2

Distance  metric  

Intui%vely:  how  (dis)similar  are  two  items?    Formally:    

d(x,  y)  ≥  0  d(x,  x)  =  0  

d(x,  y)  =  d(y,  x)  d(x,  z)  ≤  d(x,  y)  +  d(y,  z)    

 

Page 7: Clustering. Computational Journalism week 2

Distance  metric  

d(x,  y)  ≥  0  -­‐  distance  is  never  nega%ve  

d(x,  x)  =  0  -­‐  “reflexivity”:  zero  distance  to  self  

d(x,  y)  =  d(y,  x)  -­‐  “symmetry”:  x  to  y  same  as  y  to  x  

d(x,  z)  ≤  d(x,  y)  +  d(y,  z)    -­‐  “triangle  inequality”:  going  direct  is  shorter  

   

Page 8: Clustering. Computational Journalism week 2

Distance  matrix  Data  matrix  for  M  objects  of  N  dimensions  

       Distance  matrix  

X =

x1x2xM

!

"

####

$

%

&&&&

=

x1,1 x1,2 x1,Nx2,1 x2,2 x1,M xM ,N

!

"

#####

$

%

&&&&&

Dij = Dji = d(xi , x j ) =

d1,1 d1,2 dM ,Md2,1 d2,2 d1,M dM ,M

!

"

#####

$

%

&&&&&

Page 9: Clustering. Computational Journalism week 2
Page 10: Clustering. Computational Journalism week 2

We  think  of  a  cluster  like  this…  

Page 11: Clustering. Computational Journalism week 2

Real  data  isn’t  so  simple…  

Page 12: Clustering. Computational Journalism week 2

Many  possible  defini%ons  of  a  cluster  

Page 13: Clustering. Computational Journalism week 2

Many  possible  defini%ons  of  a  cluster  

•  “every  point  inside  is  closer  to  the  center  of  this  cluster  than  the  center  of  any  other”  

•  “no  point  outside  this  cluster  is  closer  than  ε  to  any  point  inside”    

•  “every  point  in  this  cluster  is  closer  to  all  points  inside  than  any  point  outside”  

 

Page 14: Clustering. Computational Journalism week 2

Different  clustering  algorithms  

•  Par%%oning  – keep  adjus%ng  clusters  un%l  convergence  – e.g.  K-­‐means  

•  Agglomera%ve  hierarchical  – start  with  leaves,  repeatedly  merge  clusters  – e.g.  MIN  and  MAX  approaches  

•  Divisive  hierarchical  – start  with  root,  repeatedly  split  clusters  – e.g.  binary  split  

Page 15: Clustering. Computational Journalism week 2

K-­‐means  demo  

hjp://www.paused21.net/off/kmeans/bin/  

Page 16: Clustering. Computational Journalism week 2

Agglomera%ve  –  combining  clusters  

 put  each  item  into  a  leaf  node  while  num  clusters  >  1      find  two  closest  clusters      merge  them  

 

Page 17: Clustering. Computational Journalism week 2

single  link  or  “min”   complete  link  or  “max”  

average  

Page 18: Clustering. Computational Journalism week 2
Page 19: Clustering. Computational Journalism week 2

UK  House  of  Lords  vo%ng  clusters  Algorithm  instructed  to  separate  MPs  into  five  clusters.  Output:  !!1 2 2 1 3 2 2 2 1 4 !1 1 1 1 1 1 5 2 1 1 !2 2 1 2 3 2 2 4 2 1 !2 3 2 1 3 1 1 2 1 2 !1 5 2 1 4 2 2 1 2 1 !

1 4 1 1 4 1 2 2 1 5 !1 1 1 2 3 3 2 2 2 5 !2 3 1 2 1 4 1 1 4 4 !1 1 2 1 1 2 2 2 2 1 !2 1 2 1 2 2 1 3 2 1 !1 2 2 1 2 3 4 2 2 2!

! ! ! ! ! ! ! .!! ! ! ! ! ! ! .!! ! ! ! ! ! ! . !

Page 20: Clustering. Computational Journalism week 2

Vo%ng  clusters  with  par%es   LDem XB Lab LDem XB Lab XB Lab Con XB ! 1 2 2 1 3 2 2 2 1 4 ! Con Con LDem Con Con Con LDem Lab Con LDem ! 1 1 1 1 1 1 5 2 1 1 !

Lab Lab Con Lab XB XB Lab XB Lab Con ! 2 2 1 2 3 2 2 4 2 1 ! Lab XB Lab Con XB XB LDem Lab XB Lab !

2 3 2 1 3 1 1 2 1 2 ! Con Con Lab Con XB Lab Lab Con XB XB ! 1 5 2 1 4 2 2 1 2 1 ! Con XB Con Con XB Con Lab XB LDem Con !

1 4 1 1 4 1 2 2 1 5 ! Con Con Con Lab Bp XB Lab Lab Lab LDem ! 1 1 1 2 3 3 2 2 2 5 !

Lab XB Con Lab Con XB Con Con XB XB ! 2 3 1 2 1 4 1 1 4 4 ! Con Con Lab Con Con XB Lab Lab Lab Con ! 1 1 2 1 1 2 2 2 2 1 !

Lab LDem Lab Con Lab Lab Con XB Lab Con ! 2 1 2 1 2 2 1 3 2 1 ! Con Lab XB Con XB XB XB Lab Lab Lab ! 1 2 2 1 2 3 4 2 2 2!

! ! ! ! ! ! ! ! .!! ! ! ! ! ! ! ! .!! ! ! ! ! ! ! ! .!

!

Page 21: Clustering. Computational Journalism week 2

Clustering  Algorithm    

Input:  data  points  (feature  vectors).  Output:  a  set  of  clusters,  each  of  which  is  a  set  of  points.    

Visualiza%on    

Input:  data  points  (feature  vectors).  Output:  a  picture  of  the  points.    

Page 22: Clustering. Computational Journalism week 2

Dimensionality  reduc%on  

Problem:  vector  space  is  high-­‐dimensional.  Up  to  thousands  of  dimensions.  The  screen  is  two-­‐dimensional.      We  have  to  go  from    

 x  ∈  RN    to  much  lower  dimensional  points  

 y  ∈  RK<<N      Probably  K=2  or  K=3.    

Page 23: Clustering. Computational Journalism week 2

This  is  called  "projec%on"  

Projec%on  from  3  to  2  dimensions  

Page 24: Clustering. Computational Journalism week 2

Linear  projec%ons  

Projects  in  a  straight  line  to  closest  point  on  "screen."  Mathema%cally,    

y  =  Px    

where  P  is  a  K  by  N  matrix.  

Projec%on  from  2  to  1  dimensions  

Page 25: Clustering. Computational Journalism week 2

 Think  of  this  as  rota%ng  to  align  the  "screen"  with  coordinate  axes,  then  simply  throwing  out  values  of  higher  dimensions.  

Projec%on  from  3  to  2  dimensions  

Page 26: Clustering. Computational Journalism week 2

Which  direc%on  should  we  look  from?  Principal  components  analysis:  find  a  linear  projec%on  that  preserves  greatest  variance        

Take  first  K  eigenvectors  of  covariance  matrix  corresponding  to  largest  eigenvalues.  This  gives  a  K-­‐dimensional  sub-­‐space  for  projec%on.  

Page 27: Clustering. Computational Journalism week 2

Some%mes  overlap  is  unavoidable  

Page 28: Clustering. Computational Journalism week 2

Real  data  isn’t  so  simple…  

Page 29: Clustering. Computational Journalism week 2

Nonlinear  projec%ons  

S%ll  going  from  high-­‐dimensional  x  to  low-­‐dimensional  y,  but  now    

y  =  f(x)    for  some  func%on  f(),  not  linear.  So,  may  not  preserve  rela%ve  distances,  angles,  etc.  

Fish-­‐eye  projec%on  from  3  to  2  dimensions  

Page 30: Clustering. Computational Journalism week 2

Mul%dimensional  scaling  

Idea:  try  to  preserve  distances  between  points  "as  much  as  possible."    If  we  have  the  distances  between  all  points  in  a  distance  matrix,    

 D  =  |xi  –  xj|  for  all  i,j    We  can  recover  the  original  {xi}  coordinates  exactly  (up  to  rigid  transforma%ons.)  Like  working  out  a  country  map  if  you  know  how  far  away  each  city  is  from  every  other.  

     

Page 31: Clustering. Computational Journalism week 2

Mul%dimensional  scaling  Torgerson's  "classical  MDS"  algorithm  (1952)  

       

Page 32: Clustering. Computational Journalism week 2

Reducing  dimension  with  MDS  

No%ce:  dimension  N  is  not  encoded  in  the  distance  matrix  D  (it’s  M  by  M  where  M  is  number  of  points)    MDS  formula  (theore%cally)  allows  us  to  recover  point  coordinates  {x}  in  any  number  of  dimensions  k.          

Page 33: Clustering. Computational Journalism week 2

MDS  Stress  minimiza%on  The  formula  actually  minimizes  “stress”      Think  of  “springs”  between  every  pair  of  points.  Spring  between  xi,  xj  has  rest  length  dij              Stress  is  zero  if  all  high-­‐dimensional  distances  matched  exactly  in  low  dimension.                

stress(x) = xi − x j − dij( )2

i, j∑

Page 34: Clustering. Computational Journalism week 2

Mul%-­‐dimensional  Scaling  

Like  "flajening"  a  stretchy  structure  into  2D,  so  that  distances  between  points  are  preserved  (as  much  as  possible")  

Page 35: Clustering. Computational Journalism week 2

House  of  Lords  MDS  plot  

Page 36: Clustering. Computational Journalism week 2

Robustness  of  results  

Regarding  these  analyses  of  congressional  vo%ng,  we  could  s%ll  ask:  •  Are  we  modeling  the  right  thing?  (What  about  other  legisla%ve  work,  e.g.  in  commijee?)  

•  Are  our  underlying  assump%ons  correct?  (do  representa%ves  really  have  “ideal  points”  in  a  preference  space?)  

•  What  are  we  trying  to  argue?  What  will  be  the  effect  of  poin%ng  out  this  result?  

Page 37: Clustering. Computational Journalism week 2

Why  do  clusters  have  meaning?    

What  is  the  connec%on  between  mathema%cal  and  seman%c  proper%es?  

Page 38: Clustering. Computational Journalism week 2

No  unique  “right”  clustering  

Different  distance  metrics  and  clustering  algorithms  give  different  results.    Should  we  sort  incident  reports  by  loca%on,  %me,  actor,  event  type,  author,  cost,  casual%es…?    There  is  only  context-­‐specific  categoriza%on.    And  the  computer  doesn’t  understand  your  context.  

Page 39: Clustering. Computational Journalism week 2

Different  libraries,  different  categories  

Page 40: Clustering. Computational Journalism week 2