hierarchical clustering in python and beyond

Hierarchical clustering

in Python & elsewhere

For @PyDataConf London, June 2015, by Frank Kelly

Data Scientist, Engineer @analyticsseo

@norhustla

Hierarchical Clustering

Theory Practice Visualisation

Origins & definitions

Methods & considerations

Hierachical theory

Metrics & performance

My use case

Python libraries

Example

Static

Interactive

Further ideas

All opinions expressed are my own

Who am I?

All opinions expressed are my own

Attribution: www.alexmaclean.com

Clustering: a recap

http://www.alexmaclean.com/

Clustering is an unsupervised learning problem

"SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg

based on some notion of similarity.

whereby we aim to group subsets of entities with one another

Origins

1930s:

Anthropology&Psychology

http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html

Diverse applications

Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/

Two main purposes

Exploratory analysis – standalone tool

(Data mining)

As a component of a supervised learning pipeline (in which distinct classifiers or regression models are trained for each cluster).

(Machine Learning)

Clustering considerations

Partitioning criteria(single / multi level)

SeparationExclusive / non-exclusive

Clustering space(Full-space / sub-space)

Similarity measure(distance / connectivity)

Use case: search keywords

RD

P

P

P

KW

KW

KW

KW

KW

CP

CP

KW

KW

KW

The competition!

KW

KW

CP

CDYou

Opportunity!

CD = Competing domainsCP = Competitor’s pages

RD = Ranking domainP = Your pageKW = Keyword

….x 100,000 !!

Use case: search keywords

KW…so we have found 100,000 new

‘s – now what?

How do we summarise and present these to a client?

Clients’ questions…

• Do search categories in general align with my website structure?

• Which categories of opportunity keywords have the highest search volume, bring the most visitors, revenue etc.?

• Which keywords are not relevant?

Website-like structure

Requirements• Need: visual insights;

structure

• Allow targeting of problem in hand

• May develop into a semi- supervised solution

• High-dimensional and sparse data set

• Values correspond to word frequencies

• Recommended methods include: hierarchical clustering, Kmeans with an appropriate distance measure, topic modelling (LDA, LSI), co-clustering

Options for text clustering?

Hierarchical Clusteringbringing structure

2 types

Agglomerative

Divisive Deterministic algorithms!

Attribution: Wikipedia

Agglomerative

Start with many “singleton” clusters

…Merge 2 at a time

continuously…

Build a hierarchy

Divisive

Start with a huge “macro” cluster

…Iteratively split into 2

groups…

Build a hierarchy

Agglomerative method: Linkage types

• Single (similarity between most similar – based on nearestneighbour - two elements)

• Complete (similarity between most dissimilar two elements)

Attribution: https://www.coursera.org/course/clusteranalysis

Agglomerative method: Linkage types

Average link( avg. of similarity between all inter-cluster pairs )Computationally expensive (Na*Nb)

Trick: Centroid link (similaritybetween centroid of two clusters)


Ward’s criterion

• Minimise a function: total in-cluster variance

• As defined by, e.g.:

• Once merged, then the SSE will increase (cluster becomes bigger) by:

https://en.wikipedia.org/wiki/Ward's_method

Divisive clustering • Top-down approach

• Criterion to split: Ward’s criterion

• Handling noise: Use a threshold to determine the termination criteria


Similarity measures

This will certainly influence the shape of the clusters!

• Numerical: Use a variation of the Manhattan distance (e.g. City block, Euclidean)

• Binary: Manhattan, Jaccard co-efficient, Hamming

• Text: Cosine similarity.

Cosine similarityRepresent a document by a bag of terms

Record the frequency of a particular term (word/ topic/ phrase)

If d1 and d2 are two term vectors,

…can thus calculate the similarity between them


Gather word documents = keyword phrases

Aggregate search words with URL “words”

Text clustering: preparations

• Add features where possibleo I added URL words to my word set

• Stem wordso Choose the right stemmer – too severe can be bad

• Stop wordso NLTK tokenisero Scikit learn TF-IDF tokeniser

• Low frequency cut-offo 2 => words appearing less than twice in whole corpus

• High frequency cut-offo 0.5 => words that appear in more than 50% of documents

• N-gramso Single words, bi-grams, tri-grams

• Beware of foreign languageso Separate datasets if possible

Text preparation

Dimensionality• Get a sparse matrix

o Mostly zeros

• Reduce the number of dimensionso PCAo Spectral clustering

• The “curse” of dimensionality

Results: reduced

dimensions

The

dendrogra

m

Assess the quality of your clusters

• Internal: Purity, completeness & homogeneity

• External: Adjusted Rand index, Normalised Information index

Topic labelling

Hierarchical ClusteringBeyond Python (!?)

Life on the inside: Elasticsearch

• Why not perform pre-processing and clustering inside elasticsearch?

• Document store

• TF-IDF and other• Stop words• Language specific analysers

Elasticsearch - try it ! -

• https://www.elastic.co/

• NoSQL document store• Aggregations and stats• Fast, distributed• Quick to set up

https://www.elastic.co/

Document storage in ES

Lingo 3G algorithm

• Lingo 3G: Hierarchical clustering off-the-shelf• Built-in part of speech (POS)• User-defined word/synonym/label dictionaries• Built-in stemmer / word inflection database• Multi-lingual support, advanced tuning• Commercial: costs attached

http://download.carrotsearch.com/lingo3g/manual/#section.es

http://project.carrot2.org/algorithms.html

Elasticsearch with clustering – Utopia?

Carrot2’s Lingo3G in action : http://search.carrot2.org/stable/search

Foamtree visualisation example

Visualisation of hierarchical structure possible for large datasets via “lazy loading”

http://get.carrotsearch.com/foamtree/demo/demos/large.html

http://search.carrot2.org/stable/search



Limitations of hierarchical clustering• Can’t undo what’s done (divisive method, work on

sub clusters, cannot re-merge). Even true for agglomerative (once merged will never split it again)

• Every split or merge must be refined• Methods may not scale well, checking all possible

pairs, complexity goes high

There are extensions: BIRCH, CURE and CHAMELEON

Thank you!A decent introductory course to clustering;https://www.coursera.org/course/clusteranalysis

Hierarchical (agglomerative) clustering in Python:http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc

Visualisation: http://carrotsearch.com/foamtree-overview

Clustering elsewhere (Lingo, Lingo3G) with Carrot2:http://download.carrotsearch.com/

Elasticsearch: https://www.elastic.co/

Analytics SEO: http://www.analyticsseo.com/

Me: @norhustla / [email protected]

Attribution: http://wynway.com/

https://www.coursera.org/course/clusteranalysis



http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

https://www.kaggle.com/c/lshtc



http://carrotsearch.com/foamtree-overview

http://carrotsearch.com/foamtree-overview

http://download.carrotsearch.com/

http://download.carrotsearch.com/



http://www.analyticsseo.com/



mailto:[email protected]

Extra slide: Why work inside the database?

1. Sharing data (management of)Support concurrent access by multiple readers and writers

2. Data Model Enforcement Make sure all applications see clean, organised data

3. Scale Work with datasets too large to fit in memory (over a certain size, need specialised algorithms to deal with the data -> bottleneck)The database organises and exposes algorithms for you conveniently

4. Flexibility Use the data in new, unanticipated ways -> anticipate a broad set of ways of accessing the data

hierarchical clustering in python and beyond

Data & Analytics

clustering considerations

text clustering

coclustering options

search keywords rd p

notion of similarity

centroid link similarity

elements complete similarity

agglomerative method