using geworkbench: hierarchical & som clustering fan lin, ph. d molecular analysis tools...

26
Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard

Upload: meghan-wilcox

Post on 02-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Using geWorkbench:

Hierarchical & SOM Clustering

Fan Lin, Ph. D

Molecular Analysis Tools Knowledge Center

Columbia University

and

The Broad Institute of MIT and Harvard

Page 2: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Overviews

geWorkbench offers various clustering tools to help researchers discover gene patterns, reduce data dimensions, and visualize gene clusters in microarray experiments.

In this presentation, we will demonstrate how to analyze the gene expression with:

Hierarchical Clustering SOM Clustering

Page 3: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Clustering Calculation

Hierarchical and SOM Clustering can be selectively run on the local machine in which geWorkbench is installed or run as services on caGrid.

In this demo, both clustering algorithms will be performed on the local machine within geWorkbench. For more information on how to run them as caGrid service, please refer to this geWorkbench Tutorial:

http://wiki.c2b2.columbia.edu/workbench/index.php/Tutorial_-_Grid_Services

Page 4: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 1. Overview

Hierarchical clustering is a method:to group markers (or/and arrays) together based on the similarity on their expression profiles to organize the resulting groups in a tree-like hierarchical structure to elucidate the strength of their correlations.

Page 5: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 2. Memory Requirement

The hierarchical clustering calculation is memory intensive. Clustering more than about 2000 markers requires adjusting the default memory settings.

For more information on how to increase the default memory settings, please refer to the geWorkbench FAQ titled: “How do I increase the amount of memory available to Java to run geWorkbench?”

(https://cabig-kc.nci.nih.gov/Molecular/KC/index.php/GeWorkbench003)

Page 6: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 3. Data Requirement

A microarray dataset must be loaded in the Project Folders component.

There should be no missing value in the dataset: An error message will be returned if Hierarchical Clustering is run with a dataset containing any

missing values. Missing values can be filtered out or replaced using the Missing Value Filter (see figure below).

An annotation file corresponding to the array data is optional. If an annotation file is loaded, gene names will be used in the results display, otherwise probeset

names will be used

Page 7: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 4. Sample Data for Demo

The microarray data file and the annotation file used in this presentation are: webmatrix2_quantile_log2_dev1.2_mv0.exp 70_TFs_from_HG-U95Av2.na28.csvBoth files can be found in the tutorial data package. For more details, please refer to:

(http://wiki.c2b2.columbia.edu/workbench/index.php/Tutorial_-_Data)

Predefined data sets were used in this presentation: Markers: A smaller “Marker Test Set” (left arrow) was created with 1018 markers randomly selected from the large marker pool. It can be downloaded

from : https://cabig-kc.nci.nih.gov/Molecular/uploaded_files/b/bd/Marker_Test_Set.zip

Arrays/Phenotypes: 4 pre-defined array sets are selected ( right arrow).

Page 8: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 5. More on Sample Data for Demo

geWorkbench allows restricting analysis to a subset of the input dataset:

The user can activate one or more sets of markers or arrays (Arrow 1). In that case the analysis will utilize the subset of the array data comprising the designated markers and arrays.

When there are marker or array sets defined and activated, the user may select the "All Arrays" and/or the "All Markers" check-boxes to override any set selections (Arrow 2).

All Arrays: Use all arrays in the dataset. All Markers: Use all markers in the dataset.

1

2

Page 9: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 6. Setting Up Parameters

Clustering MethodThis indicates the cluster-to-cluster distances when constructing the hierarchical tree. Options are:

Single Linkage: The distance between each member of one cluster to each member of the other cluster is computed and the minimum of all distances is used as the cluster-to-cluster distance.

Average Linkage: As in Single linkage, but uses the average distance. Total Linkage: As in Single linkage, but uses the maximum distance.

Clustering DimensionThis indicates whether to cluster markers, arrays, or both. The calculation will apply to the marker /array set(s) defined in the previous slides (Slide 7 and 8). Select one from:

Marker Array Both

Clustering MetricThe values being clustered, whether markers or microarrays, can each be represented by vectors of numbers. Clustering Metric gives the choice of methods by which to calculate the distance between any two vectors:

Euclidean: Geometric distance. Pearson's: Pearson’s correlation. Spearman's: Spearman’s rank correlation.

Three sets of parameters are required for the clustering calculation:

Page 10: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 7. Running the calculation

Now, we are ready to run the calculation!Select parameters as shown below, and click on “Analyze” to start the calculation (Arrow 1)A progress bar will be visible during the calculation,

First it displays a message about computing distances.. (Arrow 2) Then about clustering.. (Arrow 3)

1

2

3

Page 11: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 8. Clustering with different datasets

We got the result! Results form the hierarchical clustering analysis are displayed in the

Dendrogram component (picture below left). Rows represent markers and columns represent arrays.

A node representing the results is added to the Projects Folder and can be renamed (picture below right).

Page 12: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 9. Adjust Clustering Display

The user can adjust the cell sizes in the Dendrogram by changing: Gene height: height in pixels of each cell. Gene width: width in pixels of each cell. Intensity slider: Adjust the intensity range of the red-blue color scale.

Page 13: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 10. Clustering with different datasets

In Slide 9, we mentioned that there are 3 options for the clustering dimension. Pictures below illustrate the different results obtained by clustering

marker only (left picture) array only (middle picture) or both (right picture)

Clustering by ArraysClustering by Markers Clustering by Arrays & Markers

Page 14: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 11. Information on Individual Data

Selecting the bulb icon (arrow) activates the tool-tip feature on the Dendrogram display. The following information about the expression measurement associated with a Dendrogram cell will be displayed by mousing over that cell :

Chip: the array name (represented by the cell column) Marker: the marker (probeset) name (represented by the cell row) Signal: the expression value of the marker on the array.

Page 15: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 12. Enable Selection

A very useful feature in Hierarchical Clustering is “Enable Selection”:

Checking this box allows a subtree of the Dendrogram to be selected interactively using the mouse.

The user can easily restore the display back to the original result by de-selecting the “Enable Selection” box.

Page 16: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 13. Display a Subset of the Clustering

Select an area of interested. The selected area is highlighted in blue. Right-clicking on the selected area will restrict the display to just the selected portion of

the tree.

The pictures below depict how to selectively display a subset of the tree after “Enable Selection” is activated:

1. Select markers by highlighting the left portion of the tree and right-clicking.

2. Select arrays by highlighting the top portion of the tree and right-clicking.

3. A small portion is now sliced out of the original tree

Page 17: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Hierarchical Clustering 14. Saving the Clustering Results

Right-clicking on the Dendrogram (Arrow 1) will show a menu with two entries: Image Snapshot - place a static image snapshot of the tree (as currently displayed) into the

Project Folders component (Arrow 2) Add to Set - add the markers and arrays represented in the currently displayed tree to a new

set, called "Cluster Tree“, in the Markers component (arrow 3) and the Arrays Component (arrows 4). This is most useful if done after a subtree of markers has been selected.

2

3

4

1

Page 18: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

SOM Clustering 1. Overview

Self Organizing Map (SOM) is an algorithm to perform clustering of real vectors defined on an instance space of high dimensionality.

The clusters found are described by prototypical instances, referred to as neurons of the SOM, which are arranged topologically in the form of a one- or two-dimensional grid, the Self Organizing Map.

Page 19: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

SOM Clustering 2. Setup SOM Cluster Component

As of geWorkbench release 1.7, SOM Clustering is no longer a default component in the Analysis Panel.

1. Select SOM Analysis and SOM Clusters – the two required components2. Click on “Apply”

To activate the SOM Clustering component, following steps below:

1. Select “Tool” from Menu Bar2. Select “Component Configuration”

SOM Analysis is now in Analysis Panel.

In “Component Configuration Manager”:

Page 20: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

SOM Clustering 3. SOM Analysis Sample Data & Parameters

Sample Data:The same set of data used in Hierarchical clustering (Slide 7) will be used to demo SOM analysis.

Default Parameter Settings are used in this demo. These include: Number of Rows & Columns: The rows & columns in the target 2-dimensional “neuron” grid. Radius: The extent of the neighborhood used in the Bubble function. Iterations: The number of times that the dataset will be presented to the Map. Alpha: the value used to scale the change of individual SOM vectors when a new expression vector is

associated with a node. Function: The neighborhood options indicate the conventions (formulas) used to update (adapt) an SOM vector

once an expression vector has been added into a Node's neighborhood. Options of Bubble & Gaussian

Learn more about the algorithm of SOM clustering, please refer to: http://en.wikipedia.org/wiki/Self-organizing_map

Click on “Analyze” to start SOM Analysis

Page 21: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

SOM Clustering 4. SOM Clusters Viewer

Running the example described in the previous slide, with 3 rows and 3 columns, produces the nine clusters (3x3 grid) shown below in the Dendrogram component of geWorkbench.

Kenneth C. Smith
we could really use an example that gives a clear difference between groups for SOM.
Page 22: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

SOM Clustering 5. Show Selected

If the "Show selected" box is checked, the user may then click on any of the clusters to zoom on it.

Unchecking the "Show selected" box will return to the original display of all clusters.

Page 23: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

SOM Clustering 6. Customizing a Cluster Display

Right-clicking on a cluster produces the menu shown below (Figure at left)

Properties: The "Properties" item allows the title, scale, axis labels and other aspects of the cluster graphs to be customized.

Zoom In/Zoom Out: Zoom in or zoom out on the graph of a particular cluster (Figure at right). Auto Range: Return the cluster to original display size (fit to display area). Image Snapshot: Add a snapshot of the selected cluster to the Project Folders component. Add to Set: Add the markers in the selected cluster to a new set in the Markers component.

The customization can be done in the single cluster view or the view with multiple bins .

Page 24: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

SOM Clustering 7. Label a Cluster Display (1)

The "Properties" item allows the title, scale, axis labels and other aspects of the cluster graphs to be customized.

By checking “Show Title” in “Chart Properties”, user can add title to the cluster display

Page 25: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

SOM Clustering 7. Label a Cluster Display (2)

The labels are shown in chart (arrows)

Page 26: Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of

Need More Information?

NCI is developing an extensive knowledge base to support various NCI molecular

analysis tools. Visit us at NCI’s Molecular Analysis Tool Knowledge center at: https://cabig-kc.nci.nih.gov/MediaWiki/index.php/Main_Page.

For more information on how to use geWorkbench, please visit NCI Knowledge Center, geWorkbench section at : https://cabig-kc.nci.nih.gov/Molecular/KC/index.php/GeWorkbench .

Have a geWorkbench related question? Find the answers in geWorkbench FAQ section at: https://cabig-kc.nci.nih.gov/Molecular/KC/index.php/GeWorkbench_FAQ.

New more helps? Post it in geWorkbench Forum at : https://cabig-kc.nci.nih.gov/Molecular/forums/viewforum.php?f=3