visualizing microarray dataweb.math.ku.dk/~richard/courses/bioconductor2009/...1 elements of data...

Visualizing microarray data

Laurent Gautier

August 18th, 2009

Loading and installing more packages

> l ibrary ( ”<package>”)> # if no such package

> source ( ”http : //www.bioconductor.org/b iocL i t e .R ”)> b i o cL i t e ( ”<package>”)> l ibrary ( ”<package>”)

1

1 Elements of data visualization

1.1 Shapes and screen resolution

Screen resolution

� only one dot displayed per pixel

� there a relatively small number of different geomtric shapes a viewer candistinguish on a plot

�

Overplotting Plain plot

~300, 000 data points are represented on a 500× 500 image.

Overplotting Alpha blending

2

Sampling

30, 000 points are sampled out of the 300, 000.

Binning

3

m[, 1]

m[,

2]

10

12

14

10 12 14

Counts

1

299

596

894

1192

1490

1788

2085

2383

2681

2978

3276

3574

3872

4170

4467

4765

Smooth scatter plot

9 10 11 12 13 14 15

910

1112

1314

15

m[, 1]

m[,

2]

1.2 Colors

� If only two colors, avoid the two color-blind people can distinguish (redand green are one notorious example)

� Most of the people can only keep track of around a dozen different colors(or tone differences) on one plot

4

ColorBrewer’s palettes

BrBGPiYG

PRGnPuOrRdBuRdGyRdYlBu

RdYlGnSpectral

AccentDark2Paired

Pastel1Pastel2

Set1Set2Set3

BluesBuGnBuPuGnBu

GreensGreys

OrangesOrRdPuBu

PuBuGnPuRd

PurplesRdPuRedsYlGn

YlGnBuYlOrBrYlOrRd

� People generally compare better lengths than areas

Blueberry

Cherry

Apple

Boston Cream

Other

Vanilla Cream

5

pie chart vs barplot

Blueberry

Cherry

Apple

Boston Cream

Other

Vanilla Cream

Blueberry Cherry Apple Boston Cream Other Vanilla Cream

0.00

0.05

0.10

0.15

0.20

0.25

0.30

1.3 basic R objects

R objects (some of the)

1.4 basic R plots

plot

> x ← rnorm(50)> plot ( x )

6

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

0 10 20 30 40 50

−2

−1

01

2

Index

x

Histogram

> hist ( x )

Histogram of x

x

Fre

quen

cy

−2 −1 0 1 2

02

46

810

12

Density estimate

> plot (density ( x ) )

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

density.default(x = x)

N = 50 Bandwidth = 0.367

Den

sity

> y ← 2*x + rnorm(50 , sd=0.3 )> plot (x , y )

7

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

−4

−2

02

x

y

> mycolors ← rep ( ”black ” , length ( x ) )> mycolors [ x < 0 | y < 0 ] ← ”red ”> plot (x , y , col = mycolors )

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

−4

−2

02

x

y

1.5 Lattice plots

FormulaeFormulae are used to describe a model, generally for the purpose of fitting

or plotting.

y ~ xExample:weight ~ age

y ~ x + zExample:weight ~ age + gender

y ~ x | zExample:weight ~ age | gender

8

Storing data in data.frame

> data ( chickwts )> head( chickwts )

weight f e ed1 179 horsebean2 160 horsebean3 136 horsebean4 227 horsebean5 217 horsebean6 168 horsebean

> hist ( chickwts$weight )

Histogram of chickwts$weight

chickwts$weight

Fre

quen

cy

100 150 200 250 300 350 400 450

05

1015

Using lattice

> l ibrary ( l a t t i c e )> p ← histogram ( ∼ weight ,+ data = chickwts )> print (p)

weight

Per

cent

of T

otal

0

5

10

15

20

100 200 300 400

> p ← histogram ( ∼ weight | feed ,+ data = chickwts )> print (p)

9

weight

Per

cent

of T

otal

0

10

20

30

40

50

100 200 300 400

casein horsebean

100 200 300 400

linseed

meatmeal

100 200 300 400

soybean

0

10

20

30

40

50

sunflower

weight

Per

cent

of T

otal

0

10

20

30

40

50

100 200 300 400

casein horsebean

100 200 300 400

linseed

meatmeal

100 200 300 400

soybean

0

10

20

30

40

50

sunflower

> p ← den s i t yp l o t ( ∼ weight , groups = feed ,+ data = chickwts ,+ auto .key = TRUE)> print (p)

weight

Den

sity

0.000

0.005

0.010

0.015

100 200 300 400 500

●●●● ●

●● ●● ●

● ●●●●

●●●● ● ●● ●

●●● ●●● ● ● ●●● ●●

●● ●● ●●●

●●●●● ●●●● ● ●●

●●● ● ●

● ●●●● ●● ●●●●

caseinhorsebeanlinseedmeatmealsoybeansunflower

10

1.6 ggplot2

Using ggplot2

> l ibrary ( ggp lot2 )> p ← ggp lot ( chickwts ) ++ aes (x = weight , col=feed ) ++ geom density ( )> print (p)

weight

dens

ity

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

150 200 250 300 350 400

feed

casein

horsebean

linseed

meatmeal

soybean

sunflower

Density estimates

> p ← ggp lot ( chickwts ) ++ aes (x = weight ) ++ geom density ( ) ++ facet wrap ( ∼ f e ed )> print (p)

weight

dens

ity

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

casein

meatmeal

150 200 250 300 350 400

horsebean

soybean

150 200 250 300 350 400

linseed

sunflower

150 200 250 300 350 400

Boxplot

> data ( sleep )> p ← ggp lot ( sleep ) ++ aes (x = factor ( group ) , y = extra ) ++ geom boxplot ( )> print (p)

11

factor(group)

extr

a

−1

0

1

2

3

4

5

1 2

ExpressionSet objects

eset[i, j] subset the matrix with its associated data.frames

> l ibrary ( go lubEsets )> data ( Golub Merge )> e s e t ← Golub Merge> p ← ggp lot ( pData ( e s e t ) ) ++ aes (x = PS) ++ geom histogram ( ) ++ facet wrap ( ∼ Gender , ncol = 3)> print (p)

PS

coun

t

0

1

2

3

4

F

0.2 0.4 0.6 0.8 1.0

M

0.2 0.4 0.6 0.8 1.0

NA

0.2 0.4 0.6 0.8 1.0

12

> exprs ( e s e t ) [ exprs ( e s e t ) ≤ 0 ] ←+ min( exprs ( e s e t ) [ exprs ( e s e t ) > 0 ] )> l ibrary ( limma )> model ← model.matrix ( ∼ pData ( e s e t )$ALL.AML )> f i t ← lmFit ( log2 ( exprs ( e s e t ) ) , model)> t f i t ← t r e a t ( f i t , l f c =1)> t t ← topTreat ( t f i t , coef=2, number=50)> head( t t )

ID logFC AveExpr t P.Valuead j .P .Va l2288 M84526 at 9 .819500 4 .245026 11 .641353 2 .030486e−18 1 .447533e−141882 M27891 at 7 .449327 7 .545176 7 .692992 3 .117432e−11 1 .111209e−073252 U46499 at 4 .664273 7 .019749 7 .224403 2 .283971e−10 5 .427476e−07760 D88422 at 3 .998790 7 .925139 6 .578039 3 .467697e−09 5 .671084e−061834 M23197 at 2 .258431 8 .058312 6 .545160 3 .977475e−09 5 .671084e−066378 M83667 rna1 s at 6 .876969 4 .061439 6 .463448 5 .590043e−09 6 .641903e−06

> plot ( exprs ( e s e t ) [ 1 , ] )

● ●

●

● ● ● ●

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

● ● ● ● ● ● ● ● ● ●

●

● ● ● ● ● ● ● ●

●

● ● ● ●

●

● ● ● ●

0 10 20 30 40 50 60 70

020

4060

80

Index

expr

s(es

et)[

1, ]

Having data in data.frame

data.frame as a workhorse

� Have it one variable per column

� Split tell-it-all names into separate variables

� Really, no variable value such as mutant drugA 231209 grumpy (insteadhave 4 columns strain, treatment, experiment date, experimentalist mood

13

Converting a wide data structure into a long one

> l ibrary ( reshape )> myids ← t t$ID [ 1 : 1 0 ]> data f ← melt ( exprs ( e s e t ) [ myids , ] ,+ varnames = c ( ”symbol ” , ”sample ” ) )> data f ← merge( dataf , pData ( Golub Merge ) ,+ by.x = ”sample ” , by.y = 0)> str ( data f )

' data . f rame ' : 720 obs . o f 14 v a r i a b l e s :$ sample : i n t 1 1 1 1 1 1 1 1 1 1 . . .$ symbol : Factor w/ 10 levels ”D88422 at ” , ”M11722 at ” , . . : 8 5 10 1 4 7 9 3 2 6 . . .$ value : num 1 303 44 161 261 . . .$ Samples : i n t 1 1 1 1 1 1 1 1 1 1 . . .$ ALL.AML : Factor w/ 2 levels ”ALL” , ”AML” : 1 1 1 1 1 1 1 1 1 1 . . .$ BM.PB : Factor w/ 2 levels ”BM” , ”PB” : 1 1 1 1 1 1 1 1 1 1 . . .

14

$ T .B . c e l l : Factor w/ 2 levels ”B−cell ” , ”T−cell ” : 1 1 1 1 1 1 1 1 1 1 . . .$ FAB : Factor w/ 4 levels ”M1” , ”M2” , ”M4” , . . : NA NA NA NA NA NA NA NA NA NA . . .$ Date : Factor w/ 27 levels ”” , ”1/24/1984 ” , . . : 26 26 26 26 26 26 26 26 26 26 . . .$ Gender : Factor w/ 2 levels ”F” , ”M” : 2 2 2 2 2 2 2 2 2 2 . . .$ pc tB la s t s : i n t NA NA NA NA NA NA NA NA NA NA . . .$ Treatment : Factor w/ 2 levels ”Fa i l u r e ” , ”Success ” : NA NA NA NA NA NA NA NA NA NA . . .$ PS : num 1 1 1 1 1 1 1 1 1 1 . . .$ Source : Factor w/ 4 levels ”CALGB” , ”CCG” , . . : 3 3 3 3 3 3 3 3 3 3 . . .

ggplot2

> l ibrary ( ggp lot2 )

New school plots

> p ← ggp lot ( data f ) ++ aes (x=ALL.AML, y=log2 ( va lue ) ) ++ geom point ( ) ++ facet wrap (∼symbol)> print (p)

ALL.AML

log2

(val

ue)

02468

101214

02468

101214

02468

101214

D88422_at

●●●

●

●

●

●●●●

●●●

●●

●

●

●●●●

●

●●

●

●●

●●

●●●●●●

●

●

●

●●●

●

●

●●●

●●●

●

●

●●

●

●●

●

●

●●●

●

●

●●●

●

●

●●

●

●

M27891_at

●

●

●●

●

●●

●

●

●●●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●●●●

●●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●●●

●●

●

●

●●

●

M89957_at

●

●

●

●●

●

●

●

●

●●●

●

●

●●

●

●

●●●●●

●●●●

●

●●●●

●

●

●

●

●●

●●●●●●

●

●●

●●●●●●●●●●●●●

●

●●●●

●

●●●

●

●●

ALL AML

M11722_at

●

●

●

●●●●

●●

●●●

●

●●●●

●●●●

●

●

●●●

●

●●

●

●

●

●●●

●●●

●

●

●●

●

●

●●●

●●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

M29474_at

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●●●

●

●

●

●

●●●●●●

●

●●●●●●●●

●

●●●●●

●

●●

●

U46499_at

●

●

●●●●●

●

●●●

●

●

●

●

●

●●●●

●

●●●●●

●

●●

●

●●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●●●

●

●

●●

●●

●

●●●●

●●

●●●

●

●●●

●●

ALL AML

M19507_at

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●●●●●

●●

●

●

●

●●●

M83667_rna1_s_at

●●●●

●

●

●●●●

●

●

●

●●●●●●●●●●●●●●●

●

●●●

●

●

●

●●

●

●

●●

●●●●●●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

ALL AML

M23197_at

●●●

●

●●●●●●

●●●●●

●●●●●●●

●●●●●●

●

●●●●●●●

●●●●●●●●●●

●

●●●

●●●●

●●

●

●

●●●●●

●●●●●●●●●

M84526_at

●●●●●●●

●

●●●●●●●●●●

●

●●●●●

●

●●●●●●

●

●●●●●●

●

●●●●●

●

●●

●●●●

●●

●

●

●

●

●

●

●●●

●

●●

●●●

●

●●

●

ALL AML

> p ← ggp lot ( data f ) ++ aes (x=ALL.AML, y=log2 ( va lue ) , col=Source ) ++ geom point ( ) ++ facet wrap (∼symbol)> print (p)

15

ALL.AML

log2

(val

ue)

02468

101214

02468

101214

02468

101214

D88422_at

●●

●

●

●●

●

●●

●

●●

●

●

●

●

●●

●●

●●●

●

●

●

●●●●

●●●

●●

●

●

●●●●

●

●●

●

●●

●●

●●

●●

●

●

●

●●●

●

●

●●●●●

●

●●●

●

●

M27891_at

●

●●

●●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●●

●

●

●●

●

●

●●●

●

●

●

●

●

●●●●

●●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

M89957_at

●●●●●●●●●●●●●●●●●

●

●●

●

●

●

●●

●

●

●

●

●●●

●

●

●●

●

●

●●●●●

●●●●

●

●●●●

●

●

●

●

●●●●●

●

●●

●

●

●

●

●

●●

●

ALL AML

M11722_at

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●●

●●

●●●

●

●●●●

●●●●

●

●

●●●●

●

●

●

●●●

●●

●

●

●●

●

●

●●●

●

●

●

●

●●

●●

M29474_at

●●●

●

●●●●●●●

●

●●

●

●●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●●

●●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●●●

U46499_at

●●

●●

●●●

●●

●●

●

●●●●

●

●●●

●

●

●●●●●

●

●●●

●

●

●

●

●

●●●●

●

●●●●●

●

●●●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●●

●●●

●●

ALL AML

M19507_at

●

●●

●

●

●

●

●

●

●

●●●●●●

●

●●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●●

●●●

M83667_rna1_s_at

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●●●

●

●

●●●●

●

●

●

●●●●●●●●●●●●●●●

●

●●●

●

●●●

●●●

●●●●●

●

●●

●●●●●

ALL AML

M23197_at

●●●

●●●●

●●

●

●

●●

●●

●●

●●●

●●●

●

●●●●●●

●●●●●

●●●●●●●

●●●●●●

●

●●●●●

●●●●●●●●

●

●

●●●

●●●●●

M84526_at

●●

●

●

●

●

●●

●

●

●

●●●

●

●●●●

●●●●●●●●

●

●●●●●●●●●

●

●●●●

●

●●●●●●

●

●●●●●

●

●●●●●

●

●●●●●

●

●●●●

ALL AML

Source

● CALGB

● CCG

● DFCI

● St−Jude

16

Anatomy of an heatmap

� central matrix of array values

� hierarchical clustering on row values

� hierarchical clustering on column values

Heatmap Iconic graphics for expression data

� A heatmap is not mandatory for all and every analysis of microarray data

� Every single pattern in a heatmap is not gold.

� Staring at heatmap diagrams can cause serious apophenia.

� Assessing the validity of clusters otherwise than visually is a good idea.

� Yet heatmaps can be useful.

?heatmap

About distance measures

� Dissimilarities a generalization of distances

� Measure how dissimilar entities are

� Hierarchical clustering depends on dissimilarity measure

Two dissimilarity measures, two outcomes

x

y

2

4

6

8

10

1.0 1.5 2.0 2.5 3.0 3.5 4.0

●

●

●

●

a

1.0 1.5 2.0 2.5 3.0 3.5 4.0

●

●

●

●

b

1.0 1.5 2.0 2.5 3.0 3.5 4.0

●

●

●

●

c

c a b

1

2

3

4

c a b

1

2

3

4

> l ibrary ( b ioDi s t )

17

� heatmap (stats)

� heatmap.2 (gplots)

� heatmap_2, heatmap_plus (Heatplus)

� circularmap, heatmapl (NeatMap)

Express ionSet ( storageMode : lockedEnvironment )assayData : 7129 f ea tu r e s , 72 samples

element names : exprsphenoData

sampleNames : 39 , 40 , . . . , 33 (72 t o t a l )varLabe l s and varMetadata description :

Samples : Sample indexALL.AML: Factor , i n d i c a t i n g ALL or AML. . . : . . .Source : Source o f sample(11 t o t a l )

featureDatafeatureNames : AFFX−BioB−5 at , AFFX−BioB−M at, . . . , Z78285 f at

(7129 t o t a l )f va rLabe l s and fvarMetadata description : none

experimentData : use ' experimentData ( ob j e c t ) '

pubMedIds : 10521349Annotation : hu6800

' data . f rame ' : 72 obs . o f 11 v a r i a b l e s :$ Samples : i n t 39 40 42 47 48 49 41 43 44 45 . . .$ ALL.AML : Factor w/ 2 levels ”ALL” , ”AML” : 1 1 1 1 1 1 1 1 1 1 . . .$ BM.PB : Factor w/ 2 levels ”BM” , ”PB” : 1 1 1 1 1 1 1 1 1 1 . . .$ T .B . c e l l : Factor w/ 2 levels ”B−cell ” , ”T−cell ” : 1 1 1 1 1 1 1 1 1 1 . . .$ FAB : Factor w/ 4 levels ”M1” , ”M2” , ”M4” , . . : NA NA NA NA NA NA NA NA NA NA . . .$ Date : Factor w/ 27 levels ”” , ”1/24/1984 ” , . . : 1 13 NA 27 9 NA NA NA 4 4 . . .$ Gender : Factor w/ 2 levels ”F” , ”M” : 1 1 1 2 1 2 1 1 1 2 . . .$ pc tB la s t s : i n t NA NA NA NA NA NA NA NA NA NA . . .$ Treatment : Factor w/ 2 levels ”Fa i l u r e ” , ”Success ” : NA NA NA NA NA NA NA NA NA NA . . .$ PS : num 0 .78 0 .68 0 .42 0 .81 0 .94 0 .84 0 .99 0 .66 0 .97 0 .88 . . .$ Source : Factor w/ 4 levels ”CALGB” , ”CCG” , . . : 3 3 3 3 3 3 3 3 3 3 . . .

18

Histogram of exprs(Golub_Merge)

exprs(Golub_Merge)

Fre

quen

cy

−20000 0 20000 40000 60000

010

0000

2500

00

package stats

> m ← exprs ( Golub Merge )> dim(m)

[ 1 ] 7129 72

> s p l i ← order (apply (m, 1 , var ) ) [ 1 : 3 0 0 ]> m ← m[ s p l i , ]> heatmap(m, labRow=””)

54 58 64 8 17 27 44 45 47 56 39 49 24 3 53 33 1 43 62 65 59 23 57 21 34 36 60 35 40 7 4 6 51 55 52 12 29 19 67 22 66 41 70 46 38 50 30 37 61 9 63 28 2 68 14 16 20 15 69 71 13 5 32 31 42 25 26 18 11 72 10 48

package stats

> exprs brk ←+ c ( quantile (m[m < 0 ] , seq (0 , 1 , length=5)) ,+ 0 ,+ quantile (m[m > 0 ] , seq (0 , 1 , length=5)))> a l l am l ← as.integer ( pData ( Golub Merge )$ALL.AML)> mycol ← brewer .pa l (2 , ”Set1 ” ) [ a l l am l ]

19

> heatmap(m, labRow = ”” ,+ col = brewer .pa l (10 , ”RdBu”) ,+ breaks = exprs brk ,+ ColS ideColors = mycol )

54 58 64 8 17 27 44 45 47 56 39 49 24 3 53 33 1 43 62 65 59 23 57 21 34 36 60 35 40 7 4 6 51 55 52 12 29 19 67 22 66 41 70 46 38 50 30 37 61 9 63 28 2 68 14 16 20 15 69 71 13 5 32 31 42 25 26 18 11 72 10 48

54 58 64 8 17 27 44 45 47 56 39 49 24 3 53 33 1 43 62 65 59 23 57 21 34 36 60 35 40 7 4 6 51 55 52 12 29 19 67 22 66 41 70 46 38 50 30 37 61 9 63 28 2 68 14 16 20 15 69 71 13 5 32 31 42 25 26 18 11 72 10 48

package gplots

> l ibrary ( gp l o t s )> heatmap.2 (m, labRow=”” ,+ symbreaks = TRUE,+ col = brewer .pa l (10 , ”RdBu”) ,+ trace = ”none ” ,+ ColS ideColors = mycol )

20

54 58 64 8 17 27 44 45 47 56 39 49 24 3 53 33 1 43 62 65 59 23 57 21 34 36 60 35 40 7 4 6 51 55 52 12 29 19 67 22 66 41 70 46 38 50 30 37 61 9 63 28 2 68 14 16 20 15 69 71 13 5 32 31 42 25 26 18 11 72 10 48

−200 0 200Value

0Color Key

and HistogramC

ount

package Heatplus

> l ibrary ( Heatplus )> addvar ← pData ( Golub Merge ) [ c ( ”Gender ” , ”ALL.AML” ) ]> heatmap plus (m,+ breaks = exprs brk ,+ col=brewer .pa l (10 , ”RdBu”) ,+ addvar = addvar )

54 58 64 8 17 27 44 45 47 56 39 49 24 3 53 33 1 43 62 65 59 23 57 21 34 36 60 35 40 7 4 6 51 55 52 12 29 19 67 22 66 41 70 46 38 50 30 37 61 9 63 28 2 68 14 16 20 15 69 71 13 5 32 31 42 25 26 18 11 72 10 48

M58569_s_atX13589_atL17325_atX95425_s_atU18991_atS62907_s_atU18297_s_atAB005535_s_atM10950_cds2_atX86564_atU92459_atX82835_atY08564_atU12535_atX94629_atHG2743−HT3926_s_atL27624_s_atU70981_atL14542_atU31973_s_atU44105_atS62027_s_atD63882_s_atU20860_atX83929_s_atL09234_atZ78285_f_atM64554_rna1_atU24577_atHG4245−HT4515_atD64108_atK02402_atL05597_atU39231_atU62015_atL32140_atX95654_atU68031_atD13264_atU13913_s_atX83107_atU79304_atU49516_atY00317_atY10202_atU73330_atU37529_atZ83800_atU62434_atHG2564−HT2660_s_atU95626_rna3_atX82629_atX58987_atZ70723_atAC000062_atS76853_s_atL36642_atX54925_atU49379_atD87458_atX66360_atU88871_atL41349_atU28015_atD38462_atJ00209_f_atV00551_f_atL14430_atX14894_atU07856_atU56102_atU38810_atX15943_atL76627_atJ00219_s_atX56088_s_atZ78290_atJ04156_atX78578_atU17033_atL27559_s_atD64053_atU65002_atX97675_rna1_atU92458_atX64877_s_atZ48510_atS75313_atU87309_atX76342_atU46116_atHG3431−HT3616_s_atU35407_atU87460_atZ83806_atX95239_atM14648_atM63896_atL34081_atD38503_atX71661_atU89012_atY10510_atM38180_rna1_atU22322_s_atX75958_atU17034_atU14407_atM55419_atX55330_atZ46629_atD14822_atD14497_atM16474_s_atL41913_atM14123_xpt1_atM73239_s_atU47007_atL40396_atY10517_atX89426_atM61916_atU68727_atU62432_atX98330_atM19154_atX51405_atU63332_atX65663_atAC000066_atD10922_s_atM14306_atU65437_rna1_atX52001_atL11573_atU13896_atD37965_atJ04970_atU68133_atU39196_atX82018_atD42072_atU35376_atU17032_atU30245_atHG2160−HT2230_atHG2007−HT2056_s_atU91521_atD26561_cds3_atM26167_rna1_atX51730_atD86980_atU79300_atU43328_atX51823_s_atL49218_f_atU58033_atU22815_atU79249_atX81895_atX95237_atM18533_atL75847_atU19495_s_atX76534_atU11821_s_atD26561_cds1_atX00949_atU52155_atL26953_atX76383_atS79862_s_atU79246_atX16901_atY07596_atL22650_atM30773_atX65233_atD13168_atU42359_atM82919_atS78693_f_atJ04513_atU57093_atU69108_atS73205_atX62429_s_atD00408_atM16801_atU66561_atL07077_atS81661_s_atM37981_atM91556_s_atD82347_atM81882_atL08485_atM33478_atX54150_atY08136_atU16129_atS57887_atU68135_s_atU00001_s_atAFFX−LysX−5_atM31423_s_atL32163_atL25286_s_atHG3513−HT3707_atM86808_atD13644_atX97671_atX06290_atX71125_atX06661_atU93091_atL19778_atL07949_atX84003_atHG2841−HT2970_atU26712_atU18985_atU21128_atL07615_atX64643_atU12778_atY08319_atL47726_atL24470_atHG429−HT429_atU12897_atX63597_atX15422_atX77922_s_atV00503_atZ95624_atU54617_atU33267_atX78926_atM60828_atX82153_atX02956_atX58723_atL12468_atM16714_atM60503_atZ28339_atM84605_atX66087_atM55418_atZ75330_atM63623_atX07820_atZ24725_atU03886_atD31784_atX78686_atU19906_atX98266_cds2_atU59914_atM93119_atM31241_s_atS52028_s_atY09615_atD38122_atY10571_atU10886_atU66726_s_atL21934_atJ03810_atU66497_atS58544_atX59710_atJ05096_rna1_atU33632_atM25393_atM19301_atZ83802_atX00540_atU22233_atL40157_atV00571_rna1_atX98337_s_atL25441_atM65290_atL46353_atD76435_atM54927_atZ48570_atX64810_atL29306_s_atM62424_atX84195_atS82592_atU96136_atY07512_atU01828_atS66896_atX05608_atU09279_at

Gender

ALL.AML

21

2 Visualizing data on the genome

� contextual information

� measured entities have respective positions on the genome

� the reference genome is an abstraction

2.1 Idiogram

> l ibrary ( idiogram )

> l ibrary ( go lubEsets )> data ( Golub Train )> human chr ← buildChromLocation ( ”hu6800 ”)> expr vec to r ←+ assayData ( Golub Train ) [ [ ”exprs ” ] ] [ , 1 ]> id iogram ( expr vector , human chr , chr=”1 ”)

●●● ●●●● ● ●●●● ●● ●●●●●● ● ●● ●●●●● ●●●●●●●●● ●●●● ●●●●●●● ●●●●●● ●●●● ●●●● ●●●●●● ● ●●●●● ● ●●●● ●●●●● ●●● ●●●●● ●● ●● ●●● ● ●●●●●●●● ●●●● ●●●● ●●●● ●● ●● ●●●● ●● ● ●●●●●●●●●●● ●●● ●●●●●●●●● ●● ●●●●● ●●●●● ●●● ●● ● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●● ●●●●●●●●● ●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●

●●●●●● ●●●●● ●●●●● ●● ●●●●●● ●● ●●●● ●●●●●●●● ●● ●●●●● ● ●●●●● ●●●●● ●●●● ●●● ●●●● ●●●●●●● ●● ●● ● ●●●●● ●●●● ●●●●●● ●●●●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●● ●●● ●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●● ●●●●●● ●●●●● ●● ●●● ● ●● ●●● ●● ●●● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●

1

0 5000 10000 15000 20000

q44q43q42q41q32q31q25q24q23q22q21q12q11p11p12p13p21p22p31p32p33p34p35p36

chromLocation objects

> buildChromLocation ( ”<package name>”)

> prob2chro ← new.env ( )> assign ( ”probe a1 ” , ”1 ” , prob2chro )> assign ( ”probe a2 ” , ”1 ” , prob2chro )> assign ( ”probe b ” , ”X” , prob2chro )> assign ( ”probe c ” , ”X” , prob2chro )> prob2symbol ← new.env ( )> assign ( ”probe a1 ” , ”a ” , prob2symbol )> assign ( ”probe a2 ” , ”a ” , prob2chro )> assign ( ”probe b ” , ”b” , prob2symbol )> assign ( ”probe c ” , ”c ” , prob2symbol )

> f o oba r ch r l o c ← new( ”chromLocation ” ,+ organism = ”foobar ” , # name

+ dataSource = ”Dr. Moreau ' s lab ” ,+ chromLocs = l i s t ( gene a=c ( ) , gene b=c ( ) ) ,+ probesToChrom = c (1000 , 1200 , 200 , 200) ,+ chromInfo = c ( ”1 ”=10000 , ”X”=500 , ”Y”=600) ,+ geneSymbols = prob2symbol )

22

> l ibrary ( RColorBrewer )> c o l i nd ex ← cut ( expr vector , 9)> my col ← brewer .pa l (9 , ”BuGn” ) [ c o l i nd ex ]> id iogram ( expr vector , human chr , chr=”1 ” ,+ col = my col )

●●● ●●●● ● ●●●● ●● ●●●●●● ● ●● ●●●●● ●●●●●●●●● ●●●● ●●●●●●● ●●●●●● ●●●● ●●●● ●●●●●● ● ●●●●● ● ●●●● ●●●●● ●●● ●●●●● ●● ●● ●●● ● ●●●●●●●● ●●●● ●●●● ●●●● ●● ●● ●●●● ●● ● ●●●●●●●●●●● ●●● ●●●●●●●●● ●● ●●●●● ●●●●● ●●● ●● ● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●● ●●●●●●●●● ●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●

●●●●●● ●●●●● ●●●●● ●● ●●●●●● ●● ●●●● ●●●●●●●● ●● ●●●●● ● ●●●●● ●●●●● ●●●● ●●● ●●●● ●●●●●●● ●● ●● ● ●●●●● ●●●● ●●●●●● ●●●●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●● ●●● ●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●● ●●●●●● ●●●●● ●● ●●● ● ●● ●●● ●● ●●● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●

1

0 5000 10000 15000 20000


> c o l i nd ex ← cut (rank ( expr vec to r ) , 9)> my col ← brewer .pa l (9 , ”BuGn” ) [ c o l i nd ex ]> id iogram ( expr vector , human chr , chr=”1 ” ,+ col = my col , pch = 16)

●●● ●●●● ● ●●●● ●● ●●●●●● ● ●● ●●●●● ●●●●●●●●● ●●●● ●●●●●●● ●●●●●● ●●●● ●●●● ●●●●●● ● ●●●●● ● ●●●● ●●●●● ●●● ●●●●● ●● ●● ●●● ● ●●●●●●●● ●●●● ●●●● ●●●● ●● ●● ●●●● ●● ● ●●●●●●●●●●● ●●● ●●●●●●●●● ●● ●●●●● ●●●●● ●●● ●● ● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●● ●●●●●●●●● ●●●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●● ●● ●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●

●●●●●● ●●●●● ●●●●● ●● ●●●●●● ●● ●●●● ●●●●●●●● ●● ●●●●● ● ●●●●● ●●●●● ●●●● ●●● ●●●● ●●●●●●● ●● ●● ● ●●●●● ●●●● ●●●●●● ●●●●●●●● ●●●●● ●●●●●●● ●●●● ●●●●●● ●●● ●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●● ●●●●●● ●●●●● ●● ●●● ● ●● ●●● ●● ●●● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●● ● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●

1

0 5000 10000 15000 20000


2.2 Genome Atlases

Genome Atlas

23

> l ibrary ( e c o l i t k )

http://tinyurl.com/6zqsc5

Using more of the available space

24

KEMISK MÅNEDSBLAD / NUMMER 4 APRIL 2009 / 90. ÅRGANG ISSN 0011-6335KEMISK FORENING KEMIINGENIØRGRUPPEN

Tema: Plast fra træ Holdbar mælk Humleudtræk

– kemi på højniveau

2.3 Hilbervis

Hilbert’s curve

� fractal curve

� covers as much of the plane as possible

Iteration level = 1

●

● ●

●

4 data points can be represented on a 2x2 grid.

25

Iteration level = 2

● ●

●●

●

● ●

● ●

● ●

●

●●

● ●


Iteration level = 3

●

● ●

● ● ●

●●

● ●

●●●

●●

●

● ●

●●

●

● ●

● ●

● ●

●

●●

● ● ● ●

●●

●

● ●

● ●

● ●

●

●●

● ●

●

●●

●●●

● ●

●●

● ● ●

● ●

●


Iteration level = 4

26

● ●

●●

●

● ●

● ●

● ●

●

●●

● ● ●

● ●

● ● ●

●●

● ●

●●●

●●

●

●

● ●

● ● ●

●●

● ●

●●●

●●

●●●

● ●

●

●●

●●

●●

●

● ●

●●

●

● ●

● ● ●

●●

● ●

●●●

●●

●

● ●

●●

●

● ●

● ●

● ●

●

●●

● ● ● ●

●●

●

● ●

● ●

● ●

●

●●

● ●

●

●●

●●●

● ●

●●

● ● ●

● ●

● ●

● ●

● ● ●

●●

● ●

●●●

●●

●

● ●

●●

●

● ●

● ●

● ●

●

●●

● ● ● ●

●●

●

● ●

● ●

● ●

●

●●

● ●

●

●●

●●●

● ●

●●

● ● ●

● ●

●

●●

● ●

●

●●

●●

●●

●

● ●

●●●

●●

●●●

● ●

●●

● ● ●

● ●

●

●

●●

●●●

● ●

●●

● ● ●

● ●

● ● ●

●●

●

● ●

● ●

● ●

●

●●

● ●


Iteration level = 5

●● ●

● ● ●●●

● ●●●●

●●●● ●

●●●● ●

● ●● ●

●●●

● ● ● ●●●

●● ●

● ●● ●

●●●

● ●●●●

●●●● ●

●●● ● ●

● ●● ● ●

●●●● ●

● ●● ●

●●●

● ● ●● ●

● ● ●●●

● ●●●●

●●●●● ●

● ● ●●●

● ●●●●

●●●●●

● ●●●●

●●●●

●● ●

●●● ●

●●●● ●

● ●● ●

●●●

● ● ●● ●

● ● ●●●

● ●●●●

●●●●● ●

● ● ●●●

● ●●●●

●●●●●

● ●●●●

●●●●

●● ●

●●●●●

●●●● ●

●●● ● ●

● ●●●●

● ●●●●

●●●●

●● ●

●●●●● ●

●●●

●●●●

●● ●

●●●● ●

● ● ●●●

● ●●●●

●●●● ●

●●●● ●

● ●● ●

●●●

● ● ●● ●

● ● ●●●

● ●●●●

●●●●● ●

● ● ●●●

● ●●●●

●●●●●

● ●●●●

●●●●

●● ●

●●●● ●

● ● ●●●

● ●●●●

●●●● ●

●●●● ●

● ●● ●

●●●

● ● ● ●●●

●● ●

● ●● ●

●●●

● ●●●●

●●●● ●

●●● ● ●

● ●● ●

● ●● ● ●

●●● ●

●●●●●

●● ●

●●●● ●

● ●● ●

●●●

● ● ● ●●●

●● ●

● ●● ●

●●●

● ●●●●

●●●● ●

●●● ● ●

● ●●●●

● ●●●●

●●●●

●● ●

●●●●●

●●●● ●

●●● ● ●

● ●●●●●

●●●● ●

●●● ● ●

● ●● ● ●

●●●● ●

● ●● ●

●●●

● ● ● ●●●

●● ●

● ●● ●

●●●

● ● ●● ●

● ● ●●●

● ●●●●

●●●●● ●

● ● ●●●

● ●●●●

●●●●●

● ●●●●

●●●●

●● ●

●●●● ●

● ● ●●●

● ●●●●

●●●● ●

●●●● ●

● ●● ●

●●●

● ● ● ●●●

●● ●

● ●● ●

●●●

● ●●●●

●●●● ●

●●● ● ●

● ●● ●

● ●● ● ●

●●● ●

●●●●●

●● ●

●●●● ●

● ●● ●

●●●

● ● ● ●●●

●● ●

● ●● ●

●●●

● ●●●●

●●●● ●

●●● ● ●

● ●●●●

● ●●●●

●●●●

●● ●

●●●●●

●●●● ●

●●● ● ●

● ●●●●●

●●●● ●

●●● ● ●

● ●● ● ●

●●●● ●

● ●● ●

●●●

● ●●●●

●●●● ●

●●● ● ●

● ●●●●

● ●●●●

●●●●

●● ●

●●●●● ●

●●●

●●●●

●● ●

●●●● ●

● ● ●●●

● ●●●●

●●●●●

● ●●●●

●●●●

●● ●

●●●●●

●●●● ●

●●● ● ●

● ●●●●●

●●●● ●

●●● ● ●

● ●● ● ●

●●●● ●

● ●● ●

●●●

● ●●●

● ●●●●

●●●●

●● ●

●●●●●

●●●● ●

●●● ● ●

● ●●●●●

●●●● ●

●●● ● ●

● ●● ● ●

●●●● ●

● ●● ●

●●●

● ● ●● ●

● ● ●●●

● ●●●●

●●●● ●

●●●● ●

● ●● ●

●●●

● ● ● ●●●

●● ●

● ●● ●

●●●

● ●●●●

●●●● ●

●●● ● ●

● ●●


Visualizing large vectors on screens

27

● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0 5 10 15 20 25 30

020

060

010

00

image size

vect

or s

ize

A square of 1000x1000 pixels can be used to represent 1,000,000 data points.

Comparing cartesian plots with Hilbert curve

seq

x

0

2

4

0 100 200 300 400

seq

x

0

2

4

0 100 200 300 400

●●●

●●

●

●

●●

●

●●

●

●

●●●●

●

●●●

●●

●●●●●●●

●

●

●

●

●

●

●●●

●●

●

●●●●●●●●●●

●●●

●●●●●●

●

●

●●●●●

●

●●●●●●

●●●●●●

●●●

●●

●

●

●●

●

●●●

●●

●

●

●

●●

●●●

●●

●

●

●

●

●●

●●

●●

●

●

●●●●●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●●●●

●●

●●

●●●●●

●●

●●●●●●

●●●

●●

●●●

●

●●●●●●●

●

●

●

●

●●●●●●●●

●

●●●

●

●●●●●

●

●●●●●●

●●●

●●

●

●●●

●●●●

●●●●

●●

●

●

●

●●●

●

●●

●

●●

●●

●

●

●●●●●

●●

●

●

●

●●

●

●●●●●

●

●●●●●●●●●●●

●

●●●

●

●

●

●●

●

●●

●

●●

●●●●

●

●●●●●

●

●

●

●

●

●●

●●●●

●

●

●

●●●

●●

●

●

●

●

●

●●●●

●●●

●●●

●●●●●

●

●

●●

●

●

●

●

●●●●●●

●

●●●●

●

●●

●

●●

●

●●●

●●

●

●●

●

●

●

●●

●

●

●●

●

●●●●●

●

●●

●●

●

●●●●

●

●

●●●●

●

−4

−2

0

2

4

2.4 GenomeGraphs

GenomeGraphs

� Genomic information is traditionnally represented in tracks

� Tracks can be:

– ORF, promoters, genes on the genome

– Sequence characteristics (GC-content, functional domains, complex-ity)

– Experimental signal associated with those sequences (RNA levels,methylation signals, copy-numbers)

> l ibrary (GenomeGraphs )

> ehs mart ← useMart ( ”ensembl ” ,+ ”hsap iens gene ensembl ”)> minbase ← 180300000#180292097> maxbase ← 180500000#180491933> genesp lus ← makeGeneRegion ( start = minbase ,+ end = maxbase ,+ strand = ”+” ,+ chromosome = ”3 ” ,

28

+ biomart = ehs mart )> genesmin ← makeGeneRegion ( start = minbase ,+ end = maxbase ,+ strand = ”−” ,+ chromosome = ”3 ” ,+ biomart = ehs mart )> a x i s t r k← makeGenomeAxis ( add53 = TRUE,+ add35=TRUE,+ l i t t l e T i c k s = TRUE)> p ← gdPlot ( l i s t ( genesplus , ax i s t r k , genesmin ) ,+ minBase = minbase , maxBase = maxbase )> print (p)

180300000

180350000

180400000

180450000

1805000005' 3'

3' 5'

> i d i o g ← makeIdeogram ( ”3 ”)> p ← gdPlot ( l i s t ( id iog , genesplus ,+ ax i s t r k , genesmin ) ,+ minBase = minbase , maxBase = maxbase )> print (p)

180300000

180350000

180400000

180450000

1805000005' 3'

3' 5'

> probepos beg in ← sort ( runif (200 , min=minbase ,+ max=maxbase ) )> probepos end ← probepos beg in + 200> probepo s s i gna l ←

29

+ sin ( seq (0 , 6 , length=200)) + rnorm(200 , sd=0.05 )> probepo s s i gna l ←+ matrix ( probepos s i gna l ,+ ncol=1)> e xp r e s s i o n t r k ←+ makeGenericArray (+ i n t e n s i t y = probepos s i gna l ,+ probeStart = probepos begin ,+ probeEnd = probepos end ,+ dp = DisplayPars ( c o l o r=”darkblue ” ,+ type=”point ” ) )> gdPlot ( l i s t ( ”+” = genesplus ,+ ax i s t r k , ”−” = genesmin ,+ ”log− rat ion expr e s s i on ” = exp r e s s i o n t r k ) ,+ minBase = minbase ,+ maxBase = maxbase )

+−

log−

ratio

n ex

pres

sion

180300000180350000

180400000180450000

1805000005' 3'3' 5'

−1

−0.5

0

0.5

1

30

visualizing microarray dataweb.math.ku.dk/~richard/courses/bioconductor2009/...1 elements of data...

Documents