lab: using graphs for comparing transcriptome and...

Lab: Using Graphs for Comparing Transcriptome and Interactome

Data

S. Falcon, W. Huber, R. Gentleman

June 23, 2006

1 Introduction

In this lab we will use the graph, Rgraphviz, and RBGL Bioconductor packages to assess the re-lationship between sets of genes that cluster together based on expression value from microarrayexperiments (the transcriptome) and sets of genes whose proteins are known to physically interact(the interactome). The lab follows the discussion in section 22.2 of Gentleman et al. (2005), whichin turn is based on Balasubramanian et al. (2004).

The approach we will take is to create two graphs, one where edges represent the fact that twogenes are in the same cluster and the other where an edge represents the fact that the two genesphysically interact. If there is an association between clustering (which is based on gene expression)and physically interacting then we anticipate that when there is an edge between two proteins inone graph there will also be an edge between them in the other graph.

We test this hypothesis by creating the two graphs and counting how many edges they have incommon. Then, we generate a reference distribution by permuting the labels on one of the twographs (either one is fine) and for each permutation count edges in common. Finally, we comparethe observed number of edges with the permutation distribution and conclude that there is arelationship if the observed number is large, as determined by the reference distribution.

2 Required Packages

Exercise 1Use the library function to load the following packages: Biobase, graph, Rgraphviz, RBGL, RColor-Brewer, RbcBook1, and yeastExpData.

1

> library("Biobase")

> library("graph")

> library("Rgraphviz")

> library("RBGL")

> library("RbcBook1")

> library("yeastExpData")

3 The Data

The curated data in yeastExpData contains both gene expression data from a yeast cell-cycle ex-periment, including cluster membership (ccyclered), and protein-protein interaction (PPI) dataextracted from published papers (litG).

> data(ccyclered)

> data(litG)

Exercise 2What type of object is litG? How do you find out more about this class? Use the nodes methodto extract the first five nodes of litG.

It is an instance of the graphNEL class. You can use the manual page,class?graphNEL, to find out more.

> nodes(litG)[1:5]

[1] "YBL072C" "YBL083C" "YBR009C" "YBR010W" "YBR031W"

Exercise 3Explore the ccyclered data to determine what type of R object it is and what kind of data itcontains.

The ccyclered data is stored in a data.frame. You should also look at themanual page.

> class(ccyclered)

> str(ccyclered)

> dim(ccyclered)

> names(ccyclered)

2

4 Graph Basics

The RBGL package has a number of different graph algorithms implemented. In these next fewexercises we will see how to use a few of them. We will use the litG graph for our examples.

A graph can consist of one or more connected components. You can find them using the connect-edComp function.

> cc1 = connectedComp(litG)

> length(cc1)

[1] 2642

> cc1lens = sapply(cc1, length)

> table(cc1lens)

cc1lens1 2 3 4 5 6 7 8 12 13 36 88

2587 29 10 7 1 1 2 1 1 1 1 1

Exercise 4How many connected components are there? What is the size of the largest connected component?How many singletons are there? What are the elements (or values) stored in cc1? Create asubgraph of litG which has only the connected components of size 4 or more.

We first do the computations.

> cc1lens = sapply(cc1, length)

> good = cc1lens > 4

> whComp = cc1[good]

> sg1 = subGraph(unlist(whComp), litG)

> sg1

A graphNEL graph with undirected edgesNumber of Nodes = 182Number of Edges = 241

There are 2642 connected components. There largest component is of size88.

Lets plot the component of size 12 using the Rgraphviz package. We first compute the subgraph andthen lay it out. There are many options for node color, line color and type, node shape etc. If youare interested in more complex pictures you should read the vignette from the Rgraphviz package.

3

> sg12 = subGraph(cc1[cc1lens == 12][[1]], litG)

> l12 <- agopen(sg12, layoutType = "dot", nodeAttrs = makeNodeAttrs(sg12,

+ fillcolor = "steelblue2"), name = "")

> plot(l12)

YBL035C

YBR088C

YDL102W

YDR097CYJR006W

YJR043C

YNL102W

YPR135W

YMR167W

YPL164C YNL082W

YOL090W

Figure 1: The connected component of size 12.

Exercise 5Layout the graph using the two other layout types from Rgraphviz, namely neato and twopi.

You should plot these.

> l12 <- agopen(sg12, layoutType = "dot", nodeAttrs = makeNodeAttrs(sg12,

+ fillcolor = "steelblue2"), name = "")

4

Next let’s extract the largest component and use it to compute some other quantities. Here wecompute the shortest path between a rather arbitrarily selected pair of nodes.

> sg88 = subGraph(cc1[cc1lens == 88][[1]], litG)

> nN = nodes(sg88)

> sps = sp.between(sg88, nN[1], nN[81])

Exercise 6What sort of object is sps? What does the manual page say about it? Can you plot the graphand identify that this indeed is the shortest path? (You could color these nodes differently fromthe rest).

If you want to find the diameter of the graph it is defined as the longest shortest path between anytwo nodes. To compute this we use the function johnson.all.pairs.sp.

> allp = johnson.all.pairs.sp(sg88)

Exercise 7What type of object is allp? What data does it contain? What is the diameter of allp?

allp is a matrix of the shortest path distances between all pairs of nodes.You can find the diameter by finding the maximum value in allp.

> max(allp)

[1] 13

5 The Analysis

We now return to the analysis we discussed at the beginning of this lab. As suggested above wewant to create two graphs, one that reflects protein interactions and one that reflects clustered geneexpression. The first one of these graphs is litG and we will need to use the ccyclered to createthe second one.

Cho et al. (1998) discuss the k means clustering of 2885 Saccharomyces genes into 30 clusters withmeasurements taken over two synchronized cell cycles. These data are stored as ccyclered and wenext explain how to extract what is needed.

5

5.1 Cluster graph

The first step is to create a cluster graph from the ccyclered data in which edges are between allgenes that are in the same cluster. The clusters are given in the first column (named Cluster)of the data.frame. There is a specialized graph class, clusterGraph that can be used to representclusters.

We need to compute the set of genes in each cluster and you will do that by building a list, whereeach entry contains the names of the genes in that cluster.

Exercise 8Use the split function and the Y.name and Cluster columns of the ccyclered dataframe to createa list that maps gene name to cluster name. Store the list in a variable named clusts.

> clusts <- split(ccyclered[["Y.name"]], ccyclered[["Cluster"]])

Next we use the clusts list from the previous exercise to create a clusterGraph instance using new :

> cg1 <- new("clusterGraph", clusters = clusts)

Exercise 9How many connected components does the cluster graph cg1 have? Hint: apropos("connect").

> ccClust <- connectedComp(cg1)

5.2 PPI graph

We next turn our attention to a brief exploration of the literature-based PPI data stored as agraphNEL object.

Exercise 10Store the connected components of litG in a variable called ccLit.

Exercise 11a) Use listLen to compute the size of each connected component. Store the result in cclens.

b) Use table to summarize the sizes of the connected components stored in cclens.

c) How many singleton components are there?

6

> ccLit <- connectedComp(litG)

> cclens <- listLen(ccLit)

> table(cclens)

cclens1 2 3 4 5 6 7 8 12 13 36 88

2587 29 10 7 1 1 2 1 1 1 1 1

> nrSingletons <- table(cclens)["1"]

Creating an index vector that orders the connected components by size will allow us to easily accessthe smallest and largest components. In the example below, we list the genes in the eighth largestconnected component.

> ord <- order(cclens, decreasing = TRUE)

> ccLit[[ord[8]]]

[1] "YMR080C" "YLL026W" "YJR132W" "YDR172W" "YNL112W"[6] "YBR143C"

Exercise 12Use the subGraph method to create two new graphs sG1 and sG2, the first and second largestconnected components of of the litG graph.

> sG1 <- subGraph(ccLit[[ord[1]]], litG)

> sG2 <- subGraph(ccLit[[ord[2]]], litG)

Now we plot sG1 and sG2 using Rgraphviz:

> lsG1 <- agopen(sG1, layoutType = "neato", nodeAttrs = makeNodeAttrs(sG1),

+ name = "")

> lsG2 <- agopen(sG2, layoutType = "neato", nodeAttrs = makeNodeAttrs(sG2,

+ fillcolor = "#a6cee3"), name = "")

> plot(lsG1)

> plot(lsG2)

7

YDR382W

YER009W

YFL039C

YLR229C

YLR340W

YDL127W

YER111CYGR109C

YGR152C

YJL187C

YKL042W

YKL101W

YLR212C

YLR313C

YMR199W

YNL289W

YPL256C

YPR120C

YOR160W

YDR388W

YJL157C

YOR036W

YPL031C

YCL027W

YLL024C

YOR027W

YOR185C

YPL240C

YMR294W

YNL004W

YNL243W

YOL039W

YOR098C

YAL040C

YBR200W

YGR108W

YPL242C

YPR119W

YCL040W

YAL005C

YLR216C

YBR133C

YER114C

YDL179W

YHR005C

YJL194W

YLR079W

YLR319C

YMR109W

YDR432W

YDR356W

YEL003W

YHR061C

YHR172W

YLL021W

YNL126W

YPL016W

YDR085C

YHR102W

YOL016C

YBR160W

YMR308C

YHL007C

YKL068W

YAL029C

YBR109C

YGL016W

YLR293C

YML065W

YLR200W

YML094W

YMR092C

YMR186W

YDR309C

YHR129C

YAL041W

YBL016W

YBL079W

YOR127W

YPL174C

YDR103W

YDR323C

YDR184C

YHR069C

YOL021C

YOR181W

YBR155W

YPL140C

YBR009C

YBR010W

YNL030W

YNL031C

YOL139C

YAR007C

YBR073W

YER095W

YJL173C

YNL312W

YBL084C

YDR146C

YLR127C

YNL172W

YLR134W

YMR284W

YER179WYIL144W

YML104C

YOR191W

YDL008W

YDL030W

YDL042C

YDR004W

YGR162W

YMR117C

YDR386W

YDR485C

YDL043C

YDR118W

YMR106C

YML032C

YDR076W

YDR180W

YDL013W

YDR227W

Figure 2: The two largest PPI connected components.

5.3 Testing associations

It is now easy to determine how many pairs of genes have both a protein-protein interaction andare found in the same expression cluster. To compute this, find the intersection of the cluster-graphand the literature graph using intersection:

> commonG <- intersection(cg1, litG)

Exercise 13How many edges are common to the two graphs (cg1 and litG)?

Now we will try to determine whether the number of common edges is statistically interesting ornot. We will do this by generating a null distribution via permutation of node labels on the observedgraph.

Here is a function that can be used to generate values from the desired null distribution. Unfortu-nately, running this function with the current implementation is very slow.

> nodePerm <- function(g1, g2, B = 1000) {

+ n1 <- nodes(g1)

+ sapply(1:B, function(i) {

+ nodes(g1) <- sample(n1)

+ numEdges(intersection(g1, g2))

+ })

+ }

Exercise 14Describe what the nodePerm function is doing to make sure you understand how it works.

8

Since the nodePerm function is slow, we’ve computed 500 iterations ahead of time. Load theprecomputed result as follows:

> data(nPdist)

> summary(nPdist)

Exercise 15Plot the nPdist data and decide if the number of edges in common between litG and cg1 isstatistically interesting.

6 Some harder problems

In this section we present some problems that are more open ended. They are not formally part ofthis Lab, but are here for those who finish early, or who are particularly interested in these sortsof applications.

To answer the last two questions you will need to obtain a newer Bioconductor package called ScISI.

� Which of the expression clusters have intersections and with which of the literature clusters?

� Are there expression clusters that have a number of literature cluster edges going betweenthem (and hence suggesting that the expression clustering was too fine or that the genesinvolved in the literature cluster are not cell-cycle regulated).

� Are there known cell-cycle regulated protein complexes, and do the genes involved tend tocluster together in both graphs?

� Is the expression behavior of genes that are involved in multiple protein complexes differentfrom that of genes that are involved in only one complex?

The version number of R and packages loaded for generating this document are:

Version 2.3.1 Patched (2006-06-08 r38315)powerpc-apple-darwin8.6.0

attached base packages:[1] "tools" "methods" "stats" "graphics"[5] "grDevices" "utils" "datasets" "base"

other attached packages:yeastExpData RbcBook1 RBGL Rgraphviz

9

"0.6.0" "1.0.2" "1.8.1" "1.10.0"graph Ruuid Biobase

"1.10.6" "1.10.0" "1.10.0"

References

R. Balasubramanian, T. LaFramboise, D. Scholtens, et al. A graph theoretic approach to testingassociations between disparate sources of functional genomics data. Bioinformatics, 20:3353–3362, 2004.

R.J. Cho, M.J. Campbell, E.A. Winzeler, et al. A genome-wide transcriptional analysis of themitotic cell cycle. Molecular Cell, 2:65–73, 1998.

R. Gentleman, W. Huber, V. Carey, R. Irizarry, and S. Dudoit, editors. Bioinformatics and Com-putational Biology Solutions Using R and Bioconductor. Springer, 2005.

10

lab: using graphs for comparing transcriptome and...

Documents