preservation of protein-protein interaction networks ...dev.off(); the resulting plot is shown in...

8
Preservation of protein-protein interaction networks Simple simulated example Peter Langfelder and Steve Horvath May 31, 2011 Contents 1 Overview 1 1.a Setting up the R session ............................................ 1 2 Calculation of module preservation 2 3 Analysis of module preservation statistics 2 A Simulation of PPI networks 5 1 Overview This document contains a simple illustration of the use of the function modulePreservation [1] to study the preser- vation of complexes in protein-protein interaction (PPI) networks. We simulate two PPI networks. Each network contains 10 complexes with sizes between 10 and 50 proteins. Five of the 10 complexes, labeled 1–5, are preserved between the two networks, while the other five complexes (labeled 6-10) are not preserved. We encourage readers unfamiliar with any of the functions used in this tutorial to type, in an active R session, help(functionName) (replace functionName with the actual name of the function) to get a detailed description of what the functions does, what the input arguments mean, and what is the output. 1.a Setting up the R session After starting R we execute a few commands to set the working directory and load the requisite packages: # Display the current working directory getwd(); # If necessary, change the path below to the directory where the data files are stored. # "." means current directory. On Windows use a forward slash / instead of the usual \. workingDir = "."; setwd(workingDir); # Load the package library(WGCNA); # The following setting is important, do not omit. options(stringsAsFactors = FALSE);

Upload: others

Post on 03-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Preservation of protein-protein interaction networks ...dev.off(); The resulting plot is shown in Figure 1. We note that in this example the composite statistics medianRank and Z summary

Preservation of protein-protein interaction networks

Simple simulated example

Peter Langfelder and Steve Horvath

May 31, 2011

Contents

1 Overview 11.a Setting up the R session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Calculation of module preservation 2

3 Analysis of module preservation statistics 2

A Simulation of PPI networks 5

1 Overview

This document contains a simple illustration of the use of the function modulePreservation [1] to study the preser-vation of complexes in protein-protein interaction (PPI) networks. We simulate two PPI networks. Each networkcontains 10 complexes with sizes between 10 and 50 proteins. Five of the 10 complexes, labeled 1–5, are preservedbetween the two networks, while the other five complexes (labeled 6-10) are not preserved.We encourage readers unfamiliar with any of the functions used in this tutorial to type, in an active R session,

help(functionName)

(replace functionName with the actual name of the function) to get a detailed description of what the functions does,what the input arguments mean, and what is the output.

1.a Setting up the R session

After starting R we execute a few commands to set the working directory and load the requisite packages:

# Display the current working directory

getwd();

# If necessary, change the path below to the directory where the data files are stored.

# "." means current directory. On Windows use a forward slash / instead of the usual \.

workingDir = ".";

setwd(workingDir);

# Load the package

library(WGCNA);

# The following setting is important, do not omit.

options(stringsAsFactors = FALSE);

1

Page 2: Preservation of protein-protein interaction networks ...dev.off(); The resulting plot is shown in Figure 1. We note that in this example the composite statistics medianRank and Z summary

2 Calculation of module preservation

We use simulated PPI networks that are generated using code provided in Appendix A. For simplicity, we simplyload the networks saved there.

load(file = "simulatedPPInetworks.RData");

The above command loads two object, PPInetwork1 and PPInetwork2. Each of them is a list with two components:the component adjacency contains the network adjacency matrix, and the component labels contains the module(or protein complex) labels. The modules are labeled by numbers 1–10. Proteins that are not part of any complexcarry the label 0. To get a basic idea of how big the network is, we can use

dim(PPInetwork1$adjacency)

which will tell us that the network contains 350 proteins. Also note that the columns of the adjacency matrix mustcarry protein names. In our example we named the simulated proteins simply "Protein.1"–"Protein.350":

colnames(PPInetwork1$adjacency)

Column names for the adjacency matrices are important because they allow the module preservation function tomatch proteins between reference and test networks – even though here we use the same proteins in the same order,in practice this may not be the case.We next create “multi-adjacency” and module “multi-labels”. These variables are lists with one component per dataset. In this example we study two data sets, a reference set (1) and a test set (2). Note that the components of thelist must be named. The names are used as identifiers for the data set.

multiAdj = list( network1 = list(data = PPInetwork1$adjacency),

network2 = list(data = PPInetwork2$adjacency));

multiLabels = list(network1 = PPInetwork1$labels);

We now call the modulePreservation function to calculate network module preservation statistics. This calculationmay take up to a few hours, depending on the available computational speed.

mp = modulePreservation(multiAdj, multiLabels, dataIsExpr = FALSE,

referenceNetworks = 1, restrictSummaryForGeneralNetworks = FALSE,

nPermutations = 100,

calculateCor.kIMall = FALSE, verbose = 3);

# Save the results

save(mp, file = "mp.RData");

We saved the results so the calculation only need to be run once. The results can be re-loaded using the followingcommand:

load(file = "mp.RData");

3 Analysis of module preservation statistics

We now isolate the medianRank and the Z statistics and plot them as a function of module size.

stats = cbind(medianRank = mp$preservation$observed[[1]][[2]]$medianRank.pres[-c(1,2)],

mp$preservation$Z[[1]][[2]][-c(1,2), -1]);

moduleSizes = mp$preservation$Z[[1]][[2]][-c(1,2), 1];

# Order rows by module label

order = order(as.numeric(rownames(stats)))

stats = stats[order, ]

moduleSizes = moduleSizes[order]

labels = as.numeric(rownames(stats))

# Indicate preserved modules by red color and non-preserved by black color

preserved = c(1:5);

2

Page 3: Preservation of protein-protein interaction networks ...dev.off(); The resulting plot is shown in Figure 1. We note that in this example the composite statistics medianRank and Z summary

presInd = match(preserved, labels);

presColor = rep(1, length(labels));

presColor[presInd] = 2;

# Open a suitably sized graphics window or, alternatively, open a pdf file to hold the plot

sizeGrWindow(10,7);

#pdf(file=spaste("Plots/PPIsimulation-halfPreserved"), wi=10, he=8)

# Set sectioning and margins

par(mfrow = c(3,4))

par(mar = c(3.2, 3.2, 2, 0.5))

par(mgp = c(2.0, 0.6, 0))

# Plot the individual statistics

for (s in 1:ncol(stats))

{

min = min(stats[, s], na.rm = TRUE);

max = max(stats[, s], na.rm = TRUE);

if (s > 1)

{

if (min > -max/5) min = -max/5;

} else {

tmp = min; min = max; max = tmp;

}

plot(moduleSizes, stats[, s], main = colnames(stats)[s],

ylab = colnames(stats)[s], type = "n", xlab = "Module size",

cex.main = 1, ylim = c(min, max))

text(moduleSizes, stats[, s], labels = labels, col = presColor);

box = par("usr");

if (s==1) legend(x = box[2], y = (max+min)/2, xjust = 1, yjust = 0.5,

legend = c("Preserved", "Non-preserved"), fill = c(2,1), cex = 0.8)

if (s>1)

{

abline(h=0)

abline(h=2, col = "blue", lty = 2);

abline(h=10, col = "darkgreen", lty = 2);

}

}

# If plotting into a file, close it.

dev.off();

The resulting plot is shown in Figure 1. We note that in this example the composite statistics medianRank andZsummary work best at separating the preserved and non-preserved modules. While medianRank appears largelyindependent of module size, the Z statistics for preserved modules show a marked dependence on module size. Thisagrees with the intuition that it is more significant to observe a preservation of a pattern among 50 proteins thanamong 10 proteins.

3

Page 4: Preservation of protein-protein interaction networks ...dev.off(); The resulting plot is shown in Figure 1. We note that in this example the composite statistics medianRank and Z summary

10 20 30 40 50

86

42

medianRank

Module size

med

ianR

ank

1

2

3

45

6

7

8

9

10

PreservedNon−preserved

10 20 30 40 50

−5

515

25Zsummary

Module size

Zsu

mm

ary

12

34

56

78910

10 20 30 40 50

−10

1030

50

Zdensity

Module sizeZ

dens

ity

12

34

56

78

910

10 20 30 40 50

−2

02

46

8

Zconnectivity

Module size

Zco

nnec

tivity

1

2

3

4

5

67

89

10

10 20 30 40 50

02

46

Z.propVarExplained

Module size

Z.p

ropV

arE

xpla

ined 1

2

34

56

7

8

910

10 20 30 40 50

−10

1030

50

Z.meanKIM

Module size

Z.m

eanK

IM

12

34

56

78

910

10 20 30 40 50

−10

1030

50Z.meanAdj

Module size

Z.m

eanA

dj

12

34

56

78

910

10 20 30 40 50

−0.

50.

51.

52.

5

Z.meanClusterCoeff

Module sizeZ

.mea

nClu

ster

Coe

ff

1

23

45

67

89

10

10 20 30 40 50

−2

02

46

Z.cor.kIM

Module size

Z.c

or.k

IM

123

4

5

6

7

8

9

10

10 20 30 40 50

−2

02

46

8

Z.cor.kME

Module size

Z.c

or.k

ME

12

34

5

67

8

910

10 20 30 40 50

−2

02

46

8

Z.cor.adj

Module size

Z.c

or.a

dj

1

23

45

67

8

910

10 20 30 40 50

−1

01

23

Z.cor.clusterCoeff

Module size

Z.c

or.c

lust

erC

oeff

1

2

3

45

6

78

910

Figure 1: Module preservation statistics of simulated modules in this study. Each plot shows one of the preservationstatistics (indicated in the title) as a function of the module size. Modules are labeled by their numeric labels; redcolor denotes preserved and black non-preserved modules. The blue and green dashed lines denote the thresholdsZ = 2 and Z = 10. The statistics medianRank and Zsummary do the best job of distinguishing the preserved andnon-preserved modules in this study.

4

Page 5: Preservation of protein-protein interaction networks ...dev.off(); The resulting plot is shown in Figure 1. We note that in this example the composite statistics medianRank and Z summary

A Simulation of PPI networks

Here we generate the reference and test networks used in this tutorial. We start by defining two functions, one forsimulating a protein complex (a group of densely interconnected proteins), and for simulating a network consistingof several complexes.

simulateComplex = function(nProteins, minScaledK, maxScaledK)

{

k = seq(from = maxScaledK, to=minScaledK, length.out = nProteins) * nProteins;

K = sum(k);

adjacency = matrix(1, nProteins, nProteins);

pMat = matrix(NA, nProteins, nProteins)

for (i in 1:(nProteins-1))

for (j in (i+1):nProteins)

{

p = k[i]*k[j] / (K - (k[i] + k[j])/2);

if (p >1) p = 1;

pMat[i,j] = pMat[j,i] = p;

adjacency[i,j] = adjacency[j,i] = sample(c(0,1), size = 1, prob = c(1-p, p))

}

adjacency;

}

simulateProteinNetwork = function(

complexSizes, nSigletons,

minScaledK = 0.2, maxScaledK = 0.9,

propMissingLinks = 0,

propInterComplexLinks = 0)

{

nProteins = sum(complexSizes) + nSingletons;

adjacency = matrix(0, nProteins, nProteins);

diag(adjacency) = 1;

labels = rep(0, nProteins);

starts = c(1, cumsum(complexSizes)+1);

ends = c(cumsum(complexSizes), nProteins);

for (c in 1:nComplexes)

{

st = starts[c];

en = ends[c];

adj.complex = simulateComplex(complexSizes[c], minScaledK, maxScaledK);

adj.dst = as.dist(adj.complex);

leaveOut = sample(c(FALSE, TRUE), size = length(adj.dst),

prob = c(1-propMissingLinks, propMissingLinks),

replace = TRUE);

adj.dst[leaveOut] = 0;

adj.complex = as.matrix(adj.dst);

diag(adj.complex) = 1;

adjacency[st:en, st:en] = adj.complex;

labels[st:en] = c;

}

for (c1 in 1:(nComplexes+1))

{

if (c1 <= nComplexes)

{

c1x = c1 + 1

} else

5

Page 6: Preservation of protein-protein interaction networks ...dev.off(); The resulting plot is shown in Figure 1. We note that in this example the composite statistics medianRank and Z summary

c1x = c1;

for (c2 in c1x:(nComplexes + 1))

{

st1 = starts[c1];

en1 = ends[c1];

st2 = starts[c2];

en2 = ends[c2];

n1 = en1 - st1 + 1;

n2 = en2 - st2 + 1;

interAdj = sample(c(0, 1), size = n1*n2, prob = c(1-propInterComplexLinks, propInterComplexLinks),

replace = TRUE);

dim(interAdj) = c(n1, n2);

if (c1==c2)

{

interAdj = as.matrix(as.dist(interAdj));

diag(interAdj) = 1;

}

adjacency[st1:en1, st2:en2] = interAdj;

adjacency[st2:en2, st1:en1] = t(interAdj);

}

}

colnames(adjacency) = spaste("Protein.", c(1:nProteins));

rownames(adjacency) = spaste("Protein.", c(1:nProteins));

list(adjacency = adjacency, labels = labels);

}

We next define basic paramaters of the simulation.

nComplexes = 10;

nPreserved = 5;

preserved = c(1:nPreserved)

nNonPreserved = nComplexes - nPreserved;

nonPreserved = c(1:nComplexes)[-preserved];

complexSizes1 = seq(from = 50, to = 10, length.out = nPreserved);

complexSizes = rep(complexSizes1, 2);

nSingletons = 50;

We call the simulation function twice, to generate two separate networks with the same complex structure, but detailsof the connections within complexes differ a bit. For simplicity we do not simulate any connections between proteinsin different complexes although the above functions support it.

set.seed(10);

PPInetwork1 = simulateProteinNetwork(complexSizes, nSingletons);

PPInetwork2 = simulateProteinNetwork(complexSizes, nSingletons);

The networks can be visualized, for example, using the heatmap function:

sizeGrWindow(8,8);

#pdf(file = "Plots/networkImage.pdf", wi=8, he=8);

image(PPInetwork1$adjacency, xaxt = "none", yaxt = "none")

dev.off();

The plot is shown in Figure 2. The network image verifies that we have simulated 10 complexes of different sizes.We now permute the proteins in the non-preserved complexes in the test data set.

starts = c(1, cumsum(complexSizes)+1);

ends = cumsum(complexSizes);

scramble = starts[ min(nonPreserved)]:ends[max(nonPreserved)];

newOrder = sample(scramble);

PPInetwork2$adjacency[scramble, scramble] = PPInetwork2$adjacency[newOrder, newOrder];

6

Page 7: Preservation of protein-protein interaction networks ...dev.off(); The resulting plot is shown in Figure 1. We note that in this example the composite statistics medianRank and Z summary

Figure 2: Image of the simulated reference PPI network. Each row and column represents one protein; red color meansnot connected and white color means connected. Squares along the diagonal with dense connections correspond tosimulated complexes.

PPInetwork2$labels[scramble] = PPInetwork2$labels[newOrder];

Lastly, we save the networks for future use.

save(PPInetwork1, PPInetwork2, file = "simulatedPPInetworks.RData");

The resulting file is used as input at the start of this tutorial.

7

Page 8: Preservation of protein-protein interaction networks ...dev.off(); The resulting plot is shown in Figure 1. We note that in this example the composite statistics medianRank and Z summary

References

[1] Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath. Is my network module preserved andreproducible? PLoS Comput Biol, 7(1):e1001057, 01 2011.

8