· web viewin order to assess the efficacy of this method in our data, we began by generating...

49
SUPPLEMENTARY INFORMATION Materials and Methods Whole genome sequencing Page 2 Exome Sequencing Page 6 Variant Validation Page 10 Identification of DLBCL Cancer Genes Page 16 Gene Expression Microarray Analysis Page 23 Gene Annotation and GO Term Enrichment Page 23 Biological Validation Page 25 1

Upload: others

Post on 26-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

SUPPLEMENTARY INFORMATION

Materials and Methods

Whole genome sequencing Page 2

Exome Sequencing Page 6

Variant Validation Page 10

Identification of DLBCL Cancer Genes Page 16

Gene Expression Microarray Analysis Page 23

Gene Annotation and GO Term Enrichment Page 23

Biological Validation Page 25

1

Page 2:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Materials and MethodsSample acquisition and processing Archival lymphoma tumors (N=73) and normal tissue (N=34) from 73 patients were obtained

from the institutions that constitute the Hematologic Malignancies Research Consortium

(HMRC) 1. These cases were anonymized, shipped to Duke University, and processed in

accordance with a protocol approved by the Institutional Review Board at Duke University.

RNA and genomic DNA were extracted from these 73 cases in addition to 21 DLBCL cell lines

using column-based methods described previously1.

Whole genome sequencing

Library PreparationWhole genome sequencing libraries were prepared using methods

described in the “Sample Preparation” section of the Agilent SureSelect protocol (pre-capture

portion). Genomic DNA was sheared to 500 bp using Covaris settings: duty cycle-10%,

intensity-5, frequency-200 cycles/burst, duration-135s, waterbath temperature-4°C, and

quantified by BioAnalyzer (Agilent) using the DNA1000 chip. Then, it was end-repaired, A-

tailed, and ligated to Illumina paired-end adapters at a ratio of 2μl per μg of DNA as quantified

by BioAnalyzer. The ligated library was amplified for 6 cycles using Illumina PE PCR primers

and 2x Phusion HF Master Mix. Post-PCR, the library was purified and assayed on BioAnalyzer

to determine size and concentration. Libraries were diluted to 5 pM for Illumina clustering and

paired-end sequenced over 9 days.

Sequence Alignment

Raw reads in fastq format 2 were masked for Illumina adapter sequences, barcodes, and Phred-

scaled base qualities of 10 and less using GATK3. All the alignments were output as BAM files 4

and merged using Picard (http://picard.sourceforge.net). PCR/optical duplicates were marked

with Picard, and base quality recalibration and localized Indel realignments were performed

using GATK 3. Read alignments were visualized with Integrative Genomics Viewer5.

SAMtools mpileup with settings “-C50 -m3 -F0.0002” was run for the samples concurrently and

output to a VCF file. Individual SNVs and Indels were annotated with gene names and predicted

function using SequenceVariantAnalyzer6, dbSNP130, HapMap v3 allele frequencies, 1000

Genome Project pilot 1 allele frequencies and CCDS Gene IDs using BEDTools, AWK, and

2

Page 3:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

custom Python scripts. Predictions for phenotypic severity of variants were determined using

mutation assessor 7(www.mutationassessor.org). 

Structural Variant Calling and Annotation

The discovery of structural genetic aberrations was not the primary objective of this study. There

are major limitations imposed by the short-read format in the detection of such variants and the

methods for the detection of such variants are still evolving. We nevertheless surveyed our whole

genome sequencing data using established methods to identify copy number variation and

structural rearrangements.

Copy Number Variation

We identified copy number variants by using an approach similar to that described previously8.

In order to define the alterations in copy number throughout the genome, we began by

segmenting the genome into non-overlapping intervals of 200 KB each. Each of these intervals

represented an individual bin for identifying segmentation through a Hidden Markov Model. The

copy number calculations were computed based on the model.

Briefly, we computed the total number of sequencing reads mapping to each 200KB interval in

both the tumor and the normal genomes from the same patient. The per-interval counts were

median-centered for each sample, and then a ratio was computed of the number of reads mapping

to the tumor and normal and log2-transformed. The copy number comparison between the

whole-genome DLBCL and its matching normal is depicted in 200 kb intervals below (Figure

S1). The y-axis is the log2 ratio between median-centered values for the DLBCL and matching

normal. We identified 7 deletions and 3 amplifications. Known oncogenes that were implicated

by these copy number alterations include PTEN (chromosome 10) and P16/CDKN2A

(chromosome 9). We also identified small deletions in the immunoglobulin heavy chain and

light chain loci that correspond to somatic rearrangement of these genes.

3

Page 4:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Genetic Structural Rearrangements

We used Breakdancer v1.19 as the primary method for detecting structural rearrangements in our

whole-genome sequencing data.

Simulated structural variants

In order to assess the efficacy of this method in our data, we began by generating simulated

paired end reads as positive controls for 12 different structural variants including gene fusions,

insertions, deletions, and duplications involving either repetitive elements or unique coding

sequences. Simulated paired-end reads were generated at random along the length of the

simulated structural variant sequence, as well as for the corresponding normal sequence, in order

to simulate heterozygosity. The distances between paired reads were chosen to have the same

insert size distribution as was observed in actual sequencing data. Sampling frequency was

adjusted such that the “coverage” of the simulated reads was comparable to that of actual data.

Additionally, randomly chosen actual base quality calls were assigned to the bases in the

simulated reads to mimic the quality of real sequencing reads. Simulated reads were

concatenated to an actual fastq sequencing lane, and Breakdancer results with and without the

simulated reads were examined to determine whether the expected structural variant calls were

introduced by the simulated reads. We found that the sensitivity for simulated structural variants

was high; BreakDancer was able to detect abnormalities within 100 bp of the simulated

rearrangement in 10 of 12 cases. Of the two simulated variants Breakdancer did not detect, one

4

Figure S1: Copy number alterations in the DLBCL whole-genome compared to its matched control. Copy number changes are depicted on log2scale.

Page 5:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

was an 80 bp insertion of a repetitive element into an exon, and another was a 248 bp insertion of

a unique exonic sequence into another exon.

The results on simulated reads are listed below, with hg18 coordinates (Table S1):

Table S1: Simulated structural variants used to test Breakdancer sensitivity.

Structural variant simulated Genomic break 1 Genomic break 2

Detected by

Breakdancer

196 bp Insertion of a repetitive element into the

middle of an exon 3 + 10158562 not applicable Yes

22 bp inversion 3 - 188930304 3 - 188930281 Yes

200 bp inversion 8 + 128819825 8 + 128820026 Yes

gene fusion 19 + 50103049 3 - 188932141 Yes

hanging insertion fused with a repetitive element 17 - 7519987 not applicable Yes

fusion with another gene's promoter 11 + 69166980 not applicable Yes

everted duplication 8 + 128822002 not applicable Yes

80 bp insertion of non-exonic sequence into an

exon 1 - 92719084 not applicable Yes

35 bp deletion in exon 11 + 69175116 11 + 69175153 Yes

80 bp insertion of exonic sequence into another

exon 7 - 517064 not applicable Yes

80 bp insertion of repetitive sequence into an exon 1 - 104037517 not applicable No

248 bp insertion of exonic sequence into another

exon 1 - 104038290 17 - 10486509 No

We concluded that these methods produced accurate results and proceeded to apply them to our

whole genome sequencing data.

Breakdancer analysis of genomic rearrangements in DLBCL

The aligned reads of both the DLBCL and matching normal sample were analyzed with

Breakdancer in order to detect structural variant predictions unique to the DLBCL sample. The

mitochondrial chromosome (chromosome M) was used as a control. The positive control was the

number of reads supporting end-to-end joining of chromosome M because BreakDancer treats all

chromosomes as linear, whereas chromosome M is actually circular.

5

Page 6:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

We detected one 4 kb deletion on chromosome 1 as the only rearrangement that was unique to

our DLBCL case (and not present in paired normal). The Breakdancer output in support of it is

indicated in Table S2, in addition to an example of a chrM to chrM rearrangement. There are two

possible explanations for why we did not observe more rearrangements: 1. They are not present,

and 2. The limitations of the short read sequencing platform and insert sizes reduced the ability

to detect such variants. The advent of longer read formats will provide new opportunities for the

comprehensive detection of genetic rearrangements.

Table S2: Detected Rearrangements by whole genome sequencing

Chr #1 Position 1 Orientation1 Chr #2 Position 2 Orientation 2 Type Size Score Reads

chr1 143804307 30+0- chr1 143808440 0+27- DEL 4180 99 27

chrM 1 237+2169- chrM 16631 2822+1245- ITX 15654 99 2017

Circos

Circos 10(http://circos.ca/software/download/) was used to depict the whole-genome lymphoma

copy number and somatically acquired mutations separated by region (intergenic, regulatory,

exonic). Copy number alterations were depicted in Figure 1A by binning the read mappings into

15 KB windows, median centering and computing a ratio between the number of reads in the

lymphoma sample and its matched normal sample. Somatically acquired mutations in intergenic,

regulatory, and exonic regions were counted in 250 kb bins and depicted in Figure 1A. Bins in

which no mutations were detected are not plotted.

Exome SequencingMultiplexed Paired-End library preparation

Libraries were prepared as described in the Agilent SureSelect protocol, with modifications to

adapter sequences and addition of a sample pooling step prior to exome capture in order to

enable multiplexing.

Specifically, sheared DNA was purified, resuspended in 10 mM Tris-Cl pH 8.5, quantified on the

BioAnalyzer DNA1000 chip, end-repaired, and A-tailed. 5prime barcoded adapters were

prepared by annealing complementary oligos (Table S3).

6

Page 7:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Table S3: Sequences for barcoded 5 prime adapters.

Pool Strand 1 sequence, 5' to 3'Strand 2 (Complement) sequence, 5' to 3'

Barcode sequence

1 ACACTCTTTCCCTACACGACGCTCTTCCGATCTGCCTAAT

TTAGGCAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT GCCTAA

ACACTCTTTCCCTACACGACGCTCTTCCGATCTGTAGCCT

GGCTACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT GTAGCC

2 ACACTCTTTCCCTACACGACGCTCTTCCGATCTTGGTCAT

TGACCAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT TGGTCA

ACACTCTTTCCCTACACGACGCTCTTCCGATCTATTGGCT

GCCAATAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT ATTGGC

3 ACACTCTTTCCCTACACGACGCTCTTCCGATCTGATCTGT

CAGATCAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT GATCTG

ACACTCTTTCCCTACACGACGCTCTTCCGATCTTCAAGTT

ACTTGAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT TCAAGT

4 ACACTCTTTCCCTACACGACGCTCTTCCGATCTCTGATCT

GATCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT CTGATC

ACACTCTTTCCCTACACGACGCTCTTCCGATCTAAGCTAT

TAGCTTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT AAGCTA

After annealing, 5p adapters were mixed in equimolar ratio with universal 3p adapters (which

have sequences identical to the Illumina PE sequence) and diluted to the same concentration as

Illumina adapters. 8 barcoded adapters were mixed in pairs as indicated in the adapter sequence

table, so that two barcoded adapters could be ligated to each sample. The purpose of assigning

two barcodes per sample was to reduce possible barcode-specific bias and increase sequence

diversity in the beginning of the reads, when clusters are resolved. The adapter:DNA ligation

ratio was 2μl adapter pool per μg of sheared DNA, as determined by BioAnalyzer. The ligated

library was amplified by Illumina PE PCR primers and 2x Phusion HF Master Mix. Post-PCR,

the library was purified and assayed on BioAnalyzer to determine size and concentration.

Multiplexed Exome capture

Four libraries (125 ng each) were pooled, vacuum-dried at 45°C, and resuspended in 3.4 μl

water. Libraries were blocked and prepared according to Agilent protocol and hybridized against

Agilent SureSelect All Exome baits for 24 hours at 65°C in a thermal cycler. The biotinylated

exome baits, with exonic DNA hybridized, were purified on SPRI magnetic beads (Agilent) and

7

Page 8:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

washed three times according to Agilent SureSelect protocol. The capture library pool was

amplified with 12 cycles of PCR, and molarity and size distribution were measured by

BioAnalyzer.

Illumina Sequencing

Libraries were diluted to 5 pM for Illumina clustering and paired-end sequenced over 9 days.

Exome Sequence Alignment and Variant Processing

Alignment steps were performed as described for whole genome sequence processing with the

following modifications to address the problems associated with some 100bp paired end reads

reading through the insert and into the opposite adapter. After the first alignment step with

BWA, any discordantly mapping or unmapped read pairs were re-aligned using Novoalign

(Novocraft.com), a Needleman–Wunsch algorithm based aligner 11, with a –“-softclip” setting.

Remaining unmapped reads were clipped to 35bp and re-aligned with BWA to remove imperfect

adapter matches that GATK would not remove. This alignment strategy resulted in excellent

overlap in our exome sequencing data from the Hapmap sample (NA12762) and the data from

the 1000 genomes project (described below).

Merging of data from different samples was performed using GATK, followed by extraction of

CCDS exons 12 using BEDTools 13. Overlaps between samples were computed using VCFtools

(http://vcftools.sourceforge.net/) and AWK scripts. 

SAMTools pileups were generated for 73 DLBCL primary tumors and 21 cell lines, 34 matching

normal, and 257 control exomes. These 257 control exomes consisted of one prepared in-house

(NA12762), as well as 256 from publicly available datasets, all of which were processed from

raw sequencing reads using methods identical to those used for our sequenced exomes.

These sequence variants were annotated for gene and predicted function using Sequence Variant

Analyzer 6. These data were collapsed by unique genomic position and used for annotation of the

data. SAMtools mpileup with settings “-C50 -m3 -F0.0002” was run on all 73 DLBCL primary

cases, 21 cell lines, 34 matching normal and 257 control exomes concurrently and output to a

8

Page 9:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

VCF file. Each DLBCL variant was required to have an instance of genotype quality greater

than 30 and read depth greater than 5.

Power Calculations and Statistics

Power calculations were determined by calculating the expected sample size needed to detect

somatic variants with a prevalence of 5% or greater in DLBCL. From our list of variants

discovered in our samples, the expected subsets discoverable for sample sizes n, or v(n), were

determined by calculating the average number of variants found over combinations of that n. In

cases where calculating the number of variants discovered over all possible combinations of size

n was computationally infeasible, the expected value was calculated for 1000 randomly selected

combinations. Figure 2C is a plot of d(n) as a function of n. d(n) was observed to exhibit

exponential decay behavior, and thus was fitted to the standard equation d (n )=α0 e−α 1 n; after log

transformation of d(n), linear regression was used to estimate the coefficients for the equation.

We observed an excellent fit (r2 = 0.8768). 95% confidence intervals were calculated from the

regression model for log(d(n)), from which the confidence intervals for d(n) were reconstituted

by taking their exponential.

Publicly Available Controls

In addition to 256 publicly available exomes, 1000Genome Project pilot 1(SNV calls for 179

individuals)14, HapMap 315, and NHLBI Exome Sequencing Project (URL:

http://evs.gs.washington.edu/EVS/) data were downloaded to gauge population allele

frequencies.

Coverage

A custom Python wrapper script was used to assemble coverage information from SAMtools and

BEDTools. SAMtools flagstat was used to compute the number and percent of reads that mapped

to the genome. Both depth and breadth of coverage for each exome were computed using

BEDTools. Coverage statistics for all the DLBCL samples sequenced in this study are

summarized in Figure 2A and Dataset S2.

9

Page 10:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Variant ValidationHapMap Sample Validation

The HapMap sample NA12762 (Coriell) was subjected to exome-sequencing and analysis as

described above. This allowed us to test our processing pipeline for overlap of SNV calls

between our methodology and that of the 1000 Genomes Project. We observed 99.5%

concordance of our genotype calls with those of the 1000 Genomes Project for NA12762.

SNP Array Validation

43 DLBCL DNA samples and NA12762 DNA were also hybridized to the Illumina Human

OmniExpress Beadchip, which includes 750,000 probes. Fluorescence signals were imported

into GenomeStudio software v. 2010.2 (Illumina) using a standard cluster file provided by the

platform, and SNVs were called using standard GenomeStudio algorithms.

We compared the SNV calls from the array against those from our whole-exome dataset. The

range of concordance percentages for the respective samples was 89 to 99% (Average 94%).

NA12762 showed 97% concordance between whole-exome data and Illumina SNV chip data.

This further validated the accuracy of our data processing pipeline.

Raindance Sequencing Validation

At the outset of the study, we selected 179 genes of interest (Table S4) for separate validation

using RainDance (Lexington, MA) technology, that relies on PCR-amplification of exons of

interest in microdroplet reactions, followed by pooling for massively parallel sequencing using

standard Illumina protocols. The amplification and sequencing was carried out at

ExpressionAnalysis Inc., (Durham, NC).

10

Page 11:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Table S4: 179 genes sequenced using Raindance technology.ABCB1 CLEC3A HRAS MEN1 PIK3C2A SIGLEC1ABCC1 CSDE1 HSP90AA2 MET PIK3C2B SIPA1ABL2 CTCF HSP90AB1 MKRN2 PIK3C3 SLC35B2ACBD5 CTNNA1 HSP90B1 MSH6 PIK3CA SLC7A11AICDA CXADR HSPA8 MST1R PIK3CB SMAD2AIM1 CYP3A4 IFNGR1 MYCBP2 PIK3CD SMAD4AK7 DDC IL20RA MYST4 PIK3CG SMOAKAP12 DNAH9 IL22RA2 NCOR1 PIK3R1 SPIBAKT1 DUSP10 IRF4 NF1 PIK3R2 SRCAKT2 EGFR JAK1 NF2 PIK3R3 STK4ALDH1A2 EP300 KIAA1553 NQO1 PKD1L2 SUFUALK ERBB2 KIT NR4A3 PMS2 SUMO4AP4M1 ERBB4 KITLG NRAS PRAMEF12 TECATG5 ESR1 KRAS NT5E PRAMEF4 TGFBR1ATM EVI1 LAMA4 NTRK3 PRAMEF7 TGFBR2AURKB FANCA LAMA5 NUMB PRDM1 TNFAIP3BCAR3 FBXO11 LATS1 OPRM1 PRDM16 TNRBCDO2 FBXO30 LATS2 PACRG PTK2 TP53BCL6 FBXO5 LHX2 PAPPA PTK2B TRAF3IP3BCLAF1 FGFR3 LHX9 PARK2 RAET1G TTC19BMP2 FGFR4 LMO2 PAX8 RAET1L ULBP1BMP6 FIGNL1 LY6G6C PDE4DIP RAF1 ULBP2BRAF FOXO1A LY6G6D PDE7B RB1 ULBP3BRCA1 FOXO3 MAP2K1 PDGFRA RET VIL2BRCA2 MTOR MAP2K2 PDK1 RIPK5 WNT2BC11orf65 FZD10 MAP2K4 PDK2 RUNX1 YWHAZC6orf203 GLI2 MAP3K11 PDPK1 SASH1 ZDHHC14CDH1 GSK3A MAP3K4 PDSS2 SAV1 ZIC1CDKN1B HHIP MAP3K5 PERP SFRP1 ZNFX1CIITA HIVEP2 MAP3K7IP2 PGBD3 SFRP2

These Raindance sequencing reads for the 179 genes for 8 cases resulted in over 100-fold

sequencing coverage for those genes. These sequencing reads were then aligned and processed in

the same way as the exomes had been. RainDance and whole-exome sequencing variant calls for

the respective samples were computed using custom shell scripts. Genotype calls were made for

the presence or absence of a mutation call at any given position. The concordance of the SNV

calls identified by the two different methods was 99.1%, with a minimum concordance of 98.8%

(Table S5), indicating that the two methods generate similar results despite major differences in

methodologies and sequencing coverage.

11

Page 12:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Table S5: Overlap of Raindance SNVs with exome sequencing.

Sample # SNVs concordant # SNVs discordant % concordanceBJAB 1937 16 99.1807

DLBCL798 2567 24 99.0737DLBCL832 1333 16 98.8139DLBCL823 2265 25 98.9083DLBCL827 1520 9 99.4114DLBCL833 2234 16 99.2889DLBCL825 1438 15 98.9677DLBCL705 2055 23 98.8932

Sanger Sequencing Validation

Exon primers described previously16 were used to amplify regions of interest. Each 12.5 μl

reaction contained: 5 ng genomic DNA of interest, 6.25 μl of High Fidelity PCR Master Mix

(Roche, 12 140 314 001), and 300 nM of each primer. Amplification was carried out as

described by the manufacturer (94°C 2:00, 10 cycles of 94°C 0:10, 50°C 1:10, 72°C 0:45, 20

cycles of 94°C 0:15, 50°C 0:30, 74°C 0:45 incremented by 5s per cycle, 72°C 2:30). The

reaction specificity was verified by Agarose gel, and reactions were purified with Agilent

Ampure XP beads using manufacturer instructions (Agilent, A63881).

We have validated 118 variants at 32 unique loci in 26 genes using Sanger sequencing. The

concordance between genotype calls and NGS to be excellent (111/118, 94%). We expect the

concordance rates for variants we did not sample here to be comparable. The 7 discordant cases

all occurred because next-generation sequencing called a variant that was not supported by

Sanger sequencing (false positive). We did not observe any cases where a variant determined to

be absent by NGS was found to actually be present by Sanger sequencing (false negative).

Therefore, we are highly confident that NGS calls for somatic variants are real because our NGS

calls for absence of variants are highly accurate.

Our NGS accuracy is high because we have deliberately used conservative cutoffs for identifying

our mutations, which must be supported by a minimum of 5 reads and quality score

corresponding to an error rate of <100. Thus, our identified mutations agreed well with our 3

methods of validation (Sanger, Raindance, and SNP array). NGS (or any form of sequencing for

that matter) has the same accuracy for genotyping regardless of the population frequency the

variant it measures, so our high concordance rates for SNP array calls are indicative of the

12

Page 13:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

accuracy of our experimental and bioinformatic pipeline methods for identify known and novel

variants.

Table S6: Summary of Sanger sequencing results. Genes with an asterisk are depicted in the sample chromatograms in Figure S2. Function: NS=Nonsynonymous. Concordant: total experiments consistent between NGS and Sanger. Discordant: inconsistent between NGS and Sanger. NGS:+, Sanger:+: comparison of results where NGS was calling a variant as present. NGS:-, Sanger:-: comparison of results where NGS called a sample negative for the variant.

GENE FUNCTION

Concordant

Discordant

% agree

NGS: + Sanger:+

% agree NGS:-

Sanger-negative

%agreement

ALAS1 NS 0 2 0.0 0 2 0.0 0 0 NAANK2 NS 0 2 0.0 0 2 0.0 0 0 NA

*HIST1H1C NS 1 0 100.0 1 1 100.0 0 0 NA

*HIST1H1C NS 2 2 50.0 2 2 100.0 0 0 NA

*HIST1H2BK NS 2 2 50.0 2 2 100.0 0 0 NA

*HIST1H2BK NS 1 1 50.0 1 1 100.0 0 0 NA

*HIST1H2BK NS 1 1 50.0 1 1 100.0 0 0 NA*HIST1*H2BK NS 1 1 50.0 1 1 100.0 0 0 NAARID1A Frameshi

ft Deletion 2 0 100.0 1 1 100.0 1 1 100.0

BSCL2 NS 3 0 100.0 2 2 100.0 1 1 100.0CCDC46 NS 3 0 100.0 2 2 100.0 1 1 100.0CEP72 NS 2 0 100.0 1 1 100.0 1 1 100.0DECR2 NS 2 0 100.0 1 1 100.0 1 1 100.0GPD2 NS 4 0 100.0 3 3 100.0 1 1 100.0LRIG3 NS 3 0 100.0 2 2 100.0 1 1 100.0LRIG3 NS 1 1 50.0 0 1 0.0 1 1 100.0OR10AG1 NS 1 0 100.0 1 1 100.0 1 1 100.0PLEKHA7 NS 3 0 100.0 2 2 100.0 1 1 100.0

*PIK3CD NS 2 0 100.0 1 1 100.0 1 1 100.0ALDH1L2 NS 3 0 100.0 1 1 100.0 2 2 100.0PIK3CD NS 3 0 100.0 1 1 100.0 2 2 100.0PIM1 NS 3 0 100.0 1 1 100.0 2 2 100.0TP53 NS 3 0 100.0 1 1 100.0 2 2 100.0CIC NS 5 0 100.0 2 2 100.0 3 3 100.0HAPLN3 NS 5 0 100.0 2 2 100.0 3 3 100.0MGAT4A NS 8 0 100.0 5 5 100.0 3 3 100.0MPL NS 6 0 100.0 3 3 100.0 3 3 100.0POU2F2 NS 5 1 83.3 2 3 66.7 3 3 100.0CARD11 NS 8 0 100.0 4 4 100.0 4 4 100.0DOCK2 NS 6 0 100.0 2 2 100.0 4 4 100.0KIF21B NS 6 0 100.0 2 2 100.0 4 4 100.0MYD88 NS 16 1 94.1 11 10 90.9 6 6 100.0

13

Page 14:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

14

Figure S2: Representative chromatograms from Sanger experiments summarized in Table S6 (Starred genes). The top trace for each experiment depicts the trace expected if the genomic sequence matches the reference genome. The bottom trace is the chromatogram actually observed in DLBCL samples.

Page 15:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

MLL3 Variants

15

Figure S3: Example variants discovered within the MLL3 gene, which was the most recurrently mutated in this study. Exome sequencing reads are visualized using the Integrated Genomics Viewer; grey color indicates matches to the reference genome, and mismatches are labelled in colored letters. A. Somatic mutation in DLBCL835 (top), clearly absent from the matched normal sequence (bottom) B. Six additional mutations are shown.

Page 16:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Identification of DLBCL Cancer GenesThe overall schema for identifying DLBCL cancer genes is summarized in Figure S4.

Genes mutated in DLBCL were identified by analyzing the 73 primary tumor samples. The

initial set of DLBCL mutations were determined from the 34 DLBCL primary tumors with

paired normal samples, which constituted the discovery set. Data from cell lines were not used in

this analysis.

For each of these cases, we identified mutations that were present in tumor but absent from the

paired normal cases (somatically mutated). We eliminated common genetic variants by

16

Figure S4: Summary of DLBCL variant and cancer gene identification from 73 primary tumor samples.

Page 17:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

excluding those that occurred in the general population as identified from the following sources:

dbSNP 17, publicly available data (pilot 1) from the 1000 genomes project 18, 256 recently

published exomes from otherwise healthy individuals19-21, one additional hapmap exome that we

sequenced in this study, and those found to have a minor allele frequency of greater than 1% in

the 6500 exome dataset from the NHLBI Exome Sequencing Project.

We identified 5884 variants that were somatically mutated in at least one these 34 tumor-normal

pairs. From this list, we identified 2589 variants that represented frameshift, nonsense, missense,

or loss of a stop codon changes, corresponding to 2140 genes. These 2140 genes were examined

in all primary DLBCL cases and found to have 4928 frameshift, nonsense, missense or loss of a

stop codon variants. Among these variants were also 125 variants from 58 genes that were

identified as potential mutational hotspots, occurring 4 or more times in DLBCLs (and none in

controls). We estimated the functional impact of each of these 4928 variants on the encoded

protein using a program that outputs a functional index score (described below). We also tallied

the number of variants by gene and noted whether the gene had been previously annotated as a

cancer gene in the COSMIC database 22. Finally, we estimated the rate of nonsynonymous

variation in these genes in normal controls. We limited this analysis to previously sequenced 257

exomes from otherwise healthy individuals because these cases have similar exonic coverage as

our DLBCLs, and were processed using methods identical to those used to characterize the

DLBCL exomes.

We generated a statistical model for genes likely to be drivers. It takes into account 4 features:

gene size, background nonsynonymous mutation rates in normal samples, somatically acquired

events, and the rate of these events in carriers. Given that mutations are rare and the number of

genes is high relative to the number of samples, standard regression techniques do not apply.

Also chi-squared tests of independence, or other similar tests, for each individual gene, besides

having the obvious problem of multiple testing, would never account for important mutants that

occur with very low frequency but have other important characteristics. After filtering for genes

in which we observed a minimum of 1 somatic event and 1 additional rare event from the same

class or presence in the COSMIC database, we ranked genes based on their distance from known

cancer genes.

17

Page 18:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

We calculated the distance of a gene from a pool of known cancer genes based on the 4 variables

listed above. Let μ and Σ be the mean vector and covariance matrix for the population of known

cancer genes, from which we calculate cancer gene candidate j’s distance. We use the well-

known Mahalanobis distance D=(gj-μ)T Σ-1(gj-μ) where Σ=n1 Σ1+n2 Σ2

n1+n2−2

This distance and has several desirable properties: unlike the Euclidean metric, which can only

deal with circular forms and is a special case of D where Σ=I, the identity matrix. Here, because

μ and Σ are unknown, we estimate them from the same mean and sample covariance matrix for

the population. In the special case where we assume that the population is multivariate normal, D

is drawn from the chi-squared distribution with p degrees of freedom. Therefore we can test

whether each individual gene belongs to the population of cancer genes or not, based on this

assumption.

We compared two populations: known cancer genes from the literature and our candidate novel

cancer genes. The p-value measuring the level of distance is calculated from the F-distribution

F(d1,d2) with d1 and d2 degrees of freedom respectively:

n1 n2(n1+n2−p−1)❑(n¿¿1+n2) p (n¿¿1+n2−2) D F ( p , n−p−1)¿¿

Genes closest in distribution (P<10-6) to known cancer genes were identified as DLBCL cancer

genes. Previously annotated cancer genes were required to have 1 involved case, while genes

that were not previously annotated as cancer genes were required to have a minimum of two

involved primary tumor DLBCL cases. Using these statistics, we found that 90% of the known

cancer genes and the newly identified DLBCL cancer genes had at least one variant with a

functional index of 0.9 or higher, and a rate of non synonymous variants of less than one per case

in the unmatched controls.

Using these criteria, we identified a total of 426 genes that were recurrently mutated. Excluding

those for which more than two-thirds of the variants also were found by the NHLBI Exome

Sequencing Project, 322 genes remained. 52 genes within this list were previously annotated as

18

Page 19:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

cancer genes. These 322 genes (Dataset S3) comprised 1418 variants, which are listed in Dataset

S4.

Figure S5 (below) shows a side-by-side comparison of DLBCL cancer genes identified in this

study and those that were previously identified in COSMIC. The distributions were largely

identical in the two gene groups.

Overlap with other studies

We observed partial overlap in genelists between our study and three other DLBCL studies using

similar methodologies and deep sequencing23-25, shown in figure 5. We explored whether the

degree of overlap changed when genes from the Lohr study were stratified by frequency. As

19

Figure S5

Figure 5A depicts the distribution of the number of cases affected by the genes annotated in COSMIC (orange line) as well as recurrently mutated genes in our data (blue line).

Figure 5B shows the distribution of non-synonymous variation in normal controls for the genes annotated in COSMIC(orange line) and the recurrently mutated genes in our data (blue line).

Figure 5C illustrates the distribution of the computed functional index scores for variants in the genes annotated in COSMIC (orange line) compared to that for the recurrently mutated genes in our data (blue line).

Page 20:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

expected, as the frequency of genes increased, the overlap increased (Figure S6, below). Similar

results were observed when using genes from the other studies.

>0 (58 genes)

>1 (58 genes)

>2 (54 genes)

>3 (36 genes)

>4 (28 genes)

>5 (17 genes)

0102030405060708090

100

Our workPasqualucci et alMorin et al

Number of Patients in Lohr study with Gene Somatically Mutated

% O

verla

p

However, even among these more frequently mutated genes, the genes that overlapped between

studies was different. While some gene mutations were observed in multiple studies including

MYC26, B2M27 and PRDM1 23-25,28, a substantial portion of genes identified in each study were

not identified in the others. These observations, again, highlight the underlying genetic

heterogeneity of these tumors and the importance of biological validation of these findings.

We further explored whether this degree of heterogeneity also occurred in two published studies

that applied exome sequencing to define the genetics of head and neck cancer. While one study29

reports 462 genes as recurrently mutated in the disease and the other30 reports 199 genes, only 20

genes overlapped (Venn diagram, Figure S7). These findings suggest that our observations

regarding the heterogeneity of DLBCLs may also hold true in other cancers.

20

Figure S6: The accompanying chart shows the degree of overlap between different studies and that of Lohr et al, as a function of the frequency of somatic events. We found increasing overlap as the number of somatic events increases.

Figure S7: The accompanying Venn diagram depicts the comparison of overlapping gene mutations from two head and neck cancer studies. The number in parenthesis indicates the number of genes identified in the study.

Page 21:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Genes with low coverage

We noted a number of genes that were not covered in our analysis, either due to not being a

Version 36 CCDS gene (e.g. MLL2) or due to methodological reasons (e.g. RERE). All genes

with fewer than four supporting reads in all samples were systematically excluded from analysis.

Genes with average depth less than 4 are listed in Table S7. These genes collectively comprise

fewer than 5% of the total genes.

Table S7: Genes with average depth less than 4 in the exome sequencing data.AANAT C20orf55 FAM44A JPH3 NKX2.3 RNF39 TMEM95 CCDS12550.1AC005020.5 C20orf75 FAM53B JPH4 NKX2.8 RP11.299N6.3 TMEPAI CCDS12623.1AC005077.5 C21orf2 FAM64A JSRP1 NKX6.2 RP4.695O20__B.10 TNFAIP8L1 CCDS12697.1AC005691.1 C21orf56 FAM83H JUNB NKX6.3 RP4.697K14.7 TNFRSF12A CCDS12754.1AC008267.6 C21orf58 FBXL14 JUND NME3 RP5.1187M17.10 TNFRSF18 CCDS13044.1AC008623.4 C2orf54 FBXL15 KANK3 NOL10 RPP25 TNFRSF25 CCDS13045.1AC008735.9 C3orf18 FBXO2 KCNG2 NOL3 RPS15 TNFRSF4 CCDS13269.1AC008888.7 C6orf108 FCHSD2 KCNK12 NOXA1 RPS6KB2 TNFRSF6B CCDS13757.1AC010409.6 C7orf43 FCRLB KCTD17 NPDC1 RUNX3 TNFSF13 CCDS13819.1AC011498.7 C7orf50 FEV KIAA1191 NPPA S1PR5 TNFSF14 CCDS14024.1AC020763.6 C8orf55 FFAR1 KIAA1522 NPPB SAMD11 TNFSF9 CCDS1405.1AC084125.9 C9orf123 FGF22 KIAA1683 NPW SAMDC1 TNK1 CCDS14107.1AC091132.16 C9orf140 FGF3 KIAA2013 NR2F6 SBK1 TNK2 CCDS14151.1AC091152.18 C9orf151 FIBCD1 KIF19 NRBP2 SCAF1 TNNC2 CCDS14152.1ADAM33 C9orf166 FIZ1 KIFC2 NRTN SCNN1D TNNI1 CCDS14199.1ADAM8 C9orf167 FKRP KISS1 NTN1 SCRIB TNNI2 CCDS14215.1ADAMTSL5 CABP4 FOSL1 KISS1R NTN2L SCRT2 TNNT3 CCDS14239.1ADAT3 CACNG6 FOXF2 KLC3 NUDT16L1 SDF4 TPD52L2 CCDS14256.1ADRA1D CAMSAP1 FOXI2 KLF1 NUDT8 SEMA6B TPSG1 CCDS14268.1ADRB3 CASKIN1 FOXN4 KLF14 NUMBL SF3A2 TRAPPC5 CCDS14307.1AF038458.1 CASZ1 FOXQ1 KLF16 NXNL2 SH3GL1 TRIM35 CCDS14311.1AGRN CBX6 FRAT1 KLF2 OGFR SHF TRIM47 CCDS14315.2AKT1S1 CCDC102A FSCN1 KLF4 OLFM1 SHROOM1 TRIM65 CCDS14316.1AL031595.4 CCDC130 FST KLHL21 OLIG1 SIGLEC15 TRIM7 CCDS14416.1AL031705.25 CCDC14 FSTL3 KLK12 OLIG2 SIPA1 TRIM72 CCDS14448.1AL356390.24 CCDC88B FUT7 KLRG2 OR7E86P SIRT6 TRIM8 CCDS14672.1AL390294.19 CCDC9 G0S2 KREMEN2 OSCAR SIX5 TSC22D4 CCDS14699.1ALKBH6 CCR10 GAL3ST2 KRTAP12.4 P2RY11 SKI TSFM CCDS14732.1AMN CD14 GAL3ST3 LCN12 PAK4 SLC16A11 TTC16 CCDS14740.1ANKRD43 CD276 GALR3 LGI4 PAK6 SLC16A8 TTC9B CCDS14753.1ANKRD9 CD70 GAS1 LIME1 PALM SLC19A1 TUBA8 CCDS14771.3AP002796.3 CD8A GAS2L1 LIPA PARP10 SLC22A7 TUSC1 CCDS14778.1AP4M1 CDC42EP1 GATA5 LOR PATZ1 SLC25A29 TWF2 CCDS14793.1APC2 CDC42EP5 GATA6 LOXL1 PC SLC26A1 TYMS CCDS1630.1APCDD1L CDKN1C GDF7 LPAR5 PCBP4 SLC30A6 TYSND1 CCDS1747.1APOBEC3H CDKN2AIPNL GFRA4 LRFN3 PCDH8 SLC31A2 U62317.2 CCDS1921.2APOE CDX2 GGT6 LRG1 PDCD1 SLC39A4 UFSP2 CCDS2098.1APOLD1 CEBPB GIMAP1 LRP3 PDDC1 SLC8A2 UNCX CCDS2149.1ARHGAP8 CECR6 GJA3 LRRC24 PDLIM7 SMAD6 UNKL CCDS2168.1ARID5A CFD GJD4 LRRC26 PEX10 SMCR7 UPF1 CCDS2169.1

21

Page 22:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

ARL6IP4 CGB2 GLI4 LRRC29 PFN3 SMPD4 USH1G CCDS2499.1ARSA CHAC1 GLTPD1 LRRC32 PHOSPHO1 SMTN USP41 CCDS27.1ARTN CHPF GLTPD2 LRRN4CL PITX3 SNAPC2 UTF1 CCDS28.1ASB10 CHST3 GLTSCR2 LRTM1 PKMYT1 SNF1LK UTS2R CCDS30679.1ASPHD1 CILP2 GPR113 LTB PLA2G6 SNPH VASN CCDS32648.1ATG9B CITED4 GPR123 LTB4R2 PLEKHF1 SNX15 VAX2 CCDS33104.1ATOH8 CLDN19 GPR132 LYL1 PLEKHG3 SOCS1 VGLL2 CCDS33145.1ATP13A1 CLEC11A GPR150 LYSMD2 PLEKHH3 SOD3 VPS37C CCDS33507.1ATP5D CNFN GPR153 MADCAM1 PLIN SOLH VPS37D CCDS33682.1ATPBD3 CNO GPR27 MAFA PODNL1 SOST VSX2 CCDS34316.1BAHD1 CNTD2 GPR44 MAFF POLD4 SOX1 VWC2 CCDS34795.1BAI2 COL23A1 GPR78 MAP1S POMC SOX10 WDR27 CCDS35254.1BAPX1 COL8A2 GPRC5C MAP6D1 POR SOX11 WFDC5 CCDS35278.1BARX1 COMTD1 GRASP MAP7D1 POU4F1 SOX12 WFIKKN1 CCDS35319.1BASP1 CPLX2 GRIN2D MASP2 PPM1M SOX15 WISP2 CCDS35362.1BATF2 CPSF1 GRIN3B MAZ PPP1R13L SOX18 WNT6 CCDS35417.1BCL11B CRCT1 GRRP1 MDK PPP1R14A SOX4 YDJC CCDS35437.1BCL2L10 CRYGB H1FNT MEF2B PPP1R14B SOX8 YIF1B CCDS35442.1BDNF CTBP1 HAGHL MEGF11 PPP1R16A SP5 YIPF2 CCDS35443.1BEGAIN CWF19L2 HAPLN4 MEGF6 PPP2R2A SP8 ZAR1 CCDS35444.1BFSP2 CXXC5 HAS1 MESDC1 PRAP1 SPERT ZBED3 CCDS41233.1BHLHB3 CYLN2 HBM MESP1 PRCD SPIB ZBTB22 CCDS41800.1BHLHB5 DACT3 HCRT MESP2 PRDM13 SPRED3 ZBTB46 CCDS41883.1BHLHB8 DAK HES2 METRN PRDX5 SPSB4 ZDHHC1 CCDS42105.1BID DCHS1 HES3 METTL10 PRELID1 SPTBN4 ZDHHC12 CCDS42144.1BLOC1S3 DDN HES7 MFAP4 PRKCDBP ST6GALNAC4 ZDHHC24 CCDS42325.1BNIP2 DEDD2 HEXIM2 MFHAS1 PRR15 STRA13 ZFP36 CCDS42855.1BSG DLL1 HEYL MIB2 PRR18 SUV420H2 ZFP36L2 CCDS42980.1BTBD14A DLL3 HIC1 MIDN PRRT2 SYDE1 ZFPM1 CCDS43266.1C10orf22 DMWD HLA.B MKL1 PRRX2 SYT3 ZIC4 CCDS43779.1C10orf47 DOHH HLA.DOA MLL2 PRSS33 SYTL1 ZNF187 CCDS43951.1C10orf95 DOK3 HMX3 MMP17 PRSSL1 TACSTD2 ZNF219 CCDS43952.1C11orf35 DPEP3 HOP MNT PRTN3 TAF1C ZNF238 CCDS43984.1C11orf53 DRD1IP HOXD1 MNX1 PRX TAL1 ZNF331 CCDS4419.1C13orf15 DRD4 HOXD11 MPST PSMD8 TAOK2 ZNF428 CCDS4456.1C14orf152 DUS3L HOXD9 MPV17L PSORS1C1 TAS1R3 ZNF429 CCDS4544.1C14orf178 DUSP15 HRH3 MRGPRE PSORS1C2 TBC1D10C ZNF444 CCDS4739.1C14orf180 DUSP28 hsa.mir-

126MRGPRF PSPN TBC1D2B ZNF467 CCDS5.1

C14orf73 DUSP8 hsa.mir-497

MRPL34 PTBP1 TBL3 ZNF497 CCDS5040.1

C14orf80 DVL1 HSD11B1L MRPL38 PTF1A TBRG1 ZNF503 CCDS5101.1C16orf14 EDG3 HSF4 MSC PTGER1 TBX10 ZNF512B CCDS5325.1C16orf24 EFCAB4A HSPA12B MSMB PTGER3 TBXA2R ZNF517 CCDS6094.1C16orf44 EFNA2 HSPB1 MSX1 PTGES TCAP ZNF524 CCDS6142.1C16orf76 EGR4 IER2 MTAP PTGIR TCF15 ZNF536 CCDS6417.1C17orf50 EMILIN1 IER5L MVD PTH2 TCF7 ZNF579 CCDS6421.1C17orf56 EN1 IFITM5 MVP PUSL1 TCTEX1D4 ZNF581 CCDS6445.1C17orf65 EN2 IGFALS MXD4 PYGO2 TDRD9 ZNF593 CCDS6512.1C17orf82 ENDOG IGFBP2 MXRA7 QPRT TFF2 ZNF646 CCDS6513.1C19orf19 EPHA10 IL11 MXRA8 QRICH2 TGFB1 ZNF672 CCDS6893.1C19orf20 ERF IL17C MZF1 RAD54L2 TIGD1 ZNF688 CCDS7037.2C19orf21 ESPN IL27 NAGS RAD9B TIGD5 ZNF696 CCDS7456.1C19orf24 ESPNL IL28A NANOS1 RASIP1 TIMM13 ZNF775 CCDS7709.1C19orf35 ETNK2 INA NARFL RASL10A TITF1 ZNF784 CCDS7714.1C19orf6 EVI5L INSM1 NAT14 RASSF4 TLX3 ZNF787 CCDS772.1

22

Page 23:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

C19orf60 EVX1 IQCF2 NAT6 RASSF7 TMEM10 ZSCAN10 CCDS7732.1C19orf63 EXOSC6 IRF7 NBL1 RAX2 TMEM112B CCDS10025.1 CCDS7942.1C1orf115 F2RL3 IRS2 NCLN RBMXL2 TMEM121 CCDS10456.1 CCDS8009.1C1orf159 FAM100A IRX3 NDUFB7 REEP6 TMEM134 CCDS10652.1 CCDS8214.1C1orf78 FAM109A IRX6 NEU4 RELL1 TMEM143 CCDS10682.1 CCDS8451.1C1orf90 FAM110C ISG15 NEURL RERE TMEM157 CCDS10875.1 CCDS885.1C1QL2 FAM128A ITGB4 NFIC RFNG TMEM160 CCDS1148.1 CCDS8964.1C1QL4 FAM132A ITPKA NFKBID RGS11 TMEM16H CCDS11790.1 CCDS9181.1C1QR FAM148A JAG2 NFKBIE RGS9BP TMEM200B CCDS11793.1 CCDS9659.1C20orf144 FAM148B JAK3 NFKBIL1 RHBDD3 TMEM30B CCDS12146.1 CCDS9930.1C20orf151 FAM43B JPH2 NKPD1 RNF126 TMEM86B CCDS12475.1

Gene Expression Microarray Analysis Gene expression profiling was performed using standard Affymetrix protocols as described

previously31. Briefly, 1 g of total RNA was reversed-transcribed using an oligodT primer to

synthesize cDNA. In vitro transcription using a T7 primer resulted in labeled cRNA, which was

fragmented and hybridized to Affymetrix whole-genome Affymetrix Gene 1.0 ST microarrays.

The arrays were scanned and data normalized as described previously1.

Tumor samples from 73 patients with DLBCL were freshly frozen, as were cell pellets from 21

DLBCL cell lines. These cases were profiled using Affymetrix Gene 1.0 ST arrays. The

molecular subgroups were distinguished using a Bayesian approach described previously31.

Gene Annotation and GO Term Enrichment In order to better understand the biological processes that were potentially altered by gene

mutations, we used gene ontology32 annotations for biological processes. In all, 203 of the 322

genes had ontology annotation for at least one biological process. We chose all ontologies that

comprised genes that were collectively mutated at least 10 times. In all, 27 separate ontology

terms satisfied these criteria.

We noted a high degree of redundancy among several ontology categories. To reduce the

redundancy, we regrouped related ontologies into 12 larger groups, shown in Table S8. These 12

groups accounted for all 27 GO terms and comprised 1625 events. The frequency of events

occurring in each ontology group is also listed below. These ontology groups are listed in

clockwise order in Figure 3D.

23

Page 24:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Table S8: Gene Ontology of DLBCL cancer genes.

Ontology Group Proportion of Total Events GO Terms Comprised

1 Apoptosis 3.6% Apoptosis

2 Cell Adhesion 11.2%Cell adhesion

Homophilic cell adhesion

3 Cell Cycle 4.5%Cell cycle

DNA replication

4Cell Development

and Differentiation9.3%

Cell differentiation

Multicellular organismal development

Nervous system development

5 Cell Metabolism 5.4% Metabolic process

6

Chromatin

Modification and

Transcription

14.8%

Chromatin modification

Regulation of transcription

Regulation of transcription, DNA-

dependent

Transcription

7 DNA Repair 2% DNA repair

8Immune

Response3.1% Immune response

9Membrane

Transport12.9%

Electron transport

Ion transport

Transport

24

Page 25:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

10 Protein Modification 12.6%

Protein amino acid phosphorylation

Protein transport

Proteolysis

11 Signal Transduction 17.2%

Cell surface receptor linked signal

transduction

G-protein coupled receptor protein

signaling pathway

Intracellular signaling cascade

Signal transduction

Transmembrane receptor protein

tyrosine kinase signaling pathway

12 Ubiquitin Cycle 3.4% Ubiquitin cycle

These analyses are necessarily restricted by the limitations of current annotations and our

existing knowledge of the interactions of different genes and signaling pathways. Nevertheless,

they provide a broad overview of the cellular and signaling functions that are potentially altered

by recurrent mutations in DLBCL and identified new aspects of the biology of the disease.

Biological Validation

Cell Culture

As described previously33, Lymphoma cell lines were cultured with RPMI1640 media

supplemented with 10% v/v Fetal Bovine Serum (FBS) and 1% v/v Penicillin/Streptomycin

supplied at 10,000U penicillin and 10 mg streptomycin/ml (BJAB, Farage, Karpas422, Pfieffer,

RL, SCI 1, SKI, Toledo, U2932, WSU_NHL, HT), or RPMI1640 supplemented with 15% v/v

FBS and 1 % v/v P/S (RCK8, SUDH4, SUDHL7), Iscove's modified dulbecco's medium

(IMDM) supplemented with 20% v/v human plasma and 1% P/S (RCK8, SUDHL4, SUDHL7),

25

Page 26:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

or alpha-mem media supplemented with 10% FBS and 1% P/S (TMD8). Cells were grown in a

5% CO2 environment at 37°C.

Cell Viability Assays

The effect the PI3K inhibitor drug BKM120 (Novartis) on the 22 lymphoma cell lines sequenced

in this study was assayed on 96-well format using MTT viability assays. 40,000 cells were

grown in drug at 10 concentrations: 25μM, and 9 serial 1:2 dilutions down to 0.05 μM. For each

condition tested, there were 5 technical replicates. Two controls were included: media only and

drug-free cells. If the IC50 was not reached within this drug concentration range, the experiment

was repeated with an additional drug concentration of 50μM. After 48 hours, 15μl MTT was

added to each well, and the plate was incubated at 37°C for 4 hours. 100μl of MTT detergent

solution was added, and color was developed in the dark at room temperature overnight, after

which 570 nm absorbance was measured. After normalization to control wells, the IC50 values

were calculated (Table S9).

26

Page 27:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Cell Line Drug Sensitivity

Table S9: IC50 Values for 21 DLBCL cell lines.

DLBCL Cell LineIC50 (μM) MTOR Status

BJAB 1.4706 Wild-typeFarage 0.752 Wild-typeKarpass422 2.081 Wild-typeLy10 1.5249 Wild-typeLy19 1.8869 Wild-typeLy3 0.8929 Wild-typeLy7 2.5487 Wild-typeLy8 1.3736 Wild-typePfieffer 1.0441 Wild-typeRCK8 3.2553 Wild-typeRL 2.5977 Wild-typeSCI_1 0.9853 Wild-typeSKI 0.2364 Wild-typeSUDHL4 1.1134 Wild-typeTDM8 0.7645 Wild-typeToledo 2.0058 Wild-typeU2932 5.3598 Wild-typeWSU_NHL 1.2561 Wild-typeHT 0.4352 MutatedLy1 0.382 MutatedSUDHL7 0.2329 Mutated

PI3 Kinase Dependence

Protein Structure Modeling of PIK3CD

To model the three dimensional structure of the PI3KCD protein, we threaded the protein

sequence (NP_005017) through the crystal structure of the PI3KCG protein (PDB:1HE8) using

the PHYRE algorithm34. Given the very strong homology through the core structural domains,

the output was of high quality. Thus, the catalytic, Ras binding, and C2 domains were easily

discernible and properly oriented in the PI3KCD model as displayed in Figure 4D. The PyMol

program (DeLano Scientific) was used for positioning of somatically mutated residues observed

in our sequencing studies and rendering of the model figure.

27

Page 28:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

Cell culture

FL5.12 myristoylated Akt (myrAkt) cells obtained from Jeffrey Rathmell’s laboratory were

cultured in RPMI (Invitrogen, 11875-093) supplemented with 10% FCS, 500 pg/mL

recombinant mouse IL3 (rmIL3), 2 mM L-glutamine, 10 mM HEPES, 1% v/v

Penicillin/Streptomycin (Invitrogen, 15140-122) and 0.1% v/v βME (Invitrogen, 21985-023).

Cells were split down to a density of 50k/ml daily.

Creation of PIK3CD wild-type and mutant expression constructs

The PIK3CD shuttle clone (Genecopoeia, GC-M0163-CF) open reading frame was inserted into

the pEF-DEST51 plasmid (Invitrogen, 12285-011) using Gateway cloning (Invitrogen, 12538-

120). The point mutation was created by site-directed mutagenesis (Stratagene, 200521). Wild-

type and mutant PIK3CD plasmid insert sequences were confirmed over the entire length of the

ORF by Sanger sequencing.

Transfection of FL5.12 mAkt cell lines, IL3 withdrawal

For western blot analysis, 1.5 million FL5.12 myrAkt cells were transfected with 2μg wild-type

or mutant by Amaxa (Nucleofector V, program G-016), concurrent with addition of doxycycline

at 1μg/ml in the media to induce myrAkt expression. At 18 hours post-transfection, each

transfection was split in half and washed twice in Phosphate-buffered saline. The control cells

were re-suspended in normal FL5.12 growth media, whereas the remainder was re-suspended in

media lacking IL3. P-Akt was measured by Western blots 3 hours later to compare the cells in

which IL3 was replaced to those in which it was withdrawn.

Cells for PI3K activity ELISA were transfected in a similar manner, but with 50 M cells

transfected in 5 batches of 10M cells by Amaxa. Cells were also subject to IL3 withdrawal 18H

post-transfection and harvested 3 hours later.

Western blot

RIPA Lysis buffer (1 × phosphate-buffered saline [PBS], 1% Nonidet P-40, 0.5% sodium

deoxycholate, 0.1% SDS, 10 mM phenylmethylsulfonyl fluoride, 1 μg/mL aprotinin, and 100

mM sodium orthovanadate) was added to 750,000 cells and incubated on ice for 30 minutes. The

mixture was spun down and the supernatant was transferred to a new tube as the whole cell

28

Page 29:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

extract. A total of 20 μg of cell lysate was separated on a 4 – 18 % Tris-Bis NuPAGE gel

(Invitrogen) and transferred using the iBlot transfer device (Invitrogen) program 2 for 6 minutes.

The blots were probed using 1:1000 rabbit-phospho-AKT (Cell Signaling Technologies, #4060),

1:2000 rabbit totalAKT (Cell Signaling Technologies, #9272) and 1:1000 mouse-anti-β-actin

(Santa Cruz Biotechnologies, sc-47778) overnight at 4°C. The antibodies were detected using

1:10,000 goat-anti-rabbit or 1:10,000 goat-anti-mouse horse radish peroxidase conjugated

antibodies (Santa Cruz Biotechnologies). Western Blotting Luminol Reagent (Santa Cruz

Biotechnologies) was used to visualize the bands corresponding to each antibody.

wild-type PIK3CD, IL3+

wild-type PIK3CD, IL3-

mutant PIK3CD, IL3+

mutant PIK3CD, IL3-

0

25

50

75

100

PI3K Activity ELISA

PI3K

Acti

vity

Rel

ative

to IL

3+

PI3K ELISA

After 3 hours of IL3 withdrawal, cells were washed in PBS and lysed in 80 μl lysis buffer

consisting of 10 mM Tris pH 7.4, 150 mM NaCl, 1% Triton X-100, 1% deoxicholic acid, 0.1%

SDS, and 5 mM EDTA supplemented with 1% each of protease inhibitor cocktail (Sigma cat#P-

8340) Serine/Threonine phosphatase inhibitor cocktail (Sigma cat #P-2850), Tyrosine

phosphatase inhibitor (Sigma cat #P-5726) and PMSF (100mM stock). The lysate was vortexed

10s, incubated on ice for 10 minutes, vortexed again, and then incubated on ice again. Then, it

was sonicated using Covaris settings: duty cycle:5%, intensity:4, Cycles/burst:200 for 2 1-minute

pulses. The lysate was centrifuged at 4°C for 10 minutes at 16,000 rcf, and the supernatant was

transferred to a fresh tube.

100μg of protein was used per well for PI3K activity measurement by ELISA (Echelon

Biosciences part number K-1000s) per manufacturer instructions.

29

Figure S8: PI3Kinase ELISA Activity measurements of FL5 mAkt cells transfected with wild-type PIK3CD or mutant PIK3CD. Kinase activity of cells subject to IL3 removal compared to those not subject to removal is shown for each transfection.

Page 30:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

References

1. Jima, D.D. et al. Deep sequencing of the small RNA transcriptome of normal and malignant human B cells identifies hundreds of novel microRNAs. Blood 116, e118-27 (2010).

2. Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L. & Rice, P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38, 1767-71 (2010).

3. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297-303 (2010).

4. Parmigiani, G. et al. Design and analysis issues in genome-wide somatic mutation studies of cancer. Genomics 93, 17-21 (2009).

5. Robinson, J.T. et al. Integrative genomics viewer. Nat Biotechnol 29, 24-6 (2011).6. Ge, D. et al. SVA: Software for Annotating and Visualizing Sequenced Human Genomes.

Bioinformatics (2011).7. Reva, B., Antipin, Y. & Sander, C. Determinants of protein function revealed by combinatorial

entropy optimization. Genome Biol 8, R232 (2007).8. Chiang, D.Y. et al. High-resolution mapping of copy-number alterations with massively parallel

sequencing. Nat Methods 6, 99-103 (2009).9. Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural

variation. Nat Methods 6, 677-81 (2009).10. Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res 19,

1639-45 (2009).11. Needleman, S.B. & Wunsch, C.D. A general method applicable to the search for similarities in the

amino acid sequence of two proteins. J Mol Biol 48, 443-53 (1970).12. Pruitt, K.D. et al. The consensus coding sequence (CCDS) project: Identifying a common protein-

coding gene set for the human and mouse genomes. Genome Res 19, 1316-23 (2009).13. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features.

Bioinformatics 26, 841-2 (2010).14. Kamai, T. et al. Increased Rac1 activity and Pak1 overexpression are associated with

lymphovascular invasion and lymph node metastasis of upper urinary tract cancer. BMC Cancer 10, 164 (2010).

15. Altshuler, D.M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52-8 (2010).

16. Wood, L.D. et al. The genomic landscapes of human breast and colorectal cancers. Science 318, 1108-13 (2007).

17. Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308-11 (2001).

18. Siva, N. 1000 Genomes project. Nat Biotechnol 26, 256 (2008).19. Ng, S.B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature

461, 272-6 (2009).20. Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75-

8 (2010).21. Li, Y. et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-

synonymous coding variants. Nat Genet 42, 969-72 (2010).22. Forbes, S.A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic

Mutations in Cancer. Nucleic Acids Res 39, D945-50 (2011).

30

Page 31:  · Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants

23. Morin, R.D. et al. Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma. Nature 476, 298-303 (2011).

24. Pasqualucci, L. et al. Analysis of the coding genome of diffuse large B-cell lymphoma. Nat Genet 43, 830-7 (2011).

25. Lohr, J.G. et al. Discovery and prioritization of somatic mutations in diffuse large B-cell lymphoma (DLBCL) by whole-exome sequencing. Proc Natl Acad Sci U S A 109, 3879-84 (2012).

26. Pasqualucci, L. et al. Hypermutation of multiple proto-oncogenes in B-cell diffuse large-cell lymphomas. Nature 412, 341-6. (2001).

27. Challa-Malladi, M. et al. Combined genetic inactivation of beta2-Microglobulin and CD58 reveals frequent escape from immune recognition in diffuse large B cell lymphoma. Cancer Cell 20, 728-40 (2011).

28. Mandelbaum, J. et al. BLIMP1 is a tumor suppressor gene frequently disrupted in activated B cell-like diffuse large B cell lymphoma. Cancer Cell 18, 568-79 (2010).

29. Agrawal, N. et al. Exome sequencing of head and neck squamous cell carcinoma reveals inactivating mutations in NOTCH1. Science 333, 1154-7 (2011).

30. Stransky, N. et al. The mutational landscape of head and neck squamous cell carcinoma. Science 333, 1157-60 (2011).

31. Dave, S.S. et al. Molecular diagnosis of Burkitt's lymphoma. N Engl J Med 354, 2431-42 (2006).32. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology

Consortium. Nat Genet 25, 25-9 (2000).33. Zhang, J. et al. Patterns of microRNA expression characterize stages of human B cell

differentiation. Blood 113, 4586-94 (2009).34. Kelley, L.A. & Sternberg, M.J. Protein structure prediction on the Web: a case study using the

Phyre server. Nat Protoc 4, 363-71 (2009).

31