2013 alumni-webinar

Post on 10-May-2015

337 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

I’ve got the Big Data Blues

C. Titus Brownctb@msu.edu

Microbiology, Computer Science, and BEACON

Outline

1. Genetics 101 and 102 - what you need to know.2. Marek’s Disease – chicken cancer.3. Generating lots of data – the sequencing revolution.4. The problems of data analysis and data integration.5. Some preliminary results on Marek’s Disease5. An apparent digression: chess and computers.6. My actual research :)

Genetics 101: DNA to RNA to protein to phenotype…

http://commons.wikimedia.org/wiki/File:Spombe_Pop2p_protein_structure_rainbow.png; http://commons.wikimedia.org/wiki/File:Protein_CA2_PDB_12ca.png

…plus diploidy (2x each chromosome)

…plus regulation and interaction.

PHYSICAL AGENTS

INFECTIOUSAGENTS

HORMONES RADIATION

CANCER

GENETIC FACTORS

CHEMICAL CARCINOGENS

LIFESTYLE FACTORS

(slide courtesy Suga Subramanian)

Herpesvirus and Cancer

• Epstein-Barr Virus– Burkitt’s lymphoma– Hodgkin’s lymphoma– Nasopharyngeal

carcinoma

• Herpes Virus-8– Kaposi’s sarcoma– Multicentric lymphoma

• Mardivirus– Marek’s Disease

• Viral neoplastic disease• Alpha-herpesvirus• Model for Burkitt’s lymphoma

(slide courtesy Suga Subramanian)

Clinical Signs Asymmetric Paralysis

http://partnersah.vet.cornell.edu/avian-atlas/

Visceral LymphomaLiver

NO

RM

AL

LYM

PH

OM

A

Courtesy: John Dunn, USDA

Importance of Marek’s Disease

• Agricultural Impact– Economic losses (2 billion)– Viral evolution: Increased virulence – Current Vaccines: Not enough– Long term viral persistence

• Model Sytem– Human herpes viral infections– Viral induced lymphoma

(slide courtesy Suga Subramanian)

MAREK’S DISEASE VIRUS

(MDV)INBRED CHICKEN

LINES

MD-RESISTANT LINE

MD-SUSCEPTIBLE LINE

LINE 62 LINE 73

GENETIC RESISTANCE TO MAREK’S DISEASE

?(slide courtesy Suga Subramanian)

What happens when we infect?

…how does the virus specifically interact with genes?

…and what are the mechanisms of resistance?

Digression: DNA sequencing

• Observation of actual DNA sequence• Counting of molecules

Image: Werner Van Belle

Fast, cheap, and easy to generate.

Image: Werner Van Belle

Applying sequencing to Marek’s Disease

Differentially expressed genes (DEG) due to infection

Gene GO Analysis, IPA Pathway Analysis

DEGs in Md5-infected and not in Md5ΔMeq-infected groups

YES NO

Meq-dependent DEGs DEGs not dependent on Meq

DEGs in Line 6 and not in Line 7 DEGs in Line 7 and not in Line 6YES NO NO YES

Meq-dependent DEGs involved in MD resistance

Meq-dependent DEGs involved in

MD susceptibility

Meq-dependent DEGs common to both lines

Back to Marek’s disease:

(slide courtesy Suga Subramanian)

LINE 6

MD-RESISTANCE: ROLE OF MEQ

MDV MDV-no Meq

Genes involved in MD-resistance

that are regulated by Meq

Genes involved in MD-resistance that are not regulated

by Meq

1031 1670

(slide courtesy Suga Subramanian)

Pathway Analysis: MD resistance

(slide courtesy Suga Subramanian)

LINE 7

MD-SUSCEPTIBILITY: ROLE OF MEQ

MDV MDV-no Meq

Genes involved in MD-susceptibilitythat are regulated

by Meq

Genes involved in MD-susceptibility

that are not regulated by Meq

650 540

(slide courtesy Suga Subramanian)

Pathway Analysis: MD susceptibility

(slide courtesy Suga Subramanian)

Next problem: data analysis & integration!

• Once you can generate virtually any data set you want…

• …the next problem becomes finding your answer in the data set!

• Think of it as a gigantic NSA treasure hunt: you know there are terrorists out there, but to find them you to hunt through 1 bn phone calls a day…

Digression: “Heuristics”

• What do computers do when the answer is either really, really hard to compute exactly, or actually impossible?

• They approximate! Or guess!

• The term “heuristic” refers to a guess, or shortcut procedure, that usually returns a pretty good answer.

Often explicit or implicit tradeoffs between compute “amount” and quality of result

http://www.infernodevelopment.com/how-computer-chess-engines-think-minimax-tree

My actual research focus

What we do is think about ways to get computers to play chess better, by:

– Identifying better ways to guess;– Speeding up the guessing process;– Improving people’s ability to use the chess playing

computer

Now, replace “play chess” with“analyze biological data”...

My actual research focus…

We build tools that help experimental biologists work efficiently and correctly with large amounts of data, to help answer their

scientific questions.

This touches on many problems, including:• Computational and scientific correctness.• Computational efficiency.• Cultural divides between experimental biologists and

computational scientists.• Lack of training (biology and medical curricula devoid of math

and computing).

Not-so-secret sauce: “digital normalization”

• One primary step of one type of data analysis becomes 20-200x faster, 20-150x “cheaper”.

http://en.wikipedia.org/wiki/JPEG

Lossy compression

http://en.wikipedia.org/wiki/JPEG

Lossy compression

http://en.wikipedia.org/wiki/JPEG

Lossy compression

http://en.wikipedia.org/wiki/JPEG

Lossy compression

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Restated:

Can we use lossy compression approaches to make downstream analysis faster and better? (Yes.)

~2 GB – 2 TB of single-chassis RAM

Some diginorm examples:

1. Assembly of the H. contortus parasitic nematode genome.

2. Assembly of two Midwest soil metagenomes, Iowa corn and Iowa prairie.

3. Reference-free assembly of the lamprey (P. marinus) transcriptome.

1. The H. contortus problem

• A sheep parasite.

• ~350 Mbp genome

• Sequenced DNA 6 individuals after whole genome amplification, estimated 10% heterozygosity (!?)

• Significant bacterial contamination.

(w/Robin Gasser, Paul Sternberg, and Erich Schwarz)

H. contortus life cycle

Refs.: Nikolaou and Gasser (2006), Int. J. Parasitol. 36, 859-868;Prichard and Geary (2008), Nature 452, 157-158.

Assembly after digital normalization

• Diginorm readily enabled assembly of a 404 Mbp genome with N50 of 15.6 kb;

• Post-processing led to 73-94% complete genome.

• Diginorm helped by making analysis possible.– Highly variable population.– Lots of contamination from microbes.

Next steps with H. contortus

• Publish the genome paper

• Identification of antibiotic targets for treatment in agricultural settings (animal husbandry).

• Serving as “reference approach” for a wide variety of parasitic nematodes, many of which have similar genomic issues.

2. Soil metagenome assembly

A “Grand Challenge” dataset (DOE/JGI)

Putting it in perspective:Total equivalent of ~1200 bacterial genomesHuman genome ~3 billion bp

Assembly results for Iowa corn and prairie(2x ~300 Gbp soil metagenomes)

Total Assembly

Total Contigs(> 300 bp)

% Reads Assembled

Predicted protein coding

2.5 bill 4.5 mill 19% 5.3 mill

3.5 bill 5.9 mill 22% 6.8 mill

Adina Howe

3. Sea lamprey gene expression

• Non-native• Parasite of

medium to large fishes

• Caused populations of host fishes to crash

Li Lab / Y-W C-D

Transcriptome results• Started with 5.1 billion reads from 50 different tissues.

(4 years of computational research, and about 1 month of compute time, GO HERE)

• Final assembly contains ~95% of genes (est.)• This is an extra 40% over previous work.• Enabling studies in –

– Basal vertebrate phylogeny– Biliary atresia– Evolutionary origin of brown fat (previously thought to be mammalian

only!) – J Exp Biol. 2013– Pheromonal response in adults

What are the tissue level changes in gene expression that support regeneration? Transcriptome analysis of a regenerating vertebrate after SCI

brainspinal cord

RNA-Seq to determinedifferential expressionprofile after injury

Sampling >weekly

-/+ Dex

Ona Bloom

Challenges ahead

• We need more people working at the interface– “Priesthood” model doesn’t scale!– Cultural shifts in biology needed…

• We need more data!– Data often only makes sense in context of other data– This is a hard sell: “if you give us 1000x as much data, we

might start to develop some idea of what it means.”

• We actually know very little about biology still!

Open science & sharing

• Science, and biology in particular, is in the middle of a transition to a “data intensive” field.

• The sharing ethos is not incentivized properly; you get more credit for discovering new stuff than for discoveries resulting from sharing.

• We are focused on sharing: methods, programs, educational materials…

Being disruptive?

Possible initiative from my lab:“We will analyze your data for you if we can

make your data openly available in 1 yr.”

Will it work, or sink like a stone? Ask me in a year

MSU’s role in my research

• MSU provides nice infrastructure, great administrative support, and a truly excellent community (students, profs, and other researchers).

• MSU is also uniquely interdisciplinary in many ways; very few “hard” boundaries in biology research.

Credits

• Marek’s Disease: Suga Subramanian and Hans Cheng (USDA)• Haemonchus: Erich Schwarz (Caltech/Cornell), Paul Sternberg

(Caltech), Robin Gasser (U. Melbourne)• Lamprey: Weiming Li (MSU), Ona Bloom (Feinstein), Jen

Morgan (MBL/Woods Hole)• Great Prairie: Jim Tiedje (MSU), Janet Jansson (LBL), Susanna

Tringe (Joint Genome Inst.)

Funding: MSU; USDA; NSF; NIH.

Drop me a line – ctb@msu.edu

top related