data workflow overview genomics high- throughput facility genome analyzer iix institute for genomics...

5
Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity Public Web Servers ● ~ 800 processors ● Sun Grid Engine ● ~ 100TB (secured) ● Fast drives ● 30TB for HTS ● HTTP, FTP ● Dedicated hosts ● User accounts HTS: 700GB/day Bandwidth: 10Gb/s USER Sample Analysis Requests (via web interface) Analysis Results (FTP server)

Upload: jeffery-newton

Post on 13-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity

Data Workflow Overview

Genomics High- Throughput Facility

GenomeAnalyzer IIx

Institute for Genomicsand Bioinformatics

Computation Resources

Storage Capacity

Public Web Servers

● ~ 800 processors● Sun Grid Engine

● ~ 100TB (secured)● Fast drives● 30TB for HTS

● HTTP, FTP● Dedicated hosts● User accounts

HTS: 700GB/day

Bandwidth: 10Gb/s

USER

Sample Analysis Requests(via web interface)

Analysis Results(FTP server)

Page 2: Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity

Data Analysis Workflow

IMAGES2-4 TB

INTENSITIES100-200 GB

Image Analysis

Firecrest

Base Calling

Bustard

BASE CALLS50-100 GB

SEQUENCES+ SCORES20/30 GB

Synthesis

Gerald

GENOMEALIGNMENT

>100 GB

Alignment

ELAND+ Reference Genome

READ COUNTSRead Counting

Casava VDC

Sample-Specific Analysis, Visualization…

e.g. Genome alignment, RNAseq, CHIPseq analysis

Downloadable files for HTS usersFASTQ files

Page 3: Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity

Sequences, Scores (FASTQ)

@HWUSI-EAS1562_0001:8:1:1119:18138#0/1ATATTCTTATATAAAAATATAATTATTTTAATATTTGGTCCTTTCGTACTAAAATAT+HWUSI-EAS1562_0001:8:1:1119:18138#0/1aaY`_aaY^a``[[`a\\\\aaa_^[aaZZWaaaXXY[VYaW^aaaa[aaa]a[a`

@HWUSI-EAS1562_0001:8:1:1119:13476#0/1AGAAAGCTTTGAAAATTATGTATACGCCTCGTAAGCCCAGTCCAAAGTCAAGACCA+HWUSI-EAS1562_0001:8:1:1119:13476#0/1a_^`a`_a[[NOONN__V__`Y^`^X]R[]]]]]Q```Y````__`^W`YVUPR]]

Sequence identifier Raw SequencePhred base calling quality scores(0 to 62 encoded using ASCII 64 to 126)

Page 4: Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity

Genome Alignment (ELAND)

HWUSI-EAS1562_0001:8:1:1119:18138#0/1 ATATTCTTATATAAAAATATAATTATTTTAATATTTGGTCCTTTCGTACTAAAATAT U1 0 147 255 chr1.fa 26532086 F 23G

HWUSI-EAS1562_0001:8:1:1119:13476#0/1 AGAAAGCTTTGAAAATTATGTATACGCCTCGTAAGCCCAGTCCAAAGTCAAGACCA U0 1 0 0 chr12.fa 90535786 F

Sequence identifier

Raw Sequence

Type of match

Number of exact/1-error/2-error matches

Chromosome/Position/Direction

Substitution

Page 5: Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity

Read Counts (Casava VDC)

Matchs with Genes, Exons, Splice junctions

Chromosome Gene Matchs

Files for visualization (GenomeStudio)

Genome alignment, Gene expression,RNAseq and CHIPseq analysis