storage and analysis of sensitive large-scale biomedical data in sweden

Post on 09-Jan-2017

168 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden

Ola SpjuthSNIC, UPPMAX and Science for Life

LaboratoryUppsala University, Sweden

ola.spjuth@farmbio.uu.se

Ola Spjuth

• Associate Professor in Pharmaceutical Bioinformatics

• Guest Researcher

• Co-Director

• Manager of Bioinformatics Compute and Storage facility

2003: First sequenced human genome - 13 years for $3 billions

2015: Human whole genome sequenced in 3 days for ~$1150

…requires supercomputersfor analysis and storage

Massively parallel sequencing….

2010: Science for Life Laboratory inaugurated

An internationally leading center that develops and applies

large-scale technologies for molecular biosciences with a focus

on health and environment.

National platform since 2013

Stockholm node

Uppsala node

2. Data delivery

Data generation and delivery

3. Analysis

Scientists

www.uppmax.uu.se/uppnexHigh-performance computers and large scale storage for bioinformatics analysis.

1. Sample transfer

Sequence production 2014:• Generated > 120 Tbp of sequence data• 13.7 Gbp/hour, 3.8 Mbp/sec (on average)

Hardware resourcesmilou: HP cluster of 208 nodes

pica: 6 (7) PBHitachi storage

halvan: 2 TB high-memory computer

Fast network via SUNET

Backup via SNIC

Long-termstorage atSweStore

nestor: 48 nodes production cluster

meles: 547 TBHitachi storage mosler: 24

nodes, 223 TBSmog: 100 nodes, ~300 TB

2015: 250 nodes

2016: 200new nodes

+1 PB

+2 PB

A national e-Infrastructure for NGS

Software + reference data

Support

Education

Compute resources

Storage resourcesEfficiency + automation

What we sequenced at NGI /

Chipster workbench on UPPMAX

UpCloud – smog - (OpenStack)

• Open catalogue of VMIs• Hosted at Uppsala University

M. Dahlö, F. Haziza, A. Kallio, E.

Korpelainen, E. Bongcam-

Rudloff, and O. Spjuth.

BioImg.org: A catalogue of

virtual machine images for the

life sciences. Accepted in

Bioinformatics and Biology

Insights.

www.bioimg.org

Managing Virtual Machine Images

Mosler overview

• e-Infrastructure for working with sensitive data

• Copy of Norwegian solution (TSD)

• Designed to look like UPPMAX clusters

Mosler specifications

• High-performance computing in a virtualized environment (OpenStack)

• 2-factor authentication• Restricted data transfer in/out• Only accessible over remote desktop (ThinLinc) via

Mosler dashboard

• Aim: Compliant with all laws and regulations for analyzing sensitive data in Sweden

Consortia

DBA

Consortiummember

MyResearch

Virtual environment

storage compute

Mosler

Datahosting

Datasyncing

Access, analysis

Data hosting use case

Manager

DBA

Scientist

LifeGene

Virtual environment

storage compute

Mosler

1. Requestfor data

2. Approval

3. Dataextraction

4. Datatransfer

5. Access, analysis

Data extraction use case

Nov 2014

20M € total grant4M € IT-infrastructure

X-Ten System

• First system able to deliver 1000$ genome• Each run 1.2TB data

• 16 Human genome (30X)• 3 days per run

• Population scale genomics• 15K genomes per year

Swedish Genome Initiative

Call for a reference variation Database (1000 genomes) and for Whole Human Genome (half price).

Goal: 5.000 genomes 2015, 10.000 genomes 2016

Aug-11 Mar-12 Sep-12 Apr-13 Nov-13 May-14 Dec-14 Jun-15 Jan-160

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000NGI-Stockholm Procution (Jan-12 to Dec-15)

Production date

Giga

Bas

esData production

Conservative Prediction(60% of maximum production)

Whole Genome Sequencing

• Data on new scale, 80% expected to be sensitive New challenges

• Funding for IT-infrastructure from KAW foundation– Resources for data production (2 M EUR)– Resources for scientists (2 M EUR)

• A national security project funded by Swedish Research Council (5 M EUR over 4 years) – SNIC Sens

SNIC-Sens

• 4-year project, started Jan 2015• Project owner: SNIC (Ann-Charlotte Sonnhammer)• Project leader: Ola Spjuth (until end of this week)• Aims:

– Specifications for analyzing sensitive data in SNIC (hardware, legal, contracts, processes etc.)

– Evaluation on the use of public cloud providers (Google, Amazon)

– Make available e-Infrastructure for production and research of data generated at NGI, blueprint for other domains

SNIC-Sens roadmap

• Information classification workshop (21/5)• Risk/vulnerability analysis (2/6)• Specifications for hardware procurement• Public tender (end of this week)• Installation and testing of production system (Aug-

Sept)• Installation, configuration and testing of research

system (Q3-Q4)• Research system online (Q1 2016)

Two pilots for clinical data management

CML, Lucia Cavelier

MDR, Åsa Melhus

top related