powerpoint presentationidies.jhu.edu/wp-content/uploads/2018/10/00-szalay.pdf · • terabase...

15
Alex Szalay 2018

Upload: others

Post on 16-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

Alex Szalay

2018

Page 2: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

The Mission of IDIES

• Intellectual leadership

in the “Science of Big Data”, together with MINDS

• Incubator

for data intensive discoveries through “disruptive assistance”,

increase JHU “agility” in Big Data research projects

• Vision and oversight

of high performance and data intensive computing,

operate HPC and Big Data facilities, 100G networks (HORnet)

• Train the next generation

in data analytic skills

• Support unique data resources

which give us competitive advantages and visibility

Page 3: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

Main Components

• Six JHU schools participating

• Interdisciplinary faculty appointments

– 4+ Bloomberg Distinguished Professors, 7+ junior appts

– First three in the Mathematics of Big Data -> MINDS

• Endowed data collections

– “ownership” of certain unique data sets gives us visibilityand a competitive edge (SDSS, Turbulence, Materials)

• Postdocs and graduate students

– Engaged in interdisciplinary research

– New, crosscutting training on the “Science of Big Data”

– π-shaped people

• Small “venture/seed funds” for game changing research equipment and rapid, disruptive ideas

• More than $16.5M IDC generated

• Started down the path towards a more sustainable future

Page 4: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

New Disciplines Added

• Carey Business School joined in 2017

• Materials Science (HEMI, MEDE, PARADIM)

• Smart Cities (helping Baltimore city planning)

• Social Science (several projects with multi-TB data)

• Internet of Things (wireless sensor networks)

• Scalable cancer immunotherapy (towards ~TB/day)

• Large numerical simulations (2PB+ hosted)

=> SciServer

• Interactive collaborative data analytics environment

• Multi PB databases with hundreds of unique datasets

• Scalable scripting with iPython, Matlab and R

• Raised almost $30M in federal funds in 5 years

Page 5: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

Current SciServer Projects

Astronomy

MaterialsScience

Life Sciences

Computer Science

WFIRST

SDSS

LSST

SkyQuery

Cosmological Simulations

Millennium

Indra

Virgo Data Center

ObservingVirtual

Universes

STScI

FragData

Rough Surfaces

Paradim

NeutronScattering

Recount2

TSE

U01

CAAPA

HIPAA

Cancer Imm. Therapy

Electro-physicsSpike

Sorting

Streaming Clustering

Social Sciences

Baltimore City

PlanningNAOJ

China Migration

Biology

Chesapeake Bay

Precision Medicine

Kennedy Krieger

Eagle

BigWig

MEDE

ScaleFree

Tweet analysis

Business School

FinTechAI

eROSITA - MPE

Manga

eROSITA - GSFC

Education AS.171.324AS.171.205

Astroinformatics

Course Material

domain

RDB

sub domain

VC Img Inst

project active new

Fluid Dynamics

EarthSciences

Ocean Circulation

Turbulence DB

Big Data in Turbulence

Gluseen

Page 6: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

SDSS Skyserver

Prototype in 21st Century data access

– 2.6B web hits in 12 years

– 410M external SQL queries

– 7,000 papers and 450K citations

– 7,000,000 distinct users vs. 15,000 astronomers

– The emergence of the “Internet Scientist”

– The world’s most used astronomy facility today

– Collaborative server-side analysis by

10K astronomers

– SDSS earned the TRUST of the community

Page 7: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

Some SkyServer Metrics

Total WWW access 2,605,708,373

Total SQL queries 410,309,546

Total CASJobs queries 36,546,986

Total distinct WWW users 6,887,585

Total distinct SQL users 240,011

Total distinct CASJobs users 9,979

0

100

200

300

400

500

600

2002 2012

WE

B H

ITS

[M

ILL

ION

S]

ANNUAL WWW ACCESS

0

50

100

150

200

250

300

350

200

3-6

200

3-1

22

00

4-6

200

4-1

22

00

6-1

12

00

7-1

02

00

8-8

200

9-3

200

9-9

201

0-3

201

0-9

201

1-3

201

1-9

201

2-3

201

2-9

201

3-3

201

3-9

201

4-3

201

4-9

201

5-3

201

5-9

201

6-3

201

6-9

201

7-3

201

7-9

NEW CASJOBS USERS PER MONTH

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

2000 2005 2010 2015 2020

NEW USERS PER YEAR

WWW SQL*10

Page 8: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

Immersive Turbulence

“… the last unsolved problem of classical physics…” Feynman

• Understand the nature of turbulence– Consecutive snapshots of a large

simulation of turbulence: 30TB

– Treat it as an experiment, play withthe database!

– Shoot test particles (sensors) from your laptop into the simulation,like in the movie Twister

– 50TB MHD simulation

– Now: channel flow 100TB, MHD 256TB

– 8K3 simulation just arrived

• New paradigm for analyzing simulations!

• 68 Trillion points delivered in 5 years

• 650TB of simulations accessible

R. Burns, C. Meneveau, T. Zaki, G. Eyink, A. Szalay, E. Vishniac

Page 9: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

Cosmology Simulations

• Millennium DB is the poster child/ success story– Built by Gerard Lemson (now at JHU)

– 600 registered users, 17.3M queries, 287B rowshttp://gavo.mpa-garching.mpg.de/Millennium/

• Data size and scalability– PB data sizes, trillion particles of dark matter

• Indra simulations:– 512 different 1 Gpc/h box,

10243 particles per simulation

– 35T particles total, 1.1PB

Bridget Falck (JHU), Tamás Budavári (JHU), Shaun Cole (Durham), Daniel Crankshaw (JHU),

László Dobos (Eötvös), Adrian Jenkins (Durham), Gerard Lemson (MPA), Mark Neyrinck (JHU),

Alex Szalay (JHU), and Jie Wang (Beijing)

Page 10: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

2PB Ocean Laboratory

• 1km resolution whole Earth model, 1 year run

• Collaboration between JHU, MIT, Columbia

– T. Haine, C. Hill, R. Abernathy, R.Gelderloos, G. Lemson,

A. Szalay, NSF $1.8M

Page 11: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

Materials Science

Wide and diverse projects at JHU

• Hopkins Extreme Materials Institute

10 year, >$70million to work with Army Research Lab

Understand high-rate response from atoms to meters

– Complex variety of tools, techniques and data need to be shared

across array of disciplines

• Paradim ($25m NSF Center): Stress-strain curves, high-speed

video & x-ray data, 2D and 3D images with atomic to mm

resolution, simulations with different physics at different scales

– Recent NSF Supplement for data infrastructure

– Collaboration with NanoHub

Page 12: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

IDIES in Genomics

• ARIOC: GPU based aligner

– 50 times faster than anything else

– SQL Server BCP format option, supports methylation

• Terabase Search Engine

– Parallel SQL server warehouse for 265 genomes

– 240B short reads in the database, 1s search times

• SnapTron (in progress)

– Expression levels for 54,000 full RNA sequences

– Only place to do lateral searches across all samples at a given

location along the genome

– Created C# code for compressed representation of data

taken from the BigWig format in DB (1 week-> 1 min)

• Linking to CBioPortal

Page 13: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

Cancer Immunotherapy

• Trick the immune system to identify cancer cells

• Complex challenge, lots of tissue data, multicolor staining, image

segmentation and measurement

• Analysis requires spatial statistics, correlation function

• Strong similarities to

astronomy

• Increase amount of data

collected 1000-fold

• Soon thousands of tissue

samples, PBs of data

• Pattern recognition problem

• Scalability challenge

Page 14: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

Image Mosaic

Page 15: PowerPoint Presentationidies.jhu.edu/wp-content/uploads/2018/10/00-Szalay.pdf · • Terabase Search Engine –Parallel SQL server warehouse for 265 genomes –240B short reads in

The Road Ahead

• Increase support with Machine Learning,

provide translational to science projects

• Increase our agility, seed funds, hackathons, …

• Involve more of the younger faculty, get fresh ideas

• Develop a scale-out, cloud strategy

• Sharpen our focus, build on our unique strengths