powerpoint presentationidies.jhu.edu/wp-content/uploads/2018/10/00-szalay.pdf · • terabase...
TRANSCRIPT
Alex Szalay
2018
The Mission of IDIES
• Intellectual leadership
in the “Science of Big Data”, together with MINDS
• Incubator
for data intensive discoveries through “disruptive assistance”,
increase JHU “agility” in Big Data research projects
• Vision and oversight
of high performance and data intensive computing,
operate HPC and Big Data facilities, 100G networks (HORnet)
• Train the next generation
in data analytic skills
• Support unique data resources
which give us competitive advantages and visibility
Main Components
• Six JHU schools participating
• Interdisciplinary faculty appointments
– 4+ Bloomberg Distinguished Professors, 7+ junior appts
– First three in the Mathematics of Big Data -> MINDS
• Endowed data collections
– “ownership” of certain unique data sets gives us visibilityand a competitive edge (SDSS, Turbulence, Materials)
• Postdocs and graduate students
– Engaged in interdisciplinary research
– New, crosscutting training on the “Science of Big Data”
– π-shaped people
• Small “venture/seed funds” for game changing research equipment and rapid, disruptive ideas
• More than $16.5M IDC generated
• Started down the path towards a more sustainable future
New Disciplines Added
• Carey Business School joined in 2017
• Materials Science (HEMI, MEDE, PARADIM)
• Smart Cities (helping Baltimore city planning)
• Social Science (several projects with multi-TB data)
• Internet of Things (wireless sensor networks)
• Scalable cancer immunotherapy (towards ~TB/day)
• Large numerical simulations (2PB+ hosted)
=> SciServer
• Interactive collaborative data analytics environment
• Multi PB databases with hundreds of unique datasets
• Scalable scripting with iPython, Matlab and R
• Raised almost $30M in federal funds in 5 years
Current SciServer Projects
Astronomy
MaterialsScience
Life Sciences
Computer Science
WFIRST
SDSS
LSST
SkyQuery
Cosmological Simulations
Millennium
Indra
Virgo Data Center
ObservingVirtual
Universes
STScI
FragData
Rough Surfaces
Paradim
NeutronScattering
Recount2
TSE
U01
CAAPA
HIPAA
Cancer Imm. Therapy
Electro-physicsSpike
Sorting
Streaming Clustering
Social Sciences
Baltimore City
PlanningNAOJ
China Migration
Biology
Chesapeake Bay
Precision Medicine
Kennedy Krieger
Eagle
BigWig
MEDE
ScaleFree
Tweet analysis
Business School
FinTechAI
eROSITA - MPE
Manga
eROSITA - GSFC
Education AS.171.324AS.171.205
Astroinformatics
Course Material
domain
RDB
sub domain
VC Img Inst
project active new
Fluid Dynamics
EarthSciences
Ocean Circulation
Turbulence DB
Big Data in Turbulence
Gluseen
SDSS Skyserver
Prototype in 21st Century data access
– 2.6B web hits in 12 years
– 410M external SQL queries
– 7,000 papers and 450K citations
– 7,000,000 distinct users vs. 15,000 astronomers
– The emergence of the “Internet Scientist”
– The world’s most used astronomy facility today
– Collaborative server-side analysis by
10K astronomers
– SDSS earned the TRUST of the community
Some SkyServer Metrics
Total WWW access 2,605,708,373
Total SQL queries 410,309,546
Total CASJobs queries 36,546,986
Total distinct WWW users 6,887,585
Total distinct SQL users 240,011
Total distinct CASJobs users 9,979
0
100
200
300
400
500
600
2002 2012
WE
B H
ITS
[M
ILL
ION
S]
ANNUAL WWW ACCESS
0
50
100
150
200
250
300
350
200
3-6
200
3-1
22
00
4-6
200
4-1
22
00
6-1
12
00
7-1
02
00
8-8
200
9-3
200
9-9
201
0-3
201
0-9
201
1-3
201
1-9
201
2-3
201
2-9
201
3-3
201
3-9
201
4-3
201
4-9
201
5-3
201
5-9
201
6-3
201
6-9
201
7-3
201
7-9
NEW CASJOBS USERS PER MONTH
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
2000 2005 2010 2015 2020
NEW USERS PER YEAR
WWW SQL*10
Immersive Turbulence
“… the last unsolved problem of classical physics…” Feynman
• Understand the nature of turbulence– Consecutive snapshots of a large
simulation of turbulence: 30TB
– Treat it as an experiment, play withthe database!
– Shoot test particles (sensors) from your laptop into the simulation,like in the movie Twister
– 50TB MHD simulation
– Now: channel flow 100TB, MHD 256TB
– 8K3 simulation just arrived
• New paradigm for analyzing simulations!
• 68 Trillion points delivered in 5 years
• 650TB of simulations accessible
R. Burns, C. Meneveau, T. Zaki, G. Eyink, A. Szalay, E. Vishniac
Cosmology Simulations
• Millennium DB is the poster child/ success story– Built by Gerard Lemson (now at JHU)
– 600 registered users, 17.3M queries, 287B rowshttp://gavo.mpa-garching.mpg.de/Millennium/
• Data size and scalability– PB data sizes, trillion particles of dark matter
• Indra simulations:– 512 different 1 Gpc/h box,
10243 particles per simulation
– 35T particles total, 1.1PB
Bridget Falck (JHU), Tamás Budavári (JHU), Shaun Cole (Durham), Daniel Crankshaw (JHU),
László Dobos (Eötvös), Adrian Jenkins (Durham), Gerard Lemson (MPA), Mark Neyrinck (JHU),
Alex Szalay (JHU), and Jie Wang (Beijing)
2PB Ocean Laboratory
• 1km resolution whole Earth model, 1 year run
• Collaboration between JHU, MIT, Columbia
– T. Haine, C. Hill, R. Abernathy, R.Gelderloos, G. Lemson,
A. Szalay, NSF $1.8M
Materials Science
Wide and diverse projects at JHU
• Hopkins Extreme Materials Institute
10 year, >$70million to work with Army Research Lab
Understand high-rate response from atoms to meters
– Complex variety of tools, techniques and data need to be shared
across array of disciplines
• Paradim ($25m NSF Center): Stress-strain curves, high-speed
video & x-ray data, 2D and 3D images with atomic to mm
resolution, simulations with different physics at different scales
– Recent NSF Supplement for data infrastructure
– Collaboration with NanoHub
IDIES in Genomics
• ARIOC: GPU based aligner
– 50 times faster than anything else
– SQL Server BCP format option, supports methylation
• Terabase Search Engine
– Parallel SQL server warehouse for 265 genomes
– 240B short reads in the database, 1s search times
• SnapTron (in progress)
– Expression levels for 54,000 full RNA sequences
– Only place to do lateral searches across all samples at a given
location along the genome
– Created C# code for compressed representation of data
taken from the BigWig format in DB (1 week-> 1 min)
• Linking to CBioPortal
Cancer Immunotherapy
• Trick the immune system to identify cancer cells
• Complex challenge, lots of tissue data, multicolor staining, image
segmentation and measurement
• Analysis requires spatial statistics, correlation function
• Strong similarities to
astronomy
• Increase amount of data
collected 1000-fold
• Soon thousands of tissue
samples, PBs of data
• Pattern recognition problem
• Scalability challenge
Image Mosaic
The Road Ahead
• Increase support with Machine Learning,
provide translational to science projects
• Increase our agility, seed funds, hackathons, …
• Involve more of the younger faculty, get fresh ideas
• Develop a scale-out, cloud strategy
• Sharpen our focus, build on our unique strengths