big data why it matterskoehl/teaching/ecs129/... · 2018. 2. 21. · big data: volume, velocity one...
TRANSCRIPT
Big Data
Why it matters
Patrice KOEHL
Department of Computer Science Genome Center
UC Davis
The three I’s of Big Data
Big Data is: - Ill-defined (what is it?)
- Immediate (we need to do something about it now)
- Intimidating (what if we don’t)
(loosely adapted from Forbes)
Big Data: Volume
Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB
Big Data: Volume
Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB
30KB
One page of text
5 MB
One song
5 GB
One movie 6 million books
1 TB
55 storeys of DVD
1 PB
Data up to 2003
5 EB
Data in 2011
1.8 ZB
NSA data center
1 YB
Big Data: Volume
Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB 1s 20 mins 11 days 30 years 300 30 million 30 billion …. centuries years years
30KB
One page of text
5 MB
One song
5 GB
One movie 6 million books
1 TB
55 storeys of DVD
1 PB
Data up to 2003
5 EB
Data in 2011
1.8 ZB
NSA data center
1 YB
Big Data: Volume, Velocity
One minute in the digital world
3+ million
searches launched
6 million users connected
1.3 million 30 hours
videos viewed
videos uploaded
(Intel, 2013) 204 million
e-mails sent
50 GB
of data generated at the Large Hadron Collider
640 TB IP data transferred
Big Data: Volume, Velocity, Variety
text
Numbers
Images
sound
Big Data: Challenges
� Volume and Velocity � Variety
¡ Structured, Unstructured…. ¡ Images, Sound, Numbers, Tables,…
� Security
� Reliability, Integrity, Validity
Big Data: Challenges
Large N: “Any dataset that is collected by a scientist whose data collection skills are far superior to her analysis skills” Computing issues: Ø Data transfer Ø Scalability of algorithms Ø Memory limitations Ø Distributed computing
Big Data: Challenges
Vizualization issues: The black screen problem
(Matloff, 2013)
Big Data: Challenges
Rule of thumb: N/P > 5….what if it does not hold anymore? Large P, “small” N: Ø Curse of dimensionality
(all data points seem equidistant)
Ø Non linearity Ø Dimension reduction
Big Data: Challenges and Opportunities
� Fourth Paradigm: data driven science
� Holistic approaches to major research efforts
� New paradigms in computing
� Digital Humanities
Data Knowledge Societal Benefit Basic Translational
Big Data: Enabling Dreams
� Understanding the physics of “Dark Energy” � How the brain works: from neurons to cognition � A holistic view of natural ecosystems � Understanding climate changes � From genotype to phenotype � Precision medicine � Big Humanities � ….
Big Data Dreams: Genomics
Big Data Dreams: Genomics
Genomics: Sequencing costs
$0.01
$0.10
$1.00
$10.00
$100.00
$1,000.00
$10,000.00
Cos
t per
Mba
se
http://www.genome.gov
$100
$1,000
$10,000
$100,000
$1,000,000
$10,000,000
$100,000,000
2001
20
02
2003
20
04
2005
20
06
2007
20
08
2009
20
10
2011
20
12
2013
Cos
t per
Hum
an G
enom
e
Genomics: Game changing technologies
Illumina HiSeq 2000
Capable of 600 Gb per run -> 1,000+ Gb 55 Gb/day 6 billion paired-end reads <$4,000 per human/plant genome <$200 per transcriptome Multiplex 384 pathogen isolates/lane ⇒ $10 (+ $50 library construction)/isolate
Gary Schroth (Illumina): “A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days”.
Challenges: Library preparation & data analysis
Genomics @ UC Davis
Massively parallel DNA sequencing 5 Illumina Genome Analyzers 2 Pacific Biosystems RS
Cancer Genomics: Molecular Diagnostics
“A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days.” Gary Schroth (Illumina)
Genomics: actual costs
“A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days.” Gary Schroth (Illumina)
Genomics: actual costs
Assembling 22GB conifer genome: Data: - 16 billion pair reads (100 bases)
Processing: - 10 days for error correction
- 11 days for assembling “super-reads”
- 60 days to build contigs/scaffold
- 8 days to fill in gaps
http://www.homolog.us/blogs/2013/05/11/ steven-salzberg-at-bog13-assembling-22gb-conifer-genome/
Social Consequences of Commodity Sequencing
� The danger of misuse predict sensitivities to various industrial or environmental agents
discrimination by employers?
� The impact of information that is likely to be incomplete an indication of a 25 percent increase in the risk of cancer?
� Reversal of knowledge paradigm � Are the "products" of the Human Genome Project to be
patented and commercialized? Myriad genetics and BRCA1/2
� How to educate about genetic research and its implications?
Social Consequences of Commodity Sequencing
Social Consequences of Commodity Sequencing
How to Approach Big Data