big data why it matterskoehl/teaching/ecs129/... · 2018. 2. 21. · big data: volume, velocity one...

25
Big Data Why it matters Patrice KOEHL Department of Computer Science Genome Center UC Davis

Upload: others

Post on 05-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data

Why it matters

Patrice KOEHL

Department of Computer Science Genome Center

UC Davis

Page 2: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

The three I’s of Big Data

Big Data is: -  Ill-defined (what is it?)

-  Immediate (we need to do something about it now)

-  Intimidating (what if we don’t)

(loosely adapted from Forbes)

Page 3: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Volume

Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB

Page 4: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Volume

Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB

30KB

One page of text

5 MB

One song

5 GB

One movie 6 million books

1 TB

55 storeys of DVD

1 PB

Data up to 2003

5 EB

Data in 2011

1.8 ZB

NSA data center

1 YB

Page 5: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Volume

Byte Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte KB MB GB TB PB EB ZB YB 1000 bytes 1000 KB 1000 MB 1000 GB 1000 TB 1000 PB 1000 ZB 1000YB 1s 20 mins 11 days 30 years 300 30 million 30 billion …. centuries years years

30KB

One page of text

5 MB

One song

5 GB

One movie 6 million books

1 TB

55 storeys of DVD

1 PB

Data up to 2003

5 EB

Data in 2011

1.8 ZB

NSA data center

1 YB

Page 6: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Volume, Velocity

One minute in the digital world

3+ million

searches launched

6 million users connected

1.3 million 30 hours

videos viewed

videos uploaded

(Intel, 2013) 204 million

e-mails sent

50 GB

of data generated at the Large Hadron Collider

640 TB IP data transferred

Page 7: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Volume, Velocity, Variety

text

Numbers

Images

sound

Page 8: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Challenges

�  Volume and Velocity �  Variety

¡  Structured, Unstructured…. ¡  Images, Sound, Numbers, Tables,…

�  Security

�  Reliability, Integrity, Validity

Page 9: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Challenges

Large N: “Any dataset that is collected by a scientist whose data collection skills are far superior to her analysis skills” Computing issues: Ø  Data transfer Ø  Scalability of algorithms Ø  Memory limitations Ø  Distributed computing

Page 10: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Challenges

Vizualization issues: The black screen problem

(Matloff, 2013)

Page 11: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Challenges

Rule of thumb: N/P > 5….what if it does not hold anymore? Large P, “small” N: Ø  Curse of dimensionality

(all data points seem equidistant)

Ø  Non linearity Ø  Dimension reduction

Page 12: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Challenges and Opportunities

�  Fourth Paradigm: data driven science

�  Holistic approaches to major research efforts

�  New paradigms in computing

�  Digital Humanities

Data Knowledge Societal Benefit Basic Translational

Page 13: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data: Enabling Dreams

�  Understanding the physics of “Dark Energy” �  How the brain works: from neurons to cognition �  A holistic view of natural ecosystems �  Understanding climate changes �  From genotype to phenotype �  Precision medicine �  Big Humanities �  ….

Page 14: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data Dreams: Genomics

Page 15: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Big Data Dreams: Genomics

Page 16: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Genomics: Sequencing costs

$0.01

$0.10

$1.00

$10.00

$100.00

$1,000.00

$10,000.00

Cos

t per

Mba

se

http://www.genome.gov

$100

$1,000

$10,000

$100,000

$1,000,000

$10,000,000

$100,000,000

2001

20

02

2003

20

04

2005

20

06

2007

20

08

2009

20

10

2011

20

12

2013

Cos

t per

Hum

an G

enom

e

Page 17: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Genomics: Game changing technologies

Illumina HiSeq 2000

Capable of 600 Gb per run -> 1,000+ Gb 55 Gb/day 6 billion paired-end reads <$4,000 per human/plant genome <$200 per transcriptome Multiplex 384 pathogen isolates/lane ⇒ $10 (+ $50 library construction)/isolate

Gary Schroth (Illumina): “A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days”.

Challenges: Library preparation & data analysis

Page 18: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Genomics @ UC Davis

Massively parallel DNA sequencing 5 Illumina Genome Analyzers 2 Pacific Biosystems RS

Page 19: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Cancer Genomics: Molecular Diagnostics

Page 20: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

“A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days.” Gary Schroth (Illumina)

Genomics: actual costs

Page 21: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

“A single lab with one HiSeq is able to generate more sequences than was in GenBank in 2009, every four days.” Gary Schroth (Illumina)

Genomics: actual costs

Assembling 22GB conifer genome: Data: - 16 billion pair reads (100 bases)

Processing: - 10 days for error correction

- 11 days for assembling “super-reads”

- 60 days to build contigs/scaffold

- 8 days to fill in gaps

http://www.homolog.us/blogs/2013/05/11/ steven-salzberg-at-bog13-assembling-22gb-conifer-genome/

Page 22: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Social Consequences of Commodity Sequencing

�  The danger of misuse predict sensitivities to various industrial or environmental agents

discrimination by employers?

�  The impact of information that is likely to be incomplete an indication of a 25 percent increase in the risk of cancer?

�  Reversal of knowledge paradigm �  Are the "products" of the Human Genome Project to be

patented and commercialized? Myriad genetics and BRCA1/2

�  How to educate about genetic research and its implications?

Page 23: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Social Consequences of Commodity Sequencing

Page 24: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

Social Consequences of Commodity Sequencing

Page 25: Big Data Why it matterskoehl/Teaching/ECS129/... · 2018. 2. 21. · Big Data: Volume, Velocity One minute in the digital world 3+ million searches launched 6 million users connected

How to Approach Big Data