cern – it department ch-1211 genève 23 switzerland t working with large data sets tim smith...

17
CERN – IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

Upload: todd-bennett

Post on 13-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

CERN – IT Department

CH-1211 Genève 23Switzerland

www.cern.ch/it

Working with Large Data Sets

Tim Smith CERN/IT

Open Access and Research Data Session

Page 2: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 2

Research Big DataResearch

Page 3: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 3

How big is BIG ?

1 PB

100 TB

10 TB

1 TB

100 GB

10 GB

1 GB

100 PB

10 PB

40M collisions /sec

150M readout channels

6 GB/s

100 PB

3 yearsLHC Data

Trigger

Filter

10 PB/s

10 TB/s

Page 4: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 4

Primary Storage

6 GB/s 80,000 Disks88,000 CPU Cores

18,000 1GB NICs 2,700 10GB NICs

Page 5: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 5

Problems Size Brings

• Few places can store it– Lots of Copies ?– 10% at 10 centres

• Hard to transport• Aggregation easier for

processing than storage

• Models of Networked Analysis at Regional Centres

x 2 locations @ CERN

Page 6: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 6

Problems Size Brings

• Few places can store it– Lots of Copies ?– 10% at 10 centres

• Hard to transport• Aggregation easier for

processing than storage

• Models of Networked Analysis at Regional Centres

Tier-2s140 Sites

Tier-1s11 Sites

350,000 Cores

Page 7: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 7

Worldwide LHC Computing Grid

• Distributed Data Management– Limited Network resources– Optimize / minimize

movement– File placement logic– Deterministic / Static

• Site Data Management– HSMs– Transparent file access

and movement• Disk-Tape migration/recall

Page 8: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 8

Research Data Infrastructure of today

• Distributed Data Management– Network: a resource to schedule– Dynamic data placement– Data transfer services– Expt replica management rules

• Site Data Management– Indep. technology choices– Decoupled tiers– Disk caches

• Managed by owners

– Bulk 3rd party migration to tertiary by owners

• AAA: any data, any time, any where

Page 9: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 9

Data Reduction / Analysis

• Publication

• Reduced

• Reconstructed

• Raw

ResearchersT2s, T1s

Analysis CoordinatorsT1s

Production ManagersT0, T1s

File

Siz

e

# F

iles

Page 10: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 10

Data Inflation

• Static: Storing 100 PB was a good challenge • Dynamic: Analysing it means transformation,

reduction, transport, replication, regenerationWAN: 300 PB

LAN: 3 EB

in 2012

Page 11: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 11

More than Data

• Papers• Tabular Data• Correlation Matrices

• Internal Notes• Wikis• Presentations

• Quality monitoring data• Filter / selection algorithms• Formatters

• Calibration Data• Conditions Data• Log Books

ResearchersT2s, T1s

Analysis CoordinatorsT1s

Production ManagersT0, T1s

WorkflowsContextual metadataSW: 10M LoC

Page 12: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 12

Data for Tomorrow

• Digital Libraries– DOIs, ORCID– DataCite, OAIS

• Custom systems– DBs– HEP formats

• Mass Storage– GUIDs– Access– Bit Preservation

30 TBs !

Page 13: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 13

Bit Access / Bit Preservation

• Media Verification– Hot / Cold Data– Catching and correcting errors while you still can– 10% of production drive capacity for 2.6 years

– (0.000065% data loss)

Bit Rot

Page 14: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 14

Bit Access / Bit Preservation

• Media Migration– Drive and Media obsolescence– 50% of current drive capacity for 2 years

Page 15: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 15

Little Big Data

Long tail of science

Big facilities

Dat

a S

ize

x (a small number)

x (a large number)

Page 16: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

OAI8 [Jun 2013] - 16

Concluding Remarks

• Data Management best done by the data owner– Make data management services available– Empower the users!

• Don’t assume Researchers behavioral patterns– Decouple the layers– Offer integratable building blocks, not an integrated solution

• Static Data is one thing, planning for Active Data quite another– (The number you know) x (a large factor)

• Housekeeping (media verification and migration) takes non-negligible resources

• Preservation means combining expertise– Storage managers, SW engineers, Librarians and Researchers

Page 17: CERN – IT Department CH-1211 Genève 23 Switzerland  t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session