large infrastructure monitoring at cern by matthias braeger at big data spain 2015

58

Upload: big-data-spain

Post on 12-Apr-2017

628 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015
Page 2: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015
Page 3: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Large Infrastructure Monitoring @ CERN

Matthias Bräger CERN Thursday, 10/15/2015 Big Data Spain

Page 4: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Agenda

Matthias Bräger

Software Engineer CERN [email protected]

▪  Big Data @ CERN ▪  In-Memory Data Grid &

Streaming Analytics

▪  Concrete CERN Example

Page 5: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Physics data(>100 PB)

Metadata of physics data

Sensor Data of technical installations

Log data Configuration data

Documents

Media data

Others

Page 6: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

European Organization for Nuclear Research

▪ Founded in 1954 (60 years ago!) ▪ 21 Member States

▪ ~ 3’360 Staff, fellows, students... ▪ ~ 10’000 Scientists from

113 different countries

▪ Budget: 1 billion CHF/year

http://cern.ch

Page 7: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

From Physics to Industry

Page 8: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

ATLAS

CMS LHCb

Alice LHC

The worlds biggest machine

Generated 30 Petabytes in 2012 > 100 PB in total!

Page 9: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

LHC - Large Hadron Collider

27km ring of superconducting magnets Started operation in 2010 with 3.5 + 3.5 TeV, 4 + 4 TeV in 2012 2013 – 2015 in Long Shutdown 1 (machine upgrade) Restarted in April 2015 with 6.5 + 6.5 TeV max

Page 10: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Some ATLAS facts ▪  25m diameter, 46m length, 7’000 tons ▪  100 million channels ▪  40MHz collision rate (~ 1 PB/s) ▪  Run 1: 200 Hz (~ 320 MB/s)

event rate after filtering ▪  Run 2: up to 1 kHz

Page 11: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015
Page 12: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Is Hadoop used for storing the ~30 PB/year of physics data ?

No ;-( Experimental data are mainly stored on tape

CERN uses Hadoop for storing the metadata of the experimental data

Page 13: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Physics Data Handling ▪  Run 1: 30 PB per year

demanding 100’000 processors with peaks of 20 GB/s writing to tape spread across 80 tape drives

▪  Run 2: > 50 PB per year

CERN’s Computer Center (1st floor)

Page 14: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Physics Data Handling 2013 already more than 100 PB stored in total! ▪  > 88 PB on 55’000 tapes ▪  > 13 PB on disk ▪  > 150 PB free tape storage waiting for

Run 2

CERN’s tape robot

Page 15: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Why tape storage? ▪  Cost of tape storage is a lot less than

disk storage ▪  No electricity consumption when tapes

are not being accessed ▪  Tape storage size = Data + Copy

Hadoop storage size = Data + 2 Copies

▪  No requirement to have all recorded physics data available within seconds

CERN’s tape robot

Page 16: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

@ CERN 3 HBase Clusters ▪  CASTOR Cluster with ~10 servers

-  ~ 100 GB of Logs per day -  > 120 TB of Logs in total

▪  ATLAS Cluster with ~20 servers

-  Event index Catalogue for experimental Data in the Grid

▪  Monitoring Cluster with ~10 servers -  Log events from CERN Computer Center

Page 17: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Metadata from physics event Metadata are created upon recording of the physics event Examples 1: ▪  Tape Storage event log

-  On which tape is my file stored?

-  Is there a copy on disk? -  List me all events for a given tape or drive

-  Was the tape repacked?

Page 18: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Example 1: Tape Storage event log

Page 19: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Metadata from physics event Metadata are created upon recording of the physics event Examples 2: ▪  Information about

-  Event number

-  run number -  timestamp

-  luminosity block number -  trigger that selected the event, etc.

Page 20: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Example 2: ATLAS EventIndex catalogue Prototype of an event-level metadata catalogue for all ATLAS events ▪  In 2011 and 2012, ATLAS produced 2 billion real events and 4 billion simulated events

▪  Migration to Hadoop for run 2 of the LHC

Data are read from the brokers, decoded and stored into Hadoop.

Page 21: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Example 2: ATLAS EventIndex catalogue The major use cases of the EventIndex project are:

▪  Event picking: give me the reference (pointer) to "this" event in "that" format for a given processing cycle.

▪  Production consistency checks: technical checks that processing cycles are complete (event counts match).

▪  Event service: give me the references (pointers) for “this” list of events, or for the events satisfying given selection criteria

Page 22: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Agenda

Matthias Bräger

Software Engineer CERN [email protected]

▪  Big Data @ CERN ▪  In-Memory Data Grid &

Streaming Analytics

▪  Concrete CERN Example

Page 23: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Physics data(>100 PB)

Metadata of physics data

Sensor Data of technical installations

Log data Configuration data

Documents

Media data

Others

Page 24: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Growth of Data

Transactions, Sensors, Logs, M2M, ..

Page 25: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Big Data is first of all cost factor!

The infrastructure has to be put in place to store the data

To a get a maximum return of investment it requires good analytic tools and well defined target goals

to harvest the precious insights of your data

Page 26: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

The value of real time

Latency Matters

Page 27: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Uptime, SLAs, HA

Performance and Scale

Page 28: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

The Shift

90% of Data in Disk-based Databases

90% of Data in In-Memory

MEMORY

RAM is 58,000 times faster than disk and 2,000 times faster than solid-state drives (SSD)

Page 29: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Tiered Storage

Distributed memory Server RAM or Flash/SSD

Process Memory

Local off-heap Memory

2,000,000+

1,000,000

100,000

Micro-seconds

Micro-seconds

Milli-seconds

Speed (TPS)

1,000s

Latency

External Data Source (e.g., Database, Hadoop, Data Warehouse)

4 GB

32 GB – 12 TB

100s GB – 100s TB

Seconds

Page 30: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Achiving High Availability with RAM?

In-memory data grids replicate the data to one or more nodes

Page 31: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Why now?

Explosion in volume and velocity of

data

Steep drop in price of RAM

Page 32: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

In-Memory Data Grid (IMDG) Platforms

-  Scale of NoSQL -  Low latency of In-Memory databases

-  Reliability & Fault Tolerance -  Transactional Guarantees

Fast Big Data

Page 33: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Scale with data and processing needs

Increase Data in Memory

Reduce database dendency

DB

Application

DB

Application

In-Memory

Distributed In-Memory

DB

Application

In-Memory

Application

In-Memory

Page 34: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Use Cases for IMDG

•  Cache to overcome legacy data bottlenecks •  Cache for transient data •  Primary store for modern apps •  NoSQL database at in-memory speed •  Data services fabric for real-time data integration •  Compute grid at in-memory speed

Page 35: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

In-Memory Data Grid solutions

Forrester Wave™: In-Memory Data Grids, Q3 2015

Page 36: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Analyzing data In-Motion Streaming analytics and filtering with Complex Event Processing (CEP)

CEP engine … Events … Actions

In-Memory Event stack

Hadoop Storage

Alarming

Page 37: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Streaming Analytics

•  Many existing products, but still no standards •  : Open Source, SQL like query language

•  JEPC: An attempt to standardize event processing, from Database Research Group, University of Marburg:

http://www.mathematik.uni-marburg.de/~bhossbach/jepc/

Page 38: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Use cases

Influencing operations and decisions

Page 39: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Agenda

Matthias Bräger

Software Engineer CERN [email protected]

▪  Big Data @ CERN ▪  In-Memory Data Grid &

Streaming Analytics

▪  Concrete CERN Example

Page 40: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Cooling

Access Control

Safety Systems

Network and Hardware Controls

Cryogenics

Electricity

Page 41: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

TIM – Technical Infrastructure Monitoring ▪ Operational since 2005

▪ Used to monitor and control infrastructure at CERN ▪ 24/7 service

▪ ~ 100 different main users at CERN

▪ Since Jan. 2012 based on

new server architecture with C2MON

CERN Control Center at LHC startup

Page 42: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Cooling Safety Systems Electricity Access Network and Hardware Controls

Cryogenics

TIM Server based on C2MON

Client Tier

Data Analysis Video Viewer TIM Viewer Access Management

Alarm Console

Data Acquisition & Filtering

> 1200 commands > 1300 rules

> 91k data sensors > 41k alarms

Web Apps

Page 43: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

TIM Server based on C2MON

Client Tier

Data Analysis Video Viewer TIM Viewer Access Management

Alarm Console

Data Acquisition & Filtering

ca. 400 million raw values per day

Filtering ca.2 million updates

> 1200 commands > 1300 rules

> 91k data sensors > 41k alarms

Web Apps

Page 44: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

C2MON - CERN Control and Monitoring Platform

C2MON server

C2MON client API my app

C2MON DAQ API my DAQ

▪  Allows the rapid implementation of high-performance monitoring solutions

▪  Modular and scalable at all layers ▪  Optimized for High Availability & big data volume

▪  Based on In-Memory solution

▪  All written in Java Currently used by two big monitoring systems @CERN: TIM & DIAMON Central LHC alarm system (“LASER”) in migration phase

http://cern.ch/c2mon

Page 45: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

C2MON server

C2MON architecture

Application Tier

C2MON server

DAQs

History / Backup

In-Memory

-  Configuration -  Rule logic -  Latest sensor values

Page 46: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

C2MON Server

C2MON server core

In-Memory Store (JCache - JSR-107)

DAQ out DAQ in

DAQ supervision

Cache persistence

Cache loading

Lifecycle

Configuration Cache

DB access

Logging Alarm Rules Benchmark Video access

C2MON server modules Client communication Authentication

Page 47: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Open Source time-series databases

▪  OpenTSDB: Uses as storage model ▪  : Uses Apache as storage model

▪  : Natively time-series, using LMDB storage engine

▪  : Built on top of Apache LuceneTM

Page 48: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015
Page 49: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015
Page 50: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Scenario 1: High availability

•  moderate data size •  average throughput •  min service interrupts •  high availability

C2MON SERVER

DAQ process DAQ process DAQ process DAQ process

Clustered JMS brokers JMS broker JMS broker

JMS broker JMS broker

C2MON client C2MON client C2MON client

C2MON SERVER

standby

Page 51: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Raw data filtering on DAQ layer

GIQO

Garbage In

Quality out

Page 52: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Scenario 2: High requirements

•  large data set •  high throughput •  min service interrupts •  high availability

DAQ process DAQ process DAQ process DAQ process

server array

C2MON SERVER CLUSTER

C2MON SERVER CLUSTER

JMS broker cluster

JMS broker cluster

C2MON client C2MON client

JMS broker cluster

JMS broker cluster

C2MON client C2MON client

Page 53: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

C2MON Roadmap

▪  Offering C2MON to the Open Source community

http://cern.ch/c2mon ▪  Introduction of Complex Event Processing (CEP) module ▪  Migrating historical event store from

relational database to time series database

Page 54: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

IoT = Internet of Things or … Intranet of Things?

▪  Creating a smarter world happens first in the Intranet ▪  Challenge: Integrating heterogenous systems and protocols

▪  Many IoT solutions available, but often closed products which are not compatible to each other

Internet of Things: ▪  Integrating and analysing monitoring data from a variety of installations of the same device type throughout the industry is essential.

Page 55: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Takeaways

▪  Data and High Availability services are more important than ever before for all modern organizations.

▪  Deriving value from collected data is key to success. ▪  In-Memory platforms are essential for high value & high velocity

data storage and processing.

Page 56: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Credits & References Many thanks to CERN & Software AG:

-  Sebastien Ponce (CERN), for providing information about CASTOR -  Rainer Toebbicke (CERN), for providing information about CERN HBASE service -  Jan Iven (CERN), for being helpful finding information about existing CERN Hadoop projects -  Software AG/Terracotta Product & Engineering Team

References: -  C2MON: http://cern.ch/c2mon

-  The ATLAS EventIndex: https://cds.cern.ch/record/1690609

-  Agile Infrastructure at CERN - Moving 9'000 Servers into a Private Cloud, Helge Meinhard (CERN): http://vimeo.com/93247922

-  CRAN, The Comprehensive R Archive Network: http://cran.r-project.org

-  Software AG Terracotta: http://www.terracotta.org

Page 57: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015

Questions? Muchas gracias por su atención!

Page 58: Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain 2015