monitoring hadoop with prometheus (hadoop user group ireland, december 2015)

23
Brian Brazil Founder Monitoring Hadoop with Prometheus Making batch jobs manageable

Upload: brian-brazil

Post on 11-Jan-2017

1.279 views

Category:

Internet


4 download

TRANSCRIPT

Page 1: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Brian BrazilFounder

Monitoring Hadoop with Prometheus

Making batch jobs manageable

Page 2: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Who am I?Engineer passionate about running software reliably in production.

● TCD CS Degree● Google SRE for 7 years, working on high-scale reliable systems such as

Adwords, Adsense, Ad Exchange, Billing, Database● Boxever TL Systems&Infrastructure, applied processes and technology to let

allow company to scale and reduce operational load● Contributor to many open source projects, including Prometheus, Ansible,

Python, Aurora and Zookeeper.● Founder of Robust Perception, making scalability and efficiency available to

everyone

Page 3: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

PrometheusInspired by Google’s Borgmon monitoring system.

Started in 2012 by ex-Googlers working in Soundcloud as an open source project.

Mainly written in Go. Publically launched in early 2015.

100+ companies using it including Digital Ocean, GoPro, Apple, Red Hat and Google.

Page 4: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Why monitor?

● Know when things go wrong○ To call in a human to prevent a business-level issue, or prevent an issue in advance

● Be able to debug and gain insight● Trending to see changes over time, and drive technical/business decisions● To feed into other systems/processes (e.g. QA, security, automation)

Page 5: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Your Services Shouldn’t be a Black Box

Page 6: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Services have Internals

Page 7: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Monitor the Internals

Page 8: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Monitor as a Service, not as Machines

Page 9: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Inclusive MonitoringDon’t monitor just at the edges:

● Instrument client libraries● Instrument server libraries (e.g. HTTP/RPC)● Instrument business logic

Library authors get information about usage.

Application developers get monitoring of common components for free.

Dashboards and alerting can be provided out of the box, customised for your organisation!

Page 10: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Prometheus is About Metrics, not EventsEvent based monitoring such as logging is limited in how much data you can have per event. Each piece of data about each event needs to be stored and processed, which is challenging to scale.

Metric based monitoring allows you to have thousands of metrics, allowing you to track performance of every subsystem.

Prometheus regularly polls in-memory state of metrics.

Page 11: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

What about Hadoop?Batch jobs such as MapReduces are a very common way to use Hadoop.

How do you monitor your regular jobs are working today?

● Checking dashboards?● Emails about every run?● Emails on failure?

Page 12: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

What do you really care about?The thing you want to know is:

Has my batch job been successful recently enough?

So let’s monitor that!

Page 13: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Introducing the PushgatewayThe Pushgateway holds metric state for ephemeral jobs.

Page 14: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Java snippet CollectorRegistry registry = new CollectorRegistry();

JobClient.runJob(job); // Submit job to Hadoop and wait for completion.

Gauge lastSuccess = Gauge.build() .name("my_batch_job_last_success") .help("Last time my batch job succeeded, in unixtime.") .register(registry);lastSuccess.setToCurrentTime()

PushGateway pg = new PushGateway("127.0.0.1:9091");pg.pushAdd(registry, "my_batch_job");

Page 15: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Prometheus AlertsPrometheus has a powerful expression language that can be used in graphs, pre-calculation and alerts.

Let’s alert if our batch job hasn’t succeeded in a day:

ALERT MyBatchJobNotSuccessfulRecently IF time() - my_batch_job_last_success{job="my_batch_job"} > 86400

Page 16: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

New World!No longer have to manually check dashboards or emails every single day for every single batch job.

Monitoring and alerting is now aligned with what we care about.

More reliable, and scales better too!

Page 17: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Aside: Idempotency and FrequencyYou shouldn’t care about a single failure.

To make things even easier to manage, write your batch jobs so that if one run fails the next run will automatically catch up.

Then run your batch jobs at least twice as often as needed.

Result: A single failure is automatically handled, and if there is a problem you run it again. No more messing with command line flags and config files!

Page 18: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Beyond BatchPrometheus has integrations with 50+ other systems, including JMX, EC2, MySQL, Postgresql, Redis, MongoDB, CouchDB, RethinkDB, Redis, Collected, Graphite, Nagios, InfluxDB, Django, Mtail, Heka, Memcached, RabbitMQ, Redis, RethinkDB, Rsyslog, HAProxy, Meteor.js, Java, Haskell, Python, Go, Ruby, .Net, Machine, Cloudwatch, Minecraft…

Easy to run, easy to use, easy to scale.

A single Prometheus can handle over 100k samples per second!

Page 19: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Powerful Data ModelAll metrics have arbitrary multi-dimensional labels.

No need to force your model into dotted strings.

Can aggregate, cut, and slice along them.

Supports any double value, labels support full unicode.

Page 20: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Powerful Query LanguageCan multiply, add, aggregate, join, predict, take quantiles across many metrics in the same query. Can evaluate right now, and graph back in time.

Answer questions like:

● What’s the 95th percentile latency in the European datacenter?● How full will the disks be in 4 hours?● Which services are the top 5 users of CPU?

Can alert based on any query.

Page 21: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

Dashboards

Page 22: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

What does this all mean for Hadoop?Due to it’s extensive integrations, Prometheus can monitor Hadoop and the rest of your infrastructure and applications.

With its powerful data model and query language, you can graph and alert on what matters - not what your monitoring system limits you to.

Better alerts with fewer false positives means more sleep, higher reliability and more confidence that your system is functioning correctly.

Page 23: Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)

ResourcesOfficial Project Website: prometheus.io

Official Mailing List: [email protected]

Demo: demo.robustperception.io

Robust Perception Website: www.robustperception.io

Queries: [email protected]