monitoring hadoop with prometheus (hadoop user group ireland, december 2015)

Brian BrazilFounder

Monitoring Hadoop with Prometheus

Making batch jobs manageable

Who am I?Engineer passionate about running software reliably in production.

● TCD CS Degree● Google SRE for 7 years, working on high-scale reliable systems such as

Adwords, Adsense, Ad Exchange, Billing, Database● Boxever TL Systems&Infrastructure, applied processes and technology to let

allow company to scale and reduce operational load● Contributor to many open source projects, including Prometheus, Ansible,

Python, Aurora and Zookeeper.● Founder of Robust Perception, making scalability and efficiency available to

everyone

PrometheusInspired by Google’s Borgmon monitoring system.

Started in 2012 by ex-Googlers working in Soundcloud as an open source project.

Mainly written in Go. Publically launched in early 2015.

100+ companies using it including Digital Ocean, GoPro, Apple, Red Hat and Google.

Why monitor?

● Know when things go wrong○ To call in a human to prevent a business-level issue, or prevent an issue in advance

● Be able to debug and gain insight● Trending to see changes over time, and drive technical/business decisions● To feed into other systems/processes (e.g. QA, security, automation)

Your Services Shouldn’t be a Black Box

Services have Internals

Monitor the Internals

Monitor as a Service, not as Machines

Inclusive MonitoringDon’t monitor just at the edges:

● Instrument client libraries● Instrument server libraries (e.g. HTTP/RPC)● Instrument business logic

Library authors get information about usage.

Application developers get monitoring of common components for free.

Dashboards and alerting can be provided out of the box, customised for your organisation!

Prometheus is About Metrics, not EventsEvent based monitoring such as logging is limited in how much data you can have per event. Each piece of data about each event needs to be stored and processed, which is challenging to scale.

Metric based monitoring allows you to have thousands of metrics, allowing you to track performance of every subsystem.

Prometheus regularly polls in-memory state of metrics.

What about Hadoop?Batch jobs such as MapReduces are a very common way to use Hadoop.

How do you monitor your regular jobs are working today?

● Checking dashboards?● Emails about every run?● Emails on failure?

What do you really care about?The thing you want to know is:

Has my batch job been successful recently enough?

So let’s monitor that!

Introducing the PushgatewayThe Pushgateway holds metric state for ephemeral jobs.

Java snippet CollectorRegistry registry = new CollectorRegistry();

JobClient.runJob(job); // Submit job to Hadoop and wait for completion.

Gauge lastSuccess = Gauge.build() .name("my_batch_job_last_success") .help("Last time my batch job succeeded, in unixtime.") .register(registry);lastSuccess.setToCurrentTime()

PushGateway pg = new PushGateway("127.0.0.1:9091");pg.pushAdd(registry, "my_batch_job");

Prometheus AlertsPrometheus has a powerful expression language that can be used in graphs, pre-calculation and alerts.

Let’s alert if our batch job hasn’t succeeded in a day:

ALERT MyBatchJobNotSuccessfulRecently IF time() - my_batch_job_last_success{job="my_batch_job"} > 86400

New World!No longer have to manually check dashboards or emails every single day for every single batch job.

Monitoring and alerting is now aligned with what we care about.

More reliable, and scales better too!

Aside: Idempotency and FrequencyYou shouldn’t care about a single failure.

To make things even easier to manage, write your batch jobs so that if one run fails the next run will automatically catch up.

Then run your batch jobs at least twice as often as needed.

Result: A single failure is automatically handled, and if there is a problem you run it again. No more messing with command line flags and config files!

Beyond BatchPrometheus has integrations with 50+ other systems, including JMX, EC2, MySQL, Postgresql, Redis, MongoDB, CouchDB, RethinkDB, Redis, Collected, Graphite, Nagios, InfluxDB, Django, Mtail, Heka, Memcached, RabbitMQ, Redis, RethinkDB, Rsyslog, HAProxy, Meteor.js, Java, Haskell, Python, Go, Ruby, .Net, Machine, Cloudwatch, Minecraft…

Easy to run, easy to use, easy to scale.

A single Prometheus can handle over 100k samples per second!

Powerful Data ModelAll metrics have arbitrary multi-dimensional labels.

No need to force your model into dotted strings.

Can aggregate, cut, and slice along them.

Supports any double value, labels support full unicode.

Powerful Query LanguageCan multiply, add, aggregate, join, predict, take quantiles across many metrics in the same query. Can evaluate right now, and graph back in time.

Answer questions like:

● What’s the 95th percentile latency in the European datacenter?● How full will the disks be in 4 hours?● Which services are the top 5 users of CPU?

Can alert based on any query.

Dashboards

What does this all mean for Hadoop?Due to it’s extensive integrations, Prometheus can monitor Hadoop and the rest of your infrastructure and applications.

With its powerful data model and query language, you can graph and alert on what matters - not what your monitoring system limits you to.

Better alerts with fewer false positives means more sleep, higher reliability and more confidence that your system is functioning correctly.

ResourcesOfficial Project Website: prometheus.io

Official Mailing List: [email protected]

Demo: demo.robustperception.io

Robust Perception Website: www.robustperception.io

Queries: [email protected]

http://prometheus.io

https://groups.google.com/forum/#!forum/prometheus-developers

http://demo.robustperception.io

http://www.robustperception.io

mailto:[email protected]

monitoring hadoop with prometheus (hadoop user group ireland, december 2015)

Internet