apache flink training - metrics & monitoring
TRANSCRIPT
1
Apache Flink® Training
Flink v1.3 – 14.09.2017
Apache Flink® Training
Metrics and Monitoring
Metrics
2
Metrics
<identifier, measurement>
Types
• Counter
• Meter (rate)
• Histogram
• Gauge (arbitrary value)
Exposed via MetricReporters
Also a REST API
Visualized in the WebUI
3
Example
4
public static class MyMap extends RichMapFunction<String, String> {private Counter count;
@Overridepublic void open(Configuration config) {
count = getRuntimeContext().getMetricGroup().counter("numRecordsIn");
}
@Overridepublic String map(String input) {
count.inc();// return something
}}
Other metric types
Gauge
• Value can be any object that implement toString()
Histogram
• No default implementation, but a wrapper for
Codahale/DropWizard histograms
Meter
• Your code calls meter.markEvent() or meter.markEvent(n)
• Flink counts events, and also reports the average rate
5
Metric Groups
Metrics are attached to MetricGroups, which provide
context about what is being measured
6
Adding your own MetricGroups
Useful for categorizing your measurements
counter = getRuntimeContext().getMetricGroup().addGroup("MyMetrics").counter("myCounter");
7
8
127.0.0.1.taskmanager.ABCDE.MyJob.MyOperator.1.numRecordsIn
host taskmanager job operator metric
Scope Formats
Dot-separate list of variables and contants
Variables are replaced at runtime
Configured in flink-conf.yaml
<host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>
9
Scope Formats
Each metric is associated with one of 6 formats:• metrics.scope.jm
• metrics.scope.jm.job
• metrics.scope.tm
• metrics.scope.tm.job
• metrics.scope.task
• metrics.scope.operator
10
Metric Reporter
Exposes metrics to the outside world
• Ganglia
• Graphite
• JMX
• StatsD
• or roll your own …
11
Example
12
public static class Log4JReporter implements MetricReporter {private static final Logger LOG = LoggerFactory.getLogger(LogReporter.class);
public void open(MetricConfig config) {}
public void close() {}
Example, cont.
13
public static class Log4JReporter implements MetricReporter {private static final Logger LOG = LoggerFactory.getLogger(LogReporter.class);
private final Map<Counter, String> counters = new ConcurrentHashMap<>();
public void notifyOfAddedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.put((Counter) metric, group.getMetricIdentifier(metricName));
}}
public void notifyOfRemovedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.remove(metric);
}}
Example, cont.
14
public static class Log4JReporter implements MetricReporter, Scheduled {private static final Logger LOG = LoggerFactory.getLogger(LogReporter.class);
private final Map<Counter, String> counters = new ConcurrentHashMap<>();
public void notifyOfAddedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.put((Counter) metric, group.getMetricIdentifier(metricName));
}}
public void notifyOfRemovedMetric(Metric metric, String metricName, MetricGroup group) {if (metric instanceof Counter) {counters.remove(metric);
}}
public void report() {for (Map.Entry<Counter, String> metric : counters.entrySet()) {LOG.info(metric.getValue() + ": " + metric.getKey());
}}
Configuration
metrics.reporters: log
metrics.reporter.log.class: org.apache.flink.metrics.log4j.Log4JReporter
metrics.reporter.log.interval: 5 SECONDS
https://github.com/zentol/log4jreporter/blob/master/src/main/java/org/apache/fli
nk/metrics/log4j/Log4JReporter.java
15
Monitoring REST API
16
Some available requests
/config
/overview
/jobmanager/metrics
/jobs
/jobs/<id>/metrics
/jobs/<id>/checkpoints
/jobs/<id>/vertices/<id>/metrics?get=0.numRecordsOutPerSecond
/taskmanagers
/taskmanagers/<id>/metrics?get=<metric>
...
17
Available metrics
18
Available metrics
Many system metrics are built into Flink, including
• CPU, memory, threads, GC
• Classloading, network, cluster, IO
• Checkpointing, throughput, latency
19
Latency Monitoring
20
Latency tracking
env.getConfig().setLatencyTrackingInterval(msec)
Latency markers are injected by the sources, and flow
through the execution graph
• If records are queued in front of an operator, a marker will wait,
but it will otherwise bypass operators
Sinks track latency for each parallel source instance
21
Back Pressure Monitoring
22
What is Back Pressure?
Records in your job flow downstream, from
sources to sinks
When a downstream operator can’t keep
up, it exerts back pressure that propagates
upstream
23
Detecting Back Pressure
24
OK: < 10%LOW: 10 – 50%HIGH: > 50%
Configuration
jobmanager.web.backpressure.refresh-interval (60000 msec)
jobmanager.web.backpressure.num-samples (100)
jobmanager.web.backpressure.delay-between-samples (50 msec)
25
26
27
28
Slow sink?
E.g., slow database indexing may be
causing backpressure
Try a discarding sink to rule out the sink
29