openstack cloud using prometheus demonstrating at scale monitoring of … · 2019-05-13 ·...

Post on 25-May-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Open Infrastructure Summit 2019

Demonstrating At Scale Monitoring Of

OpenStack Cloud Using Prometheus

Anandeep Pannuapannu@redhat.com

Pradeep Kilambiprad@redhat.com

1

3

Definitions

4

○○

○○○

7

Implications for Open Infrastructure

Critical Monitoring Features

● Portability across different footprints● HA, scaling, persistence available for free ● Re-use platform capabilities - eg. Prometheus

● Users integrate for capabilities they want● Stringent SLAs can be met ● Plug-in different OSS components with the same API

● For each API, SLAs achieved can be optimized ○ E.g Fault management uses message bus directly

● Metrics meta-data and declarative metrics for every component, so metrics can be incorporated automatically

● Data sensing, collection and processing○ Either, some or all processed at the Edge

● Centralized access to reports, alerts

● Integration with Analytics

Service Assurance Framework Architecture

Architecture OverviewOn-site infrastructure platform

○■

○■

■○

Dispatch Routing Message Distribution Bus (AMQP 1.0)

kern

elnetcpu mem

hardware

syslog /proc pid

VM

VM

VM

MetricsEvents

Application Components (VM, Container);

Controller, Compute, Ceph, RHEV, OpenShift Nodes (All Infrastructure Nodes)

3rd Party IntegrationsPrometheus Operator

MGMT Cluster APIs

Prometheus-based K8S Monitoring

● Collectd container -- Host / VM metrics collection framework○ Collectd 5.8 with additional OPNFV Barometer specific plugins not

yet in collectd project● Intel RDT, Intel PMU, IPMI● AMQP1.0 client plugin● Procevent -- Process state changes● Sysevent -- Match syslog for critical errors● Connectivity -- Fast detection of interface link status

changes○ Integrated as part of TripleO (OSP Director)

write_syslogwrite_kafkawrite_prometheusamqp_09amqp1

AMQ 7 Interconnect - Native AMQP 1.0 Message Router

● Large Scale Message Networks○ Offers shortest path (least cost) message routing○ Used without broker○ High Availability through redundant path topology and

re-route (not clustering)○ Automatic recovery from network partitioning failures○ Reliable delivery without requiring storage

● QDR Router Functionality○ Apache Qpid Dispatch Router QDR ○ Dynamically learn addresses of messaging endpoints○ Stateless - no message queuing, end-to-end transfer

Server A

Server BClient

Client

Client

Server C

High Throughput, Low LatencyLow Operational Costs

Prometheus Operator

Evolution

AMQP OS Networks

cephcntrl 1cntrl 2cntrl 3

Prometheus Operator++ Cluster

Prometheus

Grafana

QDR QDR QDR

SG

SG

SG

Central Site

Remote Site(s)

cephceph

ceph

computecomputecompute

compute

AMQP OS Networks

cephcntrl 1cntrl 2cntrl 3

cephceph

ceph

computecomputecompute

compute

AMQP OS Networks

cephcntrl 1cntrl 2cntrl 3

cephceph

ceph

computecomputecompute

compute

AMQP OS Networks

cephcntrl 1cntrl 2cntrl 3

cephceph

ceph

computecomputecompute

compute

Layer 3 Network to Remote Sites

Site 1 Site 2 Site 10

DCN Use Case

L3 Routed

Controller Nodes

OPTIONAL

AZ0

Compute Nodes(Local Ephemeral)

Undercloud+Container Registry

Ceph Cluster 0

OPTIONAL

Primary Site

DCN Site 1

AZ1

Compute Nodes(Local Ephemeral)

DCN Site 2

AZ2

Compute Nodes(Local Ephemeral)

DCN Site 3

AZ3

Compute Nodes(Local Ephemeral)

DCN Site 4

AZ4

Compute Nodes(Local Ephemeral)

DCN Site n

AZn

Compute Nodes(Local Ephemeral)

AZ0

Deployment Stack

Configuration & Deployment

● Collectd and QDR profiles are integrated as part of the TripleO

● Collectd and QDRs run as containers on the openstack nodes

● Configured via heat environment file

● Each node will have a qpid dispatch router running with collectd agent

● Collectd is configured to talk to qpid dispatch router and send metrics and events

● Relevant collectd plugins can be configured via the heat template file

TripleO Integration Of client side components

## This environment template to enable Service Assurance Client side bitsresource_registry:

OS::TripleO::Services::MetricsQdr: ../docker/services/metrics/qdr.yaml OS::TripleO::Services::Collectd: ../docker/services/metrics/collectd.yaml

parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances: notify: notify: true format: JSON presettle: true telemetry: format: JSON presettle: false

TripleO Client side Configurationenvironments/metrics-collectd-qdr.yaml

cat > params.yaml <<EOF---parameter_defaults: CollectdConnectionType: amqp1 CollectdAmqpInstances:

telemetry:format: JSONpresettle: true

MetricsQdrConnectors:- host: qdr-white-normal-sa-telemetry.apps.dev7.nfvpe.site port: 443 role: edge sslProfile: tlsProfile verifyHostname: false

EOF

TripleO Client side Configurationparams.yaml

cd ~/tripleo-heat-templatesgit checkout mastercd ~cp overcloud-deploy.sh overcloud-deploy-overcloud.shsed -i 's/usr\/share\/openstack-/home\/stack\//g' overcloud-deploy-overcloud.sh./overcloud-deploy-overcloud.sh -e /usr/share/openstack-tripleo-heat-templates/environments/metrics-collectd-qdr.yaml -e /home/stack/params.yaml

Client side DeploymentUsing overcloud deploy with collectd & qdr configuration and environment templates

There are 3 core components to the telemetry framework:

● Prometheus (and the AlertManager)

● Smart Gateway

● QPID Dispatch Router

Each of these components has a corresponding Operator that we'll use to spin up the various application components and objects.

To deploy telemetry framework from the script, simply run the following command after cloning the telemetry-framework repo[1] into the following directory.

cd ~/src/github.com/redhat-service-assurance/telemetry-framework/deploy/

./deploy.sh CREATE

[1] https://github.com/redhat-service-assurance/telemetry-framework

Operators Custom Resources

Service Assurance Framework

Deploying Service Assurance FrameworkFrom Operator to Application

Demo

avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 75 and avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) < 90

Critical CPU Usage Alert:avg_over_time(sa_collectd_cpu_percent{type=~”system|user”}[1m] ) > 90

Architecture Demo Service Assurance framework

● https://telemetry-framework.readthedocs.io/en/master/

https://quay.io/repository/redhat-service-assurance/smart-gateway-operator?tab=info

● https://github.com/redhat-service-assurance

Target /Metrics

Target /MetricsPrometheus Server

PromQLHTTP

HTTP

Visualize

top related