observability - mp3muncher.files.wordpress.com · web viewsampling. preserving causality. trace...

Observability

2.Taking tracing for a ride Jaeger provides an example app called HotRod

Illustrates some std instrumentation plus custom instrumentation patterns

Implemented in go as is Jaeger Backend depends upon masterminds/glide for the build Node dependency for compiling the front end For production needs Cassandra or elasticsearch for persistence All in one option available which provides persistence etc all in a

single process, ideal for dev etc Prebuilt binaries are available on github Prebuilt Docker Images available as well Jaeger supports the illustration of Dependencies using a Force Directed

Graph Jaeger also provides Directed Acyclic Graph (DAG) - should be an easier

read

Jaeger provides the means to search the traces and a means to view the nested calls

Difference between a span tag and a span log Log time stamps will be within the period of the span Both provide annotation Tags apply to the entire span Logs represent specific events within the span Jaeger indexes both for search

Baggage is intended as a general key value store associated with the context

Useful to help with tenancy linkages to the information Can associate attribute information that may help understand why

some executions are quick or not This means we can calculate and attribute compute effort to

information in the baggage such as the tennant ID Classic Jaeger use cases

Distributed transaction monitoring Performance and latency optimisation Root cause analysis Service dependency analysis Distributed context propagation

Distributed Tracing Fundamentals

Request correlation Anatomy of distributed tracing

Special tracepoints can be at the edge of microservices Inject trace point Extract trace point These special points are also handle meta data movement

across processes These points capture and send the data to a backend

Sampling Preserving causality Trace Models

Event model Trace points are recorded as events Assuming the happens before information is captured, then

a Directed Acyclic Graph can be constructed Span model

Shared spans Multi-server spans Concept of parent and child spans Single host with client spans Original span model used by Dapper

Clock skew adjustment

Even with NTP keeping server times in lock step tighter than 1 millisecond is impossible

When spans reside within the same server, it’s fair to assume they are accurate in relationship to each other.

Can compensate based on knowing when sync calls occur client can’t end before server call

By analysing multiple calls across the same servers the timing differential can help determine likely skew value

Trace analysis

Instrumentation basics with OpenTracing Primary entities ...

Tracer Singleton for creating spans

Expose methods of transferring context across processes and components

Span Interface for generating a trace point Span represents a unit of work within a solution Casual links to predecessors Startspan() Finish() Spans can be annotated Span provides access to baggage

OpenTracing is just the API therefore we need to use a concrete implementation

For recording information with a span you have tags and logs Tags - key value pairs Logs much like conventional logs and can be used to record

events, particularly where we don’t want to create a span Only the act of creating the tracer are vendor specific Applications should only need a single tracer Service meshes often provide the mechanisms for this Tracer can create a global instance that can be addressed Some solutions will provide a dependency injection means to

address Jaeger specifics ... Example 2 (span and nested span)

Java Tracer config ...

Each created span is given an operation name in open tracing

The operation name is used for correlation and analysis

Create a span.

Should always close the span in a finally block

Use the span to record relevant information just as you would with logging

Span being annotated ...

Tracing an individual function as a child of the parent span

This approach does have the issue of sharing the span

In-process context propergation Example 3 (scopes)

With a scope a nest span would look like

Working with a scope manager, we would create a span and tell the scope manager

Example 4 - RPC Each service instantiates it’s own tracer with unique

naming Need to change the server port through the

configuration

Java can simplify the process by using the base class TracedController ,,,

Incorporating tag management

Example 5 using baggage Retrieving baggage

Example 6 - autoinstrumentation Span references can either be ... Scopes are handled by .. Tracing solutions may not provide all the capabilities provided by things

like an ELK stack Recommend every span has key value pair of key = “event” that describes

the span log In process context propagation is difficult to solve and different languages

can solve it in different ways Crossing processes means we need to introduce operations to pass the

context Inject Extract

The means to pass context have a number of challenges ... It is customary to start new span’s for http calls Open tracing recommended tags

span.kind - the role of the service in an RPC request typically values are

Client Server Producer - when messaging systems are involved Consumer - when messaging systems are involved

Http.url - record the URL requested by the client or served by the server

Http.method - get, post etc Typically these are populated through the get method in the

tracedcontroller Baggage

term was originally coined by Prof. Rodrigo Fonseca, one of the authors of the X-Trace system

The Jaeger instrumentation libraries recognize a special HTTP header that can look like this: jaeger-baggage: k1=v1, k2=v2, .... It is useful formanually providing some baggage items for testing

Instrumentation can be simplified through auto instrumentation in a vendor neutral manner

Http://github.com/opentracing-contrib/meta replaced by https://opentracing.io/registry/

Spring provides simple instrumentation by just adding a jar This means no coding needed except response tags,

baggage etc Span names are generic Tracerresolver extension creates and tells open tracing

about the global span. However some consider this an atipattern

Spring provides a instrumentation capability if the appropriate bean is included

Tracer resolver can instantiate the Open tracing implementation https://github.com/opentracing-contrib/java-tracerresolver

Kafka has open tracing support through Spring.although It utilizes JSON serialisation rather than AVRO

Instrumentation of Asynchronous Applications

Currently Jaeger can’t show the type of tracing going on e.g. message, http etc

Consumer of the span (receiver) is always a folllows on span as there could always be multiple receivers but the consumer will not know this

Having spans that run from the moment the producer creates the event to consumer consuming it is at odds with opentracing principles

Each span should only be associated with a single process, so starting the consumer span on the event generation would be at odds

You would lose the ability to model the time impact of events waiting to be consumed

How would multiple consumers get represented? Ability to support async e.g.

Node.js Java

Futures Executors

Tracing Standards and Ecoststem

The manual instrumentation approaches aren’t practical at scale Most instrumentation trace points are next to process boundaries

These boundaries are often handled through frameworks Therefore focus on instrumenting around the frameworks

Agent based Zero touch approach Uses an approach sometimes called monkey-patching Dynamically modifies the code wrapping actions that would

require spans etc Java can do this with the command line -javaagent which then

loads a library the works with the instrument feature

Monkey patching approaches can be difficult to maintain Some frameworks provide extensibility support Agent model providers include ...

Datadog Elastic Appdynamics New relic Apache skywalking

Agent models are often linked to a specific backend github.com/opentracing-contrib/java- specialagent/

Requirements of an instrumentation api Other frameworks

AWS X-Ray Google StackDriver When a solution is distributed, or uses PaaS elements you may

experience the issue of not getting cohesive solution

There have been attempts to define and industry wide standard tracing format for wirelevel communication but non yet truest exist

Zipkin (Twitter) It’s naming using b3 has become defacto standard B3 comes from the naming convention of systems named

after birds Big Brother Bird (aka b3) Tracing can often be used to refer to one or more different dimensions

Ben Siegelman suggested these could be Analyzing Recording Transaction description Federating

Could also be presented as

Tracing and its view points

This all points to knowing who is involved in the discussion Standards work

Product notes Dapper - Google Zipkin origins at Twitter Jaeger came from Uber TChannel - RPC framework - Uber

Under the hood

Host your own Customise and integrate Bandwidth costs Own the data

Emerging standards Use open tracing to abstract so only need to instrument once B3 header option common ...

Open census W3C trace context format

Architecture and deployment modes Basic model

Streamlined model

Components Client

Client - library embedded aggregating calls and passing batches on

Client typically allows the feedback/control flow to allow tracer config changes

Client commonly uses UDP so don’t need IP of collector

Agent Jaeger implements the sidecar pattern Supports communication to collectors Includes supporting load balancing and discovery Agents allow client logic to be kept simple Agents can be deployed as either

Agent on bare metal Kubernetes daemon set Side car to businesss app e.g. in same pod

Collector Receives span data as

JSON Thrift Protobuf

Using Http Tchannel gRPC

Converts data to a normalised internal data model Sends data to configured/pluggable data store Provides adaptive sampling logic Memory queueing to smooth out load spikes

Query service and UI Search and retrieve traces used by

Jaeger UI Or another solution conversant with API

Data mining Post processing such as Spark applied

Use tags so we can attribute spans, to processes, therefore charge based on backend usage

Implementing in a large organization

Why is it hard? Reducing barriers to adoption

Standard frameworks In house adaptors and tooling

Jumpstart / accelerators Preconfigured setups etc

Trace by default Monolithic repos

Single repos Easier to manage, locate source code Easier to implement code analytics to support

implementation Mono repo increases chances of common framework

adoption Integration with existing infrastructure

Where to start Many m/s solutions are broad rather than deep, so instrumenting

the gateway and 1st level or two can yield a lot of insight - 80/20 Incremental tracing rollout can accelerate ROI, shorten problem

investigation Successes wil drive peer pressure to adopt

Creating culture Communicate value Incorporation into developer flows

Trace quality measurement As a part of a wider code quality analytics set Needs to be more than binary -applied or not, but account for

correct application etc Dimensions..

Comoleteness Has spans Has client spans Minimum client version check - which Jaeger version

being used Quality

Meaningful endpoint name Unique id

Other Provide implementation and troubleshooting guide

Insights via data mining

Integration with Metrics and Logs

Integration with metrics Standard metrics via tracing instrumentation

Adding context to metrics Context aware metrics APIs

Integration with logs Se,I structured logs e.g. log4j vs highly structured logs eg JSON The better the structure the more efficient the indexing can be Slf4j doesn’t support strong structured but when combined with a

structured formatted for Logstash more sutrctire can be applied Resources/Logstash-spring.xml

Correlating logs with trace context Scope manager is pluggable so can be extended using a decorator pattern

Distributed Context Propagation

Turning the lights on

8.Sampling Trade off with logging on performance and cost of generating info

Consider tracing backend capacity Tracing can easily generate more data than the business process

Sampling as a means to cut down tracing info being processed is cut down at source

Dapper without sampling created a 1.5% throughout and 16% latency in the workload. Reducing workload via sampling at 0.01% reduced figures to 0.06% and 0.20% respectively

Head based sampling Decide once per trace at the trace start Is an all or nothing model Heavily used in production

Rate limit based sampling Use leaky bucket algorithm aka reservoir sampling Good when work loads are erratic

Adaptive sampling Can overcome load surge for the backend by using Kafka for the

events, so consumed more steadily Sampling considerations Jaeger provides the option to shed traffic when the DB is overloaded

Tracing with Service Meshes

Rather than using an ESB as a hub microservices leverage the side car pattern to abstract the central services

Side car implemented as a light weight process or container in its own right

Sidecar benefits Can be implemented in its own language Collocates with the application meaning limited latency Each service instance has its own side car so any failing side

car does not disrupt the entire service Side car can be used to compensate for features missing

from the core service Sidecar lifecycle and identity aligned to the service

Made up of 2 key components Side cars can emit uniformly names metrics about traffic in/out, latency

and error rates etc RED Rate, Error, Duration Rate Error Duration

Envoy can handle network traffic not only for gRPC and HTTP but also MySQL, Redis and others asa result concise and rich trace data can be generated

Spring boot-open tracing tracing needs jaeger dependencies Sidecar recognises Tracing and can action new spans

However result is a lot more spans Jaeger configuration passed as values from the Docker file using env vars Envoy doesn’t understand Jaeger’s default wire representation, but it

doesn’t understand Zipkin aka b3 Jaeger port forwarding is needed in a Docker environment Istio Tracing without the microservice using spring sleuth will result in

the outbound call not being auto instrumented with the context, result new span generated

Linked and envoy require app to propergate Context propagation is the most challenging consideration White box tracing implementation recommended, because ...

Ore control over data collection Ability to tag to the span key event values Application logic does not need to know Understanding which headers relate to tracing for

propagation can be complex, white box hides this Istio can create servicegraph without needing tracing

Graph visualisation provided by Istio ... Forced Directed Graph ... istio/force/force graph.html Graphviz /dotvis

Why distributed tracing

Microservices and cloud native apps Characteristics of microservices/cloud native solutions

Componentization via (micro)services Smart endpoints and dumb pipes Organized around business capabilities Decentralized governance Decentralized data management Infrastructure automation

Design for failure Evolutionary design

2015, the Cloud Native ComputingFoundation (CNCF) was created as a vendor-neutral home for many emerging open source projects

Cncf charter: Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably withminimal toil.

Monitoring tools under CNCF Prometheus Fluentd Open tracing Jaeger

What is observability? in control theory states that the system is observable if the internal

states of the system and, accordingly, its behavior, can be determined by only looking at its inputs and outputs

However not practical in software engineering terms YouTube https://youtu.be/U4E0QxzswQc

https://youtu.be/U4E0QxzswQc Sometimes linked more widely with the idea of monitoring,

metrics, logs and traces Oxford dictionaries of the verb “monitor” is “to observe and

check the progress or quality of (something) over a period of time; keep under systematic review.”

3 pillars of observability Metrics Logs Traces

Observability challenge of microservices Whilst microservices yield benefits they also have some challenges Vijay Gill, Senior VP of Engineering at Databricks, goes as far as

saying that the only good reason to adopt microservices is to be able to scale your engineering organization and to “ship the org chart”

Not a popular / common view 2018 “Global Microservices Trends” study [6] by Dimensional

Research® found that over 91% of interviewed professionals are using or have plans to use microservices in their system

2018 “Global Microservices Trends” study [6 - 73% find “troubleshooting is harder” in a microservices environment

https://youtu.be/U4E0QxzswQc

Challenges Orchestration of Container deployment Ability for microservices to locate each other Reliability can actually drop with more components

involved e.g. multiple components at 99.9% avail doesn’t total 99.9%

Risk of latency rise as each ms takes tile invoking the next - need to consider max time not min

Questions that we need to solve What services did a call go through What did each service involved do? Where did the error happen? How have things differed from normal?

New services in the mix? Or services removed? What was performance like?

What is the critical path for the request? Who should be called?

Traditional monitoring tools Traditional tools have limitations in the microservice space Metrics are helpful as they are concise/ numerical truths. But they

can be aggregated removing the nuances Logs only show us a single instance of a stream There are multiple forms of concurrency to deal with...

Ben Sigelman - Kubecon 2016

Concurrency where threads pickup and put down sessions means events can start on 1 thread and complete on another

Using time stamps to sequence across servers and logs are susceptible to clock skew

Distributed tracing Bryan Cantrill. Visualizing Distributed Systems with Statemaps.

Observability Practitioners Summit at KubeCon/CloudNativeCon NA 2018, December 10: https://sched.co/HfG2.https://sched.co/HfG2

Ben Sigelman. Keynote: OpenTracing and Containers: Depth, Breadth, and the Future of Tracing. KubeCon/CloudNativeCon North America, 2016,

https://sched.co/HfG2

Seattle: https://sched.co/8fRU.https://sched.co/8fRU

Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed system tracing infrastructure. Technical Report dapper-2010-1, Google, April 2010.

https://sched.co/8fRU

observability - mp3muncher.files.wordpress.com · web viewsampling. preserving causality. trace...

Documents