openzipkin conf: zipkin at yelp

19
Zipkin @ Prateek Agarwal @prat0318

Upload: prateek-agarwal

Post on 19-Feb-2017

164 views

Category:

Internet


2 download

TRANSCRIPT

Zipkin @

Prateek Agarwal

@prat0318

- Prateek Agarwal

- Software Engineer- Infrastructure team @ Yelp

- Have worked on - python Swagger clients,- Zipkin infrastructure,- Maintaining Cassandra, ES clusters

About me

Yelp’s MissionConnecting people with great

local businesses.

Yelp StatsAs of Q1 2016

90M 3270%102M

- Zipkin Infrastructure

- pyramid_zipkin / swagger_zipkin

- Lessons learned

- Future plans

Agenda

- 250+ services

- We <3 Python

- Pyramid/uwsgi framework

- SmartStack for service discovery

- Swagger for API schema declaration

- Zipkin transport : Kafka | Zipkin datastore : Cassandra

- Trace is generated on live traffic at a very very low % rate (0.005%)

- Can also be generated on-demand by providing a particular query-param

Infrastructure overview

Infrastructure overview

Let’s talk about a scenario where service A calls B.

pyramid_zipkin

- A simple decorator around every request

- Able to handle scribe | kafka transport

- Attaches a `unique_request_id` to every request

- No changes needed in the service logic

- Ability to add annotations using python’s `logging` module

- Ability to add custom spans Service Bpyramid_zipkin

pyramiduwsgi

pyramid_zipkin

Service Bpyramid_zipkin

pyramiduwsgi

- Ability to add custom spans

swagger_zipkin

- Eliminates the manual work of attaching zipkin headers

- Decorates over swagger clients- swaggerpy (swagger v1.2)- bravado (swagger v2.0)

Service Aswagger_client

swagger_zipkin

Lessons Learned- Cassandra is an excellent datastore for heavy writes

- Typical prod writes/sec : 15k

- It was able to even handle 100k writes/sec

Lessons Learned- Allocating offheap memory for Cassandra helped in reducing write latency by 2x

- Pending compactions also went down.

Lessons Learned- With more services added, fetching from Kafka became a bottleneck

- Solutions tried:- Adding more kafka partitions

- Running more instances of collector

- Adding multiple kafka consumer threads- with appropriate changes in openzipkin/zipkin- WIN

- Batching up messages before sending to Kafka- with appropriate changes in openzipkin/zipkin- BIG WIN

Lessons Learned- With more services added, fetching from Kafka became a bottleneck

- Solutions tried:- Adding more kafka partitions

- Running more instances of collector

- Adding multiple kafka consumer threads- with appropriate changes in openzipkin/zipkin- WIN

- Batching up messages before sending to Kafka- with appropriate changes in openzipkin/zipkin- BIG WIN

Lessons Learned- With more services added, fetching from Kafka became a bottleneck

- Solutions tried:- Adding more kafka partitions

- Running more instances of collector

- Adding multiple kafka consumer threads- with appropriate changes in openzipkin/zipkin- WIN

- Batching up messages before sending to Kafka- with appropriate changes in openzipkin/zipkin- BIG WIN

Lessons Learned- With more services added, fetching from Kafka became a bottleneck

- Solutions tried:- Running more instances of collector

- Adding more kafka partitions

- Adding multiple kafka consumer threads- with appropriate changes in openzipkin/zipkin- WIN

- Batching up messages before sending to Kafka- with appropriate changes in openzipkin/zipkin- BIG WIN

Lessons Learned- With more services added, fetching from Kafka became a bottleneck

- Solutions tried:- Running more instances of collector

- Adding more kafka partitions

- Adding multiple kafka consumer threads- with appropriate changes in openzipkin/zipkin- WIN

- Batching up messages before sending to Kafka- with appropriate changes in openzipkin/zipkin- BIG WIN

Future Plans- To be used during deployments to check degradations

- Validate the differences in number of downstream calls- Check against any new dependency sneaking in- Time differences in the spans

- Create trace aggregation infrastructure using Splunk (wip)- A missing part of Zipkin

- Redeploy zipkin dependency graph service after improvements- The service was unprovisioned because it created 100s of Gigs of /tmp files- These files got purged after the run (in ~1-2 hours)- Meanwhile, ops got alerted due to low disk space remaining- Didn’t give much of a value addition

@YelpEngineering

fb.com/YelpEngineers

engineeringblog.yelp.com

github.com/yelp