scylla summit 2016: using scylladb for a microservice-based pipeline in go

Using ScyllaDB for a Microservice Based Pipeline in Go

Henrik JohanssonSenior Developer, Eniro Initiatives

What will you hear?

• Initial system

▸ Tiny problem…

▸ Fix!

• New system

▸Description

▸Some conclusions

Initial system

Initial system - A pipelined transformation engine

• Message based▪ Stored in Elasticsearch after each transformation▪ Read from Elasticsearch before each transformation▪ Small service handled a message in each step

• Batch oriented▪ Decoupled but belonging to a batch▪ Needed to maintain message to map relation

-Across the cluster

Initial system - Design view

Elastic

Service Service

Kafka

SignalSignal

...

Initial system - Batches

• Batch oriented▪ Needed to maintain message to map relation

-Across the cluster▪Batch flow

-On each initial input into the chain create association msg <-> batch-To reassemble a batch just read the associated messages -Assembly usually at a conceptual end state

▪Redis-Because, fast!

Initial system - A complication

• Redis▪ Super fast

-Inserts as well as lookups▪Associations

-Lists•Separate from the lookup, consistency issue?

-SCAN•Simple iteration and matching, needs deduplication on client

▪Chose SCAN

Initial system - Redis’ very nature

• Single Threaded▪Implementation detail

-Not really…

•SCAN▪Iterate and sample in batches

▪Small or large batches has different trade offs

▪Blocks… (Docs says it doesn’t but it does)

Initial system - Redis’ very nature

• Millions or even billions in a batch?

•SCAN▪Blocks…

▪It becomes really slow!

•The whole system is affected▪Inserts, lookups, everything stalls....

Because Redis is Single Threaded

Initial system - Batch post mortem

• Not really Redis’ fault▪All systems are designed in some way

• We should have known▪Hindsight is 20-20

▪Will remember and do better next time


•We need a proper and fast way▪Association is really a relation… SQL?

▪Surely we cannot…

•Why not?▪Scalability

▪Simplicity

▪Speed


•PostgreSQL▪We know it well

▪Scalability is complicated

•MongoDB▪We know it well

▪Sharding is complicated


• Custom App▪Clustering is fraught with danger…!

• Any other NoSQL▪So many to choose from...

▪Each has its tradeoffs-Complicated

-Overly simple

-Not delivering as promised

Initial system - Revival

• ScyllaDB▪Heard about it online since before 1.0

▪Cassandra clone-Less hassle better performance

▪Is SQL-ish, some sort of relations supported

▪Developed by notably skilled people using interesting

low level technology


• Same API kept, data in ScyllaDB▪Lookups, inserts and listing using complex primary key (mid,bid)

• Implementation time really short▪Install Scylla first time somewhat hard

-New stuff and some now removed quirks

-We had non-standard RHEL images


• Performance▪Lookups and inserts on par with previous implementation

▪Listings superfast-Dominated by things like marshalling and networking

-Queries really fast even in max batch scenario

•Ops maintenance?▪None, really… I know, it’s crazy!

New system

New system - A pipelined transformation engine take 2

• Stream based▪ No mandatory persistence, each service store results if needed▪ Messages flow through NATS▪ No state to maintain▪ Small service handled a message in each step▪Services can easily be scaled independently

New system - Design view

Service ServiceNATS

Scylla Scylla

New system - What do we store?

• Events▪ When is a message payload rejected and why

-Historical data, significant volume over time▪When is a message payload no longer rejected and why

-Historical data, significant volume over time▪Queries by global identifier

-Listing and manual inspection for now-Future work, proper visualization

▪Many other valuable events available, future work

New system - What do we store?

• Actual data▪ At certain points in the chain the data is more valuable

-Costly to produce-Replay functionality

▪Generate output artifacts regularly-Nightly files

▪APIs for online services-Possibly at different stages of the chain

New system - How do we store?

• Complex data types▪ Hundreds of (nested) attributes, i.e. documents

-Serialize and store metadata (id, time, etc) and blob-Resistant to data evolution problems, eg schema changes-Hard to search, filter, …

•Simpler data types▪Store in table as per the usual way

-Logging-Events

New system - Technologies?

• ScyllaDB, well obviously...▪ Fairly small machines

-3 nodes-16 Gb RAM-6 Intel Xeon 2.6 GHz

▪Read latency in single digit millis-can probably tune but no need yet

▪Write latency really good-waiting for prometheus support to get some reliable data (~40µs...)


• ScyllaDB▪ Currently all data in the same instances

-Seems to be no problem at all-Different keyspaces so moving to other nodes easy (knock, knock)

▪Upgraded without any problems-Have run all versions in production-Upgrade scenario very easy so far-1.0 -> 1.1 was a bit tricker if memory serves


• NATS▪ Fairly small machines

-2 nodes in cluster configuration-16 Gb RAM (probably way too much)-4 Intel Xeon 2.6 GHz

▪ Increased default payload size-Some abnormally large entities

▪Req-Rep primary pattern-Sometime many responses to stream large data sets


• NATS▪ There are potential use cases for persistent messaging

-NATS today is not persistent, by design• It’s a good thing

-Keeping an eye on NATS Streaming•Seems like NATS Streaming is to Kafka what Scylla is to Cassandra

-No immediate need yet but easy to envision cases


• Apps▪ Written in Go

-Gocql for Scylla communication, works very well-Go is first class language for NATS-Excellent concurrency support, crucial in microservice based systems

▪Why not X?-Wanted to try something else mostly-Followed Go for a long time-Like the founding principals as well as implementation


• ScyllaDB - http://www.scylladb.com/

• NATS - http://nats.io/

• Go - https://golang.org/

•Gocql - http://gocql.github.io/

•Gb - https://getgb.io/▪Easy build tool for Go. There are many. I like this one.

Thank You!Feedback is awesome!

Email: [email protected]

Twitter: @dahankzter

scylla summit 2016: using scylladb for a microservice-based pipeline in go

Technology