scylla summit 2016: using scylladb for a microservice-based pipeline in go
TRANSCRIPT
Using ScyllaDB for a Microservice Based Pipeline in Go
Henrik JohanssonSenior Developer, Eniro Initiatives
What will you hear?
• Initial system
▸ Tiny problem…
▸ Fix!
• New system
▸Description
▸Some conclusions
Initial system
Initial system - A pipelined transformation engine
• Message based▪ Stored in Elasticsearch after each transformation▪ Read from Elasticsearch before each transformation▪ Small service handled a message in each step
• Batch oriented▪ Decoupled but belonging to a batch▪ Needed to maintain message to map relation
-Across the cluster
Initial system - Design view
Elastic
Service Service
Kafka
SignalSignal
...
Initial system - Batches
• Batch oriented▪ Needed to maintain message to map relation
-Across the cluster▪Batch flow
-On each initial input into the chain create association msg <-> batch-To reassemble a batch just read the associated messages -Assembly usually at a conceptual end state
▪Redis-Because, fast!
Initial system - A complication
• Redis▪ Super fast
-Inserts as well as lookups▪Associations
-Lists•Separate from the lookup, consistency issue?
-SCAN•Simple iteration and matching, needs deduplication on client
▪Chose SCAN
Initial system - Redis’ very nature
• Single Threaded▪Implementation detail
-Not really…
•SCAN▪Iterate and sample in batches
▪Small or large batches has different trade offs
▪Blocks… (Docs says it doesn’t but it does)
Initial system - Redis’ very nature
• Millions or even billions in a batch?
•SCAN▪Blocks…
▪It becomes really slow!
•The whole system is affected▪Inserts, lookups, everything stalls....
Because Redis is Single Threaded
Because Redis is Single Threaded
Initial system - Batch post mortem
• Not really Redis’ fault▪All systems are designed in some way
• We should have known▪Hindsight is 20-20
▪Will remember and do better next time
Initial system - Batch post mortem
•We need a proper and fast way▪Association is really a relation… SQL?
▪Surely we cannot…
•Why not?▪Scalability
▪Simplicity
▪Speed
Initial system - Batch post mortem
•PostgreSQL▪We know it well
▪Scalability is complicated
•MongoDB▪We know it well
▪Sharding is complicated
Initial system - Batch post mortem
• Custom App▪Clustering is fraught with danger…!
• Any other NoSQL▪So many to choose from...
▪Each has its tradeoffs-Complicated
-Overly simple
-Not delivering as promised
Initial system - Revival
• ScyllaDB▪Heard about it online since before 1.0
▪Cassandra clone-Less hassle better performance
▪Is SQL-ish, some sort of relations supported
▪Developed by notably skilled people using interesting
low level technology
Initial system - Revival
• Same API kept, data in ScyllaDB▪Lookups, inserts and listing using complex primary key (mid,bid)
• Implementation time really short▪Install Scylla first time somewhat hard
-New stuff and some now removed quirks
-We had non-standard RHEL images
Initial system - Revival
• Performance▪Lookups and inserts on par with previous implementation
▪Listings superfast-Dominated by things like marshalling and networking
-Queries really fast even in max batch scenario
•Ops maintenance?▪None, really… I know, it’s crazy!
New system
New system - A pipelined transformation engine take 2
• Stream based▪ No mandatory persistence, each service store results if needed▪ Messages flow through NATS▪ No state to maintain▪ Small service handled a message in each step▪Services can easily be scaled independently
New system - Design view
Service ServiceNATS
Scylla Scylla
New system - What do we store?
• Events▪ When is a message payload rejected and why
-Historical data, significant volume over time▪When is a message payload no longer rejected and why
-Historical data, significant volume over time▪Queries by global identifier
-Listing and manual inspection for now-Future work, proper visualization
▪Many other valuable events available, future work
New system - What do we store?
• Actual data▪ At certain points in the chain the data is more valuable
-Costly to produce-Replay functionality
▪Generate output artifacts regularly-Nightly files
▪APIs for online services-Possibly at different stages of the chain
New system - How do we store?
• Complex data types▪ Hundreds of (nested) attributes, i.e. documents
-Serialize and store metadata (id, time, etc) and blob-Resistant to data evolution problems, eg schema changes-Hard to search, filter, …
•Simpler data types▪Store in table as per the usual way
-Logging-Events
New system - Technologies?
• ScyllaDB, well obviously...▪ Fairly small machines
-3 nodes-16 Gb RAM-6 Intel Xeon 2.6 GHz
▪Read latency in single digit millis-can probably tune but no need yet
▪Write latency really good-waiting for prometheus support to get some reliable data (~40µs...)
New system - Technologies?
• ScyllaDB▪ Currently all data in the same instances
-Seems to be no problem at all-Different keyspaces so moving to other nodes easy (knock, knock)
▪Upgraded without any problems-Have run all versions in production-Upgrade scenario very easy so far-1.0 -> 1.1 was a bit tricker if memory serves
New system - Technologies?
• NATS▪ Fairly small machines
-2 nodes in cluster configuration-16 Gb RAM (probably way too much)-4 Intel Xeon 2.6 GHz
▪ Increased default payload size-Some abnormally large entities
▪Req-Rep primary pattern-Sometime many responses to stream large data sets
New system - Technologies?
• NATS▪ There are potential use cases for persistent messaging
-NATS today is not persistent, by design• It’s a good thing
-Keeping an eye on NATS Streaming•Seems like NATS Streaming is to Kafka what Scylla is to Cassandra
-No immediate need yet but easy to envision cases
New system - Technologies?
• Apps▪ Written in Go
-Gocql for Scylla communication, works very well-Go is first class language for NATS-Excellent concurrency support, crucial in microservice based systems
▪Why not X?-Wanted to try something else mostly-Followed Go for a long time-Like the founding principals as well as implementation
New system - Technologies?
• ScyllaDB - http://www.scylladb.com/
• NATS - http://nats.io/
• Go - https://golang.org/
•Gocql - http://gocql.github.io/
•Gb - https://getgb.io/▪Easy build tool for Go. There are many. I like this one.