Download - Testing data streaming applications
www.mapflat.com
Testing data streaming applications
Lars Albertsson, independent consultantØyvind Løkling, Schibsted Media Group
www.mapflat.com
Who’s talking?● Swedish Institute of Computer Science (distributed system test+debug tools)● Sun Microsystems (very large machines)● Google (Hangouts, productivity)● Recorded Future (NLP startup)● Cinnober Financial Tech. (trading systems)● Spotify (data processing & modelling)● Schibsted Media Group (data processing & modelling)● Mapflat - independent data engineering consultant
www.mapflat.com
Why stream processing?● Increasing number of
data-driven features● 90+% fed by batch processing
○ Simpler, better tooling○ 1+ hour data reaction time
● Stream processing for○ 100ms - 1 hour reaction○ Decoupled, asynchronous
microservices
User content
Professional content
Ads / partners
Userbehaviour
SystemsAds
System diagnostics
Recommendations
Data-based features
Curated content
Pushing
Business intelligence
Experiments
Exploration
www.mapflat.com
The organic past● Many paths● Synchronous● Link failure -> chain failure● Heterogeneous● Difficult to recover from
transformation bugs
Service Service Service
App App App
DB Poll
Queue
Aggregate logs
NFSHourly dump
Data warehouse
ETL
Queue
NFS
scp
DB
HTTP
www.mapflat.com
● Publish data in streams● Replicated, sharded
append-only log● Pub / sub with history
○ Kafka, Google Pub/Sub, AWS Kinesis
● Tap to data lake for batch processing
Unified log
The unified log
Ads Search Feed
App App App
StreamStream Stream
Data lake
www.mapflat.com
● Decoupled producers/consumers
○ In source/deployment○ In space○ In time
● Publish results to log● Recovers from link failures● Replay on job bug fix
Stream processing
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business intelligence
Job
www.mapflat.com
Stream processing building blocks ● Aggregate
○ Calculate time windows○ Aggregate state (in memory / local database / shared database)
● Filter○ Slim down stream○ Privacy, security concerns
● Join ○ Enrich by joining with datasets, e.g. geo IP lookup, demographics○ Join streams within time windows, e.g. click-through rate
● Transform○ Bring data into same “shape”, schema
www.mapflat.com
Stream processing technologies● Spark Streaming
○ Ideal if you are already using Spark, same model○ Bridges gap between data science / data engineers, batch and stream
● Kafka Streams○ Library - new, positions itself as a lightweight alternative○ Tightly coupled to Kafka
● Others○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda
www.mapflat.com
● Update database table, e.g. for polling dashboard
● Create service index table n+1. Notify service to switch.
● Post to external web service● Push stream to client
Egress
Service
Stream Stream
Job Job
App
www.mapflat.com
Test concepts
Test harness
Testfixture
System under test(SUT)
3rd party component(e.g. DB)
3rd party component
3rd party component
Test input
Test oracle
Test framework (e.g. JUnit, Scalatest)
Seam
IDEs
Build tools
www.mapflat.com
● Unit● Single job● Multiple jobs● Pipeline, including service● Full system, including client
Choose stable interfaces
Each scope has a cost
Potential test scopes
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
Stream application properties● Output = function(input, code)
○ Perfect for testing!○ Avoid: indeterministic processing, reading wall clock
● Pipeline and job endpoints are stable○ Correspond to business value
● Internal abstractions are volatile○ Reslicing in different dimensions is common
www.mapflat.com
● Single job● Multiple jobs● Pipeline, including service
Recommended scopes
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
● Unit○ Few stable interfaces○ Not necessary○ Avoid mocks, DI rituals
● Full system, including client○ Client automation fragile
“Focus on functional system tests, complement with smaller where you cannot get coverage.” - Henrik Kniberg
Scopes to avoid
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
Stream application, example harness
Scalatest Spark Streaming jobs IDE, CI, debug integration
15
DB
TopicKafka
Test input
Test oracle
Docker
IDE / Gradle
Polling
www.mapflat.com
Test lifecycle1. Start fixture containers2. Await fixture ready3. Allocate test case resources4. Start jobs5. Push input data to Kafka6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }7. While (moreTests) { Goto 3 }8. Tear down fixture
For absence test, send dummy sync messages at end.
2, 7. Scalatest 4. Spark
5 6
1. Docker
IDE / Gradle
www.mapflat.com
● Input & output is denormalised & wide● Fields are frequently changed
○ Additions are compatible○ Modifications are incompatible => new, similar data type
● Static test input, e.g. JSON files○ Unmaintainable
● Input generation routines○ Robust to changes, reusable
Input generation
www.mapflat.com
Test oracles● Compare with expected output● Check fields relevant for test
○ Robust to field changes○ Reusable for new, similar types
● Tip: Use lenses○ JSON: JsonPath (Java), Play JSON (Scala)○ Case classes: Monocle
● Express invariants for each data type○ Reuse for production data quality monitoring
www.mapflat.com
Data pipeline = yet another programDon’t veer from best practices
● Regression testing● Design: Separation of concerns, modularity, etc● Process: CI/CD, code review, static analysis tools● Avoid anti-patterns: Global state, hard-coding location, duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers● Document “golden path”
www.mapflat.com
Testing with cloud services● PaaS components do not work locally
○ Cloud providers should provide fake implementations○ Exceptions: Kubernetes, Cloud SQL, Relational Database Service, (S3)
● Integrate PaaS service as fixture component is challenging○ Distribute access tokens, etc○ Pay $ or $$$
www.mapflat.com
Top anti-patterns1. Test as afterthought or in production
Data processing applications are suited for test!2. Static test input in version control3. Exact expected output test oracle4. Unit testing volatile interfaces5. Using mocks & dependency injection6. Tool-specific test framework - vendor lock-in7. Using wall clock time8. Embedded fixture components
www.mapflat.com
Thank you. Questions?Credits:
Øyvind Løkling, Schibsted Media Group
● Content inspirationConfluent, LinkedIn, Google, Netflix, Apache Samza
● ImagesTracey Saxby, Integration and Application Network, University of Maryland Center for Environmental Science (ian.umces.edu/imagelibrary/).
www.mapflat.com
Bonus slides
www.mapflat.com
Quality testing variants● Functional regression
○ Binary, key to productivity
● Golden set○ Extreme inputs => obvious output○ No regressions tolerated
● (Saved) production data input○ Individual regressions ok○ Weighted sum must not decline○ Beware of privacy
24
www.mapflat.com
Hadoop / Spark counters
● Processing tool (Spark/Hadoop) counters○ Odd code path => bump counter
● Dedicated quality assessment pipelines○ Reuse test oracle invariants in production
Obtaining quality metrics
25
DB
Quality assessment job
www.mapflat.com
Quality testing in the process
● Binary self-contained○ Validate in CI
● Relative vs history
○ E.g. large drops
○ Precondition for publishing dataset
● Push aggregates to DB
○ Standard ops: monitor, alert
26
DB
∆?
Code ∆!