scalable streaming data pipelines with redis
Post on 21-Apr-2017
1.296 Views
Preview:
TRANSCRIPT
Scalable Streaming Data Pipelines with Redis
Avram Lyon Scopely / @ajlyon / github.com/avram
LA Redis Meetup / April 18, 2016
ScopelyMobile games publisher and
developer
Diverse set of games
Independent studios around the world
What kind of data?• App opened
• Killed a walker
• Bought something
• Heartbeat
• Memory usage report
• App error
• Declined a review prompt
• Finished the tutorial
• Clicked on that button
• Lost a battle
• Found a treasure chest
• Received a push message
• Finished a turn
• Sent an invite
• Scored a Yahtzee
• Spent 100 silver coins
• Anything else any game designer or developer wants to learn about
How much?Recently:
Peak: 2.8 million events / minute
2.4 billion events / day
Collection Kinesis
WarehousingEnrichment Realtime MonitoringKinesisPublic API
Primary Data Stream
CollectionHTTP
CollectionSQS
SQS
SQS
Studio A
Studio BStudio C
Kinesis
SQS Failover
Redis Caching App Configurations
System Configurations
Kinesis
SQS Failover
K
K
Data Warehouse Forwarder
Enricher
S3
Kinesis K
Ariel (Realtime)
Idempotence
Elasticsearch
Idempotence
Idempotence
?
Aggregation
Kinesisa short aside
Kinesis
• Distributed, sharded streams. Akin to Kafka.
• Get an iterator over the stream— and checkpoint with current stream pointer occasionally.
• Workers coordinate shard leases and checkpoints in DynamoDB (via KCL)
Shard 0Shard 1Shard 2
Shard 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Checkpointing
Given: Worker checkpoints every 5
Shard 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Checkpointing
Given: Worker checkpoints every 5
K Worker A
Shard 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Checkpointing
Given: Worker checkpoints every 5
K Worker A 🔥
Shard 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Checkpointing
Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5
K Worker A 🔥
Shard 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Checkpointing
Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5
K Worker A 🔥
K Worker B
Shard 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Checkpointing
Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5
K Worker A 🔥
K Worker B
Auxiliary Idempotence
• Idempotence keys at each stage
• Redis sets of idempotence keys by time window
• Gives resilience against various types of failures
Auxiliary Idempotence
Auxiliary Idempotence
• Gotcha: Set expiry is O(N)
• Broke up into small sets, partitioned by first 2 bytes of md5 of idempotence key
Collection Kinesis
WarehousingEnrichment Realtime MonitoringKinesisPublic API
Kinesis
SQS Failover
K
K
Data Warehouse Forwarder
Enricher
S3
Kinesis K
Ariel (Realtime)
Idempotence
Elasticsearch
Idempotence
Idempotence
?
Aggregation
1. Deserialize
2. Reverse deduplication
3. Apply changes to application properties
4. Get current device and application properties
5. Generate Event ID
6. Emit.
Collection Kinesis
Enrichment
1. Deserialize
2. Reverse deduplication
3. Apply changes to application properties
4. Get current device and application properties
5. Generate Event ID
6. Emit.
Collection Kinesis
Enrichment
Idempotence Key: DeviceToken + API Key + EventBatch Sequence + Event
Batch Session
Now we have a stream of well-described, denormalized event
facts.
K
Enriched Event Data
Preparing for Warehousing (SDW Forwarder)
dice app open
bees level complete
slots payment0:10
0:01
0:05
emitted by time
emitted by size
• Game • Event Name • Superset of
Properties in batch • Data
Slice
… ditto …Slice
SQS
K
Enriched Event Data
Preparing for Warehousing (SDW Forwarder)
dice app open
bees level complete
slots payment0:10
0:01
0:05
emitted by time
emitted by size
• Game • Event Name • Superset of
Properties in batch • Data
Slice
… ditto …Slice
SQS
Idempotence Key: Event ID
K
But everything can die!
dice app open
bees level complete
slots payment
Shudder
ASG
SNS
SQS
K
But everything can die!
dice app open
bees level complete
slots payment
Shudder
ASG
SNS
SQS
HTTP“Prepare to Die!”
K
But everything can die!
dice app open
bees level complete
slots payment
Shudder
ASG
SNS
SQS
HTTP“Prepare to Die!”
emit!
emit!
emit!
Pipeline to HDFS
• Partitioned by event name and game, buffered in-memory and written to S3
• Picked up every hour by Spark job
• Converts to Parquet, loaded to HDFS
A closer look at Ariel
K
Live Metrics (Ariel)Enriched Event Data
name: game_end time: 2015-07-15 10:00:00.000 UTC _devices_per_turn: 1.0 event_id: 12345 device_token: AAAA user_id: 100
name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12346 device_token: BBBB user_id: 100
name: Cheating Games predicate: _devices_per_turn > 1.5 target: event_id type: DISTINCT id: 1
name: Cheating Players predicate: _devices_per_turn > 1.5 target: user_id type: DISTINCT id: 2
name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12347 device_token: BBBB user_id: 100
PFADD /m/1/2015-07-15-10-00 12346 PFADD /m/1/2015-07-15-10-00 123467 PFADD /m/2/2015-07-15-10-00 BBBB PFADD /m/2/2015-07-15-10-00 BBBB
PFCOUNT /m/1/2015-07-15-10-002 PFCOUNT /m/2/2015-07-15-10-001
Configured Metrics
Dashboards
Alarms
HyperLogLog• High-level algorithm (four bullet-point version stolen from my
colleague, Cristian)
• b bits of the hashed function is used as an index pointer (redis uses b = 14, i.e. m = 16384 registers)
• The rest of the hash is inspected for the longest run of zeroes we can encounter (N)
• The register pointed by the index is replaced with max(currentValue, N + 1)
• An estimator function is used to calculate the approximated cardinality
http://content.research.neustar.biz/blog/hll.html
K
Live Metrics (Ariel)Enriched Event Data
name: game_end time: 2015-07-15 10:00:00.000 UTC _devices_per_turn: 1.0 event_id: 12345 device_token: AAAA user_id: 100
name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12346 device_token: BBBB user_id: 100
name: Cheating Games predicate: _devices_per_turn > 1.5 target: event_id type: DISTINCT id: 1
name: Cheating Players predicate: _devices_per_turn > 1.5 target: user_id type: DISTINCT id: 2
name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12347 device_token: BBBB user_id: 100
PFADD /m/1/2015-07-15-10-00 12346 PFADD /m/1/2015-07-15-10-00 123467 PFADD /m/2/2015-07-15-10-00 BBBB PFADD /m/2/2015-07-15-10-00 BBBB
PFCOUNT /m/1/2015-07-15-10-002 PFCOUNT /m/2/2015-07-15-10-001
Configured Metrics
We can count different things
Kinesis
K
Collector Idempotence
Aggregation
Ariel
Web
PFCOUNT
PFADD
Workers
Are installs anomalous?
Pipeline Delay
• Pipelines back up
• Dashboards get outdated
• Alarms fire!
Alarm Clocks
• Push timestamp of current events to per-game pub/sub channel
• Take 99th percentile age as delay
• Use that time for alarm calculations
• Overlay delays on dashboards
Kinesis
K
Collector Idempotence
Aggregation
Ariel, now with clocks
Web
PFCOUNT
PFADD
Workers
Are installs anomalous?
Event Clock
Ariel 1.0
• ~30K metrics configured
• Aggregation into 30-minute buckets
• 12KB/30min/metric
Ariel 1.0
• ~30K metrics configured
• Aggregation into 30-minute buckets
• 12KB/30min/metric
Challenges
• Dataset size. RedisLabs non-cluster max = 100GB
• Packet/s limits: 250K in EC2-Classic
• Alarm granularity
Challenges
• Dataset size. RedisLabs non-cluster max = 100GB
• Packet/s limits: 250K in EC2-Classic
• Alarm granularity
Hybrid Datastore: Requirements
• Need to keep HLL sets to count distinct
• Redis is relatively finite
• HLL outside of Redis is messy
Hybrid Datastore: Plan
• Move older HLL sets to DynamoDB
• They’re just strings!
• Cache reports aggressively
• Fetch backing HLL data from DynamoDB as needed on web layer, merge using on-instance Redis
Kinesis
K
Collector IdempotenceAggregation
Ariel, now with hybrid datastore
Web
PFCOUNT
PFADD
Workers
Are installs anomalous?
Event Clock
DynamoDB
Report Caches
Old Data Migration
Merge Scratchpad
Much less memory…
Redis Roles• Idempotence
• Configuration Caching
• Aggregation
• Clock
• Scratchpad for merges
• Cache of reports
Other Considerations
• Multitenancy. We run parallel stacks and give games an assigned affinity, to insulate from pipeline delays
• Backfill. System is forward-looking only; can replay Kinesis backups to backfill, or backfill from warehouse
Thanks!
Questions?
scopely.com/jobs@ajlyon
avram@scopely.com github.com/avram
top related