scalable streaming data pipelines with redis

Post on 21-Apr-2017

1.296 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scalable Streaming Data Pipelines with Redis

Avram Lyon Scopely / @ajlyon / github.com/avram

LA Redis Meetup / April 18, 2016

ScopelyMobile games publisher and

developer

Diverse set of games

Independent studios around the world

What kind of data?• App opened

• Killed a walker

• Bought something

• Heartbeat

• Memory usage report

• App error

• Declined a review prompt

• Finished the tutorial

• Clicked on that button

• Lost a battle

• Found a treasure chest

• Received a push message

• Finished a turn

• Sent an invite

• Scored a Yahtzee

• Spent 100 silver coins

• Anything else any game designer or developer wants to learn about

How much?Recently:

Peak: 2.8 million events / minute

2.4 billion events / day

Collection Kinesis

WarehousingEnrichment Realtime MonitoringKinesisPublic API

Primary Data Stream

CollectionHTTP

CollectionSQS

SQS

SQS

Studio A

Studio BStudio C

Kinesis

SQS Failover

Redis Caching App Configurations

System Configurations

Kinesis

SQS Failover

K

K

Data Warehouse Forwarder

Enricher

S3

Kinesis K

Ariel (Realtime)

Idempotence

Elasticsearch

Idempotence

Idempotence

?

Aggregation

Kinesisa short aside

Kinesis

• Distributed, sharded streams. Akin to Kafka.

• Get an iterator over the stream— and checkpoint with current stream pointer occasionally.

• Workers coordinate shard leases and checkpoints in DynamoDB (via KCL)

Shard 0Shard 1Shard 2

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Given: Worker checkpoints every 5

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Given: Worker checkpoints every 5

K Worker A

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Given: Worker checkpoints every 5

K Worker A 🔥

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5

K Worker A 🔥

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5

K Worker A 🔥

K Worker B

Shard 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Checkpointing

Checkpoint for Shard 0: 10 Given: Worker checkpoints every 5

K Worker A 🔥

K Worker B

Auxiliary Idempotence

• Idempotence keys at each stage

• Redis sets of idempotence keys by time window

• Gives resilience against various types of failures

Auxiliary Idempotence

Auxiliary Idempotence

• Gotcha: Set expiry is O(N)

• Broke up into small sets, partitioned by first 2 bytes of md5 of idempotence key

Collection Kinesis

WarehousingEnrichment Realtime MonitoringKinesisPublic API

Kinesis

SQS Failover

K

K

Data Warehouse Forwarder

Enricher

S3

Kinesis K

Ariel (Realtime)

Idempotence

Elasticsearch

Idempotence

Idempotence

?

Aggregation

1. Deserialize

2. Reverse deduplication

3. Apply changes to application properties

4. Get current device and application properties

5. Generate Event ID

6. Emit.

Collection Kinesis

Enrichment

1. Deserialize

2. Reverse deduplication

3. Apply changes to application properties

4. Get current device and application properties

5. Generate Event ID

6. Emit.

Collection Kinesis

Enrichment

Idempotence Key: DeviceToken + API Key + EventBatch Sequence + Event

Batch Session

Now we have a stream of well-described, denormalized event

facts.

K

Enriched Event Data

Preparing for Warehousing (SDW Forwarder)

dice app open

bees level complete

slots payment0:10

0:01

0:05

emitted by time

emitted by size

• Game • Event Name • Superset of

Properties in batch • Data

Slice

… ditto …Slice

SQS

K

Enriched Event Data

Preparing for Warehousing (SDW Forwarder)

dice app open

bees level complete

slots payment0:10

0:01

0:05

emitted by time

emitted by size

• Game • Event Name • Superset of

Properties in batch • Data

Slice

… ditto …Slice

SQS

Idempotence Key: Event ID

K

But everything can die!

dice app open

bees level complete

slots payment

Shudder

ASG

SNS

SQS

K

But everything can die!

dice app open

bees level complete

slots payment

Shudder

ASG

SNS

SQS

HTTP“Prepare to Die!”

K

But everything can die!

dice app open

bees level complete

slots payment

Shudder

ASG

SNS

SQS

HTTP“Prepare to Die!”

emit!

emit!

emit!

Pipeline to HDFS

• Partitioned by event name and game, buffered in-memory and written to S3

• Picked up every hour by Spark job

• Converts to Parquet, loaded to HDFS

A closer look at Ariel

K

Live Metrics (Ariel)Enriched Event Data

name: game_end time: 2015-07-15 10:00:00.000 UTC _devices_per_turn: 1.0 event_id: 12345 device_token: AAAA user_id: 100

name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12346 device_token: BBBB user_id: 100

name: Cheating Games predicate: _devices_per_turn > 1.5 target: event_id type: DISTINCT id: 1

name: Cheating Players predicate: _devices_per_turn > 1.5 target: user_id type: DISTINCT id: 2

name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12347 device_token: BBBB user_id: 100

PFADD /m/1/2015-07-15-10-00 12346 PFADD /m/1/2015-07-15-10-00 123467 PFADD /m/2/2015-07-15-10-00 BBBB PFADD /m/2/2015-07-15-10-00 BBBB

PFCOUNT /m/1/2015-07-15-10-002 PFCOUNT /m/2/2015-07-15-10-001

Configured Metrics

Dashboards

Alarms

HyperLogLog• High-level algorithm (four bullet-point version stolen from my

colleague, Cristian)

• b bits of the hashed function is used as an index pointer (redis uses b = 14, i.e. m = 16384 registers)

• The rest of the hash is inspected for the longest run of zeroes we can encounter (N)

• The register pointed by the index is replaced with max(currentValue, N + 1)

• An estimator function is used to calculate the approximated cardinality

http://content.research.neustar.biz/blog/hll.html

K

Live Metrics (Ariel)Enriched Event Data

name: game_end time: 2015-07-15 10:00:00.000 UTC _devices_per_turn: 1.0 event_id: 12345 device_token: AAAA user_id: 100

name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12346 device_token: BBBB user_id: 100

name: Cheating Games predicate: _devices_per_turn > 1.5 target: event_id type: DISTINCT id: 1

name: Cheating Players predicate: _devices_per_turn > 1.5 target: user_id type: DISTINCT id: 2

name: game_end time: 2015-07-15 10:01:00.000 UTC _devices_per_turn: 14.1 event_id: 12347 device_token: BBBB user_id: 100

PFADD /m/1/2015-07-15-10-00 12346 PFADD /m/1/2015-07-15-10-00 123467 PFADD /m/2/2015-07-15-10-00 BBBB PFADD /m/2/2015-07-15-10-00 BBBB

PFCOUNT /m/1/2015-07-15-10-002 PFCOUNT /m/2/2015-07-15-10-001

Configured Metrics

We can count different things

Kinesis

K

Collector Idempotence

Aggregation

Ariel

Web

PFCOUNT

PFADD

Workers

Are installs anomalous?

Pipeline Delay

• Pipelines back up

• Dashboards get outdated

• Alarms fire!

Alarm Clocks

• Push timestamp of current events to per-game pub/sub channel

• Take 99th percentile age as delay

• Use that time for alarm calculations

• Overlay delays on dashboards

Kinesis

K

Collector Idempotence

Aggregation

Ariel, now with clocks

Web

PFCOUNT

PFADD

Workers

Are installs anomalous?

Event Clock

Ariel 1.0

• ~30K metrics configured

• Aggregation into 30-minute buckets

• 12KB/30min/metric

Ariel 1.0

• ~30K metrics configured

• Aggregation into 30-minute buckets

• 12KB/30min/metric

Challenges

• Dataset size. RedisLabs non-cluster max = 100GB

• Packet/s limits: 250K in EC2-Classic

• Alarm granularity

Challenges

• Dataset size. RedisLabs non-cluster max = 100GB

• Packet/s limits: 250K in EC2-Classic

• Alarm granularity

Hybrid Datastore: Requirements

• Need to keep HLL sets to count distinct

• Redis is relatively finite

• HLL outside of Redis is messy

Hybrid Datastore: Plan

• Move older HLL sets to DynamoDB

• They’re just strings!

• Cache reports aggressively

• Fetch backing HLL data from DynamoDB as needed on web layer, merge using on-instance Redis

Kinesis

K

Collector IdempotenceAggregation

Ariel, now with hybrid datastore

Web

PFCOUNT

PFADD

Workers

Are installs anomalous?

Event Clock

DynamoDB

Report Caches

Old Data Migration

Merge Scratchpad

Much less memory…

Redis Roles• Idempotence

• Configuration Caching

• Aggregation

• Clock

• Scratchpad for merges

• Cache of reports

Other Considerations

• Multitenancy. We run parallel stacks and give games an assigned affinity, to insulate from pipeline delays

• Backfill. System is forward-looking only; can replay Kinesis backups to backfill, or backfill from warehouse

Thanks!

Questions?

scopely.com/jobs@ajlyon

avram@scopely.com github.com/avram

top related