[whug] wielki brat patrzy - czyli jak zbieramy dane o użytkownikach allegro

Wielki Brat patrzyCzyli jak zbieramy dane o użytkownikach Allegro

O nas

Marcin Kuthan

mkuthan.github.io

Maciej Arciuch

zbyt leniwy, żeby mieć konto na githubie

Opowiemy o

● czym jest clickstream (co?)

● potrzebach biznesowych i

technicznych (dlaczego?)

● ogólnej architekturze

systemu, technicznych

aspektach (jak?)

Clickstream w Allegro

● Czym jest clickstream?

● Zbierane z frontu, web i mobile

● Ponad 400 mln zdarzeń dziennie

● Podstawa do wielu decyzji biznesowych

● Kilka zespołów Big Data

Jak być powinno

● Dane dostępne od razu - małe opóźnienia

● Dobrze opisane, łatwo dostępne dla innych

● Efektywny format danych

● Stabilnie

● Skalowalnie

A jak czasem wychodzi...

Spróbujmy jeszcze raz

● Potrzeba nr 1: szybciej!

● Kolejka + przetw. strumieniowe: po 2s

● Stabilnie i skalowalnie

● Nowe zastosowania:

○ dział wykrywania “wałków”

○ rekomendacje, wyszukiwarka


Potrzeba nr 2: miejsce. Rozwiązanie: format Avro

● dojrzałe rozwiązanie

● zajmuje (całkiem) mało miejsca

● schematy: struktura + dokumentacja

● do przetw. wsadowego i strumieniowego

● kompatybilność


● Dane nieskompresowane: Avro zajmuje 45% JSON-a

● Avro - format binarny: niektóre alg. kompresji się nie

nadają (vide Snappy, 9 razy mniejszy stopień

kompresji Avro niż JSON-a)

● Realny wybór: GZip vs LZ4

○ wybraliśmy LZ4 - mniejsze zużycie CPU kosztem

20% gorszej kompresji

● Możliwość zmiany “w locie” (Kafka)


Problem nr 3: bałagan. Rozwiązanie: centralne

repozytorium schematów.

● single source of truth

● każdy element czyta z repo najnowszy schemat

● kontrola kompatybilności przy commicie

● wiemy z czym porównać

● propagacja do metastore’a, plików, itd.

Repozytorium schematów

● “schema review” - praca nad schematem przez pull-requesty

● merge, wdrożenie na DEV● promocja na TEST i PROD● nie ważne co wdrożymy pierwsze:

kod czy schemat

Repozytorium schematów

● Dwie konkurencyjne implementacje○ https://github.com/schema-repo/schema-repo○ https://github.com/confluentinc/schema-registry

● Korzystamy ze schema-repo○ była pierwsza, wyszła z AVRO-1124○ trzyma schematy w ZK, a nie Kafce

https://github.com/schema-repo/schema-repo



https://github.com/confluentinc/schema-registry



https://issues.apache.org/jira/browse/AVRO-1124

Pageviews

MobileEvents

Errors

Source

Clickstream Ingestion System

Buffer Kafka

...

Clients Kafka

Pageviews

MobileEvents

Errors

Clickstream Ingestion System (cont)

...

Clients Kafka camus

camus2hive.sh

Why Spark Streaming?

Why Apache Kafka?

Requirements

● Scalability

● At least once delivery

● Fault tolerance

● Back pressure

At least once - end2end

“Offset Store”“Streaming Engine”

fetch_data(topic, partitions, offset_begins, offset_ends)

process_data

commit_offsets(topic, partitions, offsets)

publish_results(topic, results)

Kafka Out Kafka In

get_offsets(topic)

Exactly once

“Checkpoint Store”“Streaming Engine”

fetch_data(topic, partitions, offset_begins, offset_ends)

process_data

store_checkpoint(metadata, results)

publish_results(results)

Kafka Out Kafka In

load_checkpoint()

transactional

non-transactional

exactly once

Exactly once - end2end

Apache Kafka

⇓

Streaming Engine

⇓

Apache Kafka

Fault tolerance

● yarn-cluster

● spark.yarn.maxAppAttempts

● spark.yarn.max.executor.failures

● spark.task.maxFailures

● spark.streaming.kafka.maxRetries

Fault tolerance

● min.insync.replicas = 2

● acks = all

● topics rep factor >= 3

● kafka clusters >= 4 nodes

● retries > 0

● retry.backoff.ms > 0

Fault tolerance

Back pressure

input rate == processing rate

Source Sink

Back pressure

Source Sink

input rate > processing rate

Back pressure

Source Sink

steady state again

Back pressure

Source Sink

initial state

spark.streaming.kafka.maxRatePerPartition*

* effectively requires single topic

Back pressure

● Pull from source

● Process

● Push to sink (sync)buffer.memory=not too muchblock.on.buffer.full=true

Spark Streaming & Apache Kafka integration

● Receiver based approach / high

level Kafka consumer

● Direct streams approach / low level

Kafka consumer

Receiver based approach

Spark Executor

Receiver

1

HDFS (WAL)

Spark Driver

StreamingContext

Offset Store

Source

pull data


Spark Executor

Receiver

1

HDFS (WAL)

2

Spark Driver

StreamingContext

Offset Store

Source

store Write Ahead Log


Spark Executor

Receiver

1

HDFS (WAL)

2

Spark Driver

StreamingContext

3

4

Offset Store

Source

send blocks’ ids and commit offset


Spark Executor6

HDFS (WAL)

Spark Driver

StreamingContext

5

Offset Store

Sink

process data and publish results

Driver checkpointing

Spark Driver

StreamingContext

Blocks’ ids

HDFS (checkpoint)


HDFS (checkpoint)Failed

Spark Driver

RestartedSpark Driver

StreamingContext

Blocks’ ids


StreamingContext.getOrCreate( checkpointDir, functionToCreateContext)

Problems

1. Receiver occupies 1 core / executor2. Data duplication3. Additional latency4. HDFS load5. Complex back pressure6. Controversial checkpointing

Other gotchas

● High Level Consumer rebalancing

● Spark partition != Kafka partition

val kafkaDStreams = (1 to readParallelism).map { KafkaUtils.createStream(...)}val unionDStream = ssc.union(kafkaDStreams)

Receivers - summary

Direct stream approach

1

2

Offset Store

Spark Driver

StreamingContext

Spark ExecutorSource

fetch offset & distribute work

Sink


1

2 3

Offset Store

Spark Driver

StreamingContext


fetch, process & publish data

Sink

4


1

2 3

5

6

Offset Store

Spark Driver

StreamingContext


wait for completion & commit offset

Sink

4

5’

Direct stream - good parts

● Low level Kafka consumer● Straightforward fault tolerance for

at least once ● Built-in natural back pressure● No WAL● Kafka partition == Spark partition

Direct stream - bad parts

● Built-in at least onceauto.offset.reset=smallest

● Offset Store based at least onceDIY

Direct stream - bad parts

● Lack of kafka connections pool● Less mature/mainstream than

receiver based approach

Direct stream - summary

Key takeaways

● Avro schemas and a central schema repo - a way to reduce confusion

● Spark Streaming & Apache Kafka - almost perfect couple

● Use Direct Streams

Thank you!

http://github.com/allegro

http://allegro.tech

[whug] wielki brat patrzy - czyli jak zbieramy dane o użytkownikach allegro

Software