apache samza past, present and future

33
Apache Samza Past, Present and Future Kartik Paramasivam Director of Engineering, Streams Infra@ LinkedIn

Upload: ed-yakabosky

Post on 13-Jan-2017

163 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Apache samza  past, present and future

Apache SamzaPast, Present and Future

Kartik Paramasivam Director of Engineering, Streams Infra@ LinkedIn

Navina Ramesh
There are 2 results DBs here. Are they the same or different? I don't understand what kind of use-case you are trying to explain here :(It seems awfully complicated to explain HDFS support. May be you can use more concrete example??
Ed Yakabosky
Agreed. I think one of the things Wei is working on is looking into whether people need to access data at intermediate points along the pipe. I think you're trying to demonstrate that someone might load a results DB twice, the 2nd one being derived from the first. But I agree, a concrete real world example would help. When I was talking to Wei, we used some made up examples to explain it.
Wei Song
Shouldn't this and next one be put under "Stability" section?
Kartik Paramasivam
yeah. I had initially put it in the stability section. But I wanted to get a good transition from performance. Maybe I should merge the two sections.
Jacob Maes
This slide screams "lambda" to me. Historically, our messaging has been against lambda, so some explanation may be required.
Navina Ramesh
huge +1 on that!!
Kartik Paramasivam
You might have missed the explicit mention of Lambda in the previous slide ?
Navina Ramesh
I did miss that word in the previous slide. I still feel it is better to leave out "lambda" as it no longer the "cool" things. Re-processing should be sufficient to mention on this slide.
Page 2: Apache samza  past, present and future

Agenda

1. Stream Processing 2. State of the Union3. Apache Samza : Key Differentiators 4. Apache Samza Futures

Page 3: Apache samza  past, present and future

Stream Processing: Processing events as soon as they happen..

● Stateless Processing

■ Transformation etc.

■ Lookup adjunct data (lookup databases/call services )

■ Producing results for every event

● Stateful Processing

■ Triggering/Producing results periodically (time-windows)

● Maintain intermediate state

■ E.g. Joining across multiple streams of events.

● Common Issues

■ Scale !! Scale !! Scale !!

■ Reliability !!

■ Everything else (upgrades, debugging, diagnostics, security, ……)

Page 4: Apache samza  past, present and future

Stream Processing: State of the Union

MillwheelStorm

Heron

Spark Streaming

S4

Dempsey

Samza

Flink

Beam

DataflowAzure Stream Analytics

AWS Kinesis AnalyticsGearPumpKafka Streams

Orleans

Not meant to be an accurate timeline..

Yes It is CROWDED !!

Jacob Maes
Curious what the X axis represents. Feature set?If this is really just a timeline, it probably shouldn't have 2 axes
Navina Ramesh
+1 It looks more like a timeline
Page 5: Apache samza  past, present and future

Apache Samza

● Top level Apache project since Dec 2014● 5 big Releases (0.7, 0.8, 0.9, 0.10, 0.11)● 62 Contributors● 14 Committers● Companies using : LinkedIn, Uber, MetaMarkets, Netflix,

Intuit, TripAdvisor, MobileAware, Optimizely …. https://cwiki.apache.org/confluence/display/SAMZA/Powered+By

● Applications at LinkedIn : from ~20 to ~200 in 2 years.

Page 6: Apache samza  past, present and future

Key Differentiators for Apache Samza

● Performance !!

● Stability

● Support for a variety of input sources

● Stream processing as a service AND as an embedded library

Page 7: Apache samza  past, present and future

Performance : Accessing Adjunct Data

Samza ProcessorDatabase

Remote-read

Samza ProcessorCapture changes

Databus, Brooklin

Rocks-DB

Local readDatabase

Local data access

Remote data access

Input stream

Input stream

Page 8: Apache samza  past, present and future

Performance : Maintaining Temporary State

Samza Processor RemoteDatabase

Read-Write

Samza ProcessorBackup changes

Kafka Change Log(Log compacted)Rocks-

DB

Local read/write

Local data access

Remote data access

Input stream

Input stream

In Memory Store

Page 9: Apache samza  past, present and future

Performance : Let us talk numbers !

● 100x Difference between using Local State vs Remote No-Sql store

● Local State details:

○ 1.1 Million TPS on a single processing machine (SSD)

○ Used a 3 node Kafka cluster for storing the durable changelog

● Remote State details:

○ 8500 TPS when the Samza job was changed to accessing a remote No-Sql store

○ No-Sql Store was also on a 3 node (ssd) cluster

Page 10: Apache samza  past, present and future

Remote State : Asynchronous Event Processing

Event Loop(Single thread)

ProcessAsync

Remote DB /Services

Asynchronous I/O calls, using Java Nio, Netty...

Responses sent to main thread via callback

Event loop is woken up to process next message

Task.max.concurrency >1 to enable pipelining

Available with Samza 0.11

Page 11: Apache samza  past, present and future

Remote State: Synchronous Processing on Multiple Threads

Event Loop(Single thread)

Schedule Process()

Remote DB/ Services Built-In

Thread pool

Blocking I/O calls

Event loop is woken up by the worker thread

job.container.thread.pool.size = N

Available with Samza 0.11

Page 12: Apache samza  past, present and future

Incremental Checkpointing : MVP for stateful apps

Some applications have ~ 2 TB state in production

Stateful apps don’t really work without incremental checkpointing

Samza Task

State

changelog chec

kpoi

nt

Host 1

Input stream(e.g. Kafka)

Page 13: Apache samza  past, present and future

Key Differentiators for Apache Samza

● Performance !!

● Stability

● Support for a variety of input sources

● Stream processing as a service AND as an embedded library

Page 14: Apache samza  past, present and future

Speed Thrills .. but can kill

● Local State Considerations: ○ State should NOT be reseeded under normal operations (e.g.

Upgrades, Application restarts)

○ Minimal State should be reseeded

- If a container dies/removed

- If a container is added

Page 15: Apache samza  past, present and future

How Samza keeps Local state ‘stable’ ?

P0

P1 P2 P3

Task-0 Task-1 Task-2 Task-3

P0

P1

P2

P3

Host-E Host-B Host-C

Coordinator Stream: Task-Container-Host Mapping

Container-0 -> Host-E

Container-1 -> Host-BContainer-2 -> Host-C

AM JC

Yarn-RM

Ask: Host-E Allocate: Host-E

Samza Job

Input Stream

Change-log

Enable Continuous Scheduling

Page 16: Apache samza  past, present and future

Stream A

Stream B

Stream C

Job 1

Job 2

● Kafka or durable intermediate queues are leveraged to avoid backpressure issues in a pipeline.

● Allows each stage to be independent of the next stage

Backpressure in a Pipeline

Page 17: Apache samza  past, present and future

Key Differentiators for Apache Samza

● Performance !!

● Stability

● Support for a variety of input sources

● Stream processing as a service AND as an embedded library

Page 18: Apache samza  past, present and future

Pluggable system consumers

Samza Processor

Mongo DB

DynamoDB Streams

Kafka

Databus/Brooklin

Kinesis

ZeroMQ

… Azure EventHub, Azure Document DB, Google Pub-Sub etc.

Oracle, Espresso

Dynamo-DB

Page 19: Apache samza  past, present and future

Batch processing in Samza!! (NEW)

● HDFS system consumer for Samza

● Same Samza processor can be used for processing events from Kafka and HDFS with no code changes

● Scenarios :

○ Experimentation and Testing

○ Re-processing of large datasets

○ Some datasets are readily available on HDFS (company specific)

Page 20: Apache samza  past, present and future

Samza - HDFS support

HDFS input

(Samza)Processor HDFS

output

(Samza) Processor HDFS

output

HDFS input(Samza)

Processor Kafka

Kafka

New

Available since Samza 0.10

The batch job auto-terminates when the input is fully processed.

Page 21: Apache samza  past, present and future

Reprocessing entire Dataset onlineUpdates

(Samza) Processor

(Samza) Processor

Bootstrap

output

Kafka

Brooklin

Brooklin

Database(espresso)

set offset=0

Nearline Processing

Page 22: Apache samza  past, present and future

Reprocessing ‘large’ Datasets “offline”Updates

(Samza) Processor

(Samza) Processor

Backup

output

Kafka

DatabusDatabase(espresso)

Database Backup (HDFS)

Nearline Processing

Offline Processing

Page 23: Apache samza  past, present and future

Samza batch pipelines

(Samza) Processor

HDFS output

HDFS input

(Samza)Processor Kafka

(Samza) Processor

HDFS output

HDFS input

(Samza)Processor HDFS

Page 24: Apache samza  past, present and future

Samza- HDFS Early Performance Results !!

Benchmark : Count number of records grouped by <Field>

DataSize (bytes): 250 GBNumber of files : 487

Samza

Map/Reduce

SparkNumber of Containers

Time-seconds

Page 25: Apache samza  past, present and future

Key Differentiators for Apache Samza

● Performance !!

● Stability

● Support for a variety of input sources (batch and streaming)

● Stream processing as a service AND (coming soon) as an embedded library

Page 26: Apache samza  past, present and future

Stream Processing as a Service

● Based on YARN○ Yarn-RM high availability ○ Work preserving RM ○ Support for Heterogenous hardware with Node Labels (NEW)

● Easy upgrade of Samza framework : Use the Samza version deployed on the machine instead of packaging it with the application.

● Disk Quotas for local state (e.g. rocksDB state)● Samza Management Service(SAMZA-REST)-> Next Slide

Page 27: Apache samza  past, present and future

YARN Resource Managers

Nodes in the YARN cluster

RM SRR RM SRR RM SRR

NM SRN

Samza Management Service (Samza REST) (NEW)

NM SRN NM SRN NM SRN

NM SRN NM SRN NM SRN NM SRN

/v1/jobs /v1/jobs /v1/jobs

Samza Containers

1. Exposes /jobs resource to start, stop, get status of jobs etc.

2. Cleans up stores from dead jobs

SamzaREST

YARN processes(RM/NM)

Page 28: Apache samza  past, present and future

Agenda

1. Stream processing2. State of the union3. Apache Samza : Key differentiators

4. Apache Samza Futures

Page 29: Apache samza  past, present and future

Coming Soon : Samza as a Library

Stream Processor

Code Job Coordinator

Stream Processor

Code Job Coordinator

Stream Processor

Code Job Coordinator

...

Leader

● No YARN dependency● Will use ZK for leader

election

● Embed stream processing into your bigger application

StreamProcessor processor = new StreamProcessor (config, “job-name”, “job-id”);processor.start();processor.awaitStart();…processor.stop();

Page 30: Apache samza  past, present and future

Coming Soon: High Level API and Event Time (SAMZA-914/915)

Count the number of PageViews by Region, every 30 minutes.@Override public void init(Collection<SystemMessageStream> sources) {

sources.forEach(source -> {

Function<PageView, String> keyExtractor = view -> view.getRegion();

source.map(msg -> new PageViewMessage(msg))

.window(Windows.<PageViewMessage, String>intoSessionCounter(keyExtractor,

WindowType.Tumbling, 30*60 ))

});

}

Page 31: Apache samza  past, present and future

Coming Soon: First class support for Pipelines (Samza- 1041)

public class MyPipeline implements PipelineFactory {

public Pipeline create(Config config) {

Processor myShuffler = getShuffle(config); Processor myJoiner = getJoin(config);

Stream inStream = getStream(config, “inStream1”); // … omitted for brevity

PipelineBuilder builder = new PipelineBuilder(); return builder.addInputStreams(myShuffler, inStream1) .addOutputStreams(myShuffler, intermediateOutStream) .addInputStreams(myJoiner, intermediateOutStream, inStream2) .addOutputStreams(myJoiner, finalOutStream) .build(); }}

Shuffle

Join

input output

Page 32: Apache samza  past, present and future

Future: Miscellaneous

● Exactly once processing● Making it easier to auto-scale even with Local State (on-

demand Standby containers)● Turnkey Disaster Recovery for stateful applications

○ Easy Restore of changelog and checkpoints from some other datacenter.

● Improved support for Batch jobs● SQL over Streams● A default Dashboard :)

Page 33: Apache samza  past, present and future

Questions ?