have your cake and eat it too, further dispelling the myths of the lambda architecture

59
Have Your Cake & Eat It Too Further Dispelling the Myths of the Lambda Architecture Tyler Akidau Staff Software Engineer

Upload: dimos-raptis

Post on 13-Apr-2017

107 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Have Your Cake & Eat It TooFurther Dispelling the Myths of the Lambda ArchitectureTyler AkidauStaff Software Engineer

Page 2: Have your cake and eat it too, further dispelling the myths of the lambda architecture

MillWheel Streaming

FlumeCloud Dataflow

- Stream Processing System- High-level API- Data Processing Service

Page 3: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Google Cloud Dataflow

Optimize

Schedule

GCS

GCS

Page 4: Have your cake and eat it too, further dispelling the myths of the lambda architecture

- Slava Chernyak, Josh Haberman, Reuven Lax, Daniel Mills, Paul Nordstrom, Sam McVeety, Sam Whittle, and more...

- Robert Bradshaw, Daniel Mills,and more...

- Robert Bradshaw, Craig Chambers, Reuven Lax, Daniel Mills, Frances Perry, and more...

MillWheel

Streaming Flume

Cloud Dataflow

Page 5: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Cloud Dataflow is unreleased.

Things may change.

Page 6: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Lambda vs Streaming

Strong Consistency

Reasoning About Time

1

2

3

Agenda

Page 7: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Lambda vs Streaming1

Page 8: Have your cake and eat it too, further dispelling the myths of the lambda architecture

http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

Page 9: Have your cake and eat it too, further dispelling the myths of the lambda architecture

The Lambda Architecture

Page 10: Have your cake and eat it too, further dispelling the myths of the lambda architecture

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Page 11: Have your cake and eat it too, further dispelling the myths of the lambda architecture

The Evolution of Streaming

Page 12: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Strong Consistency

Tools for Reasoning About Time

What does it take?

Page 13: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Strong Consistency2

Page 14: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Consistent Storage

Storage

Page 15: Have your cake and eat it too, further dispelling the myths of the lambda architecture

• Mostly correct is not good enough• Required for exactly-once processing• Required for repeatable results• Cannot replace batch without it

Why consistency is important

Page 16: Have your cake and eat it too, further dispelling the myths of the lambda architecture

• Sequencers (e.g. BigTable)• Leases (e.g. Spanner)• Federation of storage silos (e.g. Samza,

Dataflow)• RDDs (e.g. Spark)

How?

Page 17: Have your cake and eat it too, further dispelling the myths of the lambda architecture

http://research.google.com/pubs/pub41378.html

Page 18: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Reasoning About Time 3

Page 19: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Event Time vs Stream TimeBatch vs Streaming

ApproachesDataflow API

Page 20: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Event Time - When Events Happened

Stream Time - When Events Are Processed

Page 21: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Batch vs Streaming

Page 22: Have your cake and eat it too, further dispelling the myths of the lambda architecture

MapReduce

Batch

Page 23: Have your cake and eat it too, further dispelling the myths of the lambda architecture

MapReduce

[10:00 - 11:00)

[10:00 - 11:00)

[11:00 - 12:00)[12:00 -

13:00)[13:00 - 14:00)[14:00 -

15:00)[15:00 - 16:00)[16:00 -

17:00)[18:00 - 19:00)[19:00 -

20:00)[21:00 - 22:00)[22:00 -

23:00)[23:00 - 0:00)

Batch: Fixed Windows

Page 24: Have your cake and eat it too, further dispelling the myths of the lambda architecture

MapReduce

[10:00 - 11:00)

[11:00 - 12:00)

Batch: User Sessions

Joan

Larry

Ingo

Amanda

Cheryl

Arthur

[11:00 - 12:00)

[10:00 - 11:00)

Page 25: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Streaming

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Page 26: Have your cake and eat it too, further dispelling the myths of the lambda architecture

UnorderedUnbounded

Of Varying Event Time Skew

Confounding characteristics of data streams

Page 27: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Event Time Skew

Stre

am

Tim

e

Event Time

Skew

Page 28: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Approaches

Page 29: Have your cake and eat it too, further dispelling the myths of the lambda architecture

1.Time-Agnostic Processing2.Approximation3.Stream Time Windowing4.Event Time Windowing

Approaches to reasoning about time

Page 30: Have your cake and eat it too, further dispelling the myths of the lambda architecture

1.Time-Agnostic Processing - Filters

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

Example Input:Example Output:

Pros:

Cons:

Web server traffic logsAll traffic from specific domains

StraightforwardEfficientLimited utility

Page 31: Have your cake and eat it too, further dispelling the myths of the lambda architecture

1.Time-Agnostic Processing - Hash Join

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

Example Input:Example Output:

Pros:

Cons:

Query & Click traffic Joined stream of Query + Click pairs

StraightforwardEfficientLimited utility

Page 32: Have your cake and eat it too, further dispelling the myths of the lambda architecture

2.Approximation via Online Algorithms

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

Example Input:Example Output:

Pros:Cons:

Twitter hashtagsApproximate top N hashtags per prefix

EfficientInexactComplicated Algorithms

Page 33: Have your cake and eat it too, further dispelling the myths of the lambda architecture

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

Web server request trafficPer-minute rate of received requestsStraightforwardResults reflect contents of streamResults don’t reflect events as they happenedIf approximating event time, usefulness varies

Example Input:Example Output:

Pros:

Cons:

3.Windowing by Stream Time

Page 34: Have your cake and eat it too, further dispelling the myths of the lambda architecture

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Event Time

Example Input:Example Output:

Pros:Cons:

Twitter hashtagsTop N hashtags by prefix per hour.Reflects events as they occurredMore complicated bufferingCompleteness issues

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

4.Windowing by Event Time - Fixed Windows

Page 35: Have your cake and eat it too, further dispelling the myths of the lambda architecture

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Event Time

Example Input:Example Output:

Pros:Cons:

User activity streamPer-session group of activitiesReflects events as they occurredMore complicated bufferingCompleteness issues

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

4.Windowing by Event Time - Sessions

Page 36: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Dataflow API

Page 37: Have your cake and eat it too, further dispelling the myths of the lambda architecture

What are you computing?

Where in event time?

When in stream time?

Page 38: Have your cake and eat it too, further dispelling the myths of the lambda architecture

What = Aggregation API

Where = Windowing API

When = Watermarks + Triggers API

Page 39: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Aggregation API

PCollection<KV<String, Double>> sums = Pipeline .begin() .read(“userRequests”) .apply(new Sum());

Page 40: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Aggregation API

2

4

7

0

1

6

33

8

918

9

16Sum

Page 41: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Streaming Mode

10:02 10:0010:06 10:04 Stream Time

24 3 1

63

3 87

0 24

16

3

38

9

0

4

7

0 33

2

10:02 10:0010:06 10:04 Event Time

1

38

9

046

18

7

0 23

432

74 3

6

3

0 33

2

Page 42: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Windowing API

PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE))); .apply(new Sum());

Page 43: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Windowing API

10:02 10:0010:06 10:04 Stream Time

24 3 1

63

3 87

0

10:02 10:0010:06 10:04 Event Time

10:02 10:0010:06 10:04 Event Time

24

16

3

38

9

0

4

7

0 33

2

FixedWindows

Sum

1

38

9

046

18

7

0 23

432

74 3

6

3

0 33

2

13 12616 1835 415

Page 44: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Watermarks

●f(S) -> E●S = a point in stream time (i.e. now)

●E = the point in event time up to which input data is complete as of S

Page 45: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Event Time Skew

Stre

am

Tim

e

Event Time

Page 46: Have your cake and eat it too, further dispelling the myths of the lambda architecture

FixedWindows

Sum

10:02 10:00

13616 1835 415 12

10:06 10:04 Stream Time

24 3 1

63

3 87

0

10:02 10:0010:06 10:04 Event Time

10:01 10:0010:03 10:02 Event Time

24

16

3

38

9

0

4

7

0 33

2

1

38

9

046

18

7

0 23

432

74 3

6

3

0 33

2

Watermarks

Page 47: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Watermark Caveats

Too slow = more latency

Too fast = late data

Page 48: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Triggers

When in stream time to emit?

Page 49: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Triggers API

PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTES)) .trigger(new AtWatermark()); .apply(new Sum());

Page 50: Have your cake and eat it too, further dispelling the myths of the lambda architecture

2

3

1

8

4

8

7

6 3

1313

Event Time10:05 10:0610:0110:00

10:0310:00

10:06

2

12 5

52020

99

10:0110:02

10:0510:04

10:02 10:03 10:04

5Late datum

Page 51: Have your cake and eat it too, further dispelling the myths of the lambda architecture

A Better Strategy

1.Once per stream time minute

2.At watermark3.Once per record for two weeks

Page 52: Have your cake and eat it too, further dispelling the myths of the lambda architecture

1325 5

2

3

1

8

4

8

7

6 3

5

Event Time10:05 10:0610:0110:00

10:0310:00

10:06

2

12 5

52020

99

10:0110:02

10:0510:04

10:02 10:03 10:04

12

20

1320

5

1320

13

1325

2

12

13

9

20 520

25Late datum

91325

Page 53: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Triggers APIPCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());

Page 54: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Lambda vs Streaming

Low-latency, approximate resultsComplete, correct results as soon as possible

Ability to deal with changes upstream

Page 55: Have your cake and eat it too, further dispelling the myths of the lambda architecture

One Last Thing...

What if I want sessions?

Page 56: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Triggers APIPCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new Sessions(1, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());

Page 57: Have your cake and eat it too, further dispelling the myths of the lambda architecture

2

8

4

8

7

6 3

Event Time10:05 10:0610:0110:00

10:0310:00

10:0610:01

10:0210:05

10:04

10:02 10:03 10:04

22

3

1

9

1 minute2 7

1 minute2

2 7

1 minute

9 39 3

39 1 48

89 7 2525

25

5Late datum

25 83333

33

335 3838

6 3938 9

20 5

13

2

12

25

9

Page 58: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Summary

Lambda is greatStreaming by itself is better :-)

Strong Consistency = CorrectnessStreaming = Aggregation + Windowing +

TriggersTools For Reasoning About Time = Power +

Flexibility

Page 59: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Thank you!

Questions?Questions about this talk:

Questions about Cloud Dataflow:

[email protected] (Tyler Akidau)[email protected] (Eric Schmidt)