Apache Spark Performance: Past, Future and Present

Download Apache Spark Performance: Past, Future and Present

Post on 21-Jan-2018

436 views

Category:

Software

0 download

TRANSCRIPT

1. Spark Performance Past, Future, and Present Kay Ousterhout Joint work with Christopher Canel, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun 2. About Me Apache Spark PMC Member Recent PhD graduate from UC Berkeley Thesis work on performance of large-scale data analytics Co-founder at Kelda (kelda.io) 3. How can I make this faster? 4. How can I make this faster? 5. Should I use a dierent cloud instance type? 6. Should I trade more CPU for less I/O by using better compression? 7. How can I make this faster? ??? 8. How can I make this faster? ??? 9. How can I make this faster? ??? 10. Major performance improvements possible via tuning, conguration if only you knew which knobs to turn 11. Past: Performance instrumentation in Spark Future: New architecture that provides performance clarity Present: Improving Sparks performance instrumentation This talk 12. spark.textFile(hdfs://).flatMap(lambda l: l.split( )).map(lambda w: (w, 1)).reduceByKey(lambda a, b: a + b).saveAsTextFile(hdfs://) Example Spark Job Split input le into words and emit count of 1 for each Word Count: 13. Example Spark Job Split input le into words and emit count of 1 for each Word Count: For each word, combine the counts, and save the output spark.textFile(hdfs://).flatMap(lambda l: l.split( )).map(lambda w: (w, 1)).reduceByKey(lambda a, b: a + b).saveAsTextFile(hdfs://) 14. spark.textFile(hdfs://) .flatMap(lambda l: l.split( )) .map(lambda w: (w, 1)) Map Stage: Split input le into words and emit count of 1 for each Reduce Stage: For each word, combine the counts, and save the output Spark Word Count Job: .reduceByKey(lambda a, b: a + b) .saveAsTextFile(hdfs://) Worker 1 Worker n Tasks Worker 1 Worker n 15. Spark Word Count Job: Reduce Stage: For each word, combine the counts, and save the output .reduceByKey(lambda a, b: a + b) .saveAsTextFile(hdfs://) Worker 1 Worker n 16. compute network time (1) Request a few shue blocks disk (5) Continue fetching remote data : time to handle one shue block (2) Process local data What happens in a reduce task? (4) Process data fetched remotely (3) Write output to disk 17. compute network time disk : time to handle one shue block What happens in a reduce task? Bottlenecked on network and disk Bottlenecked on network Bottlenecked on CPU 18. compute network time disk : time to handle one shue block What happens in a reduce task? Bottlenecked on network and disk Bottlenecked on network Bottlenecked on CPU 19. compute network time disk What instrumentation exists today? Instrumentation centered on single, main task thread : shue read blocked time : executor computing time (!) 20. actual What instrumentation exists today? timeline version 21. What instrumentation exists today? Instrumentation centered on single, main task thread Shue read and shue write blocked time Input read and output write blocked time not instrumented Possible to add! 22. compute disk Instrumenting read and write time Process shue block This is a lie compute Reality: Spark processes and then writes one record at a time Most writes get buered Occasionally the buer is ushed 23. compute Spark processes and then writes one record at a time Most writes get buered Occasionally the buer is ushed Challenges with reality: Record-level instrumentation is too high overhead Spark doesnt know when buers get ushed (HDFS does!) 24. Tasks use ne-grained pipelining to parallelize resources Instrumented times are blocked times only (task is doing other things in background) Opportunities to improve instrumentation 25. Past: Performance instrumentation in Spark Future: New architecture that provides performance clarity Present: Improving Sparks performance instrumentation This talk 26. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time 4 concurrent tasks on a worker 27. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time Concurrent tasks may contend for the same resource (e.g., network) 28. Whats the bottleneck? Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 Time t: dierent tasks may be bottlenecked on dierent resources Single task may be bottlenecked on dierent resources at dierent times 29. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 How much faster would my job be with 2x disk throughput? How would runtimes for these disk writes change? How would that change timing of (and contention for) other resources? 30. Today: tasks use pipelining to parallelize multiple resources Proposal: build systems using monotasks that each consume just one resource 31. Monotasks: Each task uses one resource Network monotask Disk monotask Compute monotask Todays task: Monotasks dont start until all dependencies complete Task 1 Network read CPU Disk write 32. Dedicated schedulers control contention Network scheduler CPU scheduler: 1 monotask / core Disk drive scheduler: 1 monotask / disk Monotasks for one of todays tasks: 33. Spark today: Tasks have non- uniform resource use 4 multi-resource tasks run concurrently Single-resource monotasks scheduled by per-resource schedulers Monotasks: API-compatible, performance parity with Spark Performance telemetry trivial! 34. How much faster would job run if... 4x more machines Input stored in-memory No disk read No CPU time to deserialize Flash drives instead of disks Faster shue read/write time 10x improvement predicted with at most 23% error 35. Monotasks: Break jobs into single-resource tasks Using single-resource monotasks provides clarity without sacricing performance Massive change to Spark internals (>20K lines of code) 36. Past: Performance instrumentation in Spark Future: New architecture that provides performance clarity Present: Improving Sparks performance instrumentation This talk 37. Spark today: Task resource use changes at ne time granularity 4 multi-resource tasks run concurrently Monotasks: Single-resource tasks lead to complete, trivial performance metrics Can we get monotask-like per-resource metrics for each task today? 38. compute network time disk Can we get per-task resource use? Measure machine resource utilization? 39. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time 4 concurrent tasks on a worker 40. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time Concurrent tasks may contend for the same resource (e.g., network) Contention controlled by lower layers (e.g., operating system) 41. compute network time disk Can we get per-task resource use? Machine utilization includes other tasks Cant directly measure per-task I/O: in background, mixed with other tasks 42. compute network time disk Can we get per-task resource use? Existing metrics: total data read How long did it take? Use machine utilization metrics to get bandwidth! 43. Existing per-task I/O counters (e.g., shue bytes read) + Machine-level utilization (and bandwidth) metrics = Complete metrics about time spent using each resource 44. Goal: provide performance clarity Only way to improve performance is to know what to speed up Why do we care about performance clarity? Typical performance eval: group of experts Practical performance: 1 novice 45. Goal: provide performance clarity Only way to improve performance is to know what to speed up Some instrumentation exists already Focuses on blocked times in the main task thread Many opportunities to improve instrumentation (1) Add read/write instrumentation to lower level (e.g., HDFS) (2) Add machine-level utilization info (3) Calculate per-resource time More details at kayousterhout.org