spark summit eu talk by zoltan zvara

Data-Aware SparkZoltánZvara

zoltan.zvara@sztaki.hu

Thisprojecthasreceived fundingfromtheEuropeanUnion’sHorizon2020research andinnovationprogramundergrantagreement No688191.

Introduction

• HungarianAcademyofSciences, InstituteforComputerScienceandControl (MTASZTAKI)• Researchinstitutewithstrongindustryties• BigDataprojects using Spark,Flink,Hadoop,Couchbase,etc…• Multiple IoTandtelecommunication use cases,withchallengingdatavolume,velocity anddistribution

Agenda

• Data-skewstory• Problemdefinitions&goals• Dynamic Repartitioning

• Architecture• Component breakdown• Repartitioning mechanism• Benchmarkresults

• Tracing• Visualization• Conclusion

Motivation

• Applications aggregatingIoT,socialnetwork,telecommunication datathattestedwellontoydata• Whendeployingitagainsttherealdatasettheapplicationseemedhealthy• Howeveritcouldbecomesurprisinglyslow orevencrash• Whatwent wrong?

Ourdata-skewstory

• Wehaveuse-caseswhenmap-sidecombineisnotanoption:groupBy, join• Powerlaws (Pareto,Zipfian)• „80%ofthetrafficgenerated by 20%ofthecommunication towers”• Mostofourdatais80–20rule

Theproblem

• Usingdefaulthashingisnotgoingtodistributethedatauniformly• Someunknownpartition(s)tocontainalotofrecordsonthereducerside• Slowtaskswillappear• Datadistributionisnotknowninadvance• Conceptdriftsarecommon

stageboundary&shuffle

evenpartitioning skeweddata

slow task

Ultimate goal

Spark to be adata-aware distributeddata-processingframework

Generic,engine-level load balancing &data-skewmitigation

Low level,works with any APIs builton topofit

Completely online

Requires noofflinestatisticsDoes not require simulation on asubset ofyour data

Distribution agnostic

Noneed to identify the underlying distributionDoes not require to handle special corner-cases

Forthe batch&the streamingguys under one hood

Same architecture andsame core logic forevery workloadSupportforunified workloads

System-aware

Dynamically understand system-specific trade-offs

Does not need any user-guidance or configuration

Out-of-the box solution,the userdoes not need to playwith the configurationspark.data-aware = true

Architecture

Slave Slave Slave

Master

approximatelocal keydistribution &statistics

1collect to master 2

global keydistribution &statistics

4redistribute newhash

Driverperspective

• RepartitioningTrackerMaster partofSparkEnv• Listenstojob&stagesubmissions• Holdsavarietyofrepartitioningstrategiesforeachjob&stage• Decides when &how to (re)partition• Collects feedbacks

Executorperspective

• RepartitioningTrackerWorker partofSparkEnv• Duties:

• StoresScannerStrategies (Scanner included) receivedfromtheRTM• Instantiates andbindsScanners toTaskMetrics (wheredata-characteristics iscollected)

• DefinesaninterfaceforScanners tosendDataCharacteristics backtotheRTM

Scalable sampling

• Key-distributions are approximatedwith astrategy,that is• not sensitive to early or late concept drifts,• lightweight andefficient,• scalable by using abackoff strategy

frequency

counters

frequency

sampling rate of𝑠" sampling rate of𝑠" /𝑏

keysfre

quency

sampling rate of𝑠"%&/𝑏

truncateincrease by 𝑠"

𝑇𝐵

Complexity-sensitive sampling

• Reducerrun-timecanhighlycorrelatewiththecomputationalcomplexity ofthevaluesforagivenkey• CalculatingobjectsizeiscostlyinJVMandinsomecasesshowslittlecorrelationtocomputationalcomplexity(functiononthenextstage)• Solution:

• Iftheuserobject isWeightable, usecomplexity-sensitive sampling• Whenincreasingthecounterforaspecificvalue,consider itscomplexity

Scalable sampling in numbers

• Optimizedwithmicro-benchmarking• Mainfactors:

• Samplingstrategy- aggressiveness (initialsamplingratio,back-offfactor,etc…)

• Complexity ofthecurrentstage(mapper)

• Currentstage’sruntime:• Whenusedthroughouttheexecution ofthewholestage itadds5-15%toruntime

• Afterrepartitioning, wecutoutthesampler’s code-path; inpractice,itadds0.2-1.5% toruntime

Decisiontorepartition

• RepartitioningTrackerMaster canusedifferentdecisionstrategies:• numberoflocalhistogramsneeded,• globalhistogram’sdistance fromuniformdistribution,• preferences intheconstruction ofthenewhashfunction.

Construction ofthe new hash function

𝑐𝑢𝑡, - single-key cut

hash akey herewith the probabilityproportional to the remaining space to reach 𝑙

𝑐𝑢𝑡. - uniform distribution

topprominent keys hashed

𝑐𝑢𝑡/ - probabilistic cut

Newhash function in numbers

• MorecomplexthanahashCode• Weneedtoevaluateitforeveryrecord• Micro-benchmark(forexampleString):

• Numberofpartitions: 512• HashPartitioner: AVGtime to hash arecord is90.033 ns• KeyIsolatorPartitioner: AVGtime to hash arecord is121.933ns

• Inpracticeitaddsnegligibleoverhead,under1%

Repartitioning

• Reorganizepreviousnaivehashing or buffer the databefore hashing• Usuallyhappensin-memory• Inpractice,addsadditional1-8%overhead(usuallythelowerend),basedon:• complexity ofthemapper,• lengthofthescanningprocess.

• Forstreaming:updatethe hash in DStream graph

BatchgroupBy

MusicTimeseries – groupBy on tags from alistenings-stream

1710 200 400

sizeofth

ebiggestpartition

number ofpartitions

over-partitioning overhead&blind scheduling64%reduction

observed 3%map-side overhead

Dynamic Repartitioning

38%reduction

Batchjoin

• Joiningtrackswithtags• Tracksdatasetisskewed• Numberofpartitionssetto33• Naive:

• sizeofthebiggest partition=4.14M• reducer’sstageruntime=251seconds

• DynamicRepartitioning• sizeofthebiggest partition=2.01M• reducer’sstageruntime=124 seconds• heavymap,only0.9%overhead

Tracing

Goal:captureandrecordadata-point’slifecyclethroughoutthewholeexecution

Rewritten Spark’s core to handle wrapped data-points.

Wrapper data-point : T

payload : T 𝑓 ∶ 𝑇 → 𝑈𝑓 ∶ 𝑇 → 𝐵𝑜𝑜𝑙𝑒𝑎𝑛𝑓 ∶ 𝑇 → 𝐼𝑡𝑒𝑟𝑎𝑏𝑙𝑒 𝑈…

Traceables in Spark

Each record isaTraceable object (a Wrapper implementation).

Apply function on payload andreport/build trace.spark.wrapper.class = com.ericsson.ark.spark.Traceable

Traceable data-point : T

payload : T 𝑓 ∶ 𝑇 → 𝑈𝑓 ∶ 𝑇 → 𝐵𝑜𝑜𝑙𝑒𝑎𝑛𝑓 ∶ 𝑇 → 𝐼𝑡𝑒𝑟𝑎𝑏𝑙𝑒 𝑈…

applytrace : Trace

Tracing implementation

Capture trace informations (deep profiling):• by reporting to anexternal service;• piggyback aggregated trace routes on data-points.

𝑇;𝑇<

𝑇<<

𝑇<C

𝑇<=

𝑇<A

• 𝑇" can beany Spark operator• we attach useful metrics to edges

and vertices (current load, timespent at task)

Provide data-characteristics to monitor&debug applications

Users can sneak-peak into the dataflowUsers can debug andmonitorapplications moreeasily

• NewmetricsareavailablethroughtheRESTAPI• AddednewqueriestotheRESTAPI,forexample:„whathappenedinthelast3second?”• BlockFetches arecollectedtoShuffleReadMetrics

Execution visualization ofSpark jobs

Repartitioning in the visualization

heavy keys createheavy partitions (slow tasks)

heavy isalone,size ofthe biggest partition

isminimized

Conclusion

• OurDynamicRepartitioning canhandledataskewdynamically,on-the-flyonanyworkloadandarbitrarykey-distributions• Withverylittleoverhead,dataskewcanbehandledinanatural &general way• Tracing can help us to improve the co-location ofrelated services• Visualizationscanaiddeveloperstobetterunderstandissuesandbottlenecksofcertainworkloads• MakingSparkdata-aware paysoff

Thankyouforyourattention

ZoltánZvarazoltan.zvara@sztaki.hu

spark summit eu talk by zoltan zvara

Data & Analytics

ghid de conversatie roman-maghiar - zoltan david de... ·...

zoltan kodaly (1882 – 1967)

carpeta de ventas zoltan

kovacs sztriko zoltan - Üstökös

zoltan bela

65953891 pipeline production zoltan

mètodo zoltan kódaly

2.1 zoltan tompa

tutorial: the zoltan toolkit

latar belakang zoltan kodaly

zoltan larry young.pdf

kodaly zoltan - esti dal

old.anif.ro zoltan... · pop zoltan carol pop zoltan carol...

5 nastanek zvara

/sarkozi zoltan s

000621 voko, zoltan

szekely zoltan lucrari alese

egressy zoltan vesztett eden

zoltán zvara - advanced visualization of flink and spark...

tanulmany sturcz zoltan