big data analysis services in the cloud - knowledge grid · big data analysis services in the cloud...

67
Big Data Analysis Services in the Cloud Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy [email protected]

Upload: lamkien

Post on 02-Apr-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Big Data Analysis Services in the Cloud

Domenico Talia

DIMES, Università della Calabria & DtoK Lab

Italy

[email protected]

2

§  BigproblemsandBigdata

§  UsingCloudsfordatamining

§  Acollec7onofservicesforscalabledataanalysis§  Dataminingworkflows

§  DataMiningCloudFramework(DMCF)

§  JS4Cloudforprogrammingservice-orientedworkflows.

§  Finalremarks

Talkoutline

3

§  KnowledgeDiscoverymethodsanddataminingtechniquesareusedineveryapplica7ondomaintoextractusefulknowledgefromBigData.

§  DataMiningapplica7onsrangefrom§  Single-taskapplica7ons§  Parameter-sweepingapplica7ons/regularparallelapplica7ons§  Complexapplica7ons(Workflow-based,distributed,parallel).

§  CloudCompu3ngcanbeusedtoprovidedevelopersandend-userswithcompu7ngandstorageservicesandscalableexecu7onmechanismsneededtoefficientlyrunalltheseclassesofapplica3ons.

Goals(1)

4

§  DiscussCloudservicesforscalableexecu7onofdataanalysisworkflows.

§  Presentaprogrammingenvironmentfordataanalysis:DataMiningCloudFramework(DMCF).

§  IntroduceavisualprogramminginterfaceVL4Cloudandthescript-basedJS4Cloudlanguageforimplemen7ngservice-orientedworkflows.

§  EvaluatetheperformanceofdataminingworkflowsonDMCF.

§  Outlinesomeopenresearchissues.

Goals(2)

5

BigDataneversleeps

6

BigDataneedsscalableanalysis

•  Vo lume i s on l y onedimensionoftheproblem.

•  The most important issueisValue.

•  Scalabledata analysis is akey technology to obtainValuefromBigData.

7

BigDataneedsscalableanalysis

Combina7onof•  B i g d a t a a n a l i s y s a n dk n o w l e d g e - d i s c o v e r ytechniqueswith

•  scalablecompu7ngsystems for

•  aneffec3vestrategytoobtainnew insights in a shorterperiodof7me.

•  Cloudcompu3nghelps!

Dataanalysisasaservice

9

§  Knowledgediscovery(KDD)anddatamining(DM)apps:§  Needtoruncompute-anddata-intensiveprocesses/tasks§  AreoTenbasedondistribu3onofdata,algorithms,andusers.

§  PaaS(Pla&ormasaService)canbeanappropriatemodeltobuildframeworksthatallowuserstodesignandexecutescalabledataminingapplica7ons.

§  SaaS(So1wareasaService)canbeanappropriatemodeltoimplementscalabledataminingapplica7ons.

§  Thosetwocloudservicemodelscanbeeffec7velyexploitedfordeliveringdataanalysistoolsandapplica3onsasservices.

Dataanalysisasaservice

Servicesfordistributeddatamining

10

§  Byexploi7ngtheSOAmodelitispossibletodefinebasicservicesforsuppor3ngdistributeddataminingtasks/applica3ons.

§  Thoseservicescanaddressalltheaspectsofdataminingandinknowledgediscoveryprocesses•  dataselec3onandtransportservices,

•  dataanalysisservices,

•  knowledgemodelsrepresenta3onservices,and

•  knowledgevisualiza3onservices.

Servicesfordistributeddatamining

11

§  Dataminingtasksandapplica7onscanbeofferedashigh-levelservices.

§  AnewwaytodeliverydataanalysissoTwareis

dataanalysisasaservice(DAaaS)

Dataminingasaservice

12

§  Itispossibletodesignservicescorrespondingto

Single KDD Steps

All steps that compose a KDD process such as preprocessing, filtering, and visualization are expressed as services.

Single Data Mining Tasks

Here are included tasks such as classification, clustering, and association rules discovery.

Distributed Data Mining Patterns

This level implements, as services, patterns such as collective learning, parallel classification and meta-learning models.

Data Mining Applications and KDD processes This level includes the previous tasks and patterns composed in multi-step workflows.

Servicesfordistributeddatamining

13

§  Thiscollec7onofdataminingservicesimplementsan

OpenServiceFrameworkforDistributedDataMining

Data Mining Task Services

Distributed Data Mining patterns

Distributed Data Mining Applications and KDD

processes

KDD Step Services

Servicesfordistributeddatamining

14

§  Itallowsdevelopers toprogramdistributedKDDprocessesasacomposi3onofsingleand/oraggregatedservicesavailableoveraservice-orientedinfrastructure.

§  ThoseservicesshouldexploitotherbasicCloudservicesfordatatransfer,replicamanagement,dataintegra7onandquerying.

A Service-oriented Cloud workflow

Servicesfordistributeddatamining

15

§  By exploi7ng the Cloud services features it is possible todevelop data mining services accessible every 3me andeverywhere(remotelyandfromsmalldevices).

§  This approach can produce not only service-based distributeddataminingapplica7ons,butalso

§  Dataminingservicesforcommuni3es/virtualorganiza7ons.

§  Distributeddataanalysisservicesondemand.

§  A sort of knowledge discovery eco-system made by a large

numbersofdecentralizeddataanalysisservices.

DataanalysisonClouds:Systems

[16]

§  Spark-anopen-sourceframeworkforin-memorydataanalysisandmachinelearningdevelopedatUCBerkeley.

§  Sphere/Sector-acomputeservicebuiltontopofSectortoprovideasetofprogramminginterfacesfordistributeddataanalysisapplica7ons.

§  DMCF–theDataMiningCloudFrameworksuppor7ngCloud-baseddataanalysisappsasvisualandscript-basedworkflows.

§  SwiS/T–aworkflow-basedsystemusingfunc7onaldata-driventaskparallelismforscalingdata-intensiveapps.

§  Others:Mahout,HPC-ABDS,CloudFlows,...&commercialsystems.

TheDataMiningCloudFramework

TheDataMiningCloudFramework

§  TheDataMiningCloudFrameworksupportsworkflow-basedKDD applica;ons, expressed (visually and by a scriptlanguage) as a graphs that link together data sources, dataminingalgorithms,andvisualiza7ontools.

18

CoverType

Sample1

Sample3

Sample2

J48

J48

J48

Sample4

Model1

Model2

Model3 Splitter Voter Model

train

train

train

T2

T3

T4 T5 T1

Architecturecomponents

§  Computeisthecomputa7onalenvironmenttoexecuteCloudapplica7ons:§  Webrole:Web-basedapplica7ons.§  Workerrole:batchapplica7ons.§  VMrole:virtualmachineimages.

§  Storageprovidesscalablestorageelements:§  Blobs:storingbinaryandtextdata.§  Tables:non-rela7onaldatabases.§  Queues:communica7onbetweencomponents.

§  Fabriccontrollerlinksthephysicalmachinesofasingledatacenter:§  ComputeandStorageservicesarebuiltontopofthiscomponent.

19

Compute

Cloud Platform

Website

Web Role instances

Worker

Worker Role instances

Queues

Task Queue

Tables

Task Status Table Storage

Fabric

Blobs

Input datasets

Data mining models

TheDataMiningCloudFramework:Architecture

20

TheDataMiningCloudFramework:Execu3on

21

TheDataMiningCloudFramework:Mapping

22

Script-basedworkflows:JS4Cloud

Script-basedworkflows

§  WeextendedDMCFaddingascript-baseddataanalysisprogrammingmodelasamoreflexibleprogramminginterface.

§  Script-basedworkflowsareaneffec7vealterna7vetographicalprogramming.

§  Ascriptlanguageallowsexpertstoprogramcomplexapplica7onsmorerapidly,inamoreconcisewayandwithhigherflexibility.

§  Theideaistoofferascript-baseddataanalysislanguageasanaddi3onalandmoreflexibleprogramminginterfacetoskilledusers.

24

TheJS4Cloudscriptlanguage

§  JS4Cloud(JavaScriptforCloud)isalanguageforprogrammingdataanalysisworkflows.

§  MainbenefitsofJS4Cloud:§  itisbasedonawellknownscrip7nglanguage,sousersdonothaveto

learnanewlanguagefromscratch;

§  itimplementsadata-driventaskparallelismthatautoma7callyspawnsready-to-runtaskstotheavailableCloudresources;

§  itexploitsimplicitparallelismsoapplica7onworkflowscanbeprogrammedinatotallysequen7alway(nouserdu?esforworkpar??oning,synchroniza?onandcommunica?on).

25

JS4Cloudfunc3ons

JS4Cloudimplementsthreeaddi7onalfunc7onali7es,implementedbythesetoffunc7ons:§  Data.get,foraccessingoneoracollec7onofdatasetsstoredintheCloud;§  Data.define, fordefiningnewdataelementsthatwillbecreatedat

run7measaresultofatoolexecu7on;§  Tool, toinvoketheexecu7onofasoTwaretoolavailableintheCloudasa

service.

26

Script-basedapplica3ons

§  Code-definedworkflowsarefullyequivalenttographically-definedones:

Data.define Data.define

Data.define

Data.define

JS4CloudpaXerns

var DRef = Data.get("Customers");var nc = 5;var MRef = Data.define("ClustModel");K-Means({dataset:DRef, numClusters:nc, model:MRef});

Single task

28

JS4CloudpaXerns

Pipeline

var DRef = Data.get("Census");var SDRef = Data.define("SCensus");Sampler({input:DRef, percent:0.25, output:SDRef});var MRef = Data.define("CensusTree");J48({dataset:SDRef, confidence:0.1, model:MRef});

29

JS4CloudpaXerns

Data partitioning

var DRef = Data.get("CovType");var TrRef = Data.define("CovTypeTrain");var TeRef = Data.define("CovTypeTest");PartitionerTT({dataset:DRef, percTrain:0.70, trainSet:TrRef, testSet:TeRef});

30

JS4CloudpaXerns

var DRef = Data.get("NetLog");var PRef = Data.define("NetLogParts", 16);Partitioner({dataset:DRef, datasetParts:PRef});

Data partitioning

31

JS4CloudpaXerns

var M1Ref = Data.get("Model1");var M2Ref = Data.get("Model2");var M3Ref = Data.get("Model3");var BMRef = Data.define("BestModel");ModelChooser({model1:M1Ref, model2:M2Ref, model3:M3Ref, bestModel:BMRef});

Data aggregation

32

JS4CloudpaXerns

var MsRef = Data.get(new RegExp("^Model"));var BMRef = Data.define("BestModel");ModelChooser({models:MsRef, bestModel:BMRef});

Data aggregation

33

JS4CloudpaXerns

var TRef = Data.get("TrainSet");var nMod = 5;var MRef = Data.define("Model", nMod);var min = 0.1;var max = 0.5;for(var i=0; i<nMod; i++) J48({dataset:TRef, model:MRef[i], confidence:(min+i*(max-min)/(nMod-1))});

Parameter sweeping

34

JS4CloudpaXerns

var nMod = 16;var MRef = Data.define("Model", nMod);for(var i=0; i<nMod; i++) J48({dataset:TsRef[i], model:MRef[i], confidence:0.1});

Input sweeping

35

Parallelismexploita3on

36

Monitoringinterface

§  Asnapshotoftheapplica7onduringitsexecu7onmonitoredthroughtheprogramminginterface.

37

Exampleapplica3ons(1)

Finance: Predic7on of personalincomebasedoncensusdata

Networks: Discovery ofnetwork a]acks from loganalysis.

E-Health: Disease classifica7onbasedongeneanalysis

38

Exampleapplica3ons(2)

Smart City: Car trajectory pa]erndetec7onapplica7ons.

Biosciences:drugmetabolismassocia7onsinpharmacogenomics.

39

KDDCup99example

•  Input dataset: 46 million tuples•  Used Cloud: up to 64 virtual servers (single-core

1.66 GHz CPU, 1.75 GB of memory, and 225 GB of disk)

40

Turnaroundandspeedup

107 hours (4.5 days)

2 hours

7.6

50.8

41

Efficiency

0.96

0.8

0.9

42

Anotherapplica3onexample

•  Ensemblelearningworkflow(geneanalysisforclassifyingcancertypes)

•  Turnaround7me:162minuteson1server,11minuteson19servers.•  Speedup:14.8

43

TrajectorypaXerndetec3on

§  Analyzetrajectoriesofmobileuserstodiscovermovementpa]ernsandrules.

§  Aworkflowthatintegratesfrequentregions

detec7on,trajectorydatasynthe7za7onandtrajectorypa]ernextrac7on.

44

45

FrequentRegionsDetec3on•  Detectareasmoredenselypassedthrough•  Density-based clustering algorithm (DB-

Scan)•  Furtheranalysis:movementthroughareas

TrajectoryDataSynthe3za3on•  each point is subs7tuted by the dense region it

belongsto.•  trajectory representa7ons is changed from

movements between points into movementsbetweenfrequentregions

TrajectoryPaXernExtrac3on•  Discovery of pa]erns from structured

trajectories•  T-Apriori algorithm, i.e. ad-hoc modified

versionofApriori

Applica3onmainsteps

Applica3onworkflow

46

47

Workflowimplementa3on

§  DMCFvisualworkflowimplemen7ngthetrajectorypa]erndetec7onalgorithm§  Eachnoderepresentseitheradatasourceoradataminingtool§  Eachedgerepresentsanexecu7ondependencyamongnodes§  Somenodesarelabeledbythearraynota7on

§  Compactwaytorepresentmul7pleinstancesofthesamedatasetortool§  Veryusfeultobuildcomplexworkflows(data/taskparallelism,parametersweeping,etc.)

128paralleltasks!

DiscovereddenseregionsontheBeijingmap

49

n  vsseveraldatasizes(upto1287mestamps),fordifferentnumberofservers

w  itpropor7onallyincreaseswiththetheinputsize

w  itpropor7onallydecreaseswiththeincreaseofcompu7ngresources

n  vsthenumberofservers(upto64),fordifferentdatasizes

w  comparisonparallel\sequen7alexecu7on

w  D16(D128):itreducesfrom8.3(68)hourstoabout0.5(1.4)hours

Experimentalevalua3on

Turnaround3me

68 hours (3 days)

1.4 hours

49

50

n  speed-up

w  notabletrend,uptothecaseof16nodes

w  goodtrendforhighernumberofnodes(influenceofthesequen7alsteps)

n  scale-up

w  comparable7meswhendatasizeand

#serversincreasepropor7onallyw  DBSCANstep(parallel)takesmostofthe

total7mew  othersteps(sequen7al)increaseswith

largerdatasets

Experimentalevalua3on

Scalabilityindicators

14.5

49.3

50

FinalRemarks

Finalremarks

§  Dataminingandknowledgediscoverytoolsareneededtosupportfindingwhatisinteres3ngandvaluableinBigData.

§  Cloudcompu7ngsystemscaneffec7velybeusedasscalableplaiormsforservice-orienteddatamining.

§  Designandprogrammingtoolsareneededforsimplicityandscalabilityofcomplexdataanalysisprocesses.

§  TheDMCFanditsprogramminginterfacessupportusersinimplemen7ngandrunningscalabledatamining.

52

Bigisquiteamovingtarget?

53

ü Yes,butnotonly for the increasingsizeofdata.

ü Bigisamisleadingterm.

ü  It must include complexity (anddifficulty)ofhandlinghugeamountsofheterogeneousdata.

ü  Inthefirstreport(2001)onBigDatathetermBigwasnotactuallyused.

Bigisquiteamovingtarget?Someexample

54

§  Somedatachallengesexampleswefacetoday§  Scien7ficdataproducedatarateofhundredsofgigabits-per-secondthatmustbestored,filteredandanalyzed.

§  Tenmillionsofimagesperdaythatmustbemined(analyzed)inparallel.

§  Onebillionoftweets/postsqueriedinreal-7meonanin-memorydatabase.

Arethechallengesfacedtodaydifferentfromthechallengesfaced10yearsago?

55

§  Yes, because data sources aremuchmorethan10yearsago.

§  Yes, because we want to solvemorecomplexproblems.

§  No,becauseindataminingwes7llwork on data produced withdifferentgoals.

§  No, because computers wereinventedtoprocessdataquickly.

IsSize/Volumethemostimportantissueindataanalysis?

56

§  No , volume is only onedimensionoftheproblem.

§  The most important issue isValue.

§  Size and complexity representthe problem,Value is the realbenefit.

[57]

58

59

[60]

61

Smartalgorithmsandscalablesystems

62

•  HPCsystemsandCloudsequippedwithdataanalysistoolsarebecomingthemostused(anduseful)plaiormsforBigDataanalysis.

•  Newwaystoefficientlycomposedifferentdistributeddataminingmodelsandtoolsareneededfornewplaiorms.

•  MAINISSUES:Dataminingalgorithms,toolsandapplica7onsmustbedesignedandportedonsuchplaiormsfordevelopingextremedatadiscoverysolu7ons.

ScalableDataanalysis:OpenResearchIssues

63

•  Programmingabstractsforbigdataanaly?cs.TheMapReduceandtheworkflowmodelsareoTenusedonHPCandclouds,butmoreresearchworkisneededtodevelopotherscalable,adap7ve,general,higher-levelabstractprogrammingstructures&tools.

•  Dataandtoolintegra?onandopenness.Codecoordina7onanddataintegra7onaremainissuesinlarge-scaleapplica7onsthatusedataandcompu7ngresources.Standardformats,dataexchangemodelsandcommonAPIsareneeded.

•  Interoperabilityofbigdataanaly?csframeworks.Cloudserviceparadigmsmustbedesignedtoallowworldwidefedera7onandintegra7onofmul7pledataanaly7csframeworksandservices.

Scalabledataanalysis:OpenResearchIssues

64

•  Exascalecompu?ngsystemsrepresentthenextcompu?ngstepalso(inpar7cular)intheBigDatafield.

•  Thosesystemsraiseveryimportantresearchchallengesunderinves7ga7on

withthegoalof

•  Buildingscalablehw/swsystemscomposedofaverylargenumberofmul7-coreprocessorsexpectedtodeliveraperformanceof1018opera7onspersecond.

… the potential interoperability and scaling convergence of HPC computing and data analysis is crucial to the future.

D.A. Reed & J. Dongarra, CACM 2015

Ongoing&futurework

§  DtoKLabisastartupthatoriginatedfromourworkinthisarea.

www.dtoklab.com

§  Nuby3csisdeliveredonpubliccloudsasahigh-performanceSoTware-as-a-Service(SaaS)tosupportinnova7vedataanalysistoolsandapplica7ons.

§  Applica7onsintheareaofsocialdataanalysis,urbancompu3ng,airtrafficandothershavebeendevelopedbyJS4Cloud.

65

Somepublica3ons

§  D.Talia,P.Trunfio,F.Marozzo,DataAnalysisintheCloud,Elsevier,USA,2015.

§  D. Talia, P. Trunfio,Service-oriented distributedknowledgediscovery,CRCPress,USA,2012.

§  D.Talia,“CloudsforScalableBigDataAnaly3cs”,IEEEComputer,46(5),pp.98-101,2013

§  F. Marozzo, D. Talia, P. Trunfio, "A WorkflowManagement System for Scalable Data Miningon Clouds”, IEEE Transac?ons on ServiceCompu?ng,2016(toappear).

§  L. Belcastro, F. Marozzo, D. Talia, P. Trunfio,"UsingScalableDataMiningforPredic3ngFlightDelays”, ACM Transac?ons on IntelligentSystemsandTechnology(TIST),vol.8no.1,July2016.

66

Thanks

Q?Credits:P.Trunfio,E.Cesario,F.Marozzo