multiplatform solution for graph datasources

33
17 NOV 2016 @ BIG DATA SPAIN @StratioBD MULTIPLATFORM SOLUTION FOR GRAPH DATASOURCES Multiplatform Spark solution for Graph datasourcess, Stratio Stratio Javier Domínguez

Upload: javier-dominguez-montes

Post on 09-Apr-2017

235 views

Category:

Engineering


4 download

TRANSCRIPT

17 NOV 2016 @ BIG DATA SPAIN

@StratioBD

MULTIPLATFORM SOLUTION FOR GRAPH DATASOURCES

Multiplatform Spark solution for Graph datasourcess, Stratio Stratio

Javier Domínguez

JavierDominguezMontes

CTO SKILLS

PROFILE

JAVIER DOMÍNGUEZStudiedcomputerengineeringattheULPGC.HeispassionateaboutScala,PythonandallBigDatatechnologiesandiscurrentlypartoftheDataScienceteamatStratioBigData,

workingwithMLalgorithms,profilinganalysisbasedaroundSpark.

INDEX

1

2

3

4

INTRODUCTION

MULTIPLATFORM SOLUTION FOR GRAPH DATASOURCES

DEMO

THE END

Graphusecases Results

What'snext?

Dataset

Mainprocessexplanation

Notebooksshowoff

DataStores

Machinelearning

Businessexample

INTRODUCTION

@StratioBD

500GB- 2TB

4TB- 8TB

20GB- 100GB

80’S 2000 2010 2015 2020

CUSTOMER DATA WILL GROW OVER 100X

100TB

>10PB

VALUEISTHEDATAVALUEISUNDERSTANDINGTHEDATA

DONOTSTAYONTHESURFACEOFKNOWLEDGE

MULTIPLATFORM SOLUTION FOR GRAPH DATASOURCES

• Graphusecases

• DataStores

• Machinelearning

@StratioBD

Exampleofhowtoexploitamassivedatabasefromdifferentstagesandthroughseveralgraphtechnologies

MACHINE LEARNING LIFE CYCLE WITH BIG DATA

MachineLearninglifecycle

ShowhowadatasciencistisabletotakeadvantageofaGraphDatabasethroughdifferentdatasourcesandtechnologiesthankstooursolution.

Useasaexampleamasivedataset.

Querythedatasourcefromdifferenttechnologieslike:

• GraphX• GraphFrames• Neo4j

AndfinallyapplyMachineLearningoverourinformation!

BIG DATA SPAIN USE CASE

USE CASES

USE CASES

Makinguseofamasivegraphdatasourceimpliesmakebatchqueriesoverit.Wewillneedtomakenthemwithourdistributedtechnologies...Theeasierthebetter

BatchQueries

Motifsfilterexample

import org.graphframes._val g: GraphFrame = Graph(usersRdd,relationshipsRdd0)

// Search for pairs of vertices with edges in both directions between themval motifs: Dataframe = g.find("(person_1)-[relation]->(person_2); (person_2)-[abilities]->(technology)")motifs.show()

// More complex queries can be expressed by applying filters.motifs.filter("person_1.name = 'Javier' AND technology.name = 'Neo4j'")

Mostofourclientsorteammateswillneedtohavefastandeasyaccesstotheinformation.Wewouldneedawaytomakeeasyqueriesandofcourseagraphicrepresentationofourdata!

WewouldneedofcoursemicroserviceslikeRESToperationsoverourdatastore.

Onlinequeries

USE CASES

DATASTORES

SparkApacheSparkisafastandgenericengineforlarge-scaledataprocessing.

GraphX

SparkAPIforthemanagementanddistributedcalculationofgraphs.Itcomeswithagreatvarietyofgraphalgorithms:q Connectedcomponentesq PageRankq Trianglecountq SVD++

GraphFramesItaimstoprovideboththefunctionalityofGraphXandextendedfunctionalitytakingadvantageofSparkDataFrames.Thisextendedfunctionalityincludesmotiffindingandhighlyexpressivegraphqueries.

DATASTORES

Neo4j

Neo4jisahighlyscalablenativegraphdatabasethatleveragesdatarelationshipsasfirst-classentities.Bigdataaloneusedtobeenough,butenterpriseleadersneedmorethanjustvolumesofinformationtomakebottom-linedecisions.Youneedreal-timeinsightsintohowdataisrelated.

DATASTORES

MACHINE LEARNING

MACHINE LEARNING

It'spossibletoquicklyandautomaticallyproducemodelsthatcananalyzebigger,morecomplexdataanddeliverfaster,moreaccurateresults– evenonaverylargescale.Theresult?High-valuepredictionsthatcanguidebetterdecisionsandsmartactionsinrealtimewithouthumanintervention.

Machinelearning

SVD

Willrelatealltheexistingobjectinourdatasetandinferpossiblenewbehaviors.

DEMO

• Dataset

• Main process explanation

• Notebooks show off

@StratioBD

STRATIO INTELLIGENCE

IntegrationofdifferentOpenSourcelibrariesofdistributedmachinelearningalgorithms.

Developmentenvironmentadaptedtoeachdatascientist.

Real-timedecisionbasedonmodelsbasedonmachinelearningalgorithms

IntegratedwithallcomponentsoftheStratioBigDataPlatform

Comprehensiveknowledgelifecyclemanagement

DATASET

Freebaseaimedtocreateaglobalresourcethatallowedpeople(andmachines)toaccesscommoninformationmoreeffectively.

Thismodelisbasedontheideaofconvertingthedeclarationsoftheresourcesinexpressionswiththesubject-predicate-object whicharecalledtriplets.

Subject:It'stheresource,whatwearedescribing.Predicate:Couldbeapropertyorarelationshipwiththeobjectvalue.Objectvalue:Propertie'svalueortherelatedsubject.

<'Cristiano Ronaldo'> <'Scores in 2014/2015'> 61 .<'Cristiano Ronaldo'> <'Born in'> 'Portugal' .

FreebaseGoogle

Totaltriplets:1.9Billion

DATASET

PROCESS EXPLANATION

PROCESS EXPLANATION

Transforms

CastRDF Dataset

GraphFrames Batch query

Neo4jGraphXExtracts sample & transforms Online

query

SVDK-core Prune Strongly

connected graphApply algorithms

Behavior Inference

Graph

Subject equality

PROCESS EXPLANATION

Ak-coreofagraphGisamaximalconnectedsubgraphofGinwhichallverticeshavedegreeatleastk.Equivalently,itisoneoftheconnectedcomponentsofthesubgraphofGformedbyrepeatedlydeletingallverticesofdegreelessthank.

Objective

Removeallnodeswithfewerconnections.Attheend,wewantonlythemostrepresentativeandconnectedelementsinourgrah.InourusecaseweusedK=5.

K-Coreprocess

PROCESS EXPLANATION

NOTEBOOKS SHOW OFF

BUSINESS EXAMPLE

JaccardGraphClustering

NodeClusterizationbasedonconcreterelationsoptimizedforBigDataenvironments.

We'vedevelopedanstraightforwardfunctionalitywhichisabletodetectpatternsandclusterizedatainagraphdatabasethankstodailymachinelearningprocesses.

Neo4j

Scala Graph functionalities

Jaccard Indexation

Connected Componentes

Java

HDFS / Parquet

Spark / GraphX

40BJaccarddistancecalculation

ineverydayprocess

400Knodesgraphclustering

BANK USE CASE

THE END

• Results

• What's next?

@StratioBD

WHAT'S NEXT?

SemanticsearchengineIncludeElasticSearchformakingtextsearchsasasearchengine.

ApplymoreMachineLearningalgorithms

• Connectedcomponents:Aswe'vealreadydone,trytoclusterinformationthankstotheirrelationships.• PageRank:Measuretheimportanceofasubject.• Trianglecounting:Checkposibletrianglerelationshipsinsideourdatasettoavoidredundancy.

NewGraphusecases

• Frauddetection• RecommendationSystem• Profiling

THANK YOU

UNITED STATES

Tel:(+1)4085998830

EUROPE

Tel:(+34)918286473

[email protected]

www.stratio.com

@StratioBD

[email protected] ARE HIRING

@StratioBD