Download - Multiplatform solution for graph datasources
17 NOV 2016 @ BIG DATA SPAIN
@StratioBD
MULTIPLATFORM SOLUTION FOR GRAPH DATASOURCES
Multiplatform Spark solution for Graph datasourcess, Stratio Stratio
Javier Domínguez
JavierDominguezMontes
CTO SKILLS
PROFILE
JAVIER DOMÍNGUEZStudiedcomputerengineeringattheULPGC.HeispassionateaboutScala,PythonandallBigDatatechnologiesandiscurrentlypartoftheDataScienceteamatStratioBigData,
workingwithMLalgorithms,profilinganalysisbasedaroundSpark.
INDEX
1
2
3
4
INTRODUCTION
MULTIPLATFORM SOLUTION FOR GRAPH DATASOURCES
DEMO
THE END
Graphusecases Results
What'snext?
Dataset
Mainprocessexplanation
Notebooksshowoff
DataStores
Machinelearning
Businessexample
500GB- 2TB
4TB- 8TB
20GB- 100GB
80’S 2000 2010 2015 2020
CUSTOMER DATA WILL GROW OVER 100X
100TB
>10PB
MULTIPLATFORM SOLUTION FOR GRAPH DATASOURCES
• Graphusecases
• DataStores
• Machinelearning
@StratioBD
Exampleofhowtoexploitamassivedatabasefromdifferentstagesandthroughseveralgraphtechnologies
MACHINE LEARNING LIFE CYCLE WITH BIG DATA
MachineLearninglifecycle
ShowhowadatasciencistisabletotakeadvantageofaGraphDatabasethroughdifferentdatasourcesandtechnologiesthankstooursolution.
Useasaexampleamasivedataset.
Querythedatasourcefromdifferenttechnologieslike:
• GraphX• GraphFrames• Neo4j
AndfinallyapplyMachineLearningoverourinformation!
BIG DATA SPAIN USE CASE
USE CASES
Makinguseofamasivegraphdatasourceimpliesmakebatchqueriesoverit.Wewillneedtomakenthemwithourdistributedtechnologies...Theeasierthebetter
BatchQueries
Motifsfilterexample
import org.graphframes._val g: GraphFrame = Graph(usersRdd,relationshipsRdd0)
// Search for pairs of vertices with edges in both directions between themval motifs: Dataframe = g.find("(person_1)-[relation]->(person_2); (person_2)-[abilities]->(technology)")motifs.show()
// More complex queries can be expressed by applying filters.motifs.filter("person_1.name = 'Javier' AND technology.name = 'Neo4j'")
Mostofourclientsorteammateswillneedtohavefastandeasyaccesstotheinformation.Wewouldneedawaytomakeeasyqueriesandofcourseagraphicrepresentationofourdata!
WewouldneedofcoursemicroserviceslikeRESToperationsoverourdatastore.
Onlinequeries
USE CASES
SparkApacheSparkisafastandgenericengineforlarge-scaledataprocessing.
GraphX
SparkAPIforthemanagementanddistributedcalculationofgraphs.Itcomeswithagreatvarietyofgraphalgorithms:q Connectedcomponentesq PageRankq Trianglecountq SVD++
GraphFramesItaimstoprovideboththefunctionalityofGraphXandextendedfunctionalitytakingadvantageofSparkDataFrames.Thisextendedfunctionalityincludesmotiffindingandhighlyexpressivegraphqueries.
DATASTORES
Neo4j
Neo4jisahighlyscalablenativegraphdatabasethatleveragesdatarelationshipsasfirst-classentities.Bigdataaloneusedtobeenough,butenterpriseleadersneedmorethanjustvolumesofinformationtomakebottom-linedecisions.Youneedreal-timeinsightsintohowdataisrelated.
DATASTORES
MACHINE LEARNING
It'spossibletoquicklyandautomaticallyproducemodelsthatcananalyzebigger,morecomplexdataanddeliverfaster,moreaccurateresults– evenonaverylargescale.Theresult?High-valuepredictionsthatcanguidebetterdecisionsandsmartactionsinrealtimewithouthumanintervention.
Machinelearning
SVD
Willrelatealltheexistingobjectinourdatasetandinferpossiblenewbehaviors.
STRATIO INTELLIGENCE
IntegrationofdifferentOpenSourcelibrariesofdistributedmachinelearningalgorithms.
Developmentenvironmentadaptedtoeachdatascientist.
Real-timedecisionbasedonmodelsbasedonmachinelearningalgorithms
IntegratedwithallcomponentsoftheStratioBigDataPlatform
Comprehensiveknowledgelifecyclemanagement
Freebaseaimedtocreateaglobalresourcethatallowedpeople(andmachines)toaccesscommoninformationmoreeffectively.
Thismodelisbasedontheideaofconvertingthedeclarationsoftheresourcesinexpressionswiththesubject-predicate-object whicharecalledtriplets.
Subject:It'stheresource,whatwearedescribing.Predicate:Couldbeapropertyorarelationshipwiththeobjectvalue.Objectvalue:Propertie'svalueortherelatedsubject.
<'Cristiano Ronaldo'> <'Scores in 2014/2015'> 61 .<'Cristiano Ronaldo'> <'Born in'> 'Portugal' .
FreebaseGoogle
Totaltriplets:1.9Billion
DATASET
PROCESS EXPLANATION
Transforms
CastRDF Dataset
GraphFrames Batch query
Neo4jGraphXExtracts sample & transforms Online
query
SVDK-core Prune Strongly
connected graphApply algorithms
Behavior Inference
Graph
Subject equality
PROCESS EXPLANATION
Ak-coreofagraphGisamaximalconnectedsubgraphofGinwhichallverticeshavedegreeatleastk.Equivalently,itisoneoftheconnectedcomponentsofthesubgraphofGformedbyrepeatedlydeletingallverticesofdegreelessthank.
Objective
Removeallnodeswithfewerconnections.Attheend,wewantonlythemostrepresentativeandconnectedelementsinourgrah.InourusecaseweusedK=5.
K-Coreprocess
PROCESS EXPLANATION
JaccardGraphClustering
NodeClusterizationbasedonconcreterelationsoptimizedforBigDataenvironments.
We'vedevelopedanstraightforwardfunctionalitywhichisabletodetectpatternsandclusterizedatainagraphdatabasethankstodailymachinelearningprocesses.
Neo4j
Scala Graph functionalities
Jaccard Indexation
Connected Componentes
Java
HDFS / Parquet
Spark / GraphX
40BJaccarddistancecalculation
ineverydayprocess
400Knodesgraphclustering
BANK USE CASE
WHAT'S NEXT?
SemanticsearchengineIncludeElasticSearchformakingtextsearchsasasearchengine.
ApplymoreMachineLearningalgorithms
• Connectedcomponents:Aswe'vealreadydone,trytoclusterinformationthankstotheirrelationships.• PageRank:Measuretheimportanceofasubject.• Trianglecounting:Checkposibletrianglerelationshipsinsideourdatasettoavoidredundancy.
NewGraphusecases
• Frauddetection• RecommendationSystem• Profiling
THANK YOU
UNITED STATES
Tel:(+1)4085998830
EUROPE
Tel:(+34)918286473
www.stratio.com
@StratioBD