apache spark graphx & graphframe synthetic id fraud use case
TRANSCRIPT
![Page 1: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/1.jpg)
Traversing our way through Apache Spark GraphFrames
and GraphX
Mo PatelData Day Texas 2017
![Page 2: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/2.jpg)
A bit about me• Currently Deep Learning Practice Director at Teradata
– Road Object Detection & Scene Labeling– Visual Product Search– Chatbots
• Previously– Analytics @ Social Sharing Startup– Analytics @ Intelligence Community– Distributed Systems @ Satellite Operations Company– Software Engineering @ Defense Communications Program
• Research Interests: Distributed Systems for Analytics
• Love snowboarding and in general outdoor sports and working out to keep doing those things
mopatel
![Page 3: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/3.jpg)
What is this talk about?• What are Graphs and what are some
interesting things about Graphs?• What are some Graph Analytics Examples?• What are GraphFrames?• What is GraphX?• How can Graph Analytics help financial
companies fight Synthetic Identity Fraud?
![Page 4: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/4.jpg)
What is a Graph?Natural Artificial
WikipediaWikipedia
![Page 5: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/5.jpg)
Power of Graphs
Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14
![Page 6: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/6.jpg)
Power of Graphs• Good: Facebook, Twitter, WhatApp…
most popular social networks
• Bad: MySpace, Friendster, Orkut…“Nobody goes there anymore. It's too crowded” – Yogi Berra
![Page 7: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/7.jpg)
• Data Growth: Recall Metcalfe’s (n2) and Reed’s Law (2n)
• Memory Intensive• Processing Intensive
Graph Databases cost
money, Graph Analytics make money!
![Page 8: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/8.jpg)
Graph Databases cost money, Graph Analytics
make money!• Page Rank, EigenCentrality• Modularity, Clustering Coefficient,
Betweenness, Closeness• Loopy Belief Propogation, SALSA
![Page 9: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/9.jpg)
Node Score in a Graph• Usecase: Find out how important an
entity is in a graph– Entity Fraud Detection– Influencers– Crime Bosses
• Methods: PageRank, EigenCentralityPageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph) EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
![Page 10: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/10.jpg)
Communities in a Graph• Usecase: Detect similar nodes– Behavioral Segmentation– Crime Rings– Product Strength & Weakness
• Methods: Modularity, Clustering Coefficient, Betweenness, Closeness
Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi) Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
![Page 11: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/11.jpg)
Growth in Graph• Usecase: Predict where will the graph
grow or suggest new edges– Event Prediction– Product Recommendation
• Methods: Loopy Belief Propagation, Belief Networks, SALSA
Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian) SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)
![Page 12: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/12.jpg)
GraphX• Apache Spark Library for conducting Graph
Analytics• Graph Operations: num[Edges, Vertices],
degress, collectNeighbors• Graph Analytics:– PageRank– Connected Components– Triangle Counter
http://spark.apache.org/graphx/
![Page 13: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/13.jpg)
Property Graph
![Page 14: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/14.jpg)
GraphFrame• SQL like context is very popular• Lots of ways to work with Graphs: Cypher,
SPARQL, Gremlin..• Spark introduced DataFrame in February 2015• Goal: Make it easy for DataFrame users to
work with Graphs• GraphFrame: GraphX & DataFrame Operations
https://graphframes.github.io/index.html
![Page 15: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/15.jpg)
GraphFrameVertices DataFrameval vertices = sqlContext.createDataFrame(List(
(“a1", “Wine", “Beverage”), (“b2", "Beer", “Beverage”), (“c3", “Pretzel", “Snack”), (“d4", "Cheese", “Snack”)
)).toDF("id", "name", “type")
Edges DataFrame GraphFrameval edges = sqlContext.createDataFrame(List(
("a1", “d4", 15455), ("b2", “c3", 4849), (“a1", “c3", 40),(“b2”, “d4”, 134)
)).toDF(“item1", “item2", “count")
val productsGraphFrame = GraphFrame(vertices, edges)
productsGraphFrame. vertices.filter(“type == Snack")
productsGraphFrame. numEdges
![Page 16: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/16.jpg)
What is Synthetic Identity Fraud?
http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud
![Page 17: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/17.jpg)
Why has Synthetic Identity Fraud emerged as a big problem?
Verafin
![Page 18: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/18.jpg)
How are Synthetic IDs created?
Verafin
Verafin
![Page 19: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/19.jpg)
How are Financial Companies exploited?
Verafin
![Page 20: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/20.jpg)
What is the impact of Synthetic Identity Fraud?
Verafin
Verafin
![Page 21: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/21.jpg)
How can Graph Analytics helps solve Synthetic Identity Problem?
Customer Address DataFrameval customerAddresses = sqlContext.createDataFrame(List(
(“a1", “123 Main Street", “123abc456efg”), (“b2", ”345 High Street", “123abc456efg”), (“c3", “789 Park Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")vertices.
Add Fake Addressval fakeAddress = sqlContext.createDataFrame(List(
(“d4", “999 Ocean Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")
val tempCustomerAddresses = customerAddresses.union(fakeAddress)
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
![Page 22: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/22.jpg)
How can Graph Analytics helps solve Synthetic Identity Problem?
Master Address Connection Edges DataFrameval masterAddressConnections = sqlContext.createDataFrame(List(
("b2", "a1"), ("e5", "c3"), ("c3", "b2"),("a1", "c3"),("e5", "d4") …
)).toDF("src", "dst")
val toEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("to") === customerAddresses("address")).select("to","from")
val fromEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("from") === customerAddresses("address")).select("to","from")
val checkEdges = fromEdgeMatches.union(toEdgeMatches)
Detection GraphFramePageRankval detectionGraphFrame = GraphFrame(tempCustomerAddresses , checkEdges)
//PageRankval resultRanks = detectionGraphFrame.pageRank.resetProbability(0.15).tol(0.01).run()
//Personalized PageRankval d4Ranks = detectionGraphFrame.pageRank.resetProbability(0.15).maxIter(10).sourceId("d4").run()
resultRanks.vertices.select("id", "pagerank").show()
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
![Page 23: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/23.jpg)
How do we decide if this address is fraud or not?
PageRankid pageranka10.9463535901944437b20.9463535901944437c30.9463535901944437d4 0.15
Personalized PageRank
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
a1id pageranka1 0.3334337192862304
5 c3 0.2834186613932958
6 b2 0.2158043756308593
3 d4 0.0
b2id pagerankb2 0.3334337192862304
5 a1 0.2834186613932958
6 c3 0.2158043756308593
3 d4 0.0
c2id pagerankc3 0.3334337192862304
5 b2 0.2834186613932958
6 a1 0.2158043756308593
3 d4 0.0
d4id pagerankd4 0.15 a1 0.0 b2 0.0c3 0.0
![Page 24: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/24.jpg)
Future Directions and Thoughts• Focus on delivering value over tools and
technologies• Will we settle on a language for Graph
Analytics?• More algorithms in GraphX?• Large scale Graph Analytics is still not
scalable
![Page 25: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case](https://reader035.vdocuments.mx/reader035/viewer/2022081418/5888adaa1a28ab80248b5307/html5/thumbnails/25.jpg)
Apache Spark GraphX: http://spark.apache.org/graphx/ Follow me on Twitter (@mopatel) for interesting Deep Learning and Analytics tweets