realtime risk management using kafka, python, and spark streaming by nick evans
Post on 11-Jan-2017
2.679 Views
Preview:
TRANSCRIPT
5
Some Numbers
$12BnC U M U L AT I V E P R O C E S S E D
200k+M E R C H A N T S
14kE V E N T S / S E C O N D
7R I S K A N A LY S T S
9
S PA R K S T R E A M I N G A L L O W S U S T O D O R E A L -T I M E
D ATA P R O C E S S I N G .
W E C A N D E C I D E W H I C H E V E N T S N E E D A C L O S E R L O O K
15
Kafka
Receiver
Spark Engine
t0 t1 t2 t3
Event Event Event Event Event Event
Build RDD with Events from t0 to
t1
Build RDD with Events from t1 to
t2
Process RDD Process RDD
16
Problems w/ Receivers
• The only way to get at-least once delivery makes it hard to deploy new code
• Zookeeper is updated with which offsets to start from when data is received, not when it is processed
• We’re actually duplicating Kafka
18
Kafka
Spark Engine
t0 t1 t2 t3
Event Event Event Event Event Event
Process RDD with Events from
t0 to t1
Process RDD with Events from
t1 to t2
19
General Structure• Load Kafka offsets from Zookeeper
• Tell Spark Streaming to create a DStream that consumes from Kafka, starting at the specified offsets
• Define your processing step (ie. filter out non-risky events)
• Define your output step (ie. POST the data to the case management software)
• Save Kafka offset of most recently processed event to Zookeeper
• Start your streaming application, and grab some popcorn!
20
Example Filtering: Risky Products
hair extensions
pharmacy
vaporizer
gateway card
wifi pineapple
iPhone
gucci
cannabis
travel package
21
Risky Products
hair extensions
gucci
cannabis
sweet bag
nice shoes
taylor swift t-shirt
RDD for Time 0
Filterhair extensions
gucci
cannabisMap
{“title”: “hair exten…}
{“title”: “gucci”…}
{“title”: “cannab…}HTTPS Post
Case Management
Software
24
The Future
• Time-Windowed Functions – A necessity for most of the non-trivial jobs
• Performance Tweaks – We haven’t spent any time on this, so lots of potential for gains
• Machine Learning – We could use the Risk Analyst decisions to build a ML model
• Improved Monitoring – We are only monitoring the basics right now
• Apache Cassandra – Others use it as a fast key/value store for their jobs
• Improved Receiverless API – An API to access Kafka / Zookeeper without hard work
25
Icon CreditsCredit Card by Rediffusion from the Noun Project
Money Bag by icon 54 from the Noun Project
top related