marc schwering – using flink with mongodb to enhance relevancy in personalization

Using Flink with MongoDB to enhance relevancy in personalization

“How to use Flink with MongoDB?”

Marc Schwering Sr. Solution Architect – EMEA

[email protected] @m4rcsch

2

Agenda For This Session

•  Personalization Process Review •  The Life of an Application •  Separation of Concerns / Real World Architecture •  Apache Spark and Flink Data Processing Projects •  Clustering with Apache Flink •  Next Steps

3

High Level Personalization Process

1. Profile created

2. Enrich with public data

3. Capture ac9vity

4. Clustering analysis

5. Define Personas

6. Tag with personas

7. Personalize interac9ons

Batch analytics

Public data

Common technologies • R • Hadoop • Spark • Python •  Java • Many other

options Personas changed much less often than tagging

4

Evolution of a Profile (1)

{ "_id" : ObjectId("553ea57b588ac9ef066428e1"),

"ipAddress" : "216.58.219.238",

"referrer" : ”kay.com",

"firstName" : "John",

"lastName" : "Doe",

"email" : "[email protected]"

}

•  <sample> –  Originating IP –  Demographic info –  Location –  Name –  Sex –  Email

5

Evolution of a Profile (n+1) {

"_id" : ObjectId("553e7dca588ac9ef066428e0"),

"firstName" : "John",

"lastName" : "Doe", "address" : "229 W. 43rd St.",

"city" : "New York",

"state" : "NY",

"zipCode" : "10036", "age" : 30,

"email" : "[email protected]",

"twitterHandle" : "johndoe",

"gender" : "male", "interests" : [ "electronics", "basketball",

"weightlifting", "ultimate frisbee", "traveling", "technology" ], "visitedCounts" : {

"watches" : 3, "shirts" : 1, "sunglasses" : 1,

"bags" : 2 }, "purchases" : [ { "id" : 1, "desc" : "Power Oxford Dress Shoe",

"category" : "Mens shoes" }, { "id" : 2, "desc" : "Striped Sportshirt", "category" : "Mens shirts"

} ], "persona" : "shoe-fanatic” }

6

One size/document fits all?

•  Profile Data –  Preferences –  Personal information

•  Contact information •  DOB, gender, ZIP...

•  Customer Data –  Purchase History –  Marketing History

•  „Session Data“ –  View History –  Shopping Cart Data –  Information Broker Data

•  Personalisation Data –  Persona Vectors –  Product and Category recommendations

Application

Batch analytics

7

Separation of Concerns






Batch analytics Layer

Frontend - System

Profile Service Customer Service Session Service Persona Service

8

Benefits

•  Code does less, Document and Code stays focused •  Split ability

– Different Teams – New Languages – Defined Dependencies

9

Advice for Developers (1)

•  Code does less, Document and Code stays focused •  Split ability

– Different Teams – New Languages – Defined Dependencies

KISS => Keep it simple and save!

=> Clean Code <=

•  Robert C. Marten: https://cleancoders.com/ •  M. Fowler / B. Meyer. et. al.: Command Query Separation

Analytics and Personalization

From Query to Clustering

11








Frontend – System


12








Frontend – System


13

Architecture revised


Frontend – System Backend– Systems

Data Processing

14

Advice for Developers (2)

•  OWN YOUR DATA! (but only relevant Data) •  Say no! (to direct Data ie. DB Access)

Data Processing

16

Hadoop in a Nutshell

•  An open source distributed storage and distributed batch oriented processing framework

•  Hadoop Distributed File System (HDFS) to store data on commodity hardware

•  Yarn as resource management platform •  MapReduce as programming model working on top of HDFS

17

Spark in a Nutshell

•  Spark is a top-level Apache project

•  Can be run on top of YARN and can read any Hadoop API data, including HDFS or MongoDB

•  Fast and general engine for large-scale data processing and analytics

•  Advanced DAG execution engine with support for data locality and in-memory computing

18

Flink in a Nutshell

•  Flink is a top-level Apache project

•  Can be run on top of YARN and can read any Hadoop API data, including HDFS or MongoDB

•  A distributed streaming dataflow engine •  Streaming and batch •  Iterative in memory execution and handling •  Cost based optimizer

19

Latency of query operations

Query Aggregation MapReduce Cluster Algorithms

time

MongoDB Hadoop Spark/Flink

Iterative Algorithms / Clustering

21

K-Means in Pictures

•  Source: Wikipedia K-Means

22

K-Means as a Process

23

Iterations in Hadoop and Spark

24

Iterations in Flink

•  Dedicated iteration operators •  Tasks keep running for the iterations, not redeployed for each step •  Caching and optimizations done automatically

Example

26

Result

27

More…?

28

Takeaways

•  Evolution is amazing and exiting! –  Be ready to learn new things, ask questions across Silos!

•  Stay focused => Start and stay small –  Evaluate with BigDocuments but do a PoC focussed on the topic

•  Extending functionality could be challenging –  Evolution is outpacing help channels –  A lot of options (Spark, Flink, Storm, Hadoop….) –  More than just a binary

•  Extending functionality is easy –  Aggregation, MapReduce –  Connectors opening a new variety of Use Cases

29

Next Steps

•  Try out Flink –  http://flink.apache.org/ –  https://github.com/mongodb/mongo-hadoop –  https://github.com/m4rcsch/flink-mongodb-example

•  Participate and ask Questions! –  @m4rcsch –  [email protected]

•  We are hiring!! J

Thank you!

Marc Schwering Sr. Solutions Architect – EMEA

[email protected] @m4rcsch

marc schwering – using flink with mongodb to enhance relevancy in personalization

Technology