startup safary | fight against robots with enbrite.ly data platform

28
Fight against robots with enbrite.ly data platform Joe MÉSZÁROS

Upload: meszaros-jozsef

Post on 13-Apr-2017

347 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Startup Safary | Fight against robots with enbrite.ly data platform

Fight against robots with enbrite.ly data platformJoe MÉSZÁROS

Page 2: Startup Safary | Fight against robots with enbrite.ly data platform

Joe MÉSZÁROSlead software engineer

@joemesz

joemeszaros

Page 3: Startup Safary | Fight against robots with enbrite.ly data platform

Who we are?

Our vision is to revolutionize the KPIs and metrics the online advertisement industry currently using. With our products, Antifraud, Brandsafety and Viewability we provide actionable data to our customers.

Page 4: Startup Safary | Fight against robots with enbrite.ly data platform

Ad display fraud (ad stacking, pixel stuffing)

Ad viewability

Page 5: Startup Safary | Fight against robots with enbrite.ly data platform

Brand safetyDetecting traffic that comes from unwanted categories (e.g. adult), countries and single domains

Page 6: Startup Safary | Fight against robots with enbrite.ly data platform

39%

39%Anti fraud detection

Page 7: Startup Safary | Fight against robots with enbrite.ly data platform
Page 8: Startup Safary | Fight against robots with enbrite.ly data platform

DATA COLLECTION

ANALYZEDATA PROCESSION

ANTI FRAUDVIEWABILITY

BRAND SAFETYREPORT + API

What we do?

Page 9: Startup Safary | Fight against robots with enbrite.ly data platform

How we do?

DATA PLATFORM

...so we need do analyze vast amount of data

Infrastucture Big Data technologies

+ enbrite.lydata

platform=

Page 10: Startup Safary | Fight against robots with enbrite.ly data platform

Amazon Web Services (AWS)

● Most popular cloud service provider● ~70 services, 13 geographical

"regions"● Amazon Big Data = Elastic Map

Reduce● BUT Do not trust the BIG guy (API

problem)https://aws.amazon.com/

Page 11: Startup Safary | Fight against robots with enbrite.ly data platform

Apache Hadoop

● de facto Big Data technology● open source software● distributed storage (HDFS) + data

processing (MapReduce)● ecosystem: many additional

softwareshttp://hadoop.apache.org/ | https://github.com/apache/hadoop

Page 12: Startup Safary | Fight against robots with enbrite.ly data platform

Apache Spark

● large-scale data processing engine● open source software (popular)● modules: core, sql, sreaming, graph,

ML● faster than Hadoop MapReduce

http://spark.apache.org/ | https://github.com/apache/spark

Page 13: Startup Safary | Fight against robots with enbrite.ly data platform

Data platform in numbers

20+ node cluster

16 services

110 servers

0.5 - 4 TB /day100+ TB on

S3

Page 14: Startup Safary | Fight against robots with enbrite.ly data platform

How we do?

DATA COLLECTION

Page 15: Startup Safary | Fight against robots with enbrite.ly data platform

How we do?

DATA PROCESSION

Page 16: Startup Safary | Fight against robots with enbrite.ly data platform

Let me tell you a short story...

Page 17: Startup Safary | Fight against robots with enbrite.ly data platform

Real world exampleYou have a simple idea to detect bot traffic, which saves the world. Let’s implement it!

Page 18: Startup Safary | Fight against robots with enbrite.ly data platform

Real world example

THE IDEA: Analyse events which are too hasty and deviate

from regular, humanlike profiles: too many clicks in a defined timeframe.

INPUT: Collected events on Amazon S3OUTPUT: Invalid sessions

Page 19: Startup Safary | Fight against robots with enbrite.ly data platform

Step 1: sessionize events

How to solve it?

Step 2: detect too many clicks

code: https://github.com/enbritely/startup-safary

Page 20: Startup Safary | Fight against robots with enbrite.ly data platform

Step 1: event to session//configure Spark application

//read events from HDFS

JavaRDD<Event> events = lines.map(Converter::jsonToEvent);

Application code : https://github.com/enbritely/startup-safary

//configure Spark application

//read events from HDFS

JavaRDD<Event> events = lines.map(Converter::jsonToEvent);

JavaRDD<Event> clicks = events.filter(e ->

e.type.equals("click"));

//configure Spark application

//read events from HDFS

JavaRDD<Event> events = lines.map(Converter::jsonToEvent);

JavaRDD<Event> clicks = events.filter(e ->

e.type.equals("click"));

JavaPairRDD<String, List<Event>> grouped = clicks

.groupBy(Event::sessionId);

//configure Spark application

//read events from HDFS

JavaRDD<Event> events = lines.map(Converter::jsonToEvent);

JavaRDD<Event> clicks = events.filter(e ->

e.type.equals("click"));

JavaPairRDD<String, List<Event>> grouped = clicks

.groupBy(Event::sessionId);

JavaRDD<Session> sessions = grouped.mapValues(sessionizer);

Page 21: Startup Safary | Fight against robots with enbrite.ly data platform

Step 1: event to session//Sessionizer

(Function<Iterable<Event>, Session>) unorderedEvents -> {

List<Event> clickOrdered = sortyByTimestamp(unorderedEvents);

Session session = new Session(sessionId);

for (Event event: clickOrdered) {

session.addClick(event.getTimestamp());

}

return session;

}

Application code : https://github.com/enbritely/startup-safary

Page 22: Startup Safary | Fight against robots with enbrite.ly data platform

Step 2: apply heuristic

Application code : https://github.com/enbritely/startup-safary

JavaRDD<String> badSessions = sessions

.filter(s -> s.getClickCount() > threshold)

.map(s -> s.sessionId + ":" + s.clickCount);

// save output to HDFS

Page 23: Startup Safary | Fight against robots with enbrite.ly data platform

Live demo!

● 4 node EMR (Hadoop) Cluster

● Apache Spark 1.6.1● 1 GB input events

build app : create-cluster : events S3 -> HDFS : submit app

Page 24: Startup Safary | Fight against robots with enbrite.ly data platform

Congratulation!MISSION COMPLETED

YOU just saved the world with a simple idea within ~10

minutes.

Page 25: Startup Safary | Fight against robots with enbrite.ly data platform

WE ARE HIRING!

working @exPrezi office, K9

check out the company in Forbes :-)

amazing company culture

BUT the real reason ….

Page 26: Startup Safary | Fight against robots with enbrite.ly data platform

WE ARE HIRING!

… is our mood manager, Bigyó :)

Page 27: Startup Safary | Fight against robots with enbrite.ly data platform

BEYOND enbrite.ly

...our investor and event sponsor is looking for talented guys

Page 28: Startup Safary | Fight against robots with enbrite.ly data platform

Joe MÉSZÁROSlead software [email protected]

@joemesz @enbritely

joemeszarosenbritely

THANK YOU!

?QUESTIONS?