startup safary | fight against robots with enbrite.ly data platform
TRANSCRIPT
Fight against robots with enbrite.ly data platformJoe MÉSZÁROS
Joe MÉSZÁROSlead software engineer
@joemesz
joemeszaros
Who we are?
Our vision is to revolutionize the KPIs and metrics the online advertisement industry currently using. With our products, Antifraud, Brandsafety and Viewability we provide actionable data to our customers.
Ad display fraud (ad stacking, pixel stuffing)
Ad viewability
Brand safetyDetecting traffic that comes from unwanted categories (e.g. adult), countries and single domains
39%
39%Anti fraud detection
DATA COLLECTION
ANALYZEDATA PROCESSION
ANTI FRAUDVIEWABILITY
BRAND SAFETYREPORT + API
What we do?
How we do?
DATA PLATFORM
...so we need do analyze vast amount of data
Infrastucture Big Data technologies
+ enbrite.lydata
platform=
Amazon Web Services (AWS)
● Most popular cloud service provider● ~70 services, 13 geographical
"regions"● Amazon Big Data = Elastic Map
Reduce● BUT Do not trust the BIG guy (API
problem)https://aws.amazon.com/
Apache Hadoop
● de facto Big Data technology● open source software● distributed storage (HDFS) + data
processing (MapReduce)● ecosystem: many additional
softwareshttp://hadoop.apache.org/ | https://github.com/apache/hadoop
Apache Spark
● large-scale data processing engine● open source software (popular)● modules: core, sql, sreaming, graph,
ML● faster than Hadoop MapReduce
http://spark.apache.org/ | https://github.com/apache/spark
Data platform in numbers
20+ node cluster
16 services
110 servers
0.5 - 4 TB /day100+ TB on
S3
How we do?
DATA COLLECTION
How we do?
DATA PROCESSION
Let me tell you a short story...
Real world exampleYou have a simple idea to detect bot traffic, which saves the world. Let’s implement it!
Real world example
THE IDEA: Analyse events which are too hasty and deviate
from regular, humanlike profiles: too many clicks in a defined timeframe.
INPUT: Collected events on Amazon S3OUTPUT: Invalid sessions
Step 1: sessionize events
How to solve it?
Step 2: detect too many clicks
code: https://github.com/enbritely/startup-safary
Step 1: event to session//configure Spark application
//read events from HDFS
JavaRDD<Event> events = lines.map(Converter::jsonToEvent);
Application code : https://github.com/enbritely/startup-safary
//configure Spark application
//read events from HDFS
JavaRDD<Event> events = lines.map(Converter::jsonToEvent);
JavaRDD<Event> clicks = events.filter(e ->
e.type.equals("click"));
//configure Spark application
//read events from HDFS
JavaRDD<Event> events = lines.map(Converter::jsonToEvent);
JavaRDD<Event> clicks = events.filter(e ->
e.type.equals("click"));
JavaPairRDD<String, List<Event>> grouped = clicks
.groupBy(Event::sessionId);
//configure Spark application
//read events from HDFS
JavaRDD<Event> events = lines.map(Converter::jsonToEvent);
JavaRDD<Event> clicks = events.filter(e ->
e.type.equals("click"));
JavaPairRDD<String, List<Event>> grouped = clicks
.groupBy(Event::sessionId);
JavaRDD<Session> sessions = grouped.mapValues(sessionizer);
Step 1: event to session//Sessionizer
(Function<Iterable<Event>, Session>) unorderedEvents -> {
List<Event> clickOrdered = sortyByTimestamp(unorderedEvents);
Session session = new Session(sessionId);
for (Event event: clickOrdered) {
session.addClick(event.getTimestamp());
}
return session;
}
Application code : https://github.com/enbritely/startup-safary
Step 2: apply heuristic
Application code : https://github.com/enbritely/startup-safary
JavaRDD<String> badSessions = sessions
.filter(s -> s.getClickCount() > threshold)
.map(s -> s.sessionId + ":" + s.clickCount);
// save output to HDFS
Live demo!
● 4 node EMR (Hadoop) Cluster
● Apache Spark 1.6.1● 1 GB input events
build app : create-cluster : events S3 -> HDFS : submit app
Congratulation!MISSION COMPLETED
YOU just saved the world with a simple idea within ~10
minutes.
WE ARE HIRING!
working @exPrezi office, K9
check out the company in Forbes :-)
amazing company culture
BUT the real reason ….
WE ARE HIRING!
… is our mood manager, Bigyó :)
BEYOND enbrite.ly
...our investor and event sponsor is looking for talented guys
Joe MÉSZÁROSlead software [email protected]
@joemesz @enbritely
joemeszarosenbritely
THANK YOU!
?QUESTIONS?