java/scala lab: Борис Трофимов - Обжигающая big data

Scalding Big ADta

или обжигая горшки с рекламой

Boris Trofimov @b0ris_1

Agenda

• Two stories on how AD is served inside AD company

• Awesome Scalding

The stories mention one company that has built multimillion-dollar

business over ordinary cookies

The story about shoes or

Big Brother is watching on you

We will answer this question in a few slides

or be careful while buying shoes

What is common between these things?

They are simple just for the first glance

The same with loading web

sites

Open any site with Ad

… 1 sec 20 ms 100 ms 150 ms

Publisher receives request

Publisher sends

response

Content delivered

to user

170

Site sends request to Ad Server

200

80 ms

280

SSP picks the winning bid and sends redirect url back to ad

Server Every bidder/DSP receives info about user:

• ssp_cookie_id • geo data • site url

300

SSP (Ad Exchange) receives ad request

and opens RTB Auction

210

Ad Server receives ad

request and redirects to

Ad Exchange

All bidders should send their

decision (participate? &

price) back

350

Ad Server shows page to

user which redirects to the bidder’s server

User’s web page asks Ad banner from CDN Showing ad & bidder’s

1x1 pixel (impression)

400

The first second…

~70% users have this cookie aboard

>>1 independent companies take

part in this auction

Return info about new user’s interests with special markers (segments) that indicates the new fact about user, e.g user

is man who has iphone and lives in NYC and has dog. Major format: <cookie_id – segment_id> Data Scientists

Real time

Offline

Pixel Tracking Farm

Warehouse

Bidder Farm

Auction requests

SSP Ad Exchange

Hourly Logs

3rd part data

House holders data

…

Hadoop’s HDFS

Updating user profiles

Hive Oozie MapReduce

Partners

HBASE Scalding

hbase keeps user profiles

Update user’s profiles with new segments

Data export

Brand new feed about

user interests

2

3

4 5

6

7

8

9

0 1 • Impressions • Clicks • Post-click Activities

5

Why do we need all this science?

• Deep audience targeting

• Case: customer would like

to show ad for all men who live in NYC have iPhone and dog

Facts about Data Scientists

• Data Scientists do: – Audience Modeling

identifying new user interests [segments] and finding ways to track them

– Audience Bridging – Insights and Analytics

• They use IBM Netezza as local warehouse

• They use R language

Facts about Realtime team

• Scala, Java • Restful Services • Akka • In Memory Cache : Aerospike, Redis

Facts about Offline team • The tasks we solve over hadoop:

– As a Storage to keep all logs we need – As Profile DB to keep all users and their interests [segments] – As MapReduce Engine to run jobs on transformations between data – As a Warehouse to export data via hive

• We use Clouderra CDH 5.1.2

• Major language: Scala

• Pure MapReduce jobs & Scalding/Cascading

• All map reduce applications are wrapped by Oozie’s workflow(s)

• Developing nextgen paltform version based on Spark Streaming/Kafka

hdfs

Scalding in a nutshell • Concise DSL

• Configurable Source(s) and

sink(s)

• Data transform operations: – map/flatMap – pivot/unpivot – project – groupBy/reduce/foldLeft

hdfs

Just one example (Java way) public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

}

Source

Just one example (Scalding way)

class WordCountJob(args : Args) extends Job(args) {

TextLine( args("input") )

.flatMap('line -> 'word) { line : String => tokenize(line) }

.groupBy('word) { _.size }

.write( Tsv( args("output") ) )

// Split a piece of text into individual words.

def tokenize(text : String) : Array[String] = {

// Lowercase each word and remove punctuation.

text.toLowerCase.split("\\s+")

}

}

Sink

Transform operations

Use Case 1 Split

• Motivation: reuse calculated streams

val common = Tsv("./file").map(...)

val branch1 = common.map(..).write(Tsv("output"))

val branch2 = common.groupby(..).write(Tsv("output"))

Use Case 2 Exotic Sources JDBC (out of the box)

case object YourTableSource extends JDBCSource {

override val tableName = "tableName"

override val columns = List(

varchar("col1", 64),

date("col2"),

tinyint("col3"),

double("col4"),

)

override def currentConfig = ConnectionSpec("www.gt.com", "username", "password",

"mysql")

}

YourTableSource.read.map(...) ...

Use Case 2 Exotic Sources HBASE

HBaseSource (https://github.com/ParallelAI/SpyGlass) • SCAN_ALL, • GET_LIST, • SCAN_RANGE HBaseRawSource (https://github.com/andry1/SpyGlass) • Advanced filtering via base64Scan

val hbs3 = new HBaseSource(

tableName,

quorum,

'key,

List("data"),

List('data),

sourceMode = SourceMode.SCAN_ALL)

.read

val scan = new Scan()

scan.setCaching(caching)

val activity_filters = new FilterList(MUST_PASS_ONE, {

val scvf = new SingleColumnValueFilter(toBytes("family"), toBytes("column"), GREATER_OR_EQUAL, toBytes(value))

scvf.setFilterIfMissing(true)

scvf.setLatestVersionOnly(true)

val scvf2 = ...

List(scvf, scvf2)

})

scan.setFilter(activity_filters)

new HBaseRawSource(tableName, quorum, families,

base64Scan = convertScanToBase64(scan)).read. ...

https://github.com/ParallelAI/SpyGlass

https://github.com/andry1/SpyGlass

Use Case 3 Join

• Motivation: joining two streams by key

• Different join strategies: – joinWithLarger – joinWithSmaller – joinWithTiny

• Inner, Left, Right, strategies

val pipe1 = Tsv("file1").read

val pipe2 = Tsv("file2").read // small file

val pipe3 = Tsv("file3").read // huge file

val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)

val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)

Use Case 4 Distributed Caching and Counters

//somewhere outside Job definition

val fl = DistributedCacheFile("/user/boris/zooKeeper.json")

// next value can be passed through any Scalding's jobs via Args

object for instance

val fileName = fl.path

...

class Job(val args:Args) {

// once we receive fl.path we can read it like a ordinary file

val fileName = args.get("fileName")

lazy val data = readJSONFromFile(fileName)

...

TSV(args.get("input")).read.map('line -> 'word ) {

line => ... /* using data json object*/ ... }

}

// counter example

Stat("jdbc.call.counter","myapp").incBy(1)

Use Case 5 Bridging Profiles

Motivation: bridge information from different sources and build complete person profile

imp

Own company’s private cookie

thanks to 1x1 pixel impression

Bridging two ssp_cookies via private

cookie

ssp_cookie_Id1

ssp_cookie_Id2

Bridging via ip address

Bridging Profiles

General task definition:

• Build graph • Identify connected

components

Connected components Let’s scalding it

class ConnectedComponentsJob(args : Args) extends Job(args) {

var attempt = 0

while( attempt < 20 ) {

val vertexes = Tsv( args("vertexes") ).read // ‘vertex \t ‘gid, by default it is equal to vertex

val edges = Tsv( args("edges") ) // 'gid_a \t 'gid_b

var groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a))

.discard('id )

.rename('gid ->'gid_b)

groups = groups.filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 }

.project ('gid_a, 'gid_b)

.map(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (Integer, Integer) =>(max(gid._1, gid._2), min(gid._1, gid._2))}

val count = groups.groupBy( ('gid_a, 'gid_b) ){ _.size }

if (count==0) attempt = 20

val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('source, 'target))

val new_vertexes = vertexes.joinWithSmaller('id -> 'source, groups, joiner = new LeftJoin )

.mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(Integer, Integer, Integer, Integer) =>

val (id, gid, source,target) = param

if (target != null)

( id , min( gid, target ) )

else

( id, gid )

}

new_vertexes.write( Tsv( args("vertexes") ) )

}

}

Other nice things

• Typed pipes

• Elegant and fast Matrix operations

• Simple migration on Spark/Kafka

• Way to retrieve data from hive’s hcatalog

Useful Resources • http://www.adopsinsider.com/ad-serving/how-does-ad-serving-

work/

• http://www.adopsinsider.com/ad-serving/diagramming-the-ssp-dsp-and-rtb-redirect-path/

• https://github.com/twitter/scalding

• https://github.com/ParallelAI/SpyGlass

• https://github.com/branky/cascading.hive

http://www.adopsinsider.com/ad-serving/how-does-ad-serving-work/











http://www.adopsinsider.com/ad-serving/diagramming-the-ssp-dsp-and-rtb-redirect-path/

















https://github.com/twitter/scalding

https://github.com/ParallelAI/SpyGlass

https://github.com/branky/cascading.hive

Thank you!

java/scala lab: Борис Трофимов - Обжигающая big data

Technology

ad server

ad request

ad banner

new users

new user interests segments

ad company awesome scalding

g user

new fact