java/scala lab: Борис Трофимов - Обжигающая big data

29
Scalding Big ADta или обжигая горшки с рекламой Boris Trofimov @b0ris_1

Upload: geekslab

Post on 11-Jul-2015

101 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Scalding Big ADta

или обжигая горшки с рекламой

Boris Trofimov @b0ris_1

Page 2: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Agenda

• Two stories on how AD is served inside AD company

• Awesome Scalding

The stories mention one company that has built multimillion-dollar

business over ordinary cookies

Page 3: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

The story about shoes or

Big Brother is watching on you

Page 4: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

We will answer this question in a few slides

or be careful while buying shoes

Page 5: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

What is common between these things?

Page 6: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data
Page 7: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

They are simple just for the first glance

The same with loading web

sites

Page 8: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Open any site with Ad

Page 9: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

… 1 sec 20 ms 100 ms 150 ms

Publisher receives request

Publisher sends

response

Content delivered

to user

170

Site sends request to Ad Server

200

80 ms

280

SSP picks the winning bid and sends redirect url back to ad

Server Every bidder/DSP receives info about user:

• ssp_cookie_id • geo data • site url

300

SSP (Ad Exchange) receives ad request

and opens RTB Auction

210

Ad Server receives ad

request and redirects to

Ad Exchange

All bidders should send their

decision (participate? &

price) back

350

Ad Server shows page to

user which redirects to the bidder’s server

User’s web page asks Ad banner from CDN Showing ad & bidder’s

1x1 pixel (impression)

400

The first second…

~70% users have this cookie aboard

>>1 independent companies take

part in this auction

Page 10: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Return info about new user’s interests with special markers (segments) that indicates the new fact about user, e.g user

is man who has iphone and lives in NYC and has dog. Major format: <cookie_id – segment_id> Data Scientists

Real time

Offline

Pixel Tracking Farm

Warehouse

Bidder Farm

Auction requests

SSP Ad Exchange

Hourly Logs

3rd part data

House holders data

Hadoop’s HDFS

Updating user profiles

Hive Oozie MapReduce

Partners

HBASE Scalding

hbase keeps user profiles

Update user’s profiles with new segments

Data export

Brand new feed about

user interests

2

3

4 5

6

7

8

9

0 1 • Impressions • Clicks • Post-click Activities

5

Page 11: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Why do we need all this science?

• Deep audience targeting

• Case: customer would like

to show ad for all men who live in NYC have iPhone and dog

Page 12: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Facts about Data Scientists

• Data Scientists do: – Audience Modeling

identifying new user interests [segments] and finding ways to track them

– Audience Bridging – Insights and Analytics

• They use IBM Netezza as local warehouse

• They use R language

Page 13: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Facts about Realtime team

• Scala, Java • Restful Services • Akka • In Memory Cache : Aerospike, Redis

Page 14: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Facts about Offline team • The tasks we solve over hadoop:

– As a Storage to keep all logs we need – As Profile DB to keep all users and their interests [segments] – As MapReduce Engine to run jobs on transformations between data – As a Warehouse to export data via hive

• We use Clouderra CDH 5.1.2

• Major language: Scala

• Pure MapReduce jobs & Scalding/Cascading

• All map reduce applications are wrapped by Oozie’s workflow(s)

• Developing nextgen paltform version based on Spark Streaming/Kafka

Page 15: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data
Page 16: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

hdfs

Scalding in a nutshell • Concise DSL

• Configurable Source(s) and

sink(s)

• Data transform operations: – map/flatMap – pivot/unpivot – project – groupBy/reduce/foldLeft

hdfs

Page 17: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Just one example (Java way) public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

}

Page 18: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Source

Just one example (Scalding way)

class WordCountJob(args : Args) extends Job(args) {

TextLine( args("input") )

.flatMap('line -> 'word) { line : String => tokenize(line) }

.groupBy('word) { _.size }

.write( Tsv( args("output") ) )

// Split a piece of text into individual words.

def tokenize(text : String) : Array[String] = {

// Lowercase each word and remove punctuation.

text.toLowerCase.split("\\s+")

}

}

Sink

Transform operations

Page 19: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Use Case 1 Split

• Motivation: reuse calculated streams

val common = Tsv("./file").map(...)

val branch1 = common.map(..).write(Tsv("output"))

val branch2 = common.groupby(..).write(Tsv("output"))

Page 20: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Use Case 2 Exotic Sources JDBC (out of the box)

case object YourTableSource extends JDBCSource {

override val tableName = "tableName"

override val columns = List(

varchar("col1", 64),

date("col2"),

tinyint("col3"),

double("col4"),

)

override def currentConfig = ConnectionSpec("www.gt.com", "username", "password",

"mysql")

}

YourTableSource.read.map(...) ...

Page 21: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Use Case 2 Exotic Sources HBASE

HBaseSource (https://github.com/ParallelAI/SpyGlass) • SCAN_ALL, • GET_LIST, • SCAN_RANGE HBaseRawSource (https://github.com/andry1/SpyGlass) • Advanced filtering via base64Scan

val hbs3 = new HBaseSource(

tableName,

quorum,

'key,

List("data"),

List('data),

sourceMode = SourceMode.SCAN_ALL)

.read

val scan = new Scan()

scan.setCaching(caching)

val activity_filters = new FilterList(MUST_PASS_ONE, {

val scvf = new SingleColumnValueFilter(toBytes("family"), toBytes("column"), GREATER_OR_EQUAL, toBytes(value))

scvf.setFilterIfMissing(true)

scvf.setLatestVersionOnly(true)

val scvf2 = ...

List(scvf, scvf2)

})

scan.setFilter(activity_filters)

new HBaseRawSource(tableName, quorum, families,

base64Scan = convertScanToBase64(scan)).read. ...

Page 22: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Use Case 3 Join

• Motivation: joining two streams by key

• Different join strategies: – joinWithLarger – joinWithSmaller – joinWithTiny

• Inner, Left, Right, strategies

val pipe1 = Tsv("file1").read

val pipe2 = Tsv("file2").read // small file

val pipe3 = Tsv("file3").read // huge file

val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)

val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)

Page 23: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Use Case 4 Distributed Caching and Counters

//somewhere outside Job definition

val fl = DistributedCacheFile("/user/boris/zooKeeper.json")

// next value can be passed through any Scalding's jobs via Args

object for instance

val fileName = fl.path

...

class Job(val args:Args) {

// once we receive fl.path we can read it like a ordinary file

val fileName = args.get("fileName")

lazy val data = readJSONFromFile(fileName)

...

TSV(args.get("input")).read.map('line -> 'word ) {

line => ... /* using data json object*/ ... }

}

// counter example

Stat("jdbc.call.counter","myapp").incBy(1)

Page 24: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Use Case 5 Bridging Profiles

Motivation: bridge information from different sources and build complete person profile

imp

Own company’s private cookie

thanks to 1x1 pixel impression

Bridging two ssp_cookies via private

cookie

ssp_cookie_Id1

ssp_cookie_Id2

Bridging via ip address

Page 25: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Bridging Profiles

General task definition:

• Build graph • Identify connected

components

Page 26: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Connected components Let’s scalding it

class ConnectedComponentsJob(args : Args) extends Job(args) {

var attempt = 0

while( attempt < 20 ) {

val vertexes = Tsv( args("vertexes") ).read // ‘vertex \t ‘gid, by default it is equal to vertex

val edges = Tsv( args("edges") ) // 'gid_a \t 'gid_b

var groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a))

.discard('id )

.rename('gid ->'gid_b)

groups = groups.filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 }

.project ('gid_a, 'gid_b)

.map(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (Integer, Integer) =>(max(gid._1, gid._2), min(gid._1, gid._2))}

val count = groups.groupBy( ('gid_a, 'gid_b) ){ _.size }

if (count==0) attempt = 20

val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('source, 'target))

val new_vertexes = vertexes.joinWithSmaller('id -> 'source, groups, joiner = new LeftJoin )

.mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(Integer, Integer, Integer, Integer) =>

val (id, gid, source,target) = param

if (target != null)

( id , min( gid, target ) )

else

( id, gid )

}

new_vertexes.write( Tsv( args("vertexes") ) )

}

}

Page 27: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Other nice things

• Typed pipes

• Elegant and fast Matrix operations

• Simple migration on Spark/Kafka

• Way to retrieve data from hive’s hcatalog

Page 28: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Useful Resources • http://www.adopsinsider.com/ad-serving/how-does-ad-serving-

work/

• http://www.adopsinsider.com/ad-serving/diagramming-the-ssp-dsp-and-rtb-redirect-path/

• https://github.com/twitter/scalding

• https://github.com/ParallelAI/SpyGlass

• https://github.com/branky/cascading.hive

Page 29: Java/Scala Lab: Борис Трофимов - Обжигающая Big Data

Thank you!