java/scala lab: Борис Трофимов - Обжигающая big data
TRANSCRIPT
Scalding Big ADta
или обжигая горшки с рекламой
Boris Trofimov @b0ris_1
Agenda
• Two stories on how AD is served inside AD company
• Awesome Scalding
The stories mention one company that has built multimillion-dollar
business over ordinary cookies
The story about shoes or
Big Brother is watching on you
We will answer this question in a few slides
or be careful while buying shoes
What is common between these things?
They are simple just for the first glance
The same with loading web
sites
Open any site with Ad
… 1 sec 20 ms 100 ms 150 ms
Publisher receives request
Publisher sends
response
Content delivered
to user
170
Site sends request to Ad Server
200
80 ms
280
SSP picks the winning bid and sends redirect url back to ad
Server Every bidder/DSP receives info about user:
• ssp_cookie_id • geo data • site url
300
SSP (Ad Exchange) receives ad request
and opens RTB Auction
210
Ad Server receives ad
request and redirects to
Ad Exchange
All bidders should send their
decision (participate? &
price) back
350
Ad Server shows page to
user which redirects to the bidder’s server
User’s web page asks Ad banner from CDN Showing ad & bidder’s
1x1 pixel (impression)
400
The first second…
~70% users have this cookie aboard
>>1 independent companies take
part in this auction
Return info about new user’s interests with special markers (segments) that indicates the new fact about user, e.g user
is man who has iphone and lives in NYC and has dog. Major format: <cookie_id – segment_id> Data Scientists
Real time
Offline
Pixel Tracking Farm
Warehouse
Bidder Farm
Auction requests
SSP Ad Exchange
Hourly Logs
3rd part data
House holders data
…
Hadoop’s HDFS
Updating user profiles
Hive Oozie MapReduce
Partners
HBASE Scalding
hbase keeps user profiles
Update user’s profiles with new segments
Data export
Brand new feed about
user interests
2
3
4 5
6
7
8
9
0 1 • Impressions • Clicks • Post-click Activities
5
Why do we need all this science?
• Deep audience targeting
• Case: customer would like
to show ad for all men who live in NYC have iPhone and dog
Facts about Data Scientists
• Data Scientists do: – Audience Modeling
identifying new user interests [segments] and finding ways to track them
– Audience Bridging – Insights and Analytics
• They use IBM Netezza as local warehouse
• They use R language
Facts about Realtime team
• Scala, Java • Restful Services • Akka • In Memory Cache : Aerospike, Redis
Facts about Offline team • The tasks we solve over hadoop:
– As a Storage to keep all logs we need – As Profile DB to keep all users and their interests [segments] – As MapReduce Engine to run jobs on transformations between data – As a Warehouse to export data via hive
• We use Clouderra CDH 5.1.2
• Major language: Scala
• Pure MapReduce jobs & Scalding/Cascading
• All map reduce applications are wrapped by Oozie’s workflow(s)
• Developing nextgen paltform version based on Spark Streaming/Kafka
hdfs
Scalding in a nutshell • Concise DSL
• Configurable Source(s) and
sink(s)
• Data transform operations: – map/flatMap – pivot/unpivot – project – groupBy/reduce/foldLeft
hdfs
Just one example (Java way) public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Source
Just one example (Scalding way)
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.split("\\s+")
}
}
Sink
Transform operations
Use Case 1 Split
• Motivation: reuse calculated streams
val common = Tsv("./file").map(...)
val branch1 = common.map(..).write(Tsv("output"))
val branch2 = common.groupby(..).write(Tsv("output"))
Use Case 2 Exotic Sources JDBC (out of the box)
case object YourTableSource extends JDBCSource {
override val tableName = "tableName"
override val columns = List(
varchar("col1", 64),
date("col2"),
tinyint("col3"),
double("col4"),
)
override def currentConfig = ConnectionSpec("www.gt.com", "username", "password",
"mysql")
}
YourTableSource.read.map(...) ...
Use Case 2 Exotic Sources HBASE
HBaseSource (https://github.com/ParallelAI/SpyGlass) • SCAN_ALL, • GET_LIST, • SCAN_RANGE HBaseRawSource (https://github.com/andry1/SpyGlass) • Advanced filtering via base64Scan
val hbs3 = new HBaseSource(
tableName,
quorum,
'key,
List("data"),
List('data),
sourceMode = SourceMode.SCAN_ALL)
.read
val scan = new Scan()
scan.setCaching(caching)
val activity_filters = new FilterList(MUST_PASS_ONE, {
val scvf = new SingleColumnValueFilter(toBytes("family"), toBytes("column"), GREATER_OR_EQUAL, toBytes(value))
scvf.setFilterIfMissing(true)
scvf.setLatestVersionOnly(true)
val scvf2 = ...
List(scvf, scvf2)
})
scan.setFilter(activity_filters)
new HBaseRawSource(tableName, quorum, families,
base64Scan = convertScanToBase64(scan)).read. ...
Use Case 3 Join
• Motivation: joining two streams by key
• Different join strategies: – joinWithLarger – joinWithSmaller – joinWithTiny
• Inner, Left, Right, strategies
val pipe1 = Tsv("file1").read
val pipe2 = Tsv("file2").read // small file
val pipe3 = Tsv("file3").read // huge file
val joinedPipe = pipe1.joinWithTiny('id1 -> 'id2, pipe2)
val joinedPipe2 = pipe1.joinWithLarge('id1 -> 'id2, pipe3)
Use Case 4 Distributed Caching and Counters
//somewhere outside Job definition
val fl = DistributedCacheFile("/user/boris/zooKeeper.json")
// next value can be passed through any Scalding's jobs via Args
object for instance
val fileName = fl.path
...
class Job(val args:Args) {
// once we receive fl.path we can read it like a ordinary file
val fileName = args.get("fileName")
lazy val data = readJSONFromFile(fileName)
...
TSV(args.get("input")).read.map('line -> 'word ) {
line => ... /* using data json object*/ ... }
}
// counter example
Stat("jdbc.call.counter","myapp").incBy(1)
Use Case 5 Bridging Profiles
Motivation: bridge information from different sources and build complete person profile
imp
Own company’s private cookie
thanks to 1x1 pixel impression
Bridging two ssp_cookies via private
cookie
ssp_cookie_Id1
ssp_cookie_Id2
Bridging via ip address
Bridging Profiles
General task definition:
• Build graph • Identify connected
components
Connected components Let’s scalding it
class ConnectedComponentsJob(args : Args) extends Job(args) {
var attempt = 0
while( attempt < 20 ) {
val vertexes = Tsv( args("vertexes") ).read // ‘vertex \t ‘gid, by default it is equal to vertex
val edges = Tsv( args("edges") ) // 'gid_a \t 'gid_b
var groups = vertexes.joinWithSmaller('id -> 'id_b, vertexes.joinWithSmaller('id -> 'id_a, edges).discard('id ).rename('gid ->'gid_a))
.discard('id )
.rename('gid ->'gid_b)
groups = groups.filter('gid_a, 'gid_b) {gid : (String, String) => gid._1 != gid._2 }
.project ('gid_a, 'gid_b)
.map(('gid_a, 'gid_b) -> ('gid_a, 'gid_b)) {gid : (Integer, Integer) =>(max(gid._1, gid._2), min(gid._1, gid._2))}
val count = groups.groupBy( ('gid_a, 'gid_b) ){ _.size }
if (count==0) attempt = 20
val new_groups = groups.groupBy('gid_a) {_.min('gid_b)}.rename(('source, 'target))
val new_vertexes = vertexes.joinWithSmaller('id -> 'source, groups, joiner = new LeftJoin )
.mapTo( ('id,'gid,'source,'target)->('id, 'gid)) { param:(Integer, Integer, Integer, Integer) =>
val (id, gid, source,target) = param
if (target != null)
( id , min( gid, target ) )
else
( id, gid )
}
new_vertexes.write( Tsv( args("vertexes") ) )
}
}
Other nice things
• Typed pipes
• Elegant and fast Matrix operations
• Simple migration on Spark/Kafka
• Way to retrieve data from hive’s hcatalog
Useful Resources • http://www.adopsinsider.com/ad-serving/how-does-ad-serving-
work/
• http://www.adopsinsider.com/ad-serving/diagramming-the-ssp-dsp-and-rtb-redirect-path/
• https://github.com/twitter/scalding
• https://github.com/ParallelAI/SpyGlass
• https://github.com/branky/cascading.hive
Thank you!