how linkedin uses scalding for data driven product development

21
Using Scalding for Data-Driven Product Development Sasha Ovsankin LinkedIn

Upload: sasha-ovsankin

Post on 11-Aug-2014

434 views

Category:

Data & Analytics


1 download

DESCRIPTION

Slides from the Cascading meetup May 29, 2014 http://www.meetup.com/cascading/events/177491292/

TRANSCRIPT

Page 1: How LinkedIn Uses Scalding for Data Driven Product Development

Using Scalding for Data-Driven Product Development

Sasha OvsankinLinkedIn

Page 2: How LinkedIn Uses Scalding for Data Driven Product Development

http://linkedin.com/in/sashao

• Studied Mathematical Physics at Moscow University

• Software Engineering background• Work at LinkedIn on Email Experience• Publish open source at

https://github.com/SashaOv• Publish music at SoundCloud

Page 3: How LinkedIn Uses Scalding for Data Driven Product Development

/home

Scalding is a must-have tool in your arsenal of Hadoop development.– Hadoop ecosystem at LinkedIn– Hadoop development tools– Scalding: why and how– What we do with Scalding, code examples.

Page 4: How LinkedIn Uses Scalding for Data Driven Product Development

/linkedin/hadoop/overview

Online Apps

Databases

NoSQL Data Stores

ETL

Hadoop

HDFS

Hadoop Flows

Tracking/logging

Analytics

Data Products

Messaging

Message delivery

Page 5: How LinkedIn Uses Scalding for Data Driven Product Development

/linkedin/hadoop/practices

• All online data end up in HDFS– Mostly encoded in Avro

• Production Process– CI/Automatic Build

• More info forthcoming– Production Review– Operations and Monitoring

• More info at http://lnkd.in/gridops2013

• Result: Thousands of jobs running in production• More info at http://lnkd.in/big-data-ecosystem

Page 6: How LinkedIn Uses Scalding for Data Driven Product Development

/linkedin/hadoop/dev-tools

• PIG• Java MR• Scalding• +many others, will not talk about them today

Page 7: How LinkedIn Uses Scalding for Data Driven Product Development

/hadoop/dev-tools/PIG

• Relatively mature tool– first official release 2008

• Easy to learn• Availability of experienced people• Extendable via UDF

Page 8: How LinkedIn Uses Scalding for Data Driven Product Development

/hadoop/dev-tools/Java

• Java MR– Maximum flexibility with Hadoop API– Verbose

• Cascading– Retain (some) Java flexibility– Less verbose

Page 9: How LinkedIn Uses Scalding for Data Driven Product Development

/hadoop/dev-tools/Scaldinghttp://github.com/twitter/scalding• Scala-based DSL• Built on Cascading, stable and mature framework• Uses API similar to Scala collections:

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""\s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) )}

• Succinct and powerful• High level of abstraction

Page 10: How LinkedIn Uses Scalding for Data Driven Product Development

…/tools/comparisonPIG Java/Scala

Debugging: stack traces No* Yes

Code reuse Macros, jobs Classes, packages, modules, frameworks…

Custom data structures/algorithms

UDF Native

Packaging Fat jars Thin jars

Avro support Partial Native

Unit testing PigUnit (in Java) Standard unit testing frameworks: JUNIT/TestMg/MRUnit, Scalding tests

PIG Java MR Scalding

LOC count Small* Large Small

Page 11: How LinkedIn Uses Scalding for Data Driven Product Development

…/tools/buyers-guideIf you need… Then use…Quick-and-dirty simple scripts, existing UDFs

PIG, Hive

Complex flows, full access to Avro, debugging, unit testing, productization

Scalding

Full flexibility of Hadoop API but not too complex processing

Java MR

Page 12: How LinkedIn Uses Scalding for Data Driven Product Development

/linkedin/email-experience• Goal

– Improve messaging users’ experience• Plan

– Track– Experiment– Optimize– Personalize

• Implementation– Generate messages offline– Apply sophisticated relevance algorithms– Shorten the release cycle to facilitate fast iteration

Page 13: How LinkedIn Uses Scalding for Data Driven Product Development

/linkedin/email-experience/overview

Content sources(PIG)

HDFS

Content sources(Scalding)

Content sources(Crunch)

Targeting, Relevance

(Scalding, Java )

Email/Message production(Java MR)

Framework(Java)

Online Delivery System

Page 14: How LinkedIn Uses Scalding for Data Driven Product Development

…/email-experience/why-scalding

• Scala + Map Reduce = match made in heavenscala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }res20: Int = 333833500

• Stack traces (yeah!)• Native Avro support• Integrates well with CI/build system

Page 15: How LinkedIn Uses Scalding for Data Driven Product Development

…/email-experience/code

Page 16: How LinkedIn Uses Scalding for Data Driven Product Development

…/email-experience/code/2

Page 17: How LinkedIn Uses Scalding for Data Driven Product Development

/linkedin/…/scalding/status

• Started >1 year ago• Thousands of production LOC written in Scalding by our

team– Pretty happy with readability and maintainability

• ~10 flows are currently in production, and counting• Currently ~12 people are coding in Scalding• Created Scalding user group• Growing interest• Learning:

– Scala[Scalding] < Scala[ _ ]

Page 18: How LinkedIn Uses Scalding for Data Driven Product Development

/linkedin/…/scalding/users

• Data science• Enterprise services• Email experience• Content

Page 19: How LinkedIn Uses Scalding for Data Driven Product Development

/linkedin/…/scalding/what-to-improve

• Better Scala language IDE tools• One-click development (->

demo)• Monitoring and troubleshooting

– Counters – implemented in 0.9– Better troubleshooting of the

ser/de process• Better tools for tuning of jobs

– setting #of mappers and reducers• Best practices

Page 20: How LinkedIn Uses Scalding for Data Driven Product Development

/home

Scalding is a must-have tool in your arsenal of Hadoop development.– Hadoop ecosystem at LinkedIn– Hadoop development tools– Scalding: why and how– What we do with Scalding, code examples.

Page 21: How LinkedIn Uses Scalding for Data Driven Product Development

/linkedin/join-us• Work on unique and interesting problems• Be part of great engineering community• Use latest tools and technologies• Help connect the world’s professionals to help them

become more productive and successful• We are looking for amazing people interested in Data

Science and Software Engineering

Questions?