how linkedin uses scalding for data driven product development

Post on 11-Aug-2014

435 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides from the Cascading meetup May 29, 2014 http://www.meetup.com/cascading/events/177491292/

TRANSCRIPT

Using Scalding for Data-Driven Product Development

Sasha OvsankinLinkedIn

http://linkedin.com/in/sashao

• Studied Mathematical Physics at Moscow University

• Software Engineering background• Work at LinkedIn on Email Experience• Publish open source at

https://github.com/SashaOv• Publish music at SoundCloud

/home

Scalding is a must-have tool in your arsenal of Hadoop development.– Hadoop ecosystem at LinkedIn– Hadoop development tools– Scalding: why and how– What we do with Scalding, code examples.

/linkedin/hadoop/overview

Online Apps

Databases

NoSQL Data Stores

ETL

Hadoop

HDFS

Hadoop Flows

Tracking/logging

Analytics

Data Products

Messaging

Message delivery

/linkedin/hadoop/practices

• All online data end up in HDFS– Mostly encoded in Avro

• Production Process– CI/Automatic Build

• More info forthcoming– Production Review– Operations and Monitoring

• More info at http://lnkd.in/gridops2013

• Result: Thousands of jobs running in production• More info at http://lnkd.in/big-data-ecosystem

/linkedin/hadoop/dev-tools

• PIG• Java MR• Scalding• +many others, will not talk about them today

/hadoop/dev-tools/PIG

• Relatively mature tool– first official release 2008

• Easy to learn• Availability of experienced people• Extendable via UDF

/hadoop/dev-tools/Java

• Java MR– Maximum flexibility with Hadoop API– Verbose

• Cascading– Retain (some) Java flexibility– Less verbose

/hadoop/dev-tools/Scaldinghttp://github.com/twitter/scalding• Scala-based DSL• Built on Cascading, stable and mature framework• Uses API similar to Scala collections:

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""\s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) )}

• Succinct and powerful• High level of abstraction

…/tools/comparisonPIG Java/Scala

Debugging: stack traces No* Yes

Code reuse Macros, jobs Classes, packages, modules, frameworks…

Custom data structures/algorithms

UDF Native

Packaging Fat jars Thin jars

Avro support Partial Native

Unit testing PigUnit (in Java) Standard unit testing frameworks: JUNIT/TestMg/MRUnit, Scalding tests

PIG Java MR Scalding

LOC count Small* Large Small

…/tools/buyers-guideIf you need… Then use…Quick-and-dirty simple scripts, existing UDFs

PIG, Hive

Complex flows, full access to Avro, debugging, unit testing, productization

Scalding

Full flexibility of Hadoop API but not too complex processing

Java MR

/linkedin/email-experience• Goal

– Improve messaging users’ experience• Plan

– Track– Experiment– Optimize– Personalize

• Implementation– Generate messages offline– Apply sophisticated relevance algorithms– Shorten the release cycle to facilitate fast iteration

/linkedin/email-experience/overview

Content sources(PIG)

HDFS

Content sources(Scalding)

Content sources(Crunch)

Targeting, Relevance

(Scalding, Java )

Email/Message production(Java MR)

Framework(Java)

Online Delivery System

…/email-experience/why-scalding

• Scala + Map Reduce = match made in heavenscala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }res20: Int = 333833500

• Stack traces (yeah!)• Native Avro support• Integrates well with CI/build system

…/email-experience/code

…/email-experience/code/2

/linkedin/…/scalding/status

• Started >1 year ago• Thousands of production LOC written in Scalding by our

team– Pretty happy with readability and maintainability

• ~10 flows are currently in production, and counting• Currently ~12 people are coding in Scalding• Created Scalding user group• Growing interest• Learning:

– Scala[Scalding] < Scala[ _ ]

/linkedin/…/scalding/users

• Data science• Enterprise services• Email experience• Content

/linkedin/…/scalding/what-to-improve

• Better Scala language IDE tools• One-click development (->

demo)• Monitoring and troubleshooting

– Counters – implemented in 0.9– Better troubleshooting of the

ser/de process• Better tools for tuning of jobs

– setting #of mappers and reducers• Best practices

/home

Scalding is a must-have tool in your arsenal of Hadoop development.– Hadoop ecosystem at LinkedIn– Hadoop development tools– Scalding: why and how– What we do with Scalding, code examples.

/linkedin/join-us• Work on unique and interesting problems• Be part of great engineering community• Use latest tools and technologies• Help connect the world’s professionals to help them

become more productive and successful• We are looking for amazing people interested in Data

Science and Software Engineering

Questions?

top related