how linkedin uses scalding for data driven product development

Download How LinkedIn Uses Scalding for Data Driven Product Development

Post on 11-Aug-2014

426 views

Category:

Data & Analytics

1 download

Embed Size (px)

DESCRIPTION

Slides from the Cascading meetup May 29, 2014 http://www.meetup.com/cascading/events/177491292/

TRANSCRIPT

  • Using Scalding for Data-Driven Product Development Sasha Ovsankin LinkedIn
  • http://linkedin.com/in/sashao Studied Mathematical Physics at Moscow University Software Engineering background Work at LinkedIn on Email Experience Publish open source at https://github.com/SashaOv Publish music at SoundCloud
  • /home Scalding is a must-have tool in your arsenal of Hadoop development. Hadoop ecosystem at LinkedIn Hadoop development tools Scalding: why and how What we do with Scalding, code examples.
  • /linkedin/hadoop/overview Online Apps Databases NoSQL Data Stores Hadoop HDFS Hadoop Flows Tracking/log ging Analytics Data Products Messaging Message delivery
  • /linkedin/hadoop/practices All online data end up in HDFS Mostly encoded in Avro Production Process CI/Automatic Build More info forthcoming Production Review Operations and Monitoring More info at http://lnkd.in/gridops2013 Result: Thousands of jobs running in production More info at http://lnkd.in/big-data-ecosystem
  • /linkedin/hadoop/dev-tools PIG Java MR Scalding +many others, will not talk about them today
  • /hadoop/dev-tools/PIG Relatively mature tool first official release 2008 Easy to learn Availability of experienced people Extendable via UDF
  • /hadoop/dev-tools/Java Java MR Maximum flexibility with Hadoop API Verbose Cascading Retain (some) Java flexibility Less verbose
  • /hadoop/dev-tools/Scalding http://github.com/twitter/scalding Scala-based DSL Built on Cascading, stable and mature framework Uses API similar to Scala collections: class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) } Succinct and powerful High level of abstraction
  • /tools/comparison PIG Java/Scala Debugging: stack traces No* Yes Code reuse Macros, jobs Classes, packages, modules, frameworks Custom data structures/algorithms UDF Native Packaging Fat jars Thin jars Avro support Partial Native Unit testing PigUnit (in Java) Standard unit testing frameworks: JUNIT/TestMg/MRUnit, Scalding tests PIG Java MR Scalding LOC count Small* Large Small
  • /tools/buyers-guide If you need Then use Quick-and-dirty simple scripts, existing UDFs PIG, Hive Complex flows, full access to Avro, debugging, unit testing, productization Scalding Full flexibility of Hadoop API but not too complex processing Java MR
  • /linkedin/email-experience Goal Improve messaging users experience Plan Track Experiment Optimize Personalize Implementation Generate messages offline Apply sophisticated relevance algorithms Shorten the release cycle to facilitate fast iteration
  • /linkedin/email-experience/overview Content sources (PIG) HDFS Content sources (Scalding) Content sources (Crunch) Targeting, Relevance (Scalding, Java ) Email/Message production (Java MR) Framework (Java) Online Delivery System
  • /email-experience/why-scalding Scala + Map Reduce = match made in heaven scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ } res20: Int = 333833500 Stack traces (yeah!) Native Avro support Integrates well with CI/build system
  • /email-experience/code
  • /email-experience/code/2
  • /linkedin//scalding/status Started >1 year ago Thousands of production LOC written in Scalding by our team Pretty happy with readability and maintainability ~10 flows are currently in production, and counting Currently ~12 people are coding in Scalding Created Scalding user group Growing interest Learning: Scala[Scalding] < Scala[ _ ]
  • /linkedin//scalding/users Data science Enterprise services Email experience Content
  • /linkedin//scalding/what-to-improve Better Scala language IDE tools One-click development (-> demo) Monitoring and troubleshooting Counters implemented in 0.9 Better troubleshooting of the ser/de process Better tools for tuning of jobs setting #of mappers and reducers Best practices
  • /home Scalding is a must-have tool in your arsenal of Hadoop development. Hadoop ecosystem at LinkedIn Hadoop development tools Scalding: why and how What we do with Scalding, code examples.
  • /linkedin/join-us Work on unique and interesting problems Be part of great engineering community Use latest tools and technologies Help connect the worlds professionals to help them become more productive and successful We are looking for amazing people interested in Data Science and Software Engineering Questions?