Slides from the Cascading meetup May 29, 2014


  • Using Scalding for Data-Driven Product Development Sasha Ovsankin LinkedIn
  • Studied Mathematical Physics at Moscow University Software Engineering background Work at LinkedIn on Email Experience Publish open source at Publish music at SoundCloud
  • /home Scalding is a must-have tool in your arsenal of Hadoop development. Hadoop ecosystem at LinkedIn Hadoop development tools Scalding: why and how What we do with Scalding, code examples.
  • /linkedin/hadoop/overview Online Apps Databases NoSQL Data Stores Hadoop HDFS Hadoop Flows Tracking/log ging Analytics Data Products Messaging Message delivery
  • /linkedin/hadoop/practices All online data end up in HDFS Mostly encoded in Avro Production Process CI/Automatic Build More info forthcoming Production Review Operations and Monitoring More info at Result: Thousands of jobs running in production More info at
  • /linkedin/hadoop/dev-tools PIG Java MR Scalding +many others, will not talk about them today
  • /hadoop/dev-tools/PIG Relatively mature tool first official release 2008 Easy to learn Availability of experienced people Extendable via UDF
  • /hadoop/dev-tools/Java Java MR Maximum flexibility with Hadoop API Verbose Cascading Retain (some) Java flexibility Less verbose
  • /hadoop/dev-tools/Scalding Scala-based DSL Built on Cascading, stable and mature framework Uses API similar to Scala collections: class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) } Succinct and powerful High level of abstraction
  • /tools/comparison PIG Java/Scala Debugging: stack traces No* Yes Code reuse Macros, jobs Classes, packages, modules, frameworks Custom data structures/algorithms UDF Native Packaging Fat jars Thin jars Avro support Partial Native Unit testing PigUnit (in Java) Standard unit testing frameworks: JUNIT/TestMg/MRUnit, Scalding tests PIG Java MR Scalding LOC count Small* Large Small
  • /tools/buyers-guide If you need Then use Quick-and-dirty simple scripts, existing UDFs PIG, Hive Complex flows, full access to Avro, debugging, unit testing, productization Scalding Full flexibility of Hadoop API but not too complex processing Java MR
  • /linkedin/email-experience Goal Improve messaging users experience Plan Track Experiment Optimize Personalize Implementation Generate messages offline Apply sophisticated relevance algorithms Shorten the release cycle to facilitate fast iteration
  • /linkedin/email-experience/overview Content sources (PIG) HDFS Content sources (Scalding) Content sources (Crunch) Targeting, Relevance (Scalding, Java ) Email/Message production (Java MR) Framework (Java) Online Delivery System
  • /email-experience/why-scalding Scala + Map Reduce = match made in heaven scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ } res20: Int = 333833500 Stack traces (yeah!) Native Avro support Integrates well with CI/build system
  • /linkedin//scalding/status Started >1 year ago Thousands of production LOC written in Scalding by our team Pretty happy with readability and maintainability ~10 flows are currently in production, and counting Currently ~12 people are coding in Scalding Created Scalding user group Growing interest Learning: Scala[Scalding] < Scala[ _ ]
  • /linkedin//scalding/users Data science Enterprise services Email experience Content
  • /linkedin//scalding/what-to-improve Better Scala language IDE tools One-click development (-> demo) Monitoring and troubleshooting Counters implemented in 0.9 Better troubleshooting of the ser/de process Better tools for tuning of jobs setting #of mappers and reducers Best practices
  • /home Scalding is a must-have tool in your arsenal of Hadoop development. Hadoop ecosystem at LinkedIn Hadoop development tools Scalding: why and how What we do with Scalding, code examples.
