scalding by adform research, alex gryzlov

14
Wordcount in MapReduce

Upload: vasil-remeniuk

Post on 15-Jul-2015

102 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Scalding by Adform Research, Alex Gryzlov

Wordcount in MapReduce

Page 2: Scalding by Adform Research, Alex Gryzlov

Cascading

Tap / Pipe / Sink abstraction over Map / Reduce in Java

Page 3: Scalding by Adform Research, Alex Gryzlov

Cascading

Page 4: Scalding by Adform Research, Alex Gryzlov

Wordcount in Cascading

Page 5: Scalding by Adform Research, Alex Gryzlov

Scalding

• Scala wrapper for Cascading

• Just like working with in-memory collections (map/filter/sort…)

• Built-in parsers for {T|C}SV, date annotations etc

• Helper algorithms e.g.

approximations (Algebird library)

matrix API

Page 6: Scalding by Adform Research, Alex Gryzlov

Wordcount in Scalding

Page 7: Scalding by Adform Research, Alex Gryzlov

run the WordCountJob in local mode with given input and output

Page 8: Scalding by Adform Research, Alex Gryzlov

Building and Deploying

• Get sbt

• sbt assembly produces jar file in target/scala_2.10

• sbt s3-upload produces jar and uploads to s3

Page 9: Scalding by Adform Research, Alex Gryzlov

Running on EMR

• hadoop fs -get s3://dev-adform-test/madeup-job.jar job.jar

• hadoop jar job.jar \

com.twitter.scalding.Tool \ Entry class

com.adform.dspr.MadeupJob \ Scalding job class

--hdfs \ Run in HDFS mode

--logs s3://dev-adform-test/logs \ Parameter

--meta s3://dev-adform-test/metadata \ Parameter

--output s3://dev-adform-test/output Parameter

For more complicated workflows you would have to use applications like Oozie or Pentaho, or write a custom runner app, check outhttps://gitz.adform.com/dco/dco-amazon-runner

Page 10: Scalding by Adform Research, Alex Gryzlov

Development

• Two APIs:

• Fields – everything is a string

• Typed – working with classes, e.g. Request/Transaction

Page 11: Scalding by Adform Research, Alex Gryzlov

Development

• Fields:• No need to parse columns

• Redundancy

• No IDE support like auto-completion

• Typed:• All benefits of types, esp. compile-time checking

• More manual work with parsing

• Sometimes API can be confusing (TypedPipe/Grouped/Cogrouped…)

Page 12: Scalding by Adform Research, Alex Gryzlov

Downsides

• A lot of configuring and googling random issues

• Scarce documentation, have to read source code/stackoverflow

• IntelliJ is slow

• Boilerplate code for parsing data

Page 13: Scalding by Adform Research, Alex Gryzlov

Some tips

• In local mode you specify files as input/output, in HDFS – folders

• You can use Hadoop API to read files from HDFS directly, but only on submitting node, not in the pipeline

• As a workaround for previous problem, you can use a distributed cache mechanism, but that only works on Hadoop 1 AFAIK

• Default memory limit per mapper/reducer is ~200Mb, can be raised by overriding Job.config and adding “mapred.child.java.opts“ -> ”-Xmx<NUMBER>m”

Page 14: Scalding by Adform Research, Alex Gryzlov

Resources

• https://github.com/twitter/scalding/wiki Wiki

• https://github.com/twitter/scalding/tree/develop/tutorial Basic stuff

• https://github.com/twitter/scalding/tree/develop/scalding-core/src/main/scala/com/twitter/scalding/examples Advanced examples, e.g., iterative jobs

• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation

• http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf

• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014