scalding @ coursera
DESCRIPTION
A lightning talk I gave about how Coursera decided to use Scalding.TRANSCRIPT
@ Coursera
Daniel Chia @DanielJHChia
Software Engineer, Infrastructure
Overview
• Context
• Growing Needs
• Hive / Pig / Scalding
Technical (Online Stack)
• 100% hosted on AWS
• Service-oriented architecture
• Mix of MySQL and Cassandra for persistence
• Scala
Existing Warehouse
Streaming
Future Warehouse Flow
S3
Event Data
Need 1: Expressive
• Joins
• Aggregations
• Secondary sort
• Multiple map-reduce
Need 2: Semi-structured Data
• Increased usage of Cassandra
• Events data
{
“timestamp”:1411359695744,
“membershipState":"LearnerEnrolled"
}
{ "typeName": "multipart", "definition": { "assignmentParts": { "id1": { "typeName": "plainText", "order": 0, "definition": { "prompt": "Write a sentence describing what you think about cereal." } }, "id2": { "typeName": "richText", "order": 1, "definition": { "prompt": "Write a long essay with lots of fancy formatting describing what you think about cereal." } }, "id3": { "typeName": "url", "order": 2, "definition": { "prompt": "Post a link to your favorite cereal." } }, "id4": { "typeName": "plainText", "order": 3, "definition": {
…
Choices
• Hive
• Pig
• Scalding
Hive
• SQL-like language
• Great for simple rollups and aggregations
• Procedural transforms difficult to express
Pig
• Mature
• Procedural
• Pig Latin + Lots of UDFs
Scalding – Pros
• Succinct
• Expressive
• All code in one language
• Re-use online data models
Scaling – Pros
• Easy to test
Scalding – Cons
• Have to learn Scala
• More heavy weight for simple experimental things.
• Many layers abstracted from MapReduce
Scalding – Example
• User event data
• Want to join with course and topic data
Scalding – Exampleval events = TypedTsv … /* load data */ .toTypedPipe
val courses = TypedTsv … .toTypedPipe
val topics = TypedTsv … .toTypedPipe
Scalding – Exampleevents.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .leftJoin(topics.groupBy(_.topicId)) /* more analysis */
Scalding – Exampleevents.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .leftJoin(topics.groupBy(_.topicId)) /* more analysis */
Scalding – Exampleevents.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .sketch(reducer = 100) .leftJoin(topics.groupBy(_.topicId))
Scalding – Wish-list
• More documentation
• Scala 2.11 soon, please?