hadoop @ … and other stuff. who am i? i'm this guy! [email protected] staff

Hadoop @… and other stuff

Who Am I?

I'm this guy!

• http://www.linkedin.com/in/richardbhpark• [email protected]

Staff

Hadoop… what is it good for?

Directly influenced by Hadoop

Indirectly influenced by Hadoop

Additionally, 50% for business analytics

A long long time ago (or 2009)

• 40 Million members

• Apache Hadoop 0.19

• 20 node cluster

• Machines built from Frys (pizza boxes)

• PYMK in 3 days!

Now-ish

• Over 5000 nodes

• 6 clusters (1 production, 1 dev, 2 etl, 2 test)

• Apache Hadoop 1.04 (Hadoop 2.0 soon-ish)

• Security turned on

• About 900 users

• 15-20K Hadoop Jobs submissions a day

• PYMK < 12 hours!

Current Setup

• Use Avro (mostly)

• Dev/Adhoc clustero Used for development and testing of workflowso For analytic queries

• Prod clusterso Data that will appear on our websiteo Only reviewed workflows

• ETL clusters

• Walled off

Three Common Problems

Hadoop Cluster (not to scale)

Data In Data OutProcess

Data

Data In

Databases (c. 2009-2010)

• Originally pulled directly through JDBC on backup DBo Pulled deltas when available and merged

• Data comes extra late (wait for replication of replicas)o Large data pulls affected by daily locks

• Very manual. Schema, connections, repairs (manual)

• No delta’s meant no Scoop

• Costly (Oracle)

Live SiteOffline Data copy

Live SiteLive Site

Offline Data copy

Offline Data copy

DWH

24 hr 5-12 hrHadoop

Databases (Present)

• Commit logs/deltas from Production

• Copied directly to HDFS

• Converted/merged to Avro

• Schema is inferred

HadoopLive SiteLive Site

Live Site < 12 hr

Datastores

Databases (Future 2014?)

• Diffs sent directly to Hadoop

• Avro format

• Lazily merge

• Explicit schema

HadoopDatabus ( < 15 min )

Webtrack (c. 2009-2011)

• Flat files (xml)

• Pulled from every servers periodically, grouped and gzipped

• Uploaded into Hadoop

• Failures nearly untraceable

HadoopNAS NAS

?

I seriously don’t knowhow many hops and copies

Webtrack (Present)

• Apache Kafka!! Yay!

• Avro in, Avro out

• Automatic pulls into Hadoop

• Auditing

HadoopKafkaKafka

KafkaKafka

KafkaKafka

KafkaKafka

5-10 mins end to end

Apache Kafka

• LinkedIn Events

• Service metrics

• Use schema registryo Compact data (md5)o Auto registero Validate schemao Get latest schema

• Migrating to Kafka 0.8o Replication

Apache Kafka + Hadoop = Camus

• Avro only

• Uses zookeepero Discover new topicso Find all brokerso Find all partitions

• Mappers pull from Kafka

• Keeps offsets in HDFS

• Partitions into hour

• Counts incoming events

Kafka Auditing

• Use Kafka to Audit itself

• Tool to audit and alert

• Compare counts

• Kafka 0.8?

Lesson’s We Learned

• Avoid lots of small files

• Automation with Auditing = sleep for me

• Group similar data = smaller/faster

• Spend time writing to spend less time readingo Convert to binary, partition, compress

• Future: o adaptive replication (higher for new, lower for old)o Metadata store (hcat)o Columnar store (Orc?, Parquett?)

Processing Data

Pure Java

• Time consuming writing jobs

• Little code re-use

• Shoot yourself in the face

• Only used when necessaryo Performanceo Memory

• Lots of libraries to help (boiler plate stuff)

Little Piggy (Apache Pig)

• Mainly a pigsty (Pig 11.0)

• Used by data products

• Transparent

• Good performance, tunable

• UDF’s, Datafu

• Tuples and bags? WTF

Hive

• Hive 11

• Only for Adhoc querieso Biz ops, PM’s, analyst

• Hard to tune

• Easy to use

• Lots of adoption

• Etl data in external tables :/

• Hive server 2 for JDBC

Disturbing Mascot

Future in Processing

• Giraph

• Impala, Shark/Spark… etc

• Tez

• Crunch

• Other?

• Say no to streaming

Workflows

Azkaban

• Run hadoop jobs in order

• Run regular schedules

• Be notified on failures

• Understand how flows are executed

• View execution history

• Easy to use

Azkaban @ LinkedIn

• Used in LinkedIn since early 2009

• Powers all our Hadoop data products

• Been using 2.0+ since late 2012

• 2.0 and 2.1 quietly released early 2013

Azkaban @ LinkedIn

• One Azkaban instance per cluster

• 6 clusters total

• 900 Users

• 1500 projects

• 10,000 flows

• 2500 flow executing per day

• 6500 jobs executing per day

Azkaban (before)

Engineer designed UI...

Azkaban 2.0

Azkaban Features

• Schedules DAGs for executions

• Web UI

• Simple job files to create dependencies

• Authorization/Authentication

• Project Isolation

• Extensible through plugins (works with any version of Hadoop)

• Prison for dark wizards

Azkaban - Upload

• Zip Job files, jars, project files

Azkaban - Execute

Azkaban - Schedule

Azkaban - Viewer Plugins

HDFS Browser

Reportal

Future Azkaban Work

• Higher availability

• Generic Triggering/Actions

• Embedded graphs

• Conditional branching

• Admin client

Data Out

Voldemort

• Distributed Key-Value Store

• Based on Amazon Dynamo

• Pluginable

• Open-source

Voldemort Read-Only

• Filesystem store for RO

• Create data files and index on Hadoop

• Copy data to Voldemort

• Swap

Voldemort + Hadoop

• Transfers are parallel

• Transfer records in bulk

• Ability to Roll back

• Simple, operationally low maintenance

• Why not Hbase, Cassandra?o Legacy, and no compelling reason to changeo Simplicity is niceo Real answer: I don’t know. It works, we’re happy.

Apache Kafka

• Reverse the flow

• Messages produced by Hadoop

• Consumer upstream takes action

• Used for emails, r/w store updates, where Voldemort doesn’t make sense etc

Nearing the End

Misc Hadoop at LinkedIn

• Believe in optimizationo File size, task count and utilizationo Reviews, culture

• Strict limitso Quotas size/file counto 10K task limit

• Use capacity schedulero Default queue with 15m limito marathon for others

We do a lot with little…

• 50-60% cluster utilization o Or about 5x more work than some other companies

• Every job is reviewed for productiono Teaches good practiceso Schedule to optimize utilizationo Prevents future headaches

• These keep our team size smallo Since 2009, hadoop users grew 90x, clusters grew

25x, LinkedIn employees grew 15xo hadoop team 5x (to 5 people)

More info

Our data site: data.linkedin.com

Kafka: kafka.apache.com

Azkaban: azkaban.github.io/azkaban2

Voldemort: project-voldemort.com

The End

hadoop @ … and other stuff. who am i? i'm this guy! [email protected] staff

Documents