hadoop @ … and other stuff. who am i? i'm this guy! [email protected] staff

44
Hadoop @ … and other stuff

Upload: piers-wilson

Post on 15-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Hadoop @… and other stuff

Page 2: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Who Am I?

I'm this guy!

• http://www.linkedin.com/in/richardbhpark• [email protected]

Staff

Page 3: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Hadoop… what is it good for?

Directly influenced by Hadoop

Indirectly influenced by Hadoop

Additionally, 50% for business analytics

Page 4: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

A long long time ago (or 2009)

• 40 Million members

• Apache Hadoop 0.19

• 20 node cluster

• Machines built from Frys (pizza boxes)

• PYMK in 3 days!

Page 5: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Now-ish

• Over 5000 nodes

• 6 clusters (1 production, 1 dev, 2 etl, 2 test)

• Apache Hadoop 1.04 (Hadoop 2.0 soon-ish)

• Security turned on

• About 900 users

• 15-20K Hadoop Jobs submissions a day

• PYMK < 12 hours!

Page 6: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Current Setup

• Use Avro (mostly)

• Dev/Adhoc clustero Used for development and testing of workflowso For analytic queries

• Prod clusterso Data that will appear on our websiteo Only reviewed workflows

• ETL clusters

• Walled off

Page 7: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Three Common Problems

Hadoop Cluster (not to scale)

Data In Data OutProcess

Data

Page 8: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Data In

Page 9: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Databases (c. 2009-2010)

• Originally pulled directly through JDBC on backup DBo Pulled deltas when available and merged

• Data comes extra late (wait for replication of replicas)o Large data pulls affected by daily locks

• Very manual. Schema, connections, repairs (manual)

• No delta’s meant no Scoop

• Costly (Oracle)

Live SiteOffline Data copy

Live SiteLive Site

Offline Data copy

Offline Data copy

DWH

24 hr 5-12 hrHadoop

Page 10: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Databases (Present)

• Commit logs/deltas from Production

• Copied directly to HDFS

• Converted/merged to Avro

• Schema is inferred

HadoopLive SiteLive Site

Live Site < 12 hr

Page 11: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Datastores

Databases (Future 2014?)

• Diffs sent directly to Hadoop

• Avro format

• Lazily merge

• Explicit schema

HadoopDatabus ( < 15 min )

Page 12: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Webtrack (c. 2009-2011)

• Flat files (xml)

• Pulled from every servers periodically, grouped and gzipped

• Uploaded into Hadoop

• Failures nearly untraceable

HadoopNAS NAS

?

I seriously don’t knowhow many hops and copies

Page 13: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Webtrack (Present)

• Apache Kafka!! Yay!

• Avro in, Avro out

• Automatic pulls into Hadoop

• Auditing

HadoopKafkaKafka

KafkaKafka

KafkaKafka

KafkaKafka

5-10 mins end to end

Page 14: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Apache Kafka

• LinkedIn Events

• Service metrics

• Use schema registryo Compact data (md5)o Auto registero Validate schemao Get latest schema

• Migrating to Kafka 0.8o Replication

Page 15: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Apache Kafka + Hadoop = Camus

• Avro only

• Uses zookeepero Discover new topicso Find all brokerso Find all partitions

• Mappers pull from Kafka

• Keeps offsets in HDFS

• Partitions into hour

• Counts incoming events

Page 16: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Kafka Auditing

• Use Kafka to Audit itself

• Tool to audit and alert

• Compare counts

• Kafka 0.8?

Page 17: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Lesson’s We Learned

• Avoid lots of small files

• Automation with Auditing = sleep for me

• Group similar data = smaller/faster

• Spend time writing to spend less time readingo Convert to binary, partition, compress

• Future: o adaptive replication (higher for new, lower for old)o Metadata store (hcat)o Columnar store (Orc?, Parquett?)

Page 18: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Processing Data

Page 19: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Pure Java

• Time consuming writing jobs

• Little code re-use

• Shoot yourself in the face

• Only used when necessaryo Performanceo Memory

• Lots of libraries to help (boiler plate stuff)

Page 20: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Little Piggy (Apache Pig)

• Mainly a pigsty (Pig 11.0)

• Used by data products

• Transparent

• Good performance, tunable

• UDF’s, Datafu

• Tuples and bags? WTF

Page 21: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Hive

• Hive 11

• Only for Adhoc querieso Biz ops, PM’s, analyst

• Hard to tune

• Easy to use

• Lots of adoption

• Etl data in external tables :/

• Hive server 2 for JDBC

Disturbing Mascot

Page 22: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Future in Processing

• Giraph

• Impala, Shark/Spark… etc

• Tez

• Crunch

• Other?

• Say no to streaming

Page 23: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Workflows

Page 24: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Azkaban

• Run hadoop jobs in order

• Run regular schedules

• Be notified on failures

• Understand how flows are executed

• View execution history

• Easy to use

Page 25: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Azkaban @ LinkedIn

• Used in LinkedIn since early 2009

• Powers all our Hadoop data products

• Been using 2.0+ since late 2012

• 2.0 and 2.1 quietly released early 2013

Page 26: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Azkaban @ LinkedIn

• One Azkaban instance per cluster

• 6 clusters total

• 900 Users

• 1500 projects

• 10,000 flows

• 2500 flow executing per day

• 6500 jobs executing per day

Page 27: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Azkaban (before)

Engineer designed UI...

Page 28: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Azkaban 2.0

Page 29: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Azkaban Features

• Schedules DAGs for executions

• Web UI

• Simple job files to create dependencies

• Authorization/Authentication

• Project Isolation

• Extensible through plugins (works with any version of Hadoop)

• Prison for dark wizards

Page 30: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Azkaban - Upload

• Zip Job files, jars, project files

Page 31: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Azkaban - Execute

Page 32: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Azkaban - Schedule

Page 33: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Azkaban - Viewer Plugins

HDFS Browser

Reportal

Page 34: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Future Azkaban Work

• Higher availability

• Generic Triggering/Actions

• Embedded graphs

• Conditional branching

• Admin client

Page 35: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Data Out

Page 36: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Voldemort

• Distributed Key-Value Store

• Based on Amazon Dynamo

• Pluginable

• Open-source

Page 37: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Voldemort Read-Only

• Filesystem store for RO

• Create data files and index on Hadoop

• Copy data to Voldemort

• Swap

Page 38: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Voldemort + Hadoop

• Transfers are parallel

• Transfer records in bulk

• Ability to Roll back

• Simple, operationally low maintenance

• Why not Hbase, Cassandra?o Legacy, and no compelling reason to changeo Simplicity is niceo Real answer: I don’t know. It works, we’re happy.

Page 39: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Apache Kafka

• Reverse the flow

• Messages produced by Hadoop

• Consumer upstream takes action

• Used for emails, r/w store updates, where Voldemort doesn’t make sense etc

Page 40: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Nearing the End

Page 41: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

Misc Hadoop at LinkedIn

• Believe in optimizationo File size, task count and utilizationo Reviews, culture

• Strict limitso Quotas size/file counto 10K task limit

• Use capacity schedulero Default queue with 15m limito marathon for others

Page 42: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

We do a lot with little…

• 50-60% cluster utilization o Or about 5x more work than some other companies

• Every job is reviewed for productiono Teaches good practiceso Schedule to optimize utilizationo Prevents future headaches

• These keep our team size smallo Since 2009, hadoop users grew 90x, clusters grew

25x, LinkedIn employees grew 15xo hadoop team 5x (to 5 people)

Page 43: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

More info

Our data site: data.linkedin.com

Kafka: kafka.apache.com

Azkaban: azkaban.github.io/azkaban2

Voldemort: project-voldemort.com

Page 44: Hadoop @ … and other stuff. Who Am I? I'm this guy!  rpark@linkedin.com Staff

The End