hadoop operations powered by ... hadoop (hadoop summit 2014 amsterdam)

Download Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Post on 08-Sep-2014

9.043 views

Category:

Technology

2 download

Embed Size (px)

DESCRIPTION

At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.

TRANSCRIPT

  • Adam Kawa Data Engineer @ Spotify Hadoop Operations Powered By Hadoop
  • 1. How many times has Coldplay been streamed this month? 2. How many times was Get Lucky streamed during first 24h? 3. Who was the most popular artist in NYC last week? Labels, Advertisers, Partners
  • 1. What song to recommend Jay-Z when he wakes up? 2. Is Adam Kawa bored with Coldplay today? 3. How to get Arun to subscribe to Spotify Premium? Data Scientists
  • (Big) Data At Spotify Data generated by +24M monthly active users and for users! - 2.2 TB of compressed data from users per day - 64 TB of data generated in Hadoop each day (triplicated)
  • Data Infrastructure At Spotify Apache Hadoop YARN Many other systems including - Kafka, Cassandra, Storm, Luigi in production - Giraph, Tez, Spark in the evaluation mode
  • Probably the largest commercial Hadoop cluster in Europe! - 694 heterogeneous nodes - 14.25 PB of data consumed - ~12.000 jobs each day Apache Hadoop
  • March 2013 Tricky questions were asked!
  • 1. How many servers do you need to buy to survive one year? 2. What will you do to use them efficiently? 3. If we agree, dont come back to us this year! OK? Finance Department
  • One of Data Engineers responsible for answering these questions! Adam Kawa
  • Examples of how to analyze various metrics, logs and files - generated by Hadoop - using Hadoop - to understand Hadoop - to avoid guesstimates! The Topic Of This Talk
  • This knowledge can be useful to - measure how fast HDFS is growing - define an empirical retention policy - measure the performance of jobs - optimize the scheduler - and more What To Use It For
  • 1. Analyzing HDFS 2. Analyzing MapReduce and YARN Agenda
  • HDFS Garbage Collection On The NameNode
  • We dont have any full GC pauses on the NN. Our GC stops the NN for less than 100 msec, on average! :) Adam Kawa @ Hadoop User Mailing List December 16th, 2013
  • Today, between 12:05 and 13:00 we had 5 full GC pauses on the NN. They stopped the NN for 34min47sec in total! :( Adam Kawa @ Spotify office, Stockholm January 13th, 2014
  • What happened between 12:05 and 13:00?
  • The NameNode was receiving the block reports from all the DataNodes Quick Answer!
  • 1. We started the NN when the DNs were running Detailed Answer
  • 1. We started the NN when the DNs were running 2. 502 DNs immediately registered to the NN Within 1.2 sec (based on logs from the DNs) Detailed Answer
  • 1. We started the NN when the DNs were running 2. 502 DNs immediately registered to the NN Within 1.2 sec (based on logs from the DNs) 3. 502 DNs started sending the block reports dfs.blockreport.initialDelay = 30 minutes 17 block reports per minute (on average) +831K blocks in each block report (on average) Detailed Answer
  • 1. We started the NN when the DNs were running 2. 502 DNs immediately registered to the NN Within 1.2 sec (based on logs from the DNs) 3. 502 DNs started sending the block reports dfs.blockreport.initialDelay = 30 minutes 17 block reports per minute (on average) +831K blocks in each block report (on average) 4. This generated a high memory pressure on the NN The NN ran into Full GC !!! Detailed Answer
  • Hadoop told us everything!
  • Enable GC logging for the NameNode Visualize e.g. GCViewer Analyze memory usage patterns, GC pauses, misconfiguration Collecting The GC Stats
  • Time
  • This blue line shows the heap used by the NN
  • Loading FsImage
  • Start replaying Edit logs
  • First block report processed
  • 25 block reports processed
  • 131 block reports processed
  • 5min 39sec of Full GC
  • 40 block reports processed
  • Next Full GC
  • Next Full GC !!!
  • CMS collector starts at 98.5% of heap We fixed that !
  • What happened in HDFS between mid-December 2013 and mid-January 2014?
  • HDFS HDFS Metadata
  • A persistent checkpoint of HDFS metadata It contains information about files + directories A binary file HDFS FsImage File
  • Converts the content of FsImage to text formats - e.g. a tab-separated file or XML Output is easily analyzed by any tools - e.g. Pig, Hive HDFS Offline Image Viewer
  • 50% of the data created during last 3 months
  • Anything interesting?
  • 1. NO data added that day 2. Many more files added after
  • The migration to YARN
  • Where did the small files come from?
  • An interactive visualization of data in HDFS Twitter's HDFS-DU /app-logs avg. file size = 253 KB no. of dirs = 595K no. of files = 60.6M
  • Statistics broken down by user/group name Candidates for duplicate datasets Inefficient MapReduce jobs - Small files - Skewed files More Uses Of FsImage File
  • You can analyze FsImage to learn how fast HDFS grows You can combine it with external datasets - number of daily/monthly active users - total size of logs generated by users - number of queries / day run by data analysts Advanced HDFS Capacity Planning
  • You can also use ''trend button'' in Ganglia Simplified HDFS Capacity Planning If we do NOTHING, we might fill the cluster in September ...
  • What will we do to survive longer than September?
  • HDFS Retention
  • Question How many days after creation, a dataset is not accessed anymore? Retention Policy
  • Question How many days after creation, a dataset is not accessed anymore? Possible Solution You can use modification_time and access_time from FsImage Empirical Retention Policy
  • Logs and core datasets are accessed even many years after creation Many reports are not accessed even a hour after creation Most intermediate datasets needed less than a week 10% of data has not been accessed for a year Our Retention Facts
  • HDFS Hot Datasets
  • Some files/directories will be accessed more often than others e.g.: - fresh logs, core datasets, dictionary files Idea To process it faster, increase its replication factor while its hot To save disk space, decrease its replication factor when it becomes cold Hot Dataset
  • How to find them?
  • Logs all filesystem access requests sent to the NN Easy to parse and aggregate - a tab-separated line for each request HDFS Audit Log 2014-01-18 15:16:12,023 INFO FSNamesystem.audit: allowed=true ugi=kawaa (auth:SIMPLE) ip=/10.254.28.4 cmd=open src=/metadata/artist/2013-11-27/part-00061.avro dst=null perm=null