amazon elastic map reduce

Amazon Elastic MapReduceDeveloper Guide

API Version 2009-11-30

Amazon Web Services

Amazon Elastic MapReduce Developer Guide

Amazon Elastic MapReduce: Developer GuideAmazon Web ServicesCopyright © 2012 Amazon Web Services, Inc. and/or its affiliates. All rights reserved.

The following are trademarks or registered trademarks of Amazon: Amazon, Amazon.com, Amazon.comDesign, Amazon DevPay, Amazon EC2, Amazon Web Services Design, AWS, CloudFront, EC2, ElasticCompute Cloud, Kindle, and Mechanical Turk. In addition, Amazon.com graphics, logos, page headers,button icons, scripts, and service names are trademarks, or trade dress of Amazon in the U.S. and/or othercountries. Amazon's trademarks and trade dress may not be used in connection with any product or servicethat is not Amazon's, in any manner that is likely to cause confusion among customers, or in any mannerthat disparages or discredits Amazon.

All other trademarks not owned by Amazon are the property of their respective owners, who may or maynot be affiliated with, connected to, or sponsored by Amazon.


Welcome ................................................................................................................................................. 1Understand Amazon EMR ...................................................................................................................... 2Overview of Amazon EMR ...................................................................................................................... 2

Architectural Overview of Amazon EMR ....................................................................................... 3Elastic MapReduce Features ........................................................................................................ 4

Amazon EMR Concepts ......................................................................................................................... 6Job Flows and Steps ..................................................................................................................... 6Hadoop and MapReduce .............................................................................................................. 7

Associated AWS Product Concepts ...................................................................................................... 11Using Amazon EMR ............................................................................................................................. 15Setting Up Your Environment to Run a Job Flow .................................................................................. 17Create a Job Flow ................................................................................................................................. 23

How to Create a Streaming Job Flow .......................................................................................... 24How to Create a Job Flow Using Hive ......................................................................................... 32How to Create a Job Flow Using Pig ........................................................................................... 40How to Create a Job Flow Using a Custom JAR ......................................................................... 48How to Create a Cascading Job Flow ......................................................................................... 56Launch an HBase Cluster on Amazon EMR ............................................................................... 64

View Job Flow Details ........................................................................................................................... 72Terminate a Job Flow ............................................................................................................................ 77Customize a Job Flow .......................................................................................................................... 79

Add Steps to a Job Flow ............................................................................................................. 79Wait for Steps to Complete ................................................................................................ 81Add More than 256 Steps to a Job Flow ............................................................................ 82

Bootstrap Actions ........................................................................................................................ 84Resizing Running Job Flows ....................................................................................................... 96Calling Additional Files and Libraries ........................................................................................ 104

Using Distributed Cache .................................................................................................. 104Running a Script in a Job Flow ........................................................................................ 109

Connect to the Master Node in an Amazon EMR Job Flow ............................................................... 110Connect to the Master Node Using SSH ................................................................................... 111Web Interfaces Hosted on the Master Node ............................................................................. 115Open an SSH Tunnel to the Master Node ................................................................................. 116Configure Foxy Proxy to View Websites Hosted on the Master Node ....................................... 117

Use Cases .......................................................................................................................................... 122Cascading ................................................................................................................................. 122Pig ............................................................................................................................................. 126Streaming .................................................................................................................................. 129

Building Binaries Using Amazon EMR ................................................................................................ 131Using Tagging ..................................................................................................................................... 136Protect a Job Flow from Termination .................................................................................................. 136Lower Costs with Spot Instances ........................................................................................................ 141

Choosing What to Launch as Spot Instances ........................................................................... 142Spot Instance Pricing in Amazon EMR ..................................................................................... 144Availability Zones and Regions ................................................................................................. 144Launching Spot Instances in Job Flows .................................................................................... 145Changing the Number of Spot Instances in a Job Flow ............................................................ 151Troubleshooting Spot Instances ................................................................................................ 154

Store Data with HBase ....................................................................................................................... 155HBase Job Flow Prerequisites .................................................................................................. 155Launch an HBase Cluster on Amazon EMR ............................................................................. 156Connect to HBase Using the Command Line ............................................................................ 164Back Up and Restore HBase .................................................................................................... 165Terminate an HBase Cluster ..................................................................................................... 174Configure HBase ....................................................................................................................... 174Access HBase Data with Hive ................................................................................................... 178View the HBase User Interface ................................................................................................. 180View HBase Log Files ............................................................................................................... 180



Monitor HBase with CloudWatch ............................................................................................... 181Monitor HBase with Ganglia ...................................................................................................... 181

Troubleshooting .................................................................................................................................. 183Things to Check When Your Amazon EMR Job Flow Fails ....................................................... 183Amazon EMR Logging .............................................................................................................. 187Enable Logging and Debugging ................................................................................................ 187Use Log Files ............................................................................................................................ 190Monitor Hadoop on the Master Node ........................................................................................ 199View the Hadoop Web Interfaces .............................................................................................. 200Troubleshooting Tips ................................................................................................................. 204

Monitor Metrics with Amazon CloudWatch ......................................................................................... 209Monitor Performance with Ganglia ...................................................................................................... 220Distributed Copy Using S3DistCp ....................................................................................................... 227Export, Query, and Join Tables in Amazon DynamoDB ...................................................................... 234

Prerequisites for Integrating Amazon EMR ............................................................................... 235Step 1: Create a Key Pair .......................................................................................................... 235Step 2: Create a Job Flow ......................................................................................................... 236Step 3: SSH into the Master Node ............................................................................................ 241Step 4: Set Up a Hive Table to Run Hive Commands ................................................................ 244Hive Command Examples for Exporting, Importing, and Querying Data .................................. 248Optimizing Performance ............................................................................................................ 255

Use Third Party Applications With Amazon EMR ............................................................................... 258Parse Data with HParser ........................................................................................................... 258Using Karmasphere Analytics ................................................................................................... 259Launch a Job Flow on the MapR Distribution for Hadoop ......................................................... 260

Write Amazon EMR Applications ........................................................................................................ 263Common Concepts for API Calls ........................................................................................................ 263Use SDKs to Call Amazon EMR APIs ................................................................................................ 265

Using the AWS SDK for Java to Create an Amazon EMR Job Flow ......................................... 266Using the AWS SDK for .Net to Create an Amazon EMR Job Flow .......................................... 267Using the Java SDK to Sign a Query Request .......................................................................... 267

Use Query Requests to Call Amazon EMR APIs ............................................................................... 268Why Query Requests Are Signed ............................................................................................. 269Components of a Query Request in Amazon EMR ................................................................... 269How to Generate a Signature for a Query Request in Amazon EMR ........................................ 270

Configure Amazon EMR ..................................................................................................................... 274Configure User Permissions with IAM ................................................................................................ 274

Set Policy for an IAM User ........................................................................................................ 277Configure IAM Roles for Amazon EMR .............................................................................................. 280Set Access Permissions on Files Written to Amazon S3 .................................................................... 285Using Elastic IP Addresses ................................................................................................................. 287Specify the Amazon EMR AMI Version ............................................................................................... 290Hadoop Configuration ......................................................................................................................... 299

Supported Hadoop Versions ..................................................................................................... 300Configuration of hadoop-user-env.sh ........................................................................................ 302Upgrading to Hadoop 1.0 .......................................................................................................... 302

Hadoop Version Behavior ................................................................................................ 303Hadoop 0.20 Streaming Configuration ...................................................................................... 304Hadoop Default Configuration (AMI 1.0) ................................................................................... 304

Hadoop Configuration (AMI 1.0) ...................................................................................... 304HDFS Configuration (AMI 1.0) ......................................................................................... 307Task Configuration (AMI 1.0) ........................................................................................... 308Intermediate Compression (AMI 1.0) ............................................................................... 311

Hadoop Memory-Intensive Configuration Settings (AMI 1.0) ................................................... 311Hadoop Default Configuration (AMI 2.0 and 2.1) ...................................................................... 314

Hadoop Configuration (AMI 2.0 and 2.1) ......................................................................... 314HDFS Configuration (AMI 2.0 and 2.1) ............................................................................ 318Task Configuration (AMI 2.0 and 2.1) .............................................................................. 318



Intermediate Compression (AMI 2.0 and 2.1) .................................................................. 321Hadoop Default Configuration (AMI 2.2) ................................................................................... 322

Hadoop Configuration (AMI 2.2) ...................................................................................... 322HDFS Configuration (AMI 2.2) ......................................................................................... 326Task Configuration (AMI 2.2) ........................................................................................... 326Intermediate Compression (AMI 2.2) ............................................................................... 329

Hadoop Default Configuration (AMI 2.3) ................................................................................... 330Hadoop Configuration (AMI 2.3) ...................................................................................... 330HDFS Configuration (AMI 2.3) ......................................................................................... 334Task Configuration (AMI 2.3) ........................................................................................... 334Intermediate Compression (AMI 2.3) ............................................................................... 337

File System Configuration ......................................................................................................... 338JSON Configuration Files .......................................................................................................... 340Multipart Upload ........................................................................................................................ 343Hadoop Data Compression ....................................................................................................... 344Setting Permissions on the System Directory ........................................................................... 345Hadoop Patches ........................................................................................................................ 346

Hive Configuration .............................................................................................................................. 348Supported Hive Versions ........................................................................................................... 349Share Data Between Hive Versions ........................................................................................... 353Differences from Apache Hive Defaults .................................................................................... 353Interactive and Batch Modes ..................................................................................................... 355Creating a Metastore Outside the Hadoop Cluster ................................................................... 357Using the Hive JDBC Driver ...................................................................................................... 359Additional Features of Hive in Amazon EMR ............................................................................ 362Upgrade to Hive 0.8 .................................................................................................................. 368

Upgrade the Configuration Files ...................................................................................... 368Upgrade the Metastore .................................................................................................... 369

Upgrade to Hive 0.8 (MySQL on the Master Node) ................................................ 369Upgrade to Hive 0.8 (MySQL on Amazon RDS) ..................................................... 373

Pig Configuration ................................................................................................................................ 377Supported Pig Versions ............................................................................................................. 377Pig Version Details .................................................................................................................... 379

Performance Tuning ............................................................................................................................ 381Running Job Flows on an Amazon VPC ............................................................................................. 381Appendix: Compare Job Flow Types ................................................................................................... 389Appendix: Amazon EMR Resources ................................................................................................... 391Document History ............................................................................................................................... 396Glossary ............................................................................................................................................. 393Index ................................................................................................................................................... 401



Welcome

This is the Amazon Elastic MapReduce (Amazon EMR) Developer Guide.This guide provides a conceptualoverview of Amazon EMR, an overview of related AWS products, and detailed information on allfunctionality available from Amazon EMR.

Amazon EMR is a web service that makes it easy to process large amounts of data efficiently. AmazonEMR uses Hadoop processing combined with several AWS products to do such tasks as web indexing,data mining, log file analysis, machine learning, scientific simulation, and data warehousing.

How Do I...?Relevant SectionsHow Do I?

Amazon Elastic MapReduce detail pageDecide whether Amazon EMR is rightfor my needs

Getting Started GuideGet started with Amazon EMR

Troubleshooting (p. 183)Learn about troubleshooting job flows

Create a Job Flow (p. 23)Learn how to create a job flow

Bootstrap Actions (p. 84)Learn about bootstrap actions

Hadoop Configuration (p. 299)Learn about Hadoop clusterconfiguration

Write Amazon EMR Applications (p. 263)Learn about the Amazon EMR API

Appendix: Compare Job Flow Types (p. 389)Compare different job flow types


Amazon Elastic MapReduce Developer GuideHow Do I...?

http://aws.amazon.com/elasticmapreduce/

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

Understand Amazon EMR

Topics

• Overview of Amazon EMR (p. 2)

• Amazon EMR Concepts (p. 6)

• Associated AWS Product Concepts (p. 11)

This introduction to Amazon Elastic MapReduce (Amazon EMR) provides a summary of this web service.After reading this section, you should understand the service features, know how Amazon EMR interactswith other AWS products, and understand the basic functions of Amazon EMR.

In this guide, we assume that you have read and completed the instructions described in the GettingStarted Guide, which provides information on creating your Amazon Elastic MapReduce (Amazon EMR)account and credentials.

You should be familiar with the following:

• Hadoop. For more information go to http://hadoop.apache.org/core/.

• Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), andAmazon SimpleDB. For more information, see the Amazon Elastic Compute Cloud User Guide, theAmazon Simple Storage Service Developer Guide, and the Amazon SimpleDB Developer Guide,respectively.

Overview of Amazon EMRAmazon Elastic MapReduce (Amazon EMR) is a data analysis tool that simplifies the set-up andmanagement of a computer cluster, the source data, and the computational tools that help you implementsophisticated data processing jobs quickly.

Typically, data processing involves performing a series of relatively simple operations on large amountsof data. In Amazon EMR, each operation is called a step and a sequence of steps is a job flow. A job flowthat processes encrypted data might look like the following example.

Decrypt dataStep 1

Process dataStep 2


Amazon Elastic MapReduce Developer GuideOverview of Amazon EMR



http://hadoop.apache.org/core/

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/

http://docs.aws.amazon.com/AmazonS3/latest/dev/

http://docs.aws.amazon.com/AmazonSimpleDB/latest/DeveloperGuide/

Encrypt dataStep 3

Save dataStep 4

Amazon EMR uses Hadoop to divide up the work among the instances in the cluster, tracks status, andcombine the individual results into one output. For an overview of Hadoop, see What Is Hadoop? (p. 8).

Amazon EMR takes care of provisioning a Hadoop cluster, running the job flow, terminating the job flow,moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop. Amazon EMR removesmost of the cumbersome details of setting up the hardware and networking required by the Hadoopcluster, such as monitoring the setup, configuring Hadoop, and executing the job flow. Together, AmazonEMR and Hadoop provide all of the power of Hadoop processing with the ease, low cost, scalability, andpower that Amazon S3 and Amazon EC2 offer.

Architectural Overview of Amazon EMRAmazon Elastic MapReduce (Amazon EMR) works in conjunction with Amazon EC2 to create a Hadoopcluster, and with Amazon S3 to store scripts, input data, log files, and output results. The Amazon EMRprocess is outlined in the following figure and table.


Amazon Elastic MapReduce Developer GuideArchitectural Overview of Amazon EMR

Amazon EMR Process

Upload to Amazon S3 the data you want to process, as well as the mapper and reducer executablesthat process the data, and then send a request to Amazon EMR to start a job flow.

1

Amazon EMR starts a Hadoop cluster, which loads any specified bootstrap actions and then runsHadoop on each node.

2

Hadoop executes a job flow by downloading data from Amazon S3 to core and task nodes.Alternatively, the data is loaded dynamically at run time by mapper tasks.

3

Hadoop processes the data and then uploads the results from the cluster to Amazon S3.4

The job flow is completed and you retrieve the processed data from Amazon S3.5

For details on mapping legacy job flows to instance groups, see Mapping Legacy Job Flows to InstanceGroups (p. 102).

Elastic MapReduce FeaturesTopics

• Bootstrap Actions (p. 4)

• Configurable Data Storage (p. 4)

• Hadoop and Step Logging (p. 5)

• Hive Support (p. 5)

• Resizeable Running Job Flows (p. 5)

• Secure Data (p. 5)

• Supports Hadoop Methods (p. 5)

• Multiple Sequential Steps (p. 5)

The following sections describe the features available in Amazon Elastic MapReduce (Amazon EMR).

Bootstrap ActionsA bootstrap action is a mechanism that lets you run a script on Elastic MapReduce cluster nodes beforeHadoop starts. Bootstrap action scripts are stored in Amazon S3 and passed to Amazon EMR whencreating a new job flow. Bootstrap action scripts are downloaded from Amazon S3 and executed on eachnode before the job flow is executed.

By using bootstrap actions, you can install software on the node, modify the default Hadoop siteconfiguration, or change the way Java parameters are used to run Hadoop daemons.

Both predefined and custom bootstrap actions are available. The predefined bootstrap actions includeConfigure Hadoop, Configure Daemons, and Run-if.You can write custom bootstrap actions in anylanguage already installed on the job flow instance, such as Ruby, Python, Perl, or bash.

You can specify a bootstrap action from the command line interface, from the Amazon EMR console, orfrom the Amazon EMR API when starting a job flow. For more information, see Bootstrap Actions (p. 84).

Configurable Data StorageAmazon EMR supports Hadoop Distributed Files System (HDFS). HDFS is fault-tolerant, scalable, andeasily configurable. The default configuration is already optimized for most job flows. Generally, the


Amazon Elastic MapReduce Developer GuideElastic MapReduce Features

configuration needs to be changed only for very large clusters. Configuration changes are accomplishedusing bootstrap actions. For more information, see Hadoop Configuration (p. 299).

Hadoop and Step LoggingAmazon EMR provides detailed logs you can use to debug both Hadoop and Amazon EMR. For moreinformation on how to create logs, view logs, and use them to troubleshoot a job flow, seeTroubleshooting (p. 183).

Hive SupportAmazon Elastic MapReduce (Amazon EMR) supports Apache Hive. Hive is an integrated data warehouseinfrastructure built on top of Hadoop. It provides tools to simplify data summarization and provides adhoc querying and analysis of large datasets stored in Hadoop files. Hive provides a simple query languagecalled Hive QL, which is based on SQL.

For more information on the supported versions of Hive, see Hive Configuration (p. 348).

Resizeable Running Job FlowsThe ability to resize a running job flow lets you increase or decrease the number of nodes in a runningcluster. Core nodes contain the Hadoop Distributed File System (HDFS). After a job flow is running, youcan increase the number of core nodes. Task nodes also run your Hadoop, but do not contain HDFS.After a job flow is running you can also increase and decrease the number of task nodes. For moreinformation, see Resizing Running Job Flows (p. 96).

Secure DataAmazon EMR provides an authentication mechanism to ensure that data stored in Amazon S3 is securedagainst unauthorized access. By default, only the AWS Account owner can access the data uploaded toAmazon S3. Other users can access the data only if you explicitly edit security permissions.

You can send data to Amazon S3 using the secure HTTPS protocol. Amazon EMR always uses a securechannel to send data between Amazon S3 and Amazon EC2. For added security, you can encrypt yourdata before uploading it to Amazon S3. For more information on AWS security, go to the AWS SecurityCenter.

Supports Hadoop MethodsAmazon EMR supports job flows based on streaming, Hive, Pig, Custom JAR, and Cascading. Streamingenables you to write application logic in any language and to process large amounts of data using theHadoop framework. Hive and Pig offer nonprogramming options with their SQL-like scripting languages.Custom JAR files enable you to write Java-based MapReduce functions. Cascading is an API with built-inMapReduce support that lets you create complex distributed processes. For more information, see UsingAmazon EMR (p. 15).

Multiple Sequential StepsAmazon EMR supports job flows with multiple, sequential steps, including the ability to add steps whilea job flow runs. Individual steps can combine to create more sophisticated job flows. Additionally, youcan incrementally add steps to a running job flow to help with debugging. For more information, see AddSteps to a Job Flow (p. 79).


Amazon Elastic MapReduce Developer GuideElastic MapReduce Features

http://aws.amazon.com/security/

http://aws.amazon.com/security/

Amazon EMR ConceptsTopics

• Job Flows and Steps (p. 6)

• Hadoop and MapReduce (p. 7)

This section describes the concepts and terminology you need to understand and use Amazon ElasticMapReduce (Amazon EMR).

Job Flows and StepsA job flow is the series of instructions Amazon Elastic MapReduce (Amazon EMR) uses to process data.A job flow contains any number of user-defined steps. A step is any instruction that manipulates the data.Steps are executed in the order in which they are defined in the job flow.

You can track the progress of a job flow by checking its state. The following diagram shows the life cycleof a job flow and how each part of the job flow process maps to a particular job flow state.

A successful Amazon Elastic MapReduce (Amazon EMR) job flow follows this process: Amazon EMRfirst provisions a Hadoop cluster. During this phase, the job flow state is STARTING. Next, any user-definedbootstrap actions are run. During this phase, the job flow state is BOOTSTRAPPING. After all bootstrapactions are completed, the job flow state is RUNNING. The job flow sequentially runs all job flow stepsduring this phase. After all steps run, the job flow state transitions to SHUTTING_DOWN and the job flowshuts down the cluster. All data stored on a cluster node is deleted. Information stored elsewhere, suchas in your Amazon S3 bucket, persists. Finally, when all job flow activity is complete, the job flow stateis marked as COMPLETED.

You can configure a job flow to go into a WAITING state once it completes processing of all steps. A jobflow in the WAITING state continues running, waiting for you to add steps or manually terminate it. Whenyou manually terminate a job flow, the Hadoop cluster shuts down and job flow state is SHUTTING_DOWN.When the job flow activity is complete, the final job flow state is TERMINATED. Creating a WAITING jobflow is useful when troubleshooting. For more information on troubleshooting, see Debug Job Flows withSteps (p. 206).

Any failure during the job flow process terminates the job flow and shuts down all cluster nodes. Any datastored on a cluster node is deleted. The job flow state is marked as FAILED.


Amazon Elastic MapReduce Developer GuideAmazon EMR Concepts

For a complete list of job flow states, see the JobFlowExecutionStatusDetail data type in the AmazonElastic MapReduce (Amazon EMR) API Reference.

You can also track the progress of job flow steps by checking their state. The following diagram showsthe processing of job flow steps and how each step maps to a particular state.

A job flow contains one or more steps. Steps are processed in the order in which they are listed in thejob flow. Step are run following this sequence: all steps have their state set to PENDING. The first step isrun and the step's state is set to RUNNING. When the step is completed, the step's state changes toCOMPLETED. The next step in the queue is run, and the step's state is set to RUNNING. After each stepcompletes, the step's state is set to COMPLETED and the next step in the queue is run. Step are run untilthere are no more steps. Processing flow returns to the job flow.

If a step fails, the step state is FAILED and all remaining steps with a PENDING state are marked asCANCELLED. No further steps are run. and processing returns to the job flow.

Data is normally communicated from one step to the next using files stored on the cluster's HadoopDistributed File System (HDFS). Data stored on HDFS exists only as long as the cluster is running.Whenthe cluster is shut down, all data is deleted. The final step in a job flow typically stores the processingresults in an Amazon S3 bucket.

For a complete list of step states, see the StepExecutionStatusDetail data type in the Amazon ElasticMapReduce (Amazon EMR) API Reference.

Hadoop and MapReduceTopics

• What Is Hadoop? (p. 8)

• What Is MapReduce? (p. 8)

• Instance Groups (p. 9)


Amazon Elastic MapReduce Developer GuideHadoop and MapReduce

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_JobFlowExecutionStatusDetail.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_StepExecutionStatusDetail.html

• Supported Hadoop Versions (p. 10)

• Supported File Systems (p. 10)

This section explains the roles of Apache Hadoop and MapReduce in Amazon Elastic MapReduce(Amazon EMR) and how these two methodologies work together to process data.

What Is Hadoop?Apache Hadoop is an open-source Java software framework that supports massive data processingacross a cluster of servers. Hadoop uses a programming model called MapReduce that divides a largedata set into many small fragments. Hadoop distributes a data fragment and a copy of the MapReduceexecutable to each of the slave nodes in a Hadoop cluster. Each slave node runs the MapReduceexecutable on its subset of the data. Hadoop then combines the results from all of the nodes into a finishedoutput. Amazon EMR enables you to upload that output into an Amazon S3 bucket you designate.

For more information about Hadoop, go to http://hadoop.apache.org.

What Is MapReduce?MapReduce is a combination of mapper and reducer executables that work together to process data.The mapper executable processes the raw data into key/value pairs, called intermediate results. Thereducer executable combines the intermediate results, applies additional algorithms, and produces thefinal output, as described in the following process.

MapReduce Process

Amazon Elastic MapReduce (Amazon EMR) starts your instances in two security groups: one forthe master node and another for the core node and task nodes.

1

Hadoop breaks a data set into multiple sets if the data set is too large to process quickly on a singlecluster node.

2

Hadoop distributes the data files and the MapReduce executable to the core and task nodes of thecluster.Hadoop handles machine failures and manages network communication between the master, core,and task nodes. In this way, developers do not need to know how to perform distributed programmingor handle the details of data redundancy and fail over.

3

The mapper function uses an algorithm that you supply to parse the data into key/value pairs.Thesekey/value pairs are passed to the reducer.As an example, for a job flow that counts the number of times a word appears in a document, themapper might take each word in a document and assign it a value of 1. Each word is a key in thiscase, and all values are 1.

4

The reducer function collects the results from all of the mapper functions in the cluster, eliminatesredundant keys by combining values of all like keys, then performs the designated operation on allthe values for each key, and then outputs the results.Continuing with the previous example, the reducer takes all of the word counts from all of themappers functions running in a cluster, adds up the number of times each word was found, andthen outputs that result to Amazon S3.

5

You can write the executables in any programming language. Mapper and reducer applications writtenin Java are compiled into a JAR file. Executables written in other programming languages use the Hadoopstreaming utility to implement the mapper and reducer algorithms.



http://hadoop.apache.org/

The mapper executable reads the input from standard input and the reducer outputs data through standardoutput. By default, each line of input/output represents a record and the first tab on each line of the outputseparates the key and value.

For more information about MapReduce, go to How Map and Reduce operations are actually carried out(http://wiki.apache.org/hadoop/HadoopMapReduce).

Instance GroupsAmazon EMR runs a managed version of Apache Hadoop, handling the details of creating the cloud-serverinfrastructure to run the Hadoop cluster. Amazon EMR refers to this cluster as a job flow, and defines theconcept of instance groups, which are collections of Amazon EC2 instances that perform roles analogousto the master and slave nodes of Hadoop. There are three types of instance groups: master, core, andtask.

Each Amazon EMR job flow includes one master instance group that contains one master node, a coreinstance group containing one or more core nodes, and an optional task instance group, which can containany number of task nodes.

If the job flow is run on a single node, then that instance is simultaneously a master and a core node. Forjob flows running on more than one node, one instance is the master node and the remaining are coreor task nodes.

For more information about instance groups, see Resizing Running Job Flows (p. 96).

Master Instance Group

The master instance group manages the job flow: coordinating the distribution of the MapReduceexecutable and subsets of the raw data, to the core and task instance groups. It also tracks the status ofeach task performed, and monitors the health of the instance groups. To monitor the progress of the jobflow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directlyor access the user interface that Hadoop publishes to the web server running on the master node. Formore information, see View Logs Using SSH (p. 197).

As the job flow progresses, each core and task node processes its data, transfers the data back to AmazonS3, and provides status metadata to the master node.

NoteThe instance controller on the master node uses MySQL. If MySQL becomes unavailable, theinstance controller will be unable to launch and manage instances.

Core Instance Group

The core instance group contains all of the core nodes of a job flow. A core node is an EC2 instance thatruns Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS).Core nodes are managed by the master node.

The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flowrun. Because core nodes store data, you can't remove them from a job flow. However, you can add morecore nodes to a running job flow. Core nodes run both the DataNodes and TaskTracker Hadoop daemons.

CautionRemoving HDFS from a running node runs the risk of losing data.

For more information about core instance groups, see Resizing Running Job Flows (p. 96).



http://wiki.apache.org/hadoop/HadoopMapReduce

Task Instance Group

The task instance group contains all of the task nodes in a job flow. The task instance group is optional.You can add it when you start the job flow or add a task instance group to a job flow in progress.

Task nodes are managed by the master node. While a job flow is running you can increase and decreasethe number of task nodes. Because they don't store data and can be added and removed from a job flow,you can use task nodes to manage the EC2 instance capacity your job flow uses, increasing capacity tohandle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

For more information about task instance groups, see Resizing Running Job Flows (p. 96).

Supported Hadoop VersionsAmazon Elastic MapReduce (Amazon EMR) allows you to choose to run either Hadoop version 0.18,Hadoop version 0.20, or Hadoop version 0.20.205.

For more information on Hadoop configuration, see Hadoop Configuration (p. 299)

Supported File SystemsAmazon EMR and Hadoop typically use two or more of the following file systems when processing a jobflow:

• Hadoop Distributed File System (HDFS)

• Amazon S3 Native File System (S3N)

• Local file system

• Legacy Amazon S3 Block File System

HDFS and S3N are the two main file systems used with Amazon EMR

HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is dataawareness between the Hadoop cluster nodes managing the job flows and the Hadoop cluster nodesmanaging the individual steps. For more information on how HDFS works, seehttp://hadoop.apache.org/docs/hdfs/current/hdfs_user_guide.html.

The Amazon S3 Native File System (S3N) is a file system for reading and writing regular files on AmazonS3. The advantage of this file system is that you can access files on Amazon S3 that were written withother tools. For information on how Amazon S3 and Hadoop work together, seehttp://wiki.apache.org/hadoop/AmazonS3.

The local file system refers to a locally connected disk. When a Hadoop cluster is created, each node iscreated from an Amazon EC2 instance which comes with a preconfigured block of preattached diskstorage called an Amazon EC2 local instance store. Data on instance store volumes persists only duringthe life of the associated Amazon EC2 instance. The amount of this disk storage varies by Amazon EC2instance type. It is ideal for temporary storage of information that is continually changing, such as buffers,caches, scratch data, and other temporary content. For more information about Amazon EC2 instances,see Amazon Elastic Compute Cloud.

The Amazon S3 Block File System Files is a legacy file storage system. We strongly discourage the useof this system.

For more information on how to use and configure file systems in Amazon EMR, see File SystemConfiguration (p. 338).



http://hadoop.apache.org/docs/hdfs/current/hdfs_user_guide.html

http://wiki.apache.org/hadoop/AmazonS3

http://aws.amazon.com/ec2/

Associated AWS Product ConceptsTopics

• Amazon EC2 Concepts (p. 11)

• Amazon S3 Concepts (p. 14)

• AWS Identity and Access Management (IAM) (p. 14)

• Regions (p. 14)

• Data Storage (p. 14)

This section describes AWS concepts and terminology you need to understand to use Amazon ElasticMapReduce (Amazon EMR) effectively.

Amazon EC2 ConceptsTopics

• Amazon EC2 Instances (p. 11)

• Reserved Instances (p. 13)

• Elastic IP Address (p. 13)

• Amazon EC2 Key Pairs (p. 13)

The following sections describe Amazon EC2 features used by Amazon EMR.

Amazon EC2 InstancesAmazon EMR enables you to choose the number and kind of Amazon EC2 instances that comprise thecluster that processes your job flow. Amazon EC2 offers several basic types.

• Standard—You can use Amazon EC2 standard instances for most applications.

• High-CPU—These instances have proportionally more CPU resources than memory (RAM) forcompute-intensive applications.

• High-Memory—These instances offer large memory sizes for high throughput applications, includingdatabase and memory caching applications.

• Cluster Compute—These instances provide proportionally high CPU resources with increased networkperformance. They are well suited for demanding network-bound applications.

• High Storage—These instances provide proportionally high storage resources. They are well suitedfor data warehouse applications.

NoteAmazon EMR does not support micro instances at this time.

The following table describes all the instance types that Amazon EMR supports.

NameI/OPerformance

Platform(bits)

DiskDrive(GiB)

ComputeUnits

RAM(GiB)

Instance Type

m1.smallModerate3215011.7Small (default)

m1.largeHigh6484047.5Large

m1.xlargeHigh641680815Extra Large


Amazon Elastic MapReduce Developer GuideAssociated AWS Product Concepts

NameI/OPerformance

Platform(bits)

DiskDrive(GiB)

ComputeUnits

RAM(GiB)

Instance Type

c1.mediumModerate3234051.7High-CPU Medium

c1.xlargeHigh641680207High-CPU Extra Large

m2.xlargeModerate644206.517.1High-Memory Extra Large

m2.2xlargeModerate648501334.2High-Memory Double ExtraLarge

m2.4xlargeHigh6416802668.4High-Memory QuadrupleExtra Large

cc1.4xlargeVery High(10 GigabitEthernet)

64169033.523Cluster Compute QuadrupleExtra Large Instance*

cc2.8xlargeVery High(10 GigabitEthernet)

6433608860.5Cluster Compute EightExtra Large**

hs1.8xlargeVery High(10 GigabitEthernet)

644915235117High Storage*

cg1.4xlargeVery High(10 GigabitEthernet)

64168033.523Cluster GPU***

*Cluster Compute Quadruple Extra Large instances and High Storage instances are supported only inthe US East (Northern Virginia) Region.

**Cluster Compute Eight Extra Large instances are only supported in the US East (Northern Virginia),US West (Oregon), and EU (Ireland) Regions.

***Cluster GPU instances have 22 GB, with 1 GB reserved for GPU operation.

The practical limit of the amount of data you can process depends on the number and type of AmazonEC2 instances selected as your cluster nodes, and on the size of your intermediate and final data. Thisis because the input, intermediate, and output data sets reside on the cluster nodes while your job flowruns. For example, the maximum amount of data that you can process on a 20-node cluster is 34 TB (20Extra Large instances x 1.69 TB of hard disk per Amazon EC2 instance = 34 TB).

The default maximum number of Amazon EC2 instances you can specify is 20. If you need more instances,you can make a formal request. For more information, go to the Request to Increase Amazon EC2 InstanceLimit Form.

Related Topics

• Request additional Amazon EC2 instances

• Amazon EC2 Instance Types

• High Performance Computing (HPC)


Amazon Elastic MapReduce Developer GuideAmazon EC2 Concepts

http://aws.amazon.com/contact-us/ec2-request/



http://aws.amazon.com/ec2/instance-types/

http://aws.amazon.com/hpc-applications/

Reserved InstancesReserved Instances provide guaranteed capacity and are an additional Amazon EC2 pricing option.Youmake a one-time payment for an instance to reserve capacity and reduce hourly usage charges. ReservedInstances complement existing Amazon EC2 On-Demand Instances and provide an option to reducecomputing costs. As with On-Demand Instances, you pay only for the compute capacity that you actuallyconsume, and if you don't use an instance, you don't pay usage charges for it.

To use a Reserved Instance with Amazon EMR, launch your job flow in the same Availability Zone asyour Reserved Instance. For example, let's say you purchase one m1.small Reserved Instance in US-East.If you launch a job flow that uses two m1.small instances in the same Availability Zone in Region US-East,one instance is billed at the Reserved Instance rate and the other is billed at the On-Demand rate. If youhave a sufficient number of available Reserved Instances for the total number of instances you want tolaunch, you are guaranteed capacity.Your Reserved Instances are used before any On-Demand Instancesare created.

You can use Reserved Instances by using either the Amazon EMR console, the command line interface(CLI), Amazon EMR API actions, or the AWS SDKs.

Related Topics

• Amazon EC2 Reserved Instances

Elastic IP AddressElastic IP addresses are static IP addresses designed for dynamic cloud computing. An Elastic IP addressis associated with your account, not a particular instance.You control the addresses associated with youraccount until you choose to explicitly release them.

You can associate one Elastic IP address with only one job flow at a time. To ensure our customers areefficiently using Elastic IP addresses, we impose a small hourly charge when IP addresses associatedwith your account are not mapped to a job flow or Amazon EC2 instance. When Elastic IP addresses aremapped to an instance, they are free of charge.

For more information about enabling Elastic IP addresses with Amazon EMR, see Using Elastic IPAddresses (p. 287). For more information about using IP addresses in AWS, go to the Using Elastic IPAddresses section in the Amazon Elastic Compute Cloud User Guide.

Amazon EC2 Key PairsWhen Amazon EMR starts an Amazon EC2 instance, it uses a 2048-bit RSA key pair that you havenamed. Amazon EC2 stores the public key. Amazon EMR stores the private key and uses the privatekey to validate all requests.

The key pair ensures that only you can access your job flows. When you launch an instance using yourkey pair name, the public key becomes part of the instance metadata. This allows you to access thecluster node securely.

Although specifying the key pair is optional, we strongly recommend that you use key pairs. This key pairbecomes associated with all of the nodes created to process your job flow. The key pair name creates ahandle you can use to access the master node in the Hadoop cluster. With the key pair name, you canlog in to the master node without using a password, enabling you to monitor the progress of your jobflows. On the master node, you can retrieve detailed job flow processing status and statistics.

For more information on how to create and use an Amazon EC2 key pair with Amazon EMR, see "Creatingan Amazon EC2 Key Pair" in the Getting Started Guide.


Amazon Elastic MapReduce Developer GuideAmazon EC2 Concepts

http://aws.amazon.com/ec2/reserved-instances/

http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/using-instance-addressing.html#using-instance-addressing-eips

http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/using-instance-addressing.html#using-instance-addressing-eips


Amazon S3 ConceptsTopics

• Buckets (p. 14)

• Multipart Upload (p. 14)

The following sections describe Amazon S3 features used by Amazon EMR.

BucketsAmazon EMRrequires Amazon S3 buckets to hold the input and output data of your Hadoop processing.Amazon EMR uses the Amazon S3 Native File System for Hadoop processing. Amazon S3 uses thehostname method for accessing data, which places restrictions on bucket names used in Amazon EMRjob flows.

For more information on creating Amazon S3 buckets for use with Amazon EMR, see Setting Up YourEnvironment to Run a Job Flow (p. 17). For more information on Amazon S3 buckets, go to Workingwith Amazon S3 Buckets in the Amazon S3 Developer Guide.

Multipart UploadAmazon Elastic MapReduce (Amazon EMR) supports Amazon S3 multipart upload through the AWSSDK for Java. Multipart upload lets you upload a single object as a set of parts.You can upload theseobject parts independently and in any order. If transmission of any part fails, you can retransmit that partwithout affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles the partsand creates the object.

For more information about enabling multipart uploads with Amazon EMR, see Multipart Upload (p. 343).For more information on Amazon S3 multipart uploads, go to Uploading Objects Using Multipart Uploadin the Amazon S3 Developer Guide.

AWS Identity and Access Management (IAM)Amazon Elastic MapReduce (Amazon EMR) supports AWS Identity and Access Management (IAM)policies. IAM is a web service that enables AWS customers to manage users and user permissions. Formore information about enabling IAM policies with Amazon EMR, see Configure User Permissions withIAM (p. 274). For more information on IAM, go to Using IAM in the Using AWS Identity and AccessManagement guide.

RegionsYou can choose the geographical region where Amazon EC2 creates the cluster to process your data.You might choose a region to optimize latency, minimize costs, or address regulatory requirements.Setting a region-specific endpoint guarantees where your data resides. For the list of regions and endpointssupported by Amazon EMR, go to Regions and Endpoints in the Amazon Web Services General Reference.

Data StorageAmazon EMR uses Amazon S3 and Amazon SimpleDB data storage systems when processing a jobflow. For more information about using Amazon S3 with Hadoop, go tohttp://wiki.apache.org/hadoop/AmazonS3. For more information about Amazon SimpleDB, go to theAmazon SimpleDB product description page.


Amazon Elastic MapReduce Developer GuideAmazon S3 Concepts

http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingBucket.html

http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingBucket.html

http://docs.amazonwebservices.com/AmazonS3/latest/dev/uploadobjusingmpu.html

http://docs.amazonwebservices.com/IAM/latest/UserGuide/IAM_UsingService.html

http://docs.amazonwebservices.com/general/latest/gr/rande.html#emr_region

http://wiki.apache.org/hadoop/AmazonS3

http://aws.amazon.com/simpledb/

Using Amazon EMR

Topics

• Setting Up Your Environment to Run a Job Flow (p. 17)

• Create a Job Flow (p. 23)

• View Job Flow Details (p. 72)

• Terminate a Job Flow (p. 77)

• Customize a Job Flow (p. 79)

• Connect to the Master Node in an Amazon EMR Job Flow (p. 110)

• Use Cases (p. 122)

• Building Binaries Using Amazon EMR (p. 131)

• Using Tagging (p. 136)

• Protect a Job Flow from Termination (p. 136)

• Lower Costs with Spot Instances (p. 141)

• Store Data with HBase (p. 155)

• Troubleshooting (p. 183)

• Monitor Metrics with Amazon CloudWatch (p. 209)

• Monitor Performance with Ganglia (p. 220)

• Distributed Copy Using S3DistCp (p. 227)

• Export, Import, Query, and Join Tables in Amazon DynamoDB Using Amazon EMR (p. 234)

• Use Third Party Applications With Amazon EMR (p. 258)

This section covers the fundamentals of creating, managing, and troubleshooting a job flow using AmazonElastic MapReduce (Amazon EMR). All supported job flow types are described. Information on using theAmazon EMR console, the CLI, SDKs, and API is included.

If you have not signed up to use Amazon EMR, instructions are provided in the Getting Started Guide.

TipWe strongly recommend that you work through the examples in the Getting Started Guide toget a basic understanding of Amazon EMR.

Amazon EMR offers a variety of interfaces, including a console, a command line interface (CLI), a queryAPI, AWS SDKs, and libraries. Each interface offers a different balance of ease and functionality. Theinterface you choose depends on your knowledge of Hadoop, your programming skills, and the functionalityyou require:





• The Amazon EMR console provides a graphical interface from which you can launch Amazon EMRjob flows and monitor their progress.

• The CLI combines full compatibility with the Amazon EMR API without requiring a programmingenvironment. The Ruby-based Amazon EMR CLI is available for download at Amazon ElasticMapReduce Ruby Client (http://aws.amazon.com/developertools/2264.)

• The Amazon EMR API, SDKs, and libraries offer the most flexibility but require a programmingenvironment and software development skills. For more information on using the query API to accessAmazon EMR see Write Amazon EMR Applications (p. 263) in this guide. The AWS SDKs providessupport for Java, C#, and .NET. For more information on the AWS SDKs, refer to the list of currentAWS SDKs. Libraries are available for Perl and PHP. For more information about the Perl and PHPlibraries see Sample Code & Libraries (http://aws.amazon.com/code/Elastic-MapReduce.)

The following table compares the functionality of the Amazon EMR interfaces.

API/SDK/Libraries

CLIAmazonEMRConsole

Function

Create multiple job flows

Define bootstrap actions in a job flow

View logs for Hadoop jobs, tasks, and task attempts usinga graphical interface

Implement Hadoop data processing programmatically

Monitor job flows in real time

Provide verbose job flow details

Resize running job flows

Run job flows with multiple steps

Select version of Hadoop, Hive, and Pig

Specify the MapReduce executable in multiple computerlanguages

Specify the number and type of Amazon Amazon EC2instances that process the data

Transfer data to and from Amazon S3 automatically

Terminate job flows in real time

The following sections describe how to use Amazon Elastic MapReduce (Amazon EMR) with each of theinterface types.



http://aws.amazon.com/developertools/2264


http://aws.amazon.com/search?searchPath=all&searchQuery=AWS+SDK&x=0&y=0

http://aws.amazon.com/search?searchPath=all&searchQuery=AWS+SDK&x=0&y=0

http://aws.amazon.com/code/Elastic-MapReduce

Setting Up Your Environment to Run a Job FlowThis section walks you through how to set up required resources and permissions to run a job flow. Thetasks that follow show you how to create the resources that your job flow uses to process data. Oncecreated, you can reuse these resources for other job flows. Depending on your application, however, itmay make operational sense to create new resources for each job flow.

The tasks that must be completed before you create a job flow are as follows:

Choose a Region (p. 17)1

Create and Configure an Amazon S3 Bucket (p. 19)2

Create an Amazon EC2 Key Pair and PEM File (p. 20)3

Modify Your PEM File (p. 21)4

For CLI and API users only, Get Security Credentials (p. 21)5

For CLI users only, optionally Create a Credentials File (p. 22)6

The following sections provide instructions on how to perform each of the tasks.

Choose a RegionAWS enables you to place resources in multiple locations. Locations are composed of Regions andAvailability Zones within those Regions. Availability Zones are distinct geographical locations that areengineered to be insulated from failures in other Availability Zones and provide inexpensive, low latencynetwork connectivity to other Availability Zones in the same Region.

All Amazon EC2 Instances, key pairs, security groups, and Amazon Elastic MapReduce (Amazon EMR)job flows must be located in the same Region.To optimize performance and reduce latency, all resources(such as Amazon S3 buckets) and job flows should be located in the same Availability Zone.

For more information about Regions and Availability Zones, go to Using Regions and Availability Zonesin the Amazon Elastic Compute Cloud User Guide

NoteNot all AWS products offer the same support in all Regions. For example, Cluster Computeinstances are available only in the US-East (Northern Virginia) Region and the Asia Pacific(Sydney) region supports only Hadoop 1.0.3 and later. Confirm that you are working in theappropriate Region for the resources you want to use.

You must ensure that you use the same Region for each resource you create. Use the table below toidentify the correct Region name.

The Amazon EC2Region is...

The Amazon S3Region is...

The Amazon EMR CLIand API Region is...

If your Amazon EMRRegion is...

US East (Virginia)US Standardus-east-1US East (Virginia)

US West (Oregon)Oregonus-west-2US West (Oregon)

US West (N. California)Northern Californiaus-west-1US West (N. California)

EU West (Ireland)Irelandeu-west-1EU West (Ireland)

Asia Pacific (Singapore)Singaporeap-southeast-1Asia Pacific (Singapore)


Amazon Elastic MapReduce Developer GuideSetting Up Your Environment to Run a Job Flow

http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?using-regions-availability-zones.html

The Amazon EC2Region is...

The Amazon S3Region is...

The Amazon EMR CLIand API Region is...

If your Amazon EMRRegion is...

Asia Pacific (Sydney)Sydneyap-southeast-2Asia Pacific (Sydney)

Asia Pacific (Tokyo)Tokyoap-northeast-1Asia Pacific (Tokyo)

South America (SaoPaulo)

Sao Paulosa-east-1South America (SaoPaulo)

Using the Amazon EMR Console to Specify a Region

To select a region in Amazon EMR

• From the Amazon EMR console, select the Region from the drop-down list.

Using the CLI to Specify a RegionSpecify the Region with the --region parameter, as in the following example. If the --region parameteris not specified, the job flow is created in the us-east-1 region.

$ ./elastic-mapreduce --create --alive --stream --input myawsbucket \ --output myawsbucket --log-uri --region eu-west-1

TipTo reduce the number of parameters required each time you issue a command from the CLI,you can store information such as Region in your credentials.json file. For more informationon creating a credentials.json file, go to the Create a Credentials File (p. 22).


Amazon Elastic MapReduce Developer GuideChoose a Region

Using the API to Specify a RegionTo select a region, configure your application to use that Region's endpoint. If you are creating a clientapplication using an AWS SDK, you can change the client endpoint by calling setEndpoint, as shownin the following example:

client.setEndpoint(“eu-west-1.elasticmapreduce.amazonaws.com”);

Once your application has specified a region by setting the endpoint, you can set the Availability Zonefor your job flow's Amazon EC2 instances with a query request that contains aInstances.Placement.AvailabilityZone parameter, as in the following example. If you do notspecify the Availability Zone for your job flow, Amazon EMR launches the job flow instances in the bestAvailability Zone in that region based on system health and available capacity.

https://elasticmapreduce.amazonaws.com?Operation=...Instances.Placement.AvailabilityZone=eu-west-1a&...

For more information about the parameters in an Amazon EMR request, see API Reference.

NoteFor more information on specifying Regions from the CLI and API, see Available RegionEndpoints for the AWS SDKs .

Create and Configure an Amazon S3 BucketAmazon Elastic MapReduce (Amazon EMR) uses Amazon S3 to store input data, log files, and outputdata. Amazon S3 refers to these storage locations as buckets.To conform with Amazon S3 requirements,DNS requirements, and restrictions in the supported data analysis tools, we recommend following thefollowing guidelines for bucket names. All bucket names must:

• Be between 3 and 63 characters long

• Contain only lowercase letters, numbers, or periods (.)

• Not contain a dash (-) or underscore (_)

For additional details on valid bucket names, go to Bucket Restrictions and Limitations in the AmazonSimple Storage Service Developers Guide.

This section shows you how to use the AWS Management Console to create and then set permissionsfor an Amazon S3 bucket. However, you can also create and set permissions for an Amazon S3 bucketusing the Amazon S3 API or the third-party Curl command line tool. For information about Curl, go toAmazon S3 Authentication Tool for Curl. For information about using the Amazon S3 API to create andconfigure an Amazon S3 bucket, go to the Amazon Simple Storage Service API Reference.

Using the AWS Management Console to Create an AmazonS3 Bucket

To create an Amazon S3 bucket

1. Sign in to the AWS Management Console and open the Amazon S3 console athttps://console.aws.amazon.com/s3/.


Amazon Elastic MapReduce Developer GuideCreate and Configure an Amazon S3 Bucket

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/

http://aws.amazon.com/articles/3912


http://docs.amazonwebservices.com/AmazonS3/latest/dev/BucketRestrictions.html

http://aws.amazon.com/code/128

http://docs.aws.amazon.com/AmazonS3/latest/dev/

https://console.aws.amazon.com/s3/

2. Click Create Bucket.The Create a Bucket dialog box opens.

3. Enter a bucket name, such as mylog-uri.This name should be globally unique, and cannot be the same name used by another bucket.

4. Select the Region for your bucket. To avoid paying cross-region bandwidth charges, create theAmazon S3 bucket in the same region as your job flow.Refer to Choose a Region (p. 17) for guidance on choosing a Region.

5. Click Create.

You created a bucket with the URI s3n://mylog-uri/.

NoteIf you enable logging in the Create a Bucket wizard, it enables only bucket access logs, notAmazon EMR job flow logs.

NoteFor more information on specifying Region-specific buckets, refer to Buckets and Regions in theAmazon Simple Storage Service Developer Guide and Available Region Endpoints for the AWSSDKs .

After you create your bucket you can set the appropriate permissions on it. Typically, you give yourself(the owner) read and write access and authenticated users read access.

Using the AWS Management Console to configure anAmazon S3 bucket

To set permissions on an Amazon S3 bucket

1. Sign in to the AWS Management Console and open the Amazon S3 console athttps://console.aws.amazon.com/s3/.

2. In the Buckets pane, right-click the bucket you just created.

3. Select Properties.

4. In the Properties pane, select the Permissions tab.

5. Click Add more permissions.

6. Select Authenticated Users in the Grantee field.

7. To the right of the Grantee drop-down list, select List.

8. Click Save.

You have created a bucket and restricted permissions to authenticated users.

Create an Amazon EC2 Key Pair and PEM FileAmazon EMR uses an Amazon Elastic Compute Cloud (Amazon EC2) key pair to ensure that you alonehave access to the instances that you launch. The PEM file associated with this key pair is required tossh directly to the master node of the cluster running your job flow.

To create an Amazon EC2 key pair

1. Sign in to the AWS Management Console and open the Amazon EC2 console athttps://console.aws.amazon.com/ec2/.

2. From the Amazon EC2 console, select a Region.

3. In the Navigation pane, click Key Pairs.


Amazon Elastic MapReduce Developer GuideCreate an Amazon EC2 Key Pair and PEM File

http://docs.amazonwebservices.com/AmazonS3/latest/dev/LocationSelection.html



https://console.aws.amazon.com/s3/

https://console.aws.amazon.com/ec2/

4. On the Key Pairs page, click Create Key Pair.

5. In the Create Key Pair dialog box, enter a name for your key pair, such as, mykeypair.

6. Click Create.

7. Save the resulting PEM file in a safe location.

Your Amazon EC2 key pair and an associated PEM file are created.

Modify Your PEM FileAmazon Elastic MapReduce (Amazon EMR) enables you to work interactively with your job flow, allowingyou to test job flow steps or troubleshoot your cluster environment. To log in directly to the master nodeof your running job flow, you can use ssh or PuTTY.You use your PEM file to authenticate to the masternode.The PEM file requires a modification based on the tool you use that supports your operating system.You use the CLI to connect on Linux or UNIX computers.You use PuTTY to connect on Microsoft Windowscomputers. For more information on how to install the Amazon EMR CLI or how to install PuTTY, go tothe Getting Started Guide.

To modify your credentials file

• Create a local permissions file:

Do this...If you areusing...

Set the permissions on the PEM file or your Amazon EC2 key pair. For example, ifyou saved the file as mykeypair.pem, the command looks like the following:

$ chmod og-rwx mykeypair.pem

Linux orUNIX

a. Download PuTTYgen.exe to your computer fromhttp://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.

b. Launch PuTTYgen.

c. Click Load. Select the PEM file you created earlier.

d. Click Open.

e. Click OK on the PuTTYgen Notice telling you the key was successfully imported.

f. Click Save private key to save the key in the PPK format.

g. When PuTTYgen prompts you to save the key without a pass phrase, click Yes.

h. Enter a name for your PuTTY private key, such as, mykeypair.ppk.

i. Click Save.

j. Exit the PuTTYgen application.

MicrosoftWindows

Your credentials file is modified to allow you to log in directly to the master node of your running job flow.

Get Security CredentialsAWS assigns you an Access Key ID and a Secret Access Key to identify you as the sender of yourrequest. AWS uses these security credentials to help protect your data.You include your Access Key ID


Amazon Elastic MapReduce Developer GuideModify Your PEM File


http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html


in all AWS requests made through the CLI or API.The AWS Management Console provides these securitycredentials automatically.

NoteYour Secret Access Key is a shared secret between you and AWS. Keep this key secret. Amazonuses this key to bill you for the AWS services you use. Never include your key in your requeststo AWS and never email your key to anyone, even if an inquiry appears to originate from AWSor Amazon.com. No one who legitimately represents Amazon will ever ask you for your SecretAccess Key.

To get your Access Key ID and Secret Access Key

1. Go to the AWS website.

2. Click My Account to display a list of options.

3. Click Security Credentials and log in to your AWS Account.Your Access Key ID is displayed inthe Access Credentials section.Your Secret Access Key remains hidden as a further precaution.

4. To display your Secret Access Key, click Show in the Your Secret Access Key area, as shown inthe following figure.

You have your Access Key ID and a Secret Access Key to securely identify yourself to AWS.You needthis information to create a credentials file, as described in the following section.

Create a Credentials FileYou can use an Amazon EMR credentials file to simplify job flow creation and authentication of requests.The credentials file provides information required for many commands.The credentials file is a convenientplace for you to store command parameters so you don't have to repeatedly enter the information.

Your credentials are used to calculate the signature value for every request you make.The Amazon EMRCLI automatically looks for these credentials in the file credentials.json. you can edit thecredentials.json file and include your AWS credentials. If you do not have a credentials.jsonfile, you must include your credentials in every request you make.

To create your credentials file

1. Create a file named credentials.json on your computer.

2. Add the following lines to your credentials file:


Amazon Elastic MapReduce Developer GuideCreate a Credentials File

http://aws.amazon.com

{ "access-id": "AccessKeyID", "private-key": "PrivateKey", "key-pair": "KeyName", "key-pair-file": "location of key pair file", "region": "Region", "log-uri": "location of bucket on Amazon S3" }

The access-id and private-key are the AWS Access Key ID and a Secret Access Key described in GetSecurity Credentials (p. 21). The key-pair and key-pair-file are the Amazon EC2 key pair and the pathand name of PEM file you created in Create an Amazon EC2 Key Pair and PEM File (p. 20). The regionis the Region you selected in Choose a Region (p. 17). The log-uri is the path to the bucket you createdin Create and Configure an Amazon S3 Bucket (p.19) using the format s3n://BucketName/FolderName.

Your credentials.json file is configured.

Each of the preceding tasks guided you through the steps to set up the objects and permissions requiredfor a job flow.You are now ready to create your job flow. Instructions on how to create a job flow are atCreate a Job Flow (p. 23).

Create a Job FlowTopics

• Choose a Job Flow Type (p. 23)

• Choose Job Flow Interface (p. 24)

• Identify Data, Scripts, and Log File locations (p. 24)

• How to Create a Streaming Job Flow (p. 24)

• How to Create a Job Flow Using Hive (p. 32)

• How to Create a Job Flow Using Pig (p. 40)

• How to Create a Job Flow Using a Custom JAR (p. 48)

• How to Create a Cascading Job Flow (p. 56)

• Launch an HBase Cluster on Amazon EMR (p. 64)

This section covers the basics of creating a job flow using Amazon Elastic MapReduce (Amazon EMR).You can create a job flow using the Amazon EMR console, downloading and installing the CommandLine Interface (CLI), or creating a query request with the Query API. The interface-specific details forusing either the Amazon EMR console, the CLI, or the API are covered in the following sections.

For information about creating the objects and setting the permissions needed to create a job flow seeSetting Up Your Environment to Run a Job Flow (p. 17). For information on the job flow process and howindividual steps are processed see Job Flows and Steps (p. 6).

Choose a Job Flow TypeChoose one of the supported job flow types: your choice of job flow type depends on several factors,including the format of the data and your level of programming knowledge. For information on comparingthe supported job flow types, see Appendix: Compare Job Flow Types (p. 389).


Amazon Elastic MapReduce Developer GuideCreate a Job Flow

Choose Job Flow InterfaceChoose the manner in which you want to create your job flow. The description of each job flow type inthis section includes details on how to create a job flow using the Amazon EMR console, the CLI, orQuery API. The Amazon EMR console provides a graphical interface to launch Elastic MapReduce jobflows and monitor their progress. The CLI combines full compatibility with the Elastic MapReduce APIwithout requiring a programming environment.The Elastic MapReduce API, AWS SDK, and libraries offerthe most flexibility, but require a programming environment and software development skills.

Identify Data, Scripts, and Log File locationsYou need to plan the job flow you want to run and specify where Amazon EMR finds the information.Typically, the MapReduce program or script is located in a bucket on Amazon S3.Your job flow input,output, and job flow logs are also typically located on Amazon S3.

Required Amazon S3 buckets must exist before you can create a job flow.You must upload any requiredscripts or data referenced in the job flow to Amazon S3. The following table describes example data,scripts, and log file locations.

Example Location on Amazon S3Information

s3://myawsbucket/wordcount/wordSplitter.pyscript or program

s3://myawsbucket/wordcount/logslog files

s3://myawsbucket/wordcount/inputinput data

s3://myawsbucket/wordcount/outputoutput data

For information on how to upload objects to Amazon S3, go to Add an Object to Your Bucket in theAmazon Simple Storage Service Getting Started Guide.

How to Create a Streaming Job FlowThis section covers the basics of creating and launching a streaming job flow using Amazon ElasticMapReduce (Amazon EMR).You'll step through how to create a streaming job flow using either theAmazon EMR console, the CLI, or the Query API. Before you create your job flow you'll need to createobjects and permissions; for more information see Setting Up Your Environment to Run a Job Flow (p. 17).

A streaming job flow reads input from standard input and then runs a script or executable (called a mapper)against each input. The result from each of the inputs is saved locally, typically on a Hadoop DistributedFile System (HDFS) partition. Once all the input is processed by the mapper, a second script or executable(called a reducer) processes the mapper results.The results from the reducer are sent to standard output.You can chain together a series of streaming job flows, where the output of one streaming job flowbecomes the input of another job flow.

The mapper and the reducer can each be referenced as a file or you can supply a Java class.You canimplement the mapper and reducer in any of the supported languages, including Ruby, Perl, Python,PHP, or Bash.

The example that follows is based on the Amazon EMR Word Count Example. This example shows howto use Hadoop streaming to count the number of times each word occurs within a text file. In this example,the input is located in the Amazon S3 bucket s3n://elasticmapreduce/samples/wordcount/input.The mapper is a Python script that counts the number of times a word occurs in each input string and islocated at s3://elasticmapreduce/samples/wordcount/wordSplitter.py.The reducer referencesa standard Hadoop library package called aggregate. Aggregate provides a special Java class and a list


Amazon Elastic MapReduce Developer GuideChoose Job Flow Interface

http://docs.amazonwebservices.com/AmazonS3/latest/gsg/PuttingAnObjectInABucket.html

http://aws.amazon.com/jobflows/2273

of simple aggregators that perform aggregations such as sum, max, and min over a sequence of values.The output is saved to an Amazon S3 bucket you created in Setting Up Your Environment to Run a JobFlow (p. 17).

Amazon EMR ConsoleThis example describes how to use the Amazon EMR console to create a streaming job flow.

To create a streaming job flow

1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console athttps://console.aws.amazon.com/elasticmapreduce/.

2. Click Create New Job Flow.

3. In the DEFINE JOB FLOW page, do the following:

a. Enter a name in the Job Flow Name field. This name is optional, and does not need to beunique.

b. Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose torun the Amazon distribution of Hadoop or one of two MapR distributions. For more informationabout MapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution forHadoop (p. 260).

c. Select Run your own application.

d. Select Streaming in the drop-down list.

e. Click Continue.


Amazon Elastic MapReduce Developer GuideHow to Create a Streaming Job Flow

https://console.aws.amazon.com/elasticmapreduce/

4. In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide,and then click Continue.

ActionField

Specify the URI where the input data resides in Amazon S3. The value mustbe in the form BucketName/path.

Input Location*

Specify the URI where you want the output stored in Amazon S3. The valuemust be in the form BucketName/path.

Output Location*

Specify either a class name that refers to a mapper class in Hadoop, or a pathon Amazon S3 where the mapper executable, such as a Python program,resides. The path value must be in the formBucketName/path/MapperExecutable.

Mapper*

Specify either a class name that refers to a reducer class in Hadoop, or a pathon Amazon S3 where the reducer executable, such as a Python program,resides. The path value must be in the formBucketName/path/ReducerExecutable. Amazon EMR supports the specialaggregate keyword. For more information, go to the Aggregate library suppliedby Hadoop.

Reducer*

Optionally, enter a list of arguments (space-separated strings) to pass to theHadoop streaming utility. For example, you can specify additional files to loadinto the distributed cache.

Extra Args

* Required parameter



5. In the CONFIGURE EC2 INSTANCES page, select the type and number of instances, using thefollowing table as a guide, and then click Continue.

NoteTwenty is the default maximum number of nodes per AWS account. For example, if youhave two job flows running, the total number of nodes running for both job flows must be20 or less. If you need more than 20 nodes, you must submit a request to increase yourAmazon EC2 instance limit. For more information, go to the Request to Increase AmazonEC2 Instance Limit Form.

ActionField

Specify the number of nodes to use in the Hadoop cluster. There is alwaysone master node in each job flow.You can specify the number of core andtasks nodes.

Instance Count

Specify the Amazon EC2 instance types to use as master, core, and tasknodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium,c1.xlarge, m2.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hs1.8xlarge, andcg1.4xlarge. The cc2.8xlarge instance type is only supported in the US East(Northern Virginia), US West (Oregon), and EU (Ireland) Regions. Thecc1.4xlarge and hs1.8xlarge instance types are only supported in the US East(Northern Virginia) Region.

Instance Type

Specify whether to run master, core, or task nodes on Spot Instances. Formore information, see Lower Costs with Spot Instances (p. 141)

Request SpotInstances






6. In the ADVANCED OPTIONS page, set additional configuration options, using the following tableas a guide, and then click Continue.

ActionField

Optionally, specify a key pair that you created previously. For more information,see Create an Amazon EC2 Key Pair and PEM File (p. 20). If you do not entera value in this field, you cannot SSH into the master node.

Amazon EC2 KeyPair

Optionally, specify a VPC subnet identifier to launch the job flow in an AmazonVPC. For more information, see Running Job Flows on an Amazon VPC (p. 381).

Amazon VPCSubnet Id

Optionally, specify a path in Amazon S3 to store the Amazon EMR log files.The value must be in the form BucketName/path. If you do not supply alocation, Amazon EMR does not log any files.

Amazon S3 LogPath

Select Yes to store Amazon Elastic MapReduce-generated log files.You mustenable debugging at this level if you want to store the log files generated byAmazon EMR.

If you select Yes, you must supply an Amazon S3 bucket name where AmazonElastic MapReduce can upload your log files.

For more information, see Troubleshooting (p. 183).

ImportantYou can enable debugging for a job flow only when you initially createthe job flow.

EnableDebugging

Select Yes to cause the job flow to continue running when all processing iscompleted.

Keep Alive



ActionField

Select Yes to ensure the job flow is not shut down due to accident or error. Formore information, see Protect a Job Flow from Termination (p. 136).

TerminationProtection

Select Yes to make the job flow visible and accessible to all IAM users on theAWS account. For more information, see Configure User Permissions withIAM (p. 274).

Visible To All IAMUsers

7. In the BOOTSTRAP ACTIONS page, select Proceed with no Bootstrap Actions, and then clickContinue.For more information about bootstrap actions, see Bootstrap Actions (p. 84).



8. In the REVIEW page, review the information, edit as necessary to correct any of the values, andthen click Create Job Flow when the information is correct.After you click Create Job Flow your request is processed; when it succeeds, a message appears.



9. Click Close.

The Amazon EMR console shows the new job flow starting. Starting a new job flow may take severalminutes, depending on the number and type of EC2 instances Amazon EMR is launching andconfiguring. Click the Refresh button for the latest view of the job flow's progress.

CLIThis example describes how to use the CLI to create a streaming job flow. Replace the red text with yourAmazon S3 bucket information.

To create a job flow

• Use the information in the following table to create your job flow:

Enter the following...If you areusing...

& ./elastic-mapreduce --create --stream \--input s3n://elasticmapreduce/samples/wordcount/input \--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \--reducer aggregate \--output s3n://myawsbucket

Linux orUNIX

c:\ruby elastic-mapreduce --create --stream \--input s3n://elasticmapreduce/samples/wordcount/input \--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \--reducer aggregate \--output s3n://myawsbucket

MicrosoftWindows

The output looks similar to the following.

Created jobflow JobFlowID



By default, this command launches a job flow to run on a single-node cluster using an Amazon EC2m1.small instance. Later, when your steps are running correctly on a small set of sample data, you canlaunch job flows to run on multiple nodes.You can specify the number of nodes and the type of instanceto run with the --num-instances and --instance-type parameters, respectively.

APIThis section describes the Amazon EMR API Query request parameters you need to create a streamingjob flow. The response includes a <JobFlowID>, which you use in other Amazon EMR operations, suchas when describing or terminating a job flow. For this reason, it is important to store job flow IDs.

The Args argument contains location information for your input data, output data, mapper, reducer, andcache file, as shown in the following example.

"Name": "streaming job flow","HadoopJarStep": { "Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar", "Args": [ "-input", "s3n://elasticmapreduce/samples/wordcount/input", "-output", "s3n://myawsbucket", "-mapper", "s3://elasticmapreduce/samples/wordcount/wordSplit ter.py", "-reducer", "aggregate" ] }

NoteAll paths are prefixed with their location. The prefix “s3://” refers to the s3n file system. If youuse HDFS, prepend the path with hdfs:///. Make sure to use three slashes (///), as inhdfs:///home/hadoop/sampleInput2/.

How to Create a Job Flow Using HiveThis section covers the basics of creating a job flow using Hive in Amazon Elastic MapReduce (AmazonEMR).You'll step through how to create a job flow using Hive with either the Amazon EMR console, theCLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; formore information see Setting Up Your Environment to Run a Job Flow (p. 17).

For advanced information on Hive configuration options, see Hive Configuration (p. 348).

A job flow using Hive enables you to create a data analysis application using a SQL-like language. Theexample that follows is based on the Amazon EMR sample: Contextual Advertising using Apache Hiveand Amazon EMR with High Performance Computing instances. This sample describes how to correlatecustomer click data to specific advertisements.

In this example, the Hive script is located in an Amazon S3 bucket ats3n://elasticmapreduce/samples/hive-ads/libs/model-build. All of the data processinginstructions are located in the Hive script. The script requires additional libraries that are located in anAmazon S3 bucket at s3n://elasticmapreduce/samples/hive-ads/libs.The input data is locatedin the Amazon S3 bucket s3n://elasticmapreduce/samples/hive-ads/tables. The output issaved to an Amazon S3 bucket you created as part of Setting Up Your Environment to Run a JobFlow (p. 17).


Amazon Elastic MapReduce Developer GuideHow to Create a Job Flow Using Hive



Amazon EMR ConsoleThis example describes how to use the Amazon EMR console to create a job flow using Hive.

To create a job flow using Hive



3. In the DEFINE JOB FLOW page, do the following:

a. Enter a name in the Job Flow Name field.We recommended you use a descriptive name. It does not need to be unique.



d. Select Hive in the drop-down list.

e. Click Continue.




4. In SPECIFY PARAMETERS page, specify whether you want to run the Hive job from a script orinteractively from the master node. If you are running Hive from a script, enter values in the boxesusing the following table as a guide. Click Continue.

ActionField

Specify the URI where your script resides in Amazon S3. The value must be inthe form BucketName/path/ScriptName.

Script Location*

Optionally, specify the URI where your input files reside in Amazon S3. Thevalue must be in the form BucketName/path/. If specified, this will be passedto the Hive script as a parameter named INPUT.

Input Location

Optionally, specify the URI where you want the output stored in Amazon S3.The value must be in the form BucketName/path. If specified, this will be passedto the Hive script as a parameter named OUTPUT.

Output Location

Optionally, enter a list of arguments (space-separated strings) to pass to Hive.Extra Args






ActionField


Instance Count


Instance Type









ActionField


Amazon EC2 KeyPair


Amazon VPCSubnet Id


Amazon S3 LogPath





EnableDebugging


Keep Alive



ActionField








9. Click Close.


CLIThis example describes how to use the CLI to create a job flow using Hive.

To create a job flow using Hive



& ./elastic-mapreduce --create --name "Test Hive" \--hive-script \s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q \--args "-d", \"LIBS=s3n://elasticmapreduce/samples/hive-ads/libs","-d", \"INPUT=s3n://elasticmapreduce/samples/hive-ads/tables", \"-d","OUTPUT=s3n://myawsbucket/hive-ads/output/"

Linux orUNIX

c:\ ruby elastic-mapreduce --create --name "Test Hive" \--hive-script \s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q \--args "-d","LIBS=s3n://elasticmapreduce/samples/hive-ads/libs",\"-d","INPUT=s3n://elasticmapreduce/samples/hive-ads/tables",\"-d","OUTPUT=s3n://myawsbucket/hive-ads/output/"

MicrosoftWindows


Created job flow JobFlowID



By default, this command launches a job flow to run on a two-node cluster using an Amazon EC2 m1.smallinstance. Later, when your steps are running correctly on a small set of sample data, you can launch jobflows to run on multiple nodes.You can specify the number of nodes and the type of instance to run withthe --num-instances and --instance-type parameters, respectively.

APIThis section describes the Amazon EMR API Query request parameters you need to create a job flowusing Hive. For an explanation of the parameters unique to RunJobFlow, go to RunJobFlow in the AmazonElastic MapReduce (Amazon EMR) API Reference. The response includes a <JobFlowID>, which youuse in other Amazon EMR operations, such as when describing or terminating a job flow. For this reason,it is important to store job flow IDs.

The Args argument contains location information for your input data, output data, and LIBS, as shownin the following example.

"Name": "Hive job flow","HadoopJarStep": {"Jar":"s3://us-west-1.elasticmapreduce/libs/script-runner/script-runner.jar","Args":[ "s3://us-west-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-west-1.elasticmapreduce/libs/hive/", "--run-hive-script", "--args", "-f", "s3n://elasticmapreduce/samples/hive-ads/libs/model-build.q", "-d LIBS=s3n://elasticmapreduce/samples/hive-ads/libs" ]}


How to Create a Job Flow Using PigThis section covers the basics of creating a job flow using Pig in Amazon Elastic MapReduce (AmazonEMR).You'll step through how to create a job flow using Pig with either the Amazon EMR console, theCLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; formore information see Setting Up Your Environment to Run a Job Flow (p. 17).

A job flow using Pig takes SQL-like commands written in Pig Latin, and converts those commands intoHadoop MapReduce algorithms. The examples that follow are based on the Amazon EMR sample:Apache Log Analysis using Pig. The sample evaluates Apache log files and then generates a reportcontaining the total bytes transferred, a list of the top 50 IP addresses, a list of the top 50 external referrers,and the top 50 search terms using Bing and Google. The Pig script is located in the Amazon S3 buckets3n://elasticmapreduce/samples/pig-apache/do-reports2.pig. Input data is located in theAmazon S3 bucket s3n://elasticmapreduce/samples/pig-apache/input. The output is savedto an Amazon S3 bucket you created as part of Setting Up Your Environment to Run a Job Flow (p. 17).

Amazon EMR ConsoleThis example describes how to use the Amazon EMR console to create a job flow using Pig.


Amazon Elastic MapReduce Developer GuideHow to Create a Job Flow Using Pig

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_RunJobFlow.html


To create a job flow using Pig



3. In the DEFINE JOB FLOW page, enter the following:




d. Select Pig Program in the drop-down list.

e. Click Continue.




4. In the SPECIFY PARAMETERS page, indicate whether you want to run Pig from a script, orinteractively from the master node. If you are running it from a script, enter values in the boxes usingthe following table as a guide. Click Continue.

ActionField


Script Location*

Optionally, specify the URI where your input files reside in Amazon S3. Thevalue must be in the form BucketName/path/. If specified, this will be passedto the Pig script as a parameter named INPUT.

Input Location

Optionally, specify the URI where you want the output stored in Amazon S3.The value must be in the form s3://BucketName/path. If specified, this will bepassed to the Pig script as a parameter named OUTPUT.

Output Location

Optionally, enter a list of arguments (space-separated strings) to pass to Pig.Extra Args






ActionField


Instance Count


Instance Type









ActionField


Amazon EC2 KeyPair


Amazon VPCSubnet Id


Amazon S3 LogPath





EnableDebugging


Keep Alive



ActionField








9. Click Close.


CLIThis example describes how to use the CLI to create a job flow using Pig.

To create a job flow using Pig



$ ./elastic-mapreduce --create --name "Test Pig" \--pig-script \s3n://elasticmapreduce/samples/pig-apache/do-reports2.pig \--ami-version 2.0 --args \"-p,INPUT=s3n://elasticmapreduce/samples/pig-apache/input, \-p,OUTPUT=s3n://myawsbucket/pig-apache/output"

Linux orUNIX

c:\ruby elastic-mapreduce --create --name "Test Pig" --pig-script s3n://elasticmapreduce/samples/pig-apache/do-re ports2.pig --ami-version 2.0 --args"-p,INPUT=s3n://elasticmapreduce/samples/pig-apache/input,-p,OUTPUT=s3n://myawsbucket/pig-apache/output"

MicrosoftWindows






APIThis section describes the Amazon EMR API Query request parameters you need to create a job flowusing Pig. For an explanation of the parameters unique to RunJobFlow, go to RunJobFlow in the AmazonElastic MapReduce (Amazon EMR) API Reference. The response includes a <JobFlowID>, which youuse in other Amazon EMR operations, such as when describing or terminating a job flow. For this reason,it is important to store job flow IDs.

The Args argument contains location information for your input data and output data, as shown in thefollowing example.

"Name": "Pig job flow","HadoopJarStep": {"jar":"s3://us-west-1.elasticmapreduce/libs/script-runner/script-runner.jar","args":["s3://us-west-1.elasticmapreduce/libs/pig/pig-script","--base-path","s3://us-west-1.elasticmapreduce/libs/pig/","--run-pig-script","--ami-version 2.0","--args","-f","s3n://elasticmapreduce/samples/pig-apache/do-reports2.pig","-p","INPUT=s3n://elasticmapreduce/samples/pig-apache/input","-p","OUTPUT=s3n://myawsbucket/pig-apache/output"]}


How to Create a Job Flow Using a Custom JARThis section covers the basics of creating a job flow using a custom JAR file in Amazon Elastic MapReduce(Amazon EMR).You'll step through how to create a job flow using a Custom JAR with either the AmazonEMR console, the CLI, or the Query API. Before you create your job flow you'll need to create objectsand permissions; for more information see Setting Up Your Environment to Run a Job Flow (p. 17).

A job flow using a custom JAR file enables you to write a script to process your data using the Javaprogramming language. The example that follows is based on the Amazon EMR sample: CloudBurst.

In this example, the JAR file is located in an Amazon S3 bucket ats3n://elasticmapreduce/samples/cloudburst/cloudburst.jar. All of the data processinginstructions are located in the JAR file and the script is referenced by the main classorg.myorg.WordCount. The input data is located in the Amazon S3 bucket


Amazon Elastic MapReduce Developer GuideHow to Create a Job Flow Using a Custom JAR



s3n://elasticmapreduce/samples/cloudburst/input. The output is saved to an Amazon S3bucket you created as part of Setting Up Your Environment to Run a Job Flow (p. 17).

Amazon EMR ConsoleThis example describes how to use the Amazon EMR console to create a job flow using a custom JARfile.

To create a job flow using a custom JAR file



3. In the DEFINE JOB FLOW page, enter the following in the Define Job Flow section of the Createa New Job Flow dialog box:




d. Select Custom JAR in the drop-down list.

e. Click Continue.





ActionField


JAR Location*

Enter a list of arguments (space-separated strings) to pass to the JAR file.JAR Arguments*






ActionField


Instance Count


Instance Type









ActionField


Amazon EC2 KeyPair


Amazon VPCSubnet Id


Amazon S3 LogPath





EnableDebugging


Keep Alive



ActionField








9. Click Close.


CLIThis section explains how to run a job flow that uses a custom JAR file.

To create a job flow using a Custom JAR



& ./elastic-mapreduce --create --name "Test custom JAR" \ --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \ --arg s3n://elasticmapreduce/samples/cloudburst/in put/s_suis.br \ --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \ --arg s3n://myawsbucket/cloud \ --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 \ --arg 24 --arg 128 --arg 16

Linux orUNIX

c:\ruby elastic-mapreduce --create --name "Test custom JAR"--jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar--arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br--arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br--arg s3n://myawsbucket/cloud --arg 36 --arg 3 --arg 0 --arg 1--arg 240 --arg 48 --arg 24--arg 24 --arg 128 --arg 16

MicrosoftWindows

NoteThe individual --arg values above could also be represented as --args followed by acomma-separated list, as shown in the preceding examples.






APIThis section describes the Amazon EMR API Query request parameters you need to create a job flowusing a custom JAR file. For an explanation of the parameters unique to RunJobFlow, see RunJobFlow.The response includes a <JobFlowID>, which you use in other Amazon EMR operations, such as whendescribing or terminating a job flow. For this reason, it is important to store job flow IDs.

To start a job flow specifying a JAR file, send a RunJobFlow request similar to the following.

https://elasticmapreduce.amazonaws.com?Operation=RunJobFlow&Name=Test custom JAR&LogUri=s3://myawsbucket/subdir&Instances.MasterInstanceType=m1.small&Instances.SlaveInstanceType=m1.small&Instances.InstanceCount=4&Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a&Instances.KeepJobFlowAliveWhenNoSteps=true&Steps.member.1.Name=MyStepName&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.HadoopJarStep.Jar=s3://elasticmapreduce/samples/cloudburst/cloud burst.jar&Steps.member.1.HadoopJarStep.MainClass=MyMainClass&Steps.member.1.HadoopJarStep.Args.member.1=arg1&Steps.member.1.HadoopJarStep.Args.member.2=arg2&AWSAccessKeyId=AccessKeyID&SignatureVersion=2&SignatureMethod=HmacSHA256&Timestamp=2009-01-28T21%3A48%3A32.000Z&Signature=calculated value

How to Create a Cascading Job FlowCascading is an open-source Java library that provides a Query API, a Query planner, and a job schedulerfor creating and running Hadoop MapReduce applications. Applications developed with Cascading arecompiled and packaged into standard Hadoop-compatible JAR files similar to other native Hadoopapplications. A Cascading job flow is treated as a custom JAR in the Amazon EMR console.

This section covers the basics of creating a Cascading job flow using a custom JAR file in Amazon ElasticMapReduce (Amazon EMR).You'll step through how to create a job flow with either the Amazon EMRconsole, the CLI, or the Query API. Before you create your job flow you'll need to create objects andpermissions; for more information see Setting Up Your Environment to Run a Job Flow (p. 17).

The examples that follow are based on the Amazon EMR sample: LogAnalyzer for Amazon CloudFront.LogAnalyzer is implemented using Cascading. This sample generates usage reports containing total


Amazon Elastic MapReduce Developer GuideHow to Create a Cascading Job Flow



http://www.cascading.org

traffic volume, object popularity, a breakdown of traffic by client IP address, and edge location. Reportsare formatted as tab-delimited text files, and saved to the Amazon S3 bucket that you specify.

In this example, the Java JAR is located in an Amazon S3 bucket atelasticmapreduce/samples/cloudfront/logprocessor.jar. The input data is located in theAmazon S3 bucket s3n://elasticmapreduce/samples/cloudfront/input. The output is savedto an Amazon S3 bucket you created as part of Setting Up Your Environment to Run a Job Flow (p. 17).

Amazon EMR ConsoleThis example describes how to use the Amazon EMR console to create a job flow using a custom JARfile.

To create a job flow using Cascading







d. Select Custom JAR in the drop-down list.

e. Click Continue.





ActionField


JAR Location*

Enter a list of arguments (space-separated strings) to pass to the JAR file.JAR Arguments*






ActionField


Instance Count


Instance Type









ActionField


Amazon EC2 KeyPair


Amazon VPCSubnet Id


Amazon S3 LogPath





EnableDebugging


Keep Alive



ActionField








9. Click Close.


CLIThis example describes how to use the CLI to create a job flow using Cascading.

To create a job flow using Cascading



& ./elastic-mapreduce --create --name "Test Cascading" \ --JAR elasticmapreduce/samples/cloudfront/logprocessor.jar \--args "-input,s3n://elasticmapreduce/samples/cloudfront/input,\-start,any,-end,2010-12-27-02 300,-output,\s3n://myawsbucket/cloudfront/output/2010-12-27-02,\-overallVolumeReport,-objectPopularityReport,-clientIPReport,\-edgeLocationReport"

Linux orUNIX

c:\ ruby elastic-mapreduce --create --name "TestCascading" --JARelasticmapreduce/samples/cloudfront/logprocessor.jar --args \"-input,s3n://elasticmapreduce/samples/cloudfront/input,\-start,any,-end,2010-12-27-02,300,-output,\s3n://myawsbucket/cloudfront/output/2010-12-27-02,\-overallVolumeReport,-objectPopularityReport,-clientIPReport,\-edgeLocationReport"

MicrosoftWindows






APIThis section describes the Amazon EMR API Query request parameters you need to create a Cascadingjob flow. For an explanation of the parameters unique to RunJobFlow, go to RunJobFlow in the AmazonElastic MapReduce (Amazon EMR) API Reference. The response includes a <JobFlowID>, which youuse in other Amazon EMR operations, such as when describing or terminating a job flow. For this reason,it is important to store job flow IDs.

The Args argument contains location information for your input data, output data, and args, as shown inthe following example.

"Name": "Cascading job flow","HadoopJarStep": {"Jar":"s3n://elasticmapreduce/samples/cloudfront/logprocessor.jar","Args":["-input","s3n://elasticmapreduce/samples/cloudfront/input","-start","any","-end","2010-12-27-02 300","-output","s3n://myawsbucket/cloudfront/output/2010-12-27-02","-overallVolumeReport","-objectPopularityReport","-clientIPReport","-edgeLocationReport"]}


Launch an HBase Cluster on Amazon EMRWhen you launch HBase on Amazon EMR, you get the benefits of running in the Amazon Web Services(AWS) cloud—easy scaling, low cost, pay only for what you use, and ease of use. The EMR team hastuned HBase to run optimally on AWS. For more information about HBase and running it on AmazonEMR, see Store Data with HBase (p. 155).

The following procedure shows how to launch an HBase job flow with the default settings. If your applicationneeds custom settings, you can configure HBase as described in Configure HBase (p. 174).

NoteHBase configuration can only be done at launch time.


Amazon Elastic MapReduce Developer GuideLaunch an HBase Cluster on Amazon EMR


For production environments, we recommend that you launch HBase on one job flow and launch anyanalysis tools, such as Hive, on a separate job flow. This ensures that HBase has ready access to theCPU and memory resources it requires.

To launch an HBase cluster using the console




a. Enter a name in the Job Flow Name field.We recommend that you use a descriptive name. It does not need to be unique.

b. Select a version of Hadoop to run on your cluster in Hadoop Version.You can choose to runthe Amazon distribution of Hadoop or one of two MapR distributions. For more information aboutMapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution forHadoop (p. 260).


d. Select HBase in the drop-down list.

e. Click Continue.




4. In the SPECIFY PARAMETERS page, indicate whether you want to preload the HBase cluster withdata stored in Amazon S3 and whether you want to schedule regular backups of your HBase cluster.Use the following table for guidance on making your selections. For more information about backingup and restoring HBase data, see Back Up and Restore HBase (p. 165).When you are finished makingselections, click Continue.

ActionField

Specify whether to preload the HBase cluster with data stored in Amazon S3.Restore fromBackup

Specify the URI where the backup to restore from resides in Amazon S3.BackupLocation*

Optionally, specify the version name of the backup at Backup Location to use.If you leave this field blank, Amazon EMR uses the latest backup at BackupLocation to populate the new HBase cluster.

Backup Version

Specify whether to schedule automatic incremental backups. The first backupwill be a full backup to create a baseline for future incremental backups.

ScheduleRegular Backups

Specify whether the backups should be consistent. A consistent backup is onewhich pauses write operations during the initial backup stage, synchronizationacross nodes. Any write operations thus paused are placed in a queue andresume when synchronization compeletes.

ConsistentBackup*

The number of Days/Hours/Minutes between scheduled backups.BackupFrequency*



ActionField

The Amazon S3 URI where backups will be stored. The backup location foreach HBase cluster should be different to ensure that differential backups staycorrect.

BackupLocation*

Specify when the first backup should occur.You can set this to now, whichcauses the first backup to start as soon as the cluster is running, or enter a dateand time in ISO format. For example, 2012-06-15T20:00Z, would set the starttime to June 15, 2012 at 8pm UTC.

Backup StartTime*

Optionally, add Hive or Pig to the HBase cluster. Because of performanceconsiderations, best practice is to run HBase on one cluster and Hive or Pig ona different cluster. For testing purposes, however, you may wish to run Hive orPig on the same cluster as HBase.

Install AdditonalPackages






http://www.w3.org/TR/NOTE-datetime



ActionField


Instance Count

Specify the Amazon EC2 instance types to use as master, core, and tasknodes.Valid types are m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge,cc1.4xlarge, hs1.8xlarge, or cc2.8xlarge.The cc2.8xlarge instance type is onlyavailable in the US East (Northern Virginia), US West (Oregon), and EU (Ireland)Regions. The cc1.4xlarge and hs1.8xlarge instance types are only supportedin the US East (Northern Virginia) Region.

Instance Type





ActionField


Amazon EC2 KeyPair


Amazon VPCSubnet Id



ActionField


Amazon S3 LogPath





EnableDebugging


Keep Alive







9. Click Close.


To launch an HBase cluster using the CLI

• Specify --hbase when you launch a job flow using the CLI.

The following example shows how to launch a job flow running HBase from the CLI. We recommendthat you run at least two instances in the HBase job flow .The --instance-type parameter must beone of the following: m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hs1.8xlarge,or cc2.8xlarge. The cc2.8xlarge instance type is only available in the US East (Northern Virginia),US West (Oregon), and EU (Ireland) Regions. The cc1.4xlarge and hs1.8xlarge instance types areonly supported in the US East (Northern Virginia) Region.



The CLI implicitly launches the HBase job flow with keep alive and termination protection set.

elastic-mapreduce --create --hbase --name "$USER HBase Cluster" \ --num-instances 2 \ --instance-type cc1.4xlarge \

To launch an HBase cluster using the API

• You need to run the hbase-setup bootstrap action when you launch HBase using the API in order toinstall and configure HBase on the cluster.You also need to add a step to start the Hbase master.These are shown in the following example.The region, us-east-1, would be replaced by the regionin which to launch the cluster. For a list of regions supported by Amazon EMR see Choose aRegion (p. 17).

https://us-east-1elasticmapreduce.amazonaws.com?Operation=RunJobFlow&Name=HBase Cluster&LogUri=s3://myawsbucket/subdir&Instances.MasterInstanceType=m1.xlarge&Instances.SlaveInstanceType=m1.xlarge&Instances.InstanceCount=4&Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a&Instances.KeepJobFlowAliveWhenNoSteps=true&Steps.member.1.Name=InstallHBase&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.BootstrapAction.ScriptBootstrapAction=s3://us-east-1.elast icmapreduce/bootstrap-actions/setup-hbase&Steps.member.1.Name=StartHBase&Steps.member.1.ActionOnFailure=CANCEL_AND_WAIT&Steps.member.1.HadoopJarStep.Jar=/home/hadoop/lib/hbase-0.92.0.jar&Steps.member.1.HadoopJarStep.Args.member.1=emr.hbase.backup.Main&Steps.member.1.HadoopJarStep.Args.member.2=--start-master&AWSAccessKeyId=AccessKeyID&SignatureVersion=2&SignatureMethod=HmacSHA256&Timestamp=2009-01-28T21%3A48%3A32.000Z& Signature=calculated value

View Job Flow DetailsThis section describes the methods used to view the details of Amazon Elastic MapReduce (AmazonEMR) job flows.You can view job flows in any state.


Amazon Elastic MapReduce Developer GuideView Job Flow Details

Example Using the Amazon EMR console

After you start a job flow, you can monitor its status and retrieve extended information about its execution.This section explains how to view the details of a job flow using the Amazon EMR console.

To view the details of a job flow


2. Select the job flow you want to view.

The Job Flow pane appears, providing detailed information about the selected job flow.

Example using the CLI

To view job flow details from the CLI, use the --list parameter to list job flows. This section presentssome of these variations.

To list job flows created in the last two days

• Use the --list parameter with no additional arguments to display job flows created during the lasttwo days as follows:


& ./elastic-mapreduce --listLinux or UNIX

c:\ruby elastic-mapreduce --listMicrosoftWindows

The response is similar to the following:




j-1YE2DN7RXJBWU FAILED Example Job Flow CANCELLED Custom Jarj-3GJ4FRRNKGY97 COMPLETED ec2-67-202-3-73.compute-1.amazonaws.com Example job flowj-5XXFIQS8PFNW COMPLETED ec2-67-202-51-30.compute-1.amazonaws.com demo 3/24 s1 COMPLETED Custom Jar

The example response shows that three job flows were created in the last two days. The indented linesare the steps of the job flow. The information for a job flow is in the following order: the job flow ID, thejob flow state, the DNS name of the master node, and the job flow name. The information for a job flowstep is in the following order: step state, and step name.

If no job flows were created in the previous two days, this command produces no output.

To list active job flows

• Use the --list and --active parameters as follows:


& ./elastic-mapreduce --list --activeLinux or UNIX

c:\ruby elastic-mapreduce --list --activeMicrosoftWindows

The response lists job flows that are in the state of STARTING, RUNNING, or SHUTTING_DOWN.

To list only running or terminated job flows

• Use the --state parameter as follows:


& ./elastic-mapreduce --list --state RUNNING --state TERMINATEDLinux or UNIX

c:\ruby elastic-mapreduce --list --state RUNNING --state TERMINATEDMicrosoftWindows

The response lists job flows that are running or terminated.

You can get information about a job flow using the --describe parameter and specifying a job flow ID.



To retrieve information about a job flow

• Use the --describe parameter with a valid job flow ID.


& ./elastic-mapreduce --describe --jobflow JobFlowIDLinux or UNIX

c:\ruby elastic-mapreduce --describe --jobflow JobFlowIDMicrosoftWindows

The response looks similar to the following:

{ "JobFlows": [ { "Name": "Development Job Flow (requires manual termination)", "LogUri": "s3n:\/\/AKIAIOSFODNN7EXAMPLE\/FileName\/", "ExecutionStatusDetail": { "StartDateTime": null, "EndDateTime": null, "LastStateChangeReason": "Starting instances", "CreationDateTime": DateTimeStamp, "State": "STARTING", "ReadyDateTime": null }, "Steps": [], "Instances": { "MasterInstanceId": null, "Ec2KeyName": "KeyName", "NormalizedInstanceHours": 0, "InstanceCount": 5, "Placement": { "AvailabilityZone": "us-east-1a" }, "SlaveInstanceType": "m1.small", "HadoopVersion": "0.20", "MasterPublicDnsName": null, "KeepJobFlowAliveWhenNoSteps": true, "InstanceGroups": [ { "StartDateTime": null, "SpotPrice": null, "Name": "Master Instance Group", "InstanceRole": "MASTER", "EndDateTime": null, "LastStateChangeReason": "", "CreationDateTime": DateTimeStamp, "LaunchGroup": null, "InstanceGroupId": "InstanceGroupID", "State": "PROVISIONING",



"Market": "ON_DEMAND", "ReadyDateTime": null, "InstanceType": "m1.small", "InstanceRunningCount": 0, "InstanceRequestCount": 1 }, { "StartDateTime": null, "SpotPrice": null, "Name": "Task Instance Group", "InstanceRole": "TASK", "EndDateTime": null, "LastStateChangeReason": "", "CreationDateTime": DateTimeStamp, "LaunchGroup": null, "InstanceGroupId": "InstanceGroupID", "State": "PROVISIONING", "Market": "ON_DEMAND", "ReadyDateTime": null, "InstanceType": "m1.small", "InstanceRunningCount": 0, "InstanceRequestCount": 2 }, { "StartDateTime": null, "SpotPrice": null, "Name": "Core Instance Group", "InstanceRole": "CORE", "EndDateTime": null, "LastStateChangeReason": "", "CreationDateTime": DateTimeStamp, "LaunchGroup": null, "InstanceGroupId": "InstanceGroupID", "State": "PROVISIONING", "Market": "ON_DEMAND", "ReadyDateTime": null, "InstanceType": "m1.small", "InstanceRunningCount": 0, "InstanceRequestCount": 2 } ], "MasterInstanceType": "m1.small" }, "BootstrapActions": [], "JobFlowId": "JobFlowID" } ]}

Example using the API

The DescribeJobFlows operation in the Amazon EMR API returns details about specified job flows.You specify a job flow by the job flow ID, creation date, or state. Amazon EMR returns descriptions of jobflows that are up to two months old. Specifying an older date returns an error. If you do not specify aCreatedAfter value, Amazon EMR uses the default of two months.



To return information about a job flow identified by its job flow ID

• Issue a request similar to the following, replacing JobFlowID, AccessKeyID, and CalculatedValuewith the values required for your job flow.

https://elasticmapreduce.amazonaws.com?JobFlowIds.member.1=JobFlowID&Operation=DescribeJobFlows&AWSAccessKeyId=AccessKeyID&SignatureVersion=2&SignatureMethod=HmacSHA256&Timestamp=2009-01-28T21%3A49%3A59.000Z&Signature=CalculatedValue

For more information about the input parameters unique to DescribeJobFlows, go toDescribeJobFlows.

To return information about a job flows in a specific state

• Issue a request similar to the following, replacing COMPLETED, AccessKeyID and CalculatedValuewith the values required for your job flows.

https://elasticmapreduce.amazonaws.com?JobFlowStates=COMPLETED&Operation=DescribeJobFlows&AWSAccessKeyId=AccessKeyID&SignatureVersion=2&SignatureMethod=HmacSHA256&Timestamp=2009-01-28T21%3A49%3A59.000Z&Signature=CalculatedValue

For more information about the input parameters unique to DescribeJobFlows, go toDescribeJobFlows.

Terminate a Job FlowThis section describes the methods to terminate a job flow.You can terminate job flows in the STARTING,RUNNING, or WAITING states. A job flow in the WAITING state must be terminated or it runs indefinitely,generating charges to your account.You can terminate a job flow that fails to leave the STARTING stateor is unable to complete a step.

If you are terminating a job flow that has termination protection set on it, you must first disable terminationprotection before you can terminate the job flow. For more information, see Terminating a Protected JobFlow (p. 140).

Depending on the configuration of the job flow, it may take up to 5-20 minutes for the job flow to completelyterminate and release allocated resources, such as Amazon E2 instances.

Amazon EMR ConsoleYou can terminate a job flow using the Amazon EMR console.


Amazon Elastic MapReduce Developer GuideTerminate a Job Flow

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_DescribeJobFlows.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_DescribeJobFlows.html

To terminate a job flow


2. Select the job flow you want to terminate.

3. Click Terminate.The Terminate Job Flow(s) confirmation dialog box appears.

4. Click Yes,Terminate.Amazon Elastic MapReduce (Amazon EMR) terminates the instances in the cluster and stops savinglog data.

using the CLITo terminate a job flow, use the --terminate parameter and specify the job flow to terminate. Theexample that follows uses job flow j-C019299B1X.


• Enter the following...If you areusing

& ./elastic-mapreduce --terminate j-C019299B1X Linux orUNIX

c:\ruby elastic-mapreduce --terminate j-C019299B1X MicrosoftWindows

The response is similar to the following:

Terminated job flow j-C019299B1X

APIThe TerminateJobFlows operation ends step processing, uploads any log data from Amazon EC2 toAmazon S3 (if configured), and terminates the Hadoop cluster. A job flow also terminates automaticallyif you set KeepJobAliveWhenNoSteps to False in a RunJobFlows request.

You can use this action to terminate either a single job flow or a list of job flows by their job flow IDs.

The following request shows how to terminate a job flow using a <JobFlowID>.


Amazon Elastic MapReduce Developer Guideusing the CLI



• Issue a request similar to the following. In this example, three job flows are terminated.

https://elasticmapreduce.amazonaws.com?JobFlowIds.member.1=j-3UN6SOUERO2AG,j-3UN6WX5RR438r7,j-3UN6DUER23849&Operation=TerminateJobFlows&AWSAccessKeyId=AccessKeyID&SignatureVersion=2&SignatureMethod=HmacSHA256&Timestamp=2009-01-28T21%3A53%3A50.000Z&Signature=calculated value

The response contains the request ID.

For more information about the input parameters unique to TerminateJobFlows, go toTerminateJobFlows. For more information about the generic parameters in the request, see CommonRequest Parameters.

Customize a Job FlowTopics

• Add Steps to a Job Flow (p. 79)

• Bootstrap Actions (p. 84)

• Resizing Running Job Flows (p. 96)

• Calling Additional Files and Libraries (p. 104)

This section describes the methods available for customizing a Amazon Elastic MapReduce (AmazonEMR) job flow.

Add Steps to a Job FlowTopics

• Wait for Steps to Complete (p. 81)

• Add More than 256 Steps to a Job Flow (p. 82)

This section describes the methods for adding steps to a job flow.

You can add steps to a running job flow only if you set the KeepJobFlowAliveWhenNoSteps parameterto True when you create the job flow. This value keeps the Hadoop cluster engaged even after thecompletion of a job flow.

The Amazon Elastic MapReduce (Amazon EMR) console does not support adding steps to a job flow.

Example using the CLI

The following example creates a simple job flow and then adds a step to the job flow.

To add a step to a job flow

1. Create a job flow:


Amazon Elastic MapReduce Developer GuideCustomize a Job Flow

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_TerminateJobFlows.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_TerminateJobFlows.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/CommonParameters.html



$ ./elastic-mapreduce --create --alive --streamLinux orUNIX

c:\ruby elastic-mapreduce --create --alive --streamMicrosoftWindows

The --stream parameter adds a streaming step using default parameters. The default parametersare the word count example that is available in the Amazon EMR console. The --alive keeps thejob flow running even when all steps have been completed. This job flow will need to be explicitlyterminated.



2. Add a step:


$ ./elastic-mapreduce -j JobFlowID \ --jar s3n://elasticmapreduce/samples/cloudburst/cloud burst.jar \ --main-class org.myorg.WordCount \ --arg s3n://elasticmapreduce/samples/cloudburst/in put/s_suis.br \ --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \ --arg hdfs:///cloudburst/output/1 \ --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 \ --arg 24 --arg 128 --arg 16

Linux orUNIX

$ ./elastic-mapreduce -j JobFlowID --jars3n://elasticmapreduce/samples/cloudburst/cloudburst.jar--main-class org.myorg.WordCount --args3n://elasticmapreduce/samples/cloudburst/input/s_suis.br --args3n://elasticmapreduce/samples/cloudburst/input/100k.br --arghdfs:///cloudburst/output/1 --arg 36 --arg 3 --arg 0 --arg 1--arg 240 --arg 48 --arg 24 --arg 24 --arg 128 --arg 16

MicrosoftWindows

This command runs an example job flow step that downloads and runs the JAR file. The arguments arepassed to the main function in the JAR file.

If your JAR file has a manifest, you do not need to specify the JAR file's main class using --main-class,as shown in the preceding example.


Amazon Elastic MapReduce Developer GuideAdd Steps to a Job Flow

Example using the API

The steps parameter defines the location and input parameters for the Hadoop JAR steps that performthe processing on the input data. Each step is identified by a member number.

Typically, you specify all job flow steps in a RunJobFlow request. The value of AddJobFlowSteps isthat you can add steps to a job flow while it is already loaded onto the Amazon EC2 instances.Youtypically add steps to modify the data processing or to aid in debugging a job flow when you are workinginteractively with the job flow, that is, you are adding steps to the job flow while the job flow execution ispaused.

The name parameter helps you distinguish step results, so it is best to make each name unique. AmazonEMR does not check for the uniqueness of step names.

The remainder of the steps parameter specifies the JAR file and the input parameters used to processthe data.

When you debug a job flow, you must set the RunJobFlow parameter KeepJobAliveWhenNoSteps toTrue and ActionOnFailure to CANCEL_AND_WAIT.

NoteThe maximum number of steps allowed in a job flow is 256. For ways to overcome this limitation,go to the section called “Add More than 256 Steps to a Job Flow” (p. 82)

To add steps to a job flow

• Send a request similar to the following.

https://elasticmapreduce.amazonaws.com?JobFlowId=JobFlowID&Steps.member.1.Name=MyStep2&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.HadoopJarStep.Jar=s3://myawsbucket/MySecondJar&Steps.member.1.HadoopJarStep.MainClass=MainClass&Steps.member.1.HadoopJarStep.Args.member.1=arg1&Operation=AddJobFlowSteps&AWSAccessKeyId=AccessKeyID&SignatureVersion=2&SignatureMethod=HmacSHA256&Timestamp=2009-01-28T21%3A51%3A51.000Z&Signature=calculated value

For more information about the parameters unique to AddJobFlowSteps, see AddJobFlowSteps.For more information about the generic parameters in the request, see Common Request Parameters.

The response contains the request ID.

Wait for Steps to CompleteWhen you submit steps to a job flow using the command line interface (CLI), you can specify that the CLIshould wait until the job flow has completed all pending steps before accepting additional commands.This can be useful, for example, if you are using a step to copy data from Amazon S3 into HDFS andneed to be sure that the copy operation is complete before you run the next step in the job flow.You dothis by specifying the --wait-for-steps parameter after you submit the copy step.



http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_AddJobFlowSteps.html


The --wait-for-steps parameter does not ensure that the step completes sucessfully, just that it hasfinished running. If, as in the earlier example, you need to ensure the step was sucessful before submittingthe next step, check the job flow status. If the step failed, the job flow will be in the FAILED status.

Although you can add the --wait-for-steps parameter in the same CLI command that adds a stepto the job flow, it is best to add it in a separate CLI command. This ensures that the --wait-for-stepsargument is parsed and applied after the step is created. This is illustrated in the example that follows.

To wait until a step completes

• Add the --wait-for-steps parameter to the job flow. This is illustrated in the following example,where JobFlowID is the job flow identifier that Amazon EMR returned when you created the jobflow. The JAR, main class, and arguments specified in the first CLI command are from the WordCount sample application; this command adds a step to the job flow. The second CLI commandcauses the job flow to wait until all of the currently pending steps have completed before acceptingadditional commands.

$ ./elastic-mapreduce -j JobFlowID \ --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \ --main-class org.myorg.WordCount \ --arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br \ --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \ --arg hdfs:///cloudburst/output/1 \ --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 \ --arg 24 --arg 128 --arg 16

$ ./elastic-mapreduce -j JobFlowID \ --wait-for-steps

Add More than 256 Steps to a Job FlowAmazon Elastic MapReduce (Amazon EMR) currently limits the number of steps in a job flow to 256. Ifyour job flow is long-running (such as a Hive data warehouse) or complex, you may require more than256 steps to process your data.

You can employ several methods to get around this limitation:

1. Have each step submit several jobs to Hadoop. This does not allow you unlimited steps, but it is theeasiest solution if you need a fixed number of steps greater than 256.

2. Write a workflow program that runs in a step on a long-running job flow and submits jobs to Hadoop.You could have the workflow program either:

• Listen to an Amazon SQS queue to receive information about new steps to run.

• Check an Amazon S3 bucket on a regular schedule for files containing information about the newsteps to run.

3. Write a workflow program that runs on an Amazon EC2 instance outside of Amazon EMR and submitsjobs to your job flows using SSH.

4. Manually SSH into the master node and submit job flows.

You can add more steps to a job flow by using the SSH shell to connect to the master node and submittingqueries directly to the software running on the master node, such as Hive and Hadoop.



You can SSH directly into the master node using a conventional SSH connection, as outlined in MonitorHadoop on the Master Node (p. 199). Or you can use the --ssh command line argument to pass queriesin and save yourself the process of establishing a new SSH connection.

CLI

To manually submit steps to Hadoop on the master node

• From a terminal or command-line window, call the CLI client, specifying the --ssh parameter, andset its value to the command you want to run on the master node. The CLI uses its connection tothe master node to run the command.

elastic-mapreduce –-jobflow JobFlowID –-scp myjar.jar \–-ssh “hadoop jar myjar.jar”

The preceding example uses the --scp parameter to copy the JAR file myjar.jar from your localdirectory to the master node of job flow JobFlowID. The example uses the --ssh parameter tocommand the copy of Hadoop running on the master node to run myjar.jar.

CLI

To manually submit queries to Hive on the master node

1. If Hive is not already installed, use the following command to install it.

elastic-mapreduce -–jobflow JobFlowID –-hive-interactive

2. Create a Hive script file containing the query or command you wish to run. The following examplescript creates two tables, aTable and anotherTable, and copies the contents of one table to another,replacing all data.

---- sample Hive script file: my-hive.q ----create table aTable (aColumn string) ;create table anotherTable like aTable;insert overwrite table anotherTable select * from aTable

3. Call the CLI client, specifying the --ssh parameter, and set its value to a Hive script containing thecommand you want to run on the master node. The CLI uses its connection to the master node andyour .pem credentials file to run the command.

elastic-mapreduce –-jobflow JobFlowID –-scp my-hive.q \–-ssh “hive -f my-hive.q”

The preceding example connects to Hive on the master node of the JobFlowID job flow and runsthe query contained in the script file my-hive.q.



To manually submit tasks based on Python files to Hadoop while Connected Using SSH

• Use the Hadoop streaming jar, as shown in the example below.

hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -input s3n://elasticmapreduce/samples/wordcount/input \ -output hdfs:///rubish/1 \ -mapper s3n://elasticmapreduce/samples/wordcount/wordSplitter.py \ -reducer aggregate

Bootstrap ActionsTopics

• Bootstrap Action Basics (p. 84)

• Using Predefined Bootstrap Actions (p. 85)

• Using Custom Bootstrap Actions (p. 91)

Bootstrap actions allow you to pass a reference to a script stored in Amazon S3. This script can containconfiguration settings and arguments related to Hadoop or Elastic MapReduce. Bootstrap actions arerun before Hadoop starts and before the node begins processing data.

Bootstrap Action BasicsBootstrap actions execute as the Hadoop user by default. A bootstrap action can execute with rootprivileges if you use sudo.

NoteIf the bootstrap action returns a nonzero error code, Amazon Elastic MapReduce (Amazon EMR)treats it as a failure and terminates the instance. If too many instances fail their bootstrap actions,then Amazon EMR terminates the job flow. If just a few instances fail, then an attempt is madeto reallocate the failed instances and continue. Refer to the job flow lastStateChangeReasonerror code to identify failures caused by a bootstrap action.

All three Amazon EMR interfaces support bootstrap actions.You can specify up to 16 bootstrap actionsper job flow by providing multiple --bootstrap-action parameters from the CLI or API.

From the CLI, references to bootstrap action scripts are passed to Elastic MapReduce by adding thebootstrap-action parameter after the create parameter. The syntax for a bootstrap-actionparameter is as follows:

--bootstrap-action "s3://myawsbucket/FileName" --args "arg1","arg2"

From the Amazon EMR console, you can specify a bootstrap action optionally while creating a job flowon the Bootstrap Actions page in the Job Flow Creation Wizard.

For more information on how to reference a bootstrap action from the API, go to the Amazon ElasticMapReduce API Reference.


Amazon Elastic MapReduce Developer GuideBootstrap Actions

http://docs.aws.amazon.com/ElasticMapReduce/latest/API/


Using Predefined Bootstrap ActionsTopics

• Configure Daemons (p. 85)

• Configure Hadoop (p. 85)

• Configure Memory-Intensive Workloads (p. 90)

• Run If (p. 91)

• Shutdown Actions (p. 91)

Amazon provides a number of predefined bootstrap action scripts that you can use to customize Hadoopsettings. This section describes the available predefined bootstrap actions. References to predefinedbootstrap action scripts are passed to Elastic MapReduce by using the bootstrap-action parameter.

You can specify up to 16 bootstrap actions per job flow by providing multiple bootstrap-actionparameters.

Configure Daemons

This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM)options for the Hadoop daemons.You can use this bootstrap action to configure Hadoop for large jobsthat require more memory than Hadoop allocates by default.You can also use this bootstrap action tomodify advanced JVM options, such as garbage collection behavior.

The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-daemons.

The following example sets the heap size to 2048 and configures the Java namenode option.

Example

$ ./elastic-mapreduce –create –alive \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \ --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19

Configure Hadoop

This bootstrap action allows you to set cluster-wide Hadoop settings. This script provides two types ofcommand line options:

• Option 1—Enables you to upload an XML file containing configuration settings to Amazon S3. Thebootstrap action merges the new configuration settings with the existing Hadoop configuration.

• Option 2—Allows you to specify a Hadoop key value pair from the command line that overrides theexisting Hadoop configuration.

The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-hadoop.

The following example demonstrates how to change the configuration for the maximum number of maptasks in the hadoop-config-file.xml file.



Example

$ ./elastic-mapreduce --create \--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop--args "--site-config-file,s3://myawsbucket/config.xml,-s,mapred.tasktrack er.map.tasks.maximum=2"

The configuration options are applied in the order described in the bootstrap action script. Settings specifiedlater in the sequence override those specified earlier.

NoteThe configuration file you supply in the Amazon S3 bucket must be a valid Hadoop configurationfile, for example:

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>mapred.userlog.retain.hours</name><value>4</value></prop erty> </configuration>

The configuration file for Hadoop 0.18 is hadoop-site.xml. In Hadoop 0.20 and later, the oldconfiguration file is replaced with three new files: core-site.xml, mapred-site.xml, andhdfs-site.xml.

For Hadoop 0.18, the name and location of the configuration file is /conf/hadoop-site.xml. Thedefault hadoop-site.xml properties are as follows.

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration> <property><name>mapred.reduce.tasks.speculative.execu tion</name><value>false</value></property> <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property>

<property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property>

<property><name>mapred.tasktracker.map.tasks.maxim um</name><value>2</value></property> <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></prop erty> <property><name>dfs.datanode.http.ad dress</name><value>0.0.0.0:9102</value></property> <property><name>dfs.datanode.https.ad dress</name><value>0.0.0.0:9402</value></property> <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></prop erty> <property><name>mapred.task.tracker.http.ad dress</name><value>0.0.0.0:9103</value></property> <property><name>mapred.map.tasks.speculative.execu tion</name><value>true</value></property> <property><name>mapred.userlog.retain.hours</name><value>48</value></property>

<property><name>dfs.datanode.du.reserved</name><value>536870912</value></prop erty> <property><name>mapred.output.direct.NativeS3FileSys



tem</name><value>true</value></property> <property><name>dfs.namenode.handler.count</name><value>20</value></property>

<property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property>

<property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.Nat iveS3FileSystem</value></property> <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></prop erty> <property><name>io.sort.factor</name><value>40</value></property> <property><name>fs.default.name</name><value>hdfs://domU-12-31-39-06-7E-53.compute-1.internal:9000</value></property> <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property> <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/ha doop/tmp</value></property> <property><name>mapred.tasktracker.reduce.tasks.maxim um</name><value>1</value></property> <property><name>mapred.reduce.parallel.copies</name><value>20</value></prop erty> <property><name>tasktracker.http.threads</name><value>20</value></property> <property><name>mapred.reduce.tasks</name><value>1</value></property> <property><name>mapred.output.compression.codec</name><value>org.apache.ha doop.io.compress.GzipCodec</value></property> <property><name>mapred.job.tracker.http.ad dress</name><value>0.0.0.0:9100</value></property> <property><name>fs.s3bfs.awsSecretAccess Key</name><value>SecretKey</value></property> <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property>

<property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></prop erty> <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></prop erty> <property><name>io.file.buffer.size</name><value>65536</value></property> <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/ha doop/s3</value></property> <property><name>mapred.local.dir</name><value>/mnt/var/lib/ha doop/mapred</value></property> <property><name>dfs.block.size</name><value>134217728</value></property> <property><name>dfs.datanode.ipc.ad dress</name><value>0.0.0.0:9201</value></property> <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></prop erty> <property><name>mapred.job.tracker</name><value>domU-12-31-39-06-7E-53.compute-1.internal:9001</value></property> <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSys tem</value></property> <property><name>io.sort.mb</name><value>150</value></property> <property><name>hadoop.job.history.user.loca tion</name><value>none</value></property> <property><name>dfs.secondary.http.ad dress</name><value>0.0.0.0:9104</value></property> <property><name>dfs.replication</name><value>1</value></property> <property><name>mapred.job.tracker.handler.count</name><value>20</value></prop erty> <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></prop



erty></configuration>

In Hadoop 0.20, the configuration file names and locations are core-site.xml, hdfs-site.xml, andmapred-site.xml.

The default core-site.xml properties are as follows.

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration> <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.Nat iveS3FileSystem</value></property> <property><name>fs.default.name</name><value>hdfs://ip-10-116-159-127.ec2.internal:9000</value></property> <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/ha doop/tmp</value></property> <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></prop erty> <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></prop erty> <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/ha doop/s3</value></property> <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.com press.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compres sion.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.com press.BZip2Codec</value></property> <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSys tem</value></property> <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></prop erty> <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></prop erty> <property><name>fs.s3bfs.awsSecretAccess Key</name><value>SecretKey</value></property> <property><name>io.compression.codec.lzo.class</name><value>com.hadoop.com pression.lzo.LzoCodec</value></property> <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></prop erty></configuration>

The default hdfs-site.xml properties are listed below.

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration> <property><name>dfs.datanode.https.ad dress</name><value>0.0.0.0:9402</value></property> <property><name>dfs.datanode.du.reserved</name><value>536870912</value></prop erty> <property><name>dfs.namenode.handler.count</name><value>20</value></property>

<property><name>io.file.buffer.size</name><value>65536</value></property> <property><name>dfs.block.size</name><value>134217728</value></property> <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></prop erty>



<property><name>dfs.replication</name><value>1</value></property> <property><name>dfs.secondary.http.ad dress</name><value>0.0.0.0:9104</value></property> <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property>

<property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property>

<property><name>dfs.datanode.http.ad dress</name><value>0.0.0.0:9102</value></property> <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></prop erty> <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property> <property><name>dfs.datanode.ipc.ad dress</name><value>0.0.0.0:9201</value></property></configuration>

The default mapred-site.xml properties are listed below.

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration> <property><name>mapred.output.committer.class</name><value>org.apache.ha doop.mapred.DirectFileOutputCommitter</value></property> <property><name>mapred.reduce.tasks.speculative.execu tion</name><value>false</value></property> <property><name>mapred.tasktracker.map.tasks.maxim um</name><value>2</value></property> <property><name>mapred.task.tracker.http.ad dress</name><value>0.0.0.0:9103</value></property> <property><name>mapred.map.tasks.speculative.execu tion</name><value>true</value></property> <property><name>mapred.userlog.retain.hours</name><value>48</value></property>

<property><name>mapred.job.reuse.jvm.num.tasks</name><value>20</value></prop erty> <property><name>io.sort.factor</name><value>40</value></property> <property><name>mapred.reduce.tasks</name><value>1</value></property> <property><name>tasktracker.http.threads</name><value>20</value></property> <property><name>mapred.reduce.parallel.copies</name><value>20</value></prop erty> <property><name>hadoop.job.history.user.loca tion</name><value>none</value></property> <property><name>mapred.job.tracker.handler.count</name><value>20</value></prop erty> <property><name>mapred.map.output.compression.codec</name><value>com.hadoop.com pression.lzo.LzoCodec</value></property> <property><name>mapred.output.direct.NativeS3FileSys tem</name><value>true</value></property> <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property>

<property><name>mapred.tasktracker.reduce.tasks.maxim um</name><value>1</value></property> <property><name>mapred.compress.map.output</name><value>true</value></property>

<property><name>mapred.output.compression.codec</name><value>org.apache.ha doop.io.compress.GzipCodec</value></property>



<property><name>mapred.job.tracker.http.ad dress</name><value>0.0.0.0:9100</value></property> <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property>

<property><name>mapred.local.dir</name><value>/mnt/var/lib/ha doop/mapred</value></property> <property><name>mapred.job.tracker</name><value>ip-10-116-159-127.ec2.intern al:9001</value></property> <property><name>io.sort.mb</name><value>150</value></property></configuration>

Configure Memory-Intensive Workloads

This bootstrap action allows you to set cluster-wide Hadoop settings to values appropriate for job flowswith memory-intensive workloads.

NoteThe memory-intensive bootstrap action should be used only with AMI versions 1.0.1 and earlier.Using the memory-intensive bootstrap action with AMI versions 2.0.0 and later may cause yourjob flow to fail.

The following Hadoop configuration parameters are set:

Parameters modified in hadoop.env.sh

• HADOOP_JOBTRACKER_HEAPSIZE

• HADOOP_NAMENODE_HEAPSIZE

• HADOOP_TASKTRACKER_HEAPSIZE

• HADOOP_DATANODE_HEAPSIZE

Parameters modified in mapred-site.xml

• mapred.child.java.opts

• mapred.tasktracker.map.tasks.maximum

• mapred.tasktracker.reduce.tasks.maximum

The bootstrap script is located ats3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive.

The default configurations for cc1.4xlarge, cc2.8xlarge, hs1.8xlarge, and cg1.4xlarge instances aresufficient for memory-intensive workloads. This bootstrap action does not modify the settings for theseinstance types.

For information about the configuration values for each supported Amazon EC2 instance type, see HadoopMemory-Intensive Configuration Settings (AMI 1.0) (p. 311).

The following example creates a default job flow with the memory-intensive bootstrap action.The bootstrapaction modifies the Hadoop cluster configuration settings to the recommended configuration for an AmazonEC2 m1.small instance.

Example

$ ./elastic-mapreduce --create \--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configura tions/latest/memory-intensive



Run If

You can use this predefined bootstrap action to conditionally run a command when an instance-specificvalue is found in the instance.json or job-flow.json files. The command can refer to a file inAmazon S3 that MapReduce can download and execute.

The location of the script is s3://elasticmapreduce/bootstrap-actions/run-if.

The following example echoes the string running on master node if the node is a master.

Example

$ ./elastic-mapreduce --create --alive \--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,echo running on master node"

NoteYou must use commas to separate commands that you specify with the --args option.

Shutdown Actions

A bootstrap action script can create one or more shutdown actions by writing scripts to the/mnt/var/lib/instance-controller/public/shutdown-actions/ directory. When a job flowis terminated, all the scripts in this directory are executed in parallel. Each script must run and completewithin 60 seconds.

NoteShutdown action scripts are not guaranteed to run if the node terminates with an error.

Using Custom Bootstrap ActionsTopics

• Running Custom Bootstrap Actions from the CLI (p. 91)

• Running Custom Bootstrap Actions from the Amazon EMR Console (p. 92)

In addition to predefined bootstrap action, you can write a custom script to perform a customized bootstrapaction. Any of the Amazon EMR interfaces can reference a custom bootstrap action.

Running Custom Bootstrap Actions from the CLI

The following example uses a bootstrap action script to download and extracts a compressed TAR archivefrom Amazon S3. The sample script is stored in Amazon S3 at:http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

The sample script looks like the following:

#!/bin/bashset -ebucket=elasticmapreducepath=samples/bootstrap-actions/file.tar.gzwget -S -T 10 -t 5 http://$bucket.s3.amazonaws.com/$pathmkdir -p /home/hadoop/contentstar -C /home/hadoop/contents -xzf file.tar.gz



http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh


To create a job flow with a custom bootstrap action

• Create the job flow.


& ./elastic-mapreduce --create --stream --alive \--input s3n://elasticmapreduce/samples/wordcount/input \--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \--output s3n://myawsbucket--bootstrap-action s3://elasticmapreduce/bootstrap-actions/down load.sh

Linux orUNIX

c:\ ruby elastic-mapreduce --create --stream --alive \ --inputs3n://elasticmapreduce/samples/wordcount/input \ --mappers3://elasticmapreduce/samples/wordcount/wordSplitter.py \ --outputs3n://myawsbucket --bootstrap-action"s3://elasticmapreduce/bootstrap-actions/download.sh"

MicrosoftWindows

Running Custom Bootstrap Actions from the Amazon EMR Console

The example in the following procedure creates a predefined word count sample job flow with a bootstrapaction script that downloads and extracts a compressed tar archive from Amazon S3. The sample scriptis stored in Amazon S3 at http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

To create a job flow with a custom bootstrap action

1. Start a new job flow:

a. From the Amazon EMR console select a Region.




b. Click Create a New Job Flow.

The Create a New Job Flow page appears.

2. In the DEFINE JOB FLOW page, enter the following information:


b. Select Run a sample application.

c. Select Word Count (Streaming) from the menu and click Continue.

3. In the SPECIFY PARAMETERS page, replace the myawsbucket text in the Output Location textfield with the name of a valid Amazon S3 bucket an then click Continue.



4. On the CONFIGURE EC2 INSTANCES page, accept the default parameters and click Continue.

5. On the ADVANCED OPTIONS page, accept the default parameters and click Continue.

6. On the BOOTSTRAP ACTIONS page, select Configure your Bootstrap Actions.

Enter the following information:

a. Select Custom Action from the Action Type drop-down list box.

b. Enter the following text in the Amazon S3 Location text box:

s3://elasticmapreduce/bootstrap-actions/download.sh

c. Click Continue.




8. Click Close.




While the job flow master node is running, you can connect to the master node and see the log files thethat the bootstrap action script generated stored in the /mnt/var/log/bootstrap-actions/1 directory.

Related Topics

• View Bootstrap Action Log Files (p. 197)

Resizing Running Job FlowsTopics

• Parameters for Resizing Job Flows (p. 97)

• Arrested State (p. 100)

• Legacy Job Flows (p. 102)

• Library Files (p. 103)

You can increase or decrease the number of nodes in a running job flow. A job flow contains a singlemaster node. The master node controls any slave nodes that are present. There are two types of slavenodes: core nodes, which hold data to process in the Hadoop Distributed File System (HDFS), and tasknodes, which do not contain HDFS. After a job flow is running, you can increase, but not decrease, thenumber of core nodes. Task nodes also run your Hadoop jobs. After a job flow is running, you can bothincrease or decrease the number of task nodes.

You can modify the size of a running job flow using either the API or the CLI. The AWS ManagementConsole allows you to monitor job flows that you resized, but it does not provide the option to resize jobflows.

Nodes within a job flow are managed by instance groups. All job flows require a master instance groupcontaining a single master node. Job flows using slave nodes require a core instance group that containsat least one core node. Additionally, if a job flow has a core instance group, it can also have a task instancegroup containing one or more task nodes.

When your job flow runs, Hadoop determines the number of mapper and reducer tasks needed to processthe data. Larger job flows should have more tasks for better resource use and shorter processing time.Typically, an Amazon Elastic MapReduce (Amazon EMR) job flow remains the same size during theentire job flow; you set the number of tasks when you create the job flow. When you resize a running jobflow, you can vary the processing during the job flow execution.Therefore, instead of using a fixed number


Amazon Elastic MapReduce Developer GuideResizing Running Job Flows

of tasks, you can vary the number of tasks during the life of the job flow. There are two configurationoptions to help set the ideal number of tasks. They are:

• mapred.map.tasksperslot

• mapred.reduce.tasksperslot

You can set both options in the mapred-conf.xml file. When you submit a job flow to the cluster, thejob client checks the current total number of map and reduce slots available cluster wide. The job clientthen uses the following equations to set the number of tasks:

• mapred.map.tasks = mapred.map.tasksperslot * map slots in cluster

• mapred.reduce.tasks = mapred.reduce.tasksperslot * reduce slots in cluster

The job client only reads the tasksperslot parameter if the number of tasks is not configured.You canoverride the number of tasks at any time, either for all job flows via a bootstrap action or individually perjob by adding a step to change the configuration.

Amazon Elastic MapReduce (Amazon EMR) withstands slave node failures and continues job flowexecution even if a slave node becomes unavailable. Amazon EMR automatically provisions additionalslave nodes to replace those that fail.

You can have a different number of slave nodes for each job flow step.You can also add a step to arunning job flow to modify the number of its slave nodes. Because all steps are guaranteed to runsequentially, you can specify the number of running slave nodes for any job flow step.

Parameters for Resizing Job FlowsThe Amazon EMR CLI provides parameters so you can control how you resize a running job flow.

Parameters to Increase or Decrease Nodes

You can increase or decrease the number of nodes in a running job flow. The parameters are listed inthe following table.

DescriptionParameter

Modify an existing instance group.--modify-instance-groupINSTANCE_GROUP_ID

Set the count of nodes for an instance group.

NoteYou are only allowed to increase the number ofnodes in a core instance group.You can increase or decrease the number ofnodes in a task instance group.Master instance groups can not be modified.

--instance-count INSTANCE_COUNT

Parameters to Add an Instance Group to a Running Job Flow

You can add an instance group to your running job flow. The parameters are listed in the following table.




Add an instance group to an existing job flow. The rolemay be TASK only. Currently, Amazon ElasticMapReduce (Amazon EMR) does not permit adding coreor master instance groups to a running job flow.

--add-instance-group ROLE

Set the count of nodes for an instance group.--instance-count INSTANCE_COUNT

Set the type of Amazon EC2 instance to create nodes foran instance group.

--instance-type INSTANCE_TYPE

Parameters to Specify an Instance Group when Creating a Job Flow

You can specify instance groups when you create a job flow. The parameters are listed in the followingtable.


Set the instance group type. A type is MASTER, CORE,or TASK

--instance-group TYPE

Set the count of nodes for an instance group.--instance-count INSTANCE_COUNT

Set the type of Amazon EC2 instance for nodes in aninstance group.

--instance-type INSTANCE_TYPE

The --describe command describes all instance groups and node types. If you run elastic-mapreduce--jobflow JobFlowID --describe, you see a section called InstanceGroups.You can see thatyour job flow contains a master instance group and, potentially, core and task instance groups.



Example

$ ./elastic-mapreduce --jobflow JobFlowID --describe

{ "JobFlows": [ { "Name": "Development Job Flow (requires manual termination)", "LogUri": "s3n:\/\/myawsbucket\/FileName\/", "ExecutionStatusDetail": { "StartDateTime": null, "EndDateTime": null, "LastStateChangeReason": "Starting instances", "CreationDateTime": DateTimeStamp, "State": "STARTING", "ReadyDateTime": null }, "Steps": [], "Instances": { "MasterInstanceId": null, "Ec2KeyName": "KeyName", "NormalizedInstanceHours": 0, "InstanceCount": 5, "Placement": { "AvailabilityZone": "us-east-1a" }, "SlaveInstanceType": "m1.small", "HadoopVersion": "0.20", "MasterPublicDnsName": null, "KeepJobFlowAliveWhenNoSteps": true, "InstanceGroups": [ { "StartDateTime": null, "SpotPrice": null, "Name": "Master Instance Group", "InstanceRole": "MASTER", "EndDateTime": null, "LastStateChangeReason": "", "CreationDateTime": DateTimeStamp, "LaunchGroup": null, "InstanceGroupId": "InstanceGroupID", "State": "PROVISIONING", "Market": "ON_DEMAND", "ReadyDateTime": null, "InstanceType": "m1.small", "InstanceRunningCount": 0, "InstanceRequestCount": 1 }, { "StartDateTime": null, "SpotPrice": null, "Name": "Task Instance Group", "InstanceRole": "TASK", "EndDateTime": null, "LastStateChangeReason": "", "CreationDateTime": DateTimeStamp, "LaunchGroup": null,



"InstanceGroupId": "InstanceGroupID", "State": "PROVISIONING", "Market": "ON_DEMAND", "ReadyDateTime": null, "InstanceType": "m1.small", "InstanceRunningCount": 0, "InstanceRequestCount": 2 }, { "StartDateTime": null, "SpotPrice": null, "Name": "Core Instance Group", "InstanceRole": "CORE", "EndDateTime": null, "LastStateChangeReason": "", "CreationDateTime": DateTimeStamp, "LaunchGroup": null, "InstanceGroupId": "InstanceGroupID", "State": "PROVISIONING", "Market": "ON_DEMAND", "ReadyDateTime": null, "InstanceType": "m1.small", "InstanceRunningCount": 0, "InstanceRequestCount": 2 } ], "MasterInstanceType": "m1.small" }, "BootstrapActions": [], "JobFlowId": "JobFlowID" } ]}

Arrested StateAn instance group goes into arrested state if it encounters too many errors while trying to start the newcluster nodes. For example, if new nodes fail while performing bootstrap actions, the instance group goesinto an arrested state, rather than continuously provision new nodes. After you resolve the underlyingissue, reset the desired number of nodes on the job flow's instance group, and then the instance groupresumes allocating nodes.

The command --describe returns all instance groups and node types, and so you can see the stateof the instance groups for the job flow. If Amazon Elastic MapReduce (Amazon EMR) detects any kindof fault with an instance group, it changes the group's state to ARRESTED.

Use the --modify-instance-group command to reset a job flow in the ARRESTED state.

Modifying the instance group instructs Amazon EMR to attempt to provision nodes again. No runningnodes are restarted or terminated.

To reset a job flow in an arrested state

• Enter the --modify-instance-group command as follows:



From the command-line prompt, enter...If you areusing...

$ ./elastic-mapreduce \--modify-instance-group InstanceGroupID \-–instance-count COUNT

Linux orUNIX

c:/ruby/ruby elastic-mapreduce \--modify-instance-group InstanceGroupID \-–instance-count COUNT

MicrosoftWindows

The <InstanceGroupID>/<InstanceGroupID> is the ID of the arrested instance group and<COUNT> is the number of nodes you want in the instance group.

TipYou do not need to change the number of nodes from the original configuration to free a runningjob flow. Set -–instance-count to the same count as the original setting.



Legacy Job FlowsBefore October 2010, Amazon EMR did not have the concept of instance groups. Job flows developedfor Amazon EMR that were built before the option to resize running job flows was available are consideredlegacy job flows. Previously, the Amazon EMR architecture did not use instance groups to manage nodesand only one type of slave node existed. Legacy job flows reference slaveInstanceType and othernow deprecated fields. Amazon EMR continues to support the legacy job flows; you do not need to modifythem to run them correctly.

Job Flow Behavior

If you run a legacy job flow and only configure master and slave nodes, you observe aslaveInstanceType and other deprecated fields associated with your job flows.

Mapping Legacy Job Flows to Instance Groups

Before October 2010, all cluster nodes were either master nodes or slave nodes. An Amazon ElasticMapReduce (Amazon EMR) configuration could typically be represented like the following diagram.

Old Amazon EMR Model

A legacy job flow launches and a request is sent to Amazon EMR to start the job flow.1

Amazon EMR creates a Hadoop cluster.2

The legacy job flow runs on a cluster consisting of a single master node and the specifiednumber of slave nodes.

3



Job flows created using the older model are fully supported and function as originally designed. TheAmazon EMR API and commands map directly to the new model. Master nodes remain master nodesand become part of the master instance group. Slave nodes still run HDFS and become core nodes andjoin the core instance group.

NoteNo task instance group or task nodes are created as part of a legacy job flow, however you canadd them to a running job flow at any time.

The following diagram illustrates how a legacy job flow now maps to master and core instance groups.

Old Amazon EMR Model Remapped to Current Architecture

A request is sent to Amazon EMR to start a job flow.1

Amazon EMR creates an Hadoop cluster with a master instance group and core instance group.2

The master node is added to the master instance group.3

The slave nodes are added to the core instance group.4

Library FilesAmazon EMR provides a library file containing a JAR file to create a job flow step programmatically insteadof directly through the CLI.



The JAR file to programmatically resize a running job flow is available ats3://elasticmapreduce/libs/resize-job-flow/0.1/resize-job-flow.jar and supportsthe optional arguments described in the following table.

DescriptionOption

List all help information.--help

Apply changes to the named instance group, specified by eitherrole or Instance Group ID. Instance group roles:MASTER, CORE,or TASK.

--modify-instance-groupROLE/InstanceGroupID

Change the number of nodes of the named instance group.--set-instance-count <COUNT>

Apply operations to the named instance group. Instance grouproles: TASK. Currently, Amazon EMR does not permit addingcore or master instance groups to a running job flow.

--add-instance-group <ROLE>

Specify the number of nodes for the named instance group.--instance-count <COUNT>

Specify the type of Amazon EC2 instances used to createnodes in the new instance group.

--instance-type <TYPE>

The job flow continues in the RUNNING state after the stepmakes a request to create or resize an instance group.

--no-wait

Step state if one of the resizing actions fails: FAIL orCONTINUE.

--on-failure STATE

Job flow state if an instance group enters the ARRESTED state:FAIL, WAIT, or CONTINUE.

--on-arrested <STATE>

The JAR file is configured to write to stderr. Only error and fatal messages are reported. The JAR fileincludes the source code.

The job flow step looks similar to:

s3://elasticmapreduce/libs/resize-job-flow/0.1/resize-job-flow.jar \--add-instance-group task --instance-type InstanceType --instance-count 10

For more information on how to add a job flow step, refer to Add Steps to a Job Flow (p. 79).

Calling Additional Files and LibrariesTopics

• Using Distributed Cache (p. 104)

• Running a Script in a Job Flow (p. 109)

The following sections describe advanced procedures on how to use additional files and libraries withina job flow.

Using Distributed CacheTopics


Amazon Elastic MapReduce Developer GuideCalling Additional Files and Libraries

• Supported File Types (p. 105)

• Location of Cached Files (p. 105)

• Access Cached Files From Mapper and Reducer Applications (p. 106)

• Amazon EMR Console (p. 106)

• CLI (p. 107)

• API (p. 108)

Distributed Cache is a Hadoop feature that allow you to transfer files from a distributed file system to thelocal file system. It can distribute data and text files as well as more complex types such as archives andjars. If your job flow depends on applications or binaries that are not installed when the cluster is created,you can use Distributed Cache to import these files. Using Distributed Cache can boost efficiency whena map or a reduce task needs access to common data. A cluster node can read files from its local filesystem, instead of retrieving the files from other cluster nodes.

You invoke Distributed Cache when you create the job flow. The files are cached just before starting theHadoop job and the files remain cached for the duration of the job.You can cache files stored on anyHadoop-compatible file system, for example HDFS or S3 native. The default size of the file cache is10GB. To change the size of the cache, reconfigure the Hadoop parameter, local.cache.size usingthe Configure Hadoop (p. 85) bootstrap action.

Supported File Types

Distributed Cache allows both single files and archives. Individual files are cached as read only. Executablesand binary files have execution permissions set.

Archives are one or more files packaged using a utility, such as gzip. Distributed Cache passes thecompressed files to each slave node and decompresses the archive as part of caching. Distributed Cachesupports the following compression formats:

• zip

• tgz

• tar.gz

• tar

• jar

Location of Cached Files

Distributed Cache copies files to slave nodes only. If there are no slave nodes in the cluster, DistributedCache copies the files to the master node.

Distributed Cache associates the cache files to the current working directory of the mapper and reducerusing symlinks. A symlink is an alias to a file location, not the actual file location.The value of the Hadoopparameter, mapred.local.dir, specifies the location of temporary files. Amazon Elastic MapReduce(Amazon EMR) sets this parameter to /mnt/var/lib/hadoop/mapred/. Cache files are located in asubdirectory of the temporary file location at /mnt/var/lib/hadoop/mapred/taskTracker/archive/.

If you cache a single file, Distributed Cache puts the file in the archive directory. If you cache an archive,Distributed Cache decompresses the file, creates a subdirectory in /archive with the same name asthe archive file name. The individual files are located in the new subdirectory.

You can use Distributed Cache only when creating streaming job flows.



Access Cached Files From Mapper and Reducer Applications

To access the cached files from your mapper or reducer applications, make sure that you have addedthe current working directory (./) into your application path and referenced the cached files as though theyare present in the current working directory.

For more information, go to Hadoop Distributed Cache(http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#DistributedCache).

Amazon EMR Console

You can use the Amazon EMR console to create job flows that use Distributed Cache.

To specify Distributed Cache files

1. Launch the Create New Job Flow wizard, specify a streaming job flow, and click Continue.For information on how to launch the Create New Job Flow wizard and specify a streaming job flowgo to How to Create a Streaming Job Flow (p. 24).

The Specify Parameters page opens.

2. In the Extra Args field, include the files and archives to save to the cache.The size of the file (or total size of the files in an archive file) must be less than the allocated cachesize.

ExampleActionIf youwant to...

–cacheFile \s3n://bucket_name/file_name#cache_file_name

Specify -cacheFilefollowed by the name andlocation of the file, the pound(#) sign, and then the nameyou want to give the filewhen it's placed in the localcache.

Add anindividualfile to theDistributedCache

–cacheArchive \s3n://buck et_name/archive_name#cache_archive_name

Enter -cacheArchivefollowed by the location ofthe files in Amazon S3, thepound (#) sign, and then thename you want to give thecollection of files in the localcache.

Add anarchivefile to theDistributedCache



http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#DistributedCache

3. Proceed with configuring and launching your streaming job flow.

Your job flow copies the files to the cache location before processing any job flow steps.

CLI

You can use the Amazon EMR console to create job flows that use Distributed Cache. To add files orarchives to the Distributed Cache using the CLI, you specify the options –-cache or --cache-archiveto the CLI command line.


• Create a streaming job flow and add the following parameters:For information on how to create a streaming job flow using the CLI, go to How to Create a StreamingJob Flow (p. 24).

The size of the file (or total size of the files in an archive file) must be less than the allocated cachesize.

Add the following parameter to the job flow ...If you want to ...

specify -cache followed by the name and location of the file, thepound (#) sign, and then the name you want to give the file when it'splaced in the local cache.

add an individual file to theDistributed Cache

enter -cache-archive followed by the location of the files in AmazonS3, the pound (#) sign, and then the name you want to give thecollection of files in the local cache.

add an archive file to theDistributed Cache



Your job flow copies the files to the cache location before processing any job flow steps.



Example 1

The following command shows the creation of a streaming job flow and uses --cache to add one file,sample_dataset_cached.dat, to the cache.

./elastic-mapreduce --create --stream \ --input s3n://my_bucket/my_input \ --output s3n://my_bucket/my_output \ --mapper s3n://my_bucket/my_mapper.py \ --reducer s3n://my_bucket/my_reducer.py \ --cache s3n://my_bucket/sample_dataset.dat#sample_dataset_cached.dat

Example 2

The following command shows the creation of a streaming job flow and uses --cache-archive to addan archive of files to the cache.

./elastic-mapreduce --create --stream \ --input s3n://my_bucket/my_input \ --output s3n://my_bucket/my_output \ --mapper s3n://my_bucket/my_mapper.py \ --reducer s3n://my_bucket/my_reducer.py \ --cache-archive s3n://my_bucket/sample_dataset.tgz#sample_dataset_cached

API

This section describes the Amazon EMR API Query request parameters needed to use Distributed Cache.


• Create a streaming job flow and add the following parameters:For information on how to create a streaming job flow using the CLI, go to How to Create a StreamingJob Flow (p. 24).

The size of the file (or total size of the files in an archive file) must be less than the allocated cachesize.

Add the following parameter to the job flow ...If you want to ...

specify -cache followed by the name and location ofthe file, the pound (#) sign, and then the name youwant to give the file when it's placed in the local cache.

add an individual fileto the DistributedCache

enter -cache-archive followed by the location ofthe files in Amazon S3, the pound (#) sign, and thenthe name you want to give the collection of files in thelocal cache.

add an archive fileto the DistributedCache

The following JSON example describes a simple streaming job flow that uses the Distributed Cache tostore the file sample_data.dat.

[{ "Name": "streaming job flow referencing distributed cache","HadoopJarStep":



{ "Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar", "Args": [ "-input", "s3n://elasticmapreduce/samples/wordcount/input", "-output", "s3n://myawsbucket", "-mapper", "s3://elasticmapreduce/samples/wordcount/wordSplit ter.py", "-reducer", "aggregate", "-cache", "s3n://my_bucket/sample_data.dat#sample_data_cached.dat"

] }}]

All paths are prefixed with their location. “s3://” refers to the Amazon S3 file system. “s3n://” refers to theAmazon S3 native file system. If you use HDFS, prepend the path with hdfs:///. Make sure to usethree slashes (///), as in hdfs:///home/hadoop/sampleInput2/.

Running a Script in a Job FlowAmazon Elastic MapReduce (Amazon EMR) enables you to run a script at any time during step processingin your job flow.You specify a step that runs a script either when you create your job flow or you can adda step if your job flow is in the WAITING state. For more information about adding steps, go to Add Stepsto a Job Flow (p. 79). For more information on running an interactive job flow, go to Interactive and BatchModes (p. 355).

If you want to run a script before step processing begins, use a bootstrap action. For more informationon bootstrap actions, go to Bootstrap Actions (p. 84).

If you want to run a script immediately before job flow shutdown, use a shutdown action. For moreinformation on shutdown actions, go to Shutdown Actions (p. 91).

You can only run multi-step job flows from the CLI and the API. The Amazon EMR console does notsupport multiple steps.

CLI

This section describes how to add a step to run a script. The script-runner.jar takes arguments tothe path to a script and any additional arguments for the script.The JAR file runs the script with the passedarguments. Script-runner.jar is located ats3://elasticmapreduce/libs/script-runner/script-runner.jar.

The job flow containing a step that runs a script looks similar to the following:

.\elastic-mapreduce --create --alive --name "My Development Jobflow" \--jar s3://elasticmapreduce/libs/script-runner/script-runner.jar \--args "s3://myawsbucket/script-path/my_script.sh"

This job flow runs the script my_script.sh on the master node when the step is processed.



API

This section describes the Amazon EMR API Query request needed to add a step to run a script. Theresponse includes a <JobFlowID>.

The Amazon EMR JSON sample below contains a step that specifies the JARs3://elasticmapreduce/libs/script-runner/script-runner.jar and passes the locationand file name of the script.

[{ "Name": "streaming job flow", "HadoopJarStep": { "Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar", "Args": [ "-input", "s3n://elasticmapreduce/samples/wordcount/input", "-output", "s3n://myawsbucket", "-mapper", "s3://elasticmapreduce/samples/wordcount/wordSplit ter.py", "-reducer", "aggregate" ] }},{"Name": "My Script Step","HadoopJarStep": { "Jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar", "Args": [ "s3://myawsbucket/script-path/my_script.sh" ] }}]

This job flow runs the script my_script.sh on the master node when the step is processed.

Connect to the Master Node in an Amazon EMRJob Flow

Topics

• Connect to the Master Node Using SSH (p. 111)

• Web Interfaces Hosted on the Master Node (p. 115)

• Open an SSH Tunnel to the Master Node (p. 116)

• Configure Foxy Proxy to View Websites Hosted on the Master Node (p. 117)

Often when you run an Amazon Elastic MapReduce (Amazon EMR) job flow, all you need to do is launchthe analysis and then collect the output from an Amazon S3 bucket. There are other times, however,when you'll want to interact with the master node while the job flow is running. For example, you maywant to connect to the master node to run interactive queries, check log files, monitor performance using


Amazon Elastic MapReduce Developer GuideConnect to the Master Node in an Amazon EMR Job

Flow

an application such as Ganglia that runs on the master node, debug a problem with the job flow, andmore. The following sections describe techniques you can use to connect to the master node.

In an Amazon EMR job flow, the master node is an EC2 instance that coordinates the EC2 instances thatare running as task and core nodes. The master node exposes a public DNS name that you can use toconnect to it.

NoteYou can connect to the master node only while the job flow is running. Once the job flowterminates, the EC2 instance acting as the master node is terminated and no longer available.You also must specify an Amazon EC2 key pair when you launch the job flow, as you will usethe key pair as the credentials for the SSH connection. If you are launching the job flow from theconsole, the Amazon EC2 key pair is specified on the ADVANCED OPTIONS pane of the Createa New Job Flow wizard.

Connect to the Master Node Using SSHSecure Shell (SSH) is a network protocol you can use to create a secure connection to a remote computer.Once you've made this connection, it's as if the terminal on your local computer is running on the remotecomputer. Commands you issue locally will run on the remote computer and the output of those commandsfrom the remote computer will appear in your terminal window.

When you use SSH with Amazon Web Services (AWS), you are connecting to an EC2 instance, whichis a virtual server running in the cloud. When working with Amazon EMR, the most common use of SSHis to connect to the EC2 instance that is acting as the master node of the job flow.

Using SSH to connect to the master node gives you the ability to monitor and interact with the job flow.You can issue Linux commands on the master node, run applications such as HBase, Hive, and Piginteractively, browse directories, read log files, and more.

In order to connect to the master node using SSH, you need the public DNS name of the master node.You also must specify an Amazon EC2 key pair when you launch the job flow, as you will use the keypair as the credentials for the SSH connection. If you are launching the job flow from the console, theAmazon EC2 key pair is specified on the ADVANCED OPTIONS pane of the Create a New Job Flowwizard.

To locate the public DNS name of the master node using the Amazon EMR console

• In the Amazon EMR console, select the job from the list of running job flows in the WAITING orRUNNING state. Details about the job flow appear in the lower pane.


Amazon Elastic MapReduce Developer GuideConnect to the Master Node Using SSH

https://console.aws.amazon.com/elasticmapreduce

The DNS name you used to connect to the instance is listed on the Description tab as Master PublicDNS Name.

To locate the public DNS name of the master node using the CLI

• If you have the Amazon EMR CLI installed, you can retrieve the public DNS name of the master byrunning the following command.

elastic-mapreduce --list

This returns a list of all the currently active job flows in the following format. In the example below,ec2-204-236-242-218.compute-1.amazonaws.com, is the public DNS name of the master node forthe job flow j-3L7WK3E07HO4H.

j-3L7WK3E07HO4H WAITING ec2-204-236-242-218.compute-1.amazonaws.com My Job Flow

OpenSSH is installed on most Linux, Unix, and Mac OS X operating systems. Windows users can usean application called PuTTY to connect to the master node. Following are platform-specific instructionsfor opening an SSH connection.

To configure the permissions of the keypair file using Linux/Unix/Mac OS X

• Before you can use the keypair file to create an SSH connection, you must set permissions on thePEM file for your Amazon EC2 key pair so that only the key owner has permissions to access thekey. For example, if you saved the file as mykeypair.pem in the user's home directory, the commandis:

chmod og-rwx ~/mykeypair.pem



If you do not do this, SSH will return an error saying that your private key file is unprotected and willreject the key.You only need to configure these permissions the first time you use the private keyto connect.

To connect to the master node using Linux/Unix/Mac OS X

1. Open a terminal window. This is found at Applications/Utilities/Terminal on Mac OS X and atApplications/Accessories/Terminal on many Linux distributions.

2. Check that SSH is installed by running the following command. If SSH is installed, this commandreturns the SSH version number. If SSH is not installed, you'll need to install the OpenSSH packagefrom a repository.

ssh -v

3. To establish the connection to the master node, enter the following command line, which assumesthe PEM file is in the user's home directory. Replaceec2-107-22-74-202.compute-1.amazonaws.com with the Master Public DNS Name of yourjob flow and replace ~/mykeypair.pem with the location and filename of your PEM file.

ssh [email protected] -i ~/mykeypair.pem

A warning states that the authenticity of the host you are connecting to can't be verified.

4. Type yes to continue.

NoteIf you are asked to log in, enter hadoop.

To connect to the master node using the CLI on Linux/Unix/Mac OS X

• If you have the Amazon EMR CLI installed and have configured your credentials.json file so the"keypair" value is set to the name of the keypair you used to launch the job flow and "key-pair-file"value is set to the full path to your keypair .pem file, and the permissions on the .pem file are set toog-rwx as shown in To configure the permissions of the keypair file using Linux/Unix/Mac OSX (p. 112), and you have OpenSSH installed on your machine, you can open an SSH connection tothe master node by issuing the following command. This is a handy shortcut for frequent CLI users.In the example below you would replace the red text with the job flow identifier of the job flow toconnect to.

elastic-mapreduce -j j-3L7WK3E07HO4H --ssh

To close an SSH connection using Linux/Unix/Mac OS X

• When you are done working on the master node, you can close the SSH connection using the exitcommand.

exit



To connect to the master node using PuTTY on Windows

1. Download PuTTYgen.exe and PuTTY.exe to your computer fromhttp://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.

2. Launch PuTTYgen.

3. Click Load.

4. Select the PEM file you created earlier. Note that you may have to change the search parametersfrom file of type “PuTTY Private Key Files (*.ppk) to “All Files (*.*)”.

5. Click Open.

6. Click OK on the PuTTYgen notice telling you the key was successfully imported.

7. Click Save private key to save the key in the PPK format.

8. When PuTTYgen prompts you to save the key without a pass phrase, click Yes.

9. Enter a name for your PuTTY private key, such as mykeypair.ppk.

10. Click Save.

11. Close PuTTYgen.You only need to perform steps 1-9 the first time that you use the private key.

12. Start PuTTY.

13. Select Session in the Category list. Enter hadoop@DNS in the Host Name field. The input lookssimilar to [email protected].

14. In the Category list, expand Connection, expand SSH, and then select Auth.The Options controllingthe SSH authentication pane appears.

15. For Private key file for authentication, click Browse and select the private key file you generatedearlier. If you are following this guide, the file name is mykeypair.ppk.

16. Click Open.

A PuTTY Security Alert pops up.

17. Click Yes for the PuTTY Security Alert.





Web Interfaces Hosted on the Master NodeHadoop, Ganglia, and other applications publish user interfaces as websites hosted on the master node.For security reasons, these websites are only available on the master node's local webserver(http://localhost:port) and are not published on the Internet.There are several ways you can access theseweb interfaces:

• Use SSH to connect to the master node and use the text-based browser, Lynx, to view the websitesfrom the SSH terminal.The following example shows how to open the Hadoop JobTracker user interfaceusing Lynx. This is the easiest and quickest way to access these web interfaces. The disadvantage isthat Lynx is a text-based browser with a limited user interface that cannot display graphics.

lynx http://localhost:9100/

• Create an SSH tunnel to the master node and manually configure your browser to use the SOCKSproxy you've just created for all URLs. This has the advantage of being relatively easy to configure(see your web browser's documentation for details). The disadvantage is you must then manuallydisable the proxy in your browser to resume normal web browsing. The following screenshot showsthe settings you'd use to manually configure Safari to view the web interfaces over a SOCKS proxy.

• Create an SSH tunnel to the master node and use a browser add-on, such as Foxy Proxy, toautomatically filter URLs based on text patterns and use the SOCKS proxy you've created only fordomains that match the form of an EC2 instance's public DNS name. This requires that you install an


Amazon Elastic MapReduce Developer GuideWeb Interfaces Hosted on the Master Node

add-on and configure the appropriate patterns in it, but once done, automatically handles turning theproxy on and off when you switch between viewing websites hosted on the master node, and those onthe Internet. The process of configuring Foxy Proxy, an add-on for the FireFox browser, is describedin Configure Foxy Proxy to View Websites Hosted on the Master Node (p. 117).

The following table lists web interfaces you can view on the master node. The Hadoop interfaces areavailable on all job flows. Other web interfaces, such as Ganglia, are only available if additional featureshave been added to the job flow.

URIName of Interface

http://master-public-dns-name:9100/Hadoop MapReduce job tracker

http://master-public-dns-name:9101/Hadoop HDFS name node

http://master-public-dns-name:9103/Hadoop MapReduce task tracker

http://master-public-dns-name/ganglia/Ganglia Metrics Reports

http://master-public-dns-name:60010/master-statusHBase Interface

For more information about the Hadoop web interfaces, see View the Hadoop Web Interfaces (p. 200).For more information about the Ganglia web interface, see Monitor Performance with Ganglia (p. 220).

Open an SSH Tunnel to the Master NodeHadoop, Ganglia, and other applications publish user interfaces as websites hosted on the master node.For security reasons, these websites are only available on the master node's local webserver(http://localhost:port) and are not published on the Internet. In order to connect to the local webserver onthe master node you can create a an SSH tunnel between your computer and the master node. This isalso known as port forwarding and creates a SOCKS proxy server. For more information about the sitesyou might want to view on the master node, see Web Interfaces Hosted on the Master Node (p. 115)

Before you begin, you'll need the public DNS name of the master node. For information on how to locatethis value, see To locate the public DNS name of the master node using the Amazon EMR console (p. 111).You also must specify an Amazon EC2 key pair when you launch the job flow, as you will use the keypair as the credentials for the SSH connection. If you are launching the job flow from the console, theAmazon EC2 key pair is specified on the ADVANCED OPTIONS pane of the Create a New Job Flowwizard.

To create an SSH tunnel to the master node using Linux/Unix/Mac OS X

• Open an SSH tunnel on your local machine using the following command:

ssh –i path-to-keyfile -ND port_number hadoop@master-public-DNS-name

The following shows the command with example values filled in.

ssh -i ~/ec2-keys/myKeyPairName -ND 8157 [email protected]

After you issue this command, the terminal will remain open and not return a command prompt. It isnow acting as a SOCKS server.


Amazon Elastic MapReduce Developer GuideOpen an SSH Tunnel to the Master Node

To create an SSH tunnel to the master node using the CLI on Linux/Unix/Mac OS X

• If you have the Amazon EMR CLI installed and have configured your credentials.json file so the"keypair" value is set to the name of the keypair you used to launch the job flow and "key-pair-file"value is set to the full path to your keypair .pem file, and the permissions on the .pem file are set toog-rwx as shown in To configure the permissions of the keypair file using Linux/Unix/Mac OSX (p. 112), and you have OpenSSH installed on your machine, you can open an SSH connection tothe master node by issuing the following command. This is a handy shortcut for frequent CLI users.In the example below, replace the red text with the job flow identifier of the job flow to open an SSHtunnel and use as a SOCKS server.

elastic-mapreduce -j j-3L7WK3E07HO4H --socks

NoteThe --socks feature is available only on the CLI version 2012-06-12 and later. To find outwhat version of the CLI you have, run elastic-mapreduce --version at the commandline.You can download the latest version of the CLI fromhttp://aws.amazon.com/code/Elastic-MapReduce/2264.

Once you've created an SSH tunnel to the master node, you can browse the websites hosted there usingthe text-based browser Lynx, or set up proxies in Firefox using the Foxy Proxy add-on.This latter techniquegives you full access to the graphical version of the web pages hosted locally on the master node. Formore information see Configure Foxy Proxy to View Websites Hosted on the Master Node (p. 117).

If you are using a browser with a SOCKS proxy configured, as described in Configure Foxy Proxy to ViewWebsites Hosted on the Master Node (p. 117), you can use the browser to access any job flow launchedin the same region as the one you used to create the SSH tunnel. This works because all of the masternodes you launch in a region share the same security group and thus are able to access each other.

To close an SSH tunnel using Linux/Unix/Mac OS X

• In the terminal, press Ctrl+C.

Configure Foxy Proxy to View Websites Hostedon the Master NodeFoxy Proxy is an add-on for the Firefox browser which provides a set of proxy management tools.Youcan configure it to automatically use a proxy server on URLs that match patterns corresponding to thedomains used by EC2 instances. If you have created an SSH tunnel to the master node as described inOpen an SSH Tunnel to the Master Node (p. 116) you can configure Foxy Proxy to use the SOCKS proxyyou've created to connect to EC2 instances. This will enable you to view the web interfaces available onthe master node. For more information about the available web interfaces, see Web Interfaces Hostedon the Master Node (p. 115).

NoteThe following tutorial uses FoxyProxy Standard version 3.6.2.

To install FoxyProxy

1. Download and install the standard version of FoxyProxy fromhttp://foxyproxy.mozdev.org/downloads.html.

2. Restart Firefox after installing FoxyProxy.


Amazon Elastic MapReduce Developer GuideConfigure Foxy Proxy to View Websites Hosted on the

Master Node

http://aws.amazon.com/code/Elastic-MapReduce/2264

http://foxyproxy.mozdev.org/downloads.html

To configure FoxyProxy to connect to a SOCKS server

1. On the Firefox Tools menu, click FoxyProxy Standard, and then select Options.

FoxyProxy displays the FoxyProxy Standard window.

2. Click Add New Proxy.

3. Click General.

4. Enter a proxy name and verify that Perform remote DNS lookups on hostnames loading throughthis proxy is selected.



Master Node

5. Click Proxy Details.

6. Select Manual Proxy Configuration and enter the host name and port number of the host youran the ssh command as the Hadoop user in step 1.

a.

The SOCKS proxy (or SSH tunnel) is running on your desktop so enter localhost and port8157.

b. Select the SOCKS proxy? check box.

c. Select SOCKS v5.



Master Node

7. Click the URL Patterns button. Next you'll add URL patterns that cause URLs of the form*ec2*.amazonaws.com* and *ec2.internal* to use the proxy.

8. Select Add New Pattern.

9. Select the Enabled check box.a.

b. Enter a name in the Pattern Name box.

c. Enter the following URL pattern in the URL pattern box: *ec2*.amazonaws.com*

d. Select the Wildcards option.

e. Select OK.

10. Select Add New Pattern again to add the second pattern.



Master Node

11. Select the Enabled check box.a.

b. Enter a name in the Pattern Name box.

c. Enter the following URL pattern in the URL pattern box: *ec2.internal*

d. Select the Wildcards option.

e. Select OK.

12. On the FoxyProxy Options pane, select the down arrow on the Select Mode drop-down menu,select Use proxies based on their predefined patterns and priorities.

13. Select Close.



Master Node

Now that you've configured FoxyProxy when you enter a URL that matches the pattern*ec2*.amazonaws.com* into the Firefox browser, Firefox uses the proxy to connect to the master nodeof the job flow.You can now load the web pages listed in Web Interfaces Hosted on the Master Node (p. 115)in your browser using URLs, such as the following, where the text in red would be replaced by the publicDNS name of the master node of your job flow.

http://ec2-107-22-74-202.compute-1.amazonaws.com:9100

Use CasesTopics

• Cascading (p. 122)

• Pig (p. 126)

• Streaming (p. 129)

This section describes job flow use cases using Amazon Elastic MapReduce (Amazon EMR). Basicinformation about creating, managing, and debugging job flows is available at Using Amazon EMR (p. 15)

CascadingCascading is an open-source Java library that provides a query API, a query planner, and a job schedulerfor creating and running Hadoop MapReduce applications. Applications developed with Cascading arecompiled and packaged into standard Hadoop-compatible JAR files similar to other native Hadoopapplications.

Multitool is a Cascading application that provides a simple command line interface for managing largedatasets. For example, you can filter records matching a Java regular expression from data stored inAmazon S3 and copy the results to the Hadoop file system.

You can run the Cascading Multitool application on Amazon Elastic MapReduce (Amazon EMR) usingeither the Amazon EMR command line interface or the Amazon EMR console. Amazon EMR supportsall Multitool arguments.

The Multitool JAR file is ats3n://elasticmapreduce/samples/multitool/multitool-aws-03-31-09.jar.

The Multitool source code, along with a number of other tools, is available for download from the projectwebsite at http://www.cascading.org/modules.html.

For additional samples and tips for using Multitool, go to Cascading.Multitool - Tips on using the Multitooland Generate usage reports . For more information about Cascading, go to http://www.cascading.org.

To create a Cascading job flow using the CLI

• Create a job flow referencing the Cascading Multitool JAR file and supply the appropriate Multitoolarguments as follows:

$ ./elastic-mapreduce --create --jar s3n://elasticmapreduce/samples/multi tool/multitool-aws-03-31-09.jar --args [args]


Amazon Elastic MapReduce Developer GuideUse Cases

http://www.cascading.org/modules.html

http://aws.amazon.com/jobflows/code/2293


http://www.cascading.org/

To create a Cascading job flow using the Amazon EMR console

1. Start a new job flow:

a. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce consoleat https://console.aws.amazon.com/elasticmapreduce/.

b. Select a Region.

c. Click Create a New Job Flow.

The Create a New Job Flow page appears.

2. In the DEFINE JOB FLOW page, enter the following information:


b. Select Run your own application.

c. Select Custom JAR from the menu and click Continue.


Amazon Elastic MapReduce Developer GuideCascading


3. In the SPECIFY PARAMETERS page, specify the job flow parameters:

a. Specify the Jar Location for the Multitool JAR file in the Specify Parameters dialog box, forexample:

s3n://elasticmapreduce/samples/multitool/multitool-aws-03-31-09.jar

b. Specify any arguments for the job flow.

All Multitool arguments are supported, including those listed in the following table.


Location of input file.-input

Location of output files.-output

Use data created after the start time.-start

Use data created by the end time.-end

c. Click Continue.



4. On the CONFIGURE EC2 INSTANCES page, accept the default parameters and click Continue.

5. On the ADVANCED OPTIONS page, accept the default parameters and click Continue.




7. Click Continue to accept the defaults for the remaining wizard steps, and then click Create JobFlow at the end to launch the Cascading job flow.

PigTopics

• Creating a Job Flow Using Pig (p. 126)

• Running Pig in Batch Mode (p. 126)

• Call User Defined Functions from Pig (p. 127)

• Additional Pig Functions (p. 129)

• How to Configure the Pig Installation (p. 129)

Amazon Elastic MapReduce (Amazon EMR) enables you to run Pig scripts in two modes:

• Interactive

• Batch

Typically, you use interactive mode to troubleshoot your job flow and batch mode in production. After yourevise the Pig Latin script using the interactive mode, you should upload it to Amazon S3 and use thebatch mode to run job flows. For more information about using SSH, go to View Logs Using SSH (p. 197).

In interactive mode, you SSH as the Hadoop user into the master node in the Hadoop cluster and runthe Pig Latin script on it so that you can debug it. The interactivity of this mode enables you to revise thePig Latin script quicker than you could in batch mode.

In batch mode, you create a Pig script using Pig Latin and then load that script into Amazon S3. Whenyou run a job flow using Pig, the first step is to download that script from Amazon S3 so that it is used inthe MapReduce job flow. When you use the Amazon EMR console this download is done automaticallyfor you.You use batch mode to run job flows in production.

Creating a Job Flow Using PigTo run Pig in interactive mode use the alive option with the create command so that the job flowremains active until you terminate it.

$ ./elastic-mapreduce --create --alive --name "Testing Pig -- $USER" \ --num-instances 5 --instance-type instanceType \ --pig-interactive

The return is similar to the following:


You are now running Pig in interactive mode and can execute Pig queries.

Running Pig in Batch ModeThe following process shows how to run Pig in batch mode and assumes that you stored the Pig scriptin a bucket on Amazon S3. For more information about uploading files into Amazon S3, see the AmazonS3 Getting Started Guide.

To run Pig in batch mode, create a job flow with a step that executes a Pig script stored on Amazon S3.


Amazon Elastic MapReduce Developer GuidePig



$ ./elastic-mapreduce --create \--name "$USER's Pig JobFlow" \--pig-script \--args s3://myawsbucket/myquery.q \--args -p,INPUT=s3://myawsbucket/input,-p,OUTPUT=s3://myawsbucket/output

The args option provides arguments to the Pig script. The first args option specifies the location of thePig script in Amazon S3. In the second args option, the -d provides a way to pass values into the script(INPUT) and where to store results (OUTPUT). Within the Pig script these parameters are available as${variable}. So, in this example Pig replaces ${INPUT} and ${OUTPUT} with the values passed in.These variables are substituted as a preprocessing step, so they can occur anywhere in a Pig script.Thereturn is similar to the following:


You might need to add Pig as a new step in an existing job flow. Adding steps can help you test anddevelop Pig scripts. For example, if the script fails, you can add a new step to the job flow without havingto wait for a new job flow to start.

To add a Pig script to an existing job flow in batch mode, specify the location in Amazon S3 of a Pig scriptand associate it with an existing job flow.

$ ./elastic-mapreduce --jobflow JobFlowID \--pig-script \--args s3://myawsbucket/myquery.q \--args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=s3://myawsbucket/output

Call User Defined Functions from PigTopics

• Call JAR files from Pig (p. 127)

• Call Python/Jython Scripts from Pig (p. 128)

Pig provides the ability to call user defined functions (UDFs) from within Pig scripts.You can do this toimplement custom processing to use in your Pig scripts. The languages currently supported are Java,Python/Jython, and Javascript. (Though Javascript support is still experimental.)

The following sections describe how to register your functions with Pig so you can call them either fromthe Pig shell or from within Pig scripts. For more information about using UDFs with Pig, go tohttp://pig.apache.org/docs/r0.9.2/udf.html.

Call JAR files from Pig

You can use custom JAR files with Pig using the REGISTER command in your Pig script. The JAR file islocal or a remote file system such as Amazon S3. When the Pig script runs, Amazon EMR downloadsthe JAR file automatically to the master node and then uploads the JAR file to the Hadoop distributedcache. In this way, the JAR file is automatically used as necessary by all instances in the cluster.

To use JAR files with Pig

1. Upload your custom JAR file into Amazon S3.

2. Use the REGISTER command in your Pig script to specify the bucket on Amazon S3 of the customJAR file.



http://pig.apache.org/docs/r0.9.2/udf.html

REGISTER s3://myawsbucket/path/to/my/uploaded.jar;

Call Python/Jython Scripts from Pig

You can register Python scripts with Pig and then call functions in those scripts from the Pig shell or in aPig script.You do this by specifying the location of the script with the register keyword.

Because Pig in written in Java, it uses the Jython script engine to parse Python scripts. For more informationabout Jython, go to http://www.jython.org/.

To call a Python/Jython script from Pig

1. Write a Python script and upload the script to a location in Amazon S3. This should be a bucketowned by the same account that creates the Pig job flow, or that has permissons set so the accountthat created the job flow can access it. In this example, the script is uploaded tos3://myawsbucket/pig/python.

2. Start a pig job flow. If you'll be accessing Pig from the Grunt shell, run an interactive job flow. If you'rerunning Pig commands from a script, start a scripted Pig job flow. In this example, we'll start aninteractive job flow. For more information about how to create a Pig job flow, see Creating a JobFlow Using Pig (p. 126).

3. Because we've launched an interactive job flow, we'll now SSH into the master node where we canrun the Grunt shell. For more information about how to SSH into the master node, see SSH into theMaster Node.

4. Run the Grunt shell for Pig by typing pig at the command line.

pig

5. Register the Jython library and your Python script with Pig using the register keyword at the Gruntcommand prompt, as shown in the following, where you would specify the location of your script inAmazon S3.

grunt> register 'lib/jython.jar';grunt> register 's3://myawsbucket/pig/python/myscript.py' using jython as myfunctions;

6. Load the input data. The following example loads input from an Amazon S3 location.

grunt> input = load 's3://myawsbucket/input/data.txt' using TextLoader as (line:chararray);

7. You can now call functions in your script from within Pig by referencing them using myfunctions.



http://www.jython.org/

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/EMR_SetUp_SSH.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/EMR_SetUp_SSH.html

grunt> output=foreach input generate myfunctions.myfunction($1);

Additional Pig FunctionsThe Amazon EMR development team has created additional Pig functions that simplify string manipulationand make it easier to format date-time information. These are available athttp://aws.amazon.com/code/2730.

How to Configure the Pig InstallationFor information about how to configure Pig on Amazon EMR as well as the versions of Pig that you canrun on Amazon EMR and their patches, see Pig Configuration (p. 377).

StreamingHadoop streaming is a utility that comes with Hadoop that enables you to develop MapReduce executablesin languages other than Java. Streaming is implemented in the form of a JAR file, so you can run it fromthe Amazon Elastic MapReduce (Amazon EMR) API or command line just like a standard JAR file.

This section describes how to stream Hadoop.

NoteApache Hadoop Streaming is an independent tool. As such, we do not describe all of its functionsand parameters. For a complete description of Apache Hadoop Streaming, go tohttp://hadoop.apache.org/common/docs/r0.20.0/streaming.html.

Using the Hadoop Stream UtilityThis section describes how use to Hadoop's streaming utility.

Hadoop Process

Write your mapper and reducer executable in the programming language of your choice.Follow the directions in Hadoop's documentation to write your streaming executables. Theprograms should read their input from standard input and output data through standard output.By default, each line of input/output represents a record and the first tab on each line is usedas a separator between the key and value.

1

Test your executables locally and upload them to Amazon S3.2

Use the Amazon EMR command line interface or Amazon EMR console to run your program.3


Amazon Elastic MapReduce Developer GuideStreaming


http://hadoop.apache.org/common/docs/r0.20.0/streaming.html

Example Running a Streaming Job Flow from the Command Line Interface

The following example shows a standard invocation of the hadoop-streaming utility.

$ ./elastic-mapreduce --create --stream \ --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \ --input s3://elasticmapreduce/samples/wordcount/input \ --output [A path to a bucket you own on Amazon S3, such as, s3n://myaws bucket] \ --reducer aggregate

In this example, both the mapper and the reducer are executables that read the input from an AmazonS3 bucket and write the output of the job flow to the Amazon S3 bucket specified by output.The mapperparameter specifies the python executable to turn into a JAR file.

Each mapper script launches as a separate process in the Hadoop cluster. Each reducer executableturns the output of the mapper executable into the data output by the job flow.

The input, output, mapper, and reducer parameters are required by most Hadoop stream job flows.The following table describes these and other, optional parameters.

RequiredDescriptionParameter

YesLocation on Amazon S3 of the input data.-inputType: String

Default: None

Constraint: URI. If no protocol is specified then it uses the cluster'sdefault file system.

YesLocation on Amazon S3 where Amazon EMR uploads the processeddata.

-output

Type: String

Default: None

Constraint: URI

Default: If a location is not specified, Amazon EMR uploads the datato the location specified by input.

YesName of the mapper executable.-mapperType: String

Default: None

YesName of the reducer executable.-reducerType: String

Default: None

NoLocation on Amazon S3 of the mapper executable.-cacheFileType: String

Default: None

Constraints: [URI]#[symlink name to create in working directory]

NoJAR file to extract into the working directory-cacheArchiveType: String

Default: None

Constraints: [URI]#[symlink directory name to create in workingdirectory


Amazon Elastic MapReduce Developer GuideStreaming

RequiredDescriptionParameter

NoCombines resultsType: String

Default: None

Constraints: Java class name

-combiner

The following code sample shows the wordSplitter.py executable identified in the previous hadoopcommand.

#!/usr/bin/pythonimport sys

def main(argv): line = sys.stdin.readline() try: while line: line = line.rstrip() words = line.split() for word in words: print "LongValueSum:" + word + "\t" + "1" line = sys.stdin.readline() except "end of file": return Noneif __name__ == "__main__": main(sys.argv)

Building Binaries Using Amazon EMRYou can use Amazon Elastic MapReduce (Amazon EMR) (Amazon EMR) as a build environment tocompile programs for use in your job flow. Programs that you use with Amazon EMR must be compiledon a system running the same version of Debian used by Amazon EMR. For a 32-bit version, (m1.small)you should have compiled on a 32-bit machine or with 32-bit cross compilation options turned on. For a64-bit version, you need to have compiled on a 64-bit machine or with 64-bit cross compilation optionsturned. For more information on Amazon EC2 instance versions, go to Amazon EC2 Instances (p. 11).Supported programming languages include C++, Cython, and C#.

The following table outlines the steps involved to build and test your application using Amazon EMR.

Process for Building a Module

Create an interactive job flow.1

Identify the job flow ID and Public DNS name of the master node.2

SSH as the Hadoop user to the master node of your Hadoop cluster.3

Copy source files to the master node.4

Build binaries with any necessary optimizations.5

Copy binaries from the master node to Amazon S3.6

Close the SSH connection.7


Amazon Elastic MapReduce Developer GuideBuilding Binaries Using Amazon EMR

Terminate the job flow.8

The details for each of these steps are covered in the sections that follow.

To create an interactive job flow

• Create an interactive job flow with a single node Hadoop cluster using the desired instance type:


$ ./elastic-mapreduce --create --alive --name "Interactive Job Flow" \--num-instances=1 --master-instance-type=m1.large --hive-inter active

Linux orUNIX

C:\ruby elastic-mapreduce --create --alive --name "InteractiveJob Flow" --num-instances=1 --master-instance-type=m1.large--hive-interactive

MicrosoftWindows

The output looks similar to:


To identify the job flow ID and Public DNS name of the master node

• Identify your job flow:

• Enter the following...If you areusing...

& ./elastic-mapreduce --listLinux orUNIX



j-SLRI9SCLK7UC STARTING ec2-75-101-168-82.compute-1.amazonaws.com

Interactive Job Flow PENDING Hive Job

The response includes the job flow ID and the Public DNS Name.You use this information to connectto the master node.

Typically you need to wait one or two minutes after launching the job flow before the Public DNSName is assigned.



To SSH as the Hadoop user to the master node

• Use your credentials created for your Amazon EC2 key pair to log in to the master node:

Instructions for creating credentials are located at Create a Credentials File (p. 22).


& ./elastic-mapreduce --ssh --jobflow JobFlowIDLinux orUNIX

a. Start PuTTY.

b. Select Session in the Category list. Enter hadoop@DNS in the Host Namefield. In this example, the input looks similar [email protected].

c. In the Category list, expand Connection, expand SSH, and then select Auth.The Options controlling SSH authentication pane appears.

d. Click Browse for Private key file for authentication, and select the privatekey file you generated earlier. If you are following this guide, the file name ismykeypair.ppk.

e. Click OK.

f. Click Open to connect to your master node.

g. A PuTTY Security Alert pops up. Click Yes.

MicrosoftWindows

When you successfully connect to the master node, the output looks similar to the following:

Using username "hadoop".Authenticating with public key "imported-openssh-key"Linux domU-12-31-39-01-5C-F8 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686--------------------------------------------------------------------------------

Welcome to Amazon EMR running Hadoop and Debian/Lenny.

Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check/mnt/var/log/hadoop/steps for diagnosing step failures.

The Hadoop UI can be accessed via the following commands:

JobTracker lynx http://localhost:9100/ NameNode lynx http://localhost:9101/

--------------------------------------------------------------------------------



To copy source files to the master node

• Copy your source files to the master node:

a. Put your source files on Amazon S3.To learn how to create buckets and move files with AmazonS3, go to the Amazon Simple Storage Service Getting Started Guide.

b. Create a folder on your Hadoop cluster for your source files by entering a command similar tothe following:

$ mkdir SourceDesitination

You now have a destination folder for your source files.

c. Copy your sources files from Amazon S3 to the Hadoop cluster by entering a command similarto the following:

$ hadoop fs -get s3://myawsbucket/SourceFiles SourceDestination

Your source files are now located in your destination folder on the master node of your Hadoopcluster.

Build binaries with any necessary optimizations

How you build your binaries code depends on many factors. Follow the instructions for your specific buildtools to setup and configure your environment.You can use Hadoop system specification commands toobtain cluster information to determine how to install your build environment.

To identify system specifications

• Use the following commands to verify the architecture you are using to build your binaries:

a. To view the version of Debian, enter the following command:

master$ cat /etc/issue


Debian GNU/Linux 5.0

b. To view the Public DNS Name and processor size, enter the following command:

master$ uname -a


Linux domU-12-31-39-17-29-39.compute-1.internal 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 GNU/Linux

c. To view the processor speed, enter the following command:



http://docs.amazonwebservices.com/AmazonS3/latest/gsg/

master$ cat /proc/cpuinfo


processor : 0vendor_id : GenuineIntelmodel name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHzflags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm...

How you build your binaries code depends on many factors. Follow the instructions for your specific buildtools to setup and configure your environment. Once your binaries are built, you can copy the files toAmazon S3.

To copy binaries from the master node to Amazon S3

• Copy the binaries to Amazon S3 by entering the following command:

$ hadoop fs -put BinaryFiles s3://myawsbucket/BinaryDestination

Your binaries are now stored in your Amazon S3 bucket.

To close the SSH connection

• Enter the following command from the Hadoop command-line prompt:

•$ exit

You are no longer connected to your cluster via SSH.

To terminate the job flow


$ ./elastic-mapreduce --terminate JobFlowIDLinux orUNIX

C:\ruby elastic-mapreduce --terminate JobFlowIDMicrosoftWindows

Your job flow is terminated.



ImportantTerminating a job flow delete all files and executables saved to the cluster. Remember tosave all required files before terminating a job flow.

Using TaggingAmazon Elastic MapReduce (Amazon EMR) automatically tags each Amazon EC2 instance it launcheswith key-value pairs that identify the job flow and the instance group to which the instance belongs. Thismakes it easy to filter your Amazon EC2 instances to show, for example, only those instances belongingto a particular job flow or to show all of the currently running instances in the task-instance group. Thisis especially useful if you are running several job flows concurrently or managing large numbers of AmazonEC2 instances.

These are the predefined key-value pairs that Amazon EMR assigns:

ValueKey

<job-flow-identifier>aws:elasticmapreduce:job-flow-id

<group-role>aws:elasticmapreduce:instance-group-role

The values are further defined as follows:

• The <job-flow-identifier> is the ID of the job flow the instance is provisioned for. It appears in the formatj-XXXXXXXXXXXXX.

• The <group-role> is one of the following values: master, core, or task. These values correspond to themaster instance group, core instance group, and task instance group.

You can view and filter on the tags that Amazon EMR adds. For more information on how to do this, goto Using Tags in the Amazon Elastic Compute Cloud User Guide. Because the tags set by Amazon EMRare system tags and cannot be edited or deleted, the sections on displaying and filtering tags are themost relevant.

NoteAmazon EMR adds tags to the Amazon EC2 instance when its status is updated to running. Ifthere's a latency period between the time the Amazon EC2 instance is provisioned and the timeits status is set to running, the tags set by Amazon EMR will not appear until the instance starts.If you don't see the tags, wait for a few minutes and refresh the view, then the Amazon EMRtags should appear.

Protect a Job Flow from TerminationTermination protection ensures that the Amazon EC2 instances in your job flow are not shut down by anaccident or error. This protection is especially useful if your job flow contains data in instance storagethat you need to recover before those instances are terminated.

By default, termination protection is disabled on job flows. When termination protection is not enabled,you can terminate job flows either through calls to the TerminateJobFlows API, through the AmazonEMR console, or by using the command line interface. In addition, the master node may terminate a tasknode that has become unresponsive or has returned an error.


Amazon Elastic MapReduce Developer GuideUsing Tagging

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html

When termination protection is enabled, you must explicitly remove termination protection from the jobflow before you can terminate the job flow. With termination protection enabled, TerminateJobFlowscan't terminate the job flow and users can't terminate the job flow using the CLI. Users terminating thejob flow using the Amazon EMR console will recieve an extra confirmation box asking if they want toremove termination protection before terminating the job flow.

If you attempt to terminate a protected job flow with the API or CLI, the API returns an error, and the CLIexits with a non-zero return code.

In the case of an error, the job flow will end, but the Amazon EC2 instances will persist. Furthermore, ifthe ActionOnFailure flag for the job flow has been set to “terminate and close” having terminationprotection enabled changes the job flow’s ActionOnFailure behavior to “close and wait.”

NoteUse job flow termination protection judiciously because it can lead to additional charges for thepersistent Amazon EC2 instances.

Termination Protection in Amazon EMR andAmazon EC2Termination protection of job flows in Amazon Elastic MapReduce (Amazon EMR) is analogous to settingthe disableAPITermination flag on an Amazon EC2 instance. In the event of a conflict between thetermination protection set in Amazon EC2 and that set in Amazon EMR, the Amazon EMR job flowprotection status overrides that set by Amazon EC2 on the given instance. For example, if you use theAmazon EC2 console to enable termination protection on an Amazon EC2 instance in an Amazon EMRjob flow that has termination protection disabled, Amazon EMR will turn off termination protection on thatAmazon EC2 instance and shut down the instance when the rest of the job flow terminates.

Termination Protection and Spot InstancesAmazon EMR termination protection does not prevent an Amazon EC2 Spot Instance from terminatingwhen the Spot price rises above the maximum bid price. For more information about the behavior ofAmazon EC2 Spot Instances in Amazon EMR, go to Lower Costs with Spot Instances (p. 141).

Termination Protection and Keep AliveEnabling termination protection on a job flow is similar to enabling keep alive on a job flow (using the--alive argument in the CLI), but the protections each offers are different. Keep alive causes instancesin a job flow to persist after the job flow has successfully completed, but still allows the job flow to beterminated by calls to TerminateJobFlows and errors.Termination protection allows the job to terminateafter successful completion, but keeps it persistent in the case of user actions, errors, andTerminateJobFlow calls.

The following table compares the protections offered by termination protection and keep alive.

Keep AliveTermination ProtectionProtects against terminationfrom...

Successful completion

User actions

TerminateJobFlows API


Amazon Elastic MapReduce Developer GuideTermination Protection in Amazon EMR and Amazon

EC2

Keep AliveTermination ProtectionProtects against terminationfrom...

Errors

Protecting a New Job FlowYou can specify that a new job flow be protected from termination during the job flow creation.

Launch a job flow with termination protection using the Amazon EMR console

1. From the Amazon EMR console, click Create New Job Flow to launch the Create a New Job Flowwizard.

2. Follow the instructions for the type of job flow you are creating at Create a Job Flow (p. 23).

3. On the ADVANCED OPTIONS page of the Create a New Job Flow wizard, set TerminationProtection to Yes.

4. Continue through the wizard, following the directions for the type of job flow you are launching. Formore information, see Create a Job Flow (p. 23).

Launch a job flow with termination protection using the CLI

• Specify --with-termination-protection during the job flow creation call.The following exampleshows setting termination protection on the WordCount sample application.


Amazon Elastic MapReduce Developer GuideProtecting a New Job Flow

https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1

elastic-mapreduce --create --alive /--instance-type m1.xlarge --num-instances 2 --stream /--input s3://elasticmapreduce/samples/wordcount/input /--output s3://myawsbucket/wordcount/output/2011-03-25 /--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate /--with-termination-protection

For additional information about launching job flows using the CLI, see see the instructions for eachjob flow type in Create a Job Flow (p. 23).

Launch a job flow with termination protection using the API

• Call RunJobFlow and set the Instances.TerminationProtected request argument to true.

https://elasticmapreduce.amazonaws.com?Operation=RunJobFlow&Name=MyJobFlowName&LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir&Instances.MasterInstanceType=m1.small&Instances.SlaveInstanceType=m1.small&Instances.InstanceCount=4&Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a&Instances.KeepJobFlowAliveWhenNoSteps=true&Instances.TerminationProtected=true&Steps.member.1.Name=MyStepName&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.HadoopJarStep.Jar=MyJarFile&Steps.member.1.HadoopJarStep.MainClass=MyMainClass&Steps.member.1.HadoopJarStep.Args.member.1=arg1&Steps.member.1.HadoopJarStep.Args.member.2=arg2&AuthParams

For additional information about launching job flows using the API, see the instructions for each jobflow type in Create a Job Flow (p. 23).

Protecting an Existing Job FlowYou can add termination protection to an already running job flow using either the CLI or the API.

NoteYou cannot currently add termination protection to a running job flow using the Amazon EMRconsole.

To enable termination protection for an existing job flow using the CLI

• Set the --set-termination-protection flag to true. This is shown in the following example,where JobFlowID is the identifier of the job flow on which to enable termination protection.

elastic-mapreduce --set-termination-protection true --jobflow JobFlowID


Amazon Elastic MapReduce Developer GuideProtecting an Existing Job Flow

To enable termination protection for an existing job flow using the API

• Call SetTerminationProtection and set TerminationProtected to true. This is shown inthe following example, where JobFlowID is the identifier of the job flow on which to enable terminationprotection.

https://elasticmapreduce.amazonaws.com?Operation=SetTerminationProtection&JobFlowId=JobFlowID&TerminationProtected=true

Terminating a Protected Job FlowIf you want to terminate a protected job flow, you must first disable termination protection before it canbe terminated. After termination protection is disabled, you can terminate the job flow from the AmazonEMR console, CLI, or programmatically using the TerminateJobFlows API.

To terminate a job flow with termination protection set using the Amazon EMR console.


2. Select the job flow you wish to terminate.

3. Click the Terminate button.

4. Click Yes,Terminate on the confirmation dialog box, to confirm that you wish to disable terminationprotection and terminate the job flow.


Amazon Elastic MapReduce Developer GuideTerminating a Protected Job Flow

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_SetTerminationProtection.html


To terminate a job flow with termination protection set using the CLI

1. Disable termination protection by setting the --set-termination-protection to false. This isshown in the following example, where JobFlowID is the identifier of the job flow on which to disabletermination protection.

elastic-mapreduce --set-termination-protection false --jobflow JobFlowID

2. Terminate the job flow using the --terminate parameter and specifying the job flow identifier ofthe job flow to terminate.

elastic-mapreduce --terminate JobFlowID

To terminate a job flow with termination protection set using the API

1. Disable termination protection by calling the SetTerminationProtection action and seting theTerminationProtected flag to false.This is shown in the following example, where JobFlowIDis the identifier of the job flow on which to disable termination protection.

https://elasticmapreduce.amazonaws.com?Operation=SetTerminationProtection&JobFlowId=JobFlowID&TerminationProtected=false

2. Terminate the job flow using the TerminateJobFlows action.

https://elasticmapreduce.amazonaws.com?JobFlowIds.member.1=JobFlowID&Operation=TerminateJobFlows&AWSAccessKeyId=AccessKeyID&SignatureVersion=2&SignatureMethod=HmacSHA256&Timestamp=2009-01-28T21%3A53%3A50.000Z&Signature=calculated value

Lower Costs with Spot InstancesWhen Amazon EC2 has unused capacity, it offers EC2 instances at a reduced cost, called the Spot Price.This price fluctuates based on availability and demand.You can purchase Spot Instances by placing arequest that includes the highest bid price you are willing to pay for those instances. When the Spot Priceis below your bid price, your Spot Instances are launched and you are billed the Spot Price. If the SpotPrice rises above your bid price, Amazon EC2 terminates your Spot Instances.

For more information about Spot Instances, go to Using Spot Instances in the Amazon Elastic ComputeCloud User Guide.

The following video describes how Spot Instances work in Amazon Elastic MapReduce (Amazon EMR)and walks you through the process of launching a job flow on Spot Instances from the Amazon EMRconsole: Using Spot Instances with Amazon ElasticMapReduce.


Amazon Elastic MapReduce Developer GuideLower Costs with Spot Instances

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_SetTerminationProtection.html

http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?using-spot-instances.html

http://youtu.be/66rfnFA0jpM?rel=0&hd=1

Additional video instruction includes:

• Amazon EC2 - Deciding on your Spot Bidding Strategy, describes strategies to use when setting a bidprice for Spot Instances.

• Amazon EC2 - Managing Interruptions for Spot Instance Workloads, describes ways to handle SpotInstance termination.

If your workload is flexible in terms of time of completion or required capacity, Spot Instances cansignificantly reduce the cost of running your job flows. Workloads that are ideal for using Spot Instancesinclude: application testing, time-insensitive workloads, and long-running job flows with fluctuations inload.

NoteSpot Instances are not recommended for job flows that are time-critical or which need guaranteedcapacity. These job flows should be launched using on-demand instance groups.

When Should You Use Spot Instances?There are several scenarios in which Spot Instances are useful for running an Amazon EMR job flow.

Long-Running Job Flows and Data WarehousesIf you are running a persistent Amazon EMR job flow, such as a data warehouse, that has a predictablevariation in computational capacity, you can handle peak demand at lower cost with Spot Instances.Launch your master and core instance groups as on-demand to handle the normal capacity and launchthe task instance group as Spot Instances to handle your peak load requirements.

Cost-Driven WorkloadsIf you are running transient job flows for which lower cost is more important than the time to completion,and losing partial work is acceptable, you can run the entire job flow (master, core, and task instancegroups) as Spot Instances to benefit from the largest cost savings.

Data-Critical WorkloadsIf you are running a job flow for which lower cost is more important than time to completion, but losingpartial work is not acceptable, launch the master and core instance groups as on-demand and supplementwith a task instance group of Spot Instances. Running the master and core instance groups as on-demandensures that your data is persisted in HDFS and that the job flow is protected from termination due toSpot market fluctuations, while providing cost savings that accrue from running the task instance groupas Spot Instances.

Application TestingWhen you are testing a new application in order to prepare it for launch in a production environment, youcan run the entire job flow (master, core, and task instance groups) as Spot Instances to reduce yourtesting costs.

Choosing What to Launch as Spot InstancesWhen you launch a job flow in Amazon Elastic MapReduce (Amazon EMR), you can choose to launchany or all of the instance groups (master, core, and task) as Spot Instances. Because each type of instancegroup plays a different role in the job flow, the implications of launching each instance group as SpotInstances vary.

When you launch an instance group either as on-demand or as Spot Instances, you cannot change itsclassification while the job flow is running. In order to change an on-demand instance group to SpotInstances or vice versa, you must terminate the job flow and launch a new one.


Amazon Elastic MapReduce Developer GuideWhen Should You Use Spot Instances?

http://youtu.be/WD9N73F3Fao?rel=0&hd=1

http://youtu.be/wcPNnUo60pc?rel=0&hd=1

The following table shows launch configurations for using Spot Instances in various applications.

Task Instance GroupCore InstanceGroup

Master InstanceGroup

Project

spoton-demandon-demandLong-running job flows

spotspotspotCost-driven workloads

spoton-demandon-demandData-critical workloads

spotspotspotApplication testing

Master Instance Group as Spot InstancesThe master node controls and directs the job flow. When it terminates, the job flow ends, so you shouldonly launch the master node as a Spot Instance if you are running a job flow where sudden terminationis acceptable.This might be the case if you are testing a new application, have a job flow that periodicallypersists data to an external store such as Amazon S3, or are running a job flow where cost is moreimportant than ensuring the job flow’s completion.

When you launch the master instance group as a Spot Instance, the job flow will not start until that SpotInstance request is fulfilled. This is something to take into consideration when selecting your bid price.

You can only add a Spot Instance master node when you launch the job flow. Master nodes cannot beadded or removed from a running job flow.

Typically, you would only run the master node as a Spot Instance if you are running the entire job flow(all instance groups) as Spot Instances.

Core Instance Group as Spot InstancesCore nodes process data and store information using HDFS. Because termination of core nodes canresult in data loss and possible termination of the job flow, you would typically only run core nodes asSpot Instances if you are either not running task nodes or running task nodes as Spot Instances.

When you launch the core instance group as Spot Instances, Amazon EMR waits until it can provisionall of the requested core instances before launching the instance group. This means that if you requesta core instance group with six nodes, the instance group will not launch if there are only five nodesavailable at or below your bid price. In this case, Amazon EMR will continue to wait until all six core nodesare available at your Spot Price until it is successful or you terminate the job flow.

You can add Spot Instance core nodes either when you launch the job flow or later to add capacity to arunning job flow.You cannot remove core nodes from a running job flow.

Task Instance Group as Spot InstancesThe task nodes process data but do not hold persistent data in HDFS. If they terminate because the SpotPrice has risen above your bid price, no data is lost and the effect on your job flow is minimal.

When you launch the task instance group as Spot Instances, Amazon EMR will provision as many tasknodes as it can at your bid price. This means that if you request a task instance group with six nodes,and only five Spot Instances are available at your bid price, Amazon EMR will launch the instance groupwith five nodes, adding the sixth later if it can.

Launching the task instance group as Spot Instances is a strategic way to expand the capacity of yourjob flow while minimizing costs. If you launch your master and core instance groups as on-demand


Amazon Elastic MapReduce Developer GuideChoosing What to Launch as Spot Instances

instances, their capacity is guaranteed for the run of the job flow and you can add task instances to theinstance group as needed to handle peak traffic or to speed up data processing.

You can add and remove Spot Instance task nodes from a running job flow.

Spot Instance Pricing in Amazon EMRThere are two components in Amazon Elastic MapReduce (Amazon EMR) billing, the cost for the EC2instances launched by the job flow and the charge Amazon EMR adds for managing the job flow. Whenyou use Spot Instances, the Spot Price may change due to fluctuations in supply and demand, but theAmazon EMR rate remains fixed.

When you purchase Spot Instances, you can set the bid price only when you launch the instance group.It can’t be changed later. This is something to consider when setting the bid price for an instance groupin a long-running job flow.

You can launch different instance groups at different bid prices. For example, in a job flow running entirelyon Spot Instances, you might choose to set the bid price for the master instance group at a higher pricethan the task instance group since if the master terminates, the job flow ends, but terminated task instancescan be replaced.

If you start and stop instances in the job flow, partial hours are billed as full hours. If instances areterminated because the Spot Price rose above your bid price, you are not charged either the AmazonEC2 or Amazon EMR charges for the partial hour.

You can look up the current Spot Price and the on-demand price for instances on the Amazon EC2 Pricingpage.

Availability Zones and RegionsWhen you launch a job flow, you have the option to specify a Region and an Availability Zone within thatRegion.

If you do not specify an Availability Zone when you launch a job flow, Amazon Elastic MapReduce (AmazonEMR) selects the Availability Zone with lowest Spot Instance pricing and the largest available capacityof EC2 instance types specified for your core instance group, and then launches the master, core, andtask instance groups in that Availability Zone.

Because of fluctuating Spot Prices between Availability Zones, selecting the Availability Zone with thelowest initial price (or allowing Amazon EMR to select it for you) might not result in the lowest price forthe life of the job flow. For optimal results, you should study the history of Availability Zone pricing beforechoosing the Availability Zone for your job flow.

NoteBecause Amazon EMR selects the Availability Zone based on free capacity of EC2 instancetype you specified for the core instance group, your job flow may end up in an Availability Zonewith less capacity in other EC2 instance types. For example, if you are launching your coreinstance group as Large and the master instance group as Extra Large, you may launch into anAvailability Zone with insufficient unused Extra Large capacity to fulfill a Spot Instance requestfor your master node. If you run into this situation, you can launch the master instance group ason-demand, even if you are launching the core instance group as Spot Instances.

If you specify an Availability Zone for the job flow, Amazon EMR launches all of the instance groups inthat Availability Zone.

All instance groups in a job flow are launched into a single Availability Zone, regardless of whether theyare on-demand or Spot Instances.The reason for using a single Availability Zone is additional data transfercosts and performance overhead make running instance groups in multiple Availability Zones undesirable.


Amazon Elastic MapReduce Developer GuideSpot Instance Pricing in Amazon EMR

http://aws.amazon.com/ec2/pricing/


NoteSelecting the Availability Zone is currently not available in the Amazon EMR console. AmazonEMR assigns an Availability Zone to job flows launched from the Amazon EMR console asdescribed above.

Launching Spot Instances in Job FlowsWhen you launch a new instance group you can launch the EC2 instances in that group either ason-demand or as Spot Instances. The procedure for launching Spot Instances is the same as launchingon-demand instances, except that you specify additional information such as the market type and the bidprice.

Amazon EMR Console

To launch an entire job flow on Spot Instances

1. Create a job flow in the Amazon EMR console as described in Create a Job Flow (p. 23), followingthe instructions for the type of job flow you want to create. On the CONFIGURE EC2 INSTANCESpage, you will specify which instance groups to run as Spot Instances, and the bid price for each.

2. To run the master node as a Spot Instance, select the Request Spot Instance check box under theMaster Instance Group heading and enter the maximum hourly rate you are willing to pay perinstance in the Spot Bid Price text box that appears.You can look up the current Spot Price forinstances on the Amazon EC2 Pricing page. In most cases, you will want to enter a price higher thanthe current Spot Price.

3. To run the core nodes as Spot Instances, select the Request Spot Instance check box under theCore Instance Group heading and enter the maximum hourly rate you are willing to pay per instancein the Spot Bid Price text box that appears.


Amazon Elastic MapReduce Developer GuideLaunching Spot Instances in Job Flows


4. To run the task nodes as Spot Instances, select the Request Spot Instance check box under theTask Instance Group heading and enter the maximum hourly rate you are willing to pay per instancein the Spot Bid Price text box that appears.

5. Click Continue and proceed to create the job flow as described in Create a Job Flow (p. 23).

CLI


• To specify that an instance group should be launched as Spot Instances, use the --bid-priceparameter. The following example shows how to create a job flow where the master, core, and task



instance groups are all running as Spot Instances. The following code launches a job flow only afteruntil the requests for the master and core instances have been completely fulfilled.

elastic-mapreduce --create --alive --name “Spot Cluster” \--instance-group master --instance-type m1.large --instance-count 1 --bid-price 0.25 \--instance-group core --instance-type m1.small --instance-count 4 --bid-price 0.03 \--instance-group task --instance-type c1.medium --instance-count 2 --bid-price 0.10

Java SDK


• To specify that an instance group should be launched as Spot Instances, set the withBidPriceand withMarket properties on the InstanceGroupConfig object that you instantiate for theinstance group. The following code shows how to define master, core, and task instance groups thatrun as Spot Instances.

InstanceGroupConfig instanceGroupConfigMaster = new InstanceGroupConfig() .withInstanceCount(1) .withInstanceRole(“MASTER”) .withInstanceType(“m1.large”) .withMarket("SPOT") .withBidPrice(“0.25”);

InstanceGroupConfig instanceGroupConfigCore = new InstanceGroupConfig() .withInstanceCount(4) .withInstanceRole(“CORE”) .withInstanceType(“m1.small”) .withMarket("SPOT") .withBidPrice(“0.03”);

InstanceGroupConfig instanceGroupConfigTask = new InstanceGroupConfig() .withInstanceCount(2) .withInstanceRole(“TASK”) .withInstanceType(“c1.medium”) .withMarket("SPOT") .withBidPrice(“0.10”);

API


• To specify that an instance group should be launched as Spot Instances, set the BidPrice andMarket properties of the InstanceGroupDetail members of the InstanceGroupDetailList.



The code below shows how to define master, core, and task instance groups that run as SpotInstances.

Example Sample Request

https://elasticmapreduce.amazonaws.com?Operation=RunJobFlow&Name=MyJobFlowName &LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir&Instances.MasterInstanceType=m1.large &Instances.SlaveInstanceType=m1.small&Instances.InstanceCount=4 &Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a&Instances.KeepJobFlowAliveWhenNoSteps=true &Instances.TerminationProtected=true&Instances.InstanceGroups.member.1.InstanceRole=MASTER&Instances.InstanceGroups.member.1.Market=SPOT&Instances.InstanceGroups.member.1.BidPrice=.25&Instances.InstanceGroups.member.2.InstanceRole=CORE&Instances.InstanceGroups.member.2.Market=SPOT&Instances.InstanceGroups.member.2.BidPrice=.03&Instances.InstanceGroups.member.3.InstanceRole=TASK&Instances.InstanceGroups.member.3.Market=SPOT&Instances.InstanceGroups.member.3.BidPrice=.03&Steps.member.1.Name=MyStepName&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.HadoopJarStep.Jar=MyJarFile&Steps.member.1.HadoopJarStep.MainClass=MyMainClass&Steps.member.1.HadoopJarStep.Args.member.1=arg1&Steps.member.1.HadoopJarStep.Args.member.2=arg2 &AuthParams

Example Sample Response

<RunJobFlowResponse xmlns="http://elasticmapreduce.amazonaws.com/doc/2009-03-31"> <RunJobFlowResult> <JobFlowId> j-3UN6WX5RRO2AG </JobFlowId> </RunJobFlowResult> <ResponseMetadata> <RequestId> 8296d8b8-ed85-11dd-9877-6fad448a8419 </RequestId> </ResponseMetadata></RunJobFlowResponse>



Amazon EMR Console

To launch only the task instance group on Spot Instances

1. Create a job flow in the Amazon EMR console as described in Create a Job Flow (p. 23), followingthe instructions for the type of job flow you want to create. On the CONFIGURE EC2 INSTANCESpage, specify which instance groups to run as Spot Instances, and the bid price for each.

2. To run the task nodes as Spot Instances, select the Request Spot Instance check box under theTask Instance Group heading and enter the maximum hourly rate you are willing to pay per instancein the Spot Bid Price text box that appears.You can look up the current Spot Price for instances onthe Amazon EC2 Pricing page. In most cases, you will want to enter a price higher than the currentSpot Price.

3. Click Continue and proceed to create the job flow as described in Create a Job Flow (p. 23).

CLI


• To specify that an instance group should be launched as Spot Instances, use the --bid-priceparameter.The following example shows how to create a job flow where only the task instance groupis running as Spot Instances. The code below will launch a job flow even if the request for SpotInstances can’t be fulfilled. In that case, Amazon EMR will add task nodes to the job flow if it is stillrunning when the Spot Price falls below the bid price.

elastic-mapreduce --create --alive --name “Spot Task Group” \--instance-group master --instance-type m1.large \--instance-count 1 \--instance-group core --instance-type m1.large \--instance-count 2 \--instance-group task --instance-type m1.small \--instance-count 4 --bid-price 0.03




Java SDK


• To specify that an instance group should be launched as Spot Instances, set the withBidPriceand withMarket properties on the InstanceGroupConfig object that you instantiate for theinstance group. The following code creates a task instance group of type m1.large with an instancecount of 10. It specifies $0.35 as the maximum bid price, and will run as Spot Instances.

InstanceGroupConfig instanceGroupConfigMaster = new InstanceGroupConfig() .withInstanceCount(1) .withInstanceRole(“MASTER”) .withInstanceType(“m1.large”)

InstanceGroupConfig instanceGroupConfigCore = new InstanceGroupConfig() .withInstanceCount(4) .withInstanceRole(“CORE”) .withInstanceType(“m1.small”)

InstanceGroupConfig instanceGroupConfig = new InstanceGroupConfig() .withInstanceCount(10) .withInstanceRole(“TASK”) .withInstanceType(“m1.large”) .withMarket"("SPOT") .withBidPrice(“0.35”);

API


• To specify that an instance group should be launched as Spot Instances, set the BidPrice andMarket properties of the TASK InstanceGroupDetail member of theInstanceGroupDetailList.The following code shows how to define only the task instance groupto run as Spot Instances.

The following example shows how to format the request.



Example Sample Request

https://elasticmapreduce.amazonaws.com?Operation=RunJobFlow&Name=MyJobFlowName &LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir&Instances.MasterInstanceType=m1.large &Instances.SlaveInstanceType=m1.small&Instances.InstanceCount=4 &Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a&Instances.KeepJobFlowAliveWhenNoSteps=true &Instances.TerminationProtected=true &Instances.InstanceGroups.member.1.InstanceRole=MASTER&Instances.InstanceGroups.member.2.InstanceRole=CORE&Instances.InstanceGroups.member.3.InstanceRole=TASK&Instances.InstanceGroups.member.3.Market=SPOT &Instances.InstanceGroups.member.3.BidPrice=.03&Steps.member.1.Name=MyStepName&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.HadoopJarStep.Jar=MyJarFile&Steps.member.1.HadoopJarStep.MainClass=MyMainClass&Steps.member.1.HadoopJarStep.Args.member.1=arg1&Steps.member.1.HadoopJarStep.Args.member.2=arg2 &AuthParams

Example Sample Response

<RunJobFlowResponse xmlns="http://elasticmapreduce.amazonaws.com/doc/2009-03-31"> <RunJobFlowResult> <JobFlowId> j-3UN6WX5RRO2AG </JobFlowId> </RunJobFlowResult> <ResponseMetadata> <RequestId> 8296d8b8-ed85-11dd-9877-6fad448a8419 </RequestId> </ResponseMetadata></RunJobFlowResponse>

Changing the Number of Spot Instances in a JobFlowWith some restrictions, you can modify the number of Spot Instances in a job flow. For example, youcannot change the number of instances in the master instance group, it always has one instance. Toprevent data loss, you can add, but not remove core nodes from an instance group. Task nodes do notstore state, and so they can be added or removed from a running job flow.


Amazon Elastic MapReduce Developer GuideChanging the Number of Spot Instances in a Job Flow

You can only define an instance group as running as Spot Instances when it is created, so for example,if you launched the core instance group as on-demand, you would not be able to change its market typeto Spot Instances later.

NoteIt’s possible to automatically modify the number of slave nodes in a job flow between job flowsteps.You just include a predefined step in your workflow that changes the number of requestedSpot Instances. For example:

elastic-mapreduce --jobflow j-XXXXXXXXXXXX \--jar s3://us-east-1.elasticmapreduce/libs/resize-job-flow/0.1/resize-job-flow.jar \--args --modify-instance-group,task,--instance-count,4

The following examples show how to add task and core instance groups to a running job flow by increasingthe number of instances in an instance group. For task nodes, you can also decrease the number ofinstances. The procedure is the same as shown in the examples that follow, but instead of specifying alarger number of instances than is currently running, you would specify a smaller number.

NoteIf you are running a job flow that contains only a master node, you cannot add instance groupsto that job flow. A job flow must have one or more core instances for you to be able to add ormodify instance groups.

NoteAdding nodes to a running job flow is not currently supported in the Amazon EMR console.

CLI

To add Spot Instances to a running job flow

• You can use the following command to add a task instance group of Spot Instances to a running jobflow:

elastic-mapreduce --jobflow j-xxxxxxxxxxxxx --add-instance-group task --instance-type m1.small \--instance-count 5 --bid-price 0.05

CLI

To increase or decrease the number of Spot Instances in a running job flow

• You can change the number of requested Spot Instances in a running job flow by calling theset-num-core-group-instances or set-num-task-group-instances on the command line.Note that you can only increase the number of CORE instances in your job flow while you can increaseor decrease the number of TASK instances. Setting the number of TASK InstanceGroup instancesto zero will remove all Spot Instances.

elastic-mapreduce --jobflow j-xxxxxxxxxxxxx --set-num-task-group-instances



5

This will change the number of requested TASK InstanceGroup instances to 5.

Java SDK


• To specify that an instance group should be launched as Spot Instances, set the withBidPriceand withMarket properties on the InstanceGroupConfig object that you instantiate for theinstance group. The following code creates a task instance group of type m1.large with an instancecount of 10. It specifies $0.35 as the maximum bid price, and will run as Spot Instances.

When you make the call to modify the instance group, pass this object instance in.

InstanceGroupConfig instanceGroupConfig = new InstanceGroupConfig() .withInstanceCount(10) .withInstanceRole(“TASK”) .withInstanceType(“m1.large”) .withMarket("SPOT") .withBidPrice(“0.35”);

API


• The following sample request increases the number of task nodes in the task instance group to eightand requests that they be launched as Spot Instances with an hourly bid price of .35.

The following is an example of the request you would send in to Amazon EMR.

Sample Request

https://elasticmapreduce.amazonaws.com?Operation=ModifyInstanceGroups&InstanceGroups.member.1.InstanceGroupId=i-3UN6WX5RRO2AG &InstanceGroups.member.1.InstanceRequestCount=8&InstanceGroups.member.1.InstanceRole=TASK&InstanceGroups.member.1.Market=SPOT&InstanceGroups.member.1.BidPrice=.35&AuthParams



Sample Response

<ModifyInstanceGroupsResponse xmlns="http://elasticmapreduce.amazon aws.com/doc/2009-03-31"> <ResponseMetadata> <RequestId> 2690d7eb-ed86-11dd-9877-6fad448a8419 </RequestId> </ResponseMetadata></ModifyInstanceGroupsResponse>

Troubleshooting Spot InstancesThe following topics address issues that may arise when you use Spot Instances. For additional informationon how to debug job flow issues, go to the section called “Troubleshooting” (p. 183).

Why haven’t I received my Spot Instances?Spot Instances are provisioned based on availability and bid price. If you haven’t received the SpotInstances you requested, that means either your bid price is lower than the current Spot Price, or thereis not enough supply at your bid price to fulfill your request.

Master and core instance groups will not be fulfilled until all of the requested instances can be provisioned.Task nodes are fulfilled as they become available.

One way to address unfulfilled Spot Instance requests is to terminate the job flow and launch a new one,specifying a higher bid price. Reviewing the price history on Spot Instances will tell you which bids havebeen successful in the recent past and can help you determine which bid is the best balance of costsavings and likelihood of being fulfilled. To review the Spot Instance price history, go to Spot Instanceson the Amazon EC2 Pricing page.

Another option is to change the type of instance you request. For example, if you requested four ExtraLarge instances and the request has not been filled after a period of time, you might consider relaunchingthe job flow and placing a request for four Large instances instead. Because the base rate is different foreach instance type, you would want to adjust your bid price accordingly. For example, if you bid 80% ofthe on-demand rate for an Extra Large instance you might choose to adjust your bid price on the newSpot Instance request to reflect 80% of the on-demand rate for a Large instance.

The final variable in the fulfillment of a Spot Instance request is whether there is unused capacity in yourRegion.You can try launching the Spot Instance request in a different Region. Before selecting this option,however, consider the implications of data transfer across Regions. For example, if the Amazon SimpleStorage (Amazon S3) bucket housing your data is in Region us-east-1 and you launch a job flow as SpotInstances in us-west-1, the additional cross-Region data transfer costs will likely outweigh any cost savingsfrom using Spot Instances.

Why did my Spot Instances terminate?By design, Spot Instances are terminated by Amazon EC2 when the Spot Instance price rises above yourbid price.

If your bid price is equal to or lower than the Spot Instance price, the instances might have terminatednormally at the end of the job flow, or they might have terminated because of an error. For more informationon how to debug job flow errors, go to the section called “Troubleshooting” (p. 183).


Amazon Elastic MapReduce Developer GuideTroubleshooting Spot Instances


How do I check the price history on Spot Instances?To review the Spot Instance price history, go to Spot Instances on the Amazon EC2 Pricing page. Thispricing information is updated at regular intervals.

Store Data with HBaseHBase is an open source, non-relational, distributed database modeled after Google's BigTable. It wasdeveloped as part of Apache Software Foundation's Hadoop project and runs on top of Hadoop DistributedFile System(HDFS) to provide BigTable-like capabilities for Hadoop. HBase provides you a fault-tolerant,efficient way of storing large quantities of sparse data using column-based compression and storage. Inaddition, HBase provides fast lookup of data because data is stored in-memory instead of on disk. HBaseis optimized for sequential write operations, and is highly efficient for batch inserts, updates, and deletes.

HBase works seamlessly with Hadoop, sharing its file system and serving as a direct input and output toHadoop jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables,joins with Hive-based tables, and support for Java Database Connectivity (JDBC).

Additionally, HBase on Amazon EMR provides the ability to back up your HBase data directly to AmazonSimple Storage Service (Amazon S3).You can also restore from a previously created backup whenlaunching an HBase cluster.

What Can I Do with HBase?You can use HBase for random, repeated access to and modification of large volumes of data. HBaseprovides low-latency lookups and range scans, along with efficient updates and deletions of individualrecords.

Here are several HBase use cases for you to consider:

• Reference data for Hadoop analytics. With its direct integration with Hadoop and Hive and rapidaccess to stored data, HBase can be used to store reference data used by multiple Hadoop tasks oracross multiple Hadoop clusters. This data can be stored directly on the cluster running Hadoop tasksor on a separate cluster. Types of analytics include analytics requiring fast access to demographicdata, IP address geolocation lookup tables, and product dimensional data.

• Real-time log ingestion and batch log analytics. HBase's high write throughput, optimization forsequential data, and efficient storage of sparse data make it a great solution for real-time ingestion oflog data. At the same time, its integration with Hadoop and optimization for sequential reads and scansmakes it equally suited for batch analysis of that log data after ingestion. Common use cases includeingestion and analysis of application logs, clickstream data, and in game usage data.

• Store for high frequency counters and summary data. Counter increments aren't just databasewrites, they're read-modify-writes, so they're a very expensive operation for a relational database.However, because HBase is a nonrelational, distributed database, it supports very high update ratesand, given its consistent reads and writes, provides immediate access to that updated data. In addition,if you want to run more complex aggregations on the data (such as max-mins, averages, and group-bys),you can run Hadoop jobs directly and feed the aggregated results back into HBase.

HBase Job Flow PrerequisitesAn Amazon EMR job flow should meet the following requirements in order to run HBase. Amazon EMRcurrently runs HBase version 0.92.0.


Amazon Elastic MapReduce Developer GuideStore Data with HBase


• A version of the Amazon EMR command line interface (CLI) that supports HBase (Optional)—CLIversion 2012-06-12 and later. To find out what version of the CLI you have, run elastic-mapreduce--version at the command line.You can download the latest version of the CLI fromhttp://aws.amazon.com/code/Elastic-MapReduce/2264. If you do not have the latest version of the CLIinstalled, you can use the Amazon EMR console to launch HBase clusters.

• At least two instances (Optional)—The job flow's master node runs the HBase master server andZookeeper, and slave nodes run the HBase region servers. For best performance, HBase job flowsshould run on at least two EC2 instances, but you can run HBase on a single node for evaluationpurposes.

• Persistent job flow—HBase only runs on persistent job flows. The CLI and Amazon EMR consoleautomatically create HBase job flows with the --alive flag set.

• An Amazon EC2 key pair set—To run HBase shell commands, you'll need to be able to use theSecure Shell (SSH) network protocol to connect with the master node. To do this, you must set anAmazon EC2 key pair when you create the job flow.

• The correct instance type—HBase is only supported on the following instance types: m1.large,m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, cc2.8xlarge, or hs1.8xlarge.The cc2.8xlargeinstance type is only supported in the US East (Northern Virginia), US West (Oregon), and EU (Ireland)Regions.The cc1.4xlarge and hs1.8xlarge instance types are only supported in the US East (NorthernVirginia) Region.

• The correct AMI and Hadoop versions—HBase job flows are currently supported only on this betaversion AMI and Hadoop 20.205 or later. The CLI and Amazon EMR console automatically set thecorrect AMI on HBase job flows.

• Ganglia (Optional)—If you want to monitor HBase performance metrics, you can use a bootstrapaction to install Ganglia when you create the job flow.

• An Amazon S3 bucket for logs (Optional)—The logs for HBase are available on the master node.If you'd like these logs copied to Amazon S3, specify an Amazon S3 bucket to receive log files whenyou create the job flow.

Launch an HBase Cluster on Amazon EMRWhen you launch HBase on Amazon EMR, you get the benefits of running in the Amazon Web Services(AWS) cloud—easy scaling, low cost, pay only for what you use, and ease of use. The EMR team hastuned HBase to run optimally on AWS. For more information about HBase and running it on AmazonEMR, see Store Data with HBase (p. 155).

The following procedure shows how to launch an HBase job flow with the default settings. If your applicationneeds custom settings, you can configure HBase as described in Configure HBase (p. 174).

NoteHBase configuration can only be done at launch time.

For production environments, we recommend that you launch HBase on one job flow and launch anyanalysis tools, such as Hive, on a separate job flow. This ensures that HBase has ready access to theCPU and memory resources it requires.

To launch an HBase cluster using the console









b. Select a version of Hadoop to run on your cluster in Hadoop Version.You can choose to runthe Amazon distribution of Hadoop or one of two MapR distributions. For more information aboutMapR distributions for Hadoop, see Launch a Job Flow on the MapR Distribution forHadoop (p. 260).


d. Select HBase in the drop-down list.

e. Click Continue.



4. In the SPECIFY PARAMETERS page, indicate whether you want to preload the HBase cluster withdata stored in Amazon S3 and whether you want to schedule regular backups of your HBase cluster.Use the following table for guidance on making your selections. For more information about backingup and restoring HBase data, see Back Up and Restore HBase (p. 165).When you are finished makingselections, click Continue.

ActionField

Specify whether to preload the HBase cluster with data stored in Amazon S3.Restore fromBackup

Specify the URI where the backup to restore from resides in Amazon S3.BackupLocation*

Optionally, specify the version name of the backup at Backup Location to use.If you leave this field blank, Amazon EMR uses the latest backup at BackupLocation to populate the new HBase cluster.

Backup Version

Specify whether to schedule automatic incremental backups. The first backupwill be a full backup to create a baseline for future incremental backups.

ScheduleRegular Backups

Specify whether the backups should be consistent. A consistent backup is onewhich pauses write operations during the initial backup stage, synchronizationacross nodes. Any write operations thus paused are placed in a queue andresume when synchronization compeletes.

ConsistentBackup*

The number of Days/Hours/Minutes between scheduled backups.BackupFrequency*



ActionField

The Amazon S3 URI where backups will be stored. The backup location foreach HBase cluster should be different to ensure that differential backups staycorrect.

BackupLocation*

Specify when the first backup should occur.You can set this to now, whichcauses the first backup to start as soon as the cluster is running, or enter a dateand time in ISO format. For example, 2012-06-15T20:00Z, would set the starttime to June 15, 2012 at 8pm UTC.

Backup StartTime*

Optionally, add Hive or Pig to the HBase cluster. Because of performanceconsiderations, best practice is to run HBase on one cluster and Hive or Pig ona different cluster. For testing purposes, however, you may wish to run Hive orPig on the same cluster as HBase.

Install AdditonalPackages









ActionField


Instance Count

Specify the Amazon EC2 instance types to use as master, core, and tasknodes.Valid types are m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge,cc1.4xlarge, hs1.8xlarge, or cc2.8xlarge.The cc2.8xlarge instance type is onlyavailable in the US East (Northern Virginia), US West (Oregon), and EU (Ireland)Regions. The cc1.4xlarge and hs1.8xlarge instance types are only supportedin the US East (Northern Virginia) Region.

Instance Type





ActionField


Amazon EC2 KeyPair


Amazon VPCSubnet Id



ActionField


Amazon S3 LogPath





EnableDebugging


Keep Alive







9. Click Close.


To launch an HBase cluster using the CLI

• Specify --hbase when you launch a job flow using the CLI.

The following example shows how to launch a job flow running HBase from the CLI. We recommendthat you run at least two instances in the HBase job flow .The --instance-type parameter must beone of the following: m1.large, m1.xlarge, c1.xlarge, m2.2xlarge, m2.4xlarge, cc1.4xlarge, hs1.8xlarge,or cc2.8xlarge. The cc2.8xlarge instance type is only available in the US East (Northern Virginia),US West (Oregon), and EU (Ireland) Regions. The cc1.4xlarge and hs1.8xlarge instance types areonly supported in the US East (Northern Virginia) Region.



The CLI implicitly launches the HBase job flow with keep alive and termination protection set.

elastic-mapreduce --create --hbase --name "$USER HBase Cluster" \ --num-instances 2 \ --instance-type cc1.4xlarge \

To launch an HBase cluster using the API

• You need to run the hbase-setup bootstrap action when you launch HBase using the API in order toinstall and configure HBase on the cluster.You also need to add a step to start the Hbase master.These are shown in the following example.The region, us-east-1, would be replaced by the regionin which to launch the cluster. For a list of regions supported by Amazon EMR see Choose aRegion (p. 17).

https://us-east-1elasticmapreduce.amazonaws.com?Operation=RunJobFlow&Name=HBase Cluster&LogUri=s3://myawsbucket/subdir&Instances.MasterInstanceType=m1.xlarge&Instances.SlaveInstanceType=m1.xlarge&Instances.InstanceCount=4&Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a&Instances.KeepJobFlowAliveWhenNoSteps=true&Steps.member.1.Name=InstallHBase&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.BootstrapAction.ScriptBootstrapAction=s3://us-east-1.elast icmapreduce/bootstrap-actions/setup-hbase&Steps.member.1.Name=StartHBase&Steps.member.1.ActionOnFailure=CANCEL_AND_WAIT&Steps.member.1.HadoopJarStep.Jar=/home/hadoop/lib/hbase-0.92.0.jar&Steps.member.1.HadoopJarStep.Args.member.1=emr.hbase.backup.Main&Steps.member.1.HadoopJarStep.Args.member.2=--start-master&AWSAccessKeyId=AccessKeyID&SignatureVersion=2&SignatureMethod=HmacSHA256&Timestamp=2009-01-28T21%3A48%3A32.000Z& Signature=calculated value

Connect to HBase Using the Command LineAfter you create an HBase job flow, the next step is to connect to HBase so you can begin reading andwriting data.

To open the HBase shell

1. Use SSH to connect to the master server in the HBase job flow. For information about how to connectto the master node using SSH see, Connect to the Master Node Using SSH (p. 111).


Amazon Elastic MapReduce Developer GuideConnect to HBase Using the Command Line

2. Run hbase shell. The HBase shell will open with a prompt similar to the following example.

hbase(main):001:0>

You can issue HBase shell commands from the prompt. For a description of the shell commands andinformation on how to call them, type help at the HBase prompt and press Enter.

Create a TableThe following command will create a table named 't1' that has a single column family named 'f1'.

hbase(main):001:0>create 't1', 'f1'

Put a ValueThe following command will put value 'v1' for row 'r1' in table 't1' and column 'f1'.

hbase(main):001:0>put 't1', 'r1', 'f1', 'v1'

Get a ValueThe following command will get the values for row 'r1' in table 't1'.

hbase(main):001:0>get 't1', 'r1'

Back Up and Restore HBaseAmazon EMR provides the ability to back up your HBase data to Amazon S3, either manually or on anautomated schedule.You can perform both full and incremental backups. Once you have a backed-upversion of HBase data, you can restore that version to an HBase cluster.You can restore to an HBasecluster that is currently running, or launch a new cluster prepopulated with backed-up data.

During the backup process, HBase continues to execute write commands. Although this ensures thatyour cluster remains available throughout the backup, there is the risk of inconsistency between the databeing backed up and any write operations being executed in parallel. To understand the inconsistenciesthat might arise, you have to consider that HBase distributes write operations across the nodes in itscluster. If a write operation happens after a particular node is polled, that data will not be included in thebackup archive.You may even find that earlier writes to the HBase cluster (sent to a node that has alreadybeen polled) might not be in the backup archive, whereas later writes (sent to a node before it was polled)are included.

If a consistent backup is required, you must pause writes to HBase during the initial portion of the backupprocess, synchronization across nodes.You can do this by specifying the --consistent flag when


Amazon Elastic MapReduce Developer GuideBack Up and Restore HBase

requesting a backup. With this flag, writes during this period will be queued and executed as soon as thesynchronization completes.You can also schedule recurring backups, which will resolve any inconsistenciesover time, as data that is missed on one backup pass will be backed up on the following pass.

When you back up HBase data, you should specify a different backup directory for each cluster. An easyway to do this is to use the job flow identifier as part of the path specified for the backup directory. Forexample, s3://mybucket/backups/j-ABABABABAB.This ensures that any future incremental backupsreference the correct HBase cluster.

When you are ready to delete old backup files that are no longer needed, we recommend that you firstdo a full backup of your HBase data. This ensures that all data is preserved and provides a baseline forfuture incremental backups. Once the full backup is done, you can navigate to the backup location andmanually delete the old backup files.

Back Up and Restore HBase Using the ConsoleThe console provides the ability to launch a new HBase cluster and populate it with data from a previousbackup of an HBase cluster. It also gives you the ability to schedule periodic incremental backups of anew HBase cluster. Additional backup and restore functionality, such as the ability to restore data to analready running cluster, do manual backups, and schedule automated full backups is available using theAmazon EMR CLI. For more information, see Back Up and Restore HBase Using the CLI (p. 168)

To populate a new HBase cluster with archived data using the console

1. On the SPECIFY PARAMETERS pane of the Create a New Job Flow wizard, set Restore fromBackup to Yes. For more information about the Create a New Job Flow wizard, see Launch anHBase Cluster on Amazon EMR (p. 156).

2. Specify the location of the backup you wish to load into the new HBase cluster in Backup Location.This should be an Amazon S3 URL of the form s3://myawsbucket/backups/.

3. You have the option to specify the name of a backup version to load by setting a value for BackupVersion. If you do not set a value for Backup Version, Amazon EMR loads the latest backup in thespecified location.



To schedule automated backups of HBase data using the console

1. On the SPECIFY PARAMETERS pane of the Create a New Job Flow wizard, set Schedule RegularBackups to Yes For more information about the Create a New Job Flow wizard, see Launch anHBase Cluster on Amazon EMR (p. 156).

2. Specify whether the backups should be consistent. A consistent backup is one which pauses writeoperations during the initial backup stage, synchronization across nodes. Any write operations thuspaused are placed in a queue and resume when synchronization compeletes.

3. Set how often backups should occur by entering a number for Backup Frequency and selectingDays, Hours, or Minutes from the drop-down box. The first automated backup that runs will be afull backup, after that, Amazon EMR will save incremental backups based on the schedule youspecify.

4. Specify the location in Amazon S3 where the backups should be stored. Each HBase cluster shouldbe backed up to a separate location in Amazon S3 to ensure that incremental backups are calculatedcorrectly.

5. Specify when the first backup should occur by setting a value for Backup Start Time.You can setthis to now, which causes the first backup to start as soon as the cluster is running, or enter a dateand time in ISO format. For example, 2012-06-15T20:00Z, would set the start time to June 15, 2012at 8pm UTC.




Back Up and Restore HBase Using the CLIRunning HBase on Amazon EMR provides many ways to back up your data, you can create full orincremental backups, run backups manually, and schedule automatic backups. The following table listsall the flags and parameters you can set in order to backup HBase data. Following the table are examplesof commands that use these flags and parameters to back up data in various ways.


The directory where a backup exists or should be created.--backup-dir

(Optional) Specifies the version number of an existing backupto restore. If the backup version is not specified in a restoreoperation, Amazon EMR uses the latest backup, asdetermined by lexiographical order. This is in the formatYYYYMMDDTHHMMSSZ, for example: 20120809T031314Z.

--backup-version

(Optional) Pauses all write operations to the HBase clusterduring the backup process, to ensure a consistent backup.

--consistent

Turn off scheduled full backups by passing this flag into a callwith --hbase-schedule-backup

--disable-full-backups

Turn off scheduled incremental backups by passing this flaginto a call with --hbase-schedule-backup

--disable-incremental-backups




An integer that specifies the period of time units to elapsebetween automated full backups of the HBase cluster. Usedwith --hbase-schedule-backup this parameter createsregularly scheduled full backups. If this period schedules afull backup at the same time as an incremental backup isscheduled, only the full backup is created. Used with--full-backup-time-unit.

--full-backup-time-interval

The unit of time to use with--full-backup-time-interval to specify how oftenautomatically scheduled backups should run. This can takeany one of the following values: minutes, hours, days.

--full-backup-time-unit

Create a one-time backup of HBase data to the locationspecified by --backup-dir. If the --incremental flag isalso set, it creates an incremental backup, otherwise a fullbackup is created.

--hbase-backup

Restore a backup from the location specified by--backup-dir and (optionally) the version specified by--backup-version.

--hbase-restore

Schedule an automated backup of HBase data. This can setan incremental backup, a full backup, or both, depending onthe flags used to set the intervals and time units. The firstbackup in the schedule begins immediately unless a value isspecified by --start-time.

--hbase-schedule-backup

When used with --hbase-backup it creates a one-timeincremental backup. If no previous backup exists at thatlocation, Amazon EMR creates a full backup. This backupdoes not pause writes to the HBase cluster and as such, maybe inconsistent.

--incremental

An integer that specifies the period of time units to elapsebetween automated incremental backups of the HBase cluster.Used with --hbase-schedule-backup this parametercreates regularly scheduled incremental backups. If this periodschedules a full backup at the same time as an incrementalbackup is scheduled, only the full backup is created. Usedwith --incremental-backup-time-unit.

--incremental-backup-time-interval

The unit of time to use with--incremental-backup-time-interval to specify howoften automatically scheduled incremental backups shouldrun. This can take any one of the following values: minutes,hours, days.

--incremental-backup-time-unit

(Optional) Specifies the time that a backup schedule shouldstart. If this is not set, the first backup begins immediately.This should be in ISO date-time format.You can use this toensure your first data load process has completed beforeperforming the initial backup or to have the backup occur ata specific time each day.

--start-time




To manually back up HBase data

• Run --hbase-backup in the CLI and specify the job flow and the backup location in Amazon S3.Amazon EMR tags the backup with a name derived from the time the backup was launched. This isin the format YYYYMMDDTHHMMSSZ, for example: 20120809T031314Z. If you want to label yourbackups with another name, you can create a location in Amazon S3 (such as backups in theexample below) and use the location name as a way to tag the backup files.

The following example backs up the HBase data to s3://myawsbucket/backups, with the timestampas the version name. This backup does not pause writes to the HBase cluster and as such, may beinconsistent.

elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup \--backup-dir s3://myawsbucket/backups/j-ABABABABABA

This example backs up data to the same location, but sets a value for the version name.You canuse the version name later to specify which backup to restore to an HBase cluster.This backup doesnot pause writes to the HBase cluster and as such, may be inconsistent.

elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup \--backup-dir s3://myawsbucket/backups/j-ABABABABABA

This example creates an incremental backup based on the most recent backup in the location setby --backup-dir. If no previous backup exists at that location, Amazon EMR creates a full backup.This backup does not pause writes to the HBase cluster and as such, may be inconsistent.

elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup \--incremental --backup-dir s3://myawsbucket/backups/j-ABABABABABA

This example backs up data, and uses the --consistent flag to enforce backup consistency. This flagcauses all writes to the HBase cluster to pause during the backup.

elastic-mapreduce --jobflow j-ABABABABABA --hbase-backup \--backup-dir s3://myawsbucket/backups/j-ABABABABABA \--consistent

To schedule automated backups of HBase data

1. Call --hbase-schedule-backup on the HBase job flow and specify the backup time interval andunits. If you do not specify a start time, the first backup starts immediately. The following examplecreates a weekly full backup, with the first backup starting immediately.

elastic-mapreduce --jobflow j-ABABABABABA \



--name "Schedule Periodic Full Backups" \--hbase-schedule-backup \--full-backup-time-interval 7 --full-backup-time-unit days \--backup-dir s3://mybucket/backups/j-ABABABABABA

The following example creates a weekly full backup, with the first backup starting on 15 June 2012,8 p.m. UTC time.

elastic-mapreduce --jobflow j-ABABABABABA \--name "Schedule Periodic Full Backups" \--hbase-schedule-backup \--full-backup-time-interval 7 --full-backup-time-unit days \--backup-dir s3://mybucket/backups/j-ABABABABABA \--start-time 2012-06-15T20:00Z

The following example creates a daily incremental backup. The first incremental backup will beginimmediately.

elastic-mapreduce --jobflow j-ABABABABABA \--name "Schedule Periodic Incremental Backups" \--hbase-schedule-backup \--incremental-backup-time-interval 24 \--incremental-backup-time-unit hours \--backup-dir s3://mybucket/backups/j-ABABABABABA

The following example creates a daily incremental backup, with the first backup starting on 15 June2012, 8 p.m. UTC time.

elastic-mapreduce --jobflow j-ABABABABABA --name "Schedule Periodic Incre mental Backups" \--hbase-schedule-backup \--incremental-backup-time-interval 24 \--incremental-backup-time-unit hours \--backup-dir s3://mybucket/backups/j-ABABABABABA \--start-time 2012-06-15T20:00Z

The following example creates both a weekly full backup and a daily incremental backup, with thefirst full backup starting immediately. Each time the schedule has the full backup and the incrementalbackup scheduled for the same time, only the full backup will run.

elastic-mapreduce --jobflow j-ABABABABABA \--name "Schedule Periodic Incremental Backups" \--hbase-schedule-backup \--full-backup-time-interval 7 \--full-backup-time-unit days \--incremental-backup-time-interval 24 \



--incremental-backup-time-unit hours \--backup-dir s3://mybucket/backups/j-ABABABABABA

The following example creates both a weekly full backup and a daily incremental backup, with thefirst full backup starting on June 15, 2012. Each time the schedule has the full backup and theincremental backup scheduled for the same time, only the full backup will run.

elastic-mapreduce --jobflow j-ABABABABABA \--name "Schedule Periodic Incremental Backups" \--hbase-schedule-backup \--full-backup-time-interval 7 \--full-backup-time-unit days \--incremental-backup-time-interval 24 \--incremental-backup-time-unit hours \--backup-dir s3://mybucket/backups/j-ABABABABABA \--start-time 2012-06-15T20:00Z

2. The following example creates both a weekly full backup and a daily incremental backup, with thefirst full backup starting on June 15, 2012. Each time the schedule has the full backup and theincremental backup scheduled for the same time, only the full backup will run. The --consistent flagis set, so both the incremental and full backups will pause write operations during the initial portionof the backup process to ensure data consistency.

elastic-mapreduce --jobflow j-ABABABABABA \--name "Schedule Periodic Incremental Backups" \--hbase-schedule-backup \--full-backup-time-interval 7 \--full-backup-time-unit days \--incremental-backup-time-interval 24 \--incremental-backup-time-unit hours \--backup-dir s3://mybucket/backups/j-ABABABABABA \--start-time 2012-06-15T20:00Z \--consistent

To turn off automated backups

• Call the job flow with the --hbase-schedule-backup parameter and set the--disable-full-backups or --disable-incremental-backups flag, or both flags. Thefollowing example turns off full backups.

elastic-mapreduce --jobflow j-ABABABABABA \--name "Disable Periodic Full Backups" \--hbase-schedule-backup --disable-full-backups

The following example turns off incremental backups.



elastic-mapreduce --jobflow j-ABABABABABA \--name "Disable Periodic Incremental Backups" \--hbase-schedule-backup --disable-incremental-backups

The following example turns off both full and incremental backups.

elastic-mapreduce --jobflow j-ABABABABABA \--name "Disable All Periodic Backups" \--hbase-schedule-backup --disable-full-backups \--disable-incremental-backups

To restore data to a running HBase cluster

• Run an --hbase-restore step and specify the jobflow, the backup location in Amazon S3, and(optionally) the name of the backup version. If you do not specify a value for --backup-version,Amazon EMR loads the last version in the backup directory. This is the version with the name thatis lexicographically greatest.

The following example restores the HBase cluster to the latest version of backup data stored ins3://myawsbucket/backups, overwriting any data stored in the HBase cluster.

elastic-mapreduce --jobflow j-ABABABABABA --hbase-restore \--backup-dir s3://myawsbucket/backups/j-ABABABABABA

This example restored the HBase cluster to the specified version of backup data stored ins3://myawsbucket/backups, overwriting any data stored in the HBase cluster.

elastic-mapreduce --jobflow j-ABABABABABA --hbase-restore \--backup-dir s3://myawsbucket/backups/j-ABABABABABA \--backup-version 20120809T031314Z

To populate a new HBase cluster with archived data

• Add --hbase-restore and --backup-directory to the --create step in the CLI.

You can optionally specify --backup-version to indicate which version in the backup directory toload. If you do not specify a value for --backup-version, Amazon EMR loads the last version inthe backup directory. This will either be the version with the name that is lexicographically last or, ifthe version names are based on timestamps, the latest version.

The following example creates a new HBase cluster and loads it with the latest version of data ins3://myawsbucket/backups/j-ABABABABABA.



elastic-mapreduce --create --name "My HBase Restored" \--hbase --hbase-restore \--backup-dir s3://myawsbucket/backups/j-ABABABABABA

This example creates a new HBase cluster and loads it with the specified version of data ins3://myawsbucket/backups/j-ABABABABABA.

elastic-mapreduce --create --name "My HBase Restored" \--hbase --hbase-restore \--backup-dir s3://myawsbucket/backups/j-ABABABABABA \--backup-version 20120809T031314Z

Terminate an HBase ClusterAmazon EMR launches HBase clusters with termination protection turned on. This prevents the clusterfrom being terminated inadvertently or in the case of an error. Before you terminate the cluster, you mustfirst disable termination protection. For more information, see Terminating a Protected Job Flow (p. 140).

Configure HBaseAlthough the default settings should work for most applications, you have the flexibility to modify yourHBase configuration settings. To do this, you run one of two bootstrap action scripts:

• configure-hbase-daemons—Configures properties of the master, regionserver, and zookeeperdaemons. These properties include heap size and options to pass to the Java Virtual Machine (JVM)when the HBase daemon starts.You set these properties as arguments in the bootstrap action. Thisbootstrap action modifies the /home/hadoop/conf/hbase-user-env.sh configuration file on the HBasecluster.

• configure-hbase—Configures HBase site-specific settings such as the port the HBase master shouldbind to and the maximum number of times the client CLI client should retry an action.You can set theseone-by-one, as arguments in the bootstrap action, or you can specify the location of an XML configurationfile in Amazon S3. This bootstrap action modifies the /home/hadoop/conf/hbase-site.xml configurationfile on the HBase cluster.

NoteThese scripts, like other bootstrap actions, can only be run when the job flow is created, youcannot use them to change the configuration of an HBase cluster that is currently running.

When you run the configure-hbase or configure-hbase-daemons bootstrap actions, the values youspecify override the default values. Any values you don't explicitly set receive the default values.

Configuring HBase with these bootstrap actions is analogous to using bootstrap actions in Amazon EMRto configure Hadoop settings and Hadoop daemon properties. The difference is that HBase does nothave per-process memory options. Instead, memory options are set using the --daemon-opts argument,where daemon is replaced by the name of the daemon to configure.


Amazon Elastic MapReduce Developer GuideTerminate an HBase Cluster

Configure HBase DaemonsAmazon EMR provides a bootstrap action,s3://region.elasticmapreduce/bootstrap-actions/setup-hbase/configure-hbase-daemons,that you can use to change the configuration of HBase daemons, where region is the region into whichyou're launching your HBase cluster, for exampleus-west-1.elasticmapreduce/hbase-beta/configure-hbase-daemons. For a list of regionssupported by Amazon EMR see Choose a Region (p. 17). The bootstrap action can only be run whenthe HBase job flow is launched.

To Configure HBase Daemons

• Add a bootstrap action, configure-hbase-daemons, when you launch the HBase job flow.You canuse this bootstrap action to configure one or more daemons.

The following example creates a new HBase cluster and uses the configure-hbase-daemons bootstrapaction to set values for zookeeper-opts and hbase-master-opts which configure the options used bythe zookeeper and master node components of the HBase cluster.

elastic-mapreduce --create --hbase --name "My HBase Cluster" \ --bootstrap-action s3://us-east-1.elasticmapreduce/bootstrap-actions/config ure-hbase-daemons \--args '--zookeeper-opts=-Xmx1000M -XX:GCTimeRatio=19,--hbase-master-opts=-Xmx4000M'

NoteWhen you specify the arguments for this bootstrap action, you must put single quotes aroundthe value of the --args property, to keep the shell from breaking the arguments up.You mustalso include a space character between JVM arguments; in the example above there is a spacebetween -Xmx1000M and -XX:GCTimeRatio=19.

You can set the following properties with the configure-hbase-daemons bootstrap action.

DescriptionVariable

Options that control how the JVM runs the master daemon.If set, these settings override the defaultHBASE_MASTER_OPTS variables.

hbase-master-opts

Options that control how the JVM runs the region serverdaemon. If set, these settings override the defaultHBASE_REGIONSERVER_OPTS variables.

regionserver-opts

Options that control how the JVM runs the zookeeper daemon.If set, these settings override the defaultHBASE_ZOOKEEPER_OPTS variables.

zookeeper-opts

For more information about these options, go to http://hbase.apache.org/configuration.html#hbase.env.sh.


Amazon Elastic MapReduce Developer GuideConfigure HBase

http://hbase.apache.org/configuration.html#hbase.env.sh

Configure HBase Site SettingsAmazon EMR provides a bootstrap action,s3://region.elasticmapreduce/bootstrap-actions/setup-hbase/configure-hbase, thatyou can use to change the configuration of HBase, where region is the region into which you're launchingyour HBase cluster, for exampleus-west-1.elasticmapreduce/bootstrap-actions/setup-hbase/configure-hbase. For alist of all the regions supported by Amazon EMR see Choose a Region (p. 17).You can set configurationvalues one-by-one, as arguments in the bootstrap action, or you can specify the location of an XMLconfiguration file in Amazon S3.

Setting them one-by-one is useful if you only need to set a few configuration settings. Setting them usingan XML file is useful if you have many changes to make, or if you want to save your configuration settingsfor reuse.

This bootstrap action modifies the /home/hadoop/conf/hbase-site.xml configuration file on theHBase cluster. The bootstrap action can only be run when the HBase job flow is launched.

For a complete list of the HBase site settings that you can configure, go tohttp://hbase.apache.org/configuration.html#hbase.site.

To specify individual site settings

• Set the configure-hbase bootstrap action when you launch the HBase job flow, and specify thevalues within hbase-site.xml to change. The following example illustrates how to change thehbase.hregion.max.filesize settings.

elastic-mapreduce --create --hbase --name "My HBase Cluster" \ --bootstrap-action s3://us-east-1.elasticmapreduce/bootstrap-actions/config ure-hbase \--args -s,hbase.hregion.max.filesize=52428800

To specify the site settings with an XML file

• Create a custom version of hbase-site.xml.Your custom file must be valid XML.To reduce the chanceof introducing errors, start with the default copy of hbase-site.xml, located on the Amazon EMRHBase master node at /home/hadoop/conf/hbase-site.xml, and edit a copy of that file insteadof creating a file from scratch.You can give your new file a new name, or leave it as hbase-site.xml.

Upload your custom hbase-site.xml file to an Amazon S3 bucket. It should have permissions set sothe AWS account that launches the job flow can access the file. If the AWS account launching thejob flow also owns the Amazon S3 bucket, it will have access.

Set the configure-hbase bootstrap action when you launch the HBase job flow, and pass in thelocation of your custom hbase-site.xml file.

The following example sets the HBase site configuration values to those specified in the files3://myawsbucket/my-hbase-site.xml.

elastic-mapreduce --create --hbase --name "My HBase Cluster" \--bootstrap-action s3://us-east-1.elasticmapreduce/bootstrap-actions/config ure-hbase \



http://hbase.apache.org/configuration.html#hbase.site

--args --site-config-file s3://bucket/config.xml

HBase Site Settings to OptimizeYou can set any or all of the HBase site settings to optimize the HBase cluster for your application'sworkload. We recommend the following settings as a starting point in your investigation.

zookeeper.session.timeout

The default timeout is three minutes (180000 ms). If a region server crashes, this is how long it takes themaster server to notice the absence of the region server and start recovery. If you want the master serverto recover faster, you can reduce this value to a shorter time period. The following example uses oneminute, or 60000 ms.

--bootstrap-action s3://us-east-1.elasticmapreduce/bootstrap-actions/configure-hbase \--args "zookeeper.session.timeout=60000"

This defines the number of threads the region server keeps open to serve requests to tables. The defaultof 10 is low, in order to prevent users from killing their region servers when using large write buffers witha high number of concurrent clients. The rule of thumb is to keep this number low when the payload perrequest approaches the MB range (big puts, scans using a large cache) and high when the payload issmall (gets, small puts, ICVs, deletes). The following example raises the number of open threads to 30.

--bootstrap-action s3://us-east-1.elasticmapreduce/bootstrap-actions/configure-hbase \--args " hbase.regionserver.handler.count=30"

hbase.hregion.max.filesize

This parameter governs the size, in bytes, of the individual regions. By default, it is set to 256 MB. If youare writing a lot of data into your HBase cluster and it's causing frequent splitting, you can increase thissize to make individual regions bigger. It will reduce splitting, but it will take more time to load balanceregions from one server to another.

--bootstrap-action s3://us-east-1.elasticmapreduce/bootstrap-actions/configure-hbase \--args "hbase.hregion.max.filesize=1073741824"

hbase.hregion.memstore.flush.size

This parameter governs the maximum size of memstore, in bytes, before it is flushed to disk. By defaultit is 64 MB. If your workload consists of short bursts of write operations, you might want to increase thislimit so all writes stay in memory during the burst and get flushed to disk later.This can boost performanceduring bursts¬.



--bootstrap-action s3://us-east-1.elasticmapreduce/bootstrap-actions/configure-hbase \--args "hbase.hregion.memstore.flush.size=134217728"

Access HBase Data with HiveTo use Hive with HBase you'll typically want to launch two clusters, one to run HBase and the other torun Hive. Running HBase and Hive separately can improve performance because this allows HBase tofully utilize the cluster resources.

Although it is not recommended for most use cases, you can also run Hive and HBase on the samecluster.

A copy of HBase is installed on the AMI with Hive to provide connection infrastructure to access yourHBase cluster. The following sections show how to use the client portion of the copy of HBase on yourHive job flow to connect to HBase on another cluster.

The connection between the Hive and HBase clusters is structured as shown in the following diagram.

You can use Hive to connect to HBase and manipulate data, performing such actions as exporting datato Amazon S3, importing data from Amazon S3, and querying HBase data.

NoteYou can only connect your Hive job flow to a single HBase cluster.

To connect Hive to HBase

1. Create an interactive Hive job flow. Use Hive version 0.7 or later and AMI version 2.0.4 or later. Thefollowing example shows how to launch such a job flow using the Amazon EMR CLI.


Amazon Elastic MapReduce Developer GuideAccess HBase Data with Hive

elastic-mapreduce --create --alive --instance-type m2.4xlarge --num-instances 4 \--hive-interactive --hive-versions latest

2. Use SSH to connect to the master node. For more information, see Connect to the Master NodeUsing SSH (p. 111).

3. Launch the Hive shell with the following command.

hive

4. Connect the HBase client on your Hive job flow to the HBase cluster that contains your data. In thefollowing example, public-DNS-name is replaced by the public DNS name of the master node ofthe HBase cluster, for example: ec2-50-19-76-67.compute-1.amazonaws.com. For moreinformation, see To locate the public DNS name of the master node using the Amazon EMRconsole (p. 111).

set hbase.zookeeper.quorum=public-DNS-name;

To access HBase data from Hive

• After the connection between the Hive and HBase clusters has been made (as shown in the previousprocedure), you can access the data stored on the HBase cluster by creating an external table inHive.

The following example, when run from the Hive prompt, creates an external table that referencesdata stored in an HBase table called inputTable.You can then reference inputTable in Hivestatements to query and modify data stored in the HBase cluster.

add jar lib/emr-metrics-1.0.jar ;add jar lib/protobuf-java-2.4.0.jar ;

set hbase.zookeeper.quorum=ec2-107-21-163-157.compute-1.amazonaws.com ;

create external table inputTable (key string, value string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ("hbase.columns.mapping" = ":key,fam1:col1") tblproperties ("hbase.table.name" = "inputTable");

select count(*) from inputTable ;


Amazon Elastic MapReduce Developer GuideAccess HBase Data with Hive

View the HBase User InterfaceHBase provides a web-based user interface that you can use to monitor your HBase cluster. When yourun HBase on Amazon EMR, the web interface runs on the master node and can be viewed using portforwarding, also known as creating an SSH tunnel.

To view the HBase User Interface

1. Use SSH to tunnel into the master node and create a secure connection. For information on how tocreate an SSH tunnel to the master node, see Open an SSH Tunnel to the Master Node (p. 116).

2. Install a web browser with a proxy tool, such as the FoxyProxy plug-in for Firefox, to create a SOCKSproxy for domains of the type *ec2*.amazonaws.com*. For a tutorial on how to do this, see ConfigureFoxy Proxy to View Websites Hosted on the Master Node (p. 117).

3. With the proxy set and the SSH connection open, you can view the HBase UI by opening a browserwindow with http://master-public-dns-name:60010/master-status, wheremaster-public-dns-name is the public DNS address of the master server in the HBase job flow.For information on how to locate the public DNS name of a master node, see To locate the publicDNS name of the master node using the Amazon EMR console (p. 111).

View HBase Log FilesAs part of its operation, HBase writes log files with details about configuration settings, daemon actions,and exceptions. These log files can be useful for debugging issues with HBase as well as for trackingperformance.

If you configure your job flow to persist log files to Amazon S3, you should know that logs are written toAmazon S3 every five minutes, so there may be a slight delay for the latest log files to be available.

To view HBase logs on the master node

• You can view the current HBase logs by using SSH to connect to the master node, and navigatingto the mnt/var/log/hbase directory. These logs will not be available after the job flow ends. Forinformation about how to connect to the master node using SSH see, Connect to the Master NodeUsing SSH (p. 111). After you have connected to the master node using SSH, you can navigate tothe log directory using a command like the following.

cd mnt/var/log/hbase

To view HBase logs on Amazon S3

• To access HBase logs and other job-flow logs on Amazon S3, and to have them available after thejob flow ends, you must specify an Amazon S3 bucket to receive these logs when you create the jobflow. This is done using the --log-uri option, as shown in the following example.

elastic-mapreduce --create --alive --name "$USER's HBase Test" \ --num-instances 2 \ --instance-type m1.xlarge \ --bootstrap-action s3://beta.elasticmapreduce/hbase-beta/install-hbase-stage-1 \ --args s3://beta.elasticmapreduce/hbase-beta \ --bootstrap-action s3://us-east-1.elasticmapreduce/bootstrap-actions/con


Amazon Elastic MapReduce Developer GuideView the HBase User Interface

figure-hadoop \ --args -m,dfs.support.append=true \ --log-uri s3://myawsbucket/logfiles

Monitor HBase with CloudWatchAmazon EMR reports three metrics to Amazon CloudWatch that you can use to monitor your HBasebackups. These metrics are pushed to Amazon CloudWatch at five-minute intervals, and are providedwithout charge. For more information about using Amazon CloudWatch to monitor Amazon EMR metrics,see Monitor Metrics with Amazon CloudWatch (p. 209).

DescriptionMetric

Whether the last backup failed. This is set to 0 by default andupdated to 1 if the previous backup attempt failed.This metricis only reported for HBase job flows.

Use Case: Monitor HBase backups

Units: Count

HBaseBackupFailed

The amount of time it took the previous backup to complete.This metric is set regardless of whether the last comppletedbackup succeeded or failed. While the backup is ongoing,this metric returns the number of minutes since the backupstarted. This metric is only reported for HBase job flows.

Use Case: Monitor HBase Backups

Units: Minutes

HBaseMostRecentBackupDuration

The number of elapsed minutes since the last successfulHBase backup started on your cluster. This metric is onlyreported for HBase job flows.


Units: Minutes

HBaseTimeSinceLastSuccessfulBackup

Monitor HBase with GangliaThe Ganglia open source project is a scalable, distributed system designed to monitor clusters and gridswhile minimizing the impact on their performance. When you enable Ganglia on your job flow, you cangenerate reports and view the performance of the cluster as a whole, as well as inspect the performanceof individual node instances. For more information about the Ganglia open-source project, go tohttp://ganglia.info/. For more information about using Ganglia with Amazon EMR job flows, see MonitorPerformance with Ganglia (p. 220).

You can install Ganglia on an Amazon EMR job flow by calling two bootstrap actions. The first,install-ganglia, installs Ganglia.The second, configure-hbase-for-ganglia, configures HBase to publishmetrics to Ganglia.


Amazon Elastic MapReduce Developer GuideMonitor HBase with CloudWatch

http://ganglia.info/

NoteYou must specify these bootstrap actions when you launch the HBase cluster; Ganglia reportingcannot be added to an HBase cluster that is already running.

Once the HBase job flow has been launched with Ganglia reporting configured, you can use port forwardingto access the Ganglia graphs and reports.

Ganglia also stores log files on the server at /mnt/var/log/ganglia/rrds. If you configured your jobflow to persist log files to an Amazon S3 bucket, the Ganglia log files will be persisted there as well.

To configure an HBase job flow for Ganglia

• Launch the job flow and specify both the install-ganglia and configure-hbase-for-ganglia bootstrapactions.This is shown in the following example, where you would replace us-east-1 with the regionwhere your HBase cluster was launched. For a list of regions supported by Amazon EMR see Choosea Region (p. 17).

elastic-mapreduce --create --hbase --name "My HBase Cluster" \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \ --bootstrap-action s3://us-east-1.elasticmapreduce/bootstrap-actions/con figure-hbase-for-ganglia

To view HBase metrics in the Ganglia web interface



3. With the proxy set and the SSH connection open, you can view the Ganglia metrics by opening abrowser window with http://master-public-dns-name/ganglia/, where master-public-dns-nameis the public DNS address of the master server in the HBase job flow. For information on how tolocate the public DNS name of a master node, see To locate the public DNS name of the masternode using the Amazon EMR console (p. 111).

To view Ganglia log files on the master node

• If the job flow is still running, you can access the log files by using SSH to connect to the masternode and navigating to the /mnt/var/log/ganglia/rrds directory. For information about how to use SSHto connect to the master node, see Connect to the Master Node Using SSH (p. 111).

To view Ganglia log files on Amazon S3

• If you configured the job flow to persist log files to Amazon S3 when you launched it, the Ganglia logfiles will be written there as well. Logs are written to Amazon S3 every five minutes, so there maybe a slight delay for the latest log files to be available. For more information, see View HBase LogFiles (p. 180).


Amazon Elastic MapReduce Developer GuideMonitor HBase with Ganglia

TroubleshootingTopics

• Things to Check When Your Amazon EMR Job Flow Fails (p. 183)

• Amazon EMR Logging (p. 187)

• Enable Logging and Debugging (p. 187)

• Use Log Files (p. 190)

• Monitor Hadoop on the Master Node (p. 199)

• View the Hadoop Web Interfaces (p. 200)

• Troubleshooting Tips (p. 204)

Things to Check When Your Amazon EMR JobFlow FailsThere are many reasons why a job flow might fail. The following lists the most common issues and howyou can fix them.

Does your path to Amazon Simple Storage Service (AmazonS3) have at least three slashes?When you specify an Amazon S3 bucket, you must include a terminating slash on the end of the URL.For example, instead of referencing a bucket as “s3n://myawsbucket”, you should use “s3n://myawsbucket/”,otherwise Hadoop will fail your job flow in most cases.

Are you trying to recursively traverse input directories?Hadoop does not recursively search input directories for files. If you have a directory structure such as/corpus/01/01.txt, /corpus/01/02.txt, /corpus/02/01.txt, etc. and you specify /corpus/ as the input parameterto your job flow, Hadoop will not find any input files because the /corpus/ directory is empty and Hadoopdoes not check the contents of the subdirectories. Similarly, Hadoop does not recursively check thesubdirectories of Amazon S3 buckets.

The input files must be directly in the input directory or Amazon S3 bucket you specify, not in subdirectories.

Does your output directory already exist?If you specify an output path that already exists, Hadoop will fail the job flow in most cases. This meansthat if you run a job flow once and then run it again with exactly the same parameters, it will likely workthe first time and then never again since after the first run the output path exists and thus causes allsuccessive runs to fail.

Are you trying to specify a resource using an HTTP URL?Hadoop does not accept resource locations specified using the http:// prefix.You cannot reference aresource using an HTTP URL. For example, passing in http://mysite/myjar.jar as the JAR parameter willcause the job flow to fail. For more information about how to reference files in Amazon Elastic MapReduce(Amazon EMR), go to File System Configuration (p. 338).


Amazon Elastic MapReduce Developer GuideTroubleshooting

Are you referencing an Amazon S3 bucket using an invalidname format?If you attempt to use a bucket name such as “myawsbucket.1” Hadoop will your job flow will fail becauseAmazon EMR requires bucket names be valid RFC 2396 host names, and thus the name cannot endwith a number. In addition, because of the requirements of Hadoop, Amazon S3 bucket names used withAmazon EMR must contain only lowercase letters, numbers, periods (.), and hyphens (-). For moreinformation about how to format Amazon S3 bucket names, go to Bucket Restrictions and Limitations inthe Amazon Simple Storage Service Developer Guide.

Are you passing in invalid streaming arguments?Hadoop streaming supports only the following arguments. If you pass in arguments other than those listedbelow, the job flow will fail.

-blockAutoGenerateCacheFiles -cacheArchive -cacheFile -cmdenv -combiner -debug -input -inputformat-inputreader -jobconf -mapper-numReduceTasks-output -outputformat -partitioner-reducer-verbose

In addition, Hadoop streaming only recognizes arguments passed in using Java syntax, that is, precededby a single hyphen. If you pass in arguments preceded by a double hyphen, the job flow will fail.

Are you passing the correct credentials into SSH?If you are unable to SSH into the master node, most likely it is an issue with your security credentials.

First check that the .pem file containing your SSH key has the proper permissions.You can use chmodto change the permissions on your .pem file as is shown in the following example, where you wouldreplace mykey.pem with the name of your own .pem file.

chmod og-rwx mykey.pem

The second possibility is that you are not using the keypair you specified when you created the job flow.This is easy to do if you have created multiple key pairs. Check the job flow details in the Amazon EMRconsole (or –describe from the CLI) for the name of the keypair that was specified when the job flow wascreated.


Amazon Elastic MapReduce Developer GuideThings to Check When Your Amazon EMR Job Flow

Fails

http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?BucketRestrictions.html

Once you have verified that you are using the correct key pair and that permissions are set correctly onthe .pem file, you can use the following command to SSH into the master node, where you would replacemykey.pem with the name of your .pem file [email protected] with the public dns name of the masternode (available through --describe in the CLI or through the Amazon EMR console.)

ssh -i mykey.pem [email protected]

For step-by-step instructions on how to set up your credentials and SSH into the master node, go to Howto View Logs Using SSH.

Are you using a custom jar when running DistCp?You cannot run DistCp by specifying a JAR residing on the AMI. Instead, you should use thesamples/distcp/distcp.jar file in the elasticmapreduce Amazon S3 bucket. The following example showshow to call the Amazon Elastic MapReduce (Amazon EMR) version of DistCp. Replace j-ABABABABABABwith the identifier of your job flow.

NoteDistCp is deprecated on Amazon EMR, we recommend that you use S3DistCp instead. For moreinformation about S3DistCp, see Distributed Copy Using S3DistCp (p. 227).

elastic-mapreduce --jobflow j-ABABABABABAB \ --jar s3n://elasticmapreduce/samples/distcp/distcp.jar \ --arg s3n://elasticmapreduce/samples/wordcount/input \ --arg hdfs:///samples/wordcount/input

If you are using IAM, do you have the proper Amazon EC2policies set?Because Amazon EMR uses EC2 instances as nodes, IAM users of Amazon EMR also need to havecertain Amazon EC2 policies set in order for Amazon EMR to be able to manage those instances on theIAM user’s behalf. If you do not have the required permissions set, Amazon EMR returns the error: “Useraccount is not authorized to call EC2.”

For a list of the Amazon EC2 policies your IAM account needs to have set in order to run EMR, go toExample Policies for Amazon EMR (p. 277).

Do you have enough HDFS space for your job flow?If you do not, Amazon EMR will return the following error: “Cannot replicate block, only managed toreplicate to zero nodes.” This error occurs when you generate more data in your job flow than can bestored in HDFS.You will see this error only while the job flow is running, because when the job flow ends,the HDFS space it was using is released.

The amount of HDFS space available to a job flow depends on the number and type of EC2 instancesthat are used as core nodes. All of the disk space on each EC2 instance is available to be used by HDFS.For more information on the amount of local storage for each EC2 instance type, go to Instance Typesand Families in the Amazon Elastic Compute Cloud User Guide.



Fails

http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/instance-types.html

The other factor that can affect the amount of HDFS space available is the replication factor, which is thenumber of copies of each data block that are stored in HDFS for redundancy. The replication factorincreases with the number of nodes in the job flow: there are 3 copies of each data block for a job flowwith 10 or more nodes, 2 copies of each block for a job flow with 4 to 9 nodes, and 1 copy (no redundancy)for job flows with 3 or fewer nodes. The total HDFS space available is divided by the replication factor.In some cases, such as increasing the number of nodes from 9 to 10, the increase in replication factorcan actually cause the amount of available HDFS space to decrease.

For example, a job flow with ten core nodes of type m1.large would have 2833 GB of space available toHDFS ((10 nodes X 850 GB per node)/replication factor of 3).

If your job flow exceeds the amount of space available to HDFS, you can add additional core nodes toyour cluster or use data compression to create more HDFS space. If your job flow is one that can bestopped and restarted, you may consider using core nodes of a larger Amazon EC2 instance type.Youmight also consider adjusting the replication factor. Be aware, though, that decreasing the replicationfactor reduces the redundancy of HDFS data and your job flow's ability to recover from lost or corruptedHDFS blocks.

Have you checked the log files for clues?Log files can provide insight into why a job flow failed. There are two options for viewing the log files:directly on EC2 Instances while the job flow is running, or from an Amazon S3 bucket if you launch thejob flow with log-file archiving enabled.

While the job flow is running, each EC2 instance (or node) gathers Hadoop and Amazon EMR logs inthe /mnt/var/log directory. The Hadoop logs include the daemon logs and the task and attempt logs.Amazon EMR logs include step logs, bootstrap action logs, and more. Because these log files are storedon EC2 Instances which are terminated when the job flow ends, you can only access them while the jobflow is running. This can be problematic if your job flow terminates suddenly, and you need the log fileinformation to analyze what went wrong.

To make sure that log-file information is available even after the job flow ends, you can instruct AmazonEMR to periodically copy the log files from /mnt/var/log to an Amazon S3 bucket. This incurs Amazon S3storage fees, but can be a great help in figuring out why a job flow failed. It’s a best practice to haveAmazon S3 log-file archiving turned on during your Hadoop application development and testing, and tohave a representative portion of your data archiving logs on production job flows.

If you want Amazon EMR to archive logs to Amazon S3, you must turn this on when you launch the jobflow, it is not enabled by default and cannot be added to an already running job flow.

Note that there is a delay between when the node writes a log file and when Amazon EMR propagatesit to the Amazon S3 bucket. This means it may take up to five minutes after a job flow ends for AmazonEMR to push the final logs to the Amazon S3 bucket.

For information about how to archive job flow logs to Amazon S3, see Enable Logging andDebugging (p. 187).To learn more about how to view and interpret the log files, see Use Log Files (p. 190).

Have your job flows finished terminating?Depending on the configuration of the job flow, it may take up to 5-20 minutes for the job flow to terminateand release allocated resources, such as Amazon E2 instances. If you are getting a EC2 QUOTA EXCEEDEDerror when you attempt to launch a job flow, it may be because resources from a recently terminated jobflow may not yet have been released. In this case, you can either request that your Amazon EC2 quotabe increased, or you can wait twenty minutes and re-launch the job flow.



Fails

https://aws.amazon.com/contact-us/ec2-request/

https://aws.amazon.com/contact-us/ec2-request/

Amazon EMR LoggingThere are two types of logs that store information about your job flow: step-level logs generated by AmazonElastic MapReduce (Amazon EMR) and Hadoop job logs generated by Apache Hadoop.You need toexamine both log types to have complete information about your job flow.

Amazon EMR step-level logs contain information about the job flow and the results of each step. Theselogs are useful when you are debugging problems that you encounter initializing and running the job flow.For example, a step-level log contains status information such as Streaming Command Failed!.

Hadoop logs contain information about Hadoop jobs, tasks, and task attempts.They are the standard logfiles generated by Apache Hadoop.

The following image shows the relationship between Amazon EMR job flow steps and Hadoop jobs, tasks,and task attempts.

Both step-level logs and Hadoop logs are generated by default and stored on the master node of the jobflow.You can access them while the job flow is running by using SSH to connect to the master node asdescribed in Connect to the Master Node Using SSH (p. 111). When the job flow ends the master nodeis terminated and you will no longer be able to access those logs using SSH. To be able to access thelog files of a terminated job flow, you can direct Amazon EMR to copy the step-level and Hadoop log filesto an Amazon S3 bucket as described in Enable Logging and Debugging (p. 187).

If you specify that the log files are to be copied to an Amazon S3 bucket, you have the option to haveAmazon EMR create an index over those log files to generate debugging information and reports. Thisindex is stored in Amazon SimpleDB and can be accessed by clicking the Debug button in the AmazonEMR console.

The options to copy log files to Amazon S3 and to create a debugging index in SimpleDB can only beinitialized when the job flow is launched.You cannot add them to an already running job flow.

When you are building your job flow, we recommend that you enable debugging on a small butrepresentative subset of your data.

Enable Logging and DebuggingTopics

• Enable Logging and Debugging Using the Amazon EMR Console (p. 188)

• Enable Logging and Debugging Using the CLI (p. 188)

• Enable Logging and Debugging Using the API (p. 189)

The following topics describe how to initialize logging and debugging on your job flow.You can use theAmazon EMR console, the CLI, or the API to enable logging and debugging.


Amazon Elastic MapReduce Developer GuideAmazon EMR Logging

Enable Logging and Debugging Using the Amazon EMRConsoleThis section describes how to configure logging and debugging from the Amazon EMR console.

To initialize logging and debugging for a job flow


2. Click the Create New Job Flow button to launch a new job flow. Then follow the instructions in thewizard. For more information about launching a job flow, see Create a Job Flow (p. 23).

3. On the ADVANCED OPTIONS pane, enter a value for Amazon S3 Log Path that indicates whereyou want Amazon EMR to copy the log files.

4. For Enable Debugging click Yes. Amazon EMR creates an index of the log files in SimpleDB.

When you enable debugging the Debug button in the Amazon EMR console displays debugginginformation. This display links to the log files after Amazon EMR uploads the log files to your bucket onAmazon S3. It takes a few minutes for the log file uploads to complete after the step completes. So, ifthe links are pending in the console display, the log files are not yet uploaded.

Amazon MapReduce periodically updates the status of Hadoop jobs, tasks, and task attempts.You canuse Refresh List in the debugging panes to get the most up-to-date status of these items.

Enable Logging and Debugging Using the CLIYou can enable logging when you create a job flow by setting two options:

• --log-uri pathToLogFilesOnAmazonS3

This provides a storage location for log files.You must create an Amazon S3 bucket and specify alog-uri in the command to create the job flow or set the parameter in the credentials.json file.

• --enable-debugging


Amazon Elastic MapReduce Developer GuideEnable Logging and Debugging


This turns on debugging which uses SimpleDB to store and access an index of the logs. Specify thelog-uri for step-level logging.

Enable Logging and Debugging Using the APITo use the debugging functionality with the API, you must enable debugging when creating a job flow. Itis not possible to enable logging after a job flow is created.

• While the job flow is running you can access the Hadoop web interface.

To enable Hadoop step level debugging, you must add the following job flow step, for example:

Action=RunJobFlow&Steps.member.1.ActionOnFailure=TERMINATE_JOB_FLOW&Instances.Ec2KeyName=MyKeyName&Steps.member.1.Name=Setup%20Hadoop%20Debugging&LogUri=s3%3A%2F%2FYourBucket%2Flogs%2F&Signature=calculated value&Instances.SlaveInstanceType=m1.small&Name=Development%20Job%20Flow%20%20%28requires%20manual%20termination%29&AWSAccessKeyId=calculated value&Instances.MasterInstanceType=m1.small&Instances.InstanceCount=1&Timestamp=2010-05-26T11%3A25%3A40-07%3A00&SignatureVersion=2&Steps.member.1.HadoopJarStep.Args.member.1=s3%3A%2F%2Fus-east-1.elasticmapre duce%2Flibs%2Fstate-pusher%2F0.1%2Ffetch&SignatureMethod=HmacSHA1&Steps.member.1.HadoopJarStep.Jar=s3%3A%2F%2Fus-east-1.elasticmapre duce%2Flibs%2Fscript-runner%2Fscript-runner.jar&Instances.KeepJobFlowAliveWhenNoSteps=true

• If you specified the a logURI when the job flow was created you can download the log files from yourjob flow after they are copied to a bucket on Amazon S3.

To enable this option you must specify a logURI parameter, for example:

https://elasticmapreduce.amazonaws.com/?Action=RunJobFlow&Steps.member.1.ActionOnFailure=TERMINATE_JOB_FLOW&Instances.Ec2KeyName=MyKeyName&Steps.member.1.Name=Setup%20Hive&LogUri=s3%3A%2F%2FYourBucekt%2Flogs%2F&Signature=calculated value&Instances.SlaveInstanceType=instanceType&Name=Development%20Job%20Flow%20%20%28requires%20manual%20termination%29&AWSAccessKeyId=calculated value&Instances.MasterInstanceType=instanceType&Instances.InstanceCount=COUNT&Timestamp=2010-05-26T11%3A29%3A20-07%3A00&SignatureVersion=VERSION&Steps.member.1.HadoopJarStep.Args.member.1=s3%3A%2F%2Fus-east-1.elasticmapre duce%2Flibs%2Fhive%2Fhive-script&SignatureMethod=HmacSHA1&Steps.member.1.HadoopJarStep.Args.member.2=--base-path&Steps.member.1.HadoopJarStep.Args.member.3=s3%3A%2F%2Fus-east-1.elasticmapre duce%2Flibs%2Fhive%2F&


Amazon Elastic MapReduce Developer GuideEnable Logging and Debugging

Steps.member.1.HadoopJarStep.Args.member.4=--install-hive&Steps.member.1.HadoopJarStep.Jar=s3%3A%2F%2Fus-east-1.elasticmapre duce%2Flibs%2Fscript-runner%2Fscript-runner.jar&Instances.KeepJobFlowAliveWhenNoSteps=BOOL

• You can enable Hadoop debugging and use the Amazon EMR console to inspect the progress of yourjob flow and access log files from your job flow located on Amazon S3.

To enable Hadoop debugging and use the Amazon EMR console, you must run a special step at thebeginning of your job flow in addition to specifying a logURI, for example:

Action=RunJobFlow&Steps.member.1.ActionOnFailure=TERMINATE_JOB_FLOW&Instances.Ec2KeyName=MyKeyName&Steps.member.1.Name=Setup%20Pig&LogUri=s3%3A%2F%2FYourBucekt%2Flogs%2F&Signature=calculated value&Instances.SlaveInstanceType=m1.small&Name=Development%20Job%20Flow%20%20%28requires%20manual%20termination%29&AWSAccessKeyId=calculated value&Instances.MasterInstanceType=m1.small&Instances.InstanceCount=1&Timestamp=2010-05-26T11%3A31%3A31-07%3A00&SignatureVersion=2&Steps.member.1.HadoopJarStep.Args.member.1=s3%3A%2F%2Fus-east-1.elasticmapre duce%2Flibs%2Fpig%2Fpig-script&Steps.member.1.HadoopJarStep.Args.member.2=--base-path&Steps.member.1.HadoopJarStep.Args.member.3=s3%3A%2F%2Fus-east-1.elasticmapre duce%2Flibs%2Fpig%2F&Steps.member.1.HadoopJarStep.Args.member.4=--install-pig&SignatureMethod=HmacSHA1&Steps.member.1.HadoopJarStep.Jar=s3%3A%2F%2Fus-east-1.elasticmapre duce%2Flibs%2Fscript-runner%2Fscript-runner.jar&Instances.KeepJobFlowAliveWhenNoSteps=true

Use Log FilesTopics

• Log File Directories (p. 191)

• Example Log Files (p. 191)

• Troubleshoot Using Log Files (p. 194)

• View Job Flow Logs (p. 196)

• View Bootstrap Action Log Files (p. 197)

Both Amazon Elastic MapReduce (Amazon EMR) and Hadoop produce log files, which describe thecompletion status of every step and task within a job flow. Amazon EMR groups the log files from all ofthe Amazon EC2 instances into one location that you specify in the LogUri parameter in the RunJobFlowoperation.


Amazon Elastic MapReduce Developer GuideUse Log Files

Log File DirectoriesWhen you look in Amazon S3 at the bucket you specified with the LogUri parameter you find folderslabeled with job IDs. Within each folder is a folder labeled Steps, and within that folder is a folder for eachof the steps in the job flow. Each step folder contains a link to a variety of log files named syslog, stdout,controller, and stderr. Hadoop generates the files logged in syslog and Amazon EMR generatesthe files logged in stdout and stderr, as shown in the following example.

Task Logs: 'task_200807301447_0001_m_000000_0'

stdout logsmap: key = testmap: key = test2

stderr logs

syslog logs2008-07-30 14:51:16,410 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:Initializing JVM Metrics with processName=MAP, sessionId=2008-07-30 14:51:16,507 INFO org.apache.hadoop.mapred.MapTask:numReduceTasks: 12008-07-30 14:51:17,120 INFO org.apache.hadoop.mapred.TaskRunner: Task'task_200807301447_0001_m_000000_0' done.

Example Log FilesTopics

• Steps Stderr Example (p. 191)

• Steps Syslog Example (p. 191)

• Task Attempt Stderr Example (p. 192)

• Task Attempt Syslog Example (p. 193)

This section contains samples of some of the log files you might inspect. For more information aboutwhere in the AWS Management Console you access these log files, see Debugging (p. 204).

Steps Stderr Example

The following example comes from the stderr link on the Steps panel.

Streaming Command Failed!

Steps Syslog Example

The following example comes from the syslog link on the Steps panel. These logs correspond to theStderr, Streaming Command Failed!

2010-01-19 23:27:26,529 WARN org.apache.hadoop.mapred.JobClient (main): Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2010-01-19 23:27:30,143 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process: 12 2010-01-19 23:27:30,397 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process: 12



2010-01-19 23:27:31,092 INFO org.apache.hadoop.streaming.StreamJob (main): getLocalDirs(): [/mnt/var/lib/hadoop/mapred] 2010-01-19 23:27:31,093 INFO org.apache.hadoop.streaming.StreamJob (main): Running job: job_201001192327_0001 2010-01-19 23:27:31,093 INFO org.apache.hadoop.streaming.StreamJob (main): To kill this job, run: 2010-01-19 23:27:31,093 INFO org.apache.hadoop.streaming.StreamJob (main): UNDEF/bin/hadoop job -Dmapred.job.tracker=domU-12-31-39-0C-24-54.compute-1.internal:9001 -kill job_201001192327_0001 2010-01-19 23:27:31,094 INFO org.apache.hadoop.streaming.StreamJob (main): Tracking URL: http://domU-12-31-39-0C-24-54.compute-1.internal:9100/jobde tails.jsp?jobid=job_201001192327_0001 2010-01-19 23:27:32,105 INFO org.apache.hadoop.streaming.StreamJob (main): map 0% reduce 0% 2010-01-19 23:27:53,908 INFO org.apache.hadoop.streaming.StreamJob (main): map 5% reduce 0% 2010-01-19 23:27:54,917 INFO org.apache.hadoop.streaming.StreamJob (main): map 8% reduce 0% 2010-01-19 23:28:08,121 INFO org.apache.hadoop.streaming.StreamJob (main): map 15% reduce 0% 2010-01-19 23:28:10,169 INFO org.apache.hadoop.streaming.StreamJob (main): map 17% reduce 3% 2010-01-19 23:28:22,040 INFO org.apache.hadoop.streaming.StreamJob (main): map 17% reduce 6% 2010-01-19 23:28:26,107 INFO org.apache.hadoop.streaming.StreamJob (main): map 24% reduce 6% 2010-01-19 23:28:28,371 INFO org.apache.hadoop.streaming.StreamJob (main): map 25% reduce 6% 2010-01-19 23:28:33,432 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 100% 2010-01-19 23:28:33,434 INFO org.apache.hadoop.streaming.StreamJob (main): To kill this job, run: 2010-01-19 23:28:33,434 INFO org.apache.hadoop.streaming.StreamJob (main): UNDEF/bin/hadoop job -Dmapred.job.tracker=domU-12-31-39-0C-24-54.compute-1.internal:9001 -kill job_201001192327_0001 2010-01-19 23:28:33,435 INFO org.apache.hadoop.streaming.StreamJob (main): Tracking URL: http://domU-12-31-39-0C-24-54.compute-1.internal:9100/jobde tails.jsp?jobid=job_201001192327_0001 2010-01-19 23:28:33,435 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not Successful! 2010-01-19 23:28:33,435 INFO org.apache.hadoop.streaming.StreamJob (main): killJob...

Task Attempt Stderr Example

Entries in the following example contain error messages from Hadoop and the Mapper script. The firsterror message is a stack trace from the Ruby script, where it threw an exception while processing input.The second error message (prepended by log4j ) is a warning from Hadoop stating that it failed to findappenders. The first message explains why the script failed. The second is a benign message fromHadoop about initializing the logging subsystem.

/mnt/var/lib/hadoop/mapred/taskTracker/jobcache/job_201001192327_0001/at tempt_201001192327_0001_m_000001_3/work/./raising_wordcount.rb:17: Invalid input, refusing to proceed after receiving "work" (RuntimeError) from /mnt/var/lib/hadoop/mapred/taskTracker/jobcache/job_201001192327_0001/at tempt_201001192327_0001_m_000001_3/work/./raising_wordcount.rb:12:in `each'



from /mnt/var/lib/hadoop/mapred/taskTracker/jobcache/job_201001192327_0001/at tempt_201001192327_0001_m_000001_3/work/./raising_wordcount.rb:12 log4j:WARN No appenders could be found for logger (org.apache.hadoop.stream ing.PipeMapRed). log4j:WARN Please initialize the log4j system properly.

Task Attempt Syslog Example

The following syslog comes from a job flow where the data submitted to the mapper was in the wrongformat.

2010-01-19 23:59:56,659 INFO org.apache.hadoop.metrics.jvm.JvmMetrics (main): Initializing JVM Metrics with processName=MAP, sessionId=2010-01-19 23:59:56,846 INFO org.apache.hadoop.mapred.MapTask (main): Host name: domU-12-31-39-03-7D-E1.compute-1.internal2010-01-19 23:59:56,848 INFO org.apache.hadoop.mapred.MapTask (main): numRe duceTasks: 12010-01-19 23:59:56,867 INFO org.apache.hadoop.mapred.MapTask (main): io.sort.mb = 1502010-01-19 23:59:57,873 INFO org.apache.hadoop.mapred.MapTask (main): data buffer = 119537664/1494220802010-01-19 23:59:57,873 INFO org.apache.hadoop.mapred.MapTask (main): record buffer = 393216/4915202010-01-19 23:59:59,380 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://elasticmapreduce/samples/wordcount/input/0009' for reading2010-01-19 23:59:59,574 INFO org.apache.hadoop.streaming.PipeMapRed (main): PipeMapRed exec [/mnt/var/lib/hadoop/mapred/taskTracker/job cache/job_201001192358_0001/at tempt_201001192358_0001_m_000000_2/work/./wrong_format_wordcount.rb]2010-01-19 23:59:59,744 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]2010-01-19 23:59:59,744 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]2010-01-19 23:59:59,747 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]2010-01-19 23:59:59,757 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]2010-01-20 00:00:00,536 INFO org.apache.hadoop.streaming.PipeMapRed (main): R/W/S=10000/0/0 in:NA [rec/s] out:NA [rec/s]2010-01-20 00:00:04,235 INFO org.apache.hadoop.streaming.PipeMapRed (Thread-6): Records R/W=90635/12010-01-20 00:00:04,359 INFO org.apache.hadoop.streaming.PipeMapRed (Thread-5): MRErrorThread done2010-01-20 00:00:04,425 INFO org.apache.hadoop.streaming.PipeMapRed (main): mapRedFinished2010-01-20 00:00:04,425 INFO org.apache.hadoop.mapred.MapTask (main): Starting flush of map output2010-01-20 00:00:04,425 INFO org.apache.hadoop.mapred.MapTask (main): bufstart = 0; bufend = 69475; bufvoid = 1494220802010-01-20 00:00:04,425 INFO org.apache.hadoop.mapred.MapTask (main): kvstart = 0; kvend = 7221; length = 4915202010-01-20 00:00:04,828 WARN org.apache.hadoop.mapred.TaskTracker (main): Error running childjava.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1938) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorCombiner.re



duce(ValueAggregatorCombiner.java:55) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorCombiner.re duce(ValueAggregatorCombiner.java:34) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAnd Spill(MapTask.java:921) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAnd Spill(MapTask.java:802) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:715) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:233) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2216)

The lines starting with "at" show that the combiner is trying to parse records output by the mapper butthe records are in the wrong format.

Troubleshoot Using Log FilesTopics

• Log Files (p. 194)

• Check Step Log Files (p. 195)

Debugging errors in large, distributed applications is difficult. Amazon EMR makes it easier by collectingthe log files from the cluster and storing them in a location you specify on Amazon S3. If you do not specifya log URI in the RunJobFlow request, Amazon EMR does not collect logs.

ImportantIn this section, all relative Amazon S3 paths should be prefixed with your log URL and the jobflow identifier of the job flow to get the actual log locations.

Log Files

The Amazon EMR job flow provides a JAR or streaming file and initiates the Hadoop application on yourAmazon EC2 instances. Both Amazon EMR and Hadoop produce log files, which describe the completionstatus of every step and task within a job flow. Amazon EMR groups the log files from all of the clusternodes into one location that you specify in the LogUri parameter in the RunJobFlow action.

Log File Directories

When you look in Amazon S3 at the bucket you specified with the LogUri parameter you find folderslabeled with job IDs. Within each folder is a folder labeled Steps, and within that folder is a folder for eachof the steps in the job flow. Each step folder contains a link to a variety of log files named syslog, stdout,controller, and stderr. Hadoop generates the files logged in syslog and Amazon EMR generatesthe files logged in stdout and stderr, as shown in the following example.

Task Logs: 'task_200807301447_0001_m_000000_0'

stdout logsmap: key = testmap: key = test2

stderr logs

syslog logs2008-07-30 14:51:16,410 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:Initializing JVM Metrics with processName=MAP, sessionId=2008-07-30 14:51:16,507 INFO org.apache.hadoop.mapred.MapTask:



numReduceTasks: 12008-07-30 14:51:17,120 INFO org.apache.hadoop.mapred.TaskRunner: Task'task_200807301447_0001_m_000000_0' done.

Check Step Log Files

If you provide a custom JAR file and there is a failure, the first things to check are the step log files.Amazon EMR uploads these log files to steps/<step number>/ every few minutes. Each step createsthe following four logs:

• controller—Contains files generated by Amazon EMR that arise from errors encountered while tryingto run your step

If your step fails while loading, you can find the stack trace in this log.

• syslog—Contains logs from non-Amazon software, such as Apache and Hadoop

• stdout—Contains status generated by your mapper and reducer executables

• stderr—Contains your step's standard error messages

To debug a job flow using step log files

1. Use SSH to connect to the master node. For information on how to do this, see Connect to the MasterNode Using SSH (p. 111).

2. Use cat to view the log files.

The following example looks into the syslog files.You can use the same procedure with any of theother three logs, controller, stdout, and stderr.

$ cat /mnt/var/log/hadoop/steps/1/syslog 2009-03-25 18:43:27,145 WARN org.apache.hadoop.mapred.JobClient (main): Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-03-25 18:43:28,828 ERROR org.apache.hadoop.streaming.StreamJob (main): Error Launching job : unknown host: examples$ exit

This error from Hadoop indicates that it was trying to look for a host called examples. If we look backat our request we see that the output path was set to hdfs://examples/output. This is actuallyincorrect because we want Hadoop to access the local HDFS system with the path/examples/output. We instead need to specify hdfs:///examples/output.

3. Specify the output of the streaming job on the command line and submit another step to the job flow.

./elastic-mapreduce --jobflow j-36U2JMAE73054 --stream --output hdfs:///ex amples/output

4. List the job flows to see if it completes.

$ ./elastic-mapreduce --list -n 5 j-36U2JMAE73054 WAITING ec2-67-202-20-49.compute-1.amazonaws.com Example job flow FAILED Example Streaming Step COMPLETED Example Streaming Step



This time the job succeeded. We can run the job again but this time send the output to a bucket inAmazon S3.

5. Create a bucket in Amazon S3.

Buckets in Amazon S3 are unique so choose a unique name for your bucket. The following exampleuses s3cmd. For more information about creating buckets, see the AWS Amazon Elastic MapReduceGetting Started Guide.

$ s3cmd mb s3://myawsbucket Bucket s3://myawsbucket/ created

s3cmd requires you to specify Amazon S3 paths using the prefix s3://.Amazon EMR requires theprefix s3n:// for files in stored in Amazon S3.

6. Add a step to the job flow to send output to this bucket.

$ ./elastic-mapreduce -j j-36U2JMAE73054 --stream --output s3n://my-example-bucket/output/1 Added steps to j-36U2JMAE73054

The protocol of the output URL is s3n. This tells Hadoop to use the Amazon S3 Native File Systemfor the output location. The host part of the URL is the bucket and this is followed by the path.

7. Terminate the job flow.

$ ./elastic-mapreduce -j j-36U2JMAE73054 --terminate

8. Confirm that the job flow is shutting down.

$ ./elastic-mapreduce --list -n 5

There are other options that you can specify when creating and adding steps to job flows. Use the --helpoption to find out what they are.

View Job Flow LogsTopics

• View Logs on the Master Node Using the Command Line Interface (p. 196)

• Download Job Flow Logs from Amazon S3 (p. 197)

• View Logs Using SSH (p. 197)

The section describes the methods available for viewing job flow logs.

View Logs on the Master Node Using the Command Line Interface

Using the command line interface (CLI) it is possible to run job flows that execute multiple steps. This isuseful for developing multi-step streaming jobs and for debugging job flows.With the CLI you can constructa development job flow that continues to run until terminated by the user. This is useful for debuggingwhen a step fails because you can add another step to your active job flow rather restart the job flow.

Have a look at the status of the job flow.You can see if the job flow is started or whether the cluster nodesare starting up.



After the job flow transitions into either WAITING or RUNNING state you can use SSH to log into the masternode. For more information, see Connect to the Master Node Using SSH (p. 111).

After you log into the master node, you can inspect the log files. If you specified a log URI then log filesare automatically save to your Amazon S3 bucket. There is a delay of 5 minutes between the time thelog files complete their writes and when they are saved to your Amazon S3 bucket. Often it is quicker tosee results by viewing the logs directly on the cluster than waiting for the saved files to appear in yourbucket. The directory on the cluster node to look in is: ls /mnt/var/log/hadoop/steps/1 .

This directory contains log files for the first step. The second step is in /mnt/var/log/hadoop/steps/2 andso on. The log files are:

• controller—this is the log file of the process that attempts to execute your step

• syslog—this is a log output by Hadoop which describes the execution of your Hadoop job by the jobflow step

• stderr—this is the stderr channel of Hadoop's attempt to execute you job flow

• stdout—this is the stdout channel of Hadoop's attempt to execute you job flow

These files do not appear until the step runs for some time, finishes, or fails.

Download Job Flow Logs from Amazon S3

Instead of viewing logs on the master node, you can download the logs from a bucket on Amazon S3.You can download the data in a bucket using the Amazon S3 Organizer plug-in for Firefox. To downloadthe plug-in, go to http://www.s3fox.net/.

To download log files from Amazon S3

1. Locate the value you supplied for LogURI in the Amazon EMR console, CLI, or RunJobFlow request.

The LogURI is a path to a bucket on Amazon S3 of the form s3n://[bucketName]/[path].

2. Download the logs in the bucket using the Amazon S3 Organizer plug-in with Firefox, or the AmazonS3 GET Bucket operation.

Amazon S3 downloads the log files in the bucket.

View Logs Using SSH

You can set up an SSH tunnel between your host and the master node where you can look on the filesystem for log files or at the job flow statistics published by the Hadoop web server. The master node inthe cluster contains summary information of all of the work done by the slave nodes.You can, however,explore the working and error logs on each slave node in an effort to resolve problems occurring in theexecution of the job flow.

Elastic MapReduce starts your instances in two security groups: one for the master node and anotherfor the core node and task nodes. The master security group opens a port for communication with theservice. It also opens the SSH port to allow you to connect via SSH as the Hadoop user directly on tothe cluster nodes, using the proper credentials.The core and task nodes start in a separate security groupthat only allows interaction with the master node. Because these security groups are associated with youraccount, you can reconfigure them using the standard Amazon EC2 interfaces.

View Bootstrap Action Log FilesBootstrap action log files can help you identify and diagnose the results of your bootstrap actions. Theselog files are on the master node of your Hadoop cluster.



http://www.s3fox.net/

To view bootstrap action log files

1. Use SSH to connect to the master node. For more information, see Connect to the Master NodeUsing SSH (p. 111)

2. From the command prompt, change to the bootstrap action log directory.

$ cd /mnt/var/log/bootstrap-actions/

Log files for each bootstrap action are located in a subdirectory. The subdirectory name is based onthe order of the bootstrap actions. For example, 1 for the first action, 2 for the second action, and soforth.

NoteThe bootstrap action logs are also saved to your LogURI if you specify one.

LogURI/JobFlowID/node/NodeID/bootstrap-actions/ActionNumber

3. Change to the folder for the bootstrap action log file you want to view. For example, to access thefirst bootstrap action log file enter the following:

$ cd 1

4. View the contents of the log file.

$ cat stderr

The contents of the log file should look similar to the following:

--2010-02-20 01:26:24-- http://.elasticmapreduce.s3.amazon aws.com/samples/bootstrap-actions/file.tar.gzResolving elasticmapreduce.s3.amazonaws.com... 72.21.211.147Connecting to elasticmapreduce.s3.amazonaws.com|72.21.211.147|:80... connec ted.HTTP request sent, awaiting response...HTTP/1.1 200 OKx-amz-id-2: W4ggqTgUmMWylfuQksXgBi5hxrgmiHp8LYrZH184CXpUW+s1/jfOvKmhoG/NIFJzx-amz-request-id: D673608F6C6114D2Date: Sat, 20 Feb 2010 01:26:25 GMTx-amz-meta-s3fox-filesize: 153x-amz-meta-s3fox-modifiedtime: 1256233644776Last-Modified: Thu, 22 Oct 2009 17:47:44 GMTETag: "47a007dae0ff192c166764259246388c"Content-Type: application/gzipContent-Length: 153Connection: Keep-AliveServer: AmazonS3Length: 153 [application/gzip]Saving to: `file.tar.gz'

0K 100% 24.3M=0s

2010-02-20 01:26:24 (24.3 MB/s) - `file.tar.gz' saved [153/153]



Monitor Hadoop on the Master NodeThere are two main ways to monitor Hadoop on the master node.You can use SSH to log into the masternode and interact with Hadoop using the command line, from which you can browse directories and issuecommands. Or, you can create an SSH tunnel to view the web interfaces that Hadoop publishes as localwebsites on the master node. These web interfaces display information about the job flow, including mapand reduce tasks, jobs, and attempts.

To view the Amazon EMR logs on the master node

1. Use SSH to connect to the master node. For more information, see see Connect to the Master NodeUsing SSH (p. 111).

2. Navigate to /mnt/var/log/hadoop/steps/1 to see the logs on the master node for the first step.The second step log files are in /mnt/var/log/hadoop/steps/2 and so on. The log files are:

• controller—Log file of the process that attempts to execute your step

• syslog—Log file generated by Hadoop that describes the execution of your Hadoop job by thejob flow step

• stderr—A stderr log file generated by Hadoop when it attempts to execute your job flow

• stdout—The stdout log file generated by Hadoop when it attempts to execute your job flow

These log files do not appear until the step runs for some time, finishes or fails. These logs containcounter and status information.

NoteIf you specified a log URI where Amazon Elastic MapReduce (Amazon EMR) uploads logfiles onto Amazon S3, you can inspect the log files on Amazon S3. There is, however, afive minute delay between when the log files stop being written and when they are savedin a bucket on Amazon S3. So, it is generally faster to look at the log files on the masternode, especially if the step failed quickly.


Amazon Elastic MapReduce Developer GuideMonitor Hadoop on the Master Node

View the Hadoop Web InterfacesApache Hadoop publishes a web interface that displays status about the job flow. This web interface ishosted on the master node of the job flow.You can view this web interface by using SSH to create atunnel to the master node and configuring a SOCKS proxy so your browser can view websites hosted onthe master node using the SSH tunnel. For information about how to do this, see Web Interfaces Hostedon the Master Node (p. 115).

The web user interfaces for Hadoop are located at the URLs in the following table, wheremaster-public-dns-name is the public DNS name of the master node of the job flow. For informationabout how to locate the public DNS name of the master node for a job flow, see Connect to the MasterNode Using SSH (p. 111).

URLName of Interface

http://master-public-dns-name:9100/MapReduce job tracker

http://master-public-dns-name:9101/HDFS name node

http://master-public-dns-name:9103/MapReduce task tracker

To relocate these web interfaces, edit conf/hadoop-default.xml.

To view the Hadoop Distributed File System web interface

• Open http://master-public-dns-name:9101/. The Hadoop Distributed File System (HDFS)web interface is hosted on the master node of the job flow.You can access the web interface byusing SSH to create a tunnel to the master node and configuring a SOCKS proxy so your browsercan view websites hosted on the master node. For information about how to do this, see WebInterfaces Hosted on the Master Node (p. 115).

To view the job flow status using the Hadoop web interface

1. To access the Hadoop Job Tracker UI running on the master node, openhttp://master-public-dns-name:9100/. This web interface is hosted on the master node ofthe job flow.You can access the web interface by using SSH to create a tunnel to the master node


Amazon Elastic MapReduce Developer GuideView the Hadoop Web Interfaces

and configuring a SOCKS proxy so your browser can view websites hosted on the master node. Forinformation about how to do this, see Web Interfaces Hosted on the Master Node (p. 115).

The Cluster Summary shows that there were two slave nodes in the cluster and that each performedfour tasks.The Completed Jobs section shows that the map and reduce job flows are 100% complete.

2. Click a job flow ID.

Hadoop displays information about the selected job flow.

This display shows a variety of file system and job flow counters.

3. Choose one of the following actions:



Do this...To...

Click on an entry in the Failed/Killed Task Attempts column.Find out more about thekilled tasks

Click map.

Hadoop displays all of the tasks completed and their status.

All of the mapper tasks completed successfully.

Get more informationabout the mapper tasks

Click on an entry in the Counters column.

Hadoop displays the task counter information.

Display task counters



Do this...To...

Click a task.

Hadoop displays task information.

Get information abouttasks

4. On the All Task Attempts pane, choose one of the following actions:

Do This...To...

Click an entry in the Machine column.

Hadoop displays host information.

Get information about thenode that ran the task



Do This...To...

Click an entry in the Task Logs column.

Hadoop displays the logs.

See the task logs

Troubleshooting TipsTopics

• Troubleshooting Job Flows (p. 206)

• Commonly Logged Errors and Details (p. 208)

Use the Amazon EMR console to access the log files at the different step and Hadoop job executionlevels.You can use these logs to debug your applications.

Before you can use the debugging functionality in the console, you must enable debugging when youcreate a job flow. For more information, see Enable Logging and Debugging (p. 187).

The following procedure shows you how to debug a job flow using the Amazon EMR console.

To debug a failed job flow

1. In the Amazon EMR console, click the check box next to the failed job flow you want to debug andclick Debug.


Amazon Elastic MapReduce Developer GuideTroubleshooting Tips

NoteBy default, the list is sorted alphabetically by the Name column. To sort the results basedon another column, click the column title once (for ascending) or twice (for descendingorder).

The Steps pane displays the steps in the selected job flow.

Each row provides links pointing to Hadoop logs generated as part of each step. If the links arelabeled (log not uploaded yet), click Refresh List.

2. Click one of the following links in the Log Files column in the row marked FAILED:

• controller—Contains files generated by Amazon Elastic MapReduce (Amazon EMR) that arisefrom errors encountered while trying to run your step

If your step fails while loading, you can find the stack trace in this log. Errors loading or accessingyour application are often described here. Missing mapper file errors are often described here.

• stderr—Contains your step's standard error messages

Application loading errors are often described here. Sometimes contains stack trace.

• stdout—Contains status generated by your mapper and reducer executables

Application loading errors are often described here. Sometimes contains application error messages.

• syslog—Contains logs from non-Amazon software, such as Apache and Hadoop

Streaming errors are often described here.

3. If you can't resolve the problem by looking at the these log files, click View All Tasks for All Jobs.

This action skips over the Jobs pane, which does not associate links to log files.

The Tasks pane displays the Hadoop tasks in the jobs.

Time elapsed during a task is a good indication of trouble; the longer the elapsed time, the morelikelihood of trouble.

• To easily see the time elapsed in a task, click the Elapsed Time column title to sort the results byelapsed time.

4. On the Tasks pane, click View Attempts for the task that failed.

The Task Attempts pane displays the task attempts in the selected task.

5. On the Task Attempts pane, click one or more of the links in the Log Files column for the taskattempt that failed:

• stderr—Contains task attempt error messages

• stdout—Contains task attempt output logs

• syslog—Contains logs generated by Hadoop.



Troubleshooting Job FlowsTopics

• Debug Job Flows with No Steps (p. 206)

• Debug Job Flows with Steps (p. 206)

• Troubleshooting Task Attempts (p. 207)

• Checking Hadoop Failures (p. 207)

• Troubleshooting Cluster Nodes and Instance Groups (p. 207)

This section describes how you can troubleshoot your job flows using the log files produced by Hadoopand Amazon EMR.

Debug Job Flows with No Steps

Amazon EMR allows you to create a job flow containing no steps.The effect is to create a Hadoop clusterand then stop processing.You can add additional steps using --AddJobFlowSteps. As soon as youissue that request, Amazon EMR continues the job flow and you can see whether or not the step completedsuccessfully.

To develop and debug a job flow starting without steps

1. In a RunJobFlow request, set KeepJobFlowAliveWhenNoSteps to true and ActionOnFailureto CANCEL_AND_WAIT.

CANCEL_AND_WAIT stops job flow execution but does not terminate the Hadoop cluster. The defaultvalue, TERMINATE, stops the job flow and terminates the cluster. CANCEL_AND_WAIT enables youto revise your jars or add steps and retry the job flow without incurring the expense of downloadingthe data from Amazon S3 to Amazon EC2.

2. Send the RunJobFlow request.

3. To view the Hadoop system, use SSH to connect to the master node. For more information, seeConnect to the Master Node Using SSH (p. 111).

4. In a AddJobFlowSteps request, set ActionOnFailure to CANCEL_AND_WAIT.

5. Send the AddJobFlowSteps request.

6. Inspect the log files using a tool like Amazon S3 Organizer to see if there were errors.

Using this procedure, you can work on a step to make sure it completes successfully before adding thenext step. For more information about adding steps, go to Add Steps to a Job Flow (p. 79).

When you are ready for production, set KeepJobFlowAliveWhenNoSteps to false andActionOnFailure to TERMINATE_JOB_FLOW.

This value automatically terminates the Hadoop cluster after running the job flow.

NoteWhen you use the console to run a job flow, the value of ActionOnFailure is always CONTINUE.

Debug Job Flows with Steps

You might want to debug a job flow with steps.

To develop and debug a job flow with steps

1. In a RunJobFlow request, set ActionOnFailure to CANCEL_AND_WAIT.



This value stops job flow execution but does not terminate the Hadoop cluster. The default value,TERMINATE, stops the job flow and terminates the cluster.CANCEL_AND_WAIT enables you to reviseyour JAR files or add steps and retry the job flow without incurring the expense of downloading thedata from Amazon S3 to Amazon EC2.

2. Send the RunJobFlow request.

3. Inspect the log files using a tool like Amazon S3 Organizer to see if there were errors.

4. Change the step that caused the error and resubmit the step using AddJobFlowStep setting, in therequest, ActionOnFailure to CANCEL_AND_WAIT.

Troubleshooting Task Attempts

If your JAR file successfully started or you created a streaming job, the next place to look for failures isin the task attempts. The Map and Reduce functions you wrote execute in the context of a task. Taskscan execute multiple times as "task attempts" because of failures or speculative execution. Amazon EMRuploads task attempt logs into task-attempts/.

If one of the tasks failed, you can look at the task logs to determine what happened. These files are alsoavailable on the nodes under /mnt/var/log/hadoop/userlogs/. Looking through log files on eachnode in the cluster, however, makes this way of debugging difficult.

Task-attempt log files are similar in format to the step log files.

Checking Hadoop Failures

In rare cases, Hadoop itself might fail. To see if that is the case, you must look at the Hadoop daemonlogs.

To view the daemon log files

• Look under /mnt/var/log/hadoop/ on each node or under daemons/<instance id>/ onAmazon S3.

NoteNot all cluster nodes run all daemons.

When developing your application, we recommend that you enable both types of debugging: step andHadoop job level and run a small but representative subset of your data to make sure your applicationworks. To enable step level debugging, select Yes for Enable Debugging and enter an Amazon S3bucket URI in the Amazon S3 Log Path field.

Troubleshooting Cluster Nodes and Instance Groups

When when a node fails to come up, Amazon EMR stops attempting to contact the node and put theassociated instance group into a failed state. After some time, the failed node causes the instance groupto change to an ARRESTED state.

A node could fail to come up if:

• Hadoop or the cluster is somehow broken and does not accept a new node into the cluster

• A bootstrap action fails on the new node

• The node is not functioning correctly and fails to check in with Hadoop

If an instance group is in the ARRESTED state, and the job flow is in a WAITING state, you can add a jobflow step to reset the desired number of slave nodes. Adding the step resumes processing of the job flowand put the instance group back into a RUNNING state.



For details on how to reset a job flow in an arrested state, refer to Arrested State (p. 100).

Commonly Logged Errors and DetailsThe following sections describe common errors for each job flow type.

Topics

• Custom JAR Common Errors (p. 208)

• Hive and Pig Common Errors (p. 208)

• Streaming Common Errors (p. 209)

Custom JAR Common Errors

The following table describes common errors for custom JAR job flows.

Where to LookError

You can usually find the cause of a custom JARerror in the syslog file. Link to it from the Stepspane. If you can't determine the problem there,check in the Hadoop task attempt error message,which you link to from the Task Attempts pane.

General

If the main program of your custom JAR throws anexception while creating the Hadoop job, the bestplace to look is the syslog file. Link to it from theSteps pane.

JAR throws exception before creating a job

If your custom JAR and mapper throw an exceptionwhile processing input data, the best place to lookis the syslog file. Link to it on the Task Attemptspane.

JAR throws an error inside a map task

Hive and Pig Common Errors

The following table describes common errors for Hive or Pig job flows.

Where to LookError

You can usually find the cause of a Hive or Pigerror in the syslog file, which you link to from theSteps pane. If you can't determine the problemthere, check in the Hadoop task attempt errormessage. Link to it on the Task Attempts pane.

General

If a step fails, look at the stdout file (which youlink to from the Steps pane) of the step that ranthe Hive script. If the error is not there, look is thesyslog file. Link to it on the Task Attempts pane.

Syntax or semantic error in the Hive script

If you are running Hive interactively on the masternode and the job flow failed, select the syslog.Link to it on the Task Attempts pane for the taskin the interactive step that failed.

Job fails when running interactively`



Streaming Common Errors

The following table describes common errors for streaming job flows.

Where to LookError

You can usually find the cause of a streaming errorin a syslog file. Link to it on the Steps pane.

General

You can find the error message in the syslog fileof a failed task attempt. Link to it on the TaskAttempts pane.

Data sent to the mapper in the wrong format

Your mapper or reducer script does not produceoutput within the configured time limit (600 seconds,by default). Find the error in the syslog of thefailed task attempt.You can change the time limitby passing an extra arg: -jobconfmapred.task.timeout=800000. This is thenumber of milliseconds before Amazon EMRterminates a task if it neither reads an input, writesan output, or updates its status string.

Misconfigured time limit

Your mapper or reducer script exits with an error.Find the error in the stderr file of the failed taskattempt.

Exit with error

Monitor Metrics with Amazon CloudWatchWhen you’re running a job flow, you often want to track its progress and health. Amazon Elastic MapReduce(Amazon EMR) records metrics that can help you monitor your job flow. It makes these metrics availablein the Amazon EMR console and in the Amazon CloudWatch console, where you can track them withyour other AWS metrics. In Amazon CloudWatch you can set alarms to warn you if a metric goes outsideof parameters you specify.

Metrics are updated every five minutes. This interval is not configurable. Metrics are archived for twoweeks; after that period, the data is discarded.

These metrics are automatically collected and pushed to Amazon CloudWatch for every Amazon EMRjob flow. There is no charge for the Amazon EMR metrics reported in Amazon CloudWatch; they areprovided as part of the Amazon EMR service.

NoteViewing Amazon EMR metrics in Amazon CloudWatch is supported only for job flows launchedwith AMI 2.0.3 or later and running Hadoop 0.20.205 or later. For more information about selectingthe AMI version for your job flow, see Specify the Amazon EMR AMI Version (p. 290).

Video Tour of Amazon EMR MetricsAvailable online is a video tutorial, Monitoring Amazon EMR Metrics, that walks you through the metricsthat Amazon EMR provides in the Amazon EMR console and pushes to Amazon CloudWatch.


Amazon Elastic MapReduce Developer GuideMonitor Metrics with Amazon CloudWatch

http://youtu.be/QnOtYGmMGzk?rel=0&hd=1

How Do I Use Amazon EMR Metrics?The metrics reported by Amazon EMR provide information that you can analyze in different ways. Thetable below shows some common uses for the metrics. These are suggestions to get you started, not acomprehensive list. For the complete list of metrics reported by Amazon EMR go to Metrics Reported byAmazon EMR in Amazon CloudWatch (p. 216).

Revelant MetricsHow do I?

Look at the RunningMapTasks,RemainingMapTasks, RunningReduceTasks,and RemainingReduceTasks metrics.

Track the progress of my job flow

The IsIdle metric tracks whether a job flow islive, but not currently running tasks.You can setan alarm to fire when the job flow has been idle fora given period of time, such as thirty minutes.

Detect job flows that are idle

The HDFSUtilization metric is the percentageof disk space currently used. If this rises above anacceptable level for your application, such as 80%of capacity used, you may need to resize your jobflow and add more core nodes.

Detect when a node runs out of storage

Access Amazon CloudWatch MetricsThere are many ways to access the metrics that Amazon EMR pushes to Amazon CloudWatch.You canview them through either the Amazon EMR console or Amazon CloudWatch console, or you can retrievethem using the Amazon CloudWatch CLI or the Amazon CloudWatch API.The following procedures showyou how to access the metrics using these various tools.

To view metrics in the Amazon EMR console


2. To view metrics for a job flow, click on it to display the Job Flow Details pane.


Amazon Elastic MapReduce Developer GuideHow Do I Use Amazon EMR Metrics?


3. Select the Monitoring tab to view information about that job flow. This loads the pane with reportsabout the progress and health of the job flow.

To view metrics in the Amazon CloudWatch console

1. Sign in to the AWS Management Console and open the Amazon CloudWatch console athttps://console.aws.amazon.com/cloudwatch/.


Amazon Elastic MapReduce Developer GuideAccess Amazon CloudWatch Metrics

https://console.aws.amazon.com/cloudwatch/

2. Click the All Metrics link in the Navigation pane.

3. Scroll down to the metric that you want to graph. An easy way to find the Amazon EMR metrics youwant is to search on the job flow identifier of the job flow to monitor.

4. Click a metric to display the graph.

To access metrics from the Amazon CloudWatch CLI

• Call mon-get-stats.You can learn more about this and other metrics-related functions in theAmazon CloudWatch Developer Guide.


Amazon Elastic MapReduce Developer GuideAccess Amazon CloudWatch Metrics

http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/cli-mon-get-stats.html

http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/

To access metrics from the Amazon CloudWatch API

• Call GetMetricStatistics.You can learn more about this and other metrics-related functions inthe xAmazon CloudWatch API Reference.

Setting Alarms on MetricsAmazon EMR pushes metrics to Amazon CloudWatch, which means you can use Amazon CloudWatchto set alarms on your Amazon EMR metrics.You can, for example, configure an alarm in AmazonCloudWatch to send you an email any time the HDFS utilization rises above 80%.

The following topics give you a high-level overview of how to set alarms using Amazon CloudWatch. Fordetailed instructions, go to Using Amazon CloudWatch in the Amazon CloudWatch Developer Guide.

View a Video Tutorial on Setting AlarmsAvailable online, a video tutorial, Setting Alarms on Amazon EMR Metrics, shows you how to set up analarm on an Amazon EMR metric using the Amazon CloudWatch console.

Set alarms using the Amazon CloudWatch console

1. Sign in to the AWS Management Console and open the Amazon CloudWatch console athttps://console.aws.amazon.com/cloudwatch/.

2. Click the Create Alarm button. This launches the Create Alarm Wizard.

3. Scroll through the Amazon EMR metrics to locate the metric you want to place an alarm on. An easyway to display just the Amazon EMR metrics in this dialog box is to search on the job flow identifierof your job flow. Select the metric to create an alarm on and click Continue.


Amazon Elastic MapReduce Developer GuideSetting Alarms on Metrics

http://docs.amazonwebservices.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html

http://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/

http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/CHAP_UsingCloudWatch.html

http://youtu.be/L7CY7Oa7yOM?rel=0&hd=1

https://console.aws.amazon.com/cloudwatch/

4. Fill in the Name, Description, Threshold, and Time values for the metric, and click Continue.

5. Choose Alarm as the alarm state. If you want Amazon CloudWatch to send you an email when thealarm state is reached, choose either a pre-existing Amazon SNS email subscription list or CreateNew Email Topic. If you select Create New Email Topic, you can set the name and email addressesfor a new email subscription list. This list will be saved and appear in the drop-down box for futurealarms. Click Continue.



NoteIf you use Create New Email Topic to create a new Amazon SNS topic, the email addressesmust be verified before they will receive notifications. Emails are only sent when the alarmenters an alarm state. If this alarm state change happens before the email addresses areverified, they will not receive a notification.

6. At this point the Create Alarm Wizard gives you a chance to review the alarm you’re about to create.If you need to make any changes, you can use the Edit links on the right. Click Create Alarm.



NoteFor more information about how to set alarms using the Amazon CloudWatch console, go toCreate an Alarm that Sends Email in the in the Amazon CloudWatch Developer Guide.

To set an alarm using the Amazon CloudWatch

• Call mon-put-metric-alarm.You can learn more about this and other alarm-related functions inthe Amazon CloudWatch Developer Guide.

To set an alarm using the Amazon CloudWatch API

• Call PutMetricAlarm.You can learn more about this and other alarm-related functions in theAmazon CloudWatch API Reference

Metrics Reported by Amazon EMR in Amazon CloudWatchThe following table lists all of the metrics that Amazon EMR reports in the Amazon EMR console andpushes to Amazon CloudWatch.

Amazon EMR Metrics

Amazon EMR sends data for several metrics to Amazon CloudWatch. All Amazon EMR job flowsautomatically send metrics in five-minute intervals. Metrics are archived for two weeks; after that period,the data is discarded.

NoteAmazon EMR pulls metrics from a job flow. If a job flow becomes unreachable, no metrics willbe reported until the job flow becomes available again.

DescriptionMetric

The number of core nodes waiting to be assigned. All of the corenodes requested may not be immediately available; this metricreports the pending requests. Data points for this metric arereported only when a corresponding instance group exists.

Use Case: Monitor job flow health

Units: Count

CoreNodesPending

The number of core nodes working. Data points for this metricare reported only when a corresponding instance group exists.


Units: Count

CoreNodesRunning

Whether the last backup failed. This is set to 0 by default andupdated to 1 if the previous backup attempt failed. This metricis only reported for HBase job flows.


Units: Count

HBaseBackupFailed



http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/AlarmThatSendsEmail.html

http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/cli-mon-put-metric-alarm.html

http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/

http://docs.amazonwebservices.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html

http://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/

DescriptionMetric

The amount of time it took the previous backup to complete.Thismetric is set regardless of whether the last comppleted backupsucceeded or failed. While the backup is ongoing, this metricreturns the number of minutes since the backup started. Thismetric is only reported for HBase job flows.

Use Case: Monitor HBase Backups

Units: Minutes

HBaseMostRecentBackupDuration

The number of elapsed minutes since the last successful HBasebackup started on your cluster. This metric is only reported forHBase job flows.


Units: Minutes

HBaseTimeSinceLastSuccessfulBackup

The number of bytes read from HDFS.

Use Case: Analyze job flow performance, Monitor job flowprogress

Units: Count

HDFSBytesRead

The number of bytes written to HDFS.


Units: Count

HDFSBytesWritten

The percentage of HDFS storage currently used.

Use Case: Analyze job flow performance

Units: Percent

HDFSUtilization

Indicates that a job flow is no longer performing work, but is stillalive and accruing charges. It is set to 1 if no tasks are runningand no jobs are running, and set to 0 otherwise. This value ischecked at five-minute intervals and a value of 1 indicates onlythat the job flow was idle when checked, not that it was idle forthe entire five minutes.To avoid false positives, you should alarmwhen this value has been 1 for more than one consecutive5-minute check. For example, you might raise an alarm on thisvalue if it has been 1 for thirty minutes or longer.

Use Case: Monitor job flow performance

Units: Count

IsIdle

The number of jobs in the job flow that have failed.


Units: Count

JobsFailed



DescriptionMetric

The number of jobs in the job flow that are currently running.


Units: Count

JobsRunning

The percentage of data nodes that are receiving work fromHadoop.


Units: Percent

LiveDataNodes

The percentage of task trackers that are functional.


Units: Percent

LiveTaskTrackers

The unused map task capacity. This is calculated as themaximum number of map tasks for a given job flow, less the totalnumber of map tasks currently running in that job flow.


Units: Count

MapSlotsOpen

The number of blocks in which HDFS has no replicas. Thesemight be corrupt blocks.


Units: Count

MissingBlocks

Unused reduce task capacity.This is calculated as the maximumreduce task capacity for a given job flow, less the number ofreduce tasks currently running in that job flow.


Units: Count

ReduceSlotsOpen

The number of remaining map tasks for each job. If you have ascheduler installed and multiple jobs running, multiple graphsare generated.

Use Case: Monitor job flow progress

Units: Count

RemainingMapTasks

The ratio of the total map tasks remaining to the total map slotsavailable in the cluster.


Units: Ratio

RemainingMapTasksPerSlot



DescriptionMetric

The number of remaining reduce tasks for each job. If you havea scheduler installed and multiple jobs running, multiple graphswill be generated.


Units: Count

RemainingReduceTasks

The number of running map tasks for each job. If you have ascheduler installed and multiple jobs running, multiple graphswill be generated.


Units: Count

RunningMapTasks

The number of running reduce tasks for each job. If you have ascheduler installed and multiple jobs running, multiple graphswill be generated.


Units: Count

RunningReduceTasks

The number of bytes read from Amazon S3.


Units: Count

S3BytesRead

The number of bytes written to Amazon S3.


Units: Count

S3BytesWritten

The number of core nodes waiting to be assigned. All of the tasknodes requested may not be immediately available; this metricreports the pending requests. Data points for this metric arereported only when a corresponding instance group exists.


Units: Count

TaskNodesPending

The number of task nodes working. Data points for this metricare reported only when a corresponding instance group exists.


Units: Count

TaskNodesRunning

The total number of concurrent data transfers.


Units: Count

TotalLoad



Dimensions for Amazon EMR Metrics

Amazon EMR data can be filtered using any of the dimensions in the following table.

DescriptionDimension

The identifier for a job flow.You can find this value by clickingon the job flow in the Amazon EMR console. It takes theform j-XXXXXXXXXXXXX.

JobFlowId

The identifier of a job within a job flow.You can use this tofilter the metrics returned from a job flow down to those thatapply to a single job within the job flow. JobId takes the formjob_XXXXXXXXXXXX_XXXX.

JobId

Monitor Performance with GangliaThe Ganglia open source project is a scalable, distributed system designed to monitor clusters and gridswhile minimizing the impact on their performance. When you enable Ganglia on your job flow, you cangenerate reports and view the performance of the cluster as a whole, as well as inspect the performanceof individual node instances. For more information about the Ganglia open-source project, go tohttp://ganglia.info/.

Topics

• Initialize Ganglia on a Job Flow (p. 220)

• View Ganglia Metrics (p. 221)

• Ganglia Reports (p. 222)

• Hadoop Metrics in Ganglia (p. 227)

Initialize Ganglia on a Job FlowTo set up Ganglia monitoring on a job flow, you must specify the Ganglia bootstrap action when youcreate the job flow.You cannot add Ganglia monitoring to a job flow that is already running. AmazonElastic MapReduce (Amazon EMR) then installs the monitoring agents and the aggregator that Gangliauses to report data.

When you create a new job flow using the CLI, you can specify the Ganglia bootstrap action by addingthe following parameter to your job flow call:

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia

The following command illustrates the use of the bootstrap-action parameter when starting a newjob flow. In this example, you start the Word Count sample job flow provided by Amazon EMR and launchthree instances.

elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 3 \--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia --stream \--input s3://elasticmapreduce/samples/wordcount/input \


Amazon Elastic MapReduce Developer GuideMonitor Performance with Ganglia

http://ganglia.info/

--output s3://myawsbucket/wordcount/output/2012-04-19 \--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer ag gregate

View Ganglia MetricsGanglia provides a web-based user interface that you can use to view the metrics Ganglia collects.Whenyou run Ganglia on Amazon EMR, the web interface runs on the master node and can be viewed usingport forwarding, also known as creating an SSH tunnel. For more information about viewing web interfaceson Amazon EMR, see Web Interfaces Hosted on the Master Node (p. 115).

To view the Ganglia Web Interface



3. With the proxy set and the SSH connection open, you can view the HBase UI by opening a browserwindow with http://master-public-dns-name/ganglia/, where master-public-dns-name is thepublic DNS address of the master server in the HBase job flow. For information on how to locate thepublic DNS name of a master node, see To locate the public DNS name of the master node usingthe Amazon EMR console (p. 111).


Amazon Elastic MapReduce Developer GuideView Ganglia Metrics

Ganglia ReportsWhen you open the Ganglia web reports in a browser, you see an overview of the cluster’s performance,with graphs detailing the load, memory usage, CPU utilization, and network traffic of the cluster. Belowthe cluster statistics are graphs for each individual server in the cluster. In the preceding job flow creationexample, we launched three instances, so in the following reports there are three instance charts showingthe cluster data.


Amazon Elastic MapReduce Developer GuideGanglia Reports

The default graph for the node instances is Load, but you can use the Metric drop-down list to changethe statistic displayed in the node-instance graphs.



You can drill down into the full set of statistics for a given instance by selecting the node from the drop-downlist or by clicking the corresponding node-instance chart.



This opens the Host Overview for the node.



If you scroll down, you can view charts of the full range of statistics collected on the instance.



Hadoop Metrics in GangliaGanglia reports Hadoop metrics for each node instance. The various types of metrics are prefixed bycategory: distributed file system (dfs.*), Java virtual machine (jvm.*), MapReduce (mapred.*), and remoteprocedure calls (rpc.*).You can view a complete list of these metrics by clicking the Gmetrics link, on theHost Overview page.

Distributed Copy Using S3DistCpTopics

• S3DistCp Options (p. 227)

• Adding S3DistCp as a Step in a Job Flow (p. 230)

• S3DistCp Versions Supported in Amazon EMR (p. 233)

Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduceto copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks acrossseveral servers. For more information about the Apache DistCp open source project, go tohttp://hadoop.apache.org/docs/stable/distcp.html.

S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services (AWS), particularlyAmazon Simple Storage Service (Amazon S3).You use S3DistCp by adding it as a step in a job flow.Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it canbe processed by subsequent steps in your Amazon Elastic MapReduce (Amazon EMR) job flow.Youcan also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.

S3DistCp is stored as a JAR file on Amazon S3. It's stored ats3://region.elasticmapreduce/libs/s3distcp/version/s3distcp.jar, where region is one of the regionssupported by Amazon EMR, such as us-west-1 and version is either the explicit S3DistCp versionnumber (e.g. 1.0.5) or 1.latest, which is always the latest version available. For a complete list ofthe available regions, go to the Regions and Endpoints documentation.

NoteIf you are using AWS Identity and Access Management (IAM) roles in your job flow, use theversion of S3DistCp at s3://sa-east-1.elasticmapreduce/libs/s3distcp/role/s3distcp.jar instead ofthe location above. For more information, see Configure IAM Roles for Amazon EMR (p. 280).

If a file already exists in the destination location, S3DistCp overwrites it. This is true for destinations inboth Amazon S3 and HDFS.

If S3DistCp is unable to copy some or all of the specified files, the job flow step fails and returns a non-zeroerror code. If this occurs, S3DistCp does not clean up partially copied files.

S3DistCp OptionsWhen you call S3DistCp, you can specify options that change how it copies and compresses data.Theseare described in the following table. The options are added to the step using either the --arg or --argssyntax, examples of which are shown following the table.


Amazon Elastic MapReduce Developer GuideHadoop Metrics in Ganglia

http://hadoop.apache.org/docs/stable/distcp.html


RequiredDescriptionOption

YesLocation of the data to copy. This can be either anHDFS or Amazon S3 location.

Example:--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node

--src,LOCATION

YesDestination for the data. This can be either an HDFSor Amazon S3 location.

Example: --dest,hdfs:///output

--dest,LOCATION

NoA regular expression that filters the copy operation toa subset of the data at --src. If neither --srcPatternnor --groupBy is specified, all data at --src is copiedto --dest.

If the regular expression argument contains specialcharacters, such as an asterisk (*), either the regularexpression or the entire --args string must beenclosed in single quotes (').

Example:--srcPattern,.*daemons.*-hadoop-.*

--srcPattern,PATTERN

NoA regular expression that causes S3DistCp toconcatenate files that match the expression. Forexample, you could use this option to combine all ofthe log files written in one hour into a single file. Theconcatenated filename is the value matched by theregular expression for the grouping.

Parentheses indicate how files should be grouped, withall of the items that match the parenthetical statementbeing combined into a single output file. If the regularexpression does not include a parenthetical statement,the job flow will fail on the S3DistCp step and return anerror.

If the regular expression argument contains specialcharacters, such as an asterisk (*), either the regularexpression or the entire --args string must beenclosed in single quotes (').

When --groupBy is specified, only files that matchthe specified pattern will be copied.You do not needto specify --groupBy and --srcPattern at the sametime.

Example:--groupBy,.*subnetid.*([0-9]+-[0-9]+-[0-9]+-[0-9]+).*

--groupBy,PATTERN


Amazon Elastic MapReduce Developer GuideS3DistCp Options

http://en.wikipedia.org/wiki/Regular_expression

http://en.wikipedia.org/wiki/Regular_expression


NoThe size, in mebibytes (MiB), of the files to create basedon the --groupBy option. This value must be aninteger. When --targetSize is set, S3DistCp willattempt to match this size; the actual size of the copiedfiles may be larger or smaller than this value.

If the files concatenated by --groupBy are larger thanthe value of --targetSize, they will be broken upinto part files, which will be named sequentially with anumeric value appended to the end. For example, afile concatenated into myfile.gz would be be brokeninto parts as: myfile0.gz, myfile1.gz, etc.

Example: --targetSize,2

--targetSize,SIZE

NoSpecifies the compression codec to use for the copiedfiles. This can take the values: gzip, lzo, snappy, ornone.You can use this option, for example, to convertinput files compressed with Gzip into output files withLZO compression, or to uncompress the files as partof the copy operation. If you do not specify a value for--outputCodec the files are copied over with nochange in their compression.

Example: --outputCodec,lzo

--outputCodec,CODEC

NoIf the copy operation is successful, this option causesS3DistCp to delete the copied files from the sourcelocation. This is useful if you are copying output files,such as log files, from one location to another as ascheduled task, and you don't want to copy the samefiles twice.

Example: --deleteOnSuccess

--deleteOnSuccess

NoDisables the use of multipart upload. For moreinformation about multipart upload, see MultipartUpload (p. 343).

Example: --disableMultipartUpload

--disableMultipartUpload

NoThe size, in MiB, of the multipart upload part size. Bydefault it will use multipart upload when writing toAmazon S3. The default chunk size is 16 MiB.

Example: --multipartUploadChunkSize,32

--multipartUploadChunkSize,SIZE

NoPrepends output files with sequential numbers. Thecount starts at 0 unless a different value is specified by--startingIndex.

Example: --numberFiles

--numberFiles

NoUsed with --numberFiles to specify the first numberin the sequence.

Example: --startingIndex,1

--startingIndex,INDEX


Amazon Elastic MapReduce Developer GuideS3DistCp Options


NoCreates a text file, compressed with Gzip, that containsa list of all the files that were copied by S3DistCp.

Example: --outputManifest,manifest-1.gz

--outputManifest,FILENAME

NoReads a manifest file that was created during a previouscall to S3DistCp using the --outputManifest flag.When the --previousManifest flag is set, S3DistCpexcludes the files listed in the manifest from the copyoperation. If --outputManifest is specified alongwith --previousManifest, files listed in the previousmanifest will also appear in the new manifest file,although the files will not be recopied.

Example:--previousManifest,/usr/bin/manifest-1.gz

--previousManifest,PATH

NoReverses the behavior of --previousManifest tocause S3DistCp to use the specified manifest file as alist of files to copy, instead of a list of files to excludefrom copying.

Example: --copyFromManifest--previousManifest,/usr/bin/manifest-1.gz

--copyFromManifest

NoSpecifies the Amazon S3 endpoint to use whenuploading a file. This option sets the endpoint for boththe source and destination. If not set, the defaultendpoint is s3.amazonaws.com. For a list of theAmazon S3 endpoints, see Regions and Endpoints.

Example: --s3Endpoints3-eu-west-1.amazonaws.com

--s3Endpoint ENDPOINT

In addition to the options above, S3DistCp implements the Tool interface which means that it supportsthe generic options.

Adding S3DistCp as a Step in a Job FlowYou can call S3DistCp by adding it as a step in your job flow.

To add a S3DistCp step to a job flow using the CLI

• Add a step to the job flow that calls S3DistCp, passing in the parameters that specify how S3DistCpshould perform the copy operation. For more information about adding steps to a job flow, see AddSteps to a Job Flow (p. 79).

The following example copies daemon logs from Amazon S3 to hdfs:///output.In this CLI command:

• --jobflow specifies the job flow to add the copy step to.

• --jar is the location of the S3DistCp JAR file.

• --args is a comma-separated list of the option name-value pairs to pass in to S3DistCp. For acomplete list of the available options, see S3DistCp Options (p. 227).You can also specify theoptions singly, using multiple --arg parameters. Both forms are shown in examples below.


Amazon Elastic MapReduce Developer GuideAdding S3DistCp as a Step in a Job Flow

http://docs.amazonwebservices.com/general/latest/gr/rande.html#s3_region

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Tool.html

You can use either the --args or --arg syntax to pass options into the job flow step. The --argsparameter is a convenient way to pass in several --arg parameters at one time. It splits the stringpassed in on comma (,) characters to parse them into arguments.This syntax is shown in the followingexample. Note that the value passed in by --args is enclosed in single quotes ('). This preventsasterisks (*) and any other special characters in any regular expressions from being expanded bythe Linux shell.

elastic-mapreduce --jobflow jobflow-identifier --jar \s3://region.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \--args 'S3DistCp-OptionName1,S3DistCp-OptionValue1, \S3DistCp-OptionName2,S3DistCp-OptionValue2,\S3DistCp-OptionName3,S3DistCp-OptionValue3'

If the value of a S3DistCp option contains a comma, you cannot use --args, and must use insteadindividual --arg parameters to pass in the S3DistCp option names and values. Only the --src and--dest arguments are required. Note that the option values are enclosed in single quotes ('). Thisprevents asterisks (*) and any other special characters in any regular expressions from being expandedby the Linux shell.

elastic-mapreduce --jobflow jobflow-identifier --jar \s3://region.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \--arg S3DistCp-OptionName1 --arg 'S3DistCp-OptionValue1' \--arg S3DistCp-OptionName2 --arg 'S3DistCp-OptionValue2' \--arg S3DistCp-OptionName3 --arg 'S3DistCp-OptionValue3'

Example Specify an option value that contains a comma

In this example, --srcPattern is set to '.*[a-zA-Z,]+'. The inclusion of a comma in the--srcPattern regular expression requires the use of individual --arg parameters.

elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \--arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \--arg --src --arg 's3://myawsbucket/logs/j-3GY8JC4179IOJ/node/' \--arg --dest --arg 'hdfs:///output' \--arg --srcPattern --arg '.*[a-zA-Z,]+'



Example Copy log files from Amazon S3 to HDFS

This example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this examplethe --srcPattern option is used to limit the data copied to the daemon logs.

elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\--dest,hdfs:///output,\--srcPattern,.*daemons.*-hadoop-.*'

Example Load Amazon CloudFront logs into HDFS

This example loads Amazon CloudFront logs into HDFS. In the process it changes the compressionformat from Gzip (the CloudFront default) to LZO. This is useful because data compressed using LZOcan be split into multiple maps as it is decompressed, so you don't have to wait until the compression iscomplete, as you do with Gzip.This provides better performance when you analyze the data using AmazonEMR.This example also improves performance by using the regular expression specified in the --groupByoption to combine all of the logs for a given hour into a single file. Amazon EMR job flows are more efficientwhen processing a few, large, LZO-compressed files than when processing many, small, Gzip-compressedfiles.

elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \--args '--src,s3://myawsbucket/cf,\--dest,hdfs:///local,\--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,\--targetSize,128,\--outputCodec,lzo,--deleteOnSuccess'

Consider the case in which the preceding example is run over the following CloudFront log files.

s3://myawsbucket/cf/XABCD12345678.2012-02-23-01.HLUS3JKx.gzs3://myawsbucket/cf/XABCD12345678.2012-02-23-01.I9CNAZrg.gzs3://myawsbucket/cf/XABCD12345678.2012-02-23-02.YRRwERSA.gzs3://myawsbucket/cf/XABCD12345678.2012-02-23-02.dshVLXFE.gzs3://myawsbucket/cf/XABCD12345678.2012-02-23-02.LpLfuShd.gz

S3DistCp copies, concatenates, and compresses the files into the following two files, where the file nameis determined by the match made by the regular expression.

hdfs:///local/2012-02-23-01.lzohdfs:///local/2012-02-23-02.lzo



To add a S3DistCp step to a job flow using the API

• Send a request similar to the following example, where the arguments specified bySteps.member.1.HadoopJarStep.Args.member alternate the argument name and value, wherethe value is URL encoded.

https://elasticmapreduce.amazonaws.com?JobFlowId=jobflow-identifier&Steps.member.1.Name="S3DistCp Step"&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.HadoopJarStep.Jar=s3://us-east-1.elasticmapreduce/libs/s3dist cp/1.latest/s3distcp.jar&Steps.member.1.HadoopJarStep.Args.member.1=--src&Steps.member.1.HadoopJarStep.Args.member.2=s3%3A%2F%2Fbucket%2Fcf& Steps.member.1.HadoopJarStep.Args.member.3=--dest&Steps.member.1.HadoopJarStep.Args.member.4=hdfs%3A%2F%2F%2Flocal&Steps.member.1.HadoopJarStep.Args.member.5=--srcPattern&Steps.member.1.HadoopJarStep.Args.member.6=.%2A%5Ba-zA-Z%5D%2B&Steps.member.1.HadoopJarStep.Args.member.7=--groupBy&Steps.member.1.HadoopJarStep.Args.member.8=.%2A%5Ba-zA-Z%5D%2B& Operation=AddJobFlowSteps&AWSAccessKeyId=access-key-identifier&SignatureVersion=2&SignatureMethod=HmacSHA256&Timestamp=2011-12-28T21%3A51%3A51.000Z&Signature=calculated-value

S3DistCp Versions Supported in Amazon EMRAmazon EMR supports the following versions of S3DistCp.

Release DateDescriptionVersion

6 August 2012Adds the --s3Endpoint option.1.0.6

27 June 2012Improves the ability to specify which version of S3DistCp torun.

1.0.5

19 June 2012Improves the --deleteOnSuccess option.1.0.4

12 June 2012Adds support for the --numberFiles and--startingIndex options.

1.0.3

6 June 2012Improves file naming when using groups.1.0.2

19 January 2012Initial release of S3DistCp.1.0.1


Amazon Elastic MapReduce Developer GuideS3DistCp Versions Supported in Amazon EMR

Export, Import, Query, and Join Tables inAmazon DynamoDB Using Amazon EMR

Topics

• Prerequisites for Integrating Amazon EMR with Amazon DynamoDB (p. 235)

• Step 1: Create a Key Pair (p. 235)

• Step 2: Create a Job Flow (p. 236)

• Step 3: SSH into the Master Node (p. 241)

• Step 4: Set Up a Hive Table to Run Hive Commands (p. 244)

• Hive Command Examples for Exporting, Importing, and Querying Data in Amazon DynamoDB (p. 248)

• Optimizing Performance for Amazon EMR Operations in Amazon DynamoDB (p. 255)

In the following sections, you will learn how to use Amazon Elastic MapReduce (Amazon EMR) with acustomized version of Hive that includes connectivity to Amazon DynamoDB to perform operations ondata stored in Amazon DynamoDB, such as:

• Loading Amazon DynamoDB data into the Hadoop Distributed File System (HDFS) and using it asinput into an Amazon EMR job flow.

• Querying live Amazon DynamoDB data using SQL-like statements (HiveQL).

• Joining data stored in Amazon DynamoDB and exporting it or querying against the joined data.

• Exporting data stored in Amazon DynamoDB to Amazon S3.

• Importing data stored in Amazon S3 to Amazon DynamoDB.

To perform each of the tasks above, you'll launch an Amazon EMR job flow, specify the location of thedata in Amazon DynamoDB, and issue Hive commands to manipulate the data in Amazon DynamoDB.

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictableperformance with seamless scalability. Developers can create a database table and grow its requesttraffic or storage without limit. DynamoDB automatically spreads the data and traffic for the table over asufficient number of servers to handle the request capacity specified by the customer and the amount ofdata stored, while maintaining consistent, fast performance. Using Amazon EMR and Hive you can quicklyand efficiently process large amounts of data, such as data stored in Amazon DynamoDB. For moreinformation about Amazon DynamoDB go to the Amazon DynamoDB Developer Guide.

Apache Hive is a software layer that you can use to query map reduce job flows using a simplifed, SQL-likequery language called HiveQL. It runs on top of the Hadoop architecture. For more information aboutHive and HiveQL, go to the HiveQL Language Manual.

There are several ways to launch an Amazon EMR job flow: you can use the Amazon EMR console, theAmazon EMR command line interface (CLI), or you can program your job flow using the AWS SDK orthe API.You can also choose whether to run a Hive job flow interactively or from a script. In this section,we will show you how to launch an interactive Hive job flow from the Amazon EMR console and the CLI.

Using Hive interactively is a great way to test query performance and tune your application. Once youhave established a set of Hive commands that will run on a regular basis, consider creating a Hive scriptthat Amazon EMR can run for you. For more information about how to run Hive from a script, go to Howto Create a Job Flow Using Hive.

WarningAmazon EMR read or write operations on an Amazon DynamoDB table count against yourestablished provisioned throughput, potentially increasing the frequency of provisioned throughput


Amazon Elastic MapReduce Developer GuideExport, Query, and Join Tables in Amazon DynamoDB

http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/Introduction.html

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreatingaJobFlowUsingHive.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/CLI_CreatingaJobFlowUsingHive.html

exceptions. For large requests, Amazon EMR implements retries with exponential backoff tomanage the request load on the Amazon DynamoDB table. Running Amazon EMR jobsconcurrently with other traffic may cause you to exceed the allocated provisioned throughputlevel.You can monitor this by checking the ThrottleRequests metric in Amazon CloudWatch.If the request load is too high, you can relaunch the job flow and set the Read PercentSetting (p. 256) or Write Percent Setting (p. 256) to a lower value to throttle the Amazon EMRoperations. For information about Amazon DynamoDB throughput settings, see ProvisionedThroughput.

Prerequisites for Integrating Amazon EMR withAmazon DynamoDBTo use Amazon Elastic MapReduce (Amazon EMR) and Hive to manipulate data in Amazon DynamoDB,you need the following:

• An Amazon Web Services account. If you do not have one, you can get an account by going tohttp://aws.amazon.com, and clicking Create an AWS Account.

• An Amazon DynamoDB table that contains data.

• A customized version of Hive that includes connectivity to Amazon DynamoDB (Hive 0.7.1.3 or lateror—if you are using the binary data type—Hive 0.8.1.5 or later). These versions of Hive require theAmazon EMR AMI version 2.0 or later and Hadoop 0.20.205. The latest version of Hive provided byAmazon EMR is available by default when you launch an Amazon EMR job flow from the AWSManagement Console or from a version of the Amazon EMR command line client (CLI) downloadedafter 11 December 2011. If you launch a job flow using the AWS SDK or the API, you must explicitlyset the AMI version to latest and the Hive version to 0.7.1.3 or later. For more information aboutAmazon EMR AMIs and Hive versioning, go to Specifying the Amazon EMR AMI Version and toConfiguring Hive in the Amazon Elastic MapReduce Developer Guide.

• Support for Amazon DynamoDB connectivity. This is loaded on the Amazon EMR AMI version 2.0.2or later.

• (Optional) An Amazon S3 bucket. For instructions about how to create a bucket, see Get Started WithAmazon Simple Storage Service. This bucket is used as a destination when exporting AmazonDynamoDB data to Amazon S3 or as a location to store a Hive script.

• (Optional) A Secure Shell (SSH) client application to connect to the master node of the Amazon EMRjob flow and run HiveQL queries against the Amazon DynamoDB data. SSH is used to run Hiveinteractively.You can also save Hive commands in a text file and have Amazon EMR run the Hivecommands from the script. In this case an SSH client is not necessary, though the ability to SSH intothe master node is useful even in non-interactive job flows, for debugging purposes.

An SSH client is available by default on most Linux, Unix, and Mac OS X installations. Windows userscan install and use an SSH client called PuTTY.

• (Optional) An Amazon EC2 key pair.This is only required for interactive job flows.The key pair providesthe credentials the SSH client uses to connect to the master node. If you are running the Hive commandsfrom a script in an Amazon S3 bucket, an EC2 key pair is optional.

Step 1: Create a Key PairTo run Hive interactively to manage data in Amazon DynamoDB, you will need a key pair to connect tothe Amazon EC2 instances launched by Amazon Elastic MapReduce (Amazon EMR).You will use thiskey pair to connect to the master node of the Amazon EMR job flow to run a HiveQL script (a languagesimilar to SQL).


Amazon Elastic MapReduce Developer GuidePrerequisites for Integrating Amazon EMR

http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/WorkingWithDDTables.html#ProvisionedThroughput

http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/WorkingWithDDTables.html#ProvisionedThroughput

http://aws.amazon.com

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/SignUp.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/EnvironmentConfig_AMIVersion.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Hive.html



http://www.chiark.greenend.org.uk/~sgtatham/putty/

To generate a key pair

1. Sign in to the AWS Management Console and open the Amazon EC2 console athttps://console.aws.amazon.com/ec2/.

2. In the Navigation pane, select a Region from the Region drop-down menu.This should be the sameregion that your Amazon DynamoDB database is in.

3. Click Key Pairs in the Navigation pane.

The console displays a list of key pairs associated with your account.

4. Click Create Key Pair.

5. Enter a name for the key pair, such as mykeypair, for the new key pair in the Key Pair Name fieldand click Create.

You are prompted to download the key file.

6. Download the private key file and keep it in a safe place.You will need it to access any instancesthat you launch with this key pair.

ImportantIf you lose the key pair, you cannot connect to your Amazon EC2 instances.

For more information about key pairs, see Getting an SSH Key Pair in the Amazon EC2 User Guide.

Step 2: Create a Job FlowFor Hive to run on Amazon Elastic MapReduce (Amazon EMR), you must create a job flow with Hiveenabled. This sets up the necessary applications and infrastructure for Hive to connect to AmazonDynamoDB. The following procedures explain how to create an interactive Hive job flow from the AWSManagement Console and the CLI.

Topics

• To start a job flow using the AWS Management Console (p. 236)

• To start a Job Flow using a command line client (p. 241)

To start a job flow using the AWS Management Console1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

This opens the Amazon Elastic MapReduce console which you can use to launch and manage jobflows.

2. Select a region from the Region drop-down box.This is the region in which you'll create the AmazonEMR job flow.To avoid cross-region data transfer charges, this should be the same region that hostsyour Amazon DynamoDB data. Similarly, if you are exporting data to Amazon S3, the Amazon S3bucket should be in the same region as both the Amazon DynamoDB and the Amazon EMR job flowto avoid cross-region data transfer charges.

3. Click the Create New Job Flow button.


Amazon Elastic MapReduce Developer GuideStep 2: Create a Job Flow

https://console.aws.amazon.com/ec2/

http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/generating-a-keypair.html


4. On the DEFINE NEW JOB FLOW page, do the following:

• Give your Job Flow a name, such as "My Job Flow".

• Select which version of Hadoop to run on your cluster in Hadoop Version.You can choose to runthe Amazon distribution of Hadoop or one of two MapR distributions. For more information aboutMapR distributions for Hadoop, seehttp://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-mapr.html.

• Select the Run your own application radio button.

• In the Choose a Job Type drop-down menu, choose Hive Program.

Click Continue.



http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-mapr.html

5. On the SPECIFY PARAMETERS page, select the Start an Interactive Hive Session radio button.

Hive is an open-source tool that runs on top of Hadoop to provide a way to query job flows using asimplified SQL syntax. Select an interactive session to issue commands from a terminal window.

Later, once you’ve established a set of queries that you’d like to run on a regular basis, you can saveyour queries as a script in an Amazon S3 bucket and have Amazon EMR run them for you withoutan interactive session.

Click Continue.

6. On the CONFIGURE EC2 INSTANCES page, set the number and type of instances to process thedata in parallel.

In the Master Instance Group, for Instance Type, use an m1.small master node. In the CoreInstance Group, for Instance Count use the default value 2 and for Instance Type use the defaultvalue m1.small. If you need more processing power, select larger options.



Click Continue.

7. On the ADVANCED OPTIONS page, select the key pair you created earlier in the Amazon EC2Key Pair drop-down menu.

Leave the rest of the settings on this page at the default values. For example, Amazon VPC SubnetId should remain set to Proceed without a VPC Subnet ID.

Click Continue.



8. In the Bootstrap Actions dialog:

Select the Proceed with no Bootstrap Actions radio button.

Click Continue.

9. In the Review dialog:

Review the settings for your Job Flow.



Click Create Job Flow.

NoteWhen the confirmation window closes, your new job flow appears in the list of job flows inthe Amazon Elastic MapReduce console with the status STARTING. If you do not see yourjob flow with the STARTING status, click Refresh to see the job flow. It takes a few minutesfor Amazon EMR to provision the Amazon EC2 instances for your job flow.Your Job Flowis ready for use when the status is WAITING.

To start a Job Flow using a command line client1. Download the Amazon EMR Ruby command line client (CLI). If you downloaded the Amazon EMR

CLI before 11 December 2011, you will need to download and install the latest version to get supportfor AMI versioning, Amazon EMR AMI version 2.0, and Hadoop 0.20.205.

2. Install the command line client and set up your credentials. For information about how to do this, goto Sign Up and Install the Command Line Interface in the Amazon Elastic MapReduce DeveloperGuide.

3. Use the following syntax to start a new job flow, specifying your own values for the instance size andyour own job flow name for "myJobFlowName":

elastic-mapreduce --create --alive --num-instances 3 \--instance-type m1.small \--name "myJobFlowName" \--hive-interactive --hive-versions 0.7.1.1 \--ami-version latest \--hadoop-version 0.20.205

You must use the same account to create the Amazon EMR job flow that you used to store data inAmazon DynamoDB.This ensures that the credentials passed in by the CLI will match those requiredby Amazon DynamoDB.

NoteAfter you create the job flow, you should wait until its status is WAITING before continuing to thenext step.

Step 3: SSH into the Master NodeWhen the job flow’s status is WAITING, the master node is ready for you to connect to it. With an activeSSH session into the master node, you can execute command line operations.

To SSH into the master node

1. Locate the Master Public DNS Name.

In the Amazon Elastic MapReduce console, select the job from the list of running job flows in theWAITING state. Details about the job flow appear in the lower pane.


Amazon Elastic MapReduce Developer GuideStep 3: SSH into the Master Node



The DNS name you used to connect to the instance is listed on the Description tab as Master PublicDNS Name. Use this name in the next step.

2. Use SSH to open up a terminal connection to the master node.

Use the SSH application available on most Linux, Unix, and Mac OS X installations. Windows userscan use an application called PuTTY to connect to the master node.The following are platform-specificinstructions for opening an SSH connection.

To connect to the master node using Linux/Unix/Mac OS X

1. Open a terminal window. This is found at Applications/Utilities/Terminal on Mac OS X and atApplications/Accessories/Terminal on many Linux distributions.

2. Set the permissions on the PEM file for your Amazon EC2 key pair so that only the key ownerhas permissions to access the key. For example, if you saved the file as mykeypair.pem inthe user's home directory, the command is:

chmod og-rwx ~/mykeypair.pem

If you do not perform this step, SSH will return an error saying that your private key file isunprotected and will reject the key.You only need to perform this step the first time you use theprivate key to connect.

3. To establish the connection to the master node, enter the following command line, which assumesthe PEM file is in the user's home directory. Replace master-public-dns-name with theMaster Public DNS Name of your job flow and replace ~/mykeypair.pem with the location andfilename of your PEM file.

ssh hadoop@master-public-dns-name -i ~/mykeypair.pem

A warning states that the authenticity of the host you are connecting to can't be verified.

4. Type yes to continue.




Now, you should see a Hadoop command prompt and you are ready to start a Hive interactivesession.

To connect to the master node using PuTTY on Windows

1. Download PuTTYgen.exe and PuTTY.exe to your computer fromhttp://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.

2. Launch PuTTYgen.

3. Click Load.

4. Select the PEM file you created earlier. Note that you may have to change the search parametersfrom file of type “PuTTY Private Key Files (*.ppk) to “All Files (*.*)”.

5. Click Open.

6. Click OK on the PuTTYgen notice telling you the key was successfully imported.

7. Click Save private key to save the key in the PPK format.

8. When PuTTYgen prompts you to save the key without a pass phrase, click Yes.

9. Enter a name for your PuTTY private key, such as mykeypair.ppk.

10. Click Save.

11. Close PuTTYgen.You only need to perform steps 1-9 the first time that you use the private key.

12. Start PuTTY.

13. Select Session in the Category list. Enter hadoop@DNS in the Host Name field. The input lookssimilar to [email protected].

14. In the Category list, expand Connection, expand SSH, and then select Auth. The Optionscontrolling the SSH authentication pane appears.

15. For Private key file for authentication, click Browse and select the private key file yougenerated earlier. If you are following this guide, the file name is mykeypair.ppk.

16. Click Open.

A PuTTY Security Alert pops up.




17. Click Yes for the PuTTY Security Alert.


Now, you should see a Hadoop command prompt and you are ready to start a Hive interactivesession.

Step 4: Set Up a Hive Table to Run Hive CommandsApache Hive is a data warehouse application you can use to query data contained in Amazon ElasticMapReduce (Amazon EMR) job flows using a SQL-like language. Because we launched the job flow asa Hive application, Amazon EMR will install Hive on the Amazon EC2 instances it launches to processthe job flow. To learn more about Hive, go to http://hive.apache.org/.

If you've followed the previous instructions to set up a job flow and SSH into the master node, you areready to use Hive interactively.

To run Hive commands interactively

1. At the hadoop command prompt for the current master node, type hive.

You should see a hive prompt: hive>

2. Enter a Hive command that maps a table in the Hive application to the data in Amazon DynamoDB.This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locallyin Hive and any queries using this table run against the live data in Amazon DynamoDB, consumingthe table’s read or write capacity every time a command is run. If you expect to run multiple Hivecommands against the same dataset, consider exporting it first.

The following shows the syntax for mapping a Hive table to an Amazon DynamoDB table.

CREATE EXTERNAL TABLE hive_tablename (hive_column1_name column1_datatype, hive_column2_name column2_datatype...)STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "dynamodb_tablename", "dynamodb.column.mapping" = "hive_column1_name:dynamodb_attrib ute1_name,hive_column2_name:dynamodb_attribute2_name...");

When you create a table in Hive from Amazon DynamoDB, you must create it as an external tableusing the keyword EXTERNAL. The difference between external and internal tables is that the datain internal tables is deleted when an internal table is dropped. This is not the desired behavior whenconnected to Amazon DynamoDB, and thus only external tables are supported.

For example, the following Hive command creates a table named "hivetable1" in Hive that referencesthe Amazon DynamoDB table named "dynamodbtable1". The Amazon DynamoDB table"dynamodbtable1" has a hash-and-range primary key schema. The hash key element is "name"(string type), the range key element is "year" (numeric type), and each item has an attribute valuefor "holidays" (string set type).

CREATE EXTERNAL TABLE hiveTableName (col1 string, col2 bigint, col3 ar ray<string>)STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'


Amazon Elastic MapReduce Developer GuideStep 4: Set Up a Hive Table to Run Hive Commands

http://hive.apache.org/

TBLPROPERTIES ("dynamodb.table.name" = "dynamodbtable1", "dynamodb.column.mapping" = "col1:name,col2:year,col3:holidays");

Line 1 uses the HiveQL CREATE EXTERNAL TABLE statement. For "hivetable1", you need to establisha column for each attribute name-value pair in the Amazon DynamoDB table, and provide the datatype.These values are not case-sensitive, and you can give the columns any name (except reservedwords).

Line 2 uses the STORED BY statement.The value of STORED BY is the name of the class that handlesthe connection between Hive and Amazon DynamoDB. It should be set to'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'.

Line 3 uses the TBLPROPERTIES statement to associate "hivetable1" with the correct table andschema in Amazon DynamoDB. Provide TBLPROPERTIES with values for the dynamodb.table.nameparameter and dynamodb.column.mapping parameter. These values are case-sensitive.

NoteAll Amazon DynamoDB attribute names for the table must have corresponding columns inthe Hive table. Otherwise, the Hive table won't contain the name-value pair from AmazonDynamoDB. If you do not map the Amazon DynamoDB primary key attributes, Hive generatesan error. If you do not map a non-primary key attribute, no error is generated, but you won'tsee the data in the Hive table. If the data types do not match, the value will be null.

Then you can start running Hive operations on "hivetable1". Queries run against hivetable1" are internallyrun against the Amazon DynamoDB table "dynamodbtable1" of your Amazon DynamoDB account,consuming read or write units with each execution.

When you run Hive queries against an Amazon DynamoDB table, you need to ensure that you haveprovisioned a sufficient amount of read capacity units.

For example, suppose that you have provisioned 100 units of Read Capacity for your DynamoDB table.This will let you perform 100 reads, or 102,400 bytes, per second. If that table contains 20GB of data(21,474,836,480 bytes), and your Hive query performs a full table scan, you can estimate how long thequery will take to run:

21,474,836,480 / 102,400 = 209,715 seconds = 58.25 hours

The only way to decrease the time required would be to adjust the read capacity units on the sourceDynamoDB table. Adding more Elastic MapReduce nodes will not help.

In the Hive output, the completion percentage is updated when one or more mapper processes arefinished. For a large DynamoDB table with a low provisioned Read Capacity setting, the completionpercentage output might not be updated for a long time; in the case above, the job will appear to be 0%complete for several hours. For more detailed status on your job's progress, go to the Elastic MapReduceconsole; you will be able to view the individual mapper task status, and statistics for data reads.

You can also log on to Hadoop interface on the master node and see the Hadoop statistics.This will showyou the individual map task status and some data read statistics. For more details, see the followingAmazon Elastic MapReduce documentation:

• Web Interfaces Hosted on the Master Node

• View the Hadoop Web Interfaces

Sample HiveQL statements to perform tasks such as exporting or importing data from Amazon DynamoDBand joining tables are listed in Hive Command Examples for Exporting, Importing, and Querying Data inAmazon DynamoDB (p. 248).



http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingtheHadoopUserInterface.html

You can also create a file that contains a series of commands, launch a job flow, and reference that fileto perform the operations. For more information, see Interactive and Batch Modes in the Amazon ElasticMapReduce Developer Guide.

To cancel a Hive request

When you execute a Hive query, the initial response from the server includes the command to cancel therequest.To cancel the request at any time in the process, use the Kill Command from the server response.

1. Enter Ctrl+C to exit the command line client.

2. At the shell prompt, enter the Kill Command from the initial server response to your request.

Alternatively, you can run the following command from the command line of the master node to killthe Hadoop job, where job-id is the identifier of the Hadoop job and can be retrieved from theHadoop user interface. For more information about the Hadoop user interface, go to How to Use theHadoop User Interface in the Amazon Elastic MapReduce Developer Guide.

hadoop job -kill job-id

Data Types for Hive and Amazon DynamoDBThe following table shows the available Hive data types and how they map to the corresponding AmazonDynamoDB data types.

Amazon DynamoDB typeHive type

string (S)string

number (N)bigint or double

binary (B)binary

number set (NS), string set (SS), or binary set (BS)array

The bigint type in Hive is the same as the Java long type, and the Hive double type is the same as theJava double type in terms of precision. This means that if you have numeric data stored in AmazonDynamoDB that has precision higher than is available in the Hive datatypes, using Hive to export, import,or reference the Amazon DynamoDB data could lead to a loss in precision or a failure of the Hive query.

Exports of the binary type from Amazon DynamoDB to Amazon Simple Storage Service (Amazon S3) orHDFS are stored as a Base64-encoded string. If you are importing data from Amazon S3 or HDFS intothe Amazon DynamoDB binary type, it should be encoded as a Base64 string.

Hive OptionsYou can set the following Hive options to manage the transfer of data out of Amazon DynamoDB. Theseoptions only persist for the current Hive session. If you close the Hive command prompt and reopen itlater on the job flow, these settings will have returned to the default values.



http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Hive.html#interactiveandbatch



DescriptionHive Options

Set the rate of read operations to keep your AmazonDynamoDB provisioned throughput rate in the allocatedrange for your table. The value is between 0.1 and 1.5,inclusively.

The value of 0.5 is the default read rate, which means thatHive will attempt to consume half of the read provisionedthroughout resources in the table. Increasing this valueabove 0.5 increases the read request rate. Decreasing itbelow 0.5 decreases the read request rate. This read rateis approximate. The actual read rate will depend on factorssuch as whether there is a uniform distribution of keys inAmazon DynamoDB.

If you find your provisioned throughput is frequentlyexceeded by the Hive operation, or if live read traffic is beingthrottled too much, then reduce this value below 0.5. If youhave enough capacity and want a faster Hive operation,set this value above 0.5.You can also oversubscribe bysetting it up to 1.5 if you believe there are unusedinput/output operations available.

dynamodb.throughput.read.percent

Set the rate of write operations to keep your AmazonDynamoDB provisioned throughput rate in the allocatedrange for your table. The value is between 0.1 and 1.5,inclusively.

The value of 0.5 is the default write rate, which means thatHive will attempt to consume half of the write provisionedthroughout resources in the table. Increasing this valueabove 0.5 increases the write request rate. Decreasing itbelow 0.5 decreases the write request rate. This write rateis approximate.The actual write rate will depend on factorssuch as whether there is a uniform distribution of keys inAmazon DynamoDB

If you find your provisioned throughput is frequentlyexceeded by the Hive operation, or if live write traffic isbeing throttled too much, then reduce this value below 0.5.If you have enough capacity and want a faster Hiveoperation, set this value above 0.5.You can alsooversubscribe by setting it up to 1.5 if you believe there areunused input/output operations available or this is the initialdata upload to the table and there is no live traffic yet.

dynamodb.throughput.write.percent

Specify the endpoint in case you have tables in differentregions. The default endpoint isdynamodb.us-east-1.amazonaws.com. For the list ofavailable Amazon DynamoDB endpoints, see Regions andEndpoints.

Example: SETdynamodb.endpoint=dynamodb.us-west-2.amazonaws.com;

dynamodb.endpoint

Specify the maximum number of map tasks when readingdata from Amazon DynamoDB. This value must be equalto or greater than 1.

dynamodb.max.map.tasks



http://docs.amazonwebservices.com/general/latest/gr/rande.html#ddb_region

http://docs.amazonwebservices.com/general/latest/gr/rande.html#ddb_region

DescriptionHive Options

Specify the number of minutes to use as the timeoutduration for retrying Hive commands. This value must bean integer equal to or greater than 0. The default timeoutduration is two minutes.

dynamodb.retry.duration

These options are set using the SET command as shown in the following example.

SET dynamodb.throughput.read.percent=1.0;

INSERT OVERWRITE TABLE s3_export SELECT * FROM hiveTableName;

If you are using the AWS SDK for Java, you can use the -e option of Hive to pass in the command directly,as shown in the last line of the following example.

steps.add(new StepConfig().withName("Run Hive Script").withHadoopJarStep(new HadoopJarStepConfig().withJar("s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar").withArgs("s3://us-east-1.elasticmapreduce/libs/hive/hive-script","--base-path","s3://us-east-1.elasticmapreduce/libs/hive/","--run-hive-script","--args","-e","SET dynamodb.throughput.read.percent=1.0;")));

Hive Command Examples for Exporting, Importing,and Querying Data in Amazon DynamoDBThe following examples use Hive commands to perform operations such as exporting data to AmazonS3 or HDFS, importing data to Amazon DynamoDB, joining tables, querying tables, and more.

Operations on a Hive table reference data stored in Amazon DynamoDB. Hive commands are subject tothe Amazon DynamoDB table's provisioned throughput settings, and the data retrieved includes the datawritten to the Amazon DynamoDB table at the time the Hive operation request is processed by AmazonDynamoDB. If the data retrieval process takes a long time, some data returned by the Hive commandmay have been updated in Amazon DynamoDB since the Hive command began.

Hive commands DROP TABLE and CREATE TABLE only act on the local tables in Hive and do not createor drop tables in Amazon DynamoDB. If your Hive query references a table in Amazon DynamoDB, thattable must already exist in Amazon DynamoDB, for example to write to or read from, the table must existin Amazon DynamoDB before you run the query. For more information on creating and deleting tablesin Amazon DynamoDB, go to Working with Tables in Amazon DynamoDB.

NoteWhen you map a Hive table to a location in Amazon S3, do not map it to the root path of thebucket, s3://mybucket, as this may cause errors when Hive writes the data to Amazon S3. Insteadmap the table to a subpath of the bucket, s3://mybucket/mypath.


Amazon Elastic MapReduce Developer GuideHive Command Examples for Exporting, Importing, and

Querying Data

http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/WorkingWithDDTables.html

Exporting Data from Amazon DynamoDBYou can use Hive to export data from Amazon DynamoDB.

To export an Amazon DynamoDB table to an Amazon S3 bucket

• Create a Hive table that references data stored in Amazon DynamoDB.Then you can call the INSERTOVERWRITE command to write the data to an external directory. In the following example,s3://bucketname/path/subpath/ is a valid path in Amazon S3. Adjust the columns and datatypesin the CREATE command to match the values in your Amazon DynamoDB.You can use this tocreate an archive of your Amazon DynamoDB data in Amazon S3.

CREATE EXTERNAL TABLE hiveTableName (col1 string, col2 bigint, col3 ar ray<string>)STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "dynamodbtable1", "dynamodb.column.mapping" = "col1:name,col2:year,col3:holidays");

INSERT OVERWRITE DIRECTORY 's3://bucketname/path/subpath/' SELECT * FROM hiveTableName;

To export an Amazon DynamoDB table to an Amazon S3 bucket using formatting

• Create an external table that references a location in Amazon S3.This is shown below as s3_export.During the CREATE call, specify row formatting for the table. Then, when you use INSERTOVERWRITE to export data from Amazon DynamoDB to s3_export, the data will be written out inthe specified format. In the following example, the data is written out as comma-separated values(CSV).


CREATE EXTERNAL TABLE s3_export(a_col string, b_col bigint, c_col ar ray<string>)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://bucketname/path/subpath/';




Querying Data

To export an Amazon DynamoDB table to an Amazon S3 bucket without specifying acolumn mapping

• Create a Hive table that references data stored in Amazon DynamoDB.This is similar to the precedingexample, except that you are not specifying a column mapping. The table must have exactly onecolumn of type map<string, string>. If you then create an EXTERNAL table in Amazon S3 youcan call the INSERT OVERWRITE command to write the data from Amazon DynamoDB to AmazonS3.You can use this to create an archive of your Amazon DynamoDB data in Amazon S3. Becausethere is no column mapping, you cannot query tables that are exported this way. Exporting datawithout specifying a column mapping is available in Hive 0.8.1.5 or later, which is supported onAmazon EMR AMI 2.2.3 and later.

CREATE EXTERNAL TABLE hiveTableName (item map<string,string>)STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "dynamodbtable1");

CREATE EXTERNAL TABLE s3TableName (item map<string, string>)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'LOCATION 's3://bucketname/path/subpath/';

INSERT OVERWRITE DIRECTORY s3TableName SELECT * FROM hiveTableName;

To export an Amazon DynamoDB table to an Amazon S3 bucket using data compression

• Hive provides several compression codecs you can set during your Hive session. Doing so causesthe exported data to be compressed in the specified format. The following example compresses theexported files using the Lempel-Ziv-Oberhumer (LZO) algorithm.

SET hive.exec.compress.output=true;SET io.seqfile.compression.type=BLOCK;SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;


CREATE EXTERNAL TABLE lzo_compression_table (line STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'LOCATION 's3://bucketname/path/subpath/';

INSERT OVERWRITE TABLE lzo_compression_table SELECT * FROM hiveTableName;

The available compression codecs are:

• org.apache.hadoop.io.compress.GzipCodec



Querying Data

• org.apache.hadoop.io.compress.DefaultCodec

• com.hadoop.compression.lzo.LzoCodec

• com.hadoop.compression.lzo.LzopCodec

• org.apache.hadoop.io.compress.BZip2Codec

• org.apache.hadoop.io.compress.SnappyCodec

To export an Amazon DynamoDB table to HDFS

• Use the following Hive command, where hdfs:///directoryName is a valid HDFS path andhiveTableName is a table in Hive that references Amazon DynamoDB. This export operation isfaster than exporting a Amazon DynamoDB table to Amazon S3 because Hive 0.7.1.1 uses HDFSas an intermediate step when exporting data to Amazon S3. The following example also shows howto set dynamodb.throughput.read.percent to 1.0 in order to increase the read request rate.


SET dynamodb.throughput.read.percent=1.0;

INSERT OVERWRITE DIRECTORY 'hdfs:///directoryName' SELECT * FROM hiveTable Name;

You can also export data to HDFS using formatting and compression as shown above for the exportto Amazon S3. To do so, simply replace the Amazon S3 directory in the examples above with anHDFS directory.

To read non-printable UTF-8 character data in Hive

• You can read and write non-printable UTF-8 character data with Hive by using the STORED ASSEQUENCEFILE clause when you create the table. A SequenceFile is Hadoop binary file format; youwill need to use Hadoop to read this file. The following example shows how to export data fromAmazon DynamoDB into Amazon S3.You can use this functionality to handle non-printable UTF-8encoded characters.


CREATE EXTERNAL TABLE s3_export(a_col string, b_col bigint, c_col ar ray<string>)STORED AS SEQUENCEFILELOCATION 's3://bucketname/path/subpath/';



Querying Data


Importing Data to Amazon DynamoDBWhen you write data to Amazon DynamoDB using Hive you should ensure that the number of writecapacity units is greater than the number of mappers in the job flow. For example, job flows that run onm1.xlarge EC2 instances produce 8 mappers per instance. In the case of a job flow that has 10 instances,that would mean a total of 80 mappers. If your write capacity units are not greater than the number ofmappers in the job flow, the Hive write operation may consume all of the write throughput, or attempt toconsume more throughput than is provisioned. For details about the number of mappers produced byeach EC2 instance type, go to Task Configuration (AMI 2.0).

The number of mappers in Hadoop are controlled by the input splits. If there are too few splits, your writecommand might not be able to consume all the write throughput available.

If an item with the same key exists in the target Amazon DynamoDB table, it will be overwritten. If no itemwith the key exists in the target Amazon DynamoDB table, the item is inserted.

To import a table from Amazon S3 to Amazon DynamoDB

• You can use Amazon Elastic MapReduce (Amazon EMR) and Hive to write data from Amazon S3to Amazon DynamoDB.

CREATE EXTERNAL TABLE s3_import(a_col string, b_col bigint, c_col ar ray<string>)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://bucketname/path/subpath/';


INSERT OVERWRITE TABLE 'hiveTableName' SELECT * FROM s3_import;

To import a table from an Amazon S3 bucket to Amazon DynamoDB without specifying acolumn mapping

• Create an EXTERNAL table that references data stored in Amazon S3 that was previously exportedfrom Amazon DynamoDB. Before importing, ensure that the table exists in Amazon DynamoDB andthat it has the same key schema as the previously exported Amazon DynamoDB table. In addition,the table must have exactly one column of type map<string, string>. If you then create a Hivetable that is linked to Amazon DynamoDB, you can call the INSERT OVERWRITE command to writethe data from Amazon S3 to Amazon DynamoDB. Because there is no column mapping, you cannotquery tables that are imported this way. Importing data without specifying a column mapping isavailable in Hive 0.8.1.5 or later, which is supported on Amazon EMR AMI 2.2.3 and later.



Querying Data

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_AMI2.html

CREATE EXTERNAL TABLE s3TableName (item map<string, string>)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'LOCATION 's3://bucketname/path/subpath/';

CREATE EXTERNAL TABLE hiveTableName (item map<string,string>)STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "dynamodbtable1");

INSERT OVERWRITE DIRECTORY hiveTableName SELECT * FROM s3TableName;

To import a table from HDFS to Amazon DynamoDB

• You can use Amazon EMR and Hive to write data from HDFS to Amazon DynamoDB.

CREATE EXTERNAL TABLE hdfs_import(a_col string, b_col bigint, c_col ar ray<string>)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs:///directoryName';


INSERT OVERWRITE TABLE 'hiveTableName' SELECT * FROM hdfs_import;

Querying Data in Amazon DynamoDBThe following examples show the various ways you can use Amazon EMR to query data stored in AmazonDynamoDB.

To find the largest value for a mapped column (max)

• Use Hive commands like the following. In the first command, the CREATE statement creates a Hivetable that references data stored in Amazon DynamoDB. The SELECT statement then uses thattable to query data stored in Amazon DynamoDB. The following example finds the largest orderplaced by a given customer.

CREATE EXTERNAL TABLE hive_purchases(customerId bigint, total_cost double, items_purchased array<String>) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'TBLPROPERTIES ("dynamodb.table.name" = "Purchases","dynamodb.column.mapping" = "customerId:CustomerId,total_cost:Cost,items_pur chased:Items");



Querying Data

SELECT max(total_cost) from hive_purchases where customerId = 717;

To aggregate data using the GROUP BY clause

• You can use the GROUP BY clause to collect data across multiple records. This is often used withan aggregate function such as sum, count, min, or max. The following example returns a list of thelargest orders from customers who have placed more than three orders.


SELECT customerId, max(total_cost) from hive_purchases GROUP BY customerId HAVING count(*) > 3;

To join two Amazon DynamoDB tables

• The following example maps two Hive tables to data stored in Amazon DynamoDB. It then calls ajoin across those two tables. The join is computed on the cluster and returned. The join does nottake place in Amazon DynamoDB. This example returns a list of customers and their purchases forcustomers that have placed more than two orders.


CREATE EXTERNAL TABLE hive_customers(customerId bigint, customerName string, customerAddress array<String>) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'TBLPROPERTIES ("dynamodb.table.name" = "Customers","dynamodb.column.mapping" = "customerId:CustomerId,customerName:Name,custom erAddress:Address");

Select c.customerId, c.customerName, count(*) as count from hive_customers c JOIN hive_purchases p ON c.customerId=p.customerId GROUP BY c.customerId, c.customerName HAVING count > 2;



Querying Data

To join two tables from different sources

• In the following example, Customer_S3 is a Hive table that loads a CSV file stored in Amazon S3and hive_purchases is a table that references data in Amazon DynamoDB. The following examplejoins together customer data stored as a CSV file in Amazon S3 with order data stored in AmazonDynamoDB to return a set of data that represents orders placed by customers who have "Miller" intheir name.


CREATE EXTERNAL TABLE Customer_S3(customerId bigint, customerName string, customerAddress array<String>)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://bucketname/path/subpath/';

Select c.customerId, c.customerName, c.customerAddress from Customer_S3 c JOIN hive_purchases p ON c.customerid=p.customerid where c.customerName like '%Miller%';

NoteIn the preceding examples, the CREATE TABLE statements were included in each example forclarity and completeness. When running multiple queries or export operations against a givenHive table, you only need to create the table once, at the beginning of the Hive session.

Optimizing Performance for Amazon EMROperations in Amazon DynamoDBAmazon Elastic MapReduce (Amazon EMR) operations on an Amazon DynamoDB table count as readoperations, and are subject to the table's provisioned throughput settings. Amazon EMR implements itsown logic to try to balance the load on your Amazon DynamoDB table to minimize the possibility ofexceeding your provisioned throughput. At the end of each Hive query, Amazon EMR returns informationabout the job flow used to process the query, including how many times your provisioned throughput wasexceeded.You can use this information, as well as Amazon CloudFront metrics about your AmazonDynamoDB throughput to better manage the load on your Amazon DynamoDB table in subsequentrequests.

The following factors influence Hive query performance when working with Amazon DynamoDB tables.

Provisioned Read Capacity UnitsWhen you run Hive queries against an Amazon DynamoDB table, you need to ensure that you haveprovisioned a sufficient amount of read capacity units.

For example, suppose that you have provisioned 100 units of Read Capacity for your DynamoDB table.This will let you perform 100 reads, or 102,400 bytes, per second. If that table contains 20GB of data


Amazon Elastic MapReduce Developer GuideOptimizing Performance

(21,474,836,480 bytes), and your Hive query performs a full table scan, you can estimate how long thequery will take to run:

21,474,836,480 / 102,400 = 209,715 seconds = 58.25 hours

The only way to decrease the time required would be to adjust the read capacity units on the sourceDynamoDB table. Adding more Elastic MapReduce nodes will not help.

In the Hive output, the completion percentage is updated when one or more mapper processes arefinished. For a large DynamoDB table with a low provisioned Read Capacity setting, the completionpercentage output might not be updated for a long time; in the case above, the job will appear to be 0%complete for several hours. For more detailed status on your job's progress, go to the Elastic MapReduceconsole; you will be able to view the individual mapper task status, and statistics for data reads.

You can also log on to Hadoop interface on the master node and see the Hadoop statistics.This will showyou the individual map task status and some data read statistics. For more details, see the followingAmazon Elastic MapReduce documentation:

• Web Interfaces Hosted on the Master Node

• View the Hadoop Web Interfaces

Read Percent SettingBy default, Amazon EMR manages the request load against your Amazon DynamoDB table accordingto your current provisioned throughput. However, when Amazon EMR returns information about your jobthat includes a high number of provisioned throughput exceeded responses, you can adjust the defaultread rate using the dynamodb.throughput.read.percent parameter when you set up the Hive table.For more information about setting the read percent parameter, see Hive Options (p. 246).

Write Percent SettingBy default, Amazon EMR manages the request load against your Amazon DynamoDB table accordingto your current provisioned throughput. However, when Amazon EMR returns information about your jobthat includes a high number of provisioned throughput exceeded responses, you can adjust the defaultwrite rate using the dynamodb.throughput.write.percent parameter when you set up the Hivetable. For more information about setting the write percent parameter, see Hive Options (p. 246).

Retry Duration SettingBy default, Amazon EMR will re-run a Hive query if it has not returned a result within two minutes, thedefault retry interval.You can adjust this interval by setting the dynamodb.retry.duration parameterwhen you run a Hive query. For more information about setting the write percent parameter, see HiveOptions (p. 246).

Number of Map TasksThe mapper daemons that Hadoop launches to process your requests to export and query data storedin Amazon DynamoDB are capped at a maximum read rate of 1 MiB per second to limit the read capacityused. If you have additional provisioned throughput available on Amazon DynamoDB, you can improvethe performance of Hive export and query operations by increasing the number of mapper daemons. Todo this, you can either increase the number of EC2 instances in your job flow or increase the number ofmapper daemons running on each EC2 instance.

You can increase the number of EC2 instances in a job flow by stopping the current job flow andre-launching it with a larger number of EC2 instances.You specify the number of EC2 instances in the



http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html


Configure EC2 Instances dialog box if you're launching the job flow from the Amazon Elastic MapReduceconsole, or with the --num-instances option if you're launching the job flow from the CLI.

The number of map tasks run on an instance depends on the EC2 instance type. For a list of the supportedEC2 instance types and the number of mappers each one provides, go to Task Configuration.

Another way to increase the number of mapper daemons is to change themapred.tasktracker.map.tasks.maximum configuration parameter of Hadoop to a higher value.This has the advantage of giving you more mappers without increasing either the number or the size ofEC2 instances, which saves you money. A disadvantage is that setting this value too high can cause theEC2 instances in your job flow to run out of memory.To set mapred.tasktracker.map.tasks.maximum,launch the job flow and specify the Configure Hadoop bootstrap action, passing in a value formapred.tasktracker.map.tasks.maximum as one of the arguments of the bootstrap action. This isshown in the following example.

--bootstrap-action s3n://elasticmapreduce/bootstrap-actions/configure-hadoop \

--args -s,mapred.tasktracker.map.tasks.maximum=10

For more information about bootstrap actions, go to Using Custom Bootstrap Actions in the AmazonElastic Map Reduce Developer Guide.

Parallel Data RequestsMultiple data requests, either from more than one user or more than one application to a single table maydrain read provisioned throughput and slow performance.

Process DurationData consistency in Amazon DynamoDB depends on the order of read and write operations on eachnode. While a Hive query is in progress, another application might load new data into the AmazonDynamoDB table or modify or delete existing data. In this case, the results of the Hive query might notreflect changes made to the data while the query was running.

Avoid Exceeding ThroughputWhen running Hive queries against Amazon DynamoDB, take care not to exceed your provisionedthroughput, because this will deplete capacity needed for your application's calls to DynamoDB::Get. Toensure that this is not occurring, you should regularly monitor the read volume and throttling on applicationcalls to DynamoDB::Get by checking logs and monitoring metrics in Amazon CloudWatch.

Request TimeScheduling Hive queries that access a Amazon DynamoDB table when there is lower demand on theAmazon DynamoDB table improves performance. For example, if most of your application's users live inSan Francisco, you might choose to export daily data at 4 a.m. PST, when the majority of users areasleep, and not updating records in your Amazon DynamoDB database.

Time-Based TablesIf the data is organized as a series of time-based Amazon DynamoDB tables, such as one table per day,you can export the data when the table becomes no longer active.You can use this technique to backup data to Amazon S3 on an ongoing fashion.



http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_AMI2.html

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html

Archived DataIf you plan to run many Hive queries against the data stored in Amazon DynamoDB and your applicationcan tolerate archived data, you may want to export the data to HDFS or Amazon S3 and run the Hivequeries against a copy of the data instead of Amazon DynamoDB.This will conserve your read operationsand provisioned throughput.

Viewing Hadoop LogsIf you run into an error, you can investigate what went wrong by viewing the Hadoop logs and userinterface. For more information on how to do this, go to How to Monitor Hadoop on a Master Node andHow to Use the Hadoop User Interface in the Amazon Elastic MapReduce Developer Guide.

Use Third Party Applications With Amazon EMRYou can run several popular big-data applications on Amazon EMR with utility pricing. This means youpay a nominal additional hourly fee for the third-party application while your job flow is running. It allowsyou to use the application without having to purchase an annual liscence.

DescriptionProduct

A tool you can use to extract data stored in heterogeneousformats and convert it into a form that is easy to process andanalyze. In addition to text and XML, HParser can extract andconvert data stored in proprietary formats such as PDF andWord files. For mroe information about running HParsers withAmazon EMR, see Parse Data with HParser (p. 258).

HParser

Graphical desktop tools for working with large structured andunstructured data sets on Amazon Elastic MapReduce(Amazon EMR). Karmasphere Analytics can launch newAmazon EMR job flows or interact with job flows launchedwith Karmasphere Analytics enabled. For more informationabout using Karmasphere with Amazon EMR, see UsingKarmasphere Analytics (p. 259)

Karmasphere

An open, enterprise-grade distribution that makes Hadoopeasier and more dependable. For ease of use, MapR providesnetwork file system (NFS) and open database connectivity(ODBC) interfaces, a comprehensive management suite andautomatic compression. For dependability, MapR provideshigh availability with a self-healing no-NameNode architecture,data protection with snapshots and disaster recovery and withcross-cluster mirroring. For more inforamtion about usingMapR with Amazon EMR, see Launch a Job Flow on theMapR Distribution for Hadoop (p. 260).

MapR Distribution for Hadoop

Parse Data with HParserInformatica's HParser is a tool you can use to extract data stored in heterogeneous formats and convertit into a form that is easy to process and analyze. For example, if your company has legacy stock tradinginformation stored in custom-formatted text files, you could use HParser to read the text files and extract


Amazon Elastic MapReduce Developer GuideUse Third Party Applications With Amazon EMR

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingSSHtoMonitorJobStatus.html


the relevant data as XML. In addition to text and XML, HParser can extract and convert data stored inproprietary formats such as PDF and Word files.

HParser is designed to run on top of the Hadoop architecture, which means you can distribute operationsacross many computers in a cluster to efficiently parse vast amounts of data. Amazon Elastic MapReduce(Amazon EMR) makes it easy to run Hadoop in the Amazon Web Services (AWS) cloud. With AmazonEMR you can set up a Hadoop cluster in minutes and automatically terminate the resources when theprocessing is complete.

In our stock trade information example, you could use HParser running on top of Amazon EMR to efficientlyparse the data across a cluster of machines. The cluster will automatically shut down when all of yourfiles have been converted, ensuring you are only charged for the resources used.This makes your legacydata available for analysis, without incurring ongoing IT infrastructure expenses.

The following tutorial walks you through the process of using HParser hosted on Amazon EMR to processcustom text files into an easy-to-analyze XML format. The parsing logic for this sample has been definedfor you using HParser, and is stored in the transformation services file (services_basic.tar.gz). This file,along with other content needed to run this tutorial, has been preloaded onto Amazon Simple StorageService (Amazon S3) at s3n://elasticmapreduce/samples/informatica/.You will reference these files whenyou run the HParser job.

For a step-by-step tutorial of how to run HParser on Amazon EMR, see Parse Data with HParser onAmazon EMR.

For more information about HParser and how to use it, go tohttp://www.informatica.com/us/products/b2b-data-exchange/hparser/.

Using Karmasphere AnalyticsKarmasphere Analytics provides graphical desktop tools for working with large structured and unstructureddata sets on Amazon Elastic MapReduce (Amazon EMR). Karmasphere Analytics can launch new AmazonEMR job flows or interact with job flows launched with Karmasphere Analytics enabled. For pricing details,tutorials, and download instructions, go to http://aws.amazon.com/elasticmapreduce/karmasphere.

Enabling Karmasphere Analytics on a Job FlowYou can enable Karmasphere Analytics on a job flow by setting the supported products field tokarmasphere-enterprise-utility when you launch the job flow. Karmasphere Analytics will thenbe able to access the job flow and its EC2 instances.You will be billed an hourly rate for KarmasphereAnalytics usage. For pricing details, go to http://aws.amazon.com/elasticmapreduce/karmasphere.

NoteYou cannot enable Karmasphere Analytics on a job flow that is currently running.The supportedproducts field must be set when the job flow is launched.You can set the supported productsfield either through the Amazon EMR command-line interface or the RunJobFlow API; it is notsupported in the Amazon EMR console. Job Flows launched by Karmasphere Analyticsautomatically have this field set.

To Launch a Job Flow with Karmasphere Analytics Enabled

• When you create a new job flow, set the supported products field by adding the following parameterto your job flow call: --with-supported-products karmasphere-enterprise-utility.Thefollowing example launches a job flow with Karmasphere Analytics enabled. The example shows aninteractive job flow running on five m1.xlarge EC2 instances.

elastic-mapreduce --create --alive \


Amazon Elastic MapReduce Developer GuideUsing Karmasphere Analytics



http://www.informatica.com/us/products/b2b-data-exchange/hparser/

http://aws.amazon.com/elasticmapreduce/karmasphere

http://aws.amazon.com/elasticmapreduce/karmasphere


--instance-type m1.xlarge --num-instances 5 \--with-supported-products karmasphere-enterprise-utility

Launch a Job Flow on the MapR Distribution forHadoopMapR is a third-party application that offers an open, enterprise-grade distribution that makes Hadoopeasier to use and more dependable. For ease of use, MapR provides network file system (NFS) and opendatabase connectivity (ODBC) interfaces, a comprehensive management suite, and automatic compression.For dependability, MapR provides high availability with a self-healing no-NameNode architecture, anddata protection with snapshots, disaster recovery, and with cross-cluster mirroring. For more informationabout MapR, go to http://www.mapr.com/.

There are two versions of MapR available on Amazon EMR:

• M3 Edition (v1.2.8)—A complete, easy to use, dependable distribution available at no additional chargeover standard hourly Amazon EMR fees.

• M5 Edition (v1.2.8)—The full set of MapR functionality, including high availability, mirroring, snapshots,and data placement control, all for a nominal additional hourly fee over standard hourly Amazon EMRfees. For complete details, including pricing, see the MapR on Amazon EMR Detail Page.

The MapR distribution for Hadoop does not support the following features of Amazon EMR:

• Decreasing the number of task nodes in a cluster.

• Amazon Virtual Private Cloud (Amazon VPC)

• Ganglia

• LZO compression/decompression

• Snappy compression/decompression

• AWS Identity and Access Management (IAM) roles

NoteThe MapR distribution for Hadoop is not supported in the Asia Pacific (Sydney) Region.

Launch an Amazon EMR Job Flow with MapRYou can launch any standard job flow on a MapR distribution by specifying MapR when you set theHadoop version.You can do this with job flows launched from the Amazon EMR console, the CLI, or API.

To launch an Amazon EMR job flow with MapR using the console


2. Click Create New Job Flow to launch the Create a New Job Flow wizard. For more informationabout using the Create a New Job Flow wizard to launch a job flows, see the instructions for eachjob flow type under Create a Job Flow (p. 23).

3. On the DEFINE JOB FLOW page, in the Hadoop Version dropdown list, select either Hadoop0.20.205 (MapR M5 Edition v1.2.8) or Hadoop 0.20.205 (MapR M3 Edition v1.2.8) .This will launchyour job flow on the corresponding MapR distribution.


Amazon Elastic MapReduce Developer GuideLaunch a Job Flow on the MapR Distribution for Hadoop

http://www.mapr.com/

http://aws.amazon.com/elasticmapreduce/mapr


4. Continue through the wizard, following the directions for the type of job flow you are launching, asdescribed in Create a Job Flow (p. 23).

To launch an Amazon EMR job flow with MapR using the CLI

• Set the --with-supported-products parameter to either mapr-m3 or mapr-m5 to run your jobflow on the corresponding version of the MapR Hadoop distribution.

The following example launches a job flow running with the M3 Edition of MapR.

elastic-mapreduce --create --alive \--instance-type m1.xlarge –num-instances 5 \--with-supported-products mapr-m3

For additional information about launching job flows using the CLI, see the instructions for each jobflow type in Create a Job Flow (p. 23).

To launch an Amazon EMR job flow with MapR using the API

• In the request to RunJobFlow, set a member of the SupportedProducts list to either mapr-m3 ormapr-m5, corresponding to the version of the MapR distribution you'd like to run the job flow on.

The following example request launches a job flow running with the M3 Edition of MapR.

https://elasticmapreduce.amazonaws.com?Action=RunJobFlow




&Name=MyJobFlowName &LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir &Instances.MasterInstanceType=m1.xlarge &Instances.SlaveInstanceType=m1.xlarge &Instances.InstanceCount=4 &Instances.Ec2KeyName=myec2keyname &Instances.Placement.AvailabilityZone=us-east-1a &Instances.KeepJobFlowAliveWhenNoSteps=true &Instances.TerminationProtected=true &Steps.member.1.Name=MyStepName &Steps.member.1.ActionOnFailure=CONTINUE &Steps.member.1.HadoopJarStep.Jar=MyJarFile &Steps.member.1.HadoopJarStep.MainClass=MyMainClass &Steps.member.1.HadoopJarStep.Args.member.1=arg1 &Steps.member.1.HadoopJarStep.Args.member.2=arg2 &SupportedProducts.member.1=mapr-m3&AuthParams

For additional information about launching job flows using the API, see the instructions for each jobflow type in Create a Job Flow (p. 23).



Write Amazon EMR Applications

Topics

• Common Concepts for API Calls (p. 263)

• Use SDKs to Call Amazon EMR APIs (p. 265)

• Use Query Requests to Call Amazon EMR APIs (p. 268)

There are two ways you can programmatically call the functionality exposed by the Amazon ElasticMapReduce (Amazon EMR) API: submit a Query request over HTTP/HTTPS, or call wrapper functionsin one of the AWS SDKs.

Calling the APIs using a Query request provides the lowest-level access to the APIs, but at the cost ofhaving to write code to manage connection details, such as calculating the hash to sign the request, errorhandling, and retry requests. However, by using Query requests directly, you have access to the fullfunctionality of the web service and can use any programming language. For more information about howto call Amazon EMR using a Query request, go to Use Query Requests to Call Amazon EMR APIs (p. 268).

The AWS SDKs provide language-specific functions that wrap the web service's API and simplify connectingto the web service, handling many of the connection details for you. The trade-off is that an SDK may notexist for the language you wish to use, and the entire API may not be represented in the wrapper functions.You can work around this last limitation by using the SDK wrapper functions to create and submit a Queryrequest.. For more information about calling Amazon EMR using one of the SDKs, go to Use SDKs toCall Amazon EMR APIs (p. 265).

Common Concepts for API CallsTopics

• Endpoints for Amazon EMR (p. 264)

• Specifying Job Flow Parameters in Amazon EMR (p. 264)

• Availability Zones in Amazon EMR (p. 264)

• How to Use Additional Files and Libraries in Amazon EMR Job Flows (p. 264)

• Amazon EMR Sample Applications (p. 265)

When you write an application that calls the Amazon Elastic MapReduce (Amazon EMR) API, there areseveral concepts that apply regardless of whether you are calling the API directly using a Query requestor are calling one of the wrapper functions of an SDK.


Amazon Elastic MapReduce Developer GuideCommon Concepts for API Calls

Endpoints for Amazon EMRAn endpoint is a URL that is the entry point for a web service. Every web service request, whether itoriginates from a Query or a call to an SDK function, must contain an endpoint. The endpoint specifiesthe AWS Region where job flows are created, described, or terminated. It has the formelasticmapreduce.regionname.amazonaws.com. If the Region name is not specified, Amazon EMRuses the default Region, us-east-1.

For a list of the endpoints for Amazon EMR, go to Regions and Endpoints in the Amazon Web ServicesGeneral Reference.

Specifying Job Flow Parameters in Amazon EMRThe Instances parameters enable you to configure the type and number of Amazon EC2 instances tocreate nodes to process the data. Hadoop spreads the processing of the data across multiple job flownodes. The master node is responsible for keeping track of the health of the core and task nodes andpolling the nodes for job result status. The core and task nodes do the actual processing of the data. Ifyou have a single-node job flow, the node serves as both the master and a core node.

The KeepJobAlive parameter in a RunJobFlow request determines whether to terminate the clusterwhen it runs out of job flow steps to execute. Set this value to False when you know that the job flow isrunning as expected. When you are troubleshooting the job flow and adding steps while the job flowexecution is suspended, set the value to True.This reduces the amount of time and expense of uploadingthe results to Amazon Simple Storage Service (Amazon S3), only to repeat the process after modifyinga step to restart the job flow.

If KeepJobAlive is true, after successfully getting the job flow to complete its work, you must send aTerminateJobFlows request or the job flow continues to run and generate AWS charges.

For more information about parameters that are unique to RunJobFlow, see RunJobFlow. For moreinformation about the generic parameters in the request, see Common Request Parameters.

Availability Zones in Amazon EMRAmazon EMR uses Amazon Elastic Compute Cloud (Amazon EC2) instances as nodes to process jobflows. These Amazon EC2 instances have locations composed of Availability Zones (AZ) and Regions.Regions are dispersed and located in separate geographic areas. Availability Zones are distinct locationswithin a Region insulated from failures in other Availability Zones. Each Availability Zone providesinexpensive, low-latency network connectivity to other Availability Zones in the same Region. For a listof the regions and endpoints for Amazon EMR, go to Regions and Endpoints in the Amazon Web ServicesGeneral Reference.

The AvailabilityZone parameter specifies the general location of the job flow. This parameter isoptional and, in general, we discourage its use.When AvailabilityZone is not specified Amazon EMRautomatically picks the best AvailabilityZone for the job flow.You might find this parameter usefulif you want to colocate your instances with other existing running instances, and your job flow needs toread or write data from those instances. For more information, go to the Amazon Elastic Compute CloudDeveloper Guide.

How to Use Additional Files and Libraries inAmazon EMR Job FlowsThere are times when you might like to use additional files or custom libraries with your mapper or reducerapplications. For example, you might like to use a library that converts a PDF file into plain text.


Amazon Elastic MapReduce Developer GuideEndpoints for Amazon EMR





http://docs.aws.amazon.com/AWSEC2/latest/DeveloperGuide/

http://docs.aws.amazon.com/AWSEC2/latest/DeveloperGuide/

To cache a file for the mapper or reducer to use when using Hadoop streaming

• In the JAR args field:, add the following argument:

-cacheFile s3n://bucket/path_to_executable#local_path

The file, local_path, is in the working directory of the mapper, which could reference the file.

Amazon EMR Sample ApplicationsAWS provides tutorials that show you how to create complete applications, including:

• Contextual Advertising using Apache Hive and Amazon EMR with High Performance Computinginstances

• Parsing Logs with Apache Pig and Elastic MapReduce

• Processing and Loading Data from Amazon S3 to the Vertical Analytic Database

• Finding Similar Items with Amazon EMR, Python, and Hadoop Streaming

• ItemSimilarity

• Word Count Example

For more Amazon EMR code examples, go to Sample Code & Libraries.

Use SDKs to Call Amazon EMR APIsTopics

• Using the AWS SDK for Java to Create an Amazon EMR Job Flow (p. 266)

• Using the AWS SDK for .Net to Create an Amazon EMR Job Flow (p. 267)

• Using the Java SDK to Sign a Query Request (p. 267)

The AWS SDKs provide functions that wrap the API and take care of many of the connection details,such as calculating signatures, handling request retries, and error handling.The SDKs also contain samplecode, tutorials, and other resources to help you get started writing applications that call AWS. Calling thewrapper functions in an SDK can greatly simplify the process of writing an AWS application.

One disadvantage to using the SDKs is that the implementation of the wrapper functions sometimes lagsbehind changes to the web service's API, meaning that there may be a period between the time that anew web service API is released and when a wrapper function for it becomes available in the SDKs.Youcan overcome this disadvantage by using the SDKs to generate a raw Query request..

For more information about how to download and use the AWS SDKs, go to Sample Code & Libraries.SDKs are currently available for the following languages/platforms:

• Android

• iOS

• Java

• PHP

• Python

• Ruby

• Windows and .NET


Amazon Elastic MapReduce Developer GuideAmazon EMR Sample Applications








http://aws.amazon.com/code/Elastic-MapReduce

http://aws.amazon.com/code

http://aws.amazon.com/sdkforandroid

http://aws.amazon.com/sdkforios

http://aws.amazon.com/java

http://aws.amazon.com/php

http://aws.amazon.com/python

http://aws.amazon.com/ruby

http://aws.amazon.com/net

Using the AWS SDK for Java to Create an AmazonEMR Job FlowThe AWS SDK for Java provides three packages with Amazon Elastic MapReduce (Amazon EMR)functionality:

• com.amazonaws.services.elasticmapreduce

• com.amazonaws.services.elasticmapreduce.model

• com.amazonaws.services.elasticmapreduce.util

For more information about these packages, go to the AWS SDK for Java API Reference.

The following example illustrates how the SDKs can simplify programming with Amazon EMR The codesample below uses the StepFactory object, a helper class for creating common Amazon EMR steptypes, to create an interactive Hive job flow with debugging enabled.

AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);

AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(creden tials);

StepFactory stepFactory = new StepFactory();

StepConfig enableDebugging = new StepConfig() .withName("Enable Debugging") .withActionOnFailure("TERMINATE_JOB_FLOW") .withHadoopJarStep(stepFactory.newEnableDebuggingStep());

StepConfig installHive = new StepConfig() .withName("Install Hive") .withActionOnFailure("TERMINATE_JOB_FLOW") .withHadoopJarStep(stepFactory.newInstallHiveStep());

RunJobFlowRequest request = new RunJobFlowRequest() .withName("Hive Interactive") .withSteps(enableDebugging, installHive) .withLogUri("s3://myawsbucket/") .withInstances(new JobFlowInstancesConfig() .withEc2KeyName("keypair") .withHadoopVersion("0.20") .withInstanceCount(5) .withKeepJobFlowAliveWhenNoSteps(true) .withMasterInstanceType("m1.small") .withSlaveInstanceType("m1.small"));

RunJobFlowResult result = emr.runJobFlow(request);


Amazon Elastic MapReduce Developer GuideUsing the AWS SDK for Java to Create an Amazon EMR

Job Flow

http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/package-summary.html

http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/model/package-summary.html

http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/util/package-summary.html

http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/index.html

Using the AWS SDK for .Net to Create an AmazonEMR Job FlowThe following example illustrates how the SDKs can simplify programming with Amazon EMR The codesample below uses the StepFactory object, a helper class for creating common Amazon EMR steptypes, to create an interactive Hive job flow with debugging enabled.

var emrClient =AWSClientFactory.CreateAmazonElasticMapReduceCli ent(RegionEndpoint.USWest2); var stepFactory = new StepFactory();

var enableDebugging = new StepConfig{ Name = "Enable Debugging", ActionOnFailure = "TERMINATE_JOB_FLOW", HadoopJarStep = stepFactory.NewEnableDebuggingStep() };

var installHive = new StepConfig{ Name = "Install Hive", ActionOnFailure = "TERMINATE_JOB_FLOW", HadoopJarStep = stepFactory.NewInstallHiveStep() };

var instanceConfig = new JobFlowInstancesConfig{ Ec2KeyName = "keypair", HadoopVersion = "0.20", InstanceCount = 5, KeepJobFlowAliveWhenNoSteps = true, MasterInstanceType = "m1.small", SlaveInstanceType = "m1.small" };

var request = new RunJobFlowRequest{ Name = "Hive Interactive", Steps = {enableDebugging, installHive}, LogUri = "s3://myawsbucket", Instances = instanceConfig };

var result = emrClient.RunJobFlow(request);

Using the Java SDK to Sign a Query RequestThe following example uses the amazon.webservices.common package of the AWS SDK for Java togenerate an AWS signature version 2 Query request signature.To do so, it creates an RFC 2104-compliantHMAC signature. For more information about HMAC, go to HMAC: Keyed-Hashing for MessageAuthentication. For more information about how to format and sign Query requests using AWS signatureversion 2, go to How to Generate a Signature for a Query Request in Amazon EMR (p. 270)

NoteJava is used in this case as a sample implementation, you can use the programming languageof your choice to implement the HMAC algorithm to sign Query requests.


Amazon Elastic MapReduce Developer GuideUsing the AWS SDK for .Net to Create an Amazon EMR

Job Flow

http://www.ietf.org/rfc/rfc2104.txt

http://www.ietf.org/rfc/rfc2104.txt

package amazon.webservices.common;

import java.security.SignatureException; import javax.crypto.Mac; import javax.crypto.spec.SecretKeySpec;

/** * This class defines common routines for generating * authentication signatures for AWS Platform requests. */ public class Signature { private static final String HMAC_SHA1_ALGORITHM = "HmacSHA256";

/** * Computes RFC 2104-compliant HMAC signature. * * @param data * The signed data. * @param key * The signing key. * @return * The Base64-encoded RFC 2104-compliant HMAC signature. * @throws * java.security.SignatureException when signature generation fails */ public static String calculateRFC2104HMAC(String data, String key) throws java.security.SignatureException { String result; try {

// get an hmac_sha1 key from the raw key bytes SecretKeySpec signingKey = new SecretKeySpec(key.getBytes(), HMAC_SHA1_AL GORITHM);

// get an hmac_sha1 Mac instance and initialize with the signing key Mac mac = Mac.getInstance(HMAC_SHA1_ALGORITHM); mac.init(signingKey);

// compute the hmac on input data bytes byte[] rawHmac = mac.doFinal(data.getBytes());

// base64-encode the hmac result = Encoding.EncodeBase64(rawHmac);

} catch (Exception e) { throw new SignatureException("Failed to generate HMAC : " + e.getMessage()); } return result; }

Use Query Requests to Call Amazon EMR APIsTopics


Amazon Elastic MapReduce Developer GuideUse Query Requests to Call Amazon EMR APIs

• Why Query Requests Are Signed (p. 269)

• Components of a Query Request in Amazon EMR (p. 269)

• How to Generate a Signature for a Query Request in Amazon EMR (p. 270)

Query requests are HTTP or HTTPS requests that use the HTTP verb GET or POST and a Queryparameter named Action or Operation that specifies the API you are calling. Action is used throughoutthis documentation, although Operation is also supported for backward compatibility with other AWSQuery APIs.

Calling the API using a Query request is the most direct way to access the web service, but requires thatyour application handle low-level details such as generating the hash to sign the request, and errorhandling.The benefit of calling the service using a Query request is that you are assured of having accessto the complete functionality of the API.

NoteThe Query interface used by AWS is similar to REST, but does not adhere completely to theREST principles.

Why Query Requests Are SignedQuery requests travel over the Internet using either HTTP or HTTPS, and are vulnerable to beingintercepted and altered in transit. To prevent this and ensure that the incoming request is both from avalid AWS account and unaltered, AWS requires all requests to be signed.

To sign a Query request, you calculate a digital signature using a cryptographic hash function over thetext of the request and your AWS secret key. A cryptographic hash is a one-way function that returnsunique results based on the input.

When Amazon Elastic MapReduce (Amazon EMR) receives the request, it re-calculates the signatureusing the request text and the secret key that matches the AWS access key in the request. If the twosignatures match, Amazon EMR knows that the query has not been altered and that the request originatedfrom your account. This is one reason why it is important to safeguard your private key. Any malicioususer who obtains it would be able to make AWS calls, and incur charges, on your account.

For additional security, you should transmit your Query requests using Secure Sockets Layer (SSL) byusing HTTPS. SSL encrypts the transmission, protecting your Query request from being viewed in transit.For more information about securing your Query requests, see Making Secure Requests to Amazon WebServices.

NoteThe signature format that AWS uses has been refined over time to increase security and easeof use. Amazon EMR uses signature version 2, and the samples and instructions in this documentreflect the version 2 protocol.

Components of a Query Request in Amazon EMRAmazon Elastic MapReduce (Amazon EMR) requires that each HTTP or HTTPS Query request formattedfor signature version 2 contain the following:

• Endpoint—Also known as the host part of the HTTP request. This is the DNS name of the machineto which you send the Query request. This is different for each AWS Region. For the complete list ofendpoints supported by Amazon EMR go to Regions and Endpoints in the Amazon Web ServicesGeneral Reference.The endpoint, elasticmapreduce.amazonaws.com, shown in the example below, is the defaultendpoint and maps to the Region us-east-1.

• Action—Specifies the action that you want Amazon EMR to perform.


Amazon Elastic MapReduce Developer GuideWhy Query Requests Are Signed




This value determines the parameters that are used in the request. For descriptions of all Amazon EMRactions and their parameters, see the Amazon Elastic MapReduce API Reference.

The action in the example below is DescribeJobFlows, which causes Amazon EMR to return detailsabout one or more job flows.

• Required and optional parameters—Each action in Amazon EMR has a set of required and optionalparameters that define the API call. For a list of parameters that must be included in every AmazonEMR action, see Common Request Parameters.In the example below, the action DescribeJobFlows has a single parameter JobFlowIds.member.1which specifies the identifier of the job flow for which you want Amazon EMR to return details.

• AccessKeyId—A value distributed by AWS when you sign up for an AWS Account.To view your Access Key ID, go to AWS Security Credentials in the Amazon Elastic MapReduce(Amazon EMR) Getting Started Guide.

For information about how Amazon EMR uses the AccessKeyId to validate your call, go to Why QueryRequests Are Signed (p. 269).

• Timestamp—This is the time at which you make the request. Including this in the Query request helpsprevent third parties from intercepting your request and re-submitting to Amazon EMR.

• SignatureVersion—The version of the AWS signature protocol you're using. For Amazon EMR thecurrent supported version is 2.

• SignatureMethod—The hash-based protocol you are using to calculate the signature. This can beeither HMAC-SHA1 or HMAC-SHA256 for version 2 AWS signatures.

• Signature—This is a calculated value that ensures the signature is valid and has not been tamperedwith in transit. Amazon EMR currently uses version 2 of the AWS signature protocol for signing requests.

Following is an example Query request formulated as an HTTPS GET request. (Note that in the actualQuery request, there would be no spaces or newline characters.The request would appear as a continuousline of text. The version below has been formatted for human readability.)

https://elasticmapreduce.amazonaws.com? Action=DescribeJobFlows& JobFlowIds.member.1=JobFlowID& AWSAccessKeyId=AccessKeyID& Timestamp=2009-01-28T21%3A49%3A59.000Z& SignatureVersion=2& SignatureMethod=HmacSHA256& Signature=calculated value

NoteBe sure to URI encode the GET request. For example, blank spaces in your HTTP GET requestshould be encoded as %20. Although an unencoded space is normally allowed by the HTTPprotocol specification, use of unencoded characters creates an invalid signature in your Queryrequest.

How to Generate a Signature for a Query Requestin Amazon EMRWeb service requests are sent across the Internet and thus are vulnerable to tampering. To check thatthe request has not been altered, Amazon Elastic MapReduce (Amazon EMR) calculates the signatureto determine if any of the parameters or parameter values were changed en route. Amazon EMR requiresa signature as part of every request.


Amazon Elastic MapReduce Developer GuideHow to Generate a Signature for a Query Request in

Amazon EMR


http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/index.html?CommonParameters.html


The following topics describe the steps needed to calculate a signature using the AWS signature version2.

Format the Query RequestBefore you can sign the Query request, you must put the request into a completely unambiguous format.This is needed because there are different—and yet correct—ways to format a Query request, but thedifferent variations would result in different HMAC signatures. Putting the request into an unambiguous,canonical, format before signing it ensures that your application and Amazon EMR will calculate the samesignature for a given request.

The unambigous string to sign is built up by concatenating the Query request components together asfollows. As an example, let's generate the string to sign for the following call to DescribeJobFlows.

https://elasticmapreduce.amazonaws.com?Action=DescribeJobFlows&Version=2009-03-31&AWSAccessKeyId=AKIAIOSFODNN7EXAMPLE&SignatureVersion=2&SignatureMethod=HmacSHA256&Timestamp=2011-10-03T15%3A19%3A30

To create the string to sign (signature version 2)

1. Start with the request method (either GET or POST), followed by a newline character. (In the following,for human readability, the newline character is represented as \n.)

GET\n

2. Add the HTTP host header in lowercase, followed by a newline character. The port information isomitted if it is the standard port for the protocol (port 80 for HTTP and port 443 for HTTPS), butincluded if it is a non-standard port.

elasticmapreduce.amazonaws.com\n

3. Add the URL-encoded version of the absolute path component of the URI (this is everything betweenthe HTTP host header to the question mark character (?) that begins the query string parameters)followed by a newline character. If the absolute path is empty, use a forward slash (/).

/\n

4. Add the query string components (the name-value pairs, not including the initial question mark (?)as UTF-8 characters which are URL encoded per RFC 3986 (hexadecimal characters must beuppercased) and sorted using lexicographic byte ordering. Separate parameter names from theirvalues with the equal sign character (=) (ASCII character 61), even if the value is empty. Separatepairs of parameter and values with the ampersand character (&) (ASCII code 38). All reservedcharacters must be escaped. All unreserved characters must not be escaped. Concatenate the



Amazon EMR

http://tools.ietf.org/html/rfc3986

parameters and their values to make one long string with no spaces between them. Spaces withina parameter value, are allowed, but must be URL encoded as %20. In the concatenated string, periodcharacters (.) are not escaped. RFC 3986 considers the period character an unreserved character,and thus it is not URL encoded.

NoteRFC 3986 does not specify what happens with ASCII control characters, extended UTF-8characters, and other characters reserved by RFC 1738. Since any values may be passedinto a string value, these other characters should be percent encoded as %XY where X andY are uppercase hex characters. Extended UTF-8 characters take the form %XY%ZA...(this handles multi-bytes). The space character should be represented as '%20'. Spacesshould not be encoded as the plus sign (+) as this will cause as error.

The following example shows the query string components of a call to DescribeJobFlows, processedas described above.

AWSAccessKeyId=AKIAIOSFODNN7EXAMPLE&Action=DescribeJobFlows&SignatureMeth od=HmacSHA256&SignatureVersion=2&Timestamp=2011-10-03T15%3A19%3A30&Ver sion=2009-03-31

5. The string to sign for the call to DescribeJobFlows takes the following form:

GET\nelasticmapreduce.amazonaws.com%0A/\nAWSAccessKeyId=AKIAIOSFODNN7EXAMPLE&Action=DescribeJobFlows&SignatureMeth od=HmacSHA256&SignatureVersion=2&Timestamp=2011-10-03T15%3A19%3A30&Ver sion=2009-03-31

Calculate the SignatureAfter you've created the canonical string as described in Format the Query Request (p. 271), you calculatethe signature by creating a hash-based message authentication code (HMAC) using either the HMAC-SHA1or HMAC-SHA256 protocols.

You then add the value returned to the Query request as a signature parameter, as shown below.Youcan then use the signed request in an HTTP or HTTPS call. Amazon EMR will then return the results ofthe call formatted as a response. For more information about the inputs and outputs of the Amazon EMRAPI calls, go to the Amazon Elastic MapReduce Developer Guide.

https://elasticmapreduce.amazonaws.com?AWSAccessKeyId=AKIAIOSFODNN7EXAMPLE&Ac tion=DescribeJobFlows&SignatureMethod=HmacSHA256&SignatureVer sion=2&Timestamp=2011-10-03T15%3A19%3A30&Version=2009-03-31&Signa ture=lptk88A0LEP2KfwM3ima33DUjY0e%2FyfF7YfitJ%2FQw6I%3D

The AWS SDKs offer functions to generate Query request signatures. To see an example using the AWSSDK for Java, go to Using the Java SDK to Sign a Query Request (p. 267).



Amazon EMR



http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/

Troubleshooting Request Signatures in Amazon EMRThis section describes some error codes you might see when you are initially developing code to generatethe signature to sign Query requests.

SignatureDoesNotMatch Signing Error in Amazon EMR

The following error response is returned when Amazon EMR attempts to validate the request signatureby recalculating the signature value and generates a value that does not match the signature you appendedto the request.This can occur because the request was altered between the time you sent it and the timeit reached the Amazon EMR endpoint (this is the case the signature is designed to detect) or becausethe signature was calculated improperly. A common cause of the error message below is not properlycreating the string to sign, such as forgetting to URL encode characters such as the colon (:) and theforward slash (/) in Amazon S3 bucket names.

<ErrorResponse xmlns="http://elasticmapreduce.amazonaws.com/doc/2009-03-31"> <Error> <Type>Sender</Type> <Code>SignatureDoesNotMatch</Code> <Message>The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.</Message> </Error> <RequestId>7589637b-e4b0-11e0-95d9-639f87241c66</RequestId></ErrorResponse>

IncompleteSignature Signing Error in Amazon EMR

The following error indicates that signature is missing information or has been improperly formed.

<ErrorResponse xmlns="http://elasticmapreduce.amazonaws.com/doc/2009-03-31"> <Error> <Type>Sender</Type> <Code>IncompleteSignature</Code> <Message>Request must contain a signature that conforms to AWS standards</Mes sage> </Error> <RequestId>7146d0dd-e48e-11e0-a276-bd10ea0cbb74</RequestId></ErrorResponse>



Amazon EMR

Configure Amazon EMR

Topics

• Configure User Permissions with IAM (p. 274)

• Configure IAM Roles for Amazon EMR (p. 280)

• Set Access Permissions on Files Written to Amazon S3 (p. 285)

• Using Elastic IP Addresses (p. 287)

• Specify the Amazon EMR AMI Version (p. 290)

• Hadoop Configuration (p. 299)

• Hive Configuration (p. 348)

• Pig Configuration (p. 377)

• Performance Tuning (p. 381)

• Running Job Flows on an Amazon VPC (p. 381)

This section shows you how to customize the environment configuration used to run a job flow. Afterreading this section, you should understand Hadoop cluster configuration options, Hive configurationsettings, and recommendations for performance tuning your job flow environment.

Configure User Permissions with IAMAWS Identity and Access Management (IAM) enables you to create users under your Amazon WebServices (AWS) account.You can define policies that limit the actions those users can take with yourAWS resources. For example, you can choose to give an IAM user the ability to view, but not to createor terminate, Amazon Simple Storage Service (Amazon S3) buckets in your AWS account. IAM is availableat no charge to all AWS account holders; you do not need to sign up for IAM.You can use IAM throughthe Amazon EMR console, the Amazon EMR CLI, and programatically through the Amazon EMR APIand the AWS SDKs.

Hidden Job FlowsBy default, if an IAM user launches a job flow, that job flow is hidden from other IAM users on the AWSaccount. For example, if an IAM user uses the CLI to run the --list command, the CLI will only listhidden job flows launched by that IAM user, not hidden job flows launched by other IAM users on theAWS account.This filtering occurs on all Amazon EMR interfaces—the console, CLI, API, and SDKs—and


Amazon Elastic MapReduce Developer GuideConfigure User Permissions with IAM

prevents IAM users from accessing and indvertently changing job flows created by other IAM users. It isuseful for job flows that are intended to be viewed by only a single IAM user and the main AWS account.

NoteThis filtering does not prevent IAM users from viewing the underlying resources of the job flow,such as EC2 instances, by using AWS interfaces outside of Amazon EMR.

Visible Job FlowsYou also have the option to make a job flow visible and accessible to all IAM users under a single AWSaccount. This visiblity can be set when you launch the job flow, or it can be added to a job flow that isalready running.

Using this feature, you can make it possible for all IAM users on your account to access the job flow and,by configuring the policies of the IAM groups they belong to, control how those users interact with the jobflow. For example, Devlin, a developer, belongs to a group that has an IAM policy that grants full accessto all Amazon EMR functionality. He could launch a job flow that is visible to all other IAM users on hiscompany's AWS account. A second IAM user, Ann, a data analyst with the same company, could thenrun queries on that job flow. Because Ann does not launch or terminate job flows, the IAM policy for thegroup she is in would only contain the permissions necessary for her to run her queries.

To make a job flow visible to all IAM users using the Amazon EMR console

• On the ADVANCED OPTIONS pane of the Create Job Flow Wizard, select the Visible to All IAMUsers checkbox. Using the console, IAM user visibility can only be set when the job flow is created.To add IAM user visibility to a running job flow, use the Amazon EMR CLI or the Amazon EMR API,as described in the following procedures.


Amazon Elastic MapReduce Developer GuideVisible Job Flows

To make a job flow visible to all IAM users using the Amazon EMR CLI

• If you are adding IAM user visibility to a new job flow, add the --visible-to-all-users flag tothe job flow call as shown in the following example.

elastic-mapreduce --create --alive /--instance-type m1.xlarge --num-instances 2 /--visible-to-all-users

If you are adding IAM user visibility to an existing job flow, you can use the--set-visible-to-all-users option of the Amazon EMR CLI, and specify identifier of the jobflow to modify. This is shown in the following example, where job-flow-identifier would bereplaced by the job flow identifer of your job flow. The visibility of a running job flow can be changedonly by the IAM user that created the job flow or the AWS account that owns the job flow.

elastic-mapreduce --set-visible-to-all-users true --jobflow job-flow-identi fier

The Amazon EMR CLI is available for download at Amazon Elastic MapReduce Ruby Client.

To make a job flow visible to all IAM users using the Amazon EMR API

• If you are adding IAM user visibility to a new job flow, call RunJobFlow and setVisibleToAllUsers=true, as shown in the following example.

https://elasticmapreduce.amazonaws.com?Operation=RunJobFlow&Name=MyJobFlowName&VisibleToAllUsers=true&LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir&Instances.MasterInstanceType=m1.small&Instances.SlaveInstanceType=m1.small&Instances.InstanceCount=4&Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a&Instances.KeepJobFlowAliveWhenNoSteps=true&Instances.TerminationProtected=true&Steps.member.1.Name=MyStepName&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.HadoopJarStep.Jar=MyJarFile&Steps.member.1.HadoopJarStep.MainClass=MyMainClass&Steps.member.1.HadoopJarStep.Args.member.1=arg1&Steps.member.1.HadoopJarStep.Args.member.2=arg2&AuthParams

If you are adding IAM user visibility to an existing job flow, call SetVisibleToAllUsers and setVisibleToAllUsers to true, as shown in the following example. The visibility of a running jobflow can be changed only by the IAM user that created the job flow or the AWS account that ownsthe job flow.


Amazon Elastic MapReduce Developer GuideVisible Job Flows



http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_SetVisibleToAllUsers.html

https://elasticmapreduce.amazonaws.com?Operation=SetVisibleToAllUsers&VisibleToAllUsers=true&JobFlowIds.member.1=j-3UN6WX5RRO2AG &AuthParams

Set Policy for an IAM UserIn the IAM console, you can select an Amazon EMR policy template to set IAM account permissions foraccess to Amazon EMR. Or you can create a custom policy using the following examples as guidelines.Amazon EMR provides the following policy templates:

• Amazon Elastic MapReduce Full Access—Provides access to all Amazon EMR functionality.

• Amazon Elastic MapReduce Read Only Access—Provides access to view details and debugginginformation about job flows.

For more information, go to Creating and Listing Groups in Using AWS Identity and Access Management.

To add a permission to a user or group, write a policy that contains the permission and attach the policyto the user or group.You cannot specify a specific Amazon EMR resource in a policy, such as a specificjob flow.You can only specify Allow or Deny access to Amazon EMR API actions.

In an IAM policy, to specify Amazon EMR actions, the action name must be prefixed with the lowercasestring elasticmapreduce.You use wildcards to specify all actions related to Amazon EMR.The wildcard"*" matches zero or multiple characters.

For a complete list of Amazon EMR actions, refer to the API action names in the Amazon EMR APIReference. For more information about permissions and policies go to Permissions and Policies in theUsing AWS Identity and Access Management guide.

Users with permission to use Amazon EMR API actions can create and manage job flows as describedelsewhere in this guide. Users must use their own AWS Access ID and secret key to authenticate AmazonEMR commands. For more information on creating job flows, go to Using Amazon EMR (p. 15).

Example Policies for Amazon EMRThis section shows several sample policies for controlling user access to Amazon EMR. For informationabout attaching policies to users, go to Managing IAM Policies in the Using AWS Identity and AccessManagement Guide.


Amazon Elastic MapReduce Developer GuideSet Policy for an IAM User

http://docs.amazonwebservices.com/IAM/latest/UserGuide/Using_CreatingAndListingGroups.html



http://docs.amazonwebservices.com/IAM/latest/UserGuide/PermissionsAndPolicies.html

http://docs.amazonwebservices.com/IAM/latest/UserGuide/ManagingPolicies.html

Example 1: Deny a group use of Amazon EMR

The following policy denies permissions to run any Amazon EMR API .

{ "Statement":[{ "Action":["elasticmapreduce:*"], "Effect":"Deny", "Resource":"*" }]}


Amazon Elastic MapReduce Developer GuideExample Policies for Amazon EMR

Example 2: Allow full access to Amazon EMR

The following policy gives permissions for all actions required to use Amazon EMR. This policy includesactions for Amazon EC2, Amazon S3, Amazon CloudWatch, and Amazon SimpleDB, as well as for allAmazon EMR actions. Amazon EMR relies on these additional services to perform such actions aslaunching instances, writing log files, or managing Hadoop jobs and tasks.

NoteIn the following policy, access to Amazon S3 is limited to the buckets matching the pattern*elasticmapreduce/*, which includes buckets that store resources such as Amazon EMRsample applications and bootstrap actions. If you want IAM users to access other Amazon S3buckets, such as buckets that contain data to load into a job flow, those buckets must be explictlyadded to the list of resources.Access to Amazon SimpleDB in the following policy is limited to the resources used by AmazonEMR for debugging.

{ "Statement": [ { "Action": [ "elasticmapreduce:*", "ec2:AuthorizeSecurityGroupIngress", "ec2:CancelSpotInstanceRequests", "ec2:CreateSecurityGroup", "ec2:CreateTags", "ec2:DescribeAvailabilityZones", "ec2:DescribeInstances", "ec2:DescribeKeyPairs", "ec2:DescribeSecurityGroups", "ec2:DescribeSpotInstanceRequests", "ec2:DescribeSubnets", "ec2:ModifyImageAttribute", "ec2:ModifyInstanceAttribute", "ec2:RequestSpotInstances", "ec2:RunInstances", "ec2:TerminateInstances", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics", "cloudwatch:PutMetricData" ], "Effect": "Allow", "Resource": ["*"] },{ "Action": [ "s3:GetObject", "s3:ListBucket", "sdb:CreateDomain", "sdb:Select", "sdb:GetAttributes", "sdb:PutAttributes", "sdb:BatchPutAttributes" ], "Effect": "Allow", "Resource": [ "arn:aws:s3:::*elasticmapreduce/*", "arn:aws:sdb:*:*:*ElasticMapReduce*/*"


Amazon Elastic MapReduce Developer GuideExample Policies for Amazon EMR

] } ]}

NoteThe ec2:TerminateInstances action enables the IAM user to terminate any of the EC2instances associated with the IAM account, even those that are not part of an Amazon EMR jobflow.

Example 3: Allow requests from a certain IP address or range

The following policy denies any traffic using the AWS account that does not come from the named IPaddress ranges.

{ "Statement":[{ "Effect":"Deny", "Action":"*", "Resource":"*", "Condition":{ "NotIpAddress":{ "aws:SourceIp":["10.1.2.0/24","10.1.3.0/24"] } } }]}

This policy uses the AWS-wide key called aws:SourceIp to specify the range of valid IP addresses. Forinformation about AWS-wide policy keys, go to Element Descriptions in the Using AWS Identity andAccess Management Guide.

Related Topics

• How to Write a Policy (Using AWS Identity and Access Management Guide)

Configure IAM Roles for Amazon EMRTopics

• Launch an Amazon EMR Job Flow with an IAM Role (p. 281)

• EMRJobflowDefault IAM Role (p. 282)

• Custom IAM Roles (p. 283)

• Access AWS Resources Using IAM Roles (p. 284)

An AWS Identity and Access Management (IAM) role is a way to delegate access so IAM users or servicesin AWS can act on your AWS resources.You create an IAM role and assign permissions to it, such asthe ability to read and write data in one of your Amazon S3 buckets. When an IAM user or a service inAWS assumes that IAM role, they gain the specified permissions to access your AWS resources.

Amazon EMR uses IAM roles so that applications running on the EC2 instances of your job flow canaccess your AWS resources without the need to distribute your AWS account or IAM user credentials to


Amazon Elastic MapReduce Developer GuideConfigure IAM Roles for Amazon EMR

http://docs.amazonwebservices.com/IAM/latest/UserGuide/AccessPolicyLanguage_ElementDescriptions.html

http://docs.amazonwebservices.com/IAM/latest/UserGuide/AccessPolicyLanguage_HowToWritePolicies.html

those EC2 instances.With IAM roles, not only is your account information more secure, but you can refinethe IAM roles to limit the actions these applications take on your behalf. For example, with IAM roles, youcan grant an application the ability to read from the S3 bucket that contains your input data, but restrictits ability to launch new EC2 instances.

For more information about IAM roles, go to Delegating API Access by Using Roles.

Launch an Amazon EMR Job Flow with an IAMRoleThe following versions of Amazon EMR components are required to use IAM roles:

• AMI version 2.3.0 or later.

• If you are using Hive, version 0.8.1.6 or later.

• If you are using the CLI, version 2012-12-17 or later.

• If you are using s3DistCP, use the version ats3://sa-east-1.elasticmapreduce/libs/s3distcp/role/s3distcp.jar.

NoteLaunching a job flow with IAM roles is currently not supported in the Amazon EMR console. Ifyou need to use IAM roles in your job flow, launch the job flow using the CLI or the API.

To launch a job flow with an IAM role using the CLI

• Add the --jobflow-role parameter to the command that creates the job flow and specify the nameof the IAM role to apply to the EC2 instances in the job flow. The following example shows how tocreate an interactive Hive job flow that uses the default IAM role provided by Amazon EMR.

./elastic-mapreduce --create --alive --num-instances 3 \--instance-type m1.small \--name "myJobFlowName" \--hive-interactive --hive-versions 0.8.1.6 \--ami-version 2.3.0 \--hadoop-version latest \--jobflow-role EMRJobflowDefault

To set a default IAM role for the CLI

• If you launch most or all of your job flows with a specific IAM role, you can set that IAM role as thedefault for the CLI, so you don't need to specify it at the command line. Add a jobflow-role fieldin the credentials.json file you created when you installed the CLI. For more information aboutcredentials.json, see Create a Credentials File.

The following example shows the contents of a credentials.json file that causes the CLI toalways launch job flows with a user-defined IAM role, MyCustomRole.

{"access-id": "AccessKeyID","private-key": "PrivateKey",


Amazon Elastic MapReduce Developer GuideLaunch an Amazon EMR Job Flow with an IAM Role

http://docs.aws.amazon.com/IAM/latest/UserGuide/WorkingWithRoles.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Prereqs.html#ConfigCredentials

"key-pair": "KeyName","jobflow-role": "MyCustomRole","key-pair-file": "location of key pair file","region": "Region","log-uri": "location of bucket on Amazon S3"}

You can override the IAM role specified in credentials.json at any time by specifying a differentIAM role at the command line as shown in the preceding procedure.

To launch a job flow with an IAM role using the API

• Add a JobFlowRole argument to the call to the RunJobFlow action that specifies the name of theIAM role. This is shown in the following example, which sets the IAM role for the job flow toEMRJobflowDefault.

https://elasticmapreduce.amazonaws.com?Action=RunJobFlow&Name=MyJobFlowName &LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir&Instances.MasterInstanceType=m1.small &Instances.SlaveInstanceType=m1.small&Instances.InstanceCount=4 &Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a&Instances.KeepJobFlowAliveWhenNoSteps=true &Instances.TerminationProtected=true&Steps.member.1.Name=MyStepName&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.HadoopJarStep.Jar=MyJarFile&Steps.member.1.HadoopJarStep.MainClass=MyMainClass&Steps.member.1.HadoopJarStep.Args.member.1=arg1&Steps.member.1.HadoopJarStep.Args.member.2=arg2 &JobFlowRole=EMRJobflowDefault&AuthParams

If you do not specify the name of a role when you launch the job flow, the job flow is launched withoutroles enabled, and any applications on the job flow that need to access AWS resources must use pre-rolesauthentication methods.

EMRJobflowDefault IAM RoleTo simplify using IAM roles, Amazon EMR provides a default IAM role called EMRJobflowDefault. Ifyou launch a job flow using the CLI and specify the IAM role as EMRJobflowDefault, the CLI will checkand see if a IAM role with that name already exists for your account. If not, it will create the IAM role onyour behalf.

If you are using an IAM user with the CLI, your IAM user must have iam:CreateRole,iam:PutRolePolicy, iam:CreateInstanceProfile, iam:AddRoleToInstanceProfile,iam:PassRole, and iam:ListInstanceProfiles permissions for the CLI to succeed in creating thedefault IAM role and launching the job flow with that IAM role.


Amazon Elastic MapReduce Developer GuideEMRJobflowDefault IAM Role

The permissions set in the automatically generated EMRJobflowDefault IAM role are as follows.

{ "Statement": [ { "Action": [ "cloudwatch:*", "dynamodb:*", "ec2:Describe*", "elasticmapreduce:Describe*", "rds:Describe*", "s3:*", "sdb:*", "sns:*", "sqs:*", ], "Effect": "Allow", "Resource": [ "*" ]}]}

This set of permissions provides applications running on your job flow access to the full functionality ofAmazon EMR, Amazon CloudWatch, Amazon S3, Amazon RDS, Amazon SimpleDB, and AmazonDynamoDB. It also provides access to a subset of the functionality of Amazon EC2, that is, the set ofactions required by Hadoop to process job flows.

If your application doesn't require access to all of the services listed earlier, you can create a custom IAMrole to use when launching job flows that is limited to just the access your application requires. Forinformation on how to do that, see Custom IAM Roles (p. 283).

Custom IAM RolesIf the default IAM role provided by Amazon EMR, EMRJobflowDefault, does not meet your needs, youcan create a custom IAM role and use that instead. For example, if your application does not accessAmazon DynamoDB, you should remove Amazon DynamoDB permissions in your custom IAM role.Creating and managing IAM roles is described in the AWS Identity and Access Managementdocumentation.

• Creating a Role

• Modifying a Role

• Deleting a Role

We recommend that you use the permissions in EMRJobflowDefault as a starting place when developinga custom IAM role to use with Amazon EMR. To ensure that you always have access to the originalversion of this IAM role, we recommend that you generate EMRJobflowDefault using the Amazon EMRCLI, copy the contents of EMRJobflowDefault, create a new IAM role, paste in the permissions, andmodify those.

The following is an example of a custom IAM role for use with Amazon EMR. This example is for a jobflow that does not use Amazon RDS, or Amazon DynamoDB.The access to Amazon SimpleDB is includedto permit debugging from the console. Access to Amazon CloudWatch is included so the job flow canreport metrics. Amazon SNS and Amazon SQS permissions are included for messaging.


Amazon Elastic MapReduce Developer GuideCustom IAM Roles

http://docs.aws.amazon.com/IAM/latest/UserGuide/creating-role.html

http://docs.aws.amazon.com/IAM/latest/UserGuide/modifying-role.html

http://docs.aws.amazon.com/IAM/latest/UserGuide/deleting-roles.html

{ "Statement": [ { "Action": [ "cloudwatch:*", "ec2:Describe*", "elasticmapreduce:Describe*", "s3:*", "sdb:*", "sns:*", "sqs:*", ], "Effect": "Allow", "Resource": [ "*" ]}]}

ImportantIf you use the IAM CLI or API to create a IAM role and its associated instance profile, and givethe instance profile a different name than the IAM role, you should use the name of the instanceprofile, not the name of the IAM role, when specifying a IAM role to use in an Amazon EMR jobflow. For simplicity, we recommend you give a new IAM role the same name as its associatedinstance profile. For more information about instance profiles, go to Instance Profiles.

Access AWS Resources Using IAM RolesIf you've launched your job flow with an IAM role, applications running on the EC2 instances of that jobflow can use the IAM role to obtain temporary account credentials to use when calling services in AWS.

The version of Hadoop available on AMI 2.3.0 and later has already been updated to make use of IAMroles. If your application runs strictly on top of the Hadoop architecture, and does not directly call anyservice in AWS, it should work with IAM roles with no modification.

If your application calls services in AWS directly, you'll need to update it to take advantage of IAM roles.This means that instead of obtaining account credentials from /home/hadoop/conf/core-site.xmlon the EC2 instances in the job flow, your application will now either use an SDK to access the resourcesusing IAM roles, or call the EC2 instance metadata to obtain the temporary credentials.

To access AWS resources with IAM roles using an SDK

• The following topics show how to use several of the AWS SDKs to access temporary accountcredentials using IAM roles. Each topic starts with a version of an application that does not use IAMroles and then walks you through the process of converting that application to use IAM roles.

• Using IAM Roles for EC2 Instances with the SDK for Java in the AWS SDK for Java DeveloperGuide

• Using IAM Roles for EC2 Instances with the SDK for .NET in the AWS SDK for .NET DeveloperGuide

• Using IAM Roles for EC2 Instances with the SDK for PHP in the AWS SDK for PHP DeveloperGuide


Amazon Elastic MapReduce Developer GuideAccess AWS Resources Using IAM Roles

http://docs.aws.amazon.com/IAM/latest/UserGuide/instance-profiles.html

http://docs.amazonwebservices.com/AWSSdkDocsJava/latest/DeveloperGuide/java-dg-roles.html

http://docs.amazonwebservices.com/AWSSdkDocsNET/latest/DeveloperGuide/net-dg-roles.html

http://docs.amazonwebservices.com/AWSSdkDocsPHP/latest/DeveloperGuide/php-dg-roles.html

• Using IAM Roles for EC2 Instances with the SDK for Ruby in the AWS SDK for Ruby DeveloperGuide

To obtain temporary account credentials from EC2 instance metadata

• Call the following URL from an EC2 instance that is running with the specified IAM role. In the examplethat follows, we've used the default IAM role, EMRJobflowDefault.This URL returns the temporarysecurity credentials (AccessKeyId, SecretAccessKey, SessionToken, and Expiration) associatedwith the IAM role.

GET http://169.254.169.254/latest/meta-data/iam/security-credentials/EMRJob flowDefault

For more information about writing applications that use IAM roles, go to Granting Applications that Runon Amazon EC2 Instances Access to AWS Resources.

For more information on how to use temporary security credentials, go to Using Temporary SecurityCredentials to Access AWS.

Set Access Permissions on Files Written toAmazon S3

When you write a file to an Amazon Simple Storage Service (Amazon S3) bucket, by default, you are theonly one able to read that file. The assumption is that you will write files to your own buckets, and thisdefault setting protects the privacy of your files.

However, if you are running a job flow, and you want the output to write to the Amazon S3 bucket ofanother AWS user, and you want that other AWS user to be able to read that output, you must do twothings:

• Have the other AWS user grant you write permissions for their Amazon S3 bucket. The job flow youlaunch runs under your AWS credentials, so any job flows you launch will also be able to write to thatother AWS user's bucket.

• Set read permissions for the other AWS user on the files that you or the job flow write to the AmazonS3 bucket.The easiest way to set these read permissions is to use canned access control lists (ACLs),a set of pre-defined access policies defined by Amazon S3.

For information about how the other AWS user can grant you permissions to write files to the other user'sAmazon S3 bucket, see Editing Bucket Permissions in the Amazon Simple Storage Service Console UserGuide.

For your job flow to use canned ACLs when it writes files to Amazon S3, set the fs.s3.canned.acljob flow configuration option to the canned ACL to use. The following table lists the currently definedcanned ACLs.


Amazon Elastic MapReduce Developer GuideSet Access Permissions on Files Written to Amazon S3

http://docs.amazonwebservices.com/AWSSdkDocsRuby/latest/DeveloperGuide/ruby-dg-roles.html

http://docs.aws.amazon.com/IAM/latest/UserGuide/role-usecase-ec2app.html

http://docs.aws.amazon.com/IAM/latest/UserGuide/role-usecase-ec2app.html

http://docs.aws.amazon.com/STS/latest/UsingSTS/UsingTokens.html

http://docs.aws.amazon.com/STS/latest/UsingSTS/UsingTokens.html

http://docs.amazonwebservices.com/AmazonS3/latest/UG/EditingBucketPermissions.html

DescriptionCanned ACL

Specifies that the owner is grantedPermission.FullControl and theGroupGrantee.AuthenticatedUsers group grantee isgranted Permission.Read access.

AuthenticatedRead

Specifies that the owner of the bucket is grantedPermission.FullControl. The owner of the bucket is notnecessarily the same as the owner of the object.

BucketOwnerFullControl

Specifies that the owner of the bucket is grantedPermission.Read. The owner of the bucket is notnecessarily the same as the owner of the object.

BucketOwnerRead

Specifies that the owner is grantedPermission.FullControl and theGroupGrantee.LogDelivery group grantee is grantedPermission.Write access, so that access logs can bedelivered.

LogDeliveryWrite

Specifies that the owner is grantedPermission.FullControl.

Private

Specifies that the owner is grantedPermission.FullControl and theGroupGrantee.AllUsers group grantee is grantedPermission.Read access.

PublicRead

Specifies that the owner is grantedPermission.FullControl and theGroupGrantee.AllUsers group grantee is grantedPermission.Read and Permission.Write access.

PublicReadWrite

There are many ways to set the job flow configuration options, depending on the type of job flow you arerunning. The following procedures show how to set the option for common cases.

To write files using canned ACLs in Hive

• From the Hive command prompt, set the fs.s3.canned.acl configuration option to the cannedACL you want to have the job flow set on files it writes to Amazon S3. To access the Hive commandprompt connect to the master node using SSH, and type Hive at the Hadoop command prompt. Formore information, see Connect to the Master Node Using SSH (p. 111).

The following example sets the fs.s3.canned.acl configuration option toBucketOwnerFullControl, which gives the owner of the Amazon S3 bucket complete control overthe file. Note that the set command is case sensitive and contains no quotation marks or spaces.

hive> set fs.s3.canned.acl=BucketOwnerFullControl; create table acl (n int) location 's3://acltestbucket/acl/'; insert overwrite table acl select count(n) from acl;

The last two lines of the example create a table that is stored in Amazon S3 and write data to thetable.


Amazon Elastic MapReduce Developer GuideSet Access Permissions on Files Written to Amazon S3

To write files using canned ACLs in Pig

• From the Pig command prompt, set the fs.s3.canned.acl configuration option to the canned ACLyou want to have the job flow set on files it writes to Amazon S3.To access the Pig command promptconnect to the master node using SSH, and type Pig at the Hadoop command prompt. For moreinformation, see Connect to the Master Node Using SSH (p. 111).

The following example sets the fs.s3.canned.acl configuration option to BucketOwnerFullControl,which gives the owner of the Amazon S3 bucket complete control over the file. Note that the setcommand includes one space before the canned ACL name and contains no quotation marks.

pig> set fs.s3.canned.acl BucketOwnerFullControl; store somedata into 's3://acltestbucket/pig/acl';

To write files using canned ACLs in a custom JAR

• Set the fs.s3.canned.acl configuration option using Hadoop with the -D flag. This is shown inthe example below.

hadoop jar hadoop-examples.jar wordcount -Dfs.s3.canned.acl=BucketOwnerFullControl s3://mybucket/input s3://mybuck et/output

Using Elastic IP AddressesTopics

• Assign an Elastic IP Address to a New Job Flow (p. 288)

• Assigning an Elastic IP Address to a Running Job Flow (p. 288)

• View Allocated Elastic IP Addresses using Amazon EC2 (p. 289)

• Manage Elastic IP Addresses using Amazon EC2 (p. 290)

To help you manage your resources, you can change the dynamically assigned IP address of the masternode of your running job flow to a static Elastic IP address. Elastic IP addresses enable you to quicklyremap the dynamically assigned IP address of the job flow's master node to a static IP address. An ElasticIP address is associated with your AWS account, not with a particular job flow.You control your ElasticIP address until you choose to explicitly release it. For more information about Elastic IP addresses, goto Using Instance IP Addresses(http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/using-instance-addressing.html).

By default, the master node of your running job flow is assigned a dynamic IP address that is reachablefrom the Internet. The dynamic IP address is associated with the master node of your running job flowuntil it is stopped, terminated, or replaced with an Elastic IP address. When a job flow with an Elastic IPaddress is stopped or terminated, the Elastic IP address is not released and remains associated withyour AWS account.


Amazon Elastic MapReduce Developer GuideUsing Elastic IP Addresses

http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/using-instance-addressing.html

Assign an Elastic IP Address to a New Job FlowFrom the Amazon Elastic MapReduce (Amazon EMR) CLI, you can allocate an Elastic IP address andassign it to a new job flow. (For information on assigning an elastic IP address to a running job flow, seeAssigning an Elastic IP Address to a Running Job Flow (p. 288).)

You can assign Elastic IP addresses from the Amazon EMR CLI. Amazon EMR does not supportassignment of Elastic IP addresses from the Amazon EMR console or through the Amazon EMR API.

To assign an Elastic IP address to a job flow

• Create a job flow and add the --eip parameter.

For information on how to create a job flow using the CLI, go to Create a Job Flow (p. 23).

The CLI allocates an Elastic IP address and waits until the Elastic IP address is successfully assignedto the job flow. This assignment can take up to two minutes to complete.

NoteIf you want to use a previously allocated Elastic IP address, use the --eip parameterfollowed by your allocated Elastic IP address. If the allocated Elastic IP address is in useby another job flow, the other job flow loses the Elastic IP address and is assigned a newdynamic IP address.

You have successfully created a job flow and assigned it an Elastic IP address. It may take one or twominutes before the instance is available from the assigned address.

Assigning an Elastic IP Address to a Running JobFlowFrom the Amazon Elastic MapReduce (Amazon EMR) CLI, you can allocate an Elastic IP address andassign it to an running job flow. If you assign an Elastic IP address that is currently associated with anotherjob flow, the other job flow is assigned a new dynamic IP address.

Amazon EMR does not support assignment of Elastic IP addresses from the Amazon EMR console orthrough the Amazon EMR API.

To assign an Elastic IP address using a job flow ID

1. If you do not currently have a running job flow, create a job flow.For information on creating a job flow, go to Create a Job Flow (p. 23).

2. Identify your job flow:

Your job flow must have a Public DNS Name before you can assign an Elastic IP address. Typically,a job flow is assigned a Public DNS Name one or two minutes after launching the job flow.


& ./elastic-mapreduce --listLinux orUNIX



Amazon Elastic MapReduce Developer GuideAssign an Elastic IP Address to a New Job Flow


j-SLRI9SCLK7UC STARTING ec2-75-101-168-82.compute-1.amazonaws.com New Job Flow PENDING Streaming Job

The response includes the job flow ID and the Public DNS Name.You need the job flow ID to performthe next step.

3. Allocate and assign an Elastic IP address to the job flow:


& ./elastic-mapreduce job_flow_ID --eipLinux orUNIX

c:\ruby elastic-mapreduce job_flow_ID --eipMicrosoftWindows

This allocates an Elastic IP address and associates it with the named job flow.

NoteIf you want to use a previously allocated Elastic IP address, include your Elastic IP address,Elastic_IP, as follows:


& ./elastic-mapreduce job_flow_ID --eip Elastic_IPLinux orUNIX

c:\ruby elastic-mapreduce job_flow_ID --eip Elastic_IPMicrosoftWindows

You have successfully assigned an Elastic IP address to your job flow.

View Allocated Elastic IP Addresses using AmazonEC2Once you have allocated an Elastic IP address, you can reuse it on other job flows. To learn how toidentify your currently allocated IP addresses, go to Using Elastic IP Addresses in the Amazon ElasticCompute Cloud User Guide.

NoteBy default, each AWS customer has a limit of five Elastic IP addresses that can be associatedwith their account. If you would like to increase this limit, please submit a Request to IncreaseElastic IP Address Limit (http://aws.amazon.com/contact-us/eip_limit_request/) to increase yourmaximum number of Elastic IP addresses.


Amazon Elastic MapReduce Developer GuideView Allocated Elastic IP Addresses using Amazon EC2

http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?using-instance-addressing.html#using-instance-addressing-eips

http://aws.amazon.com/contact-us/eip_limit_request/

http://aws.amazon.com/contact-us/eip_limit_request/

Manage Elastic IP Addresses using Amazon EC2Amazon EC2 allows you to manage your Elastic IP Addresses from the Amazon EC2 console, the AmazonEC2 command line interface, and the Amazon EC2 API.

To learn more about using Amazon EC2 to create and manage your Elastic IP addresses, go to UsingElastic IP Addresses in the Amazon Elastic Compute Cloud User Guide.

Specify the Amazon EMR AMI VersionTopics

• AMI Version Numbers (p. 290)

• Default AMI and Hadoop Versions (p. 291)

• Specifying the AMI Version for a New Job Flow (p. 291)

• Check the AMI Version of a Running Job Flow (p. 293)

• Amazon EMR AMIs and Hadoop Versions (p. 294)

• Amazon EMR AMI Deprecation (p. 294)

• AMI Versions Supported in Amazon EMR (p. 294)

Amazon Elastic MapReduce (Amazon EMR) uses Amazon Machine Images (AMIs) to initialize the AmazonEC2 instances it launches to run a job flow. The AMIs contain the Linux operating system, Hadoop, andother software used to run the job flow. These AMIs are specific to Amazon EMR and can be used onlyin the context of running a job flow. Periodically, Amazon EMR updates these AMIs with new versions ofHadoop and other software, so users can take advantage of improvements and new features.

For general information about AMIs, go to Using AMIs in the Amazon Elastic Compute Cloud User Guide.For details about the software versions included in the Amazon EMR AMIs, go to the section called “AMIVersions Supported in Amazon EMR” (p. 294).

If your application depends on a specific version or configuration of Hadoop, you might want delayupgrading to the new AMI until you have tested your application on it. AMI versioning gives you the optionto specify which AMI version your job flow uses to launch Amazon EC2 instances.

Specifying the AMI version during job flow creation is optional; if you do not provide an AMI-versionparameter, and you are using the CLI, your job flows will run on the most recent AMI version.This meansyou always have the latest software running on your job flows, but you must ensure that your applicationwill work with new changes as they are released.

If you specify an AMI version when you create a job flow, your instances will be created using that AMI.This provides stability for long-running or mission-critical applications.The trade-off is that your applicationwill not have access to new features on more up-to-date AMI versions.

AMI Version NumbersAMI version numbers are composed of three parts major-version.minor-version.patch.The currentversion of the Amazon EMR CLI provides three ways to specify which version of the AMI to use to launchyour job flow.

• Fully specified—If you specify the AMI version using all three parts (e.g. --ami-version 2.0.1)your job flow will be launched on exactly that version. The preceding example would launch a job flowusing AMI 2.0.1. This is useful if you are running an application that depends on a specific AMI versionand you want to ensure that AMI version is the one used to launch your job flows. The downside is youwill not benefit from new features and improvements that are released on subsequent AMIs.


Amazon Elastic MapReduce Developer GuideManage Elastic IP Addresses using Amazon EC2



http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html



• Major-minor version specified—If you specify just the major and minor version for the AMI (e.g.--ami-version 2.0), your job flow will be launched on the AMI that matches those specificationsand which has the latest patches. The preceding example would launch a job flow using AMI 2.0.4,since .4 is the latest patch for the 2.0 AMI series that is not deprecated.This scenario ensures a measureof stability in the AMI version, while ensuring that you recieve the benefits of new patches and bugreleases.

• Latest version specified—If you use the keyword latest instead of a version number for the AMI(e.g. --ami-version latest) the job flow will be launched with the latest version available. At thiswriting, the preceding example would launch a job flow using AMI 2.1.1, because that is the latestversion currently available. This is the most dynamic way to run your job flows, as AMIs are updatedregularly. This configuration is best for prototyping and testing and is not recommended for productionenvironments.

Default AMI and Hadoop VersionsIf you don't specify the AMI and Hadoop versions for the job flow, Amazon EMR launches your job flowwith default versions. The default versions returned depend on the interface you use to launch the jobflow.

Default AMI and Hadoop versionsInterface

latest AMI and Hadoop versionsAmazon EMR console

AMI 1.0, Hadoop 0.18API

AMI 1.0, Hadoop 0.18SDK

latest AMI and Hadoop versionsCLI (version 2012-07-30) and later

AMI 2.1.3, Hadoop 0.20.205CLI (versions 2011-12-08 to 2012-07-09)

AMI 1.0, Hadoop 0.18CLI (version 2011-12-11 and earlier)

To determine which version of the CLI you have installed, run the following command.

$ ./elastic-mapreduce --version

Specifying the AMI Version for a New Job FlowYou can specify which AMI version a new job flow should use when you create it. For details about thedefault configuration and applications available on AMI versions, see AMI Versions Supported in AmazonEMR (p. 294).

NoteAMI versioning is not currently supported in the Amazon EMR console. Job flows created throughthe Amazon EMR console will use the latest version available.

To specify an AMI version using the CLI

• When creating a job flow using the CLI, add the --ami-version parameter. If you do not specifythis parameter, or if you specify --ami-version latest the most recent version of AMI will beused.

The following example specifies the AMI completely and will launch a job flow on AMI 2.0.1.


Amazon Elastic MapReduce Developer GuideDefault AMI and Hadoop Versions

$ ./elastic-mapreduce --create --alive --name "Static AMI Version" \ --ami-version 2.0.1 \ --num-instances 5 --instance-type m1.small

The following example specifies the AMI using just the major and minor version. It will launch the jobflow on the AMI that matches those specifications and which has the latest patches. This examplewould launch a job flow using AMI 2.0.5, since .5 is the latest patch for the 2.0 AMI series.

$ ./elastic-mapreduce --create --alive --name "Major-Minor AMI Version" \ --ami-version 2.0 \ --num-instances 5 --instance-type m1.small

The following example specifies that the job flow should be launched with the most current versionavailable. At this writing, this example would launch a job flow using AMI 2.2.0, because that is thelatest version currently available.

$ ./elastic-mapreduce --create --alive --name "Latest AMI Version" \ --ami-version latest \ --num-instances 5 --instance-type m1.small

To specify an AMI version using the API

• When creating a job flow using the API, add the AmiVersion and the HadoopVersion parametersto the request string, as shown in the following example. If you do not specify these parameters,Amazon EMR will create the job flow using the version 1.0 AMI and Hadoop 0.20. For moreinformation, go to RunJobFlow in the Amazon Elastic MapReduce API Reference.

https://elasticmapreduce.amazonaws.com?Operation=RunJobFlow&Name=MyJobFlowName&LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir&AmiVersion=1.0&HadoopVersion=0.20&Instances.MasterInstanceType=m1.small&Instances.SlaveInstanceType=m1.small&Instances.InstanceCount=4&Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a&Instances.KeepJobFlowAliveWhenNoSteps=true&Steps.member.1.Name=MyStepName&Steps.member.1.ActionOnFailure=CONTINUE&Steps.member.1.HadoopJarStep.Jar=MyJarFile&Steps.member.1.HadoopJarStep.MainClass=MyMainClass&Steps.member.1.HadoopJarStep.Args.member.1=arg1&Steps.member.1.HadoopJarStep.Args.member.2=arg2&AuthParams


Amazon Elastic MapReduce Developer GuideSpecifying the AMI Version for a New Job Flow

http://docs.aws.amazon.com/ElasticMapReduce/latest/API/index.html?API_RunJobFlow.html

Check the AMI Version of a Running Job FlowIf you need to find out which AMI version a job flow is running, you can retrieve this information using theconsole, the CLI, or the API.

To check the current AMI version using the console


2. Click on a job flow. The Ami Version and other details about the job flow are displayed in thenavigational pane that appears.

To check the current AMI version using the CLI

• Use the --describe parameter to retrieve the AMI version on a job flow. In the following exampleJobFlowID is the identifier of the job flow. The AMI version will be returned along with otherinformation about the job flow.

$ ./elastic-mapreduce --describe -–jobflow JobFlowID

To check the current AMI version using the API

• Call DescribeJobFlows to check which AMI version a job flow is using.The version will be returnedas part of the response data, as shown in the following example. For the complete response syntax,go to DescribeJobFlows in the Amazon Elastic MapReduce API Reference.

<DescribeJobFlowsResponse xmlns="http://elasticmapreduce.amazon aws.com/doc/2009-03-31">


Amazon Elastic MapReduce Developer GuideCheck the AMI Version of a Running Job Flow


http://docs.aws.amazon.com/ElasticMapReduce/latest/API/index.html?API_DescribeJobFlows.html

<DescribeJobFlowsResult> <JobFlows> <member> ... <AmiVersion> 2.1.3 </AmiVersion> ... </member> </JobFlows> </DescribeJobFlowsResult> <ResponseMetadata> <RequestId> 9cea3229-ed85-11dd-9877-6fad448a8419 </RequestId> </ResponseMetadata> </DescribeJobFlowsResponse>

Amazon EMR AMIs and Hadoop VersionsAn AMI can contain multiple versions of Hadoop. If the AMI you specify has multiple versions of Hadoopavailable, you can select the version of Hadoop you want to run as described in the section called “HadoopConfiguration” (p. 299).You cannot specify a Hadoop version that is not available on the AMI. For a listof the versions of Hadoop supported on each AMI, go to AMI Versions Supported in Amazon EMR (p. 294).

Amazon EMR AMI DeprecationEighteen months after an AMI version is released, the Amazon EMR team might choose to deprecatethat AMI version and no longer support it. In addition, the Amazon EMR team might deprecate an AMIbefore eighteen months has elapsed if a security risk or other issue is identified in the software or operatingsystem of the AMI. If a job flow is running when its AMI is depreciated, the job flow will not be affected.You will not, however, be able to create new job flows with the deprecated AMI version. The best practiceis to plan for AMI obsolescence and move to new AMI versions as soon as is practical for your application.

Before an AMI is deprecated, the Amazon EMR team will send out an announcement specifying the dateon which the AMI version will no longer be supported.

AMI Versions Supported in Amazon EMRAmazon EMR supports the AMI versions listed in the following table.You can specify the AMI version touse when you create a job flow. If you do not specify an AMI version, Amazon EMR creates the job flowusing the default AMI version. For information about default AMI configurations, see Default AMI andHadoop Versions (p. 291).

Release DateDescriptionAMI Version

19 December 2012Same as 2.2.4, with the following additions:

• Adds support for IAM roles. For more information, seeConfigure IAM Roles for Amazon EMR (p. 280).

2.3.0


Amazon Elastic MapReduce Developer GuideAmazon EMR AMIs and Hadoop Versions


6 December 2012Same as 2.2.3, with the following additions:

• Improves error handling in the Snappy decompressor.For more information, go to HADOOP-8151.

• Fixes an issue with MapFile.Reader reading LZO orSnappy compressed files. For more information, go toHADOOP-8423.

• Updates the kernel to the AWS version of 3.2.30-49.59.

2.2.4

30 November 2012Same as 2.2.1, with the following additions:

• Improves HBase backup functionality.

• Updates the AWS SDK for Java to version 1.3.23.

• Resolves issues with the job tracker user interface.

• Improves Amazon S3 file system handling in Hadoop.

• Improves to NameNode functionality in Hadoop.

2.2.3

23 November 2012Deprecated.2.2.2

30 August 2012Same as 2.2.0, with the following additions:

• Fixes an issue with HBase backup functionality.

• Enables multipart upload by default for files larger thanthe Amazon S3 block size specified by fs.s3n.blockSize.For more information, see Multipart Upload (p. 343).

2.2.1


• Adds support for Hadoop 1.0.3.

• No longer includes Hadoop 0.18 and Hadoop 0.20.205.

Operating system: Debian 6.0.5 (Squeeze)

Applications: Hadoop 1.0.3, Hive 0.8.1.3, Pig 0.9.2.2,HBase 0.92.0

Languages: Perl 5.10.1, PHP 5.3.3, Python 2.6.6, R 2.11.1,Ruby 1.8.7

File system: ext3 for root, xfs for ephemeral

Kernel: Amazon Linux

2.2.0


• Fixes issues in the Native Amazon S3 file system.

• Enables multipart upload by default. For more information,see Multipart Upload (p. 343).

2.1.4


Amazon Elastic MapReduce Developer GuideAMI Versions Supported in Amazon EMR

https://issues.apache.org/jira/browse/HADOOP-8151




• Fixes issues in HBase.

2.1.3


• Support for Amazon CloudWatch metrics when usingMapR.

Improve reliability of reporting metrics to AmazonCloudWatch.

2.1.2

3 July 2012Same as 2.1.0, with the following additions:

• Improves the reliability of log pushing.

• Adds support for HBase in Amazon VPC.

• Improves DNS retry functionality.

2.1.1

12 June 2012Same as AMI 2.0.5, with the following additions:

• Supports launching HBase clusters. For more informationsee Store Data with HBase (p. 155).

• Supports running MapR Editon M3 and Edition M5. Formore information, see Launch a Job Flow on the MapRDistribution for Hadoop (p. 260).

• Enables HDFS append by default;dfs.support.append is set to true inhdfs/hdfs-default.xml. The default value in codeis also set to true.

• Fixes a race condition in instance controller.

• Changes mapreduce.user.classpath.first todefault to true. This configuration setting indicateswhether to load classes first from the job flow's JAR fileor the Hadoop system lib directory. This change wasmade to provide a way for you to easily override classesin Hadoop.

• Uses Debian 6.0.5 (Squeeze) as the operating system.

2.1.0




19 April 2012NoteBecause of an issue with AMI 2.0.5, this versionis deprecated. We recommend that you use adifferent AMI version instead.

Same as AMI 2.0.4, with the following additions:

• Improves Hadoop performance by reinitializing therecycled compressor object for mappers only if they areconfigured to use the GZip compression codec for output.

• Adds a configuration variable to Hadoop calledmapreduce.jobtracker.system.dir.permissionthat can be used to set permissions on the systemdirectory. For more information, see Setting Permissionson the System Directory (p. 345).

• Changes InstanceController to use an embeddeddatabase rather than the MySQL instance running on thebox. MySQL remains installed and running by default.

• Improves the collectd configuration. For more informationabout collectd, go to http://collectd.org/.

• Fixes a rare race condition in InstanceController.

• Changes the default shell from dash to bash.


2.0.5

30 January 2012Same as AMI 2.0.3, with the following additions:

• Changes the default for fs.s3n.blockSize to 33554432(32MiB).

• Fixes a bug in reading zero-length files from Amazon S3.

2.0.4


• Adds support for Amazon EMR metrics in AmazonCloudWatch.

• Improves performance of seek operations in Amazon S3.

2.0.3



http://collectd.org/



• Adds support for the Python API Dumbo. For moreinformation about Dumbo, go tohttps://github.com/klbostee/dumbo/wiki/.

• The AMI now runs the Network Time Protocol Daemon(NTPD) by default. For more information about NTPD,go to http://en.wikipedia.org/wiki/Ntpd.

• Updates the Amazon Web Services SDK to version1.2.16.

• Improves the way Amazon S3 file system intializationchecks for the existence of Amazon S3 buckets.

• Adds support for configuring the Amazon S3 block sizeto facilitate splitting files in Amazon S3.You set this inthe fs.s3n.blockSize parameter.You set thisparameter by using the configure-hadoop bootstrapaction. The default value is 9223372036854775807 (8EiB).

• Adds a /dev/sd symlink for each /dev/xvd device. Forexample, /dev/xvdb now has a symlink pointing to it called/dev/sdb. Now you can use the same device names forAMI 1.0 and 2.0.

2.0.2

19 December 2011Same as AMI 2.0 except for the following bug fixes:

• Task attempt logs are pushed to Amazon S3.

• Fixed /mnt mounting on 32-bit AMIs.


2.0.1

11 December 2011Operating system: Debian 6.0.2 (Squeeze)

Applications: Hadoop 0.20.205, Hive 0.7.1, Pig 0.9.1


File system: ext3 for root, xfs for ephemeral

Kernel: Amazon Linux

Note: Added support for the Snappycompression/decompression library.

2.0.0

3 April 2012Same as AMI 1.0 except for the following change:

• Updates sources.list to the new location of the Lennydistribution in archive.debian.org.

1.0.1



https://github.com/klbostee/dumbo/wiki/

http://en.wikipedia.org/wiki/Ntpd

http://code.google.com/p/snappy/



26 April 2011Operating system: Debian 5.0 (Lenny)

Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5,0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (onHadoop 0.20)


File system: ext3 for root and ephemeral

Kernel: Red Hat

Note: This was the last AMI released before the CLI wasupdated to support AMI versioning. For backwardcompatibility, job flows launched with versions of the CLIdownloaded before 11 December 2011 use this version.

1.0.0

NoteThe cc2.8xlarge instance type is supported only on AMI 2.0.0 or later. The hs1.8xlarge instancetype is supported only on AMI 2.3 or later.

Hadoop ConfigurationApache Hadoop runs on the EC2 instances a job flow launches in order to process a job flow. Dependingon the Amazon Machine Image (AMI) version you use to launch the job flow, you have many options asto which verson of Hadoop to run, each with different options and configurations. For more informationabout the supported AMI versions, see Specify the Amazon EMR AMI Version (p. 290).

The following sections describe the various configuration settings and mechanisms available in AmazonElastic MapReduce (Amazon EMR).

Topics

• Supported Hadoop Versions (p. 300)

• Configuration of hadoop-user-env.sh (p. 302)

• Upgrading to Hadoop 1.0 (p. 302)

• Hadoop 0.20 Streaming Configuration (p. 304)

• Hadoop Default Configuration (AMI 1.0) (p. 304)

• Hadoop Memory-Intensive Configuration Settings (AMI 1.0) (p. 311)

• Hadoop Default Configuration (AMI 2.0 and 2.1) (p. 314)



• File System Configuration (p. 338)

• JSON Configuration Files (p. 340)

• Multipart Upload (p. 343)

• Hadoop Data Compression (p. 344)

• Setting Permissions on the System Directory (p. 345)

• Hadoop Patches (p. 346)


Amazon Elastic MapReduce Developer GuideHadoop Configuration

Supported Hadoop VersionsYou can choose to run one of four Hadoop versions.You set the --hadoop-version as shown in thefollowing table. We recommend using the latest version of Hadoop to take advantage of performanceenhancements and new functionality.

Configuration ParametersHadoopVersion

--hadoop-version 1.0.3 --ami-version 2.31.0.3

--hadoop-version 0.20.205 --ami-version 2.00.20.205

--hadoop-version 0.20 --ami-version 1.00.20

--hadoop-version 0.18 --ami-version 1.00.18

For details about the default configuration and software available on AMIs used by Amazon ElasticMapReduce (Amazon EMR) see Specify the Amazon EMR AMI Version (p. 290).

To specify the Hadoop version when creating a job flow with the CLI

• Add the --hadoop-version option and specify the version number.The following example createsa waiting job flow running Hadoop 1.0.3. The version of Hadoop you specify must be available onthe AMI you are using for the job flow. For details about the version of Hadoop available on an AMI,see AMI Versions Supported in Amazon EMR (p. 294).

$ ./elastic-mapreduce --create --alive --name "Test Hadoop" \ --hadoop-version 1.0.3 \ --num-instances 5 --instance-type m1.small

Hadoop 1.0 New FeaturesHadoop 1.0.3 support in Amazon EMR includes the features listed in Hadoop Common Releases, including:

• A RESTful API to HDFS, providing a complete FileSystem implementation for accessing HDFS overHTTP.

• Support for executing new writes in HBase while an hflush/sync is in progress.

• Performance-enhanced access to local files for HBase.

• The ability to run Hadoop, Hive, and Pig jobs as another user, similar to the following:

$ export HADOOP_USER_NAME=usernamehere

By exporting the HADOOP_USER_NAME environment variable the job would then be executed by thespecified username.

NoteIf HDFS is used then you need to either change the permissions on HDFS to allow READ andWRITE access to the specified username or you can disable permission checks on HDFS.This is done by setting the configuration variable dfs.permissions to false in themapred-site.xml file and then restarting the namenodes, similar to the following:


Amazon Elastic MapReduce Developer GuideSupported Hadoop Versions

http://hadoop.apache.org/common/releases.html

<property> <name>dfs.permissions</name> <value>false</value></property>

• S3 file split size variable renamed from fs.s3.blockSize to fs.s3.block.size, and the default is set to 64MB. This is for consistency with the variable name added in patch HADOOP-5861.

Setting access permissions on files written to Amazon S3 is also supported in Hadoop 1.0.3 with AmazonEMR. For more information see Set Access Permissions on Files Written to Amazon S3 (p. 285).

For a list of the patches applied to the Amazon EMR version of Hadoop 1.0.3, see Hadoop 1.0.3Patches (p. 346).

Hadoop 0.20 New FeaturesHadoop 0.18 was not designed to efficiently handle multiple small files. The following enhancements inHadoop 0.20 and later improve the performance of processing small files:

• Hadoop 0.20 and later assigns multiple tasks per heartbeat. A heartbeat is a method that periodicallychecks to see if the client is still alive. By assigning multiple tasks, Hadoop can distribute tasks to slavenodes faster, thereby improving performance. The time taken to distribute tasks is an important partof the processing time usage.

• Historically, Hadoop processes each task in its own Java Virtual Machine (JVM). If you have manysmall files that take only a second to process, the overhead is great when you start a JVM for eachtask. Hadoop 0.20 and later can share one JVM for multiple tasks, thus significantly improving yourprocessing time.

• Hadoop 0.20 and later allows you to process multiple files in a single map task, which reduces theoverhead associated with setting up a task. A single task can now process multiple small files.

Hadoop 0.20 and later also supports the following features:

• A new command line option, -libjars, enables you to include a specified JAR file in the class pathof every task.

• The ability to skip individual records rather than entire files. In previous versions of Hadoop, failures inrecord processing caused the entire file containing the bad record to skip. Jobs that previously failedcan now return partial results.

In addition to the Hadoop 0.18 streaming parameters, Hadoop 0.20 and later introduces the three newstreaming parameters listed in the following table:

DefinitionParameter

Specifies comma-separated files to copy to the map reduce cluster.-files

Specifies comma-separated archives to restore to the compute machines.-archives

Specifies a value for the key you enter, in the form of <key>=<value>.-D

For a list of the patches applied to the Amazon EMR version of Hadoop 0.20.205, see Hadoop 0.20.205Patches (p. 347).


Amazon Elastic MapReduce Developer GuideSupported Hadoop Versions

Configuration of hadoop-user-env.shWhen you run a Hadoop daemon or job, a number of scripts are executed as part of the initializationprocess.The script that runs when you enter hadoop at the Hadoop command line is a shell script locatedin "/home/hadoop/bin/hadoop.This script is responsible for setting up the Java classpath, configuringthe Java memory settings, determining which main class to run, and executing the actual Java process.

As part of the Hadoop configuration, the hadoop script executes a file called conf/hadoop-env.sh.The hadoop-env.sh script can set various environment variables. The conf/hadoop-env.sh scriptis used so that the main bin/hadoop script remains unmodified. Amazon Elastic MapReduce (AmazonEMR) creates a hadoop-env.sh script on every node in a cluster in order to configure the amount ofmemory for every Hadoop daemon launched.

Additionally, Amazon EMR provides a user customizable script, conf/hadoop-user-env.sh, to allowyou to override the default Hadoop settings that Amazon EMR configures.

You should put your custom overrides for the Hadoop environment variables inconf/hadoop-user-env.sh. Custom overrides could include items such as changes to Java memoryor naming additional JAR files in the classpath.The script is also where Amazon EMR writes data whenyou use a bootstrap action to configure memory or specifying additional Java args.

Examples of environment variables that you can specify in hadoop-user-env.sh include:

• export HADOOP_DATANODE_HEAPSIZE="128"

• export HADOOP_JOBTRACKER_HEAPSIZE="768"

• export HADOOP_NAMENODE_HEAPSIZE="256"

• export HADOOP_OPTS="-server"

• export HADOOP_TASKTRACKER_HEAPSIZE="512"

Bootstrap actions run before Hadoop starts and before any steps are run. In some cases it is necessaryto configure the Hadoop environment variables referenced in the Hadoop launch script.

If the script /home/hadoop/conf/hadoop-user-env.sh exists when Hadoop launches, Amazon EMRexecutes this script and any options are passed on to bin/hadoop.

For example, if you want to add a JAR file to the Hadoop daemon classpath, you can use a bootstrapaction such as:

#!/bin/bash echo "HADOOP_CLASSPATH=/path/to/my.jar" >> /home/hadoop/conf/hadoop-user-env.sh

For more information on using bootstrap actions, refer to Bootstrap Actions (p. 84).

Upgrading to Hadoop 1.0This section describes how to upgrade your Amazon Elastic MapReduce (Amazon EMR) deployment toHadoop 1.0.3.

NoteThe following information applies to Hadoop 0.20 and later, including Hadoop 1.0.3.

Many Hadoop jobs that run successfully on Hadoop 0.18 run without modification on Hadoop 0.20 andlater. However, before you engage in a full upgrade, we recommend recompiling your Hadoop jobs againstHadoop 1.0.3 and testing on small subsets of your data.


Amazon Elastic MapReduce Developer GuideConfiguration of hadoop-user-env.sh

Streaming jobs should also work without modification, but we recommend using the new streamingparameters introduced with version 0.20. These are summarized in the following table.

TypeHadoop 0.20Hadoop 0.18

Comma separated URIs-files-cacheFile

Comma separated URIs-archives-cacheArchive

key=value-D-jobconf

When using Amazon EMR with Hadoop 0.20 and later we offer the additional guidance listed below:

• You should recompile cascading applications with the Hadoop 1.0.3 version specified so they can takeadvantage of the new features available in this version.

• Full support provided for Pig scripts.

• All Amazon EMR sample applications are compatible.The Amazon EMR console only supports Hadoop1.0.3, so samples default to 1.0.3 once launched.

Hadoop Version BehaviorThe version of Hive and Pig you have installed on your job flow depends on the Hadoop version installedon your job flow. For Hadoop version 1.0.3, Hive version 0.8.1 and Pig version 0.9.2 is used. For Hadoopversion 0.20.205, Hive version 0.7.1 and Pig version 0.9.1 is used. For Hadoop version 0.20, Hive version0.5 and version Pig 0.6 is used. For Hadoop version 0.18, Hive version 0.4 and Pig version 0.3 is used.The version can be selected by setting HadoopVersion in JobFlowInstancesConfig.

The Amazon EMR console supports Hadoop 1.0.3 with Hive 0.8.1 and Pig 0.9.2.

The default version of Hadoop for the Amazon EMR console, and the command line interface is Hadoop1.0.3 with Hive 0.8.1 and Pig 0.9.2.You can continue running Hadoop 0.18 with Hive 0.4 for the remainderof the Hadoop 0.18 lifecycle. Additional versions of Hive are available on the command line interfacethrough Hive versioning, for more information, go to Supported Hive Versions (p. 349)

For all job flows run from the Amazon EMR APIs or Java SDK, the default version of Hadoop is 0.18 withHive 0.4 and Pig 0.3.This is to maintain compatibility with existing libraries and systems.You can continuerunning Hadoop 0.18 with Hive 0.4 and Pig 0.3 from the Elastic MapReduce API or Java SDK for theremainder of the Hadoop 0.18 lifecycle, but you should consider upgrading as soon as possible to takeadvantage of the features and performance improvements found in Hadoop 1.0.3, Hive 0.8.1, and Pig0.9.2.

For more information, see Default AMI and Hadoop Versions (p. 291).

You can choose to continue running Hadoop 0.18 with Hive 0.4 using either the command line interfaceor the Amazon EMR API with the HadoopVersion in the RunJobFlow function. This parameter acceptsvalues 0.18, 0.20, 0.20.205, and 1.0.3 We have regenerated the client libraries to support the newAPI. Old clients and libraries continue to default to Hadoop 0.18. If you update to the new clients andwant to run Hadoop 0.18, you must to explicitly specify the version 0.18 in your requests.

The CLI defaults to run Hadoop 1.0.3. In order to run Hadoop version 0.18 you can either use an earlierversion of the Ruby client, or specify –HadoopVersion=0.18 when creating the job flows. As with otheroptions in the command line client, you can specify the –HadoopVersion parameter in your .credentialsfile.


Amazon Elastic MapReduce Developer GuideUpgrading to Hadoop 1.0

Hadoop 0.20 Streaming ConfigurationHadoop 0.20 and later supports the three streaming parameters described in the following table, in additionto the version 0.18 parameters.


Specifies comma-separated files to copy to the MapReduce cluster.-files

Specifies comma-separated archives to restore to the compute machines.-archives

Sets a Hadoop configuration variable. KEY is a Hadoop configuration, such asmapred.map.tasks, and VALUE is the new value.

-D KEY=VALUE

The --files and --archives parameters are similar to --cacheFile and --cacheArchive ofHadoop 0.18, except that they accept comma-separated values.

Hadoop Default Configuration (AMI 1.0)Topics

• Hadoop Configuration (AMI 1.0) (p. 304)

• HDFS Configuration (AMI 1.0) (p. 307)

• Task Configuration (AMI 1.0) (p. 308)

• Intermediate Compression (AMI 1.0) (p. 311)

This section describes the default configuration settings Amazon Elastic MapReduce (Amazon EMR)uses to configure a Hadoop cluster launched with Amazon Machine Image (AMI) version 1.0. For moreinformation about the AMI versions supported by Amazon EMR, see Specify the Amazon EMR AMIVersion (p. 290).

Hadoop Configuration (AMI 1.0)The following Amazon Elastic MapReduce (Amazon EMR) default configuration settings are appropriatefor most workloads.

If your job flow tasks are memory-intensive, you can enhance performance by using fewer tasks per corenode and reducing your job tracker heap size. These and other memory-intensive configuration settingsare described in Hadoop Memory-Intensive Configuration Settings (AMI 1.0) (p. 311).

The following tables list the default configuration settings for each Amazon EC2 instance type in job flowslaunched with Amazon EMR AMI version 1.0. For more information about the AMI versions supportedby Amazon EMR, see Specify the Amazon EMR AMI Version (p. 290).

m1.small

ValueParameter

768HADOOP_JOBTRACKER_HEAPSIZE

256HADOOP_NAMENODE_HEAPSIZE

512HADOOP_TASKTRACKER_HEAPSIZE

128HADOOP_DATANODE_HEAPSIZE


Amazon Elastic MapReduce Developer GuideHadoop 0.20 Streaming Configuration

ValueParameter

-Xmx725mmapred.child.java.opts

2mapred.tasktracker.map.tasks.maximum

1mapred.tasktracker.reduce.tasks.maximum

m1.large

ValueParameter








m1.xlarge

ValueParameter








c1.medium

ValueParameter








Amazon Elastic MapReduce Developer GuideHadoop Default Configuration (AMI 1.0)

ValueParameter


c1.xlarge

ValueParameter








m2.xlarge

ValueParameter








m2.2xlarge

ValueParameter










m2.4xlarge

ValueParameter








cc1.4xlarge

ValueParameter








cg1.4xlarge

ValueParameter








HDFS Configuration (AMI 1.0)The following table describes the default Hadoop Distributed File System (HDFS) parameters and theirsettings.



Default ValueDefinitionParameter

134217728 (128 MB)The size of HDFS blocks. When operating on data stored inHDFS, the split size is generally the size of an HDFS block.Larger numbers provide less task granularity, but also put lessstrain on the cluster NameNode.

dfs.block.size

1 for clusters < fournodes

2 for clusters < tennodes

3 for all other clusters

This determines how many copies of each block to store fordurability. For small clusters we set this to 2 because the clusteris small and easy to restart in case of data loss.You can changethe setting to 1, 2, or 3 as your needs dictate.

dfs.replication

Task Configuration (AMI 1.0)Topics

• Tasks per Machine (p. 308)

• Tasks per Job (AMI 1.0) (p. 309)

• Task JVM Settings (AMI 1.0) (p. 309)

• Avoiding Job Flow Slowdowns (AMI 1.0) (p. 310)

There are a number of configuration variables for tuning the performance of your MapReduce jobs. Thissection describes some of the important task-related settings.

Tasks per Machine

Two configuration options determine how many tasks are run per node, one for mappers and the otheror reducers. They are:



Amazon Elastic MapReduce (Amazon EMR) provides defaults that are entirely dependent on the AmazonEC2 instance type. The following table shows the default settings.

ReducersMappersAmazon EC2 Instance Name

12m1.small

12m1.medium

24m1.large

48m1.xlarge

24c1.medium

48c1.xlarge

24m2.xlarge

48m2.2xlarge

816m2.4xlarge




312cc1.4xlarge

312cg1.4xlarge

NoteThe number of default mappers is based on the memory available on each Amazon EC2 instancetype. If you increase the default number of mappers, you also need to modify the task JVMsettings to decrease the amount of memory allocated to each task. Failure to modify the JVMsettings appropriately could result in out of memory errors.

Tasks per Job (AMI 1.0)

When your job flow runs, Hadoop creates a number of map and reduce tasks. These determine thenumber of tasks that can run simultaneously during your job flow. Run too few tasks and you have nodessitting idle, run too many and there is significant framework overhead.

Amazon EMR determines the number of map tasks from the size and number of files of your input data.You configure the reducer setting. There are four settings you can modify to adjust the reducer setting.

The parameters for configuring the reducer setting are described in the following table.


Target number of map tasks to run. The actual number of tasks created issometimes different than this number.

mapred.map.tasks

Target number of map tasks to run as a ratio to the number of map slots inthe cluster. This is used if mapred.map.tasks is not set.

mapred.map.tasksperslot

Number of reduce tasks to run.mapred.reduce.tasks

Number of reduce tasks to run as a ratio of the number of reduce slots inthe cluster.

mapred.reduce.tasksperslot

The two tasksperslot parameters are unique to Amazon Elastic MapReduce (Amazon EMR). They onlytake effect if mapred.*.tasks is not defined. The order of precedence is:

1. mapred.map.tasks set by the Hadoop job

2. mapred.map.tasks set in mapred-conf.xml on the master node

3. mapred.map.tasksperslot if neither of those are defined

Task JVM Settings (AMI 1.0)

You can configure the amount of heap space for tasks as well as other JVM options with themapred.child.java.opts setting. Amazon EMR provides a default -Xmx value in this spot, with thedefaults per instance type shown in the following table.

Default JVM valueAmazon EC2Instance Name

-Xmx725mm1.small

-Xmx1600mm1.large




-Xmx1600mm1.xlarge

-Xmx362mc1.medium

-Xmx747mc1.xlarge

-Xmx2048mm2.xlarge

-Xmx3200mm2.2xlarge

-Xmx1024mm2.4xlarge

-Xmx1024mcc1.4xlarge

-Xmx1024mcg1.4xlarge

You can start a new JVM for every task, which provides better task isolation, or you can share JVMsbetween tasks, providing lower framework overhead. If you are processing many small files, it makessense to reuse the JVM many times to amortize the cost of start-up. However, if each task takes a longtime or processes a large amount of data, then you might choose to not reuse the JVM to ensure allmemory is freed for subsequent tasks.

Use the mapred.job.reuse.jvm.num.tasks option to configure the JVM reuse settings.

Example Modifying JVM using a bootstrap action

$ ./elasticmapreduce --create --alive --name "JVM infinite reuse" \--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \--bootstrap-name "Configuring infinite JVM reuse" \--args "-m,mapred.job.reuse.jvm.num.tasks=-1"

NoteAmazon EMR sets the value of mapred.job.reuse.jvm.num.tasks to 20, but you canoverride it with a bootstrap action. A value of -1 means infinite reuse within a single job, and 0means do not reuse tasks.

Avoiding Job Flow Slowdowns (AMI 1.0)

In a distributed environment, you are going to experience random delays, slow hardware, failing hardware,and other problems that collectively slow down your job flow. This is known as the stragglers problem.Hadoop has a feature called speculative execution that can help mitigate this issue. As the job flowprogresses, some machines complete their tasks. Hadoop schedules tasks on nodes that are free.Whichever task finishes first is the successful one and the other tasks are killed. This feature cansubstantially cut down on the run time of jobs. The general design of a mapreduce alogorithm is suchthat the processing of map tasks is meant to be idempotent. If, however, you are running a job where thetask execution has side effects (for example, a zero reducer job that calls an external resource) is itimportant to disable speculative execution.

You can enable speculative execution for mappers and reducers independently. By default, Amazon EMRenables it for mappers and disables it for reducers in AMI 1.0.You can override these settings with abootstrap action. For more information on using bootstrap actions, refer to Bootstrap Actions (p. 84).



Speculative Execution Parameters

Default SettingParameter

truemapred.map.tasks.speculative.execution

falsemapred.reduce.tasks.speculative.execution

Example Enabling reducer speculative execution using a bootstrap action

$ ./elasticmapreduce --create --alive --name "Reducer speculative execution" \

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \--bootstrap-name "Enable reducer speculative execution" \--args "-m,mapred.reduce.tasks.speculative.execution=true"

Intermediate Compression (AMI 1.0)Hadoop sends data between the mappers and reducers in its shuffle process. This network operation isa bottleneck for many job flows. To reduce this bottleneck, Amazon Elastic MapReduce (Amazon EMR)enables intermediate data compression by default. Because it provides a reasonable amount ofcompression with only a small CPU impact, we use the LZO codec.

You can modify the default compression settings with a bootstrap action. For more information on usingbootstrap actions, refer to Bootstrap Actions (p. 84).

The following table presents the default values for the parameters that affect intermediate compression.

ValueParameter

truemapred.compress.map.output

com.hadoop.compression.lzo.LzoCodecmapred.map.output.compression.codec

Example Enabling/disabling compression using a bootstrap action

$ ./elasticmapreduce --create --alive --name "Reducer speculative execution" \--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \--bootstrap-name "Disable compression" \--args "mapred.compress.map.output=false" \--args "mapred.map.output.compression.codec=org.apache.hadoop.io.compress.Gzip Codec"

Hadoop Memory-Intensive Configuration Settings(AMI 1.0)The Amazon Elastic MapReduce (Amazon EMR) default configuration settings are appropriate for mostworkloads. However, based on your job flow’s specific memory and processing requirements, you mightwant to modify the configuration settings.

NoteThe memory-intensive settings are set by default in AMI 2.0.0 and later.You should only needto adjust these settings for AMI versions 1.0.1 and earlier.


Amazon Elastic MapReduce Developer GuideHadoop Memory-Intensive Configuration Settings (AMI

1.0)

For example, if your job flow tasks are memory-intensive, you can use fewer tasks per core node andreduce your job tracker heap size. A predefined bootstrap action is available to configure your job flowon startup. For more information, see the Configure Memory-Intensive Workloads (p. 90) bootstrap action.

The following tables list the recommended configuration settings for each Amazon EC2 instance type.The default configurations for the cc1.4xlarge, cc2.8xlarge, hs1.8xlarge, and cg1.4xlarge instances aresufficient for memory-intensive workloads; therefore, the recommended configuration settings for theseinstances are not listed.

m1.small

ValueParameter








m1.large

ValueParameter








m1.xlarge

ValueParameter









1.0)

ValueParameter


c1.medium

ValueParameter








c1.xlarge

ValueParameter








m2.xlarge

ValueParameter










1.0)

m2.2xlarge

ValueParameter








m2.4xlarge

ValueParameter








Hadoop Default Configuration (AMI 2.0 and 2.1)Topics

• Hadoop Configuration (AMI 2.0 and 2.1) (p. 314)

• HDFS Configuration (AMI 2.0 and 2.1) (p. 318)

• Task Configuration (AMI 2.0 and 2.1) (p. 318)

• Intermediate Compression (AMI 2.0 and 2.1) (p. 321)

This section describes the default configuration settings Amazon Elastic MapReduce (Amazon EMR)uses to configure a Hadoop cluster launched with Amazon Machine Image (AMI) version 2.0 or 2.1. Formore information about the AMI versions supported by Amazon EMR, see Specify the Amazon EMR AMIVersion (p. 290).

Hadoop Configuration (AMI 2.0 and 2.1)The following Amazon Elastic MapReduce (Amazon EMR) default configuration settings for job flowslaunched with Amazon EMR AMI 2.0 or 2.1 are appropriate for most workloads.

If your job flow tasks are memory-intensive, you can enhance performance by using fewer tasks per corenode and reducing your job tracker heap size.


Amazon Elastic MapReduce Developer GuideHadoop Default Configuration (AMI 2.0 and 2.1)

The following tables list the default configuration settings for each Amazon EC2 instance type in job flowslaunched with the Amazon EMR AMI version 2.0 or 2.1. For more information about the AMI versionssupported by Amazon EMR, see Specify the Amazon EMR AMI Version (p. 290).

m1.small

ValueParameter








m1.large

ValueParameter








m1.xlarge

ValueParameter










c1.medium

ValueParameter








c1.xlarge

ValueParameter








m2.xlarge

ValueParameter








m2.2xlarge

ValueParameter




ValueParameter







m2.4xlarge

ValueParameter








cc1.4xlarge

ValueParameter








cc2.8xlarge

ValueParameter






ValueParameter





cg1.4xlarge

ValueParameter








HDFS Configuration (AMI 2.0 and 2.1)The following table describes the default Hadoop Distributed File System (HDFS) parameters and theirsettings.



dfs.block.size





dfs.replication

Task Configuration (AMI 2.0 and 2.1)Topics


• Tasks per Job (AMI 2.0 and 2.1) (p. 319)

• Task JVM Settings (AMI 2.0 and 2.1) (p. 320)

• Avoiding Job Flow Slowdowns (AMI 2.0 and 2.1) (p. 321)




Tasks per Machine




Amazon Elastic MapReduce (Amazon EMR) provides defaults that are entirely dependent on the AmazonEC2 instance type. The following table shows the default settings for job flows launched with AMI 2.0 or2.1.


12m1.small

13m1.large

38m1.xlarge

12c1.medium

27c1.xlarge

13m2.xlarge

26m2.2xlarge

414m2.4xlarge

312cc1.4xlarge

624cc2.8xlarge

312cg1.4xlarge


Tasks per Job (AMI 2.0 and 2.1)








mapred.map.tasks










Task JVM Settings (AMI 2.0 and 2.1)



-Xmx384mm1.small

-Xmx1152mm1.large

-Xmx1024mm1.xlarge

-Xmx384mc1.medium

-Xmx512mc1.xlarge

-Xmx3072mm2.xlarge

-Xmx3584mm2.2xlarge

-Xmx3072mm2.4xlarge











Avoiding Job Flow Slowdowns (AMI 2.0 and 2.1)


You can enable speculative execution for mappers and reducers independently. By default, Amazon EMRenables it for mappers and reducers in AMI 2.0 or 2.1.You can override these settings with a bootstrapaction. For more information on using bootstrap actions, refer to Bootstrap Actions (p. 84).




truemapred.reduce.tasks.speculative.execution

Example Disabling reducer speculative execution using a bootstrap action


--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \--bootstrap-name "Disable reducer speculative execution" \--args "-m,mapred.reduce.tasks.speculative.execution=false"

Intermediate Compression (AMI 2.0 and 2.1)Hadoop sends data between the mappers and reducers in its shuffle process. This network operation isa bottleneck for many job flows. To reduce this bottleneck, Amazon Elastic MapReduce (Amazon EMR)enables intermediate data compression by default. Because it provides a reasonable amount ofcompression with only a small CPU impact, we use the Snappy codec.





ValueParameter


org.apache.hadoop.io.compress.SnappyCodecmapred.map.output.compression.codec









Hadoop Configuration (AMI 2.2)The following Amazon Elastic MapReduce (Amazon EMR) default configuration settings for job flowslaunched with Amazon EMR AMI 2.2 are appropriate for most workloads.


The following tables list the default configuration settings for each Amazon EC2 instance type in job flowslaunched with the Amazon EMR AMI version 2.2. For more information about the AMI versions supportedby Amazon EMR, see Specify the Amazon EMR AMI Version (p. 290).

m1.small

ValueParameter









ValueParameter


m1.large

ValueParameter








m1.xlarge

ValueParameter








c1.medium

ValueParameter










c1.xlarge

ValueParameter








m2.xlarge

ValueParameter








m2.2xlarge

ValueParameter








m2.4xlarge

ValueParameter




ValueParameter







cc1.4xlarge

ValueParameter








cc2.8xlarge

ValueParameter








cg1.4xlarge

ValueParameter






ValueParameter








dfs.block.size





dfs.replication







Tasks per Machine








12m1.small

13m1.large

38m1.xlarge

12c1.medium

27c1.xlarge

13m2.xlarge

26m2.2xlarge

414m2.4xlarge

312cc1.4xlarge

624cc2.8xlarge

312cg1.4xlarge








mapred.map.tasks















-Xmx384mm1.small

-Xmx1152mm1.large

-Xmx1024mm1.xlarge

-Xmx384mc1.medium

-Xmx512mc1.xlarge

-Xmx3072mm2.xlarge

-Xmx3584mm2.2xlarge

-Xmx3072mm2.4xlarge










In a distributed environment, you are going to experience random delays, slow hardware, failing hardware,and other problems that collectively slow down your job flow. This is known as the stragglers problem.Hadoop has a feature called speculative execution that can help mitigate this issue. As the job flow



progresses, some machines complete their tasks. Hadoop schedules tasks on nodes that are free.Whichever task finishes first is the successful one and the other tasks are killed. This feature cansubstantially cut down on the run time of jobs. The general design of a mapreduce alogorithm is suchthat the processing of map tasks is meant to be idempotent. If, however, you are running a job where thetask execution has side effects (for example, a zero reducer job that calls an external resource) is itimportant to disable speculative execution.

You can enable speculative execution for mappers and reducers independently. By default, Amazon EMRenables it for mappers and reducers in AMI 2.2.You can override these settings with a bootstrap action.For more information on using bootstrap actions, refer to Bootstrap Actions (p. 84).








Intermediate Compression (AMI 2.2)Hadoop sends data between the mappers and reducers in its shuffle process. This network operation isa bottleneck for many job flows. To reduce this bottleneck, Amazon Elastic MapReduce (Amazon EMR)enables intermediate data compression by default. Because it provides a reasonable amount ofcompression with only a small CPU impact, we use the Snappy codec.



ValueParameter













Hadoop Configuration (AMI 2.3)The following Amazon Elastic MapReduce (Amazon EMR) default configuration settings for job flowslaunched with Amazon EMR AMI 2.3 are appropriate for most workloads.


The following tables list the default configuration settings for each Amazon EC2 instance type in job flowslaunched with the Amazon EMR AMI version 2.3. For more information about the AMI versions supportedby Amazon EMR, see Specify the Amazon EMR AMI Version (p. 290).

m1.small

ValueParameter








m1.large

ValueParameter








ValueParameter



m1.xlarge

ValueParameter








c1.medium

ValueParameter








c1.xlarge

ValueParameter










m2.xlarge

ValueParameter








m2.2xlarge

ValueParameter








m2.4xlarge

ValueParameter








cc1.4xlarge

ValueParameter




ValueParameter







cc2.8xlarge

ValueParameter








hs1.8xlarge

ValueParameter








cg1.4xlarge

ValueParameter






ValueParameter








dfs.block.size





dfs.replication







Tasks per Machine








12m1.small

13m1.large

38m1.xlarge

12c1.medium

27c1.xlarge

13m2.xlarge

26m2.2xlarge

414m2.4xlarge

312cc1.4xlarge

624cc2.8xlarge

624hs1.8xlarge

312cg1.4xlarge








mapred.map.tasks















-Xmx384mm1.small

-Xmx1152mm1.large

-Xmx1024mm1.xlarge

-Xmx384mc1.medium

-Xmx512mc1.xlarge

-Xmx3072mm2.xlarge

-Xmx3584mm2.2xlarge

-Xmx3072mm2.4xlarge



-Xmx2048mhs1.8xlarge











You can enable speculative execution for mappers and reducers independently. By default, Amazon EMRenables it for mappers and reducers in AMI 2.3.You can override these settings with a bootstrap action.For more information on using bootstrap actions, refer to Bootstrap Actions (p. 84).








Intermediate Compression (AMI 2.3)Hadoop sends data between the mappers and reducers in its shuffle process. This network operation isa bottleneck for many job flows. To reduce this bottleneck, Amazon Elastic MapReduce (Amazon EMR)enables intermediate data compression by default. Because it provides a reasonable amount ofcompression with only a small CPU impact, we use the Snappy codec.



ValueParameter







File System ConfigurationAmazon Elastic MapReduce (Amazon EMR) and Hadoop provide a variety of file systems you can usewhen processing job flow steps.You specify which file system to use by the prefix of the URI used toaccess the data. For example, s3://myawsbucket/path references an Amazon S3 bucket using the S3native file system. The following table lists the available file systems, with recommendations on when it’sbest to use them.

DescriptionPrefixFile System

HDFS is used by the master and core nodes. Its advantageis that it’s fast; its disadvantage is that it’s ephemeralstorage which is reclaimed when the job flow ends. It’s bestused for caching the results produced by intermediatejob-flow steps.

hdfs:// or noprefix

HDFS

Amazon S3 native is a persistent and fault-tolerant filesystem. It continues to exist after the job flow ends. Itsdisadvantage is that it’s slower than HDFS because of theround-trip to Amazon S3. It’s best used for storing the inputto a job flow, the output of the job flow, and the results ofintermediate job flow steps where re-computing the stepwould be onerous.

NotePaths that specify only a bucket name must endwith a terminating slash. In other words, allAmazon S3 URIs must have at least three slashes.For example, specify s3n://myawsbucket/ insteadof s3n://myawsbucket. The URIs3n://myawsbucket/myfolder, however, is alsovalid.

s3:// ors3n://

Amazon S3 native

Amazon S3 block is a deprecated file system that is notrecommended because it can trigger a race condition thatmight cause your job flow to fail. It may be required bylegacy applications.

s3bfs://Amazon S3 block

NoteThe configuration of Hadoop running on Amazon EMR differs from the default configurationprovided by Apache Hadoop. On Amazon EMR, s3n:// and s3:// both map to the Amazon S3native file system, while in the default configuration provided by Apache Hadoop s3:// is mappedto the Amazon S3 block storage system.


Amazon Elastic MapReduce Developer GuideFile System Configuration

Upload Large Files with the S3 Native File SystemThe S3 native file system imposes a 5 GiB file-size limit.You might need to upload or store files largerthan 5 GiB with Amazon S3. Amazon EMR makes this possible by extending the S3 file system throughthe AWS Java SDK to support multipart uploads. Using this feature of Amazon EMR you can upload filesof up to 5 TiB in size. Multipart upload is disabled by default; to learn how to enable it for your job flow,see the section called “Multipart Upload” (p. ?).

Access File SystemsYou specify which file system to use by the prefix of the uniform resource identifier (URI) used to accessthe data. The following procedures illustrate how to reference several different types of file systems.

To access a local HDFS

• Specify the hdfs:/// prefix in the URI. Amazon EMR resolves paths that do not specify a prefix in theURI to the local HDFS. For example, both of the following URIs would resolve to the same locationin HDFS.

hdfs:///path-to-data

/path-to-data

To access a remote HDFS

• Include the IP address of the master node in the URI as shown in the following examples.

hdfs://master-ip-address/path-to-data

master-ip-address/path-to-data

To access the Amazon S3 native file system

• Use the s3n:// or s3:// prefix. Amazon EMR resolves both of the URIs below to the same location.

s3n://bucket-name/path-to-file-in-bucket

s3://bucket-name/path-to-file-in-bucket

NoteBecause of the file syntax difference between Hadoop running on Amazon EMR and standardApache Hadoop, it is recommended that you use the s3n:// prefix to highlight the fact thatyou are using the S3 native file system.


Amazon Elastic MapReduce Developer GuideFile System Configuration

To access the Amazon S3 block file system

• Use only for legacy applications that require the Amazon S3 block file system. To access or storedata with this file system, use the s3bfs:// prefix in the URI.

The Amazon S3 block file system is a legacy file system that was used to support uploads to AmazonS3 that were larger than 5 GiB in size. With the multipart upload functionality Amazon EMR providesthrough the AWS Java SDK, you can upload files of up to 5 TiB in size to the Amazon S3 native filesystem, and the Amazon S3 block file system is deprecated.

CautionBecause this legacy file system can create race conditions that can corrupt the file system,you should avoid this format and use the Amazon S3 native file system instead.

s3bfs://bucket-name/path-to-file-in-bucket

JSON Configuration FilesTopics

• Node Settings (p. 340)

• Job Flow Configuration (p. 341)

When Amazon Elastic MapReduce (Amazon EMR) creates a Hadoop cluster, each node contains a pairof JSON files containing configuration information about the node and the currently running job flow.These files are in the /mnt/var/lib/info directory, and accessible by scripts running on the node.

Node SettingsSettings for an Elastic MapReduce cluster node are contained in the instance.json file.

The following table describes the contents of the instance.json file.


Indicates that is the master node.

Type: Boolean

isMaster

Indicates that this is running the Hadoop name node daemon.Type: Boolean

isRunningNameNode

Indicates that is running the Hadoop data node daemon.Type: Boolean

isRunningDataNode

Indicates that is running the Hadoop job tracker daemon.Type: Boolean

isRunningJobTracker

Indicates that is running the Hadoop task tracker daemon.Type: Boolean

isRunningTaskTracker

The following example shows the contents of an instance.json file:


Amazon Elastic MapReduce Developer GuideJSON Configuration Files

{ "instanceGroupId":"Instance_Group_ID", "isMaster": Boolean, "isRunningNameNode": Boolean, "isRunningDataNode": Boolean, "isRunningJobTracker": Boolean,"isRunningTaskTracker": Boolean}

Example to identify settings in JSON file using a bootstrap action

This example demonstrates how to execute the command line function echo to display the string runningon master nodeon a master node by evaluating the JSON file parameter instance.isMaster.


$ ./elasticmapreduce --create --alive --name "RunIf" \--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if \--bootstrap-name "Run only on master" \--args "instance.isMaster=true,echo,’Running on master node’"

Linux orUNIX

c:\ruby elasticmapreduce --create --alive --name "RunIf"--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if--bootstrap-name "Run only on master" --=args"instance.isMaster=true,echo,’Running on master node’"

MicrosoftWindows

Job Flow ConfigurationInformation about the currently running job flow is contained in the job-flow.json file.

The following table describes the contents of the job-flow.json file.


Contains the ID for the job flow.

Type: String

JobFlowID

Contains the time that the job flow was created.Type: Long

jobFlowCreationInstant

Contains the number of nodes in an instance group.Type: Integer

instanceCount

Contains the ID for the master node.Type: String

masterInstanceID

Contains the private DNS name of the master node.Type: String

masterPrivateDnsName

Contains the Amazon EC2 instance type of the master node.Type: String

masterInstanceType




Contains the Amazon EC2 instance type of the slave nodes.Type: String

slaveInstanceType

Contains the version of Hadoop running on the cluster.Type: String

HadoopVersion

A list of objects specifying each instance group in the clusterinstanceGroupId—unique identifier for this instance group.

Type: String

instanceGroupName—uUser defined name of the instance group.

Type: String

instanceRole—one of Master, Core, or Task.

Type: String

instanceType—the Amazon EC2 type of the node, such as"m1.small".

Type: String

requestedInstanceCount—the target number of nodes for this instancegroup.

Type: Long

instanceGroups

The following example shows the contents of an job-flow.json file.

{ "jobFlowId":"JobFlowID","jobFlowCreationInstant": CreationInstanceID, "instanceCount": Count, "masterInstanceId":"MasterInstanceID", "masterPrivateDnsName":"Name", "masterInstanceType":"Amazon_EC2_Instance_Type", "slaveInstanceType":"Amazon_EC2_Instance_Type", "hadoopVersion":"Version", "instanceGroups": [ { "instanceGroupId":"InstanceGroupID", "instanceGroupName":"Name", "instanceRole":"Master", "marketType":"Type", "instanceType":"AmazonEC2InstanceType", "requestedInstanceCount": Count}, } { "instanceGroupId":"InstanceGroupID", "instanceGroupName":"Name", "instanceRole":"Core", "marketType":"Type", "instanceType":"AmazonEC2InstanceType", "requestedInstanceCount": Count}, } { "instanceGroupId":"InstanceGroupID", "instanceGroupName":"Name",



"instanceRole":"Task", "marketType":"Type", "instanceType":"AmazonEC2InstanceType", "requestedInstanceCount": Count } ]}

Multipart UploadMultipart upload allows you to upload a single file to Amazon S3 as a set of parts. Using the AWS JavaSDK, you can upload these parts incrementally and in any order. Using the multipart upload method canresult in faster uploads and shorter retries than when uploading a single large file.

The Amazon EMR configuration parameters for multipart upload are described in the following table.

DescriptionDefault ValueConfiguration Parameter Name

A boolean type that indicates whether to enablemultipart uploads.

Truefs.s3n.multipart.uploads.enabled

A boolean type that indicates whether to usehttp or https.

Truefs.s3n.ssl.enabled

You modify the configuration parameters for multipart uploads using a bootstrap action.

Amazon EMR ConsoleThis procedure explains how to disable multipart upload using the Amazon EMR console.

To disable multipart uploads with a predefined bootstrap action


2. Click the Create New Job Flow button and fill out the Create a New Job Flow wizard. For moreinformation about creating job flows, see Create a Job Flow (p. 23).

3. On the BOOTSTRAP ACTIONS pane of the wizard, select Configure your Bootstrap Actions.

4. For Action Type select Configure Hadoop.

5. In Optional Arguments, replace the default value with the following:

-c fs.s3n.multipart.uploads.enabled=false

6. If you have more bootstrap actions to add, click Add another Bootstrap Action. When all of yourbootstrap actions are added, click Continue to go to the REVIEW pane of the Create a New JobFlow wizard.

CLIThis procedure explains how to disable multipart upload using the CLI. The command creates a job flowin a waiting state with multipart upload disabled.


Amazon Elastic MapReduce Developer GuideMultipart Upload



$ ./elastic-mapreduce --create --alive \--bootstrap-action \ s3://elasticmapreduce/bootstrap-actions/configure-hadoop \--bootstrap-name "enable multipart upload" \--args "-c,fs.s3n.multipart.uploads.enabled=false"

Linux orUNIX

c:\ruby elastic-mapreduce --create --alive --bootstrap-actions3://elasticmapreduce/bootstrap-actions/configure-hadoop--bootstrap-name "enable multipart upload" --args"-c,fs.s3n.multipart.uploads.enabled=false"

MicrosoftWindows

This job flow remains in the WAITING state until it is terminated.

Using the APIFor information on using Amazon S3 multipart uploads programmatically, go to Using the AWS SDK forJava for Multipart Upload in the Amazon S3 Developer Guide.

For more information on the AWS SDK for Java, go to the AWS SDK for Java detail page.

Hadoop Data CompressionTopics

• Output Data Compression (p. 344)

• Intermediate Data Compression (p. 345)

• How to Process Compressed Files (p. 345)

• Using the Snappy Library with Amazon EMR (p. 345)

Output Data CompressionThis compresses the output of your Hadoop job. If you are using TextOutputFormat the result is a gzip'edtext file. If you are writing to SequenceFiles then the result is a SequenceFile which is compressedinternally. This can be enabled by setting the configuration setting mapred.output.compress to true.

If you are running a streaming job you can enable this by passing the streaming job these arguments.

-jobconf mapred.output.compress=true

You can also use a bootstrap action to automatically compress all job outputs. Here is how to do that withthe Ruby client.

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \


Amazon Elastic MapReduce Developer GuideHadoop Data Compression

http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMPDotJavaAPI.html

http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMPDotJavaAPI.html

http://aws.amazon.com/sdkforjava/

--args "-s,mapred.output.compress=true"

Finally, if are writing a Custom Jar you can enable output compression with the following line when creatingyour job.

FileOutputFormat.setCompressOutput(conf, true);

Intermediate Data CompressionIf your job shuffles a significant amount data from the mappers to the reducers, you can see a performanceimprovement by enabling intermediate compression. Compresses the map output and decompresses itwhen it arrives on the slave node. The configuration setting is mapred.compress.map.output.You canenable this similarly to output compression.

When writing a Custom Jar, use the following command:

conf.setCompressMapOutput(true);

How to Process Compressed FilesHadoop checks the file extension to detect compressed files.The compression types supported by Hadoopare: gzip, bzip2, and LZO.You do not need to take any additional action to extract files using these typesof compression; Hadoop handles it for you.

To index LZO files, you can use the hadoop-lzo library which can be downloaded fromhttps://github.com/kevinweil/hadoop-lzo. Note that because this is a third-party library, Amazon ElasticMapReduce (Amazon EMR) does not offer developer support on how to use this tool. For usageinformation, see the hadoop-lzo readme file.

Using the Snappy Library with Amazon EMRSnappy is a compression and decompression library that is optimized for speed. It is available on AmazonEMR AMIs version 2.0 and later and is used as the default for intermediate compression. For moreinformation about Snappy, go to http://code.google.com/p/snappy/. For more information about AmazonEMR AMI versions, go to Specify the Amazon EMR AMI Version (p. 290)

Setting Permissions on the System DirectoryYou can set the permissions on the release directory by using a bootstrap action to modify themapreduce.jobtracker.system.dir.permission configuration variable. This is useful if you arerunning a custom configuration in which users other than "hadoop user" submit jobs to Hadoop.

Amazon Elastic MapReduce (Amazon EMR) added this configuration variable to Hadoop in AMI 2.0.5,and it is available in AMI versions 2.0.5 and later. For more information about AMI versions, see Specifythe Amazon EMR AMI Version (p. 290).


Amazon Elastic MapReduce Developer GuideSetting Permissions on the System Directory

https://github.com/kevinweil/hadoop-lzo

https://github.com/kevinweil/hadoop-lzo/blob/master/README.md


To set permissions on the system directory

• Specify a bootstrap action that sets the permissions for the directory using numerical permissionscodes (octal notation).The following example bootstrap action, with a permissions code of 777, givesfull permissions to the user, group, and world. For more information about octal notation, go tohttp://en.wikipedia.org/wiki/File_system_permissions#Octal_notation.

s3://elasticmapreduce/bootstrap-actions/configure-hadoop \--args "-s,mapreduce.jobtracker.system.dir.permission=777"

Hadoop PatchesThe following sections detail the patches the Amazon Elastic MapReduce (Amazon EMR) team hasapplied to the Hadoop versions loaded on Amazon EMR AMIs.

Topics

• Hadoop 1.0.3 Patches (p. 346)

• Hadoop 0.20.205 Patches (p. 347)

Hadoop 1.0.3 PatchesThe Amazon EMR team has applied the following patches to Hadoop 1.0.3 on the Amazon EMR AMIversion 2.2.

DescriptionPatch

See Hadoop 0.20.205 Patches (p. 347) for details.All of the patches applied to theAmazon EMR version of Hadoop0.20.205.

Files stored on the native Amazon S3 file system, those with URLsof the form s3n://, now report a block size determined byfs.s3n.block.size. For more information, go tohttps://issues.apache.org/jira/browse/HADOOP-5861.

Status: FixedFixed in AWS Hadoop Version: 1.0.3

Fixed in Apache Hadoop Version: 0.21.0

HADOOP-5861

Supports specifying a pattern to RunJar.unJar that determines whichfiles are unpacked. For more information, go tohttps://issues.apache.org/jira/browse/HADOOP-6346.



HADOOP-6346


Amazon Elastic MapReduce Developer GuideHadoop Patches

http://en.wikipedia.org/wiki/File_system_permissions#Octal_notation



DescriptionPatch

Changes the TaskTracker node so it does not fully unjar job jars intothe job cache directory. For more information, go tohttps://issues.apache.org/jira/browse/MAPREDUCE-967.



MAPREDUCE-967

Changes the JobTracker service to remove the contents ofmapred.system.dir during startup instead of removing the directoryitself. For more information, go tohttps://issues.apache.org/jira/browse/MAPREDUCE-2219.



MAPREDUCE-2219

Hadoop 0.20.205 PatchesThe Amazon EMR team has applied the following patches to Hadoop 0.20.205 on the Amazon EMR AMIversion 2.0.

DescriptionPatch

Install the hadoop-lzo third-party package. For more information abouthadoop-lzo, go to https://github.com/kevinweil/hadoop-lzo

Status: Third-party PackageFixed in AWS Hadoop Version: 0.20.205

Fixed in Apache Hadoop Version: n/a

Add hadoop-lzo

Add the hadoop-snappy library to provide access to the snappycompression. For more information about this library, go tohttp://code.google.com/p/hadoop-snappy/.

Status: Third-party LibraryFixed in AWS Hadoop Version: 0.20.205

Fixed in Apache Hadoop Version: n/a

Install the hadoop-snappy library

Fixes to how CombineFileInputFormat handles split locations andfiles that can be split. For more information about these patches, goto https://issues.apache.org/jira/browse/MAPREDUCE-1597,https://issues.apache.org/jira/browse/MAPREDUCE-2021, andhttps://issues.apache.org/jira/browse/MAPREDUCE-2046.

Status: Resolved, FixedFixed in AWS Hadoop Version: 0.20.205


MAPREDUCE-1597/2021/2046


Amazon Elastic MapReduce Developer GuideHadoop Patches

https://issues.apache.org/jira/browse/MAPREDUCE-967


https://github.com/kevinweil/hadoop-lzo

http://code.google.com/p/hadoop-snappy/




DescriptionPatch

Remove the files generated by automake and autoconf of the nativebuild and use the host's automake and autoconf to generate the filesinstead. For more information about this patch, go tohttps://issues.apache.org/jira/browse/HADOOP-6436.

Status: Closed, FixedFixed in AWS Hadoop Version: 0.20.205

Fixed in Apache Hadoop Version: 0.22.0,0.23.0

HADOOP-6436

Prevent an infinite loop from occurring when creating splits usingCombineFileInputFormat. For more information about this patch, goto https://issues.apache.org/jira/browse/MAPREDUCE-2185.



MAPREDUCE-2185

Change Configuration.writeXML to not hold a lock while outputting.For more information about this patch, go tohttps://issues.apache.org/jira/browse/HADOOP-7082.



HADOOP-7082

Update RawLocalFileSystem#listStatus to deal with a directory thathas changing entries, as in a multi-threaded or multi-processenvironment. For more information about this patch, go tohttps://issues.apache.org/jira/browse/HADOOP-7015.



HADOOP-7015

Update the Ganglia metrics to be compatible with Ganglia 3.1. Formore information about this patch go tohttps://issues.apache.org/jira/browse/HADOOP-4675.



HADOOP-4675

Hive ConfigurationAmazon Elastic MapReduce (Amazon EMR) provides support for Apache Hive. Amazon EMR supportsseveral versions of Hive, which you can install on any running job flow. Amazon EMR also allows you torun multiple versions concurrently, allowing you to control your Hive version upgrade. The followingsections describe the Hive configurations using Amazon EMR.

Topics

• Supported Hive Versions (p. 349)

• Share Data Between Hive Versions (p. 353)


Amazon Elastic MapReduce Developer GuideHive Configuration






• Differences from Apache Hive Defaults (p. 353)

• Interactive and Batch Modes (p. 355)

• Creating a Metastore Outside the Hadoop Cluster (p. 357)

• Using the Hive JDBC Driver (p. 359)

• Additional Features of Hive in Amazon EMR (p. 362)

• Upgrade to Hive 0.8 (p. 368)

Supported Hive VersionsYou can choose to run Hive in several different configurations.You set the --hadoop-version,--hive-versions, and --ami-version parameters in the job creation call as shown in the followingtable.

The default configuration for Amazon EMR is the latest verison of Hive running on the latest AMI version.

The Amazon EMR console does not support Hive versioning and always loads the latest version of Hive.

Versions of the Amazon EMR CLI released on 9 April 2012 and later load the latest version of Hive bydefault. To use a verison of Hive other than the latest, specify the --hive-versions parameter whenyou create the job flow. Versions of the Amazon EMR CLI released prior to 9 April 2012 load the defaultconfiguration of Hive.

Calls to the API will launch the default configuration of Hive, unless you specify --hive-versions asan argument to the step that loads Hive onto the job flow during the call to RunJobFlow.

Hive Version NotesCompatibleHadoopVersions

Hive Version

• Adds support for IAM roles. For more information, see ConfigureIAM Roles for Amazon EMR (p. 280)

1.0.30.8.1.6

• Adds support for the new Amazon DynamoDB binary data type.

• Adds the patch Hive-2955, which fixes an issue where queriesconsisting only of metadata always return an empty value.

• Adds the patch Hive-1376, which fixes an issue where Hive wouldcrash on an empty result set generated by "where false" clausequeries.

• Fixes the RCFile interaction with Amazon Simple Storage Service(Amazon S3).

• Replaces JetS3t with the AWS SDK for Java.

• Uses BatchWriteItem for puts to Amazon DynamoDB.

• Adds schemaless mapping of Amazon DynamoDB tables into aHive table using a Hive map<string, string> column.

1.0.30.8.1.5

Updates the HBase client on Hive job flows to version 0.92.0 to matchthe version of HBase used on HBase job flows.This fixes issues thatoccurred when connecting to an HBase job flow from a Hive job flow.

1.0.30.8.1.4

Adds support for Hadoop 1.0.3.1.0.30.8.1.3


Amazon Elastic MapReduce Developer GuideSupported Hive Versions

http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_RunJobFlow.html


Hive Version

Fixes an issue with duplicate data in large job flows.1.0.3, 0.20.2050.8.1.2

Adds support for MapR and HBase.1.0.3, 0.20.2050.8.1.1

Introduces new features and improvements. The most significant ofthese are as follows. For complete information about the changes inHive 0.8.1, go to the Apache Hive 0.8.1 Release Notes.

• Support Binary DataType (HIVE-2380)

• Support Timestamp DataType (HIVE-2272)

• Provide a Plugin Developer Kit (HIVE-2244)

• Support INSERT INTO append semantics (HIVE-306)

• Support Per-Partition SerDe (HIVE-2484)

• Support Import/Export facilities (HIVE-1918)

• Support Bitmap Indexes (HIVE-1803)

• Support RCFile Block Merge (HIVE-1950)

• Incorporate Group By Optimization (HIVE-1694)

• Enable HiveServer to accept -hiveconf option (HIVE-2139)

• Support --auxpath option (HIVE-2355)

• Add a new builtins subproject (HIVE-2523)

• Insert overwrite table db.tname fails if partition already exists(HIVE-2617)

• Add a new input format that passes multiple GZip files to eachmapper, so fewer mappers are needed. (HIVE-2089)

• Incorporate JDBC Driver improvements (HIVE-559, HIVE-1631,HIVE-2000, HIVE-2054, HIVE-2144, HIVE-2153, HIVE-2358,HIVE-2369, HIVE-2456)

1.0.3, 0.20.2050.8.1

Prevents the "SET" command in Hive from changing the currentdatabase of the current session.

0.20.2050.7.1.4

Adds the dynamodb.retry.duration option, which you can useto configure the timeout duration for retrying Hive queries againsttables in Amazon DynamoDB.This version of Hive also supports thedynamodb.endpoint option, which you can use to specify theAmazon DynamoDB endpoint to use for a Hive table. For moreinformation about these options, see Hive Options (p. 246).

0.20.2050.7.1.3

Modifies the way files are named in Amazon S3 for dynamic partitions.It prepends file names in Amazon S3 for dynamic partitions with aunique identifier. Using Hive 0.7.1.2 you can run queries in parallelwith set hive.exec.parallel=true. It also fixes an issue withfilter pushdown when accessing Amazon DynamoDB with spare datasets.

0.20.2050.7.1.2



https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310843&version=12319268

https://issues.apache.org/jira/browse/HIVE-2380
























Hive Version

Introduces support for accessing Amazon DynamoDB, as detailedin Export, Import, Query, and Join Tables in Amazon DynamoDBUsing Amazon EMR (p. 234). It is a minor version of 0.7.1 developedby the Amazon EMR team.When specified as the Hive version, Hive0.7.1.1 overwrites the Hive 0.7.1 directory structure and configurationwith its own values. Specifically, Hive 0.7.1.1 matches Apache Hive0.7.1 and uses the Hive server port, database, and log location of0.7.1 on the job flow.

0.20.2050.7.1.1

Improves Hive query performance for a large number of partitionsand for Amazon S3 queries. Changes Hive to skip commented lines.

0.20.205, 0.20,0.18

0.7.1

Improves Recover Partitions to use less memory, fixes the hashCodemethod, and introduces the ability to use the HAVING clause to filteron groups by expressions.

0.20, 0.180.7

Fixes isses with FileSinkOperator and modifies UDAFPercentile totolerate null percentiles.

0.20, 0.180.5

Introduces the ability to write to Amazon S3, run Hive scripts fromAmazon S3, and recover partitions from table data stored in AmazonS3. Also creates a separate namespace for managing Hive variables.

0.20, 0.180.4

For additional details about the changes in a version of Hive, go to Supported Hive Versions (p. 349). Forinformation about Hive patches and functionality developed by the Amazon EMR team, go to AdditionalFeatures of Hive in Amazon EMR (p. 362).

To specify the Hive version when creating the job flow

• Use the --hive-versions parameter. The following command-line example creates an interactiveHive job flow running Hadoop 0.20 and Hive 0.7.1.

$ ./elastic-mapreduce --create --alive --name "Test Hive" \ --hadoop-version 0.20 \ --num-instances 5 --instance-type m1.large \ --hive-interactive \ --hive-versions 0.7.1

NoteThe --hive-versions parameter must come after any reference to the parameters--hive-interactive, --hive-script, or --hive-site.

To specify the latest Hive version when creating the job flow

• Use the --hive-versions parameter with the latest keyword. The following command-lineexample creates an interactive Hive job flow running the latest version of Hive.

$ ./elastic-mapreduce --create --alive --name "Test Hive" \ --hadoop-version 0.20 \



--num-instances 5 --instance-type m1.large \ --hive-interactive \ --hive-versions latest

To specify the Hive version for a job flow that is interactive and uses a Hive script

• If you have a job flow that uses Hive both interactively and from a script, you must set the Hive versionfor each type of use. The following command-line example illustrates setting both the interactive andthe script version of Hive to use 0.7.1.

$ ./elastic-mapreduce --create --debug --log-uri s3://myawsbucket/perftest/logs/ \--name "Testing m1.large AMI 1" \--ami-version latest --hadoop-version 0.20 \--instance-type m1.large --num-instances 5 \--hive-interactive --hive-versions 0.7.1.2 \--hive-script s3://myawsbucket/perftest/hive-script.hql --hive-versions 0.7.1.2

To load multiple versions of Hive for a given job flow

• Use the --hive-versions parameter and separate the version numbers by comma. The followingcommand-line example creates an interactive job flow running Hadoop 0.20 and multiple versionsof Hive. With this configuration, you can use any of the installed versions of Hive on the job flow.

$ ./elastic-mapreduce --create --alive --name "Test Hive" \ --hadoop-version 0.20 \ --num-instances 5 --instance-type m1.large \ --hive-interactive \ --hive-versions 0.5,0.7.1

To call a specific version of Hive

• Add the version number to the call. For example, hive-0.5 or hive-0.7.1.

NoteIf you have multiple versions of Hive loaded on a job flow, calling hive will access the defaultversion of Hive or the version loaded last if there are multiple --hive-versions parametersspecified in the job flow creation call. When the comma-separated syntax is used with--hive-versions to load multiple versions, hive will access the default version of Hive.

NoteWhen running multiple versions of Hive concurrently, all versions of Hive can read the samedata. They cannot, however, share metadata. Use an external metastore if you want multipleversions of Hive to read and write to the same location.



Display the Hive VersionYou can use the --print-hive-version command to display the version of the Hive currently in usefor a given job flow. This is a useful command to call after you have upgraded to a new version of Hiveto confirm that the upgrade succeeded, or when you are using multiple versions of Hive and need toconfirm which version is currently running. The syntax for this is as follows, where JobFlowID is theidentifier of the job flow to check the Hive version on.

elastic-mapreduce --jobflow JobFlowID --print-hive-version

Share Data Between Hive VersionsYou can take advantage of Hive bug fixes and performance improvements on your existing Hive job flowsby upgrading your version of Hive. Different versions of Hive, however, have different schemas. To sharedata between two versions of Hive, you can create an external table in each version of Hive with the sameLOCATION parameter.

To share data between Hive versions

Start a job flow with the new version of Hive.This procedure assumes that you already havea job flow with the old version of Hive running.

1

Configure the two job flows to allow communication:On the job flow with the old version of Hive, configure the insert overwrite directory to thelocation of the HDFS of the job flow with the new version of Hive.

2

Export and reimport the data.3

Differences from Apache Hive DefaultsTopics

• Input Format (p. 353)

• Combine Splits Input Format (p. 354)

• Log files (p. 354)

• Thrift Service Ports (p. 354)

• Hive Authorization (p. 355)

This section describes the differences between Amazon EMR Hive installations and the default versionsof Hive available at http://svn.apache.org/viewvc/hive/branches/.

Input FormatThe Apache Hive default input format is text. The Amazon EMR default input format for Hive isorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat.You can specify thehive.base.inputformat option in Hive to select a different file format, for example:

hive>set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;

To switch back to the default Amazon EMR input format, you would enter the following:


Amazon Elastic MapReduce Developer GuideShare Data Between Hive Versions

http://svn.apache.org/viewvc/hive/branches/

hive>set hive.base.inputformat=default;

Combine Splits Input FormatIf you have many GZip files in your Hive job flow, you can optimize performance by passing multiple filesto each mapper. This reduces the number of mappers needed in your job flow and can help your jobflows complete faster.You do this by specifying that Hive use the HiveCombineSplitsInputFormatinput format and setting the split size, in bytes. This is shown in the following example.

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveCombineSplitsInput Format;hive> set mapred.min.split.size=100000000;

NoteThis input format was added with Hive 0.8.1 and is available only in job flows running Hive 0.8.1or later.

Log filesApache Hive saves Hive log files to /tmp/{user.name}/ in a file named hive.log. Amazon EMRsaves Hive logs to /mnt/var/log/apps/. In order to support concurrent versions of Hive, the versionof Hive you run determines the log file name, as shown in the following table.

Log File NameHive Version

hive.log0.4

hive_05.log0.5

hive_07.log0.7

hive_07_1.log

NoteMinor versions of Hive 0.7.1, such as Hive 0.7.1.3 and Hive0.7.1.4, share the same log file location as Hive 0.7.1.

0.7.1

hive_081.log

NoteMinor versions of Hive 0.8.1, such as Hive 0.8.1.1, sharethe same log file location as Hive 0.8.1.

0.8.1

Thrift Service PortsThrift is an RPC framework that defines a compact binary serialization format used to persist data structuresfor later analysis. Normally, Hive configures the server to operate on port 10000. In order to supportconcurrent versions of Hive, Amazon EMR operates Hive 0.5 on port 10000, Hive 0.7 on port 10001, andHive 0.7.1 on port 10002. For more information about thrift services, go to http://wiki.apache.org/thrift/.


Amazon Elastic MapReduce Developer GuideDifferences from Apache Hive Defaults

http://wiki.apache.org/thrift/

Hive AuthorizationAmazon EMR does not support Hive authorization. Amazon EMR clusters run with authorization disabled.You cannot use Hive authorization in your Amazon EMR job flow.

Interactive and Batch ModesAmazon EMR enables you to run Hive scripts in two modes:

• Interactive

• Batch

Typically, you use interactive mode to troubleshoot your job flow and use batch mode in production.

In interactive mode, you ssh as the Hadoop user into the master node in the Hadoop cluster and use theHive Command Line Interface to develop and run your Hive script. Interactive mode enables you to revisethe Hive script more easily than batch mode. After you successfully revise the Hive script in interactivemode, you can upload the script to Amazon S3 and use batch mode to run production job flows.

In batch mode, you upload your Hive script to Amazon S3, and then execute it using a job flow.You canpass parameter values into your Hive script and reference resources in Amazon S3. Variables in Hivescripts use the dollar sign and curly braces, for example:

${VariableName}

In the Amazon EMR CLI, use the -d parameter to pass values into the Hive script as in the followingexample.

$ ./elastic-mapreduce --create \ --name "Hive Job Flow" \ --hive-script \ --args s3://myawsbucket/myquery.q \ --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=3://myawsbucket/output

Using batch mode, you can pass parameter values into a Hive script from the Specify Parameters pageof the Create a New Job Flow wizard found in the Amazon EMR console. The values go into the ExtraArgs field. For example, you could enter:

-d VariableName=Value

The Amazon EMR console and Amazon EMR command line interface (CLI) both interactive and batchmodes.

Running Hive in Interactive ModeYou can run Hive in interactive mode from both the CLI and Amazon EMR console.

• To start an interactive job flow from the command line, use the --alive option with the --createparameter so that the job flow remains active until you terminate it, for example:

$ ./elastic-mapreduce --create --alive --name "Hive job flow" \ --num-instances 5 --instance-type m1.large \ --hive-interactive


Amazon Elastic MapReduce Developer GuideInteractive and Batch Modes

https://cwiki.apache.org/Hive/languagemanual-auth.html

The return output is similar to the following:


Add additional steps from the Amazon Elastic MapReduce (Amazon EMR) CLI or ssh directly to themaster node following the instructions in the Amazon Elastic MapReduce (Amazon EMR) Getting StartedGuide .

You start an interactive job flow from the Amazon EMR console in the Create a New Job Flow wizard.

To start an interactive job flow from the Amazon EMR console

1. Click Create New Job Flow and launch the Create a New Job Flow wizard.

2. Enter a Job Flow Name, and choose a Hive Program Job Type. Click Continue.

3. From the Specify Parameters page, select Start an Interactive Hive Session, enter the appropriateinputs for Script Location, Input Location, and Output Location, and then click Continue:

4. Choose the appropriate Amazon EC2 instance type, EC2 key pair, and debugging levels for yourjob flow on the Configure EC2 Instances page, and then click Continue.

5. Add any bootstrap actions on the Bootstrap Actions page, and then click Continue.

6. Review your job, and then click Continue.

The job flow begins. When the job flow is in the WAITING state, you can add steps to your job flow fromthe Amazon EMR CLI or ssh directly to the master node following the instructions in the Amazon ElasticMapReduce (Amazon EMR) Getting Started Guide.

Adding steps can help you test and develop Hive scripts. For example, if the script fails, you can add anew step to the job flow without having to wait for a new job flow to start. The following procedure showsyou how to use the command line to add Hive as a new step to an existing job flow.

To add Hive to an existing job flow

• Enter the following command, replacing the location with an Amazon S3 bucket containing a Hivescript and the <JobFlowID> from your job:

$ ./elastic-mapreduce --jobflow JobFlowID \ --hive-script \ --args s3://location/myquery.q \ --args -d,INPUT=s3://location/input,-d,OUTPUT=s3://location/output

Running Hive in Batch ModeThe following procedure shows how to run Hive in batch mode from the command line. The procedureassumes that you stored the Hive script in a bucket on Amazon S3. For more information about uploadingfiles into Amazon S3, go to the Amazon S3 Getting Started Guide.

To create a job flow with a step that executes a Hive script

• Enter the following command, substituting the replaceable parameters with the actual values fromyour job:

$ ./elastic-mapreduce --create \ --name "Hive job flow" \


Amazon Elastic MapReduce Developer GuideInteractive and Batch Modes

http://docs.aws.amazon.com/ElasticMapReduce/latest/GettingStartedGuide/CreateJobFlowHive.html




http://docs.aws.amazon.com/ElasticMapReduce/latest/GettingStartedGuide/

--hive-script \ --args s3://myawsbucket/myquery.q \ --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=s3://myawsbucket/output

The --args option provides arguments to the Hive-script. The first --args option here specifies thelocation of the Hive script in Amazon S3. In the second --args option, the -d provides a way to passvalues (INPUT, OUTPUT) into the script. Within the Hive script, these parameters are available as${variable}. In this example, Hive replaces ${INPUT} and ${OUTPUT} with the values you passedin. These variables are substituted during a preprocessing step, so the variables can occur anywhere inthe Hive script.

The return output is similar to the following:


Creating a Metastore Outside the Hadoop ClusterHive records metastore information in a MySQL database that is located, by default, on the master node.The metastore contains a description of the input data, including the partition names and data types,contained in the input files.

When a job flow terminates, all associated cluster nodes shut down. All data stored on a cluster node,including the Hive metastore, is deleted. Information stored elsewhere, such as in your Amazon S3 bucket,persists.

If you have multiple job flows that share common data and update the metastore, you should locate theshared metastore on persistent storage.

To share the metastore between job flows, override the default location of the MySQL database to anexternal persistent storage location.

NoteHive neither supports nor prevents concurrent write access to metastore tables. If you sharemetastore information between two job flows, you must ensure that you do not write to the samemetastore table concurrently—unless you are writing to different partitions of the same metastoretable.

The following procedure shows you how to override the default configuration values for the Hive metastorelocation and start a job flow using the reconfigured metastore location.

To create a metastore located outside of the cluster

1. Create a MySQL database.Relational Database Service (RDS) provides a cloud-based MySQL database. Instructions on howto create an Amazon RDS database are at http://aws.amazon.com/rds/.

2. Modify your security groups to allow JDBC connections between your MySQL database and theElasticMapReduce-Master security group.Instructions on how to modify your security groups for access are athttp://aws.amazon.com/rds/faqs/#31.

3. Set the JDBC configuration values in hive-site.xml:

a. Create a hive-site.xml configuration file containing the following information:


Amazon Elastic MapReduce Developer GuideCreating a Metastore Outside the Hadoop Cluster

http://aws.amazon.com/rds/

http://aws.amazon.com/rds/faqs/#31

<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hostname:3306/hive?createDatabaseIfNotEx ist=true</value> <description>JDBC connect string for a JDBC metastore</description>

</property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>username</value> <description>Username to use against metastore database</description>

</property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>password</value> <description>Password to use against metastore database</description>

</property></configuration>

<hostname> is the DNS address of the Amazon RDS instance running MySQL. <username>and <password> are the credentials for your MySQL database.

The MySQL JDBC drivers are installed by Amazon EMR.

NoteThe value property should not contain any spaces or carriage returns. It should appearall on one line.

b. Save your hive-site.xml file to a location on Amazon S3, such ass3://myawsbucket/conf/hive-site.xml.

4. Create a job flow and specify the Amazon S3 location of the new Hive configuration file, for example:

$ ./elastic-mapreduce --create --alive \ --name "Hive job flow" \ --hive-interactive \ --hive-site=s3://myawsbucket/conf/hive-site.xml

The --hive-site parameter installs the configuration values in hive-site.xml in the specifiedlocation. The --hive-site parameter overrides only the values defined in hive-site.xml.

5. Connect to the master node of your job flow.Instructions on how to connect to the master node are available in the Amazon Elastic MapReduce(Amazon EMR) Getting Started Guide.

6. Create your Hive tables specifying the location on Amazon S3 by entering a command similar to thefollowing:


Amazon Elastic MapReduce Developer GuideCreating a Metastore Outside the Hadoop Cluster

http://docs.aws.amazon.com/ElasticMapReduce/latest/GettingStartedGuide/CreateJobFlowHive.html#d0e1324

http://docs.aws.amazon.com/ElasticMapReduce/latest/GettingStartedGuide/CreateJobFlowHive.html#d0e1324

CREATE EXTERNAL TABLE IF NOT EXISTS table_name(key int,value int)LOCATION s3://myawsbucket/hdfs/

7. Add your Hive script to the running job flow.

Your Hive job flow runs using the metastore located on Amazon S3. Launch all additional Hive job flowsthat share this metastore by specifying the metastore location.

Using the Hive JDBC DriverThe Hive JDBC driver provides a mechanism to move data from one database format to another. Installinga JDBC client requires you to download the JDBC driver and install the client software correctly.You canuse the Hive JDBC driver to connect to a SQL client. An example of connecting to the SQuirrel SQL clientfollows.

To download JDBC drivers

• Download Hive 0.5 JDBC drivers fromhttp://aws.amazon.com/developertools/Elastic-MapReduce/0196055244487017 and save the fileslocally.

Download Hive 0.7 JDBC drivers fromhttp://aws.amazon.com/developertools/Elastic-MapReduce/1818074809286277 and save the fileslocally.

Download Hive 0.7.1 JDBC drivers fromhttp://aws.amazon.com/developertools/Elastic-MapReduce/8084613472207189 and save the fileslocally.

Download Hive 0.8.1 JDBC drivers from http://aws.amazon.com/developertools/4897392426085727and save the files locally.

You need only download the drivers appropriate to the version(s) of Hive you want to access.

To install SQuirrel SQL client

1. Download SQuirrel SQL client from http://squirrel-sql.sourceforge.net/.

2. Open the self extracting JAR file, and follow the wizard instructions to install the software.

3. From the command line, create an SSH tunnel to the master node of your Hive job flow as follows:

Enter the following...If you areinstalling...

ssh -o ServerAliveInterval=10 -L 10000:localhost:10000hadoop@MasterNodeDNS -i $HOME/mysecretkey.pem

Hive 0.5drivers


Hive 0.7drivers


Hive 0.7.1drivers


Amazon Elastic MapReduce Developer GuideUsing the Hive JDBC Driver

http://aws.amazon.com/developertools/Elastic-MapReduce/0196055244487017







http://squirrel-sql.sourceforge.net/



Hive 0.8.1drivers

The MasterNodeDNS is the public DNS name of the master node of the Hadoop cluster andmysecretkey.pem is the name of your AWS secret key file.

4. Add the JDBC driver to SQuirrel SQL:

a. Open SQuirrel SQL and click the Drivers tab.

b. Double-click JDBC ODBC Bridge to add attributes.

c. Type org.apache.hadoop.hive.jdbc.HiveDriver in the Class Name field, and then clickAdd.

d. Navigate to the location of your JDBC drivers.

e. Add the following JAR files:

Add the following...If you areinstalling...

hadoop-0.20-core.jarhive/lib/hive-exec-0.5.0.jarhive/lib/hive-jdbc-0.5.0.jarhive/lib/hive-metastore-0.5.0.jarhive/lib/hive-service-0.5.0.jarhive/lib/libfb303.jarhive/lib/log4j-1.2.15.jarlib/commons-logging-1.0.4.jar

Hive 0.5drivers

hadoop-0.20-core.jarhive/lib/hive-exec-0.7.0.jarhive/lib/hive-jdbc-0.7.0.jarhive/lib/hive-metastore-0.7.0.jarhive/lib/hive-service-0.7.0.jarhive/lib/libfb303.jarlib/commons-logging-1.0.4.jar slf4j-api-1.5.6.jarslf4j-log4j12-1.5.6.jar

Hive 0.7drivers


Hive 0.7.1drivers



Add the following...If you areinstalling...


Hive 0.8.1drivers

f. Click OK.

5. Add a new alias:

a. Click the Alias tab, and then click + to add a new alias.

b. Enter the following information in the Add Alias dialog:

DescriptionField

Enter the name of the alias.Name

Select the JDBC driver from the list.Driver

Enter your local machine login.User Name

Enter your local machine password.Password

c. Enter the URL information in the Add Alias dialog based on the version of Hive:


jdbc:hive://localhost:10000/defaultHive 0.5drivers

jdbc:hive://localhost:10001/defaultHive 0.7drivers

jdbc:hive://localhost:10002/defaultHive 0.7.1drivers

jdbc:hive://localhost:10003/defaultHive 0.8.1drivers

d. Click OK.

The SQuirrel SQL client is ready to use.



For more information about using Hive and the JDBC interface, go tohttp://wiki.apache.org/hadoop/Hive/HiveClient and http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface.

Additional Features of Hive in Amazon EMRWe have extended Hive with new features that integrate Hive with Amazon Web Services (AWS), suchas the ability to read and write from Amazon S3. For information about which versions of Hive supportthese additional features, see Hive Patches (p. 365).

Topics

• Write Data Directly to Amazon S3 (p. 362)

• Use Hive to Access Resources in Amazon S3 (p. 362)

• Use Hive to Recover Partitions (p. 363)

• Variables in Hive (p. 363)

• Make JDBC Connections in Hive (p. 364)

• Persist Hive Schema (p. 364)

• Amazon EMR Hive Steps (p. 365)

• Hive Patches (p. 365)

Write Data Directly to Amazon S3The Hadoop Distributed File System (HDFS) and Amazon Simple Storage Service (Amazon S3) arehandled differently within Amazon EMR and Hive. The version of Hive installed with Amazon EMR isextended with the ability to write directly to Amazon S3 without the use of temporary files. This producesa significant performance improvement but it means that HDFS and Amazon S3 behave differently withinHive.

A consequence of Hive writing directly to Amazon S3 is that you cannot read and write within the sameHive statement to the same table if that table is located in Amazon S3.The following example shows howto use multiple Hive statements to update a table in Amazon S3.

To update a table in Amazon S3 using Hive

1. From a Hive prompt or script, create a temporary table in the job flow's local HDFS filesystem.

2. Write the results of a Hive query to the temporary table.

3. Copy the contents of the temporary table to Amazon S3. This is shown in the following example.

4.

create temporary table tmp like my_s3_table ;insert overwrite tmp select .... ;insert overwrite my_s3_table select * from tmp ;

Use Hive to Access Resources in Amazon S3The version of Hive installed in Amazon EMR enables you to reference resources, such as JAR files,located in Amazon S3.

add jar s3://elasticmapreduce/samples/hive-ads/lib/jsonserde.jar


Amazon Elastic MapReduce Developer GuideAdditional Features of Hive in Amazon EMR

http://wiki.apache.org/hadoop/Hive/HiveClient

http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface

You can also reference scripts located in Amazon S3 to execute custom map and reduce operations.This is shown in the following example.

from logs select transform (line) using 's3://mybucket/scripts/parse-logs.pl' as (time string, exception_type string, exception_details string)

The abilty to initialize Hive from a file stored in Amazon S3 was introduced with Hive 0.8.1. Versions ofHive prior to 0.8.1 do not support initializing Hive from Amazon S3. For example, in the following Hivecommand, -i s3n://myawsbucket/hive_init.sql succeeds if run with Hive 0.8.1 or later, and failsif run with an earlier version of Hive.

hive -i s3n://myawsbucket/hive_init.sql -f s3n://myawsbucket/hive_example.sql

Use Hive to Recover PartitionsWe added a statement to the Hive query language that recovers the partitions of a table from table datalocated in Amazon S3. The following example shows this.

create external table (json string) raw_impression partitioned by (dt string) location 's3://elastic-mapreduce/samples/hive-ads/tables/impressions';alter table logs recover partitions ;

The partition directories and data must be at the location specified in the table definition and must benamed according to the Hive convention, e.g., dt=2009-01-01.

Variables in HiveYou can include variables in your scripts by using the dollar sign and curly braces.

add jar ${LIB}/jsonserde.jar

You pass the values of these variables to Hive on the command line using the -d parameter, as thefollowing example shows.

-d LIB=s3://elasticmapreduce/samples/hive-ads/lib

You can also pass the values into steps that execute Hive scripts.



elastic-mapreduce --hive-script --arg s3://mybucket/script.q \--args -d,LIB=s3://elasticmapreduce/samples/hive-ads/lib

Make JDBC Connections in HiveWhen you start an interactive Hive session using either the EMR console or the CLI, a Hive server startson the master node and installs Hive in a job flow. The Hive server accepts JDBC connections from theHive JDBC driver on port 10000.

To establish a connection from a remote machine

1. Start an SSH tunnel.

ssh -i my_private_key.pem hadoop@<master_node> -N -L 1234:localhost:10000

Replace <master_node> with the DNS name of the master node of your job flow. Alternatively, youcan establish an SSH tunnel using Java secure channel (JSch).

2. Connect to the Hive server using the JDBC connection string.

jdbc:hive://localhost:1234/default

Alternatively, you can connect from a machine running in Amazon EC2 that is either in theElasticMapReduce-master or ElasticMapReduce-slave security group.

Persist Hive SchemaBy default, Hive keeps its schema information on the master node and that information ceases to existwhen the job flow terminates.You can use the hive-site.xml feature to override the default location of themetadata store and replace it with a location that persists. In the following example, the default locationis replaced by a MySQL instance that is already running in Amazon EC2.

The first step is to create a Hive site configuration file and store it in Amazon S3 so that it can overridethe location of the metadata store.

<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://ec2-72-44-33-189.compute-1.amazon aws.com:3306/hive?user=user12&password=abababa7&create=true</value> <description>JDBC connect string for a JDBC metastore</description> </property>

<property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description>



</property> </configuration>

In this example, the Hive site configuration file is in the following Amazon S3 location:

s3://mybucket/config/hive-site.conf

Next, using the Amazon EMR command line, you can use hive-site to install the configuration file in a jobflow.

elastic-mapreduce --jobflow $JOBFLOW \ --hive-site=s3://mybucket/conf/hive-site.xml

Amazon EMR Hive StepsThe Amazon EMR CLI provides a convenient way to access Hive steps.You can also access Hive stepsfrom programs that call directly into the Amazon EMR web service.

Run the --describe command of the install step on a job flow that has Hive installed on it.

"StepConfig": { "ActionOnFailure": "TERMINATE_JOB_FLOW", "Name": "Setup Hive", "HadoopJarStep": { "MainClass": null, "Jar": "s3:\/\/us-east-1.elasticmapreduce\/libs\/script-runner\/script-runner.jar", "Args": [ "s3:\/\/us-east-1.elasticmapreduce\/libs\/hive\/0.4\/install-hive" ], "Properties": [] } }

In this example, you can see that a custom JAR called script-runner executes a script called install-hive,which resides in Amazon S3.

Notice that the install scripts are region-specific. If you're launching a job flow in eu-west-1 for example,you should include the installed script in the bucket eu-west-1.elasticmapreduce rather than the bucketus-east-1.elasticmapreduce.

Hive PatchesThe Amazon EMR team has created the following patches for Hive.



DescriptionPatch

Supports moving data between different file systems, such as HDFSand Amazon S3. Adds support for file systems (such as Amazon S3)that do not provide a “move” operation. Removes redundantoperations like moving data to and from the same location.

Status: SubmittedFixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: n/a (HIVE-2318)

Write to Amazon S3

Enables Hive to download the Hive scripts in Amazon S3 bucketsand run them. Saves you the step of copying scripts to HDFS beforerunning them.

Status: CommittedFixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: 0.7.0 (HIVE-1624)

Scripts in Amazon S3

Allows you to recover partitions from table data located in AmazonS3 and Hive table data in HDFS.

Status: Not SubmittedFixed in AWS Hive Version: 0.4

Fixed in Apache Hive Version: n/a

Recover partitions

Create a separate namespace (aside from HiveConf) for managingHive variables. Adds support for setting variables on the commandline using either '-define x=y' or 'set hivevar:x=y'. Adds support forreferencing variables in statements using '${var_name}'. Provides ameans for differentiating between hiveconf, hivevar, system, andenvironment properties in the output of 'set -v'.



Variables in Hive

FileSinkOperator reports progress to Hadoop while writing large files,so that the task is not killed.



Report progress while writing toAmazon S3

Corrects an issue where compression values were not set correctlyin FileSinkOperator, which resulted in uncompressed files.

Status: SubmittedFixed in AWS Hive Version: 0.5

Fixed in Apache Hive Version: n/a (HIVE-2266)

Fix compression arguments

Fixes an issue where UDAFPercentile would throw a null pointerexception when passed null percentile list.



Fix UDAFPercentile to toleratenull percentiles



DescriptionPatch

Fixes the hashCode() method of DoubleWritable class of Hive andprevents the HashMap (of type DoubleWritable) from behaving asLinkedList.



Fix hashCode method inDoubleWritable class

Improved version of Recover Partitions that uses less memory.



Recover partitions, version 2

Use the HAVING clause to directly filter on groups by expressions(instead of using nested queries). Integrates Hive with other dataanalysis tools that rely on the HAVING expression.



HAVING clause

Reduces startup time for queries spanning a large number ofpartitions.

Status: CommittedFixed in AWS Hive Version: 0.7.1


Improve Hive query performance

Reduces startup time for Amazon S3 queries. SetHive.optimize.s3.query=true to enable optimization.

The optimization flag assumes that the partitions are stored instandard Hive partitioning format:“HIVE_TABLE_ROOT/partititon1=value1/partition2=value2”. This isthe format used by Hive to create partitions when you do not specifya custom location.

The partitions in an optimized query should have the same prefix,with HIVE_TABLE_ROOT as the common prefix.

Status: Not SubmittedFixed in AWS Hive Version: 0.7.1


Improve Hive query performancefor Amazon S3 queries

Fixes an issue where Hive scripts would fail on a comment line; nowHive scripts skip commented lines.

Status: CommittedFixed in AWS Hive Version: 0.7.1


Skip comments in Hive scripts



DescriptionPatch

Improves performance recovering partitions from Amazon S3 whenthere are many partitions to recover.

Status: Not SubmittedFixed in AWS Hive Version: 0.8.1


Limit Recover Partitions

Upgrade to Hive 0.8Running Hive 0.8 on Amazon EMR offers several improvements, such as support for Hadoop 1.0.3,Binary and Timestamp data types, a Plugin developer kit, and JDBC driver improvements. (For the fulllist of new features, see Supported Hive Versions (p. 349).) To use these improvements, you can eitherlaunch a new job flow on an AMI that supports Hive 0.8 or upgrade an older Hive job flow to Hive 0.8.

Upgrading an existing job flow to Hive 0.8 is useful if you have a long-running job flow that you don't wantto restart, or you are storing the Hive metastore outside of the job flow, for example, running a MySQLdatabase on Amazon RDS.

NoteHadoop 1.0.3 (AMI 2.2.0 or later) does not support Hive 0.7 or Hive 0.5 and you must use Hive0.8 or later.

Topics

• Upgrade the Configuration Files (p. 368)

• Upgrade the Metastore (p. 369)

Upgrade the Configuration FilesThe Hive configuration files changed between Hive 0.7 and Hive 0.8. In Hive 0.8 the hive-default.xml filewas deprecated; all configuration settings now are listed in hive-site.xml.

If your application sets values in hive-default.xml, you have two options:

• Set your Hive settings in a script instead of the configuration file. For example:

set hive.s3.optimize.query=true;

• Move your settings to the new Hive 0.8 version of hive-site.xml. When you do so, take care not tooverwrite the values set by Amazon EMR.

You can obtain a copy of the Hive 0.8 version of hive-site.xml with the Amazon EMR settings includedby launching a new job flow that uses Hive 0.8 and copying it from hive/conf/ on the master node. Todo so, use SSH to connect to the master node and use the scp utility to copy the file to your localmachine. For more information about how to use SSH to connect to the master node, see Connect tothe Master Node Using SSH (p. 111).


Amazon Elastic MapReduce Developer GuideUpgrade to Hive 0.8

Upgrade the MetastoreThe Hive metastore stores the metadata for Hive tables and partitions. Amazon EMR uses a MySQLdatabase to contain the metastore.

By default, Amazon EMR creates the MySQL database on the master node. In this scenario, the metastoreis deleted when the job flow terminates. For the metastore to persist between job flows, you can specifythat the Hive job flow use a remote metastore, such as a MySQL database hosted on Amazon RDS. Formore information about how to create a remote metastore, see Creating a Metastore Outside the HadoopCluster (p. 357).

If you create a new job flow using Hive 0.8 and let it create a new metastore on the master node (thedefault behavior) it will have the new schema. No updates are required.

If you have an existing metastore, created with Hive 0.7 or earlier, that you want to reuse, you must updatethe schema to the Hive 0.8 format. Apache Hive provides scripts you can use to update metastore schemasfrom one version to another; their use is explained in the following procedures.

If you are updating a metastore created with a version of Hive prior to 0.7, this may require multiple steps.For example, a metastore created with Hive 0.5 would first need to be updated to the Hive 0.6 schema,then the Hive 0.7 schema, before it could be updated to the Hive 0.8 schema. For a list of the Apacheupdate scripts for previous versions of Hive, go tohttp://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/.

NoteThe transformation scripts only work in one direction, after you've converted your Hive metastoreto the Hive 0.8 format, you cannot use the scripts to convert the metastore back to the Hive 0.7format. It is recommended that you backup your metastore before you begin the upgrade process.

Upgrade to Hive 0.8 (MySQL on the Master Node)

The following procedures explain how to upgrade a Hive metastore stored in a MySQL database hostedon the master node of a job flow (the default behavior).

If your metastore is stored remotely, as described in Creating a Metastore Outside the HadoopCluster (p. 357), please use the instructions at Upgrade to Hive 0.8 (MySQL on Amazon RDS) (p. 373)instead.

To upgrade the metastore from 0.7 to 0.8 (MySQL on the master node)

1. Stop running Hive processes during the upgrade procedure. This ensures that the database is notaltered while the upgrade scripts are running.

2. Use SSH to connect to the master node of the Hive cluster to upgrade. For more information abouthow to use SSH to connect to the master node, see Connect to the Master Node Using SSH (p. 111).If you have the Amazon EMR CLI installed, you can use the following command to connect, wherejobflowid is the identifier of the job flow to connect to.

elastic-mapreduce --ssh -j jobflowid

3. Copy the upgrade scripts from Apache to the master node.You can do this by running the wget utilityon the master node. (The view=co request variable in the following URLs ensures that you get theplain-text version of the scripts, not the HTML-encoded version.)



http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/

wget -O 008-HIVE-2246.mysql.sql ht tp://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/up grade/mysql/008-HIVE-2246.mysql.sql?view=co

wget -O 009-HIVE-2215.mysql.sql ht tp://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/up grade/mysql/009-HIVE-2215.mysql.sql?view=co

wget -O upgrade-0.7.0-to-0.8.0.mysql.sql ht tp://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/up grade/mysql/upgrade-0.7.0-to-0.8.0.mysql.sql?view=co

4. Launch the MySQL Monitor using the following command.

mysql -u root

5. From the MySQL Monitor command prompt, find the name of the database that contains your Hivemetastore.You can do this by running the following command.

mysql> show databases;

This returns results such as the following.The database that starts with "hive_" is the one that containsyour Hive MetaStore. In the example below, this is "hive_071".

+--------------------+| Database |+--------------------+| information_schema || InstanceController || hive_071 || mysql |+--------------------+4 rows in set (0.00 sec)

6. Exit MySQL Monitor.

mysql> exit

7. Back up your MySQL metastore database.You can do this using the mysqldump utility, as shownin the example below, where hive_071 is the name of the database containing the metastore.



mysqldump --opt hive_071 > metastore_backup.sql -u root

8. Export the current metastore schema to a file. The following mysqldump command extracts only theschema content.

mysqldump --skip-add-drop-table --no-data \ hive_071 > my-schema-0.7.1.mysql.sql -u root

9. Compare your current schema against the official Apache Hive schema listed athttp://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/. If you areupgrading from 0.7 to 0.8, compare the schema you exported against the version inhive-schema-0.7.0.mysql.sql. The schemas should match. If you have made custom changes to theschema, you may need to roll those back in order for the upgrade scripts to work properly.

Differences you may find:

• Missing tables. By default, Hive only creates schema elements when they are used. If you havenot created a certain type of Hive catalog object, the corresponding table will not exist in yourschema.You must create these missing tables for the upgrade script to succeed.You can do thisby hand or by running the official schema DDL script (located athttp://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/hive-schema-0.8.0.mysql.sql?view=co)against your metastore.This script ignores tables that already exist, and will only create those thatare missing.

• Extra tables. If your schema contains tables named NUCLEUS_TABLES or SEQUENCE_TABLE,these will not affect the upgrade script.You do not need to remove them.

• Reversed Column Constraint Names. If a table has multiple constraints, the names may bereversed between your schema and the canonical schema. For example, if a table containsPARTITIONS_FK1 and PARTITIONS_FK2 which reference SDS.SD_ID and TBLS.TBL_ID, yourschema may instead connect PARTITIONS_FK1 to TBLS.TBL_ID and PARTITIONS_FK2 toSDS.SD_ID. This will not affect the upgrade script and can be ignored.

• Changes in Column and Constraint Names. If your schema contains tables with unique keysnamed "UNIQUE<tab_name>" or columns named "IDX" you will need to rename these to"UNIQUE_<tab_name>" and "INTEGER_IDX" before you run the upgrade script. The reason forthis is explained in HIVE-1435.

10. Launch the MySQL Monitor using the following command.

mysql -u root

11. Run the two update scripts using the source command in the MySQL command line. This is shownin the following example, where hive_071 is the database containing the metastore.

mysql> use hive_071;mysql> source upgrade-0.7.0-to-0.8.0.mysql.sql;




http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/hive-schema-0.8.0.mysql.sql?view=co

The script should complete without error.


mysql> exit

13. Export the upgraded metastore schema to a file. The following mysqldump command extracts onlythe schema content.

mysqldump --skip-add-drop-table --no-data \ hive_071 > my-schema-0.8.1.mysql.sql

14. Compare this schema to the official Apache Hive 0.8 schema:http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/hive-schema-0.8.0.mysql.sql.They should match.

15. Backup the upgraded metastore.

mysqldump --opt hive_071 > metastore_upgraded_backup.sql \ -u root

To move your upgraded configuration files and metastore to a Hive 0.8 job flow (MySQLon the master node)

1. Create a new job flow on Hadoop 1.0.3 (AMI version 2.2 or later).

2. Use SSH to connect to the master node. For more information on how to do this, see Connect to theMaster Node Using SSH (p. 111).

3. Use the scp utility to copy your upgraded configuration files and upgraded metastore backup file tothe new job flow.

4. Copy custom configuration settings from hive-default.xml and hive-site.xml to the hive-site.xml onthe new job flow (found on the master node at hive/conf/.) Take care not to overwrite any settingsused by Amazon EMR.

5. Import the upgraded metastore to the MySQL database (whether locally on the master, or on a remoteserver.)

mysql -u root

mysql>show databases;


+--------------------+



http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/hive-schema-0.8.0.mysql.sql

| Database |+--------------------+| information_schema || InstanceController || hive_081 || mysql |+--------------------+4 rows in set (0.00 sec)

6. Replace this empty metastore, hive_081 in the preceeding example, with your upgraded metastore.You can use the following commands to move your database to the new location.

mysql> drop database hive_081;mysql> create database hive_081;mysql> use database hive_081;mysql> source metastore_upgraded_backup.sql;

Upgrade to Hive 0.8 (MySQL on Amazon RDS)

The following procedures explain how to upgrade a Hive metastore stored outside of the job flow, asdescribed in Creating a Metastore Outside the Hadoop Cluster (p. 357).

If your metastore is stored on the master node (the default behavior) please use the procedures in Upgradeto Hive 0.8 (MySQL on the Master Node) (p. 369) instead.

In order to connect to the Amazon RDS database from the master node, you need to add the EC2 SecurityGroup elasticmapreduce-master to the DB Security Groups.You can do this as described in the followingprocedure.

To configure security groups to enable connections to Amazon RDS from the master node

1. From the Amazon RDS Console, click DB Security Groups in the left pane.

2. Select the security group to modify in the center pane. This should be the security group that wasused to launch the MySQL database hosting the Hive metastore.The default security group is default.

3. In the information pane at the bottom, select EC2 Security Group for Connection Type and selectElasticMapReduce-master for EC2 Security Group.

4. Click Add.



To upgrade the metastore from 0.7 to 0.8 (MySQL on Amazon RDS)

1. Stop running Hive processes during the upgrade procedure. This ensures that the database is notaltered while the upgrade scripts are running.

2. Back up the Amazon RDS database as described in Creating a DB Snapshot in the Amazon RelationalDatabase Service User Guide.

3. Download the following schema upgrade scripts from Apache Hive to your local machine. (Theview=co request variable in the following URLs ensures that you get the plain-text version of thescripts, not the HTML-encoded version.)

• http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql?view=co

• http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/009-HIVE-2215.mysql.sql?view=co

4. Use SSH to connect to the master node of the Hive cluster to upgrade. For more information, seeConnect to the Master Node Using SSH (p. 111). If you have the Amazon EMR CLI installed, you canuse the following command to connect, where jobflowid is the identifier of the job flow to connectto.

elastic-mapreduce --ssh -j jobflowid

5. Use the MySQL monitor installed on the master node to connect to the Amazon RDS database.Before you can do this, you must have completed the steps in the preceding procedure, To configuresecurity groups to enable connections to Amazon RDS from the master node.

From the master node, run the following command to connect to the Amazon RDS database, whereis myinstance the name of the database, mydnsnameexample is the custom portion of the DNSname assigned to the database and mymasteruser is the master user on the database. For moreinformation about how to connect to an Amazon RDS instance, go to Connecting to a DB InstanceRunning the MySQL Database Engine in the Amazon Relational Database Service User Guide.

mysql -h myinstance.mydnsnameexample.rds.amazonaws.com -P 3306 -u mymasteruser -p

6. From the MySQL Monitor command prompt, find the name of the database that contains your Hivemetastore.You can do this by running the following command.

mysql> show databases;


+--------------------+| Database |+--------------------+| information_schema || InstanceController |



http://docs.amazonwebservices.com/AmazonRDS/latest/UserGuide/USER_CreateSnapshot.html

http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql?view=co

http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/009-HIVE-2215.mysql.sql?view=co

http://docs.amazonwebservices.com/AmazonRDS/latest/UserGuide/USER_ConnectToInstance.html

http://docs.amazonwebservices.com/AmazonRDS/latest/UserGuide/USER_ConnectToInstance.html

| hive_071 || mysql |+--------------------+4 rows in set (0.00 sec)


mysql> exit

8. Export the current metastore schema from Amazon RDS.The following mysqldump command extractsonly the schema content.

mysqldump --skip-add-drop-table --no-data hive_071 \-u root -h myinstance.mydnsnameexample.rds.amazonaws.com \-P 3306 -u mymasteruser -p

9. Compare your current schema against the official Apache Hive schema listed athttp://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/. If you areupgrading from 0.7 to 0.8, compare the schema you exported against the version inhive-schema-0.7.0.mysql.sql. The schemas should match. If you have made custom changes to theschema, you may need to roll those back in order for the upgrade scripts to work properly.

Differences you may find:

• Missing tables. By default, Hive only creates schema elements when they are used. If you havenot created a certain type of Hive catalog object, the corresponding table will not exist in yourschema.You must create these missing tables for the upgrade script to succeed.You can do thisby hand or by running the official schema DDL script (located athttp://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/hive-schema-0.8.0.mysql.sql?view=co)against your metastore.This script ignores tables that already exist, and will only create those thatare missing.

• Extra tables. If your schema contains tables named NUCLEUS_TABLES or SEQUENCE_TABLE,these will not affect the upgrade script.You do not need to remove them.

• Reversed Column Constraint Names. If a table has multiple constraints, the names may bereversed between your schema and the canonical schema. For example, if a table containsPARTITIONS_FK1 and PARTITIONS_FK2 which reference SDS.SD_ID and TBLS.TBL_ID, yourschema may instead connect PARTITIONS_FK1 to TBLS.TBL_ID and PARTITIONS_FK2 toSDS.SD_ID. This will not affect the upgrade script and can be ignored.

• Changes in Column and Constraint Names. If your schema contains tables with unique keysnamed "UNIQUE<tab_name>" or columns named "IDX" you will need to rename these to"UNIQUE_<tab_name>" and "INTEGER_IDX" before you run the upgrade script. The reason forthis is explained in HIVE-1435.

10. Use the MySQL monitor installed on the master node to connect to the Amazon RDS database.

mysql -h myinstance.mydnsnameexample.rds.amazonaws.com \




http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/hive-schema-0.8.0.mysql.sql?view=co

-P 3306 -u mymasteruser -p

11. Select the database to run the upgrade scripts against.This is shown in the following example, wherehive_071 is the database containing the metastore.

mysql> use hive_071;

12. Run the two upgrade scripts you downloaded from Apache Hive in Step 3. Cut and paste the SQLcommands from the scripts into the MySQL monitor running on Amazon RDS. The scripts shouldcomplete without error.


mysql> exit

14. Export the upgraded metastore schema from Amazon RDS. The following mysqldump commandextracts only the schema content.

mysqldump --skip-add-drop-table --no-data hive_071 \-u root -h myinstance.mydnsnameexample.rds.amazonaws.com \-P 3306 -u mymasteruser -p

15. Compare this schema to the official Apache Hive 0.8 schema:http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/hive-schema-0.8.0.mysql.sql.They should match.

16. Backup the upgraded metastore by backing up the Amazon RDS database as described in Creatinga DB Snapshot in the Amazon Relational Database Service User Guide.

To move your upgraded configuration files and metastore to a Hive 0.8 job flow (MySQLon Amazon RDS)

1. Create a new job flow on Hadoop 1.0.3 (AMI version 2.2 or later). Follow the instructions at Creatinga Metastore Outside the Hadoop Cluster (p. 357) to have the Hive job flow use the metastore youupgraded to Hive 0.8 in the previous procedure.

2. Use SSH to connect to the master node. For more information on how to do this, see Connect to theMaster Node Using SSH (p. 111).

3. Use scp to copy your upgraded configuration files to the new job flow.

4. Copy custom configuration settings from hive-default.xml and hive-site.xml to the hive-site.xml onthe new job flow (found on the master node at hive/conf/.) Take care not to overwrite any settingsused by Amazon EMR.



http://svn.apache.org/viewvc/hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/hive-schema-0.8.0.mysql.sql



Pig ConfigurationAmazon Elastic MapReduce (Amazon EMR) supports Apache Pig, a platform you can use to analyzelarge data sets. For more information about Pig, go to http://pig.apache.org/. Amazon EMR supportsseveral versions of Pig. The following sections describe how to configure Pig on Amazon EMR.

Topics

• Supported Pig Versions (p. 377)

• Pig Version Details (p. 379)

Supported Pig VersionsThe versions of Pig you can run depend on the version of the Amazon Elastic MapReduce (AmazonEMR) AMI and the version of Hadoop you are using. The table below shows which AMI versions andversions of Hadoop are compatible with the different versions of Pig. We recommend using the latestavailable version of Pig to take advantage of performance enhancements and new functionality.To selectthe configuration, use the --ami-version, --hadoop-version, and --pig-versions parametersin the job flow creation call.

The default configuration for Amazon EMR job flows launched with AMI version 2.2 and later is Hadoop1.0.3 with Pig 0.9.2.1. The default configuration for Amazon EMR job flows launched with AMI version1.0 is Hadoop 0.18 with Pig 0.3. For more information about the Amazon EMR AMIs and AMI versioning,see Specify the Amazon EMR AMI Version (p. 290).

The Amazon EMR console does not support Pig versioning and always launches the latest version ofPig.

The version of the Amazon EMR CLI released on 9 April 2012 is the first version to support Pig versioning.Job flows created with versions of the CLI downloaded before 9 April 2012 do not support Pig versioningand use the default configuration of Pig. Job flows created with versions of the Amazon EMR CLIdownloaded on 9 April 2012 or later will use the latest version of Pig available on the AMI, unless otherwisespecified using the --pig-versions parameter.You can download the latest version of the CLI fromhttp://aws.amazon.com/code/Elastic-MapReduce/2264.

Calls to the API will launch the default configuration of Pig unless you specify --pig-versions as anargument to the step that loads Pig onto the job flow during the call to RunJobFlow.

Configuration ParametersAMI VersionHadoopVersion

Pig Version

--pig-versions 0.3

--hadoop-version 0.18

--ami-version 1.0

1.00.180.3

--pig-versions 0.6

--hadoop-version 0.20

--ami-version 1.0

1.00.200.6

--pig-versions 0.9.1

--hadoop-version 0.20.205

--ami-version 2.0

2.00.20.2050.9.1


Amazon Elastic MapReduce Developer GuidePig Configuration

http://pig.apache.org/



Configuration ParametersAMI VersionHadoopVersion

Pig Version

--pig-versions 0.9.2


--ami-version 2.2

2.2 and later1.0.30.9.2

--pig-versions 0.9.2.1


--ami-version 2.2

2.2 and later1.0.30.9.2.1

--pig-versions 0.9.2.2


--ami-version 2.2

2.2 and later1.0.30.9.2.2

To specify the Pig version when creating the job flow

• Use the --pig-versions parameter. The following command-line example creates an interactivePig job flow running Hadoop 1.0.3 and Pig 0.9.2. In the following, instanceType would be replacedby an EC2 instance type such as m1.small.

elastic-mapreduce --create --alive --name "Test Pig" \--hadoop-version 1.0.3 \--ami-version 2.2 \--num-instances 5 --instance-type instanceType \--pig-interactive \--pig-versions 0.9.2

To specify the latest Pig verision when creating the job flow

• Use the --pig-versions parameter with the latest keyword.The following command-line examplecreates an interactive Pig job flow running the latest version of Pig. In the following, instanceTypewould be replaced by an EC2 instance type such as m1.small.

elastic-mapreduce --create --alive --name "Test Latest Pig" \--hadoop-version 1.0.3 \--ami-version 2.2 \--num-instances 5 --instance-type instanceType \--pig-interactive \--pig-versions latest

To load multiple versions of Pig for a given job flow

• Use the --pig-versions parameter and separate the version numbers by commas. The followingcommand-line example creates an interactive Pig job flow running Hadoop 0.20.205 and Pig 0.9.1


Amazon Elastic MapReduce Developer GuideSupported Pig Versions

and Pig 0.9.2.With this configuration, you can use either version of Pig on the job flow. In the following,instanceType would be replaced by an EC2 instance type such as m1.small.

elastic-mapreduce --create --alive --name "Test Pig" \--hadoop-version 0.20.205 \--ami-version 2.0 \--num-instances 5 --instance-type instanceType \--pig-interactive \--pig-versions 0.9.1,0.9.2

If you have multiple versions of Pig loaded on a job flow, calling Pig will access the default version of Pig(currently 0.9.2), or the version loaded last if there are multiple --pig-versions parameters specifiedin the job flow creation call. When the comma-separated syntax is used with --pig-versions to loadmultiple versions, pig will access the default version of Pig.

To call a specific version of Pig

• Add the version number to the call. For example, pig-0.9.1 or pig-0.9.2.You would do this, forexample, in an interactive Pig job flow by using SSH to connect to the master node and then runninga command like the following from the terminal.

pig-0.9.1

To display the Pig version

• You can use the --print-pig-version command to display the version of Pig currently in usefor a given job flow. This is a useful command to call after you have upgraded to a new version ofPig to confirm that the upgrade succeeded, or when you are using multiple versions of Pig and needto confirm which version is currently running. The syntax for this is as follows, where JobFlowID isthe identifier of the job flow to check the Pig version on.

elastic-mapreduce --jobflow JobFlowID --print-pig-version

Pig Version DetailsYou can configure which version of Pig to run on Amazon Elastic MapReduce (Amazon EMR) job flows.For more information on how to do this, see Pig Configuration (p. 377). The following sections describedifferent Pig versions and the patches applied to the versions loaded on Amazon EMR.

New Features of Pig 0.9.2Pig 0.9.2.2 adds support for Hadoop 1.0.3.


Amazon Elastic MapReduce Developer GuidePig Version Details

Pig 0.9.2.1 adds support for MapR. For more information, see Launch a Job Flow on the MapR Distributionfor Hadoop (p. 260).

Pig 0.9.2 includes several performance improvements and bug fixes. For complete information about thechanges for Pig 0.9.2, go to the Pig 0.9.2 Change Log.

Pig 0.9.2 Patches

Apache Pig 0.9.2 is a maintenance release of Pig. The Amazon EMR team has applied the followingpatches to the Amazon EMR version of Pig 0.9.2.

DescriptionPatch

Add the Boolean data type to Pig as a first class data type. For moreinformation, go to https://issues.apache.org/jira/browse/PIG-1429.

Status: CommittedFixed in Apache Pig Version: 0.10

PIG-1429

Support import modules in Jython UDF. For more information, go tohttps://issues.apache.org/jira/browse/PIG-1824.


PIG-1824

Bundle registered JARs on the distributed cache. For moreinformation, go to https://issues.apache.org/jira/browse/PIG-2010.


PIG-2010

Add a ~/.pigbootup file where the user can specify default Pigstatements. For more information, go tohttps://issues.apache.org/jira/browse/PIG-2456.


PIG-2456

Support using Amazon S3 paths to register UDFs. For moreinformation, go to https://issues.apache.org/jira/browse/PIG-2623.

Status: CommittedFixed in Apache Pig Version: 0.10, 0.11

PIG-2623

Pig 0.9.1 Patches

The Amazon EMR team has applied the following patches to the Amazon EMR version of Pig 0.9.1.

DescriptionPatch

Add support for running scripts and registering JAR files stored inHDFS, Amazon S3, or other distributed file systems. For moreinformation, go to https://issues.apache.org/jira/browse/PIG-1505.

Status: CommittedFixed in Apache Pig Version: 0.8.0

Support JAR files and Pig scriptsin dfs


Amazon Elastic MapReduce Developer GuidePig Version Details

http://svn.apache.org/repos/asf/pig/tags/release-0.9.2/CHANGES.txt

https://issues.apache.org/jira/browse/PIG-1429






DescriptionPatch

Add support for Pig scripts to read data from one file system andwrite it to another. For more information, go tohttps://issues.apache.org/jira/browse/PIG-1564.

Status: Not CommittedFixed in Apache Pig Version: n/a

Support multiple file systems inPig

Add datetime and string UDFs to support custom Pig scripts. Formore information, go tohttps://issues.apache.org/jira/browse/PIG-1565.

Status: Not CommittedFixed in Apache Pig Version: n/a

Add Piggybank datetime andstring UDFs

Performance TuningAmazon Elastic MapReduce (Amazon EMR) enables you to specify the number and kind of Amazon EC2instances in the cluster. These specifications are the primary means of affecting the speed with whichyour job flow completes. There are, however, a number of Hadoop parameter values that govern theoperation of Amazon EC2 instances at a much finer level of granularity.

By default, Amazon EMR sets many Hadoop parameters. Some of these parameter values can beoverridden by parameter values set in a RunFlowJob request. For more information, see RunJobFlowin the Amazon Elastic MapReduce (Amazon EMR) API Reference. Hadoop parameters govern suchthings as the number of mapper and reducer tasks assigned to each node in the cluster, the amount ofmemory allocated for these tasks, the number of threads, timeouts, and other configuration parametersfor the various Hadoop components.

Hadoop configuration parameters reside in Hadoop's JobConf file.You set the Hadoop configurationparameters by including them in your JAR file. For streaming jobs you can specify JobConf parametersusing the --jobconf option. For more information from the Hadoop website, go to the HadoopMap/Reduce Tutorial.

JobConf parameters often act in concert with related parameters or the entire framework, and thereforethey are more difficult to set. For more information, go to Job Configuration on the Hadoop website.

To assist with debugging and performance tuning, Amazon EMR keeps a log of the Hadoop settings(from the Hadoop JobConf) that were used to execute each job flow. These XML files are stored underjobs/ in Amazon S3 or at /mnt/var/log/hadoop/history/ on the master node.

Running Job Flows on an Amazon VPCTopics

• Restricting Permissions with IAM on Amazon VPC (p. 383)

• Setting up an Amazon VPC to Host Job Flows (p. 384)

• Launching job flows on Amazon VPC (p. 386)

Amazon Virtual Private Cloud (Amazon VPC) enables you to provision a private area within AWS whereyou can configure a virtual network, controlling aspects such as private IP address ranges, subnets,


Amazon Elastic MapReduce Developer GuidePerformance Tuning




http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Job+Configuration

routing tables and network gateways. For more information about Amazon VPC, go to the Amazon VirtualPrivate Cloud User Guide.

When you launch an Amazon Elastic MapReduce (Amazon EMR) job flow, you can choose to launch iteither on the AWS cloud (the default) or on an Amazon VPC.

Reasons why you might choose to launch your job flow on Amazon VPC include:

• Processing sensitive dataLaunching a job flow on Amazon VPC is similar to launching the job flow on a private network withadditional tools, such as routing tables and Network ACLs, for defining who has access to the network.If you are processing sensitive data in your job flow, you may want the additional access control thatlaunching your job flow on Amazon VPC provides.

• Accessing resources on an internal networkIf your data store is located on a private network, it may be impractical or undesirable to upload thatdata to AWS for import into Amazon EMR, either because of the amount of data to transfer or becauseof the sensitive nature of the data. Instead, you can launch the job flow on a Amazon VPC and connectto your data center through a VPN connection, enabling the job flow to access resources on yourinternal network. For example, if you have an Oracle database on a private VPN, launching your jobflow on a Amazon VPC connected to that VPN makes it possible for the job flow to access the Oracledatabase.

The following diagram illustrates how an Amazon EMR job flow runs on Amazon VPC. The job flow islaunched within a VPC subnet. Through the Internet gateway the job flow is able to contact resources onthe AWS cloud such as Amazon S3 buckets.

Because access to and from the AWS cloud is a requirement of the job flow, you must connect an Internetgateway to the VPC subnet hosting the job flow. If your application has components you do not wantconnected to the Internet gateway you can launch those components in other subnets you create withinyour VPC. In addition, because of the need to access the AWS cloud, you cannot use Network AddressTranslation (NAT) when you are running Amazon EMR in Amazon VPC.

Amazon EMR running on Amazon VPC uses two security groups, ElasticMapReduce-master andElasticMapReduce-slave, which control access to the master and slave nodes. Both the slave and masternodes connect to Amazon S3 through the Internet gateway.


Amazon Elastic MapReduce Developer GuideRunning Job Flows on an Amazon VPC

http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/

http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/

The following diagram shows how set up a Amazon VPC in order for the job flow to access resources ona local VPN.

NoteIn order for an Amazon EMR job flow to run inside a VPC, it must be able to connect to the AWSCloud through an Internet gateway.You cannot use Network Address Translation (NAT) withthe job flow.

Restricting Permissions with IAM on Amazon VPCWhen you launch a job flow on Amazon VPC, you can use IAM to control access to job flows and restrictactions via policies just as you would with job flows launched on the AWS cloud. For more informationabout how IAM works with Amazon EMR, go to Using AWS Identity and Access Management.

You can also use IAM to control who can create and administer VPC subnets. For more information aboutadministering policies and actions, go to Configuring User Permissions in the IAM User’s Guide.

By default, all IAM users can see all of the VPC subnets for the account, and any user can launch a jobflow in any subnet.

You can limit access to the ability to administer the VPC subnet, while still allowing users to launch jobflows into VPC subnets.To do so, create one user account which has permissions to create and configureVPC subnets and a second user account that can launch job flows but which can’t modify Amazon VPCsettings.

To allow users to launch job flows in a Amazon VPC without the ability to modify theAmazon VPC

1. Create the Amazon VPC and launch Amazon EMR into a subnet of that Amazon VPC using anaccount with permissions to administer Amazon VPC and Amazon EMR.

2. Create a second user account with permissions to call the RunJobFlow, DescribeJobFlows,TerminateJobFlows, and AddJobFlowStep actions in the Amazon EMR API.You should also createan IAM policy that allows this user to launch EC2 instances. An example of this is shown below.

{ "Statement": [ { "Action": [ "ec2:AuthorizeSecurityGroupIngress",


Amazon Elastic MapReduce Developer GuideRestricting Permissions with IAM on Amazon VPC

"ec2:CancelSpotInstanceRequests", "ec2:CreateSecurityGroup", "ec2:CreateTags", "ec2:DescribeAvailabilityZones", "ec2:DescribeInstances", "ec2:DescribeSecurityGroups", "ec2:DescribeSpotInstanceRequests", "ec2:ModifyImageAttribute", "ec2:ModifyInstanceAttribute", "ec2:RequestSpotInstances", "ec2:RunInstances", "ec2:TerminateInstances" ], "Effect": "Allow", "Resource": "*" }, { "Action": [ "elasticmapreduce:AddInstanceGroups", "elasticmapreduce:AddJobFlowSteps", "elasticmapreduce:DescribeJobFlows", "elasticmapreduce:ModifyInstanceGroups", "elasticmapreduce:RunJobFlow" "elasticmapreduce:TerminateJobFlows" ], "Effect": "Allow", "Resource": "*" }

}

Users with the IAM permissions set above will be able to launch job flows within the VPC subnet,but will not be able to change the Amazon VPC configuration.

NoteYou should be cautious when granting ec2:TerminateInstances permissions becausethis action gives the recipient the ability to shut down any Amazon EC2 instance in theaccount, including those outside of Amazon EMR.

Setting up an Amazon VPC to Host Job FlowsBefore you can launch job flows on an Amazon VPC, you must create an Amazon VPC, a VPC subnet,and an Internet gateway. The following instructions describe how to create an Amazon VPC capable ofhosting Amazon EMR job flows using the Amazon EMR console.

To create a VPC subnet to run Amazon EMR job flows

1. Sign in to the AWS Management Console and open the Amazon VPC console athttps://console.aws.amazon.com/vpc/.

2. Create an Amazon VPC by clicking Get started creating a VPC. Make sure that the Regiondrop-down box is set to the same Region where you'll be running your job flow. In this example, we'recreating a Amazon VPC in the US East (N. Virginia) Region.


Amazon Elastic MapReduce Developer GuideSetting up an Amazon VPC to Host Job Flows

https://console.aws.amazon.com/vpc/

3. Choose the VPC configuration by selecting one of the radio buttons.

If the data used in the job flow is available on the Internet (eg: Amazon S3, Amazon RDS) selectVPC with a Single Public Subnet Only.

If the data used in the job flow is stored locally (eg: an Oracle database) select VPC with Publicand Private subnets and Hardware VPN Access.

4. Confirm the Amazon VPC settings. In order to work with Amazon EMR the Amazon VPC must haveboth an Internet Gateway and a subnet.


Amazon Elastic MapReduce Developer GuideSetting up an Amazon VPC to Host Job Flows

5. A dialog box confirms that the Amazon VPC was successfully created. Click Close.

You cannot use Network Address Translation (NAT) when you are using Amazon EMR on Amazon VPC.

Once you've created an Amazon VPC you need to locate its subnet identifier; you'll use this value tolaunch the Amazon EMR job flow on the Amazon VPC.

To find the Amazon VPC subnet identifier

• Click on Subnets in the navigation menu of the Amazon VPC console. The right pane displaysinformation about the Amazon VPC, including its subnet identifier.

Launching job flows on Amazon VPCOnce you have a VPC subnet that is configured to host Amazon EMR job flows, launching job flows onthat VPC subnet is as simple as specifying the subnet identifier during the job flow creation.

If the VPC subnet does not have an Internet gateway, the job flow creation call will fail with the error:“Subnet not correctly configured, missing route to an Internet gateway."


Amazon Elastic MapReduce Developer GuideLaunching job flows on Amazon VPC

When the job flow is launched, Amazon EMR adds two security groups to the Amazon VPC:ElasticMapReduce-slave and ElasticMapReduce-master. By default, the ElasticMapReduce-mastersecurity group does not allow inbound SSH connections. If you require this functionality, you can add itto the security group.

To manage the job flow on an Amazon VPC Amazon EMR attaches a network device to the master nodeand manages it through this device.You can view this device using the Amazon EC2 APIDescribeInstances. If you disconnect this device, the job flow will fail.

Once the job flow is created, it will be able to access AWS services such as Amazon S3 to connect todata stores.

NoteAmazon VPC currently does not support CC1 instances.Thus, you cannot specify a cc1.4xlargeinstance type for nodes of a job flow launched in an Amazon VPC.

To launch a job flow on an Amazon VPC using the Amazon EMR console

1. In the Amazon EMR console, click the Create New Job Flow button.

2. Follow the instructions in the Create a New Job Flow wizard, selecting options that match the jobflow you want to launch.

3. When you reach the ADVANCED OPTIONS page, choose the Amazon VPC subnet you createdpreviously from the Amazon VPC Subnet Id drop-down box. If you have not created a Amazon VPCsubnet, click on the Create a VPC link underneath the drop-down box to open the Amazon VPCconsole and create a Amazon VPC and subnet.

4. Continue the Create a Job Flow Wizard until it is complete and the job flow is launched. It will belaunched within the Amazon VPC subnet you specified in Step 3.

To launch a job flow on an Amazon VPC using the CLI

• Once your Amazon VPC is configured, you can launch Amazon EMR job flows on it by using the--subnet argument and specifying the subnet address. This is illustrated in the following example,which creates a long-running job flow on the specified VPC subnet.



http://docs.amazonwebservices.com/AWSEC2/latest/APIReference/ApiReference-query-DescribeInstances.html

elastic-mapreduce --create --alive --subnet subnet-identifier

To launch a job flow on an Amazon VPC using the API

• Once your Amazon VPC is configured, you can launch Amazon EMR job flows on it by providing theVPC subnet identifier as the value for Ec2SubnetId, an optional String parameter on theJobFlowInstancesConfig structure.

https://elasticmapreduce.amazonaws.com? Operation=RunJobFlow& Name=MyJobFlowName& LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir& Instances.MasterInstanceType=m1.small& Instances.SlaveInstanceType=m1.small& Instances.InstanceCount=4& Instances.Ec2KeyName=myec2keyname& Instances.Placement.AvailabilityZone=us-east-1a& Instances.KeepJobFlowAliveWhenNoSteps=true& Instances.Ec2SubnetId=subnet-identifier& Steps.member.1.Name=MyStepName& Steps.member.1.ActionOnFailure=CONTINUE& Steps.member.1.HadoopJarStep.Jar=MyJarFile& Steps.member.1.HadoopJarStep.MainClass=MyMailClass& Steps.member.1.HadoopJarStep.Args.member.1=arg1& Steps.member.1.HadoopJarStep.Args.member.2=arg2& AWSAccessKeyId=AWS Access Key ID& SignatureVersion=2& SignatureMethod=HmacSHA256& Timestamp=2009-01-28T21%3A48%3A32.000Z& Signature=calculated value



Appendix: Compare Job FlowTypes

This section provides a comparison of the job flow types supported in Amazon Elastic MapReduce(Amazon EMR). In most cases, you can use any job flow type to process large amounts of data withAmazon EMR. Choosing the method that is right for you depends on the structure of your data, yourcurrent knowledge of a scripting or programming language, and how much effort you want to expendwriting MapReduce code.

CascadingCascading is a Java library that simplifies using the Hadoop MapReduce API. The API is based on pipesand filters, providing features like splitting and joining data streams.

Custom JARThe Custom JAR job flow type supports MapReduce programs written in Java.You can leverage yourexisting knowledge of Java using this method. While you have the most flexibility of any job flow type indesigning your job flow, you must know Java and the MapReduce API. Custom JAR is a low level interface.You are responsible for converting your problem definition into specific Map and Reduce tasks and thenimplementing those tasks in your JAR.

Hadoop StreamingHadoop streaming is the built-in utility provided with Hadoop. Streaming supports any scripting language,such as Python or Ruby. It is easy to read and debug, numerous libraries and data are available, and itis fast and simple.You can script your data analysis process and avoid writing code by using the existinglibraries. Streaming is a low level interface.You are responsible for converting your problem definitioninto specific Map and Reduce tasks and then implementing those tasks via scripts.



HiveHive is an open-source project that uses a SQL-like language. If you are familiar with SQL, the transitionto using Hive is fairly easy. Hive allows customizations using Java JARs.

The following information resources are available for Hive:

• Hive overview— http://wiki.apache.org/hadoop/Hive

• Hive video tutorial— http://aws.amazon.com/articles/2862

• Running Hive on Amazon Elastic MapReduce (Amazon EMR)— http://aws.amazon.com/articles/2857

• Additional features of Hive in Amazon Elastic MapReduce— http://aws.amazon.com/articles/2856

• Operating a data warehouse with Hive, Amazon EMR and Amazon SimpleDB—http://aws.amazon.com/articles/2854

• Contextual Advertising using Apache Hive and Amazon Elastic MapReduce (Amazon EMR) withHigh Performance Computing instances— http://aws.amazon.com/articles/2855

• QL-Hive Language Manual— http://wiki.apache.org/hadoop/Hive/LanguageManualThis document explains the SQL-like language called Hive QL. Hive converts QL into MapReducealgorithms for job flows that you can then run using Amazon Elastic MapReduce.

PigPig is an open-source project that uses a proprietary language called PigLatin. If you have existing scriptswritten in Pig-Latin, you can use them on Amazon EMR with little or no modification.

The following information resources are available for Pig and Pig Latin:

• Pig tutorial— Apache Log Analysis using PigThis tutorial shows you how to analyze Apache logs using Pig and Elastic MapReduce.

• Pig video tutorial— Video that shows how to use a Pig script with the Amazon EMR console and SSHThis video tutorial shows you how to use Pig in batch and interactive modes with Elastic MapReduce.

• Sample Pig script— Parsing Logs with Apache Pig and Elastic MapReduceThis document shows a sample Pig script.

• PiggyBank functions— String Manipulation and DateTime Functions For PigThis is a list of five functions that AWS added to the Pig library.

• Pig Latin— http://pig.apache.org/docs/r0.7.0/piglatin_ref1.htmlThis document explains the SQL-like language called Pig Latin. Pig converts Pig Latin into MapReducejob flows that you can then run using Elastic MapReduce.

• Pig video tutorial— Using a Pig Script with the Console Video Tutorial

HBaseHBase is an open source, non-relational, distributed database modeled after Google's BigTable. It wasdeveloped as part of Apache Software Foundation's Hadoop project and runs on top of Hadoop DistributedFile System(HDFS) to provide BigTable-like capabilities for Hadoop. HBase provides you a fault-tolerant,efficient way of storing large quantities of sparse data using column-based compression and storage. Inaddition, HBase provides fast lookup of data because data is stored in-memory instead of on disk. HBaseis optimized for sequential write operations, and is highly efficient for batch inserts, updates, and deletes.For more information, see Store Data with HBase (p. 155).


Amazon Elastic MapReduce Developer GuideHive

http://wiki.apache.org/hadoop/Hive






http://wiki.apache.org/hadoop/Hive/LanguageManual


http://s3.amazonaws.com/awsVideos/AmazonElasticMapReduce/ElasticMapReduce-PigTutorial.html



http://pig.apache.org/docs/r0.7.0/piglatin_ref1.html

http://s3.amazonaws.com/awsVideos/AmazonElasticMapReduce/ElasticMapReduce-PigTutorial.html

Appendix: Amazon EMR Resources

Topics

• Amazon EMR Documentation (p. 391)

• Getting Help (p. 392)

• Tutorials (p. 392)

This section provides additional resources to help you use Amazon Elastic MapReduce (Amazon EMR).

Amazon EMR DocumentationDescriptionResource

The Getting Started Guide provides a quick tutorial of theservice based on a simple use case. Examples andinstructions for the Amazon EMR console are included.

Amazon EMR Getting Started Guide

The API Reference describes Amazon EMR operations,errors, and data structures.

Amazon EMR API Reference

The FAQ covers the top 20 questions developers ask aboutthis product.

Amazon EMR Technical FAQ

The release notes give a high-level overview of the currentrelease. They specifically note any new features,corrections, and known issues.

Release notes

A central starting point to find documentation, codesamples, release notes, and other information to help youbuild innovative applications with AWS.

AWS Developer Resource Center

Location of the Amazon EMR console.Amazon EMR Console

A community-based forum for developers to discusstechnical questions related to Amazon Web Services.

Discussion Forums


Amazon Elastic MapReduce Developer GuideAmazon EMR Documentation



http://aws.amazon.com/elasticmapreduce/faqs/

http://aws.amazon.com/releasenotes/Elastic-MapReduce

http://aws.amazon.com/resources

http://console.aws.amazon.com/elasticmapreduce/

http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52

DescriptionResource

The home page for AWS Technical Support, includingaccess to our Developer Forums, Technical FAQs, ServiceStatus page, and Premium Support.

AWS Support Center

The primary web page for information about AWS PremiumSupport, a one-on-one, fast-response support channel tohelp you build and run applications on AWS InfrastructureServices.

AWS Premium Support

The primary web page for information about Amazon EMR.Amazon EMR product information

A central contact point for inquiries concerning AWS billing,account, events, abuse etc.

Contact Us

Detailed information about the copyright and trademarkusage at Amazon.com and other topics.

Conditions of Use

Getting HelpThe AWS Support Center is the home page for AWS Technical Support. The page includes links to ourDiscussion Forums where you can ask questions of fellow developers and Amazon support personnel.On the same page, you can find links to Elastic MapReduce Technical FAQ (Amazon EMR TechnicalFAQ), and the Service Status page. To get answers using the Elastic MapReduce documentation, go tothe Amazon EMR Developer Guide.

TutorialsIntroduction to Amazon EMR

For additional CLI tutorials, go to http://aws.amazon.com/articles/Elastic-MapReduce.


Amazon Elastic MapReduce Developer GuideGetting Help

http://developer.amazonwebservices.com/connect/support.jspa

http://aws.amazon.com/premiumsupport

http://aws.amazon.com/elasticmapreduce

http://aws.amazon.com/contact-us/

http://www.amazon.com/gp/help/customer/display.html/104-5054883-7838319?nodeId=508088

http://developer.amazonwebservices.com/connect/support.jspa

http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52



http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2297&categoryID=269

http://aws.amazon.com/articles/Elastic-MapReduce

Glossary

Amazon Machine Image An Amazon Machine Image (AMI) is similar to the root drive of your computer. Itcontains the operating system and can also include software and layers of yourapplication such as database servers, middleware, web servers, etc. AMIs areencrypted machine images stored in Amazon Elastic Block Store or AmazonSimple Storage Service.

authentication The process of proving your identity to the system.

Access Key ID A string that AWS distributes to uniquely identify each AWS user; it is alphanumerictoken associated with your Secret Access Key.

block A data set.A data set. Amazon Elastic MapReduce (Amazon EMR) breaks large amountsof data into subsets. Each subset is called a data block. Amazon ElasticMapReduce (Amazon EMR) assigns an ID to each block and uses a hash tableto keep track of block processing.

bootstrap action Default or custom actions that you specify to run a script or an application on allnodes of a job flow before Hadoop starts.

bucket A container for objects stored in Amazon S3. Every object is contained in a bucket.For example, if the object named photos/puppy.jpg is stored in the johnsmithbucket, then authorized users can access the object with the URLhttp://johnsmith/S3.amazonaws.com/photos/puppy.jpg.

Cascading Cascading is an open-source Java library that provides a query API, a queryplanner, and a job scheduler for creating and running Hadoop MapReduceapplications. Applications developed with Cascading are compiled and packagedinto standard Hadoop-compatible JAR files similar to other native Hadoopapplications.

core instance group An instance group managing core nodes. Core instance groups must alwayscontain at least one core node.

core node A core node is an Amazon EC2 instance that runs Hadoop map and reduce tasksand stores data using the Hadoop Distributed File System (HDFS). It is managedby the master node, which schedules the Hadoop tasks that run on core and tasknodes and monitors their status. While a job flow is running you can increase, butnot decrease, the number of core nodes. Because core nodes store data andcannot be removed from a job flow, Amazon EC2 instances assigned as corenodes are capacity that you need to allot for the entire job flow. Core nodes runboth the DataNodes and TaskTracker Hadoop daemons.



endpoint A URL that identifies a host and port as the entry point for a web service. Everyweb service request contains an endpoint. Most AWS products provide regionalendpoints to enable faster connectivity.

HMAC HMAC (Hash-based Message Authentication Code) is a specific construction forcalculating a message authentication code (MAC) involving a cryptographic hashfunction in combination with a secret key.You can use it to verify both the dataintegrity and the authenticity of a message at the same time. AWS calculates theHMAC using a standard, cryptographic hash algorithm, such as SHA-256.

intermediate results Processing output created by the map step in the MapReduce process.

job flow A job flow specifies the complete processing of the data. It's comprised of one ormore steps, which specify all of the functions to be performed on the data.

key The unique identifier for an object in a bucket. Every object in a bucket has exactlyone key. Because a bucket and key together uniquely identify each object, youcan think of Amazon S3 as a basic data map between the bucket + key, and theobject itself.You can uniquely address every object in Amazon S3 through thecombination of the web service endpoint, bucket name, and key, for example:http:// doc.s3.amazonaws.com/2006-03-01/AmazonS3.wsdl, wheredoc is the name of the bucket, and 2006-03-01/AmazonS3.wsdl is the key.

mapper An executable that splits the raw data into key/value pairs. The reducer uses theoutput of the mapper, called the intermediate results, as its input.

master instance group The instance group managing the master node. There can be only one masterinstance group per job flow.

master node A process running on an Amazon Machine Image that keeps track of the work itscore and task nodes complete.

metadata The metadata is a set of name-value pairs that describe the object.These includedefault metadata such as the date last modified and standard HTTP metadatasuch as Content-Type. The developer can also specify custom metadata at thetime the Object is stored.

node After an Amazon Machine Image (AMI) is launched, the resulting running systemis referred to as a node. All instances based on the same AMI start out identicaland any information on them is lost when the node terminates or fails.

object The fundamental entity stored in Amazon S3. Objects consist of object data andmetadata. The data portion is opaque to Amazon S3.

reducer An executable that uses the intermediate results from the mapper and processesthem into the final output.

Secret Access Key A key that Amazon Web Services assigns to you when you sign up for an AWSAccount. In request authentication, it is the private key in a public/private key pair.(Sometimes called simply a "secret key.")

service endpoint See endpoint.

shutdown actions A predefined bootstrap action that launches a script that executes a series ofcommands in parallel before terminating the job flow.

signature Refers to a digital signature, which is a mathematical way to confirm theauthenticity of a digital message. AWS uses signatures to authenticate the requestsyou send to our web services.



slave node Represents any nonmaster node in a Hadoop cluster.See core node and task node.

step A single function applied to the data in a job flow. The sum of all steps comprisesa job flow.

step type The type of work done in a step. There are a limited number of step types, suchas moving data from Amazon S3 to Amazon EC2 or moving data from AmazonEC2 to Amazon S3.

streaming A utility that comes with Hadoop that enables you to develop MapReduceexecutables in languages other than Java

task instance group An instance group managing tasks nodes.

task node A task node is an Amazon EC2 instance that runs Hadoop map and reduce tasksand does not store data. It is managed by the master node, which schedules theHadoop tasks that run on core nodes and task nodes and monitors their status.While a job flow is running you can increase and decrease the number of tasknodes. Because task nodes do not store data and can be added and removedfrom a job flow, you can use them to manage the amount of Amazon EC2 instancecapacity your job flow uses, increasing it to handle peak loads, and decreasingit later. Task nodes run only a TaskTracker Hadoop daemon.

tuning Selecting the number and type of Amazon Machine Images to run a Hadoop jobflow most efficiently.



Document History

The following table describes the important changes to the documentation since the last release of AmazonElastic MapReduce (Amazon EMR).

API version: 2009-03-31.

Latest documentation update: December 20, 2012.

Release DateDescriptionChange

December 20,2012

Amazon Elastic MapReduce supports hs1.8xlargeinstances. For more information, go to Hadoop DefaultConfiguration (AMI 2.0 and 2.1) (p. 314).

High StorageInstances

December 20,2012

Amazon Elastic MapReduce supports IAM Roles For moreinformation, go to Configure IAM Roles for AmazonEMR (p. 280).

IAM Roles

December 20,2012

Amazon Elastic MapReduce supports Hive 0.8.1.6 For moreinformation, go to Supported Hive Versions.

Hive 0.8.1.6

December 20,2012

Amazon Elastic MapReduce supports AMI 2.3.0 For moreinformation, go to AMI Versions Supported in Amazon EMR.

AMI 2.3.0

December 6,2012


AMI 2.2.4

November 30,2012


AMI 2.2.3

November 30,2012

Amazon Elastic MapReduce supports Hive 0.8.1.5. Formore information, go to Hive Configuration.

Hive 0.8.1.5

November 12,2012

Adds support for Amazon EMR in the Asia Pacific (Sydney)Region.

Asia Pacific (Sydney)Region

October 1, 2012Added support making a job flow visible to all IAM users onan AWS account. For more information, see Configure UserPermissions with IAM (p. 274).




http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_SupportedHiveVersions.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/EnvironmentConfig_AMIVersion.html#ami-versions-supported



http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Hive.html


September 17,2012

• Updates the HBase client on Hive job flows to version0.92.0 to match the version of HBase used on HBase jobflows. This fixes issues that occurred when connectingto an HBase job flow from a Hive job flow.

Hive 0.8.1.4

August 30, 2012• Fixes an issue with HBase backup functionality.

• Enables multipart upload by default for files larger thanthe Amazon S3 block size specified by fs.s3n.blockSize.For more information, see Multipart Upload (p. 343).

AMI 2.2.1

August 30, 2012• Fixes issues in the Native Amazon S3 file system.

• Enables multipart upload by default. For more information,see Multipart Upload (p. 343).

AMI 2.1.4

August 6, 2012Support for Hadoop 1.0.3. For more information seeSupported Hadoop Versions (p. 300).

Hadoop 1.0.3, AMI2.2.0, Hive 0.8.1.3,Pig 0.9.2.2

August 6, 2012Fixes issues with HBase.AMI 2.1.3

August 6, 2012Support for Amazon CloudWatch metrics when using MapR.AMI 2.1.2

July 9, 2012Improves the reliability of log pushing, adds support forHBase in Amazon VPC, and improves DNS retryfunctionality.

AMI 2.1.1

July 9, 2012Improves AMI versioning by adding support for major-minorreleases. Now you can specify the major-minor version forthe AMI and always have the latest patches applied. Formore information, see Specify the Amazon EMR AMIVersion (p. 290).

Major-Minor AMIVersioning

July 9, 2012Fixes an issue with duplicate data in large job flows.Hive 0.8.1.2

June 27, 2012Provides better support for specifying the version ofS3DistCp to use.

S3DistCp 1.0.5

June 12, 2012Amazon EMR supports HBase, an an open source,non-relational, distributed database modeled after Google'sBigTable. For more information, see Store Data withHBase (p. 155).

Store Data withHBase

June 12, 2012Amazon EMR supports MapR, an open, enterprise-gradedistribution that makes Hadoop easier and moredependable. For more information, see Launch a Job Flowon the MapR Distribution for Hadoop (p. 260).

Launch a Job Flow onthe MapR Distributionfor Hadoop

June 12, 2012Added information about how to connect to the master nodeusing both SSH and a SOCKS proxy. For more information,see Connect to the Master Node in an Amazon EMR JobFlow (p. 110).

Connect to the MasterNode in an AmazonEMR Job Flow




May 30, 2012Amazon Elastic MapReduce supports Hive 0.8.1. For moreinformation, go to Hive Configuration.

Hive 0.8.1

April 30, 2012Added information about running Informatica HParser onAmazon EMR. For more information, see Parse Data withHParser (p. 258).

HParser

April 19, 2012Enhancements to performance and other updates. Fordetails, see AMI Versions Supported in AmazonEMR (p. 294).

AMI 2.0.5

April 9, 2012Amazon Elastic MapReduce supports Pig 0.9.2. Pig 0.9.2adds support for user-defined functions written in Pythonand other improvements. For more information, go to Pig0.9.2 Patches.

Pig 0.9.2

April 9, 2012Amazon Elastic MapReduce supports the ability to specifythe Pig version when launching a job flow. For moreinformation, go to Pig Configuration.

Pig versioning

April 9, 2012Amazon Elastic MapReduce supports Hive 0.7.1.4. Formore information, go to Hive Configuration.

Hive 0.7.1.4

April 3, 2012Updates sources.list to the new location of the Lennydistribution in archive.debian.org.

AMI 1.0.1

March 13, 2012Support for new version of Hive, version 0.7.1.3, which addsthe dynamodb.retry.duration variable which you canuse to configure the timeout duration for retrying Hivequeries. This version of Hive also supports setting theAmazon DynamoDB endpoint from within the Hivecommand-line application.

Hive 0.7.1.3

February 28, 2012Support for AWS Identity and Access Management (IAM)in the Amazon EMR console. Improvements for S3DistCpand support for Hive 0.7.1.2 are also included.

Support for IAM in theconsole

January 31, 2012Support for monitoring job flow metrics and setting alarmson metrics.

Support forCloudWatch Metrics

January 19, 2012Support for distributed copy using S3DistCp.Support for S3DistCp

January 18, 2012Support for exporting and querying data stored in AmazonDynamoDB.

Support for AmazonDynamoDB

January 17, 2012Support for Amazon EMR AMI 2.0.2 and Hive 0.7.1.1.AMI 2.0.2 and Hive0.7.1.1

December 21,2011

Support for Cluster Compute Eight Extra Large (cc2.8xlarge)instances in job flows.

Cluster Compute EightExtra Large(cc2.8xlarge)

December 11,2011

Support for Hadoop 0.20.205. For more information seeSupported Hadoop Versions (p. 300).

Hadoop 0.20.205

December 11,2011

Support for Pig 0.9.1. For more information see SupportedPig Versions (p. 377).

Pig 0.9.1




http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-pig-version-details.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-pig-version-details.html




December 11,2011

You can now specify which version of the Amazon EMRAMI to use to launch your job flow. All Amazon EC2instances in the job flow will be initialized with the AMIversion that you specify. For more information see Specifythe Amazon EMR AMI Version (p. 290).

AMI versioning

December 11,2011

You can now launch Amazon EMR job flows inside of yourAmazon Virtual Private Cloud (Amazon VPC) for greatercontrol over network configuration and access. For moreinformation see Running Job Flows on an AmazonVPC (p. 381).

Amazon EMR jobflows on AmazonVirtual Private Cloud(Amazon VPC)

August 19, 2011Support for launching job flow instance groups as SpotInstances added. For more information see Lower Costswith Spot Instances (p. 141).

Spot Instances

July 25, 2011Support for Hive 0.7.1 added. For more information seeSupported Hive Versions (p. 349).

Hive 0.7.1

April 14, 2011Support for a new Termination Protection feature. For moreinformation see Protect a Job Flow from Termination (p.136).

Termination Protection

March 9, 2011Support for Amazon EC2 tagging. For more informationsee Using Tagging (p. 136).

Tagging

February 21, 2011Support for Amazon Identity and Access Management. Formore information see AWS Identity and AccessManagement (IAM) (p. 14) and Configure User Permissionswith IAM (p. 274).

IAM Integration

February 21, 2011Support for Elastic IP addresses. For more information seeElastic IP Address (p. 13) and Using Elastic IPAddresses (p. 287).

Elastic IP Support

February 21, 2011Expanded sections on Environment Configuration andPerformance Tuning. For more information see PerformanceTuning (p. 381) and Configure Amazon EMR (p. 274).

EnvironmentConfiguration

February 21, 2011For more information on using DistributedCache to uploadfiles and libraries, see Using Distributed Cache (p. 104).

Distributed Cache

February 21, 2011For more information see Building Binaries Using AmazonEMR (p. 131).

How to build modulesusing Amazon ElasticMapReduce (AmazonEMR)

February 21, 2011For more information see Appendix: Compare Job FlowTypes (p. 389).

Comparison of jobflow types

January 6, 2010Support of Amazon S3 multipart upload through the AWSJava SDK. For more information see MultipartUpload (p. 343).

Amazon S3 multipartupload

December 8,2010

Support for Hive 0.70 and concurrent versions of Hive 0.5and Hive 0.7 on same cluster. Note:You need to updatethe Elastic MapReduce Command Line Interface to resizerunning job flows and modify instance groups. For moreinformation see Hive Configuration (p. 348).

Hive 0.70




December 8,2010

Support for JDBC with Hive 0.5 and Hive 0.7. For moreinformation see Using the Hive JDBC Driver (p. 359).

JDBC Drivers for Hive

November 14,2010

Support for cluster compute instances. For more informationsee Amazon EC2 Instances (p. 11).

Support HPC

November 14,2010

Expanded content and samples for bootstrap actions. Formore information see Bootstrap Actions (p. 84).

Bootstrap Actions

November 14,2010

Description of Cascading job flow support. For moreinformation see How to Create a Cascading JobFlow (p. 56) and Cascading (p. 122).

Cascading job flows

October 19, 2010Support for resizing a running job flow. New node typestask and core replace slave node. For more informationsee Architectural Overview of Amazon EMR (p. 3),Resizeable Running Job Flows (p. 5), and ResizingRunning Job Flows (p. 96).

Resize Running JobFlow

October 19, 2010Expanded information on configuration options available inAmazon EMR. For more information, refer to HadoopConfiguration (p. 299).

Appendix:Configuration Options

October 19, 2010This release features a reorganization of the Amazon EMRDeveloper Guide.

Guide revision



Indexbzip2, 345gzip, 345LZO, 345

Aadd step to job flow, 79additional libraries, 264Amazon EC2, 11Amazon EC2 instance types, 11Amazon EC2 Instances, 11Amazon EMR concepts, 6Amazon S3, 14Amazon S3 buckets, 14Amazon S3 native file system, 10, 14, 14API Requests

SDK, 263architectural diagram, 3architectural overview, 3Args, 32, 110arrested job flow, 100arrested state, 100availability zones, 14, 264AWS concepts, 11

Bbootstrap actions, 4, 84

custom, 91, 92predefined, 85

bucket names, 19buckets, 14

Ccascading, 56Cascading, 57, 389cluster nodes, 9cluster tuning, 381command

--active, 73, 343--alive, 74--ami-version, 290--bootstrap-action, 91--create, 23, 24, 32, 40, 49, 56, 57--details, 74--hadoop-version, 300, 349, 377--hive-script, 32--hive-site, 357--hive-versions, 349--list, 73, 343--pig-script, 40--steps, 79--stream, 24--terminate, 77-d, 355

concepts, AWS, 11configuration, 4configure Hadoop, 302configure hadoop-user-env.sh, 302core node, 9create job flow, 23custom JAR, 49customer support, 391

Ddata compression

intermediate, 344output, 344

data security, 5data storage, 4, 10debug, 206debug using log files, 194debugging

hadoop, 5step, 5

debugging job flow, 183describe job flow, 72distributed cache, 108document history, 396download log files, 196, 197

Eendpoints

Europe, 264North America, 264

Ffailures, 207features, 2file systems, 10, 14FoxyProxy, 117

Ggenerating signatures, 267

HHadoop, 8, 129

data compression, 344failures, 207process, 129user interface, 200

Hadoop 0.20.205patches, 347

Hadoop 1.0.3patches, 346

Hadoop configuration, 340hadoop debugging, 5HBase, 389HDFS, 10Hive, 5, 32, 389



batch, 355data sharing, 353interactive, 355patches, 365versioning, 353

Hive version0.4, 3480.5, 3480.7, 3480.8.1, 348

IIAM, 274, 280instance types, 11interfaces

comparison, 15

Jjob flow, 5, 6, 126

add steps, 79cascading, 5create, 23, 130custom JAR, 5debug, 206

job flow with steps, 206job flow without steps, 206

details, 74download logs from Amazon S3, 196, 197Hive, 5, 355list, 72monitoring, 197, 206Pig, 5resizing, 5states, 72status using SSH, 199streaming, 5terminate, 77

job flow processexample, 2

JSON files, 340

Kkey pair, 13

Llibraries, additional, 264list job flows, 72log files, 183, 190, 194, 345

directories, 191, 194download from Amazon S3, 196, 197step, 195used to debug, 194

MMapReduce, 8

MapReduce process, 8master node, 9monitor job flow, 199monitoring

job flows, 197, 206multipart upload, 14, 343

Nnew features, 396node, 9

Pperformance tuning, 381Pig, 40, 389Pig 0.9.1

patches, 380Pig 0.9.2

patches, 380policies, 274

examples, 277predefined bootstrap actions, 85

ConfigureDaemons, 85ConfigureHadoop, 85Memory-Intensive, 90RunIf, 91Shutdown, 91

QQuery, 269

Rregions, 14related resources, 391, 391Reserved Instances, 13roles, 280

SS3N (see Amazon S3 native file system)security, 5service overview, 2signature, generating , 267SimpleDB, 14SSH, 5, 197, 199

Hive, 355interactive mode, 355into master node, 9

steplog files, 195states, 72

streaming, 24utility, 129

Streaming, 389

Ttask, 203, 207



task node, 9terminate job flow, 77troubleshooting, 183

Uupdates, 396user interface

Hadoop, 200

Vversion, 396

Wwhat's new, 396

Zzones, availability, 264



amazon elastic map reduce

Technology