introduction to analyzing big data using amazon web services

Introduction to analyzing big data using Amazon Web Services This tutorial accompanies the BARC seminar given at Whitehead on January 31, 2013. It contains instructions for:

1. Getting started with Amazon Web Services 2. Navigating S3 buckets from the command line 3. Creating and logging into an Amazon EC2 instance 4. Running the case study map reduce job.

This tutorial assumes you are working in a UNIX environment and are reasonably comfortable with using command line tools. Any commands that should be entered at the terminal will be denoted by Courier New font in a text box. Prompts are preceded by a “$”, whereas command line output is not. For example:

Getting started with Amazon Web Services Sign up for an AWS account

1. Go to http://aws.amazon.com/ 2. Click “Sign up” and follow the steps. You will need to enter credit card

information to use AWS, even if you only use free services. All steps presented in this tutorial besides the case study use free resources. The case study will cost around $10.

3. After you have signed up, log in to your account. Go to the AWS Console The AWS console is where you can access the user interface for all the Amazon Cloud services. To go to the console, either click “My Account/Console”-‐>”AWS Management Console”, or go to the URL https://console.aws.amazon.com/console/home. This will you bring you to a page looking something like this:

$ echo “Hello World” Hello World

We will return to this page to start using each of the services covered here.

Navigating S3 buckets from the command line S3cmd is a useful tool for interfacing with Amazon S3 from the command line. Here we will go over how to install and set it up. Setting up s3cmd You can download s3cmd using apt-‐get:

Before you can use s3cmd, you will need to configure it using your AWS credentials. To do this, run:

This will prompt you to enter your access key. You can find this at the AWS console. From the console homepage, you will see your username in the top right corner. Click to bring up a drop down box, and go to the “Security Credentials” option. Scroll down until you see the heading “Access Credentials”. Here you will find your access key (I have blanked out my personal key below, but you should see yours there):

$ sudo apt-get install s3cmd

$ s3cmd –-configure Enter new values or accept defaults in brackets with Enter. Refer to user manual for detailed description of all options. Access key and Secret key are your identifiers for Amazon S3 Access Key:

Copy the key under “Access Key ID” and paste it into the command line prompt. It will now ask for your Secret key.

To get your secret key, click “Show” under “Secret Access Key”. Copy this and paste it into the command line prompt. You will then be given a series of prompts. Press “Enter” at all of these to leave the default settings. Continue pressing “Enter” until the prompt:

Press “Y”, then “Enter” to verify that everything worked. When asked if you want to save your settings, press “Y” again. This will save your credentials file at ~/.s3cfg. This file is required to run s3cmd. S3cmd examples Here we give several examples using s3cmd. To see the full range of options, type:

Example 1: Make a new s3 bucket, upload a file to the bucket, and view bucket contents. Replace “mgymrek” with your own identifier.

If you navigate to the S3 console, you will see your new bucket and can view its contents. CAVEAT: avoid using anything besides dashes “-‐“ and lower case letters for paths in S3. Many hours of unhappy debugging can be saved by following this simple rule of

Access Key: XXXXXXXXX Secret Key:

Test access with supplied credentials? [Y/n]

$ s3cmd --help

$ s3cmd mb s3://mgymrek-test-s3 Bucket 's3://mgymrek-test-s3/' created $ echo "Hello world" > hello-world.txt $ s3cmd put hello-world.txt s3://mgymrek-test-s3/ $ s3cmd ls s3://mgymrek-test-s3 2013-01-30 04:31 12 s3://mgymrek-test-s3/hello-world.txt

thumb. Never use “_” in a filename. For some reason this will break downstream steps! Example 2: View and download data from a public repository. The 1000 Genomes data is available at the s3 bucket s3://1000genomes. Below we will view the contents of this directory and download a small file from it:

This downloaded the README file with documentation on how samples were aligned (the actual file is not important here, we just chose a small file to download as an example) to your local computer.

$ s3cmd ls s3://1000genomes/ DIR s3://1000genomes/alignment_indices/ DIR s3://1000genomes/changelog_details/ DIR s3://1000genomes/data/ DIR s3://1000genomes/phase1/ DIR s3://1000genomes/pilot_data/ DIR s3://1000genomes/release/ DIR s3://1000genomes/sequence_indices/ DIR s3://1000genomes/technical/ . . . $ s3cmd get s3://1000genomes/README.alignment_data s3://1000genomes/README.alignment_data -> ./README.alignment_data [1 of 1] 16244 of 16244 100% in 0s 280.73 kB/s done

Creating and logging into an Amazon EC2 instance Here we will see how to start EC2 compute nodes from the AWS console, and how to connect to a virtual instance from the command line. Before beginning this step, go to the EC2 console at https://console.aws.amazon.com/ec2/v2/home?region=us-‐east-‐1. Generate a key-‐pair In order to log into any of your EC2 instances, you will need a key-‐pair that will be used to verify your identity. From the EC2 console, in the menu on the left hand side under “Network & Security”, click the “Key Pairs” option. Then click the “Create Key Pair” button at the top of the page. First you will be asked to provide a name for your key-‐pair:

Clicking “Create” will generate a key-‐pair and automatically download a file <keypair-‐name>.pem. Store this file in a location where you will remember. Mine is stored in ~/keys/mgymrek_key.pem. You will need to change the permissions of this file to only be readable by you, or else you will run into problems when we try to use the key to SSH into an instance.

Set up your default security group To make sure you’ll be able to log into your EC2 instances from SSH, we just need to make sure the security settings are all set to allow this. On the left hand menu of the EC2 console, go to “Security Groups” under the heading “Network & Security”. Select “Default” from the list at the top, and then go to the tab “Inbound” on the bottom panel. For the option “Create a new rule”, select “SSH”. Then click “Add Rule” followed by “Apply Rule Changes”. Then you should be all set for the next step, launching the actual instance.

$ chmod 400 ~/keys/mgymrek_key.pem

Launch an EC2 instance Back at the EC2 console, at the menu on the left side click “EC2 Dashboard”. Click the blue button “Launch Instance”. This will bring up a pop-‐up screen. Select “Classic Wizard” and click “Continue”.

Options marked with a star are eligible for free-‐tier pricing, which we’ll use here. We will use the fourth option down “Ubuntu Server 12.04.1 LTS”. Click “Select” next to that option.

This will bring you to the next set of options, “Instance Details”. Here leave everything the same, except change the “Availability Zone” option to “us-‐east-‐1a”. It is important to always use the same zone, as data transfer from one zone to another is charged, whereas transfer within the same zone is always free. Click “Continue”.

We don’t need to set any advanced configuration, so hit “Continue” three times more until you get to the “Create Key Pair” step. Select “Choose from your existing Key Pairs” and make sure the key pair you just created is selected.

Click “Continue” to go to “Configure Firewall”. Select “Choose one or more of your existing Security Groups” and select the default group.

Hit “Continue once more to review the settings for this instance. Everything should be all set, so click “Launch”. This will bring you to a page listing all your existing EC2 instances. The instance can take a minute or two to start up, and may say “Pending”. When “Status Checks” shows a green checkmark, you are ready to continue to the next step.

Log-‐in and explore the EC2 instance Selecting your new instance at the console will bring up information in the bottom panel about that instance. You will find at the top of the bottom panel the public DNS address, which you can use to SSH into your instance. This is shown highlighted below:

We can now SSH into this instance, using the key-‐pair that we generated earlier.

The “-‐i” argument takes the location where you stored your public key. We log in with the user name “ubuntu”. Answer “yes” at the prompt “Are you sure you want to continue connecting”. This will log you into your “micro” EC2 instance. From here you can do just about anything you can do from the command line on any Ubuntu machine. Below are a couple simple examples. Note that the boxes are shaded a different color to denote that you are now entering commands on the EC2 instance, rather than on your own machine. Example 1: Look at the default storage.

So we have about 7GB we can use on this machine. It is pretty small, that’s why it’s called a “micro” instance, and that’s why it’s free to use these! Example 2: Install software Using “sudo apt-‐get install” requires first running “sudo apt-‐get update” to update repository information. In this example, we update the system, then install R.

Example 3: Transfer data (for free) from an s3 bucket. Data transfer between AWS services is free, as long as it is in the same geographic region. This tutorial assumes we are working in the region “us-‐east-‐1a”. Using s3cmd in an instance requires installing s3cmd, and getting your credential file onto the instance. This will come up again in the next section when we run a MapReduce job and need to call s3cmd. First upload your credential file from your computer to the instance:

$ ssh -i ~/keys/mgymrek_key.pem [email protected]

$ df –h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 7.9G 773M 6.8G 11% / udev 288M 8.0K 288M 1% /dev tmpfs 119M 160K 118M 1% /run

$ sudo apt-get update $ sudo apt-get install r-base-core

Then install s3cmd on the instance, and transfer the test file we made earlier:

Terminate the EC2 instance When you are done running the instance, you should terminate it. At the console, right click on your instance, and select “Terminate”.

$ sudo apt-get install s3cmd $ s3cmd get s3://mgymrek-test-s3/hello-world.txt $ ls –l -rw-rw-r-- 1 ubuntu ubuntu 12 Jan 30 05:41 hello-world.txt

$ scp –i ~/keys/mgymrek_key.pem /home/mgymrek/.s3cfg [email protected]:/home/ubuntu/.s3cfg

Running the case study map reduce job In this simple Elastic MapReduce (EMR) example we will run SNP calling on the genomes of ten individuals from the 1000 Genomes Project (only on chromosome 20 to keep the example small). All examples here use the s3 bucket s3://mgymrek-‐barc-‐example. Install and configure the elastic-‐mapreduce command line tool It is possible to run EMR jobs from the AWS console, but there are great command line tools that give you more flexibility and that are quite easy to use. First download the elastic mapreduce tools, which are based on ruby (you will need to have ruby installed):

To configure elastic-‐mapreduce, you will need to create a file with your credentials in the same directory that you unzipped the software. Create a file credentials.json there with the following (note the orange color will denote the contents of a file):

Here “access-‐id” and “private-‐key” are the access key and secret key we went over on Pages 3-‐4. Key-‐pair is the name of the key created above, and key-‐pair-‐file is the path to the “.pem” file containing that key. Log-‐uri is a location in one of your s3 buckets where EMR can store log files. To check that you have configured everything correctly, try:

$ sudo apt-get install ruby-full $ wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip emr/ $ unzip emr/elastic-mapreduce-ruby.zip

{ "access-id": "XXX", "private-key": "XXX", "key-pair": "mgymrek_key", "key-pair-file": "/home/mgymrek/keys/mgymrek_key.pem", "log-uri": "s3://mgymrek-test-s3/log", }

If everything is configured correctly, you should see a long help message with all the options for this tool. If something is wrong, you will get an error message complaining about your access credentials. Make sure credentials.json is in the emr directory. Prepare inputs: list of sample IDs The inputs to an EMR job consist of the input of one map job per line. In this example, each map job will be to run SNP calling on a single genome. So our input file will have one line with the accession for each genome to process: We create a file called genomeids.txt:

Bootstrapping: install software on each mapper Each map instance in an EMR task is basically a fresh node with nothing you will need already installed on it. Any data, software, or general configuration you need to do to each mapper in order for it to be able to complete the map tasks, you can do using something called a bootstrap script. This script runs when a map instance is started before it processes any map tasks. In our case, we will want to do the following:

NA18499 NA18501 NA18502 NA18504 NA18505 NA18507 NA18508 NA18510 NA18511 NA18516

$ cd emr $ ./elastic-mapreduce

• Download our s3cfg file so we can use s3cmd. In this example, the s3cfg file is transferred from an s3 bucket. This is probably not very secure and there may be a better way to do this.

• Install all the software we’ll need for SNP calling. We will perform SNP calling using VarScan. For this, we’ll need to install VarScan, java, and samtools.

• Create directories where we’ll store different data files. The storage space on the mapper nodes can be accessed in the /mnt directory. We will create all data directories here.

• Download and unzip the human reference genome. We’ll need this for the steps required for SNP calling.

We put all of these steps into a bash script named download-snptools.sh that will run on startup:

If you run this example yourself, note that you will need to change the path to the s3cfg file to where you have it stored. Anywhere that you can access using wget is fine. I have removed my file from this location for security reasons. The VarScan jar file and reference genome are still at the s3 buckets referenced above, so you are welcome to download them from there.

#!/bin/bash set -e # Download s3cfg file wget -S -T 10 -t 5 http://s3.amazonaws.com/mgymrek-barc-example/misc/.s3cfg sudo mv .s3cfg /mnt/ # Install Java, s3cmd, and samtools sudo apt-get update sudo apt-get install -y default-jre s3cmd samtools # Transfer VarScan from S3 sudo s3cmd -c /mnt/.s3cfg get s3://mgymrek-barc-example/tools/VarScan.v2.3.3.jar /mnt/VarScan.v2.3.3.jar # Make directories to store data sudo mkdir /mnt/alignments; sudo mkdir /mnt/genome; sudo mkdir /mnt/varscan sudo chmod -R 777 /mnt/alignments/ sudo chmod -R 777 /mnt/genome/ sudo chmod -R 777 /mnt/varscan/ # Download and unzip reference genome sudo s3cmd -c /mnt/.s3cfg get s3://mgymrek-barc-example/human_g1k_v37.fasta.gz mv human_g1k_v37.fasta.gz /mnt/genome/ gunzip /mnt/genome/human_g1k_v37.fasta.gz

Create the mapper Mappers follow a basic structure: it reads from standard input. Each line from standard input consists of a different task. When no lines of input are left, it terminates. We will write the mapper in python, but it can be in any language (as long as it is either supported by default on the map instances, or you install it during the bootstrap stage). Here our mapper will take in a genome accession from 1000 Genomes as each line of input. Then a single job will consist of calling SNPs in that genome using VarScan. The mapper will need to do the following for each task:

• Download the bam files for that genome from the 1000 Genomes bucket. • Call samtools and VarScan for SNP calling • Upload the results to S3

The code for this mapper, which is in the file snpcall-mapper.py, is below:

One caveat about the mapper is that you should avoid printing any output to standard out. Anything printed is assumed to be input to the reducer. If you want to

#!/usr/bin/python import sys import os S3_ONEKGDATAPATH = "s3://1000genomes/phase1/data" S3_VARSCANPATH = "s3://mgymrek-barc-example/varscan" ALIGNPATH = "/mnt/alignments" PILEUPPATH = "/mnt/pileups" VARSCANPATH = "/mnt/varscan" GENOMEPATH = "/mnt/genome/human_g1k_v37.fasta" S3CONFIG = "/mnt/.s3cfg" for line in sys.stdin: sample = line.strip() # download BAM alignment bamfile = "%s.chrom20.ILLUMINA.bwa.YRI.low_coverage.20101123.bam"%sample cmd = "s3cmd -c %s get %s/%s/alignment/%s %s/%s"%(S3CONFIG, S3_ONEKGDATAPATH, sample, bamfile, ALIGNPATH, bamfile) os.system(cmd) # Create pileup and run VarScan resultsfile = "%s/%s.varscan"%(VARSCANPATH, sample) cmd = "samtools mpileup -f %s %s/%s | java -jar /mnt/VarScan.v2.3.3.jar mpileup2snp> %s"%(GENOMEPATH, ALIGNPATH,bamfile,resultsfile) os.system(cmd) # Upload results to s3 cmd = "s3cmd -c %s put %s %s/%s.varscan"%(S3CONFIG, resultsfile, S3_VARSCANPATH, sample) os.system(cmd)

print debugging messages, be sure to write them to standard error instead, then you can view them in the logs later to debug. Upload data and scripts to s3 The last step before actually running the EMR job is to put the inputs, mapper, and bootstrap script into s3. We will also need to make sure that the permissions and metadata for each is set correctly, which we can do from the S3 console. Inputs

Mapper

S3 will automatically recognize that this is a python script because of the header line. To check this, you can navigate to your mapper script from the S3 Console. Select the mapper, right click and select “Properties”, then Select the “Metadata” tab on the right. You should see:

Bootstrap script

Again, S3 will recognize the file type automatically. But we will have to set the permissions for the bootstrap script so that the mappers are allowed to open it. In the S3 console, navigate to the bootstrap script. Right click, select “Properties”, and then select the “Permissions” tab. Click “Add more permissions”. Set “Grantee: Everyone” and select “Open/download”. Then click “Save”.

$ s3cmd put genomeids.txt s3://mgymrek-barc-example/inputs/genomeids.txt

$ s3cmd put snpcall-mapper.py s3://mgymrek-barc-example/scripts/snpcall-mapper.py

$ s3cmd put download-snptools.sh s3://mgymrek-barc-example/scripts/download-snptools.sh

Run the EMR job! We will run the job from the command line. There are quite a few options we will need to set. First we’ll show the command, which is a bit intimidating. Then we’ll go through what each of those options mean.

Name: a unique identifier to reference this specific EMR job Num-‐instances: the total number of compute nodes used in the job. This includes the master and slaves, so it needs to be at least two to allow for one master and one slave. Here we are doing a small example, so we set it to three. On big jobs, you can set this to tens or even hundreds of mappers to make your jobs extremely parallelized. Slave-‐instance-‐type: what type of EC2 node to use for the slave. This depends on the memory and space requirements of your map job. Here we don’t need that much RAM, and we only need enough space to process a single genome at a time, so m1.medium is enough. (It has 3+GB RAM, enough to handle loading the human genome).

$ ./emr/elastic-mapreduce --create --stream --alive \ --name snpcalls \ --num-instances 3 \ --slave-instance-type m1.medium \ --master-instance-type m1.small \ --availability-zone us-east-1a \ --input s3n://mgymrek-barc-example/inputs/ \ --mapper s3://mgymrek-barc-example/scripts/snpcall-mapper.py \ --output s3n://mgymrek-barc-example/logs012913 \ --bootstrap-action s3://mgymrek-barc-example/scripts/download-snptools.sh --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "-s,mapred.map.tasks.speculative.execution=false,-s,mapred.tasktracker.map.tasks.maximum=1,-s,mapred.map.tasks=1,-s,mapred.reduce.tasks=0,-s,mapred.tasktracker.reduce.tasks.maximum=0"

Master-‐instance-‐type: what type of EC2 node to use for the master. Usually the mapper is not doing anything that intense, it just distributes jobs. The smallest type of instance you are allowed to use is m1.small, so that’s what we use here. Availability-‐zone: as mentioned above, to keep S3 data transfer charges free, always specify the same zone. A good default is to just use us-‐east-‐1a always. Input: specify the location in S3 from where EMR should read the inpu. This is always a folder, and EMR will read from any file that is in this folder. The “s3n://” instead of “s3://” prefix is used to specify something that will be read from standard input. Always use the s3n prefix for input to EMR. Mapper: specify the location in S3 to find the mapper script. Again, the mapper script is an executable file that processes one line of standard input at a time. Beyond that requirement, you can basically write whatever you want into the mapper script. Output: specify a location in S3 where EMR can write any output from this job. It includes any log messages, and any output from the reducer if you use a reducer. Caveat this location must be unique for each EMR job. You must specify a folder that does not already exist, or else EMR will complain. Bootstrap-‐action: specify bootstrap scripts to run upon startup of each map node. You can specify as many bootstrap-‐actions as you want. Here we specified two of them: 1. The first is our custom bootstrap script we defined earlier to download all necessary software and data. 2. The second is one of the bootstrap scripts already available from Amazon. It allows configuring hadoop, which is the framework behind all the mapreduce jobs. The arguments we set look kind of cryptic, but they are all to ensure that we don’t have any reducers, and that each mapper runs only one task at a time so we don’t run out of space. A full set of the arguments you can set is here: http://hadoop.apache.org/docs/r1.0.0/mapred-‐default.html. This page is very useful for any kind of advanced configuration of hadoop for specific needs of your jobs. Ok, now we are finally read to run the EMR job! Simply run the above command at the command line. This command is in the shell script file run-EMR-example.sh.

After you ran the command above, you should see a message that a job flow was created. Cross your fingers before moving on.

$ sh run-EMR-example.sh Created job flow j-2XFP2BJELRIL0

Now, go the AWS console. Navigate to the “Elastic MapReduce” console and you should see a job listed. At first it’s state will say “Starting”.

You can select the job and view its properties in the tabs below. For instance, the “Steps” tab will tell you which steps have been completed so far. Eventually, if all is good, the state will change to “Bootstrapping” while the bootstrap steps are running

Finally, the status will change to “Running”, indicating that your EMR job is off and running!

View EMR progress You can view the progress of your EMR job at the console. At the bottom panel, select the “Monitoring” tab. This will show you graphs of all kinds of helpful and fun information, such as how many jobs are running, how many are remaining, how many mappers are working at the moment, etc. Examine outputs Output of our map tasks will be in the directory we specified, which in this example was s3://mgymrek-‐barc-‐example/varscan/. We can look at what files are there using s3cmd (or on the S3 console):

We can see that only one sample has finished so far. We can download this file and view the SNP calls:

You can either transfer all the files to your local computer for downstream analysis, or do more computing with them on the cloud! The sequel: debugging EMR jobs A lot of things can go wrong during MapReduce jobs, and debugging these is a whole other tutorial. Here are some general tips:

• You can log onto individual slave instances and look into the logs to get an idea if something went wrong. To do so, go to your EC2 console, find the slaves that are running, and SSH into them. You will need to use the same key-‐pair you’ve been using, and this time log in with username “hadoop” instead of “Ubuntu”.

• Navigate to the logs, which are stored in /mnt/var/log/. From there you can view the standard output and standard error of the bootstrap and mapper scripts. Logs for the bootstrap action are in the “bootstrap-‐actions” folder, and logs for map tasks are in the “hadoop” folder.

$ ssh –i ~/keys/mgymrek_key.pem hadoop@[email protected]

$ s3cmd ls s3://mgymrek-barc-example/varscan/ 2013-01-31 03:05 318566 s3://mgymrek-barc-example/varscan/NA18499.varscan

$ s3cmd get s3://mgymrek-barc-example/varscan/NA18499.varscan . $ head NA18499.varscan | cut –f 1,2,3,4 Chrom Position Ref Var 20 102441 T C 20 181967 A C 20 192514 C T 20 207923 C T 20 208168 T C 20 222417 T C 20 227246 A G 20 239697 G C 20 253772 C T

• At the EMR console, after selecting your job, the bottom tab will usually give informative status messages, such as the last state change or why something didn’t work. For example:

This error message tells us that the bootstrap action failed on the master. Then to figure out the problem, we can SSH into the master node, go to the logs for the bootstrap action, and likely find some helpful error message.

• Make sure your slave instance type is big enough for the job. If you find you start and EMR job but can’t SSH into the slave even though the job says it’s running, it may be running out of memory and stalling. Try bumping up the RAM.

• If all else fails, there is a ton of documentation scattered on the internet. Googling problems tends to work.

$ ls /mnt/var/log/ bootstrap-actions hadoop instance-controller instance-state service-nanny

introduction to analyzing big data using amazon web services

Documents