introduction to analyzing big data using amazon web services

21
Introduction to analyzing big data using Amazon Web Services This tutorial accompanies the BARC seminar given at Whitehead on January 31, 2013. It contains instructions for: 1. Getting started with Amazon Web Services 2. Navigating S3 buckets from the command line 3. Creating and logging into an Amazon EC2 instance 4. Running the case study map reduce job. This tutorial assumes you are working in a UNIX environment and are reasonably comfortable with using command line tools. Any commands that should be entered at the terminal will be denoted by Courier New font in a text box. Prompts are preceded by a “$”, whereas command line output is not. For example: Getting started with Amazon Web Services Sign up for an AWS account 1. Go to http://aws.amazon.com/ 2. Click “Sign up” and follow the steps. You will need to enter credit card information to use AWS, even if you only use free services. All steps presented in this tutorial besides the case study use free resources. The case study will cost around $10. 3. After you have signed up, log in to your account. Go to the AWS Console The AWS console is where you can access the user interface for all the Amazon Cloud services. To go to the console, either click “My Account/Console”>”AWS Management Console”, or go to the URL https://console.aws.amazon.com/console/home. This will you bring you to a page looking something like this: $ echo “Hello World” Hello World

Upload: others

Post on 03-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to analyzing big data using Amazon Web Services

Introduction  to  analyzing  big  data  using  Amazon  Web  Services    This  tutorial  accompanies  the  BARC  seminar  given  at  Whitehead  on  January  31,  2013.  It  contains  instructions  for:    

1. Getting  started  with  Amazon  Web  Services  2. Navigating  S3  buckets  from  the  command  line  3. Creating  and  logging  into  an  Amazon  EC2  instance  4. Running  the  case  study  map  reduce  job.  

 This  tutorial  assumes  you  are  working  in  a  UNIX  environment  and  are  reasonably  comfortable  with  using  command  line  tools.    Any  commands  that  should  be  entered  at  the  terminal  will  be  denoted  by  Courier New  font  in  a  text  box.  Prompts  are  preceded  by  a  “$”,  whereas  command  line  output  is  not.  For  example:    

 Getting  started  with  Amazon  Web  Services    Sign  up  for  an  AWS  account    

1. Go  to  http://aws.amazon.com/  2. Click  “Sign  up”  and  follow  the  steps.  You  will  need  to  enter  credit  card  

information  to  use  AWS,  even  if  you  only  use  free  services.  All  steps  presented  in  this  tutorial  besides  the  case  study  use  free  resources.  The  case  study  will  cost  around  $10.  

3. After  you  have  signed  up,  log  in  to  your  account.    Go  to  the  AWS  Console    The  AWS  console  is  where  you  can  access  the  user  interface  for  all  the  Amazon  Cloud  services.  To  go  to  the  console,  either  click  “My  Account/Console”-­‐>”AWS  Management  Console”,  or  go  to  the  URL  https://console.aws.amazon.com/console/home.    This  will  you  bring  you  to  a  page  looking  something  like  this:    

$ echo “Hello World” Hello World

Page 2: Introduction to analyzing big data using Amazon Web Services

 We  will  return  to  this  page  to  start  using  each  of  the  services  covered  here.    

Page 3: Introduction to analyzing big data using Amazon Web Services

Navigating  S3  buckets  from  the  command  line    S3cmd  is  a  useful  tool  for  interfacing  with  Amazon  S3  from  the  command  line.  Here  we  will  go  over  how  to  install  and  set  it  up.    Setting  up  s3cmd    You  can  download  s3cmd  using  apt-­‐get:  

 Before  you  can  use  s3cmd,  you  will  need  to  configure  it  using  your  AWS  credentials.  To  do  this,  run:  

 This  will  prompt  you  to  enter  your  access  key.  You  can  find  this  at  the  AWS  console.  From  the  console  homepage,  you  will  see  your  username  in  the  top  right  corner.  Click  to  bring  up  a  drop  down  box,  and  go  to  the  “Security  Credentials”  option.    Scroll  down  until  you  see  the  heading  “Access  Credentials”.  Here  you  will  find  your  access  key  (I  have  blanked  out  my  personal  key  below,  but  you  should  see  yours  there):    

 

$ sudo apt-get install s3cmd

$ s3cmd –-configure Enter new values or accept defaults in brackets with Enter. Refer to user manual for detailed description of all options. Access key and Secret key are your identifiers for Amazon S3 Access Key:

Page 4: Introduction to analyzing big data using Amazon Web Services

Copy  the  key  under  “Access  Key  ID”  and  paste  it  into  the  command  line  prompt.  It  will  now  ask  for  your  Secret  key.  

 To  get  your  secret  key,  click  “Show”  under  “Secret  Access  Key”.  Copy  this  and  paste  it  into  the  command  line  prompt.    You  will  then  be  given  a  series  of  prompts.  Press  “Enter”  at  all  of  these  to  leave  the  default  settings.  Continue  pressing  “Enter”  until  the  prompt:    

 Press  “Y”,  then  “Enter”  to  verify  that  everything  worked.    When  asked  if  you  want  to  save  your  settings,  press  “Y”  again.  This  will  save  your  credentials  file  at  ~/.s3cfg.  This  file  is  required  to  run  s3cmd.    S3cmd  examples    Here  we  give  several  examples  using  s3cmd.  To  see  the  full  range  of  options,  type:    

 Example  1:  Make  a  new  s3  bucket,  upload  a  file  to  the  bucket,  and  view  bucket  contents.  Replace  “mgymrek”  with  your  own  identifier.  

 If  you  navigate  to  the  S3  console,  you  will  see  your  new  bucket  and  can  view  its  contents.    CAVEAT:  avoid  using  anything  besides  dashes  “-­‐“  and  lower  case  letters  for  paths  in  S3.  Many  hours  of  unhappy  debugging  can  be  saved  by  following  this  simple  rule  of  

Access Key: XXXXXXXXX Secret Key:

Test access with supplied credentials? [Y/n]

$ s3cmd --help

$ s3cmd mb s3://mgymrek-test-s3 Bucket 's3://mgymrek-test-s3/' created $ echo "Hello world" > hello-world.txt $ s3cmd put hello-world.txt s3://mgymrek-test-s3/ $ s3cmd ls s3://mgymrek-test-s3 2013-01-30 04:31 12 s3://mgymrek-test-s3/hello-world.txt

Page 5: Introduction to analyzing big data using Amazon Web Services

thumb.  Never  use  “_”  in  a  filename.  For  some  reason  this  will  break  downstream  steps!    Example  2:  View  and  download  data  from  a  public  repository.    The  1000  Genomes  data  is  available  at  the  s3  bucket  s3://1000genomes.  Below  we  will  view  the  contents  of  this  directory  and  download  a  small  file  from  it:    

 This  downloaded  the  README  file  with  documentation  on  how  samples  were  aligned  (the  actual  file  is  not  important  here,  we  just  chose  a  small  file  to  download  as  an  example)  to  your  local  computer.          

$ s3cmd ls s3://1000genomes/ DIR s3://1000genomes/alignment_indices/ DIR s3://1000genomes/changelog_details/ DIR s3://1000genomes/data/ DIR s3://1000genomes/phase1/ DIR s3://1000genomes/pilot_data/ DIR s3://1000genomes/release/ DIR s3://1000genomes/sequence_indices/ DIR s3://1000genomes/technical/ . . . $ s3cmd get s3://1000genomes/README.alignment_data s3://1000genomes/README.alignment_data -> ./README.alignment_data [1 of 1] 16244 of 16244 100% in 0s 280.73 kB/s done

Page 6: Introduction to analyzing big data using Amazon Web Services

Creating  and  logging  into  an  Amazon  EC2  instance    Here  we  will  see  how  to  start  EC2  compute  nodes  from  the  AWS  console,  and  how  to  connect  to  a  virtual  instance  from  the  command  line.    Before  beginning  this  step,  go  to  the  EC2  console  at  https://console.aws.amazon.com/ec2/v2/home?region=us-­‐east-­‐1.    Generate  a  key-­‐pair    In  order  to  log  into  any  of  your  EC2  instances,  you  will  need  a  key-­‐pair  that  will  be  used  to  verify  your  identity.  From  the  EC2  console,  in  the  menu  on  the  left  hand  side  under  “Network  &  Security”,  click  the  “Key  Pairs”  option.  Then  click  the  “Create  Key  Pair”  button  at  the  top  of  the  page.    First  you  will  be  asked  to  provide  a  name  for  your  key-­‐pair:  

 Clicking  “Create”  will  generate  a  key-­‐pair  and  automatically  download  a  file  <keypair-­‐name>.pem.  Store  this  file  in  a  location  where  you  will  remember.  Mine  is  stored  in  ~/keys/mgymrek_key.pem.    You  will  need  to  change  the  permissions  of  this  file  to  only  be  readable  by  you,  or  else  you  will  run  into  problems  when  we  try  to  use  the  key  to  SSH  into  an  instance.  

   Set  up  your  default  security  group    To  make  sure  you’ll  be  able  to  log  into  your  EC2  instances  from  SSH,  we  just  need  to  make  sure  the  security  settings  are  all  set  to  allow  this.  On  the  left  hand  menu  of  the  EC2  console,  go  to  “Security  Groups”  under  the  heading  “Network  &  Security”.      Select  “Default”  from  the  list  at  the  top,  and  then  go  to  the  tab  “Inbound”  on  the  bottom  panel.  For  the  option  “Create  a  new  rule”,  select  “SSH”.  Then  click  “Add  Rule”  followed  by  “Apply  Rule  Changes”.    Then  you  should  be  all  set  for  the  next  step,  launching  the  actual  instance.  

$ chmod 400 ~/keys/mgymrek_key.pem

Page 7: Introduction to analyzing big data using Amazon Web Services

   Launch  an  EC2  instance    Back  at  the  EC2  console,  at  the  menu  on  the  left  side  click  “EC2  Dashboard”.  Click  the  blue  button  “Launch  Instance”.  This  will  bring  up  a  pop-­‐up  screen.  Select  “Classic  Wizard”  and  click  “Continue”.    

 Options  marked  with  a  star  are  eligible  for  free-­‐tier  pricing,  which  we’ll  use  here.  We  will  use  the  fourth  option  down  “Ubuntu  Server  12.04.1  LTS”.  Click  “Select”  next  to  that  option.    

Page 8: Introduction to analyzing big data using Amazon Web Services

This  will  bring  you  to  the  next  set  of  options,  “Instance  Details”.  Here  leave  everything  the  same,  except  change  the  “Availability  Zone”  option  to  “us-­‐east-­‐1a”.  It  is  important  to  always  use  the  same  zone,  as  data  transfer  from  one  zone  to  another  is  charged,  whereas  transfer  within  the  same  zone  is  always  free.  Click  “Continue”.      

   We  don’t  need  to  set  any  advanced  configuration,  so  hit  “Continue”  three  times  more  until  you  get  to  the  “Create  Key  Pair”  step.  Select  “Choose  from  your  existing  Key  Pairs”  and  make  sure  the  key  pair  you  just  created  is  selected.  

   Click  “Continue”  to  go  to  “Configure  Firewall”.  Select  “Choose  one  or  more  of  your  existing  Security  Groups”  and  select  the  default  group.    

Page 9: Introduction to analyzing big data using Amazon Web Services

   Hit  “Continue  once  more  to  review  the  settings  for  this  instance.  Everything  should  be  all  set,  so  click  “Launch”.  This  will  bring  you  to  a  page  listing  all  your  existing  EC2  instances.  The  instance  can  take  a  minute  or  two  to  start  up,  and  may  say  “Pending”.  When  “Status  Checks”  shows  a  green  checkmark,  you  are  ready  to  continue  to  the  next  step.      

     Log-­‐in  and  explore  the  EC2  instance    Selecting  your  new  instance  at  the  console  will  bring  up  information  in  the  bottom  panel  about  that  instance.  You  will  find  at  the  top  of  the  bottom  panel  the  public  DNS  address,  which  you  can  use  to  SSH  into  your  instance.  This  is  shown  highlighted  below:  

   We  can  now  SSH  into  this  instance,  using  the  key-­‐pair  that  we  generated  earlier.    

Page 10: Introduction to analyzing big data using Amazon Web Services

 

The  “-­‐i”  argument  takes  the  location  where  you  stored  your  public  key.  We  log  in  with  the  user  name  “ubuntu”.  Answer  “yes”  at  the  prompt  “Are  you  sure  you  want  to  continue  connecting”.    This  will  log  you  into  your  “micro”  EC2  instance.  From  here  you  can  do  just  about  anything  you  can  do  from  the  command  line  on  any  Ubuntu  machine.  Below  are  a  couple  simple  examples.  Note  that  the  boxes  are  shaded  a  different  color  to  denote  that  you  are  now  entering  commands  on  the  EC2  instance,  rather  than  on  your  own  machine.    Example  1:  Look  at  the  default  storage.      

 So  we  have  about  7GB  we  can  use  on  this  machine.  It  is  pretty  small,  that’s  why  it’s  called  a  “micro”  instance,  and  that’s  why  it’s  free  to  use  these!    Example  2:  Install  software    Using  “sudo  apt-­‐get  install”  requires  first  running  “sudo  apt-­‐get  update”  to  update  repository  information.  In  this  example,  we  update  the  system,  then  install  R.  

 Example  3:  Transfer  data  (for  free)  from  an  s3  bucket.    Data  transfer  between  AWS  services  is  free,  as  long  as  it  is  in  the  same  geographic  region.  This  tutorial  assumes  we  are  working  in  the  region  “us-­‐east-­‐1a”.    Using  s3cmd  in  an  instance  requires  installing  s3cmd,  and  getting  your  credential  file  onto  the  instance.  This  will  come  up  again  in  the  next  section  when  we  run  a  MapReduce  job  and  need  to  call  s3cmd.    First  upload  your  credential  file  from  your  computer  to  the  instance:    

$ ssh -i ~/keys/mgymrek_key.pem [email protected]

$ df –h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 7.9G 773M 6.8G 11% / udev 288M 8.0K 288M 1% /dev tmpfs 119M 160K 118M 1% /run

$ sudo apt-get update $ sudo apt-get install r-base-core

Page 11: Introduction to analyzing big data using Amazon Web Services

 Then  install  s3cmd  on  the  instance,  and  transfer  the  test  file  we  made  earlier:  

 Terminate  the  EC2  instance    When  you  are  done  running  the  instance,  you  should  terminate  it.  At  the  console,  right  click  on  your  instance,  and  select  “Terminate”.    

$ sudo apt-get install s3cmd $ s3cmd get s3://mgymrek-test-s3/hello-world.txt $ ls –l -rw-rw-r-- 1 ubuntu ubuntu 12 Jan 30 05:41 hello-world.txt

$ scp –i ~/keys/mgymrek_key.pem /home/mgymrek/.s3cfg [email protected]:/home/ubuntu/.s3cfg

Page 12: Introduction to analyzing big data using Amazon Web Services

Running  the  case  study  map  reduce  job    In  this  simple  Elastic  MapReduce  (EMR)  example  we  will  run  SNP  calling  on  the  genomes  of  ten  individuals  from  the  1000  Genomes  Project  (only  on  chromosome  20  to  keep  the  example  small).    All  examples  here  use  the  s3  bucket  s3://mgymrek-­‐barc-­‐example.      Install  and  configure  the  elastic-­‐mapreduce  command  line  tool    It  is  possible  to  run  EMR  jobs  from  the  AWS  console,  but  there  are  great  command  line  tools  that  give  you  more  flexibility  and  that  are  quite  easy  to  use.      First  download  the  elastic  mapreduce  tools,  which  are  based  on  ruby  (you  will  need  to  have  ruby  installed):    

 To  configure  elastic-­‐mapreduce,  you  will  need  to  create  a  file  with  your  credentials  in  the  same  directory  that  you  unzipped  the  software.  Create  a  file  credentials.json  there  with  the  following  (note  the  orange  color  will  denote  the  contents  of  a  file):    

 Here  “access-­‐id”  and  “private-­‐key”  are  the  access  key  and  secret  key  we  went  over  on  Pages  3-­‐4.  Key-­‐pair  is  the  name  of  the  key  created  above,  and  key-­‐pair-­‐file  is  the  path  to  the  “.pem”  file  containing  that  key.  Log-­‐uri  is  a  location  in  one  of  your  s3  buckets  where  EMR  can  store  log  files.    To  check  that  you  have  configured  everything  correctly,  try:          

$ sudo apt-get install ruby-full $ wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip emr/ $ unzip emr/elastic-mapreduce-ruby.zip

{ "access-id": "XXX", "private-key": "XXX", "key-pair": "mgymrek_key", "key-pair-file": "/home/mgymrek/keys/mgymrek_key.pem", "log-uri": "s3://mgymrek-test-s3/log", }

Page 13: Introduction to analyzing big data using Amazon Web Services

   

 If  everything  is  configured  correctly,  you  should  see  a  long  help  message  with  all  the  options  for  this  tool.  If  something  is  wrong,  you  will  get  an  error  message  complaining  about  your  access  credentials.  Make  sure  credentials.json  is  in  the  emr  directory.    Prepare  inputs:  list  of  sample  IDs    The  inputs  to  an  EMR  job  consist  of  the  input  of  one  map  job  per  line.  In  this  example,  each  map  job  will  be  to  run  SNP  calling  on  a  single  genome.  So  our  input  file  will  have  one  line  with  the  accession  for  each  genome  to  process:    We  create  a  file  called  genomeids.txt:      

                     

 Bootstrapping:  install  software  on  each  mapper    Each  map  instance  in  an  EMR  task  is  basically  a  fresh  node  with  nothing  you  will  need  already  installed  on  it.  Any  data,  software,  or  general  configuration  you  need  to  do  to  each  mapper  in  order  for  it  to  be  able  to  complete  the  map  tasks,  you  can  do  using  something  called  a  bootstrap  script.  This  script  runs  when  a  map  instance  is  started  before  it  processes  any  map  tasks.    In  our  case,  we  will  want  to  do  the  following:  

NA18499  NA18501  NA18502  NA18504  NA18505  NA18507  NA18508  NA18510  NA18511  NA18516  

$ cd emr $ ./elastic-mapreduce

Page 14: Introduction to analyzing big data using Amazon Web Services

• Download  our  s3cfg  file  so  we  can  use  s3cmd.  In  this  example,  the  s3cfg  file  is  transferred  from  an  s3  bucket.  This  is  probably  not  very  secure  and  there  may  be  a  better  way  to  do  this.  

• Install  all  the  software  we’ll  need  for  SNP  calling.  We  will  perform  SNP  calling  using  VarScan.  For  this,  we’ll  need  to  install  VarScan,  java,  and  samtools.  

• Create  directories  where  we’ll  store  different  data  files.  The  storage  space  on  the  mapper  nodes  can  be  accessed  in  the  /mnt  directory.  We  will  create  all  data  directories  here.  

• Download  and  unzip  the  human  reference  genome.  We’ll  need  this  for  the  steps  required  for  SNP  calling.  

 We  put  all  of  these  steps  into  a  bash  script  named  download-snptools.sh  that  will  run  on  startup:    

 If  you  run  this  example  yourself,  note  that  you  will  need  to  change  the  path  to  the  s3cfg  file  to  where  you  have  it  stored.  Anywhere  that  you  can  access  using  wget  is  fine.  I  have  removed  my  file  from  this  location  for  security  reasons.  The  VarScan  jar  file  and  reference  genome  are  still  at  the  s3  buckets  referenced  above,  so  you  are  welcome  to  download  them  from  there.  

#!/bin/bash set -e # Download s3cfg file wget -S -T 10 -t 5 http://s3.amazonaws.com/mgymrek-barc-example/misc/.s3cfg sudo mv .s3cfg /mnt/ # Install Java, s3cmd, and samtools sudo apt-get update sudo apt-get install -y default-jre s3cmd samtools # Transfer VarScan from S3 sudo s3cmd -c /mnt/.s3cfg get s3://mgymrek-barc-example/tools/VarScan.v2.3.3.jar /mnt/VarScan.v2.3.3.jar # Make directories to store data sudo mkdir /mnt/alignments; sudo mkdir /mnt/genome; sudo mkdir /mnt/varscan sudo chmod -R 777 /mnt/alignments/ sudo chmod -R 777 /mnt/genome/ sudo chmod -R 777 /mnt/varscan/ # Download and unzip reference genome sudo s3cmd -c /mnt/.s3cfg get s3://mgymrek-barc-example/human_g1k_v37.fasta.gz mv human_g1k_v37.fasta.gz /mnt/genome/ gunzip /mnt/genome/human_g1k_v37.fasta.gz

Page 15: Introduction to analyzing big data using Amazon Web Services

 Create  the  mapper    Mappers  follow  a  basic  structure:  it  reads  from  standard  input.  Each  line  from  standard  input  consists  of  a  different  task.  When  no  lines  of  input  are  left,  it  terminates.    We  will  write  the  mapper  in  python,  but  it  can  be  in  any  language  (as  long  as  it  is  either  supported  by  default  on  the  map  instances,  or  you  install  it  during  the  bootstrap  stage).  Here  our  mapper  will  take  in  a  genome  accession  from  1000  Genomes  as  each  line  of  input.  Then  a  single  job  will  consist  of  calling  SNPs  in  that  genome  using  VarScan.  The  mapper  will  need  to  do  the  following  for  each  task:  

• Download  the  bam  files  for  that  genome  from  the  1000  Genomes  bucket.  • Call  samtools  and  VarScan  for  SNP  calling  • Upload  the  results  to  S3  

 The  code  for  this  mapper,  which  is  in  the  file  snpcall-mapper.py,  is  below:    

 One  caveat  about  the  mapper  is  that  you  should  avoid  printing  any  output  to  standard  out.  Anything  printed  is  assumed  to  be  input  to  the  reducer.  If  you  want  to  

#!/usr/bin/python import sys import os S3_ONEKGDATAPATH = "s3://1000genomes/phase1/data" S3_VARSCANPATH = "s3://mgymrek-barc-example/varscan" ALIGNPATH = "/mnt/alignments" PILEUPPATH = "/mnt/pileups" VARSCANPATH = "/mnt/varscan" GENOMEPATH = "/mnt/genome/human_g1k_v37.fasta" S3CONFIG = "/mnt/.s3cfg" for line in sys.stdin: sample = line.strip() # download BAM alignment bamfile = "%s.chrom20.ILLUMINA.bwa.YRI.low_coverage.20101123.bam"%sample cmd = "s3cmd -c %s get %s/%s/alignment/%s %s/%s"%(S3CONFIG, S3_ONEKGDATAPATH, sample, bamfile, ALIGNPATH, bamfile) os.system(cmd) # Create pileup and run VarScan resultsfile = "%s/%s.varscan"%(VARSCANPATH, sample) cmd = "samtools mpileup -f %s %s/%s | java -jar /mnt/VarScan.v2.3.3.jar mpileup2snp> %s"%(GENOMEPATH, ALIGNPATH,bamfile,resultsfile) os.system(cmd) # Upload results to s3 cmd = "s3cmd -c %s put %s %s/%s.varscan"%(S3CONFIG, resultsfile, S3_VARSCANPATH, sample) os.system(cmd)

Page 16: Introduction to analyzing big data using Amazon Web Services

print  debugging  messages,  be  sure  to  write  them  to  standard  error  instead,  then  you  can  view  them  in  the  logs  later  to  debug.    Upload  data  and  scripts  to  s3    The  last  step  before  actually  running  the  EMR  job  is  to  put  the  inputs,  mapper,  and  bootstrap  script  into  s3.  We  will  also  need  to  make  sure  that  the  permissions  and  metadata  for  each  is  set  correctly,  which  we  can  do  from  the  S3  console.    Inputs  

 Mapper  

 S3  will  automatically  recognize  that  this  is  a  python  script  because  of  the  header  line.  To  check  this,  you  can  navigate  to  your  mapper  script  from  the  S3  Console.  Select  the  mapper,  right  click  and  select  “Properties”,  then  Select  the  “Metadata”  tab  on  the  right.  You  should  see:  

   Bootstrap  script    

 Again,  S3  will  recognize  the  file  type  automatically.  But  we  will  have  to  set  the  permissions  for  the  bootstrap  script  so  that  the  mappers  are  allowed  to  open  it.  In  the  S3  console,  navigate  to  the  bootstrap  script.  Right  click,  select  “Properties”,  and  then  select  the  “Permissions”  tab.  Click  “Add  more  permissions”.  Set  “Grantee:  Everyone”  and  select  “Open/download”.  Then  click  “Save”.  

$ s3cmd put genomeids.txt s3://mgymrek-barc-example/inputs/genomeids.txt

$ s3cmd put snpcall-mapper.py s3://mgymrek-barc-example/scripts/snpcall-mapper.py

$ s3cmd put download-snptools.sh s3://mgymrek-barc-example/scripts/download-snptools.sh

Page 17: Introduction to analyzing big data using Amazon Web Services

   

Run  the  EMR  job!    We  will  run  the  job  from  the  command  line.  There  are  quite  a  few  options  we  will  need  to  set.  First  we’ll  show  the  command,  which  is  a  bit  intimidating.  Then  we’ll  go  through  what  each  of  those  options  mean.  

 Name:  a  unique  identifier  to  reference  this  specific  EMR  job    Num-­‐instances:  the  total  number  of  compute  nodes  used  in  the  job.  This  includes  the  master  and  slaves,  so  it  needs  to  be  at  least  two  to  allow  for  one  master  and  one  slave.  Here  we  are  doing  a  small  example,  so  we  set  it  to  three.  On  big  jobs,  you  can  set  this  to  tens  or  even  hundreds  of  mappers  to  make  your  jobs  extremely  parallelized.    Slave-­‐instance-­‐type:  what  type  of  EC2  node  to  use  for  the  slave.  This  depends  on  the  memory  and  space  requirements  of  your  map  job.  Here  we  don’t  need  that  much  RAM,  and  we  only  need  enough  space  to  process  a  single  genome  at  a  time,  so  m1.medium  is  enough.  (It  has  3+GB  RAM,  enough  to  handle  loading  the  human  genome).    

$ ./emr/elastic-mapreduce --create --stream --alive \ --name snpcalls \ --num-instances 3 \ --slave-instance-type m1.medium \ --master-instance-type m1.small \ --availability-zone us-east-1a \ --input s3n://mgymrek-barc-example/inputs/ \ --mapper s3://mgymrek-barc-example/scripts/snpcall-mapper.py \ --output s3n://mgymrek-barc-example/logs012913 \ --bootstrap-action s3://mgymrek-barc-example/scripts/download-snptools.sh --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "-s,mapred.map.tasks.speculative.execution=false,-s,mapred.tasktracker.map.tasks.maximum=1,-s,mapred.map.tasks=1,-s,mapred.reduce.tasks=0,-s,mapred.tasktracker.reduce.tasks.maximum=0"

Page 18: Introduction to analyzing big data using Amazon Web Services

Master-­‐instance-­‐type:  what  type  of  EC2  node  to  use  for  the  master.  Usually  the  mapper  is  not  doing  anything  that  intense,  it  just  distributes  jobs.  The  smallest  type  of  instance  you  are  allowed  to  use  is  m1.small,  so  that’s  what  we  use  here.    Availability-­‐zone:  as  mentioned  above,  to  keep  S3  data  transfer  charges  free,  always  specify  the  same  zone.  A  good  default  is  to  just  use  us-­‐east-­‐1a  always.    Input:  specify  the  location  in  S3  from  where  EMR  should  read  the  inpu.  This  is  always  a  folder,  and  EMR  will  read  from  any  file  that  is  in  this  folder.  The  “s3n://”  instead  of  “s3://”  prefix  is  used  to  specify  something  that  will  be  read  from  standard  input.  Always  use  the  s3n  prefix  for  input  to  EMR.    Mapper:  specify  the  location  in  S3  to  find  the  mapper  script.  Again,  the  mapper  script  is  an  executable  file  that  processes  one  line  of  standard  input  at  a  time.  Beyond  that  requirement,  you  can  basically  write  whatever  you  want  into  the  mapper  script.      Output:  specify  a  location  in  S3  where  EMR  can  write  any  output  from  this  job.  It  includes  any  log  messages,  and  any  output  from  the  reducer  if  you  use  a  reducer.  Caveat  this  location  must  be  unique  for  each  EMR  job.  You  must  specify  a  folder  that  does  not  already  exist,  or  else  EMR  will  complain.    Bootstrap-­‐action:  specify  bootstrap  scripts  to  run  upon  startup  of  each  map  node.  You  can  specify  as  many  bootstrap-­‐actions  as  you  want.  Here  we  specified  two  of  them:     1.  The  first  is  our  custom  bootstrap  script  we  defined  earlier  to  download  all  necessary  software  and  data.     2.  The  second  is  one  of  the  bootstrap  scripts  already  available  from  Amazon.  It  allows  configuring  hadoop,  which  is  the  framework  behind  all  the  mapreduce  jobs.  The  arguments  we  set  look  kind  of  cryptic,  but  they  are  all  to  ensure  that  we  don’t  have  any  reducers,  and  that  each  mapper  runs  only  one  task  at  a  time  so  we  don’t  run  out  of  space.  A  full  set  of  the  arguments  you  can  set  is  here:  http://hadoop.apache.org/docs/r1.0.0/mapred-­‐default.html.  This  page  is  very  useful  for  any  kind  of  advanced  configuration  of  hadoop  for  specific  needs  of  your  jobs.    Ok,  now  we  are  finally  read  to  run  the  EMR  job!  Simply  run  the  above  command  at  the  command  line.  This  command  is  in  the  shell  script  file  run-EMR-example.sh.    

After  you  ran  the  command  above,  you  should  see  a  message  that  a  job  flow  was  created.  Cross  your  fingers  before  moving  on.    

$ sh run-EMR-example.sh Created job flow j-2XFP2BJELRIL0

Page 19: Introduction to analyzing big data using Amazon Web Services

 Now,  go  the  AWS  console.    Navigate  to  the  “Elastic  MapReduce”  console  and  you  should  see  a  job  listed.  At  first  it’s  state  will  say  “Starting”.  

   You  can  select  the  job  and  view  its  properties  in  the  tabs  below.  For  instance,  the  “Steps”  tab  will  tell  you  which  steps  have  been  completed  so  far.      Eventually,  if  all  is  good,  the  state  will  change  to  “Bootstrapping”  while  the  bootstrap  steps  are  running  

   Finally,  the  status  will  change  to  “Running”,  indicating  that  your  EMR  job  is  off  and  running!  

   View  EMR  progress    You  can  view  the  progress  of  your  EMR  job  at  the  console.  At  the  bottom  panel,  select  the  “Monitoring”  tab.  This  will  show  you  graphs  of  all  kinds  of  helpful  and  fun  information,  such  as  how  many  jobs  are  running,  how  many  are  remaining,  how  many  mappers  are  working  at  the  moment,  etc.    Examine  outputs    Output  of  our  map  tasks  will  be  in  the  directory  we  specified,  which  in  this  example  was  s3://mgymrek-­‐barc-­‐example/varscan/.  We  can  look  at  what  files  are  there  using  s3cmd  (or  on  the  S3  console):            

Page 20: Introduction to analyzing big data using Amazon Web Services

 

 We  can  see  that  only  one  sample  has  finished  so  far.  We  can  download  this  file  and  view  the  SNP  calls:  

   You  can  either  transfer  all  the  files  to  your  local  computer  for  downstream  analysis,  or  do  more  computing  with  them  on  the  cloud!    The  sequel:  debugging  EMR  jobs    A  lot  of  things  can  go  wrong  during  MapReduce  jobs,  and  debugging  these  is  a  whole  other  tutorial.  Here  are  some  general  tips:  

• You  can  log  onto  individual  slave  instances  and  look  into  the  logs  to  get  an  idea  if  something  went  wrong.  To  do  so,  go  to  your  EC2  console,  find  the  slaves  that  are  running,  and  SSH  into  them.  You  will  need  to  use  the  same  key-­‐pair  you’ve  been  using,  and  this  time  log  in  with  username  “hadoop”  instead  of  “Ubuntu”.    

• Navigate  to  the  logs,  which  are  stored  in  /mnt/var/log/.  From  there  you  can  view  the  standard  output  and  standard  error  of  the  bootstrap  and  mapper  scripts.  Logs  for  the  bootstrap  action  are  in  the  “bootstrap-­‐actions”  folder,  and  logs  for  map  tasks  are  in  the  “hadoop”  folder.  

$ ssh –i ~/keys/mgymrek_key.pem hadoop@[email protected]

$ s3cmd ls s3://mgymrek-barc-example/varscan/ 2013-01-31 03:05 318566 s3://mgymrek-barc-example/varscan/NA18499.varscan

$ s3cmd get s3://mgymrek-barc-example/varscan/NA18499.varscan . $ head NA18499.varscan | cut –f 1,2,3,4 Chrom Position Ref Var 20 102441 T C 20 181967 A C 20 192514 C T 20 207923 C T 20 208168 T C 20 222417 T C 20 227246 A G 20 239697 G C 20 253772 C T

Page 21: Introduction to analyzing big data using Amazon Web Services

• At  the  EMR  console,  after  selecting  your  job,  the  bottom  tab  will  usually  give  informative  status  messages,  such  as  the  last  state  change  or  why  something  didn’t  work.  For  example:  

 This  error  message  tells  us  that  the  bootstrap  action  failed  on  the  master.  Then  to  figure  out  the  problem,  we  can  SSH  into  the  master  node,  go  to  the  logs  for  the  bootstrap  action,  and  likely  find  some  helpful  error  message.  

• Make  sure  your  slave  instance  type  is  big  enough  for  the  job.  If  you  find  you  start  and  EMR  job  but  can’t  SSH  into  the  slave  even  though  the  job  says  it’s  running,  it  may  be  running  out  of  memory  and  stalling.  Try  bumping  up  the  RAM.  

• If  all  else  fails,  there  is  a  ton  of  documentation  scattered  on  the  internet.  Googling  problems  tends  to  work.  

$ ls /mnt/var/log/ bootstrap-actions hadoop instance-controller instance-state service-nanny