google n-grams on amazon web services part 3 thomas tiahrt, ma, phd computer science 482 –...

9
GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 3 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics

Upload: anissa-wilkerson

Post on 22-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Slide 2
  • GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 3 Thomas Tiahrt, MA, PhD Computer Science 482 Introduction to Text Analytics
  • Slide 3
  • 2 Data created July 2009 Version 1 file format N-gram \t year \t match_count \t page_count \t volume_count \n N-gram is the 1gram, 2gram, 3gram, 4gram, 5gram Year is the publication year match_count is the occurrences for that year page_count is the number of pages on which the ngram appeared volume_count is the number of books where the ngram occurred Version 1
  • Slide 4
  • 3 http://aws.amazon.com/datasets/8172056142375670 http://aws.amazon.com/datasets/8172056142375670 Stored in AWS Simple Storage Service (S3) AWS Public Dataset
  • Slide 5
  • 4 Stored as compressed data Luckily Hadoop supports GZIP BZIP2 LZO (see below) DEFLATE (zlib implementation) But Hadoop does not support WinZip And Hadoop supports LZO only if you create a version with it yourself AWS Public Dataset
  • Slide 6
  • 5 Compression Format ToolAlgorithmFilename Extension Multiple files? Able to be Split? DEFLATE (zlib)No CLI toolsDEFLATE.deflateNo gzip DEFLATE+.gzNo bzip2.bz2NoYes LZOlzopLZO.lzoNo Hadoop Compression Formats Source: Hadoop The Definitive Guide
  • Slide 7
  • 6 Compression FormatTool DEFLATE (zlib) org.apache.hadoop.io.compress.DefaultCodec gzip org.apache.hadoop.io.compress.GzipCodec bzip2 org.apache.hadoop.io.compress.GzipCodec LZO com.hadoop.compression.LzopCodec Hadoop Compression Formats Source: Hadoop The Definitive Guide
  • Slide 8
  • Project Assignment I 7 Use the nwcdatabucket as the bucket for input Use the tmp folder in nwcdatabucket Input is nwcdatabucket/tmp Write Python code (in > 1.py files) Find the twenty most frequently occurring 5-grams for a 10 year period. You may hard-code the 10 year period E.g. 1950 to 1959 You need not worry about error checking the range
  • Slide 9
  • Project Assignment II 8 Setting reducers Use the extra arguments in the bottom of the first page The following creates 1 reducer -D mapred.reduce.tasks=1 Upload your results as a text file Upload your Python code modules
  • Slide 10
  • The end has come. End of the Part 3 PowerPoint 9