solving real world problems with hadoop

Post on 27-Jan-2015






Click to see full reader




Solving Real World Problems with Hadoop and

SQL -> Hadoop

Masahji Stewart <>

Tuesday, April 5, 2011

Solving Real World Problems with Hadoop

Tuesday, April 5, 2011

Word CountMapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster...


Tuesday, April 5, 2011

Word CountMapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster...


as!! ! ! 1certain! ! 1collectively!1datasets! ! 1framework!! 1huge! ! ! 1number!! ! 1on!! ! ! 1referred! ! 1to!! ! ! 1

OutputMapReduce!! 1cluster! ! 1computers!! 1distributable!1for!! ! ! 1kinds! ! ! 1of!! ! ! 2problems! ! 1

(nodes),! ! 1a! ! ! ! 3is!! ! ! 1large! ! ! 1processing! 1using! ! ! 1

Tuesday, April 5, 2011

Word Count (Mapper)

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Tuesday, April 5, 2011

Word Count (Mapper)

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Extractword = “MapReduce”word = ”is”word = “a”...

Tuesday, April 5, 2011

Word Count (Mapper)

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Emit“MapReduce”, 1“is”, 1“a”, 1...

Tuesday, April 5, 2011

Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Tuesday, April 5, 2011

Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Sumkey=“of”sum = 2

Tuesday, April 5, 2011

Word Count (Reducer) public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Emit“of”, 2

Tuesday, April 5, 2011

Word Count (Running)$ hadoop jar ./.versions/0.20/hadoop-0.20-examples.jar wordcount \ -D mapred.reduce.tasks=3\ input_file out

11/04/03 21:21:27 INFO mapred.JobClient: Default number of map tasks: 211/04/03 21:21:27 INFO mapred.JobClient: Default number of reduce tasks: 311/04/03 21:21:28 INFO input.FileInputFormat: Total input paths to process : 111/04/03 21:21:29 INFO mapred.JobClient: Running job: job_201103252110_065911/04/03 21:21:30 INFO mapred.JobClient: map 0% reduce 0%11/04/03 21:21:37 INFO mapred.JobClient: map 100% reduce 0%11/04/03 21:21:49 INFO mapred.JobClient: map 100% reduce 33%11/04/03 21:21:52 INFO mapred.JobClient: map 100% reduce 66%11/04/03 21:22:05 INFO mapred.JobClient: map 100% reduce 100%11/04/03 21:22:08 INFO mapred.JobClient: Job complete: job_201103252110_065911/04/03 21:22:08 INFO mapred.JobClient: Counters: 17...11/04/03 21:22:08 INFO mapred.JobClient: Map output bytes=28611/04/03 21:22:08 INFO mapred.JobClient: Combine input records=2711/04/03 21:22:08 INFO mapred.JobClient: Map output records=2711/04/03 21:22:08 INFO mapred.JobClient: Reduce input records=24

Tuesday, April 5, 2011

Word Count (Output)$ hadoop@ip-10-245-210-191:~$ hadoop fs -ls outFound 3 items-rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000-rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001-rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000as!1certain! 1collectively!1datasets! 1framework!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001MapReduce!1cluster! 1computers!1distributable!1for!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1a! 3is!1large! 1processing! 1using! 1

A file per reducer

Tuesday, April 5, 2011

Word Count (Output)$ hadoop@ip-10-245-210-191:~$ hadoop fs -ls outFound 3 items-rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000-rw-r--r-- 2 hadoop supergroup 80 2011-04-03 21:21 /user/hadoop/out/part-r-00001-rw-r--r-- 2 hadoop supergroup 49 2011-04-03 21:21 /user/hadoop/out/part-r-00002$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000as!1certain! 1collectively!1datasets! 1framework!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001MapReduce!1cluster! 1computers!1distributable!1for!1...$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00002 (nodes),! 1a! 3is!1large! 1processing! 1using! 1

Tuesday, April 5, 2011

Word Count

MapReduce is a f ramework fo r processsing

huge datasets on certain kinds of distributable

problems using a large number of computers

(nodes) collectively

referrered to as a

MapReduce is a framework for p r o c e s s i n g huge datasets on certain kinds of distributable problems using a large number of computers ( n o d e s ) , c o l l e c t i v e l y referred to as a cluster... cluster









as 1certain 1collectively 1datasets 1framework 1huge 1number 1on 1referred 1to 1

MapReduce 1cluster 1computers 1distributable 1for 1kinds 1of 2problems 1

(nodes), 1a 3is 1large 1processing 1using 1

Input Split Map Reduce OutputShuffle/Sort

Tuesday, April 5, 2011

Log Processing (Date IP COUNT) - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "" "Mozilla/5.0" - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "" "Mozilla/5.0"



Tuesday, April 5, 2011

Log Processing (Date IP COUNT)


Output18/Jul/2010!!! 118/Jul/2010!!! 318/Jul/2010!!! 118/Jul/2010!!! 118/Jul/2010!!! 119/Jul/2010!!! 1

... - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "" "Mozilla/5.0" - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "" "Mozilla/5.0"


Tuesday, April 5, 2011

Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^([\\d.]+) (\\S+) (\\S+) \\[(([\\w/]+):([\\d:]+)\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"");

public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text ip = new Text();

public void map(Object key, Text value, Context context) throws IOException {

String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set( + "\t" +; context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } }

} }

Tuesday, April 5, 2011

Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^([\\d.]+) (\\S+) (\\S+) \\[(([\\w/]+):([\\d:]+)\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"");

public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text ip = new Text();

public void map(Object key, Text value, Context context) throws IOException {

String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set( + "\t" +; context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } }

} }

Extractip = “”ip = ””ip = “”...

Tuesday, April 5, 2011

Log Processing (Mapper) public static final Pattern LOG_PATTERN = Pattern.compile("^([\\d.]+) (\\S+) (\\S+) \\[(([\\w/]+):([\\d:]+)\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"");

public static class ExtractDateAndIpMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text ip = new Text();

public void map(Object key, Text value, Context context) throws IOException {

String text = value.toString(); Matcher matcher = LOG_PATTERN.matcher(text); while (matcher.find()) { try { ip.set( + "\t" +; context.write(ip, one); } catch(InterruptedException ex) { throw new IOException(ex); } }

} }

Emit“18/Jul/2010\t189.186.9.181”, 1...

Tuesday, April 5, 2011

Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

Tuesday, April 5, 2011

Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}


Tuesday, April 5, 2011

Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}


Tuesday, April 5, 2011

Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}


Tuesday, April 5, 2011

Log Processing (main)public class LogAggregator {... public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: LogAggregator <in> <out>"); System.exit(2); } Job job = new Job(conf, "LogAggregator"); job.setJarByClass(LogAggregator.class); job.setMapperClass(ExtractDateAndIpMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

Run it!

Tuesday, April 5, 2011

Log Processing (Running)$ hadoop jar target/hadoop-recipes-1.0.jar\ -libjars hadoop-examples.jar data/access.log log_results

11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 111/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-examples.jar in /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes-work--8125788655475885988 with rwxr-xr-x11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%

Tuesday, April 5, 2011

Log Processing (Running)$ hadoop jar target/hadoop-recipes-1.0.jar\ -libjars hadoop-examples.jar data/access.log log_results

11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 00:51:30 INFO input.FileInputFormat: Total input paths to process : 111/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-examples.jar in /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes-work--8125788655475885988 with rwxr-xr-x11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/hadoop-recipes/hadoop-examples.jar11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%

JAR placed into Distributed Cache

Tuesday, April 5, 2011

Log Processing (Output)

$ hadoop fs -ls log_resultsFound 2 items-rwxrwxrwx 1 masahji staff 0 2011-04-04 00:51 log_results/_SUCCESS-rwxrwxrwx 1 masahji staff 168 2011-04-04 00:51 log_results/part-r-00000

$ hadoop fs -cat log_results/part-r-00000 18/Jul/2010!!118/Jul/2010!!318/Jul/2010!!118/Jul/2010!!118/Jul/2010!!119/Jul/2010!!1...

Tuesday, April 5, 2011

Hadoop Streaming

Task Tracker Mapper / Reducer




Tuesday, April 5, 2011

Basic grep ...搜索 搜索 [sou1 suo3] /to search/.../internet search/database search/吉日 吉日 [ji2 ri4] /propitious day/lucky day/吉祥 吉祥 [ji2 xiang2] /lucky/auspicious/propitious/咄咄 咄咄 [duo1 duo1] /to cluck one's tongue/tut-tut/喜鵲 喜鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/...


Tuesday, April 5, 2011

Basic grep ...搜索 搜索 [sou1 suo3] /to search/.../internet search/database search/吉日 吉日 [ji2 ri4] /propitious day/lucky day/吉祥 吉祥 [ji2 xiang2] /lucky/auspicious/propitious/咄咄 咄咄 [duo1 duo1] /to cluck one's tongue/tut-tut/喜鵲 喜鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/...


Output...匯出 汇出 [hui4 chu1] /to export data (e.g. from a database)/!搜索 搜索 [sou1 suo3] /to search/.../internet search/database search/!數據庫 数据库 [shu4 ju4 ku4] /database/!數據庫軟件 数据库软件 [shu4 ju4 ku4 ruan3 jian4] /database software/!資料庫 资料库 [zi1 liao4 ku4] /database//...

Tuesday, April 5, 2011

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/cedict.txt.gz \ -output streaming/grep_database_mandarin \ -mapper 'grep database' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer...11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_000111/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin

Basic grep

Tuesday, April 5, 2011

ScriptsorJava Classes

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/cedict.txt.gz \ -output streaming/grep_database_mandarin \ -mapper 'grep database' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer...11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_000111/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin

Basic grep

Tuesday, April 5, 2011

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/cedict.txt.gz \ -output streaming/grep_database_mandarin \ -mapper 'grep database' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer...11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_000111/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin

Basic grep

$ hadoop fs -cat streaming/grep_database_mandarin/part-00000

匯出 汇出 [hui4 chu1] /to remit (money)//to export data (e.g. from a database)/!搜索 搜索 [sou1 suo3] /to search/to look for sth/internet search/database search/!數據庫 数据库 [shu4 ju4 ku4] /database/!數據庫軟件 数据库软件 [shu4 ju4 ku4 ruan3 jian4] /database software/!資料庫 资料库 [zi1 liao4 ku4] /database/

Tuesday, April 5, 2011

Ruby Example (ignore ip list)Input

Output - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 96 "-" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-" - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "" "Mozilla/5.0" - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 51 "-" "Mozilla/5.0" - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 94 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0") - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 24 "-" "Mozilla/4.0" - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0" - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "" "Mozilla/5.0"... - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"! - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "" "Mozilla 4.0"! - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "" "Mozilla/4.0"! - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "" "Mozilla/4.0"! - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"! - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")! - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "" "Mozilla/5.0"! - - [19/Jul/2010] "GET /music HTTP/1.1" 200 30151 "" "Mozilla/5.0"! - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "" "Mozilla/5.0"...

Tuesday, April 5, 2011

Ruby Example (ignore ip list)

#!/usr/bin/env ruby

ignore = %w( 192.168 10)log_regex = /^([\d.]+)\s/

while(line = STDIN.gets) next unless line =~ log_regex ip = $1

print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(\.|$)/ }.empty?end


Tuesday, April 5, 2011

Ruby Example (ignore ip list)

#!/usr/bin/env ruby

ignore = %w( 192.168 10)log_regex = /^([\d.]+)\s/

while(line = STDIN.gets) next unless line =~ log_regex ip = $1

print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(\.|$)/ }.empty?end

Tuesday, April 5, 2011

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/access.log \ -output out/streaming/filter_ips \ -mapper './script/filter_ips' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 111/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_000111/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)...

Ruby Example (ignore ip list)

Tuesday, April 5, 2011

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input data/access.log \ -output out/streaming/filter_ips \ -mapper './script/filter_ips' \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04 11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not 11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 111/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_000111/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)...

Ruby Example (ignore ip list)

$ hadoop fs -cat out/streaming/filter_ips/part-00000 ...! - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"! - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "" "Mozilla/4.0"! - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "" "Mozilla/4.0"! - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "" "Mozilla/4.0"! - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"! - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")! - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "" "Mozilla/5.0"! - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "" "Mozilla/5.0"

Tuesday, April 5, 2011

SQL -> Hadoop

Tuesday, April 5, 2011

Simple QueryQuerySELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2

Tuesday, April 5, 2011

Simple QueryQuerySELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2

Inputid first_name last_name favorite_movie_id

1 John Mulligan 3

2 Samir Ahmed 5

3 Royce Rollins 2

4 John Smith 2

Tuesday, April 5, 2011

Simple QueryQuerySELECT first_name, last_name FROM people WHERE first_name = ‘John’ OR favorite_movie_id = 2

Inputid first_name last_name favorite_movie_id

1 John Mulligan 3

2 Samir Ahmed 5

3 Royce Rollins 2

4 John Smith 2

first_name last_name

John Mulligan

John Smith


Tuesday, April 5, 2011

Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER);

try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });

context.write(columns, blank);

} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}

Tuesday, April 5, 2011

Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER);

try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });

context.write(columns, blank);

} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}


Tuesday, April 5, 2011

Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER);

try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });

context.write(columns, blank);

} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}


WHEREWHERE first_name = ‘John’ OR favorite_movie_id = 2

Tuesday, April 5, 2011

Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER);

try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });

context.write(columns, blank);

} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}

SELECTSELECT first_name, last_name


WHEREWHERE first_name = ‘John’ OR favorite_movie_id = 2

Tuesday, April 5, 2011

Simple Query (Mapper)public class SimpleQuery {... public static class SelectAndFilterMapper extends Mapper<Object, Text, TextArrayWritable, Text> { ... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER);

try { if( row[FIRST_NAME_COLUMN].equals("John") || row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] { row[FIRST_NAME_COLUMN], row[LAST_NAME_COLUMN] });

context.write(columns, blank);

} } catch(InterruptedException ex) { throw new IOException(ex); } } }...}

SELECTSELECT first_name, last_name


WHEREWHERE first_name = ‘John’ OR favorite_movie_id = 2


Tuesday, April 5, 2011

Simple Query (Running)$ hadoop jar target/hadoop-recipes-1.0.jar\ data/people.tsv out/simple_query

...11/04/04 09:19:15 INFO mapred.JobClient: map 100% reduce 100%11/04/04 09:19:15 INFO mapred.JobClient: Job complete: job_local_000111/04/04 09:19:15 INFO mapred.JobClient: Counters: 1311/04/04 09:19:15 INFO mapred.JobClient: FileSystemCounters11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_READ=30629611/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=39867611/04/04 09:19:15 INFO mapred.JobClient: Map-Reduce Framework11/04/04 09:19:15 INFO mapred.JobClient: Reduce input groups=311/04/04 09:19:15 INFO mapred.JobClient: Combine output records=011/04/04 09:19:15 INFO mapred.JobClient: Map input records=411/04/04 09:19:15 INFO mapred.JobClient: Reduce shuffle bytes=011/04/04 09:19:15 INFO mapred.JobClient: Reduce output records=311/04/04 09:19:15 INFO mapred.JobClient: Spilled Records=611/04/04 09:19:15 INFO mapred.JobClient: Map output bytes=5411/04/04 09:19:15 INFO mapred.JobClient: Combine input records=011/04/04 09:19:15 INFO mapred.JobClient: Map output records=311/04/04 09:19:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=12711/04/04 09:19:15 INFO mapred.JobClient: Reduce input records=3...

Tuesday, April 5, 2011

Simple Query (Running)

$ hadoop fs -cat out/simple_query/part-r-00000

John! Mulligan!John! Smith!Royce! Rollins!

Tuesday, April 5, 2011

Join QueryQuerySELECT first_name, last_name, name, movies.imageFROM people JOIN movies ON ( people.favorite_movie_id =

Tuesday, April 5, 2011

Join QueryInputid first_name last_name favorite_...

1 John Mulligan 3

2 Samir Ahmed 5

3 Royce Rollins 2

4 John Smith 2

id name image

2 The Matrix

3 Gatacca

4 AI

5 Avatar

Tuesday, April 5, 2011

Join QueryInputid first_name last_name favorite_...

1 John Mulligan 3

2 Samir Ahmed 5

3 Royce Rollins 2

4 John Smith 2

id name image

2 The Matrix

3 Gatacca

4 AI

5 Avatar

first_name last_name name image

John Mulligan The Matrix

Samir Ahmed Gatacca

Royce Rollins AI

John Smith Avatar


people movies

Tuesday, April 5, 2011

Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });

joinKey.set(row[MOVIES_ID_COLUMN]); }

context.write(joinKey, columns);

} catch(InterruptedException ex) { throw new IOException(ex); }...

Tuesday, April 5, 2011

Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });

joinKey.set(row[MOVIES_ID_COLUMN]); }

context.write(joinKey, columns);

} catch(InterruptedException ex) { throw new IOException(ex); }...


Tuesday, April 5, 2011

Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });

joinKey.set(row[MOVIES_ID_COLUMN]); }

context.write(joinKey, columns);

} catch(InterruptedException ex) { throw new IOException(ex); }...



Tuesday, April 5, 2011

Join Query (Mapper) public static class SelectAndFilterMapper extends Mapper<Object, Text, Text, TextArrayWritable> {... public void map(Object key, Text value, Context context) throws IOException {

String [] row = value.toString().split(DELIMITER); String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

try { if(fileName.startsWith("people")) { columns.set( new String [] { "people", row[PEOPLE_FIRST_NAME_COLUMN], row[PEOPLE_LAST_NAME_COLUMN] }); joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]); } else if(fileName.startsWith("movies")) { columns.set( new String [] { "movies", row[MOVIES_NAME_COLUMN], row[MOVIES_IMAGE_COLUMN] });

joinKey.set(row[MOVIES_ID_COLUMN]); }

context.write(joinKey, columns);

} catch(InterruptedException ex) { throw new IOException(ex); }...




Tuesday, April 5, 2011

Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {

LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }

for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...Tuesday, April 5, 2011

Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {

LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }

for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...


Tuesday, April 5, 2011

Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {

LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }

for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...

people X movies


Tuesday, April 5, 2011

Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {

LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }

for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...

people X movies

SELECT first_name, last_name, name, movies.image



Tuesday, April 5, 2011

Join Query (Reducer) public static class CombineMapsReducer extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {... public void reduce(Text key, Iterable<TextArrayWritable> values, Context context ) throws IOException, InterruptedException {

LinkedList<String []> people = new LinkedList<String[]>(); LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) { String dataset = val.getTextAt(0).toString(); if(dataset.equals("people")) { people.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } if(dataset.equals("movies")) { movies.add(new String[] { val.getTextAt(1).toString(), val.getTextAt(2).toString(), }); } }

for(String[] person : people) { for(String[] movie : movies) { columns.set(new String[] { person[0], person[1], movie[0], movie[1] }); context.write(BLANK, columns); } }...


people X movies

SELECT first_name, last_name, name, movies.image



Tuesday, April 5, 2011


Tuesday, April 5, 2011

What is Hive?“Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.”

Tuesday, April 5, 2011

Hive Features



Query Processor



Functions / UDFs, UDAFs, UDTFs

Tuesday, April 5, 2011

Hive Demo

Tuesday, April 5, 2011


Tuesday, April 5, 2011


Tuesday, April 5, 2011

top related