hack reduce mr-intro

9

Click here to load reader

Upload: montrealouvert

Post on 20-Jun-2015

162 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hack reduce mr-intro

HackReduceM a p R e d u c e I n t r o

Hopper.com (Greg Lu)

Page 3: Hack reduce mr-intro

NASDAQ,DELL,1997-08-26,83.87,84.75,82.50,82.81,48736000,10.35NASDAQ,DITC,2002-10-24,1.56,1.69,1.53,1.60,133600,1.60NASDAQ,DLIA,2008-01-28,1.91,2.31,1.91,2.23,760800,2.23NASDAQ,DWCH,2002-07-10,3.09,3.14,3.09,3.14,2400,1.57NASDAQ,DYNT,2008-12-29,0.31,0.31,0.29,0.30,26900,0.30NASDAQ,DMLP,2003-10-21,17.65,17.94,17.58,17.59,4800,9.73NASDAQ,DORM,1997-02-07,7.88,7.88,7.63,7.75,7400,3.87NASDAQ,DXPE,2004-10-25,5.19,5.24,5.00,5.00,7600,2.50NASDAQ,DEST,2009-03-17,4.55,5.03,4.55,5.03,6800,5.03NASDAQ,DBRN,1992-01-02,8.88,9.25,8.75,8.88,84800,2.22NASDAQ,DXYN,1998-11-25,6.38,6.44,6.19,6.25,211100,6.25NASDAQ,DEAR,1998-12-08,10.50,11.50,10.50,10.50,5800,6.45

}}}

InputSplit 1

InputSplit 2

InputSplit 3

datasets/nasdaq/daily_prices/NASDAQ_daily_prices_subset.csv

...

public int run(String[] args) throws Exception { Configuration conf = getConf();

if (args.length != 2) { System.err.println("Usage: " + getClass().getName() + " <input> <output>"); System.exit(2); }

// Creating the MapReduce job (configuration) object Job job = new Job(conf); job.setJarByClass(getClass()); job.setJobName(getClass().getName());

// The Nasdaq/NYSE data dumps comes in as a CSV file (text input), so we configure // the job to use this format. job.setInputFormatClass(TextInputFormat.class);

[...]

org.hackreduce.examples.stockexchange.MarketCapitalization (expanded version)

} Defines how the data is splitand assigned to which mappers

Page 4: Hack reduce mr-intro

NASDAQ,DELL,1997-08-26,83.87,84.75,82.50,82.81,48736000,10.35NASDAQ,DITC,2002-10-24,1.56,1.69,1.53,1.60,133600,1.60NASDAQ,DLIA,2008-01-28,1.91,2.31,1.91,2.23,760800,2.23NASDAQ,DWCH,2002-07-10,3.09,3.14,3.09,3.14,2400,1.57

} InputSplit 1

datasets/nasdaq/daily_prices

public int run(String[] args) throws Exception { [...]

// Tell the job which Mapper and Reducer to use (classes defined above) job.setMapperClass(MarketCapitalizationMapper.class); job.setReducerClass(MarketCapitalizationReducer.class);

// This is what the Mapper will be outputting to the Reducer job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(DoubleWritable.class);

// This is what the Reducer will be outputting job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class);

// Setting the input folder of the job FileInputFormat.addInputPath(job, new Path(args[0]));

// Preparing the output folder by first deleting it if it exists Path output = new Path(args[1]); FileSystem.get(conf).delete(output, true); FileOutputFormat.setOutputPath(job, output);

org.hackreduce.examples.stockexchange.MarketCapitalization (expanded version)

datasets/nasdaq/daily_prices/NASDAQ_daily_prices_subset.csv

} Point the job to the custom classes that we created in order to process the data.

}Define the types of the (key, value)pairs that we’ll be outputting from themappers and the result of the job itself.

Now we’ll show the MarketCapitalizationMapper class

Page 5: Hack reduce mr-intro

public static class MarketCapitalizationMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String inputString = value.toString();

String[] attributes = inputString.split(",");

if (attributes.length != 9) throw new IllegalArgumentException("Input string given did not have 9 values in CSV format");

try { String exchange = attributes[0]; String stockSymbol = attributes[1]; Date date = sdf.parse(attributes[2]); double stockPriceOpen = Double.parseDouble(attributes[3]); double stockPriceHigh = Double.parseDouble(attributes[4]); double stockPriceLow = Double.parseDouble(attributes[5]); double stockPriceClose = Double.parseDouble(attributes[6]); int stockVolume = Integer.parseInt(attributes[7]); double stockPriceAdjClose = Double.parseDouble(attributes[8]); } catch (ParseException e) { throw new IllegalArgumentException("Input string contained an unknown value that couldn't be parsed"); } catch (NumberFormatException e) { throw new IllegalArgumentException("Input string contained an unknown number value that couldn't be parsed"); }

double marketCap = stockPriceClose * stockVolume; context.write(new Text(stockSymbol), new DoubleWritable(marketCap)); }

}

NASDAQ,DELL,1997-08-26,83.87,84.75,82.50,82.81,48736000,10.35NASDAQ,DITC,2002-10-24,1.56,1.69,1.53,1.60,133600,1.60NASDAQ,DLIA,2008-01-28,1.91,2.31,1.91,2.23,760800,2.23NASDAQ,DWCH,2002-07-10,3.09,3.14,3.09,3.14,2400,1.57

} InputSplit 1

datasets/nasdaq/daily_pricesdatasets/nasdaq/daily_prices/NASDAQ_daily_prices_subset.csv

} This job doesn’t do a whole lot,but this is where the processingis occurring.

org.hackreduce.examples.stockexchange.MarketCapitalization (expanded version)

Page 6: Hack reduce mr-intro

NASDAQ,DELL,1997-08-26,83.87,84.75,82.50,82.81,48736000,10.35NASDAQ,DITC,2002-10-24,1.56,1.69,1.53,1.60,133600,1.60NASDAQ,DLIA,2008-01-28,1.91,2.31,1.91,2.23,760800,2.23NASDAQ,DWCH,2002-07-10,3.09,3.14,3.09,3.14,2400,1.57

} InputSplit 1

datasets/nasdaq/daily_pricesdatasets/nasdaq/daily_prices/NASDAQ_daily_prices_subset.csv

MarketCapitalizationMapper

(line-by-line)

(DELL, 82.81*48736000)

(DITC, 1.60*133600)

(DLIA, 2.23*760800)

(DWCH, 3.14*2400)

MarketCapitalizationReducer

(emits)

(sorted and partitioned to specific reducers)

Page 7: Hack reduce mr-intro

public static class MarketCapitalizationReducer extends Reducer<Text, DoubleWritable, Text, Text> { NumberFormat currencyFormat = NumberFormat.getCurrencyInstance(Locale.getDefault());

@Override protected void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { double highestCap = 0.0; for (DoubleWritable value : values) { highestCap = Math.max(highestCap, value.get()); } context.write(key, new Text(currencyFormat.format(highestCap))); } }

org.hackreduce.examples.stockexchange.MarketCapitalization (expanded version)

(DELL, 82.81*48736000)

(DELL, 31.92*18678500)

(DELL, 23.85*16038700)

(DELL, 30.38*68759800)

(coming from different mappers)

(...)

(but arriving at the same reducer)

(DELL, $4,035,828,160.00)

(output of this reducer)

Page 8: Hack reduce mr-intro

DAIO $1,515,345.00DAKT $63,656,600.00DANKY $89,668,857.00DARA $1,464,720.00DASTY $14,141,055.00DATA $2,888,325.00DAVE $5,144,800.00DBLE $1,040,996.00DBLEP $79,584.00DBRN $131,023,326.00DBTK $7,405,366.00DCAI $20,058,990.00DCGN $10,372,992.00DCOM $12,298,208.00DCTH $3,285,652.00DDDC $79,176.00DDIC $3,684,100.00DDMX $7,811,204.00DDRX $12,480,500.00DDSS $4,545,438.00DEAR $4,375,800.00DECK $271,081,580.00DEER $5,363,740.00DEIX $5,285,892.00

/tmp/nasdaq_marketcaps/part-r-00000

Page 9: Hack reduce mr-intro

We can dynamically increase your clusters if you need the processing power, but it’s typically bottlenecked by the code.

If your job takes longer than 10 minutes to run, come see us.