crystalball - compute relative frequency in hadoop
TRANSCRIPT
![Page 1: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/1.jpg)
Big Data Project on
Crystal BallSubmitted By:
Sushil Sedai(984474)
Suvash Shah(984461)
Submitted to:Prof. Prem Nair
![Page 2: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/2.jpg)
Pair approach (Mapper) – pseudo code
method map(docid id, doc d)
for each term w in doc d do
total = 0;for each neighbor u in Neighbor(w) do
Emit(Pair(w, u), 1);
total++;
Emit(Pair(w, *), total);
![Page 3: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/3.jpg)
Pair approach (Mapper) – Java Code
![Page 4: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/4.jpg)
Pair approach (Reducer) – pseudo code
method reduce(Pair p, Iterable<Int> values)
if p.secondValue == *
if p.firstValue is new
currentvalue = p.firstvalue;
marginal = sum(values)
else
marginal += sum(values)
else Emit(p, sum(values)/marginal);
![Page 5: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/5.jpg)
Pair approach (Reducer) – Java Code
![Page 6: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/6.jpg)
Pair approach - input
Mapper1 input
18 29 12 34 79 18 56 12 34 92
Mapper2 input
18 29 12 34 79 18 56 12 34 92
![Page 7: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/7.jpg)
Pair approach – Output (Reducer1)(10,12) 0.5
(10,34) 0.5
(12,10)0.09090909090909091
(12,18)0.09090909090909091
(12,34)0.36363636363636365
(12,56) 0.18181818181818182
(12,79)0.09090909090909091
(12,92)0.18181818181818182
(18,12) 0.25
(18,29) 0.125
(18,34) 0.25
(18,56) 0.125
(18,79) 0.125
(18,92) 0.125
(29,10)0.06666666666666667
(29,12)0.26666666666666666
(29,18)0.06666666666666667
(29,34)0.26666666666666666
(29,56)0.13333333333333333
(29,79)0.06666666666666667
(29,92)0.13333333333333333
(34,10)0.08333333333333333
(34,12) 0.25
(34,18)0.08333333333333333
(34,29)0.08333333333333333
(34,56) 0.25
(34,79)0.08333333333333333
(34,92)0.16666666666666666
(56,10) 0.1
(56,12) 0.3
(56,29) 0.1
(56,34) 0.3
(56,92) 0.2
(92,10)0.3333333333333333
(92,12)0.3333333333333333
(92,34)0.3333333333333333
![Page 8: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/8.jpg)
Pair approach – Output (Reducer2)
(79,12) 0.2
(79,18) 0.2
(79,34) 0.2
(79,56) 0.2
(79,92) 0.2
![Page 9: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/9.jpg)
Stripe approach (Mapper) – pseudo code
method map(docid id, doc d)
Stripe H;
for each term w in doc d do
clear(H);
for each neighbor u in Neighbor(w) do
if H.containsKey(u)
H{u} += 1;
else
H.add(u, 1);
Emit(w, H);
![Page 10: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/10.jpg)
Stripe approach (Mapper) – Java Code
![Page 11: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/11.jpg)
Stripe approach (Reducer) – pseudo code
total = 0;
method reduce(Text key, Stripe H [H1, H2, …])
total = sumValues(H);
for each Item h in H do
h.secondValue /= total;
Emit(key, H);
![Page 12: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/12.jpg)
Stripe approach (Reducer) – Java Code
![Page 13: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/13.jpg)
Stripe appoach (Reducer) – Java Code
![Page 14: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/14.jpg)
Stripe approach – input
Mapper1 input
34 56 29 12 34 56 92 10 34 12
Mapper2 input
18 29 12 34 79 18 56 12 34 92
![Page 15: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/15.jpg)
Stripe approach – Output(Reducer1)
10 [ (34,0.5000) (12,0.5000) ]
12 [ (56,0.1818) (92,0.1818) (34,0.3636) (18,0.0909) (79,0.0909) (10,0.0909) ]
18 [ (56,0.1250) (92,0.1250) (34,0.2500) (79,0.1250) (29,0.1250) (12,0.2500) ]
29 [ (56,0.1333) (92,0.1333) (34,0.2667) (18,0.0667) (79,0.0667) (10,0.0667) (12,0.2667) ]
34 [ (56,0.2500) (92,0.1667) (18,0.0833) (79,0.0833) (29,0.0833) (10,0.0833) (12,0.2500) ]
56 [ (92,0.2000) (34,0.3000) (29,0.1000) (10,0.1000) (12,0.3000) ]
92 [ (34,0.3333) (10,0.3333) (12,0.3333) ]
![Page 16: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/16.jpg)
Stripe approach – Output(Reducer2)
79 [ (56,0.2000) (92,0.2000) (34,0.2000) (18,0.2000) (12,0.2000) ]
![Page 17: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/17.jpg)
Hybrid approach (Mapper) – pseudo code
method map(docid id, doc d)
HashMap H;
for each term w in doc d do
for each neighbor u in Neighbor(w) do
if H.contains(Pair(w, u))
H{Pair(w, u)} += 1;
else
H.add(Pair(w, u));
for each Pair p in H do
Emit(p, H(p));
![Page 18: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/18.jpg)
Hybrid approach (Mapper) – Java Code
![Page 19: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/19.jpg)
Hybrid approach (Reducer) – pseudo codeprev = null;
HashMap H;
Method reduce(Pair p, Iterable<Int> values)
if p.firstValue != prev and not first
total = sumValues(H);
for each item h in H
h(prev.secondValue) /= total;
Emit(p.firstValue, H);
clear(H);
End if
prev = p.firstValue;
H.add(p.secondValue, sum(values));
Method close
//for last pair
total = sumValues(H);
for each item h in H
h(prev.secondValue) /= total;
Emit(p.firstValue, H);
![Page 20: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/20.jpg)
Hybrid approach (Reducer) – Java Code
![Page 21: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/21.jpg)
Hybrid approach (Reducer) – Java Code
![Page 22: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/22.jpg)
Hybrid approach - Input
Mapper1 input
34 56 29 12 34 56 92 10 34 12
Mapper2 input
18 29 12 34 79 18 56 12 34 92
![Page 23: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/23.jpg)
Hybrid approach – Output(Reducer1)
10(12,0.5) (34,0.5)
12(10,0.09090909) (18,0.09090909) (34,0.36363637) (56,0.18181819) (79,0.09090909) (92,0.18181819)
18(12,0.25) (29,0.125) (34,0.25) (56,0.125) (79,0.125) (92,0.125)
29(10,0.06666667) (12,0.26666668) (18,0.06666667) (34,0.26666668) (56,0.13333334) (79,0.06666667) (92,0.13333334)
34(10,0.083333336) (12,0.25) (18,0.083333336) (29,0.083333336) (56,0.25) (79,0.083333336) (92,0.16666667)
56(10,0.1) (12,0.3) (29,0.1) (34,0.3) (92,0.2)
92(10,0.33333334) (12,0.33333334) (34,0.33333334)
![Page 24: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/24.jpg)
Hybrid approach – Output(Reducer2)
79 (12,0.2) (18,0.2) (34,0.2) (56,0.2) (92,0.2)
![Page 25: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/25.jpg)
Comparison
![Page 26: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/26.jpg)
Apache Spark
Write a java program on spark to calculate total number of students in MUM coming in different entries. This program should display total number student by country.
![Page 27: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/27.jpg)
Spark - Java Code
![Page 28: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/28.jpg)
Spark - input
2014 Feb Nepal 20
2014 Feb India 15
2014 Oct Italy 2
2014 July France 1
2015 Feb Nepal 10
2015 Feb India 25
2015 Oct Italy 7
![Page 29: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/29.jpg)
Spark - Output
(France,1)
(Italy,9)
(Nepal,30)
(India,40)
![Page 30: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/30.jpg)
Tools Used
• VMPlayer Pro 7
• cloudera-quickstart-vm-5.4.0-0-vmware
• Eclipse Version: Luna Service Release 2 (4.4.2)
• Windows 8.1
![Page 31: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/31.jpg)
References
• http://glebche.appspot.com/static/hadoop-ecosystem/mapreduce-job-java.html
• https://hadoopi.wordpress.com/2013/06/05/hadoop-implementing-the-tool-interface-for-mapreduce-driver/
• http://www.bogotobogo.com/Hadoop/BigData_hadoop_Apache_Spark.php
![Page 32: CrystalBall - Compute Relative Frequency in Hadoop](https://reader030.vdocuments.mx/reader030/viewer/2022012903/55cee640bb61ebab108b4728/html5/thumbnails/32.jpg)
Thank You