using derivation-free optimization methods in the hadoop cluster with terasort
TRANSCRIPT
![Page 1: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/1.jpg)
Using Derivation-Free Optimizationin the Hadoop Cluster in the Hadoop Cluster
with TerasortRenato dos Santos Alves & Sarosh Farjam
Projeto de Experimentos ~ 03.7.2014
![Page 2: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/2.jpg)
Sequence
• Abstract• Introduction• Workload Analysis of Search Engines• Benchmarking Methodology and • Benchmarking Methodology and
Decisions• Scaleable Data Generation Tool• Case Studies• Conclusions
![Page 3: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/3.jpg)
Introduction• Implementation of the MapReduce cluster Benckmark
TeraSort by DFO method • Every interac�ng DFO method presents new values for
parameter configuration of Hadoop. • For these parameters, specified within the framework we • For these parameters, specified within the framework we
need to use a tool that assists in this cluster configuration to ensure proper implementation of TeraSort application.
• Chef server and Chef client
![Page 4: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/4.jpg)
TeraSort BenchmarkTerasort includes 3 MapReduce
applications:● Teragen: generates the data.● Terasort: samples the input data ● Terasort: samples the input data
and uses them with MapReduce to sort the data.
● Teravalidate: validates the output is sorted
![Page 5: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/5.jpg)
DFO Method• Derivative free optimization is a subject of mathematical
optimization. • It refers to problems for which derivative information is
unavailable or • methods that do not use derivatives.• methods that do not use derivatives.
• The derivative of a function of a real variable measures the sensitivity to change of a quantity (dependent variable) which is determined by another quantity (independent variable). E.g. the derivative of the position of a moving object with respect to time is the object's velocity.
![Page 6: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/6.jpg)
Algorithm BOBYQA • BOBYQA (Bound Optimization BY Quadratic Approximation) is
a numerical optimization algorithm by Michael J. D. Powell.
• Name of Powell's Fortran 77 implementation of the algorithm.
BOBYQA solves bound constrained optimization problems without • BOBYQA solves bound constrained optimization problems without using derivatives of the objective function, which makes it a derivative-free algorithm.
• The algorithm solves the problem using a trust region method that forms quadratic models by interpolation. One new point is computed on each iteration, usually by solving a trust region sub problem, subject to the bound constraints.
![Page 7: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/7.jpg)
Algorithm COBYLA
• Constrained optimization by linear approximation (COBYLA) is a numerical optimization method for constrained problems where the derivative of the objective function is not known, objective function is not known,
• invented by Michael J. D. Powell. • Powell invented COBYLA while working for Westland
Helicopters.• COBYLA proceeds by iteratively approximating the
actual objective function with linear programs.
![Page 8: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/8.jpg)
Hadoop Environment
• A physical cluster with 29 nodes was used, • A master Hadoop server (responsible for
implementing the JobTracker and NameNode services) implementing the JobTracker and NameNode services)
• 28 Hadoop Slaves (dedicated to the implementation of
TaskTracker and DataNode services).
• 2 Gigabit Ethernet to perform the connectivity between the 29 nodes
![Page 9: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/9.jpg)
Hadoop Environment
• A front-end access to the cluster server, that server is configured as a Chef Server also used to organize the executions of DFO TeraSortapplication is then characterized the to organize the executions of DFO TeraSortapplication is then characterized the synchronization functions of the DFO plays and updating parameter settings Hadoopbased on each iteration of DFO TeraSortmethod.
![Page 10: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/10.jpg)
Experiment Execution• Nemesis a server that is not part of the cluster is used as a front end for the
implementation TeraSort application, running the DFO method and updating settings Hadoop based on their output.
• The synchronization of executions TeraSort updates and Hadoop with the output of DFO method is performed by dfo_hadoop_terasort application executed on the front-end method is performed by dfo_hadoop_terasort application executed on the front-end server.
• The implementation of dfo_hadoop_terasort application is supplied with a file that contains the ini�al values of the configura�on parameters of Hadoop, restrictions so that these values do not reach unwanted data out value for the objec�ve func�on, tolerance value for the restrictions and maximum amount of interactions. With the processing of the input file and the interaction with the Hadoop cluster is discovered which parameter values cause a greater impact for faster execu�on of TeraSortapplication, taking as output a file with the best configuration parameters of that.
![Page 11: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/11.jpg)
Experiment Execution• As the cluster was composed of 28 servidoers slaves and each
server with two processors, for a total of 56 slots available processing was decided to maintain 10% of this total, available for tasks due to failures in implementation were spaced more than once. Therefore, we used about 100 Gigabyte generated by HadoopTeragen. once. Therefore, we used about 100 Gigabyte generated by HadoopTeragen.
• To confront the optimization of the execution time of Jobs, was executed two DFO BOBYCA And COBYLA method, aiming to identify which method best suits the application TeraSort forcenida by Hadoop ....
• Two runs with both algorithms and 50 iterations to identify at what time the executions were carried out can converge to a better runtime.
![Page 12: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/12.jpg)
Switching COBYLA, BOBYQA
• /* algoritmo COBYLA,Constrained Optimization BY Linear Approximations */
• opt = nlopt_create(NLOPT_LN_COBYLA, N);• //opt = nlopt_create(NLOPT_LN_BOBYQA, N);• //opt = nlopt_create(NLOPT_LN_BOBYQA, N);• nlopt_set_lower_bounds(opt, lb);• nlopt_set_upper_bounds(opt, ub);• nlopt_set_max_objective(opt, objetivo,
NULL);
![Page 13: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/13.jpg)
Experiment Execution
![Page 14: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/14.jpg)
Commands• First - Generate Tera sort
• *Teragen will generate approximately 100 GB100 000 179 688 bytes
• $ hadoop jar $HADOOP_HOME/hadoop-*examples*.jar teragen <number of 100-byte rows> <output dir>
• $ hadoop jar hadoop-examples-1.0.4.jar teragen1000000000 terasort-input
![Page 15: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/15.jpg)
Commands
• Second
• [hadoop@nemesis otimizacao]$ nohup sudo• [hadoop@nemesis otimizacao]$ nohup sudotime ./dfo_hadoop_terasort < entrada > log_execucao_terasort &
![Page 16: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/16.jpg)
Results
• We used of DFO method with BOBYQA and COBYLA algorithms
• Presented the main difference in variation of execution time of each iteration Jobs with execution time of each iteration Jobs with dfo_hadoop_terasort application,
• it is characterized mainly, how they treat approximations of the points for the object function, the quadratic or linear form respectively.
![Page 17: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/17.jpg)
TeraSort 50 aIterações Tempo F
1 1491
2 1501
3 1447
4 1889
12 1588
13 1321
14 1897
15 1289
5 2076
6 1466
7 1470
8 1319
9 1897
10 1611
11 1440
16 1704
17 1294
18 1313
19 1728
20 1971
21 1842
![Page 18: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/18.jpg)
1800
2000
2200
2400
tim
e in
se
con
ds
TeraSort 50 A using BOBYQAexecution progress
1000
1200
1400
1600
1800
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
tim
e in
se
con
ds
number of iterations
1289
![Page 19: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/19.jpg)
TeraSort 50 cIterações Tempo F
1 1587
2 1721
3 1473
4 1669
5 1801
6 1833
16 1516
17 1601
18 1561
19 1639
20 1515
21 1507
30 1875
31 1620
32 1780
33 1607
34 15366 1833
7 1486
8 1709
9 1510
10 1962
11 1988
12 1934
13 1898
14 2277
15 1933
21 1507
22 2205
23 1838
24 2419
25 1744
26 1566
27 1619
28 1890
29 1988
34 1536
35 1621
36 1580
37 1626
38 1675
39 2065
![Page 20: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/20.jpg)
2000
2200
2400
2600
2800
3000
tim
e in
se
con
ds
TeraSort 50 C using BOBYQA execution progress
1000
1200
1400
1600
1800
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
tim
e in
se
con
ds
number of iterations
1473
![Page 21: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/21.jpg)
TeraSort 50 bIterações Tempo F
1 1437
2 1343
3 1335
4 1228
5 1213
16 1150
17 1190
18 1165
19 1190
20 1208
21 1204
31 1206
32 1144
33 1177
34 1179
35 1232
36 1157
37 1201
38 1150
39 1195
40 1178
41 12376 1240
7 1198
8 1203
9 1231
10 1178
11 1174
12 1187
13 1186
14 1204
15 1128
21 1204
22 1113
23 1171
24 1185
25 1190
26 1170
27 1155
28 1211
29 1159
30 1198
41 1237
42 1196
43 1233
44 1356
45 1400
46 1674
47 1424
48 1365
49 1366
50 1320
![Page 22: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/22.jpg)
1400
1500
1600
1700
1800
tim
e in
se
con
ds
TeraSort 50 B using COBYLA execution progress
1000
1100
1200
1300
1400
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
tim
e in
se
con
ds
number of iterations
1113
![Page 23: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/23.jpg)
TeraSort 50 newIterações Tempo F
1 1442
2 1298
3 1285
4 1285
11 1304
12 1322
13 1345
14 1421
21 1367
22 1369
23 1352
24 13624 1285
5 1274
6 1329
7 1343
8 1314
9 1289
10 1308
15 1369
16 1336
17 1348
18 1335
19 1333
20 1307
25 1390
26 1350
27 1324
28 1382
29 1347
30 1339
![Page 24: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/24.jpg)
1400
1500
1600
1700
1800
tim
e in
se
con
ds
TeraSort 50 New using COBYLA execution progress
1000
1100
1200
1300
1400
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
tim
e in
se
con
ds
number of iterations
1274
![Page 25: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/25.jpg)
1000
1100
1200
1300
1400
1500
1600
1700
1800
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
number of iterations
TeraSort 50 B using COBYLA
execution progress
1000
1200
1400
1600
1800
2000
2200
2400
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
number of iterations
TeraSort 50 A using BOBYQA
execution progress
1000
1100
1200
1300
1400
1500
1600
1700
1800
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930number of iterations
TeraSort 50 New using COBYLA
execution progress
1000
1500
2000
2500
3000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
number of iterations
TeraSort 50 C using BOBYQA
execution progress
![Page 26: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/26.jpg)
The use of DFO method with BOBYQA and COBYLA algorithms and presents as main difference the variation of execution time of each iteration Jobs dfo_hadoop_terasort application, it is mainly how they are treated
approximations of the points for the object function the quadratic or linear form respectively.
2000
2200
2400
tim
e in
se
con
ds
Difference between Algorithms COBYLA and BOBYQA
1000
1200
1400
1600
1800
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
tim
e in
se
con
ds
TeraSort 50 New/COBYLA TeraSort 50 B/COBYLA TeraSort 50 A/BOBYQA TeraSort 50 C/BOBYQA
![Page 27: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/27.jpg)
Conclusion
• The convergence of the total time proves to be more stable in COBYLA and without many fluctuations when compared to BOBYQA algorithm.
• The Speedup BOBYQA algorithm in the execution of TeraSort application is 12% on average
• The Speedup BOBYQA algorithm in the execution of TeraSort application is 12% on average
• And the results reported by COBYLA algorithm, in the execution of TeraSort application demonstrates Speedup on average 21.15% over the initial settings of Hadoop and a greater optimization than the BOBYQA algorithm.
![Page 28: Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort](https://reader033.vdocuments.mx/reader033/viewer/2022051618/55d033f4bb61eb972b8b47f4/html5/thumbnails/28.jpg)
References • [1] O'Malley, O. (2008, May). TeraByte Sort on Apache Hadoop.
Retrieved from http://sortbenchmark.org/YahooHadoop.pdf • [2] Anand, A. (2009, May). Hadoop Sorts a Petabyte in 16:25 Hours
and a Terabyte in 62 Seconds. Retrieved from https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html
• [3] Gray, J. (n.d.). Sort Benchmark Home Page. Retrieved from http://sortbenchmark.org/
• [4] A Measure of Transaction Processing Power. (1985) Datamation, 31 (7), 112-118.
• [5] Wikipedia; http://en.wikipedia.org/