large-scale non-linear regression within the …pkc/theses/khademzadeh13.pdf · large-scale...

105
LARGE-SCALE NON-LINEAR REGRESSION WITHIN THE MAPREDUCE FRAMEWORK by Ahmed Khademzadeh A thesis submitted to the College of Engineering at Florida Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Melbourne, Florida July 2013

Upload: others

Post on 06-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • LARGE-SCALE NON-LINEAR REGRESSION WITHIN

    THE MAPREDUCE FRAMEWORK

    by

    Ahmed Khademzadeh

    A thesis submitted to the College of Engineering at

    Florida Institute of Technology

    in partial fulfillment of the requirements

    for the degree of

    Master of Science

    in

    Computer Science

    Melbourne, Florida

    July 2013

  • c© Copyright by Ahmed Khademzadeh 2013All Rights Reserved

    The author grants permission to make single copies

  • We the undersigned committeehereby approve the attached thesis

    LARGE-SCALE NON-LINEAR REGRESSION WITHINTHE MAPREDUCE FRAMEWORK

    byAhmed Khademzadeh

    Philip Chan, Ph.D.Associate ProfessorComputer SciencesPrincipal Adviser

    Marius Silaghi, Ph.D.Assistant ProfessorComputer Sciences

    Georgios C. Anagnostopoulos, Ph.D.Associate ProfessorElectrical & Computer Engineering

    William D. Shoaff, Ph.D.Associate Professor and Department HeadComputer Sciences

  • Abstract

    Large-scale Non-linear Regression within the MapReduce Framework

    By: Ahmed Khademzadeh

    Thesis Advisor: Philip Chan, Ph.D.

    Regression models have many applications in real world problems such as finance, epidemiol-

    ogy, environmental science, etc.. Big datasets are everywhere these days, and bigger datasets

    would help us to construct better models from the data. The issue with big datasets is that

    they would need a long time to be processed or even to be read on a single machine. This

    research employs MapReduce to model large-scale non-linear regression problems in a par-

    allel fashion. MRRT (MapReduce Regression Tree) algorithm divides the feature space into

    overlapping subspaces and then shuffles each of the subspace’s data items to a node in the

    cluster. Each node in the cluster then constructs a regression tree for the subspace of the

    data it has received. Different versions of algorithm (overlapping/non-overlapping subspaces

    and weighted/unweighted prediction using neighboring models) are proposed and compared

    with the regression tree (RT) algorithm implemented in Matlab libraries.

    Experiments on synthetic and real datasets show that MRRT algorithm that is devised

    to be fast and scalable for MapReduce framework not only has a close to linear speedup, and

    close to optimum scalability, but also outperforms the RT algorithm in terms of accuracy

    (in most cases) and improves the prediction time by more than 80%. Although MRRT is

    designed for MapReduce framework, it could be used on a single machine, and in that case

    it improves the learning time by 60% (in most cases) comparing to RT algorithm, and shows

    to be of close to linear scalability (comparing to RT algorithm which is roughly of quadratic

    scalability).

    iii

  • Contents

    Abstract iii

    Preface xiii

    Acknowledgments xiv

    1 Introduction 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.3 Overview of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.4 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.5 Overview of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2 Literature Review 6

    2.1 Approximating Non-linear Regression Using Piecewise Regression . . . . . . . . 6

    2.1.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.1.2 Non-linear Regression via Piecewise Linear Regression . . . . . . . . . . 7

    2.1.3 Piecewise Regression with Regression Trees . . . . . . . . . . . . . . . . 8

    2.1.4 Piecewise Linear Approximation of Time Series . . . . . . . . . . . . . . 10

    2.1.5 Online Approximation of Non-linear Models . . . . . . . . . . . . . . . . 12

    2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.2.1 Why Pairs? . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.2.2 Is That All MapReduce Does? . . . . . . . . . . . . . . . . . . . . . . . 14

    2.2.3 MapReduce for Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.2.4 MapReduce and Iterative Tasks . . . . . . . . . . . . . . . . . . . . . . . 19

    2.2.5 Arguments about Using or not Using MapReduce . . . . . . . . . . . . 20

    iv

  • 3 Approach 22

    3.1 MapReduce Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.1.1 Map1: Finding the Min and Max of Dimension that Is Being Split . . . 25

    3.1.2 Reduce1: Finding Split Points Along the Dimension that Is Being Split 26

    3.1.3 Map2: Shuffling the Data Among Cluster Nodes . . . . . . . . . . . . . 27

    3.1.4 Reduce2: Constructing the Tree Regression Models for Each Subspace . 28

    3.1.5 Using the MRRT Model to Predict . . . . . . . . . . . . . . . . . . . . . 28

    3.2 Slope-changing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3.2.1 Choosing Good Split Points . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3.2.2 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 30

    3.2.3 Map1: Finding Candidate Split Points . . . . . . . . . . . . . . . . . . . 31

    3.2.4 Reduce1 : Generating a Split Point Set from Candidate Set . . . . . . . 35

    3.2.5 Map2 : Shuffling the Data Points Based on Split Points . . . . . . . . . 39

    3.2.6 Reduce2 : Finding the Linear Model for Each Subspace . . . . . . . . . 39

    3.2.7 Using the Slope-changing Model to Predict . . . . . . . . . . . . . . . . 40

    4 Empirical Evaluation 42

    4.1 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    4.1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    4.1.2 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    4.1.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.2 Overview of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.2.1 Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.2.2 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    4.3 Overview of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    4.4 MRRT Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    4.4.1 Number of Dimensions to Split Along . . . . . . . . . . . . . . . . . . . 47

    4.4.2 Overlapping Subspaces and Neighbor-weighted Predictions . . . . . . . 50

    4.4.3 Comparing the Accuracy of MRRT and the Baseline Algorithm . . . . . 54

    4.4.4 Choosing the Dimension to Split Along . . . . . . . . . . . . . . . . . . 58

    4.4.5 Prediction Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.4.6 Speedup of MRRT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.4.7 Scalability of MRRT Algorithm . . . . . . . . . . . . . . . . . . . . . . . 69

    4.4.8 Could MRRT Be Used as a Sequential Algorithm? . . . . . . . . . . . . 72

    4.5 Slope-changing Experiments Results . . . . . . . . . . . . . . . . . . . . . . . . 77

    4.5.1 Slope-changing Algorithm Limitation . . . . . . . . . . . . . . . . . . . . 77

    v

  • 4.5.2 Comparing Accuracy of Slope-changing Algorithm to Baseline Algorithm 78

    4.5.3 Comparing Runtime of Slope-changing Algorithm to Baseline Algorithm 79

    5 Concluding Remarks 80

    5.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    5.2 Possible Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    A Synthetic Datasets Details 83

    Bibliography 88

    vi

  • List of Tables

    4.1 Summary of Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.2 Summary of Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    4.3 Comparing accuracy and learning time of MRRT when dividing the feature

    space along one dimension versus two dimensions on synthetic datasets. As it

    can be seen, none of the methods for dividing the feature space is superceeding

    the other one and there is no obvious reason to prefer one over the other one

    based on this experiment. The learning time of algorithms in both methods is

    also similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    4.4 Comparing accuracy and learning time of MRRT when dividing the feature

    space along one dimension versus two dimensions on real datasets. As it can

    be seen, Two Dimensions split wins in accuracy and One Dimension split wins

    in learing time. The accuracy difference is not a major difference, but the

    learning time difference is significant. . . . . . . . . . . . . . . . . . . . . . . . . 48

    4.5 Comparing accuracy of MRRT(W) and MRRT both with no overlap. As it can

    be seen the MRRT(W) algorithms works bettern than MRRT on most datasets. 53

    4.6 Comparing accuracy of Weighted Overlapping MapReduce Regression Tree and

    baseline algorithm on 10-dimensional synthetic datasets. Numbers in the table

    are RMSE values. MRRT(WO) algorithm always performs better than baseline

    algorithm, when splitting the feature space is done along one dimension and if

    the dimension to split is chosen properly. . . . . . . . . . . . . . . . . . . . . . 55

    4.7 Comparing accuracy of Weighted Overlapping MapReduce Regression Tree and

    baseline algorithm on 20-dimensional synthetic datasets. Numbers in the table

    are RMSE values. MRRT(WO) algorithm always performs better than baseline

    algorithm, when splitting the feature space is done along one dimension and if

    the dimension to split is chosen properly. . . . . . . . . . . . . . . . . . . . . . 56

    vii

  • 4.8 Comparing accuracy of MRRT(WO) and baseline algorithm on real datasets.

    Numbers in the table are RMSE values. MRRT(WO) algorithm always per-

    forms better than baseline algorithm, when splitting the feature space is done

    along one dimension and if the dimension to split is chosen properly. . . . . . 57

    4.9 Dimensions with lowest RMSE on synthetic datasets and rank of same dimen-

    sion on samples using MRRT(O) and MRRT(WO) algorithms. . . . . . . . . . 61

    4.10 Dimensions with lowest RMSE on sample of synthetic datasets and RMSE of

    dataset when divided along same dimension using MRRT(WO) algorithm. . . . 62

    4.11 Dimensions with lowest RMSE on real datasets and rank of same dimension

    on samples using MRRT(O) and MRRT(WO) algorithms. . . . . . . . . . . . . 63

    4.12 Dimensions with lowest RMSE on sample of real datasets and RMSE of dataset

    when divided along same dimension using MRRT(WO) algorithm. . . . . . . . 64

    4.13 Comparing prediction time of MRRT(O), MRRT(WO) and baseline algorithm

    on 20-dimensional ttoy20d3 synthetic test set containing 1000 test items on

    different size of clusters. MRRT(WO) and MRRT(O) algorithms reduce pre-

    diction time by more than 80% comparing to baseline algorithm in all cases. . . 65

    4.14 Comparing prediction time of MRRT(O), MRRT(WO) and baseline algorithm

    on real datasets’ test sets containing 4111 test items on different size of clusters.

    MRRT(WO) and MRRT(O) algorithms reduce prediction time by more than

    80% comparing to baseline algorithm in all cases. . . . . . . . . . . . . . . . . . 66

    4.15 Comparing learning time of MRRT(WOS) and baseline algorithm on 20-dimensional

    ttoy20d3 synthetic dataset on different number of subspaces. MRRT(WOS)

    always perform better than baseline algorithm although it also has better ac-

    curacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    4.16 Comparing accuracy of MRRT(WOS) and baseline algorithm on 20-dimensional

    ttoy20d3 synthetic datasets on different number of subspaces when dataset is

    divided into supspaces along first dimension. MRRT(WOS) algorithm’s RMSE

    is lower than baseline algorithm in all cases. . . . . . . . . . . . . . . . . . . . . 74

    4.17 Comparing learning time of MRRT(WOS) and baseline algorithm on real datasets

    on different number of subspaces. MRRT(WOS) algorithm’s learning time is

    always less than baseline algorithm except in one case when number of sub-

    spaces is 32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    viii

  • 4.18 Comparing accuracy of MRRT(WOS) and baseline algorithm on three real

    datasets on different number of subspaces when dataset is divided into supspaces

    along first dimension. MRRT(WOS) algorithm’s RMSE is lower than baseline

    algorithm in all cases, and it mostly decreases with increasing number of sub-

    spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    4.19 Summary of synthetic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    4.20 Comparing accuracy of slope-changing algorithm (PWC and FPS versions) and

    baseline algorithm on four datasets. . . . . . . . . . . . . . . . . . . . . . . . . 78

    4.21 Comparing learning time of slope-changing algorithm (PWC and FPS versions)

    and baseline algorithm on four datasets. . . . . . . . . . . . . . . . . . . . . . . 79

    ix

  • List of Figures

    2.1 A regression tree (left), and the corresponding 2-dimensional feature space

    (right). Each of tree nodes are corresponding to a subspace in feature space [19] 7

    2.2 MapReduce Execution Overview [3]. . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.1 Dataset distribution among cluster nodes with overlap to decrease borderline

    data points prediction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.2 Different overlap factors of subspaces on cluster nodes . . . . . . . . . . . . . . 25

    3.3 Bad split points causes bad piecewise linear models and higher prediction error 29

    3.4 Good split points helps to have better piecewise linear models and lower pre-

    diction error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    3.5 Finding the data points with maximum target value by gridifying data points

    and using a initial random seed . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.6 Using Parzen Window Classifier to find areas with many candidate split points [18] 36

    4.1 Splitting the feature space to subspaces. . . . . . . . . . . . . . . . . . . . . . . 47

    4.2 Summary of MRRT versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    4.3 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 10-dimensional

    datasets (gtoy10d1, gtoy10d2, ptoy10d1, ptoy10d2, ttoy10d1 and ttoy10d2)

    datasets with different overlap values when dividing along first dimension. . . . 51

    4.4 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 20-dimensional

    datasets (ptoy20d1, ptoy20d2, ttoy20d1 and ttoy20d2) datasets with different

    overlap values when dividing along first dimension. . . . . . . . . . . . . . . . . 53

    4.5 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on IHEPC1

    real dataset with different overlap values when dividing along first dimension. . 54

    4.6 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 10 dimen-

    sional datasets with overlap = 0.75, when splitting along differnt dimensions. . 59

    x

  • 4.7 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 20 dimen-

    sional datasets with overlap = 0.75, when splitting along differnt dimensions. . 60

    4.8 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on IHEPC1

    real datasets with overlap = 0.75, when splitting along differnt dimensions. . . 63

    4.9 Speedup of MRRT(O) and MRRT(WO) algorithms in log scale and linear

    scale on ttoy20d3 dataset with overlap = 0.75 when splitting along the first

    dimensions. The runtime of same algorithms on a single machien is 2609.6

    seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.10 Speedup of MRRT(O) and MRRT(WO) algorithms in log scale and linear

    scale on all IHEPC1, IHEPC2, and IHEPC3 real datasets respectively with

    overlap = 0.75 when splitting along the first dimensions. . . . . . . . . . . . . . 68

    4.11 Analyzing scalability of baseline, MRRT(WO) and MRRT(WOS) algorithms

    on ttoy20d3 datasets with overlap = 0.75 when changing the dataset size from

    50,000 items to 1,000,000 data itmes. . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.12 Analyzing scalability of baseline, MRRT(WO) and MRRT(WOS) algorithms

    on IHEPC1, IHEPC2, and IHEPC3 real datasets with overlap = 0.75 when

    changing the dataset size from 103,557 items to 2,071,148 data itmes. . . . . . 71

    4.13 Comparing runtime of MRRT(WO), MRRT(WOS) and baseline algorithm on

    ttoy20d3 dataset with overlap = 0.75 when splitting along first dimensions. . . 72

    4.14 Comparing runtime of MRRT(WO), MRRT(WOS) and baseline algorithm on

    IHEPC1, IHEPC2, and IHEPC3 real datasets with overlap = 0.75 when

    splitting along first dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    xi

  • List of Algorithms

    1 Basic Regression Tree Construction Algorithm . . . . . . . . . . . . . . . . . . 9

    2 MapReduce Regression Tree Algorithm - Main Method . . . . . . . . . . . . . . 23

    3 MapReduce Regression Tree Algorithm - Map Phase of First MapReduce Round 24

    4 MapReduce Regression Tree Algorithm - Reduce Phase of First MapReduce

    Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    5 MapReduce Regression Tree Algorithm - Map Phase of Second MapReduce

    Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    6 MapReduce Regression Tree Algorithm - Reduce Phase of Second MapReduce

    Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    7 MapReduce Regression Tree Algorithm - Prediction . . . . . . . . . . . . . . . 27

    8 Slope-changing Algorithm - Main Method . . . . . . . . . . . . . . . . . . . . . 30

    9 Slope-changing Algorithm - Initialization . . . . . . . . . . . . . . . . . . . . . . 31

    10 Slope-changing Algorithm - Map Phase of First MapReduce Round . . . . . . . 31

    11 Slope-changing Algorithm - Reduce Phase of First MapReduce Round (Parzen

    Window Classifier Version) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    12 Slope-changing Algorithm - Reduce Phase of First MapReduce Round (Fitness

    Proportional Selection Version) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    13 Slope-changing Algorithm - Map Phase of Second MapReduce Round . . . . . 38

    14 Slope-changing Algorithm - Reduce Phase of Second MapReduce Round . . . . 39

    15 Slope-changing Algorithm - Prediction . . . . . . . . . . . . . . . . . . . . . . . 40

    xii

  • Preface

    xiii

  • Acknowledgments

    xiv

  • Chapter 1

    Introduction

    1.1 Motivation

    The goal is approximating a non-linear regression model using piecewise linear models for

    large-scale datasets. Regression models have many applications in real world among which

    we can name trend line, finance, epidemiology, environmental science, etc. Big datasets are

    everywhere these days, and bigger datasets would help us to find better models from the data.

    The issue with big datasets is that they also would need a long time to be processed on a

    single machine. When the dataset is very large (terabyte scale) even reading the content of

    the dataset would take a very long time (a high-end machine with four I/O channels each

    having a throughput of 100 MB/sec will require three hours to read a 4 TB data set! [12]).

    For this reason we need to use parallel and distributed methods to process big datasets.

    There are many options for parallel data processing. We have decided to use MapReduce

    programming model as the distributed data processing framework. MapReduce is a program-

    ming model introduced by Google in 2004, for processing large datasets [3]. We have chosen

    MapReduce as the distributed data processing framework to use for the following reasons:

    • MapReduce handles many of the issues with large-scale distributed data processing suchas distributed file system. Google File System (GFS) is the original file system it uses.

    GFS makes all the data transfer and distribution on different cluster nodes transparent

    to the programmer. The user simply copies a file on the cluster, and GFS decides how

    the file to be distributed among cluster nodes, keeps track of the chunks of the file, and

    also manages replication of chunks of file on different nodes for fault tolerance purpose.

    • Fault tolerance is another thing that MapReduce takes care of. The programmer does

    1

  • CHAPTER 1. INTRODUCTION 2

    not need to be worried about resolving the problem of nodes’ failure. If a node fails,

    MapReduce itself manages the problem and assigns its tasks to other cluster nodes.

    • Code and data migration is also managed by MapReduce. All the mapper nodes inthe cluster run same map code on data. MapReduce takes care of delivering the code

    to all mappers and running the code on nodes. The result of map round needs to be

    shuffled among cluster nodes (delivered to reducers), and MapReduce takes care of this

    data shuffling too. Reduce phase and code migration in this phase is also managed by

    MapReduce framework.

    • MapReduce simplifies solving a distributed data processing problem by introducing ahigh level programming model for distributed data processing. It helps programmers

    to concentrate on program logic and all the details and issues related to distributed

    nature of the solution is managed by MapReduce. Although MapReduce restricts us

    and reduces the flexibility in some ways, but it helps us to have a standard way of

    describing distributed data processing algorithms.

    • MapReduce is one of the common ways of solving distributed data processing problemsin industry these days.

    Details about how MapReduce works is explained in section 2.2.

    1.2 Problem Statement

    We are handling a large-scale non-linear regression problem. Regression is a supervised learn-

    ing technique in which the algorithm tries to find a model from a dataset to generate a

    numerical prediction for future data items. We will call the numerical dependent variable

    (target variable) y, and try to approximate its value as a function of other numerical values

    x. Here x is a vector consisting of n numerical values x1, x2, . . . , xn, where n is number of

    features (attributes) of each data item of the dataset.

    y = f(x) + � (1.1)

    In above equation � is the difference between actual and predicted value of target value. The

    predicted value for y is f(x) and is indicated by ŷ symbol. There are different ways to handle

    non-linear regression problem.

    We intend to find a solution for large-scale datasets. Handling large-scale datasets could

    be very slow if parallel and distributed data processing techniques and frameworks are not

  • CHAPTER 1. INTRODUCTION 3

    used. Because of the reason mentioned in section 1.1 the programming model we have em-

    ployed to handle large-scale datasets is MapReduce. When using MapReduce, the method

    that is employed to solve the problem sequentially needs to be coupled and translated into

    MapReduce programming model. Some details about how MapReduce works is explained in

    section 2.2.

    Designing an algorithm for MapReduce framework (map and reduce phases) entails issues

    such as deciding about what process needs to be done by cluster nodes on their local piece

    of data and what information they need to extract in order to cover the issue of not having

    a global view of the data on each node of the cluster. The other challenge when designing

    a MapReduce based algorithm is how the final result is aggregated. Generally one problem

    could be handled by MapReduce in several different ways and choosing the best way to make

    use of MapReduce capabilities is the main challenge. Since there is no communication between

    different nodes during Map and Reduce phases, and the results only could be communicated

    when the Map phase is done, choosing an effective strategy on extracting useful information

    from partial views of different mappers from the partial data they have in hand, and making

    use of this data in Reduce phase (or next MapReduce rounds) is a problem that needs to be

    addressed.

    1.3 Overview of Approach

    In this work two different distributed algorithms for approximating non-linear regression

    model of a dataset using piecewise regression is suggested. Both algorithms are suggested

    for MapReduce framework.

    The first algorithm is called MapReduce Regression Tree (MRRT) algorithm. This algo-

    rithm is dividing the feature space into equal-size partitions (equal-size in terms of volume

    and not number of data points in the partition). To form the partitions, the feature space is

    divided into partitions along one dimension of the feature space. This dimension is selected

    randomly or using a pre-processing method that is working on a sample of dataset. Data

    items belonging to different subspaces are then sent to different reducers, and all reducers

    construct regression tree models (in parallel) for the partition they have received. Although

    a reducer technically needs only one partition of the feature space to generate the model, we

    send left and right partitions of each partition to the reducer too (overlapping subspaces).

    This way each reducer would receive three partitions instead of one partition (leftmost and

    rightmost partitions of the dataset have only one neighbor and the corresponding reducer

    would receive two partitions instead of three). Since we are sending extra information to each

  • CHAPTER 1. INTRODUCTION 4

    reducer, the data that needs to be transferred in the network and processed in each machine

    would increase. This redundancy has a good side-effect which is increase in accuracy of the

    final model by decreasing prediction error of the data items that are located near the bor-

    derlines. This algorithm uses a weighted prediction mechanism in order to increase accuracy.

    Details of this algorithm is explained in section 3.1.

    Second algorithm is called Slope-changing algorithm. In this algorithm dataset is dis-

    tributed among cluster nodes in a random fashion. Every mapper has part of the dataset

    at hand and finds a set of candidate split points in that part of dataset. Split points are

    points that will be used to split the feature space into smaller subspaces. These candidate

    split points include points with local maximum and minimum target value. Points that the

    model’s slope changes sharply in those points are also selected by mappers as candidate split

    points. All the candidate split points found by mappers are sent to a single reducer in order

    to select the split point set from this set of candidate points. Two different methods are

    suggested to make the selection of final split points from candidate split points. One method

    is using Parzen Window Classifier, and the other method is fitness proportional selection.

    After selecting the split point set, this set is sent to all mappers in the cluster. All mappers

    use these split points to partition the data based on the subspaces formed by these split

    points. All mappers then send the data points pertaining to a certain subspace to a certain

    reducer. This way each reducer would receive all the data points of a certain subspace from

    all mappers, and can construct a linear model for that subspace. This way a piecewise linear

    model for all subspaces of the feature space is constructed. This piecewise linear model will

    be used to predict the target value for future test items based on the subspace in which the

    test item is located. Details of this algorithm is explained in section 3.2.

    Both MRRT and Slope-changing algorithms divide the feature space into subspaces and

    find models for each subspace, but they are dividing the feature spaces differently. MRRT

    divides the space in equal-size subspaces, but size and number of subspaces in Slope-changing

    algorithm might be different and is determined by split points chosen by the algorithm.

    Another difference is that MRRT constructs regression tree models for subspaces, but Slope-

    changing algorithm constructs linear models for subspaces. MRRT also uses overlapping

    subspaces, but Slope-changing algorithm does not use overlapping subspaces.

    1.4 Overview of Contributions

    Because of the limitation of Slope-changing algorithm (that is discussed in section 4.5.1), it

    is not applicable to high-dimensional datasets. For this reason we only list contributions of

  • CHAPTER 1. INTRODUCTION 5

    MRRT algorithm:

    • Overlapping subspaces (coupled with weighted prediction) not only solves the datadistributed-ness problem, but also helps to improve accuracy over the baseline (regres-

    sion tree) algorithm. If the preProcess method is employed to choose the dimension

    to split, MRRT improves the accuracy for 8 out of 10 synthetic datasets from 1.1% to

    32.86% and for all three real datasets for 4.66%, 13.24%, and 22.73% respectively.

    • MRRT algorithm shows to have close to linear speedup (for two out of four datasetsexperimented) and near to optimum scalability for all datasets.

    • Although MRRT’s prediction is done sequentially and not on a MapReduce framework,it improves the prediction time by more than 80% comparing to regression tree algo-

    rithm.

    • MRRT could be used on a single machine, and in that case it improves the learningtime by 60% (in most cases) comparing to regression tree algorithm.

    • MRRT needs to choose a dimension to split along. preProcessmethod we have proposedfor MRRT (to choose the dimension to splitp) increases accuracy of model for 11 out of

    13 datasets comparing to model constructed by regression tree algorithm.

    1.5 Overview of Chapters

    In chapter 2 we will review the literature related to regression, piecewise regression and tree

    regression. We also talk about MapReduce and also some large-scale problem that is solved

    using MapReduce. Limitations of MapReduce and arguments about this limitations are also

    discussed in this chapter. Chapter 3 presents details of algorithms we proposed for solving

    the piecewise approximation of non-linear model within MapReduce. Next chapter presents

    the empirical evaluation of the algorithms and compares them with the baseline (regression

    tree) algorithm. Chapter 5 summarizes findings and presents the concluding remarks.

  • Chapter 2

    Literature Review

    In this thesis two distributed MapReduce-based algorithms for approximating large-scale non-

    linear regression using piecewise regression are proposed. Two major parts of the problem are

    approximating non-linear regression using piecewise regression, and MapReduce framework.

    We will review the literature related to these two major subproblems in following sections.

    2.1 Approximating Non-linear Regression Using Piece-

    wise Regression

    2.1.1 Linear Regression

    Linear regression could be used when there is a linear (or roughly linear) dependency between

    x and y (x and y are introduced in section 1.2). In this case the learning algorithm tries to

    model y as a linear function of x:

    y = β0 + β1x+ � (2.1)

    In above equation size of x and β1 vectors are equal to number of dimensions in the feature

    space, and � is the difference between actual and predicted value of target variable (error

    term).We use ŷ symbol to indicate the predicted value of target variable by the model and we

    have ŷ = β0 + β1x. The learning algorithms tries to learn β0 and β1 values (called weights)

    from the training items in the dataset. When learning weights, the objective is to minimize

    difference of actual and predicted values for all data items (as an example this difference could

    be measured by minimizing sum of square of difference between actual and predicted target

    6

  • CHAPTER 2. LITERATURE REVIEW 7

    values):n∑

    i=1

    (y(i) − ŷ(i))2 =n∑

    i=1

    (y(i) − (β0 + β1.x(i)))2 (2.2)

    We use < x(k), y(k) > to indicate kth data item in the dataset.

    2.1.2 Non-linear Regression via Piecewise Linear Regression

    One of the advantages of linear regression is its simplicity, and one of its disadvantages is its

    globality. When the relation between x and y is complex and non-linear, even the best possible

    linear model would have a high average prediction error value. Partitioning the feature space

    into smaller subspaces and constructing a model for each subspace of the feature space might

    be helpful in finding a better model and reducing the error. Piecewise methods are using this

    idea and find constant or linear models for each subspace of the feature space instead of one

    global linear model.

    A constant model for a subspace containing a set of data items like s1 = {< x(1), y(1) >,< x(2), y(2) >, . . . , < x(n), y(n) >} would be calculated as following:

    ŷ(s1) =1

    size(s1)

    ∑k∈s1

    y(k) (2.3)

    and the prediction for any new data item that lies in this subspace would be ŷ(s1).

    In most cases it is better to find a linear model for each subspace of the feature space. In

    this case, equation 2.2 that is given in previous section is used by linear regression learning

    algorithm to find a linear model for each subspace.

    %%

    %

    ✏� �� ✏� ��✏� ��

    .

    %%

    %

    SSS

    ��

    TTT

    eee

    ✏� ��

    SSS

    x1 c1

    x2 c2 x1 c3

    x2 c4ŷ1 ŷ2 ŷ3

    ŷ4 ŷ5

    R3

    -

    6

    R2

    R1

    R4

    R5

    c4

    c2

    x1c1 c3

    x2

    Figure 5: A decision tree for a dataset with two explanatory variables(left),and the corresponding partitioning of the feature space (right). For eachleaf ` and each corresponding region R` the estimate of the target value isthe average ŷ` of the observed y values within that region.

    30

    Figure 2.1: A regression tree (left), and the corresponding 2-dimensional feature space (right).Each of tree nodes are corresponding to a subspace in feature space [19]

  • CHAPTER 2. LITERATURE REVIEW 8

    Dividing the feature space into subspaces can be done in different ways. A simple way

    of dividing the feature space into smaller subspaces is using equal-size subspaces. It also is

    possible to let the algorithm decide on borderline of the subspaces. Regression tree that is

    presented in next section, uses a recursive method to divide the feature space into subspaces.

    Figure 2.1 depicts a regression tree and also shows how the feature space is divided into

    smaller subspaces based on this regression tree. Leaves of the regression tree are models for

    each subspace of the feature space (ŷi is a model for Ri).

    2.1.3 Piecewise Regression with Regression Trees

    Regression tree is a piecewise method that recursively partitions the feature space into smaller

    subspaces. The tree itself consists of nodes and edges. Every node contains a simple condition,

    e.g. if xi < 10 (i.e. if its ith feature’s value is smaller or bigger than 10), and one of the

    branches is chosen based on the answer of current data item to this question. To find the

    prediction for a new data item, tree is traversed starting from the root until we reach a leaf.

    Leaves of regression tree contain a model like linear model or constant model.

    Constructing regression tree is an iterative task. In each iteration a feature and a cor-

    responding threshold value needs to be chosen by the algorithm. We call a pair of <

    Feature, V alue > as a split point. Selecting split points could be a critical task when con-

    structing piecewise models. When selecting a split point pair among different candidate split

    point pairs, a metric is used to evaluate different trees corresponding to different split point

    pairs. The tree and corresponding split point that performs better based on the metric is

    chosen to be used in next iteration. Basic regression tree algorithm can use Sum of Squared

    Errors (SSE) to evaluate a tree T [8]:

    S =∑

    c∈leaves(T )

    ∑i∈c

    (y(i) − ŷ(c))2 (2.4)

    where

    ŷ(c) =1

    size(c)

    ∑i∈c

    y(i) (2.5)

    is the predicted value for all data items landing in that leaf.

    Algorithm 1 lists the basic algorithm for constructing regression tree. In this algorithm,

    first all the data items of dataset are assigned to the root node (line 2). The ŷ(c) and SSE

    values are then calculated for root node (lines 3-4). Afterward a repetitive task (lines 6-

    32) is applied on each leaf of the tree and each leaf is populated with two children until

    a certain condition is hold (lines 26-30). For each leaf of the tree all possible split pairs

  • CHAPTER 2. LITERATURE REVIEW 9

    < Feature, V alue > are examined and the pair that reduces SSE of the leaf most is chosen

    (lines 12-25). If the chosen pair reduces the SSE more than a threshold δ, then the node

    is populated with two children, otherwise that leaf will be kept untouched (lines 26-31). If

    number of data items of a node is less than a threshold q, that node also will be kept untouched

    (lines 8-10).

    Algorithm 1 Basic Regression Tree Construction Algorithm

    1: procedure ConstructRegTree(dataset)2: root.dataItems = dataset3: root.ŷ(c) = 1size(dataset)

    ∑i∈dataset y

    (i)

    4: root.sse =∑

    i∈dataset(y(i) − root.ŷ(c))2

    5: queue.add(root)6: while !queue.isEmpty do7: node = queue.remove8: if size(node.dataItems) < q then9: continue

    10: end if11: bestSplitPair.sse =∞12: for splitPair ∈ allSplitPairs do13: left.dataItems = splitDataItems(node.dataItems, splitPair, left)14: right.dataItems = splitDataItems(node.dataItems, splitPair, right)15: left.ŷ(c) = 1size(left.dataItems)

    ∑i∈left.dataItems y

    (i)

    16: left.sse =∑

    i∈left.dataItems(y(i) − left.ŷ(c))2

    17: right.ŷ(c) = 1size(right.dataItems)∑

    i∈right.dataItems y(i)

    18: right.sse =∑

    i∈right.dataItems(y(i) − right.ŷ(c))2

    19: if bestSplitPair.sse > left.sse+ right.sse then20: bestSplitPair.splitPair = splitPair21: bestSplitPair.sse = left.sse+ right.sse22: bestSplitPair.left = left23: bestSplitPair.right = right24: end if25: end for26: if node.sse− bestSplitPair.sse > δ then27: node.left = left28: node.right = right29: queue.add(left)30: queue.add(right)31: end if32: end while33: end procedure

    One of the issues with basic algorithm for regression tree is using a greedy method to

    select the feature and value to split. There are two problems with a greedy method to select

  • CHAPTER 2. LITERATURE REVIEW 10

    the split points. First, since greedy methods make their decision based on a locally optimal

    choice, their final model might be a suboptimal model in terms of accuracy. Second, when

    number of dimensions and size of dataset is large, finding these split points (even greedily)

    would have a very high runtime. We need to find methods to increase accuracy and decrease

    the runtime.

    Regression tree and piecewise linear regression are proposed when the dataset is not dis-

    tributed. In the case when dataset is large, algorithms to generate regression model could be

    very slow (splitting all data items of all leaf nodes into two subsets for all different pairs of

    < Feature, V alue > is an expensive task for a high-volume high-dimensional dataset). Thus

    for large-scale datasets, new technologies, techniques and algorithms needs to be used to per-

    form the task more efficiently. Section 2.2 discusses about MapReduce that is the framework

    we have used for distributed data processing.

    2.1.4 Piecewise Linear Approximation of Time Series

    Piecewise linear representation (PLR) is generally used to approximate time series with

    straight lines (hyper planes). Piecewise linear representation is more efficient than other

    modeling techniques in terms of storage, transmission and computation and has several ap-

    plications in clustering, classification, similarity search, etc. [10].

    Piecewise linear representation are also called Segmentation Algorithms (SA). Three dif-

    ferent specification has been defined for SAs. For a time series T, find the best representation

    that

    • Includes only K segments,

    • The error for each segment does not exceed a threshold, and

    • The total error does not exceed a threshold.

    A PLR can be either online or batch [10].

    PLR algorithms can be divided to 3 different categories: bottom-up, top-down, and sliding-

    windows. Bottom-up approach finds the approximation of small pieces of time series and find

    the final solution by merging them. Top-down approach recursively divides the time series

    until satisfaction of a stopping criterion [10, 13]. Sliding-windows grows a segment until the

    error exceeds a threshold. Sliding windows starts from the first point of T and adds points

    to it while sum of error is less than a threshold. At that point a segment is generated and

    process continues to generate a new segment form the next point. Several optimization are

    proposed for this algorithm: 1) adding a bigger value than 1 in each iteration of the process

  • CHAPTER 2. LITERATURE REVIEW 11

    of finding one segment, 2) since the error is monotonically non-decreasing, methods such as

    binary search can be used [10].

    Top-down methods find good split points and split T into two segments. An approximate

    linear model for each part is calculated and the error is calculated for each part. If error

    is less than a threshold, model for that part is finalized, otherwise the algorithm recursively

    repeats the process. Bottom-up methods start from smallest possible segments (totally n/2

    segments). They find the cost of merging each pair of adjacent segments and merge the

    adjacent pair that has the lowest cost. This process is repeated until the minimum cost of

    merging is smaller than a threshold [10, 13].

    Keogh et al. propose a new online algorithm called SWAB (Sliding-Window And Bottom-

    up). SWAB uses a sliding buffer of size close to 6 segments and uses bottom-up on that frame.

    After segmentation the leftmost segment is reported and the corresponding data is removed

    from the frame and more data are read into the frame [10].

    D. Lemire suggests that instead of having linear models for each interval of a time series,

    we could have models of different degrees for different intervals [13]. Some intervals may

    have constant models, some linear, etc.. The method is called adaptive because degree of

    model in an interval is decided adaptively. The reason why adaptive method is suggested is

    that piecewise linear models might locally over-fit the data by trying to find a linear model

    for the data, while a constant model would fit the data better. Since time series datasets

    could be very large, efficiency of algorithm is very important. The adaptive method proposed

    in this paper tries to improve the quality of the model while keeping the cost of the model

    construction same as top-down[11] method.

    Different algorithms with different advantages and dis-advantages could be used for ap-

    proximating time series [13]. Optimal adaptive segmentation uses dynamic programming to

    find the best segmentation and thus is of high complexity (Ω(n2)). Top-down method on

    the other hand selects the worst segment and divides it to two smaller segments iteratively

    until the complexity of model reaches the maximum allowed complexity. Adaptive top-down

    algorithm first applies top-down algorithm on time series, and then replace linear model seg-

    ments with two constant model segments if the error can be reduced with this replacement.

    Another version of adaptive top-down first constructs a top-down constant model and then

    merges constant models in order to have linear models. The optimal algorithm is not prac-

    tical because it takes a very long time (weeks) to generate results for a time series with one

    million data points. The adaptive top-down is slightly slower than top-down algorithm, but

    generating results of higher quality.

  • CHAPTER 2. LITERATURE REVIEW 12

    2.1.5 Online Approximation of Non-linear Models

    XCSF and LWPR are the two algorithms for online linear approximation of an unknown

    function. These methods cluster the input space into small subspaces and find a linear model

    for each subspace and use a weighted sum to find the final model. For this we need to

    first structure the feature space into small subspaces in order to exploit the linearity of the

    target function in each subspace, and then we need to find the linear models in each patch.

    There are several solutions for the second step, but the first step is not straightforward.

    XCSF is an evolutionary-based algorithm that uses GA[22], and LWPR (Locally Weighted

    Projection Regression) is a statistics-based algorithm for function approximation for online

    approximation of non-linear multi-dimensional functions incrementally [20].

    Receptive Fields (RFs) is the notion used by LWPR for the ellipsoidal subspaces. XCSF

    refers to subspaces as classifiers (another term for RF) [17] . Both algorithms has an empty

    population of RFs at the beginning, and add new members to this population when a new

    uncovered data item is received. An n-Dimensional ellipsoid that is not necessarily axes-

    aligned can be represented by a positive semi-definite and symmetric matrix (D). Then the

    squared distance of a data item (x) from the center (c) of this space can be defined as:

    d2 = (x− c)T .D.(x− c) (2.6)

    If this distance is zero, then the data item is placed on the surface of the RF. This way the

    subspaces are found in both methods. A linear model for each subspace can be expressed as:

    p(x) =∑k=1

    nbk.xk + b0 (2.7)

    One data item can be covered by several subspaces and in that case a weight combination of

    linear models of those subspaces is considered and the model prediction for the input data

    item [17, 22, 20].

    LWPR assigns a gaussian activity weight to each subspace based on its distance to the data

    item, and ignores those weights that are smaller than a threshold for the sake of performance.

    This way closer subspace has more significant effect on final prediction comparing to farther

    subspaces. XCSF, on the other hand, only assigns weight to subspaces with‘istance of less

    than 1. Weights are proportional to inverse value of prediction error in XCSF [17, 22, 20].

    Finding a linear model for each subspace is straightforward and can be done using least

    squares methods. XCSF uses RLS (Recursive Least Squares), and LWPR uses incremental

    partial least squares (incremental version of PLS) to find the linear model in a subspace [17,

  • CHAPTER 2. LITERATURE REVIEW 13

    22, 20].

    Learning the locality (which is the shape and location of receptive fields) is done by a steady

    state genetic algorithm in XCSF, and by a stochastic gradient descent in LWPR [17, 22, 20].

    In XCSF, each RF has an approximate value for its current prediction error which is used

    to calculate its fitness for the GA. Fitness is shared among the RFs that cover same inputs.

    Tournament selection is used for selection task of GA, and crossover, and mutation operators

    are applied on the location of center, stretch and rotation which is defined by a matrix (D).

    When the population reaches a maximum value, some RFs are deleted from crowded regions

    of input space using a proportionate selection probability. During this process, RFs are tried

    to be generalized by making their coverage area larger while keeping their accuracy sufficiently

    large [17, 22, 20].

    Center of subspaces are not changed in LWPR and the D matrix is changed (the size and

    direction of the ellipsoids). This optimization is done by an incremental gradient descent

    based on stochastic leave-one-out cross-validation. For this purpose first D is decomposed to

    a triangular matrix and then is updated. The cost function is activity weighted error plus a

    penalty term that is preventing the subspaces to shrink over iterations [17, 22, 20].

    XCSF and LWPR are compared in [17], and for comparison purpose LWPR is tuned to hit

    a low target error (by decreasing size of RFs, changing learning rate, and penalty value) that

    is the target error hit by XCSF. Then XCSF’s max population size is changed to be roughly

    equal to LWPR’s number of RFs [17, 22, 20].

    2.2 MapReduce

    MapReduce is a programming model for processing large datasets. Programs written based

    on this programming model run on a cluster of nodes called MapReduce cluster. There are

    two kind of nodes in such a cluster: mappers and reducers. Mappers run part of the program

    called map procedure, and reducers run another part of the code called reduce procedure. All

    mappers and reducers run same code on different data. Mappers (map procedure) read the

    input data from hard disk of machine they are running on, and process the data to generate

    intermediate result. The data that is received by mapper is assumed to be as pairs of (for example , or ). Each

    mapper processes its part of data and generates the result as pair of . The input

    pair and output pair might not have anything to do with each

    other. For example when input is the output pair could be .

  • CHAPTER 2. LITERATURE REVIEW 14

    One mapper might generate many pairs of with different keys and values.

    Paris of generated by mappers then are sent to reducers for the next phase

    of process. pairs are not sent randomly to reducers, and instead they are

    partitioned among reducers based on the key value of the pairs. For example from all pairs that are generated by all mappers, those with key equal to key1 is sent to a

    certain reducer.

    Each reducer receives a group of pairs generated by mappers and process

    them in order to generate the final result. Since map and reduce phases are run in parallel

    by all mappers, a large dataset that is distributed among cluster nodes is processed by the

    MapReduce framework much faster than it is possible to process it on a single machine.

    2.2.1 Why Pairs?

    When data is processed by mappers, we need a way to aggregate result generated by different

    mappers. For example if the ultimate task is counting number of words starting with a, b, c,

    and d in a huge set of text files, each mapper could generate the result for the part of data

    it has locally, and we need a way to aggregate the result from all mappers. Having pairs helps us to request all mappers to send count of all words starting with a certain

    character to a certain reducer in order to enable that reducer to have all partial results and

    calculate the final result. For this purpose all mappers would generate a result like , and all the results

    having a as their key would be sent to a certain reducer [3].

    The key concept is that the programmer is aware of the way mappers need to generate

    result (pairs of ), and also aware of the way data is shuffled from mappers to

    reducers, and he needs to decide how to use this programming model to solve the problem he

    has at hand.

    2.2.2 Is That All MapReduce Does?

    So far we have talked about how MapReduce programming model helps us to solve data

    processing problem in a parallel fashion. But it is not all a MapReduce implementation

    offers us (Different MapReduce implementations are available among which we can named

    Hadoop [12] which is an open source implementation). When you write a MapReduce code

    you are done, and Hadoop (or any other MapReduce implementation) takes care of the rest of

    problems. The framework sends the mapper procedure to all mappers, and reducer procedure

    to all reducers. Then it asks the mappers to run the code on their local data and generate the

    result based on what is specified in the code. After result generation, the framework takes

  • CHAPTER 2. LITERATURE REVIEW 15

    care of the shuffling the data among reducers. After reducers receive the data, it asks them

    to run the reducer procedure on the received pairs.

    UserProgram

    Master

    (1) fork

    worker

    (1) fork

    worker

    (1) fork

    (2)assignmap

    (2)assignreduce

    split 0

    split 1

    split 2

    split 3

    split 4

    outputfile 0

    (6) write

    worker(3) read

    worker

    (4) local write

    Mapphase

    Intermediate files(on local disks)

    worker outputfile 1

    Inputfiles

    (5) remote read

    Reducephase

    Outputfiles

    Figure 1: Execution overview

    Inverted Index: The map function parses each docu-ment, and emits a sequence of 〈word,document ID〉pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a〈word, list(document ID)〉 pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

    Distributed Sort: The map function extracts the keyfrom each record, and emits a 〈key,record〉 pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

    3 Implementation

    Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

    large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

    (1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

    (2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

    (3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

    (4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

    (5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

    3.1 Execution Overview

    The Map invocations are distributed across multiplemachines by automatically partitioning the input data

    To appear in OSDI 2004 3

    Figure 2.2: MapReduce Execution Overview [3].

    A question here is how a programmer decides and copy the file on cluster nodes in order

    to be processed by MapReduce framework? Programmer does not need to do such a task.

    MapReduce framework has a distributed file system (Google File System or GFS and Hadoop

    Distribute File System or HDFS in Hadoop implementation of MapReduce) that facilitates

    this task. All you need is running the distributed file system and issuing a command like:

    copy bigFile.txt on the cluster. Rest of the work is done by the framework. Another question

    here is what if certain mapper fails in the middle of running? The answer is that MapReduce

    framework takes care of the issue. When distributed file system copies the data on the

    cluster, it replicates different chunks of data on different mappers (based on replication factor

    indicated in configuration file by user) and when a certain mapper fails, its task would be

    assigned to other mappers. MapReduce framework also takes care of other lower level tasks

  • CHAPTER 2. LITERATURE REVIEW 16

    such as network communication. There are nodes in a MapReduce framework that their task

    is bookkeeping. They keep track of cluster nodes, mappers, reducers, data replication, etc.

    Figure 2.2 illustrates execution of a MapReduce task on a MapReduce cluster. User

    program is distributed by master among worker nodes. Some of the worker nodes would work

    as mappers and some as reducers. Data is read by mappers and then they run the mapper

    procedure on the data. Intermediate data is generated, and then they are sent to reducer

    nodes. Reducer nodes process the pairs they have received and generate the

    final result [3].

    2.2.3 MapReduce for Clustering

    One of the large-scale data processing that is data clustering. Several algorithms has presented

    recently for different clustering algorithms on MapReduce framework. In this section we

    review three clustering algorithm to see how they are using MapReduce power in order to

    cluster data.

    Zhao et al. are arguing that all previous researches on parallel k-means so far are suffering

    from two problems [24]. First, they assume that all the data are in main memory, and second,

    they are using a restricted programming model. For these two reasons, those works are not

    applicable on peta-scale datasets. Since distance calculation (calculated n*k times in each

    iteration where n is number of data points and k is number of clusters) is the most expensive

    step of the algorithm, they try to exploit the parallelization of MapReduce to decrease this

    cost. Map function assigns each data point to its closest center, and Reduce function updates

    the centroids. There is one more function called Combine that aggregates the intermediate

    results of Map functions. A global variable called centers includes list of all centers and is used

    by all map tasks. Map tasks generate pairs of . Combine

    method, aggregates the results of the same map task. It calculates the partial sum of the

    data points assigned the same cluster. Output of this method is pairs of . The Reduce function aggregates sum of

    all partial sums for each cluster, and calculates the new centroids. The output of Reduce

    is pair of . The speedup they have

    achieved for 4 machines is around 3 for the biggest dataset (8GB) which is a good speedup.

    The speedup for bigger datasets is bigger too which is a good indication. The authors are not

    talking about iterative nature of the algorithm and about how this issue is handled. They

    also do not talk about accuracy of the method and only talk about speed-up, scale-up, and

    size-up [24].

    Ferreira Cordeiro et al. present an algorithm for very large multi-dimensional dataset

  • CHAPTER 2. LITERATURE REVIEW 17

    clustering with MapReduce [7]. Since such a dataset doesn’t fit in one or several disks,

    parallel processing is the only solution. In that case I/O and network cost are the two things

    that needs to be balanced. Best of both World (BoW) is the solutions that the authors are

    suggesting in this paper. They have worked on the largest real dataset ever in the database

    subspace clustering (Twitter crawl > 12TB, and Yahoo! Operational data: 5 Petabytes - only

    reading 1TB from a modern 1TB disk takes around 3h). The contribution of the paper is

    combining sequential clustering algorithm with a parallelization method in an efficient way.

    Sequential subspace clustering algorithms can be plugged to this solution and the system

    would balance the I/O and network cost. The sequential algorithm that is plugged into the

    parallel algorithm finds the beta-clusters in a hyper-rectangle shape in the multi-dimensional

    space. Sequential subspace algorithm can be density-based or k-means-based [7].

    I/O optimal version of the algorithm (ParC) reads the dataset one time and reduces the

    I/O. Another algorithm SnI (sample and ignore) improves the network cost but reads the data

    two times. Depending on number of reducers each of the two can be the winner. The BoW

    is a combined algorithm that decides to use which of those algorithm based on number of

    reducers, and keeps the cost as min(ParC, SnI) for any number of reducers. ParC partitions

    (using one of these methods: random, address space, or arrival order) the dataset across

    the cluster (mappers), finds beta-clusters in each partition (reducers), and finally merges the

    clusters (a single machine). SnI, on the other hand, first samples the dataset (exploits the

    skewed distribution of the data), and then clusters the sample using ParC ignoring the un-

    sampled data items. This way SnI avoids processing of many of the data items that belong to

    big clusters that are already sampled. SnI reads the data two times. In first read it samples

    the data, and in second read it only maps the sampled data items and avoids other points.

    The network cost will be reduced in a great amount by this technique. In sample step of the

    algorithm mappers map each point by probability of Sr to a single reducer. That reducer then

    clusters the data using the plugin clustering algorithm and passes the clusters description to

    next phase. In ignore phase each mapper reads its partition again and ignores the data points

    that fit into the clustering found in sample phase and send other data items to r reducers.

    Those reducers cluster the data points using the plugin clustering algorithm and pass the

    clustering description to one machine. That machine merges all the clusterings found in 2nd

    phase to the clustering found in phase 1 [7].

    Both ParC and SnI have their own benefits. ParC optimizes I/O by reading the data file

    once, and SnI optimizes the network cost by reducing number of data points that needs to be

    transferred over the network in cost of reading the data file two times. To take advantage of

    the benefits of both of these methods we need a combined method that selects one of these

    based on the cost. A cost-based optimization method is used to select the better algorithm

  • CHAPTER 2. LITERATURE REVIEW 18

    adaptively. The cost formula uses file size, network speed, disk speed, startup cost, and

    plugin cost to calculate the total cost for each algorithm. BoW algorithm first calculates

    both costParC and costSnI and select the better one based on the parameters and calls it.

    Experiments has been done to check the accuracy, scalability and performance of the cost-

    based method. The authors have shown that the quality of the clustering matches the quality

    of sequential clustering while its speed-up is close to linear. The cost-based method also has

    been shown to be the best of both world in all cases [7].

    Ene A. et al. have designed the first approximate version of metric k-center and k-median

    algorithms for MapReduce [6]. They assumes that a set V of n data points and their cor-

    responding distance is given and try to cluster the similar points into same clusters. The

    output of the algorithm is k data points that is considered to be the center points of the k

    clusters.The algorithms first sample the data (in a way that the sample represents all the

    data well) in order to decrease the dataset size. The sampling method incrementally add new

    points to the final sample set only if they are not already represented well by the final set.

    Sampling is different for k-median and k-center due to their different nature. Sampling for

    k-median needs more effort because it needs to consider each points distance from its cluster

    center. A version of algorithm is presented in the paper that can be run on MapReduce [6].

    The MapReduce version of sampling is an iterative algorithm in each iteration of which we

    have three MapReduce operations. The first MapReduce operation partitions data arbitrarily

    among machines (mappers), and then each reducer construct two sets (S:final set, and H from

    which a pivot is selected). In next MapReduce step all the mappers pass the H and S sets to

    a single reducer and that reducer finds the pivot point. In the last MapReduce step mappers

    send pivot, S, one partition of R (remained data items that are not sampled yet), and the

    distance matrix to the reducers and reducers get rid of the well represented points. This steps

    are iterated until number of remaining points in the R falls under a certain threshold. K-center

    tries to minimize the maximum distance of the cluster center and the points in that cluster,

    while k-median tries to minimize sum of the distance of all the points in a cluster from the

    cluster center (both problems are know to be NP-hard). K-center uses the sampling produced

    by the sampling algorithm and mappers map all the points in the sampling along with their

    pairwise distance to a reducer, and the reducer runs a simple local clustering algorithm. K-

    median needs more information and its sample should have information about all the nodes

    that are to be clustered. For each un-sampled point the closest sample point is selected and

    its weight is increased by 1. In k-median first the sampling is done and then partitions of

    the original dataset along with the sample and part of the distance graph is sent to reducers.

    Each reducer finds weight of sample points partially. Then, in another MapReduce round the

    partial weights are summed up. Last step is a simple clustering on the sample considering

  • CHAPTER 2. LITERATURE REVIEW 19

    weight of each sample point [6].

    2.2.4 MapReduce and Iterative Tasks

    Many machine learning and data mining algorithms are working iteratively on data but

    MapReduce is not well-suited for tasks with cyclic data flow. There are frameworks such

    as Twister [5], Spark [23], and HaLoop [1] that are iterative. Dave et al. present a cloud-

    based pattern for large-scale iterative data processing problems [2]. They have implemented

    CloudClustering, as a case study, that tries to show how iterative data processing problems

    can be handled on the cloud.

    CloudClustering is the distributed version of k-means clustering algorithm, implemented

    on Microsoft’s Windows Azure platform. They introduce a way to balance the performance-

    fault tolerance trade-off (that is the main trade-off when solving iterative problems on the

    cloud) using data affinity and buddy system. Some methods are using a central pool of state-

    less tasks in order to handle the fault tolerance issue, but this could lead to low performance

    because a cluster node might need to receive different parts of the data in different iteration

    (i.e. there is no affinity between data and workers) [2].

    Windows Azure handles fault-tolerance by means of reliable queues. When a worker takes

    a task from the queue, the message becomes invisible and if it is not deleted after a timeout, it

    will reappear in the queue. This way if a worker fails, the task will be done by another worker

    node. One of the issues with the iterative tasks on the cloud is the stopping criterion. It can be

    handled in two different ways in this problem. If no data point is changed among clusters from

    one iteration to the next, we are done. This method needs to keep track of previous cluster

    of each data point. The other method checks the maximum amount of centroid movement

    and stops if it falls below a certain threshold. This method works on an read-only memory,

    but can’t guaranty the convergence [2].

    The proposed architecture is using the Windows Azure’s queuing system and includes one

    master and a pool of worker nodes. Input dataset is stored centrally and is partitioned by

    the master. The workers download a task containing the address to the corresponding part

    of partition and the centroid list and perform the task. This method is working best in terms

    of fault-tolerance but since data affinity is not considered the performance is not good in this

    system. The other extreme is having one queue per worker that will solve the problem of

    data affinity (master will assign same partition of data to same workers in each iteration),

    but suffers from fault tolerance problem (there is not other worker to take over the current

    task in case the worker fails). Buddy system is grouping workers in buddy groups and a queue

    is shared among all members of each buddy group. Now size of the buddy group defines a

  • CHAPTER 2. LITERATURE REVIEW 20

    balance between fault tolerance and performance [2].

    2.2.5 Arguments about Using or not Using MapReduce

    Schwarzkopf et al. have listed seven different assumptions and simplifications employed by re-

    searchers in the cloud research that threatens the practical applicability and scientific integrity

    of those researches [16].

    One of the issues they have pointed out in their paper is unnecessary distributed paral-

    lelism. Very large datasets and frameworks such as MapReduce have made researchers to

    employ distributed parallelism more and more. Since the new high performance computing

    frameworks offer a fascinating simplicity and handle complicated issues like communication,

    synchronization, and data motion, a lot of people are willing to use these frameworks without

    considering whether these frameworks are useful for the problem at hand or not. Frameworks

    such as MapReduce reduce the engineering time needed to design a solution for a distributed

    version of an algorithm, but they mostly increase the runtime. For this matter the speedup of

    a program must be measured to show that the distributed solution outperforms the sequential

    solution. Furthermore, we need to make sure that we need to distributed the data over several

    machine even if we are sure that a parallel solution would be beneficial for the problem at

    hand. They also point out that as Rowstron et al. have shown, with nowadays multicore

    processors and huge amount of RAM we might not need to use a distributed solution for

    many problems [15], and we would be able to make use of fast communication mechanisms

    such as shared memory and also avoid data motion [16].

    Another issue they have mentioned in their paper is forcing the abstraction. MapReduce is

    designed to alleviated the I/O bottleneck of big data by distribution of data over several hard

    disks. Time needed to process a job on a single machine is also assumed to be long. Some

    solutions are iterating and generating many short-time MapReduce jobs while it is better

    to have least number of jobs that are iteratively running on each system. Domain-specific

    systems (for stream processing, iterative processing and graph processing) have also emerged

    that seems to be a lot more justified that using the MapReduce for any problem [16].

    Since many of Machine Learning and Data Mining algorithms are iterative, and MapRe-

    duce is not inherently an iterative programming model, and some other algorithms does not fit

    to this model for other reasons, many alternatives and extensions of MapReduce is provided

    by different research/industrial groups in recent years. Some theoretical studies have been

    done to show that Hadoop (an open source implementation of MapReduce) has limitations.

    Empirical studies also have been done and frameworks such as HaLoop [1] and Twister [5]

    are presenting a class of algorithms that Hadoop is not a good fit for, and try to extend the

  • CHAPTER 2. LITERATURE REVIEW 21

    Hadoop and solve the problem more efficient than Hadoop, and off course they outperform

    Hadoop at least when running that special algorithm. Jimmy Lin provides reasons why we

    need to either revise current algorithms to be run on MapReduce or devise new algorithms

    that follow MapReduce programming model. He suggests that since MapReduce is currently

    the widely used solution for large scale data processing problems, we can get rid of the itera-

    tive solution and try to use (or devise) alternative solutions that will fit MapReduce instead

    of devising new frameworks for algorithms that MapReduce is ”good-enough”. He discusses

    three classes of of problems to justify his claim: iterative graph algorithms (e.g., PageRank),

    gradient descent (e.g., for training logistic regression classifier), and EM (e.g., for k-means,

    and HMM training) [16].

    Jimmy Line argues that extensions of Hadoop that support iterative constructs and thus

    alleviate some problems, but the problem with all these frameworks is that they are not

    Hadoop! It costs a lot for an organization to have another framework (other than Hadoop)

    for only graph and iterative algorithms. A better solution would be trying to solve the four

    above-mentioned problem by changing the algorithm in order to be runnable on Hadoop. If

    MapReduce is performing better than an alternative that is used to solve that problem. That

    does not mean that MapReduce needs to beat all the alternatives. For example MapReduce

    performing a lot better than GIZA++ for word-alignment algorithm, and also is considered

    an advance when used for k-means clustering [16]. The Hadoop stack is the standard and

    widely used platform for large-scale data analysis. Any large-scale data analysis needs to

    be able to process different types of structured and unstructured data and run different

    types of algorithms (graph, text, relational data, ML, etc.). No single programming model

    or framework can meet all the needs and be the best in terms of all the aspects such as

    performance, fault tolerance, expressivity, simplicity, abstracting low level features such as

    synchronization, etc.. No the question is: Dose adopting and deploying a new framework to

    solve a problem worth (in terms of cost, time, generality of framework, having mastered HR

    to use the framework, etc.) [14]?

  • Chapter 3

    Approach

    In this chapter we introduce two different piecewise regression algorithms. First algorithm is

    called MapReduce Regression Tree (MRRT), and second one is called Slope-changing algo-

    rithm. Both algorithms are trying to find a piecewise regression model for a dataset within

    the MapReduce framework. MapReduce Regression Tree algorithm is a Regression Tree based

    algorithm that can be used within the MapReduce framework. Slope-changing algorithm on

    the other hand is trying to introduce a non-greedy method to find good candidate split points

    and use this candidate set in order to find the final set of split points. Performance of these

    two algorithms is analyzed and compared in chapter 4.

    3.1 MapReduce Regression Tree

    Algorithm 2 lists the pseudocode for MapReduce Regression Tree algorithm. This algorithm

    partitions the feature space to smaller subspaces, but constructs a regression tree model

    (instead of a linear model in Slope-changing algorithm) for each subspace. The generated

    regression tree models are used to predict the target value of new data items.

    Unlike Slope-changing algorithm that selects the split points based on the logic that maxi-

    mum, minimum and slope-changing points are good candidates, this algorithm is not choosing

    the split points based on any heuristic, and the feature space is not divided into subspaces

    along different dimensions. The feature space is divided to subspaces of equal size (in terms

    of volume of the subspace and not number of data items in each subspace), and it is divided

    into smaller subspaces along one dimension of the feature space. This dimension is chosen

    randomly or using the preProcess method. The preProcess method is retrieving a sample of

    the dataset randomly (in our experiments we used 10% of each dataset) and runs the piecewise

    22

  • CHAPTER 3. APPROACH 23

    Algorithm 2 MapReduce Regression Tree Algorithm - Main Method

    1: function MR-Regression-Tree-Learn2: dimToSplit = preProcess(dataset)3: rangeV alues = Map1(dimToSplit) . All mappers find min and max value of the4: . dimension that is being split5: splitPoints = Reduce1(rangeV alues, dimToSplit, nMappers) . Split points are

    specified6: . based on dimension size and number of mappers7: Map2(splitPoints, dimToSplit) . Data is shuffled among reducers8: models = Reduce2() . Each reducer finds the model for the received data9: end function

    Regression Tree algorithm on all different dimensions of the dataset. One piecewise regression

    tree model is generated for each dimension of the dataset. The model is then tested against a

    validation set and the dimension corresponding to the dimension that has the least RMSE on

    validation set is chosen as the dimension to split the dataset in MapReduce Regression Tree

    algorithm along.

    Figure 3.1: Dataset distribution among cluster nodes with overlap to decrease borderline datapoints prediction error

    When dividing a feature space to subspaces, models constructed for two subspaces that

    are next to each other might have different predictions for a data point that is located on the

    borderline. It is same for data points that are located in one subspace, but are close to the

    borderline. For these data points, the neighbor model might have a better prediction than

    the actual model that the data point is located in. For this reason, smoothening methods try

    decrease the problem by using a weighted average of predictions of neighboring models for

  • CHAPTER 3. APPROACH 24

    data points close to the borderline. Number of these models are two in a two dimensional

    feature space, and might be more in a n dimensional feature space based on the location of

    the data item (the data item might be close to a borderline in one dimension, and not in

    another dimension).

    Since we are using a distributed method to solve the regression problem, we have more

    resources at hand and we might be able to afford a little bit of redundant calculation in

    order to increase the accuracy of the model. Based on this logic and the problem explained

    about borderline data points prediction, we decided to have overlapping subspaces and let

    each mapper have more data than what it needs in order to construct the model. Figure 3.1

    depicts how 7 partitions of dataset that are partitioned along x axis are assigned to 7 map-

    pers. All cluster nodes except first and last one receives three partitions of the dataset. All

    cluster nodes receive left and right partitions of the partition they are trying to construct the

    model for (we call it main partition, and call the left and right partition neighbor partitions).

    The reason that first and last nodes receive only two partitions instead of three is that their

    corresponding main partition has just one neighboring partition. Distributing dataset this

    way would let the system to construct the model based on 3 partitions but only predict the

    target value for the test items that are located in their main partition. This way we will

    not have any borderline data item and thus we will not need to use prediction smoothening

    methods that is used in Slope-changing algorithm.

    Algorithm 3 MapReduce Regression Tree Algorithm - Map Phase of First MapReduceRound1: function Map1(dimToSplit)2: for all Mappers do3: . dataset is local part of dataset on the node that this mapper is running4: minV alue = +∞5: maxV alue = −∞6: for all dataPoint ∈ dataset do7: minV alue = min(dataPoint[dimToSplit],minV alue)8: maxV alue = max(dataPoint[dimToSplit],maxV alue)9: end for

    10: end for11: send < 1, < minV alue,MaxV alue >>12: . By indicating key as 1 all information is sent to one reducer13: end function

    Figure 3.2 depicts different overlap factors when overlapping subspaces on cluster nodes.

    When overlap factor is 1, each node would receive its own subspace in addition to two neigh-

    boring subspaces that are as big as its own subspace. When overlapping factor is 0.5, size of

  • CHAPTER 3. APPROACH 25

    Figure 3.2: Different overlap factors of subspaces on cluster nodes

    neighboring subspaces that each node receives, is half of size of its own subspace. In case of

    overlapping factor of 0, no overlapping exists.

    Now that we have discussed the concepts behind how MapReduce Regression Tree algo-

    rithm works, let us explain each line of algorithm 2 briefly. The preProcess method selects

    what dimension to select to partition the dataset along. Then first round of MapReduce

    is started. In map phase of first MapReduce round, each mapper finds range of the data

    items (max, min) in the portion of dataset that mapper owns. These information are sent

    to one reducer. Reducer would receive rangeV alues, and nMappers (number of mappers),

    and would decide what portions of dataset should be sent to each mapper. In map phase of

    second round of MapReduce, all mappers receive the splitPoints information and would send

    the data items they have to two or three mappers (we know that each partition of the dataset

    would be sent to several mappers due to overlapping). In reduce phase of second MapReduce

    round, all the reducers would have two or three partitions of the dataset. They construct a

    regression tree for the portion of dataset they have received. Each of this phases is explained

    in following sections.

    3.1.1 Map1: Finding the Min and Max of Dimension that Is Being

    Split

    Algorithm 3 lists the steps of this phase of the first MapReduce round. All mappers process the

    portion of the dataset they own and find the minimum and maximum value of the dimension

  • CHAPTER 3. APPROACH 26

    Algorithm 4 MapReduce Regression Tree Algorithm - Reduce Phase of First MapReduceRound1: function Reduce1(rangeV alues, dimToSplit, nMappers)2: minV alue = min(rangeV alues.minV alues) . min value of dimToSplit dimension3: maxV alue = max(rangeV alues.maxV alues) . max value of dimToSplit dimension4: stepSize = (maxV alue−minV alue)/nMappers5: . stepSize in dimToSplit dimension when partitioning the dataset6: splitPoints[1].start = minV alue . start and end of the partition for7: splitPoints[1].end = minV alue+ 2 ∗ stepSize . first mapper is calculated differntly8: for i = 2, nMappers− 1 do9: splitPoints[i].start = minV alue+ (i− 2) ∗ stepSize

    10: splitPoints[i].end = minV alue+ (i+ 1) ∗ stepSize11: end for12: splitPoints[nMappers].start = minV alue+ (nMappers− 2) ∗ stepSize13: splitPoints[nMappers].end = minV alue+ nMappers ∗ stepSize14: . start and end of the partition for last mapper is calculated differntly15: send splitPoints to all mappers . To be used by mappers of next round16: end function

    that is supposed to be split. All the mappers then send this minimum and maximum values

    to the same reducer. This is the reason the key value for the emitted < key, value > pair is

    1 for all mappers.

    Algorithm 5 MapReduce Regression Tree Algorithm - Map Phase of Second MapReduceRound1: function Map2(splitPoints, dimToSplit, nMappers)2: for all Mappers do3: for dataPoint ∈ dataset do4: for i = 1, nMappers do5: if splitPoints[i].start < dataPoint[dimToSplit] < splitPoints[i].end then6: send < i, dataPoint > to corresponding reducer7: end if8: end for9: end for

    10: end for11: end function

    3.1.2 Reduce1: Finding Split Points Along the Dimension that Is

    Being Split

    Algorithm 4 lists the reduce phase of first MapReduce round. In this algorithm all the

    maximum and minimum values sent by all mappers used to find the maximum and minimum

  • CHAPTER 3. APPROACH 27

    value of the dimension that is being split. Using these two values, range of the dimension is

    found and stepSize of split points along that dimension is found by dividing range to number

    of mappers. Now start and end values for each mapper on dimension that is being split is

    found and stored in splitPoints array. All the mappers would have a partitions as big as

    triple of size of stepSize, except the partitions whose main partition is first or last partition

    of the dataset. Those two partitions would only have two partitions.

    Algorithm 6 MapReduce Regression Tree Algorithm - Reduce Phase of Second MapReduceRound1: function Reduce2(dataPoints)2: Input: dataPoints: data items sent to this reducer3: for all Reducers do4: models[i] = treeRegressionModel(dataset)5: end for6: end function

    3.1.3 Map2: Shuffling the Data Among Cluster Nodes

    The split points found in Reduce1 phase are used in this phase to shuffle the data. Algorithm 5

    lists how the data is shuffled among cluster nodes in this phase. Each mapper sends each data

    item in its local portion of dataset to 2 or 3 mappers. The data points that are located in first

    or last partition of the feature space would be sent to two reducers, and all other data points

    would be sent to three reducers. This would cause to have more redundancy in the amount

    of the data that is transferred in the network, but would solve the problem of borderline data

    item’s target value prediction, and would increase the accuracy of the model also.

    Algorithm 7 MapReduce Regression Tree Algorithm - Prediction

    1: function MR-Regression-Tree-Test(models, dataP