large-scale non-linear regression within the …pkc/theses/khademzadeh13.pdf · large-scale...

LARGE-SCALE NON-LINEAR REGRESSION WITHIN

THE MAPREDUCE FRAMEWORK

by

Ahmed Khademzadeh

A thesis submitted to the College of Engineering at

Florida Institute of Technology

in partial fulfillment of the requirements

for the degree of

Master of Science

in

Computer Science

Melbourne, Florida

July 2013

c© Copyright by Ahmed Khademzadeh 2013All Rights Reserved

The author grants permission to make single copies

We the undersigned committeehereby approve the attached thesis

LARGE-SCALE NON-LINEAR REGRESSION WITHINTHE MAPREDUCE FRAMEWORK

byAhmed Khademzadeh

Philip Chan, Ph.D.Associate ProfessorComputer SciencesPrincipal Adviser

Marius Silaghi, Ph.D.Assistant ProfessorComputer Sciences

Georgios C. Anagnostopoulos, Ph.D.Associate ProfessorElectrical & Computer Engineering

William D. Shoaff, Ph.D.Associate Professor and Department HeadComputer Sciences

Abstract

Large-scale Non-linear Regression within the MapReduce Framework

By: Ahmed Khademzadeh

Thesis Advisor: Philip Chan, Ph.D.

Regression models have many applications in real world problems such as finance, epidemiol-

ogy, environmental science, etc.. Big datasets are everywhere these days, and bigger datasets

would help us to construct better models from the data. The issue with big datasets is that

they would need a long time to be processed or even to be read on a single machine. This

research employs MapReduce to model large-scale non-linear regression problems in a par-

allel fashion. MRRT (MapReduce Regression Tree) algorithm divides the feature space into

overlapping subspaces and then shuffles each of the subspace’s data items to a node in the

cluster. Each node in the cluster then constructs a regression tree for the subspace of the

data it has received. Different versions of algorithm (overlapping/non-overlapping subspaces

and weighted/unweighted prediction using neighboring models) are proposed and compared

with the regression tree (RT) algorithm implemented in Matlab libraries.

Experiments on synthetic and real datasets show that MRRT algorithm that is devised

to be fast and scalable for MapReduce framework not only has a close to linear speedup, and

close to optimum scalability, but also outperforms the RT algorithm in terms of accuracy

(in most cases) and improves the prediction time by more than 80%. Although MRRT is

designed for MapReduce framework, it could be used on a single machine, and in that case

it improves the learning time by 60% (in most cases) comparing to RT algorithm, and shows

to be of close to linear scalability (comparing to RT algorithm which is roughly of quadratic

scalability).

iii

Contents

Abstract iii

Preface xiii

Acknowledgments xiv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Overview of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Overview of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature Review 6

2.1 Approximating Non-linear Regression Using Piecewise Regression . . . . . . . . 6

2.1.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Non-linear Regression via Piecewise Linear Regression . . . . . . . . . . 7

2.1.3 Piecewise Regression with Regression Trees . . . . . . . . . . . . . . . . 8

2.1.4 Piecewise Linear Approximation of Time Series . . . . . . . . . . . . . . 10

2.1.5 Online Approximation of Non-linear Models . . . . . . . . . . . . . . . . 12

2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Why Pairs? . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 Is That All MapReduce Does? . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 MapReduce for Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.4 MapReduce and Iterative Tasks . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.5 Arguments about Using or not Using MapReduce . . . . . . . . . . . . 20

iv

3 Approach 22

3.1 MapReduce Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 Map1: Finding the Min and Max of Dimension that Is Being Split . . . 25

3.1.2 Reduce1: Finding Split Points Along the Dimension that Is Being Split 26

3.1.3 Map2: Shuffling the Data Among Cluster Nodes . . . . . . . . . . . . . 27

3.1.4 Reduce2: Constructing the Tree Regression Models for Each Subspace . 28

3.1.5 Using the MRRT Model to Predict . . . . . . . . . . . . . . . . . . . . . 28

3.2 Slope-changing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Choosing Good Split Points . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.3 Map1: Finding Candidate Split Points . . . . . . . . . . . . . . . . . . . 31

3.2.4 Reduce1 : Generating a Split Point Set from Candidate Set . . . . . . . 35

3.2.5 Map2 : Shuffling the Data Points Based on Split Points . . . . . . . . . 39

3.2.6 Reduce2 : Finding the Linear Model for Each Subspace . . . . . . . . . 39

3.2.7 Using the Slope-changing Model to Predict . . . . . . . . . . . . . . . . 40

4 Empirical Evaluation 42

4.1 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.2 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Overview of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.1 Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.2 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Overview of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 MRRT Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4.1 Number of Dimensions to Split Along . . . . . . . . . . . . . . . . . . . 47

4.4.2 Overlapping Subspaces and Neighbor-weighted Predictions . . . . . . . 50

4.4.3 Comparing the Accuracy of MRRT and the Baseline Algorithm . . . . . 54

4.4.4 Choosing the Dimension to Split Along . . . . . . . . . . . . . . . . . . 58

4.4.5 Prediction Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4.6 Speedup of MRRT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4.7 Scalability of MRRT Algorithm . . . . . . . . . . . . . . . . . . . . . . . 69

4.4.8 Could MRRT Be Used as a Sequential Algorithm? . . . . . . . . . . . . 72

4.5 Slope-changing Experiments Results . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5.1 Slope-changing Algorithm Limitation . . . . . . . . . . . . . . . . . . . . 77

v

4.5.2 Comparing Accuracy of Slope-changing Algorithm to Baseline Algorithm 78

4.5.3 Comparing Runtime of Slope-changing Algorithm to Baseline Algorithm 79

5 Concluding Remarks 80

5.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2 Possible Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

A Synthetic Datasets Details 83

Bibliography 88

vi

List of Tables

4.1 Summary of Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Summary of Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Comparing accuracy and learning time of MRRT when dividing the feature

space along one dimension versus two dimensions on synthetic datasets. As it

can be seen, none of the methods for dividing the feature space is superceeding

the other one and there is no obvious reason to prefer one over the other one

based on this experiment. The learning time of algorithms in both methods is

also similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 Comparing accuracy and learning time of MRRT when dividing the feature

space along one dimension versus two dimensions on real datasets. As it can

be seen, Two Dimensions split wins in accuracy and One Dimension split wins

in learing time. The accuracy difference is not a major difference, but the

learning time difference is significant. . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 Comparing accuracy of MRRT(W) and MRRT both with no overlap. As it can

be seen the MRRT(W) algorithms works bettern than MRRT on most datasets. 53

4.6 Comparing accuracy of Weighted Overlapping MapReduce Regression Tree and

baseline algorithm on 10-dimensional synthetic datasets. Numbers in the table

are RMSE values. MRRT(WO) algorithm always performs better than baseline

algorithm, when splitting the feature space is done along one dimension and if

the dimension to split is chosen properly. . . . . . . . . . . . . . . . . . . . . . 55

4.7 Comparing accuracy of Weighted Overlapping MapReduce Regression Tree and

baseline algorithm on 20-dimensional synthetic datasets. Numbers in the table

are RMSE values. MRRT(WO) algorithm always performs better than baseline

algorithm, when splitting the feature space is done along one dimension and if

the dimension to split is chosen properly. . . . . . . . . . . . . . . . . . . . . . 56

vii

4.8 Comparing accuracy of MRRT(WO) and baseline algorithm on real datasets.

Numbers in the table are RMSE values. MRRT(WO) algorithm always per-

forms better than baseline algorithm, when splitting the feature space is done

along one dimension and if the dimension to split is chosen properly. . . . . . 57

4.9 Dimensions with lowest RMSE on synthetic datasets and rank of same dimen-

sion on samples using MRRT(O) and MRRT(WO) algorithms. . . . . . . . . . 61

4.10 Dimensions with lowest RMSE on sample of synthetic datasets and RMSE of

dataset when divided along same dimension using MRRT(WO) algorithm. . . . 62

4.11 Dimensions with lowest RMSE on real datasets and rank of same dimension

on samples using MRRT(O) and MRRT(WO) algorithms. . . . . . . . . . . . . 63

4.12 Dimensions with lowest RMSE on sample of real datasets and RMSE of dataset

when divided along same dimension using MRRT(WO) algorithm. . . . . . . . 64

4.13 Comparing prediction time of MRRT(O), MRRT(WO) and baseline algorithm

on 20-dimensional ttoy20d3 synthetic test set containing 1000 test items on

different size of clusters. MRRT(WO) and MRRT(O) algorithms reduce pre-

diction time by more than 80% comparing to baseline algorithm in all cases. . . 65

4.14 Comparing prediction time of MRRT(O), MRRT(WO) and baseline algorithm

on real datasets’ test sets containing 4111 test items on different size of clusters.

MRRT(WO) and MRRT(O) algorithms reduce prediction time by more than

80% comparing to baseline algorithm in all cases. . . . . . . . . . . . . . . . . . 66

4.15 Comparing learning time of MRRT(WOS) and baseline algorithm on 20-dimensional

ttoy20d3 synthetic dataset on different number of subspaces. MRRT(WOS)

always perform better than baseline algorithm although it also has better ac-

curacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.16 Comparing accuracy of MRRT(WOS) and baseline algorithm on 20-dimensional

ttoy20d3 synthetic datasets on different number of subspaces when dataset is

divided into supspaces along first dimension. MRRT(WOS) algorithm’s RMSE

is lower than baseline algorithm in all cases. . . . . . . . . . . . . . . . . . . . . 74

4.17 Comparing learning time of MRRT(WOS) and baseline algorithm on real datasets

on different number of subspaces. MRRT(WOS) algorithm’s learning time is

always less than baseline algorithm except in one case when number of sub-

spaces is 32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

viii

4.18 Comparing accuracy of MRRT(WOS) and baseline algorithm on three real

datasets on different number of subspaces when dataset is divided into supspaces

along first dimension. MRRT(WOS) algorithm’s RMSE is lower than baseline

algorithm in all cases, and it mostly decreases with increasing number of sub-

spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.19 Summary of synthetic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.20 Comparing accuracy of slope-changing algorithm (PWC and FPS versions) and

baseline algorithm on four datasets. . . . . . . . . . . . . . . . . . . . . . . . . 78

4.21 Comparing learning time of slope-changing algorithm (PWC and FPS versions)

and baseline algorithm on four datasets. . . . . . . . . . . . . . . . . . . . . . . 79

ix

List of Figures

2.1 A regression tree (left), and the corresponding 2-dimensional feature space

(right). Each of tree nodes are corresponding to a subspace in feature space [19] 7

2.2 MapReduce Execution Overview [3]. . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Dataset distribution among cluster nodes with overlap to decrease borderline

data points prediction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Different overlap factors of subspaces on cluster nodes . . . . . . . . . . . . . . 25

3.3 Bad split points causes bad piecewise linear models and higher prediction error 29

3.4 Good split points helps to have better piecewise linear models and lower pre-

diction error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Finding the data points with maximum target value by gridifying data points

and using a initial random seed . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Using Parzen Window Classifier to find areas with many candidate split points [18] 36

4.1 Splitting the feature space to subspaces. . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Summary of MRRT versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 10-dimensional

datasets (gtoy10d1, gtoy10d2, ptoy10d1, ptoy10d2, ttoy10d1 and ttoy10d2)

datasets with different overlap values when dividing along first dimension. . . . 51

4.4 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 20-dimensional

datasets (ptoy20d1, ptoy20d2, ttoy20d1 and ttoy20d2) datasets with different

overlap values when dividing along first dimension. . . . . . . . . . . . . . . . . 53

4.5 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on IHEPC1

real dataset with different overlap values when dividing along first dimension. . 54

4.6 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 10 dimen-

sional datasets with overlap = 0.75, when splitting along differnt dimensions. . 59

x

4.7 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 20 dimen-

sional datasets with overlap = 0.75, when splitting along differnt dimensions. . 60

4.8 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on IHEPC1

real datasets with overlap = 0.75, when splitting along differnt dimensions. . . 63

4.9 Speedup of MRRT(O) and MRRT(WO) algorithms in log scale and linear

scale on ttoy20d3 dataset with overlap = 0.75 when splitting along the first

dimensions. The runtime of same algorithms on a single machien is 2609.6

seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.10 Speedup of MRRT(O) and MRRT(WO) algorithms in log scale and linear

scale on all IHEPC1, IHEPC2, and IHEPC3 real datasets respectively with

overlap = 0.75 when splitting along the first dimensions. . . . . . . . . . . . . . 68

4.11 Analyzing scalability of baseline, MRRT(WO) and MRRT(WOS) algorithms

on ttoy20d3 datasets with overlap = 0.75 when changing the dataset size from

50,000 items to 1,000,000 data itmes. . . . . . . . . . . . . . . . . . . . . . . . . 69

4.12 Analyzing scalability of baseline, MRRT(WO) and MRRT(WOS) algorithms

on IHEPC1, IHEPC2, and IHEPC3 real datasets with overlap = 0.75 when

changing the dataset size from 103,557 items to 2,071,148 data itmes. . . . . . 71

4.13 Comparing runtime of MRRT(WO), MRRT(WOS) and baseline algorithm on

ttoy20d3 dataset with overlap = 0.75 when splitting along first dimensions. . . 72

4.14 Comparing runtime of MRRT(WO), MRRT(WOS) and baseline algorithm on

IHEPC1, IHEPC2, and IHEPC3 real datasets with overlap = 0.75 when

splitting along first dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

xi

List of Algorithms

1 Basic Regression Tree Construction Algorithm . . . . . . . . . . . . . . . . . . 9

2 MapReduce Regression Tree Algorithm - Main Method . . . . . . . . . . . . . . 23

3 MapReduce Regression Tree Algorithm - Map Phase of First MapReduce Round 24

4 MapReduce Regression Tree Algorithm - Reduce Phase of First MapReduce

Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 MapReduce Regression Tree Algorithm - Map Phase of Second MapReduce

Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 MapReduce Regression Tree Algorithm - Reduce Phase of Second MapReduce

Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 MapReduce Regression Tree Algorithm - Prediction . . . . . . . . . . . . . . . 27

8 Slope-changing Algorithm - Main Method . . . . . . . . . . . . . . . . . . . . . 30

9 Slope-changing Algorithm - Initialization . . . . . . . . . . . . . . . . . . . . . . 31

10 Slope-changing Algorithm - Map Phase of First MapReduce Round . . . . . . . 31

11 Slope-changing Algorithm - Reduce Phase of First MapReduce Round (Parzen

Window Classifier Version) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

12 Slope-changing Algorithm - Reduce Phase of First MapReduce Round (Fitness

Proportional Selection Version) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

13 Slope-changing Algorithm - Map Phase of Second MapReduce Round . . . . . 38

14 Slope-changing Algorithm - Reduce Phase of Second MapReduce Round . . . . 39

15 Slope-changing Algorithm - Prediction . . . . . . . . . . . . . . . . . . . . . . . 40

xii

Preface

xiii

Acknowledgments

xiv

Chapter 1

Introduction

1.1 Motivation

The goal is approximating a non-linear regression model using piecewise linear models for

large-scale datasets. Regression models have many applications in real world among which

we can name trend line, finance, epidemiology, environmental science, etc. Big datasets are

everywhere these days, and bigger datasets would help us to find better models from the data.

The issue with big datasets is that they also would need a long time to be processed on a

single machine. When the dataset is very large (terabyte scale) even reading the content of

the dataset would take a very long time (a high-end machine with four I/O channels each

having a throughput of 100 MB/sec will require three hours to read a 4 TB data set! [12]).

For this reason we need to use parallel and distributed methods to process big datasets.

There are many options for parallel data processing. We have decided to use MapReduce

programming model as the distributed data processing framework. MapReduce is a program-

ming model introduced by Google in 2004, for processing large datasets [3]. We have chosen

MapReduce as the distributed data processing framework to use for the following reasons:

• MapReduce handles many of the issues with large-scale distributed data processing suchas distributed file system. Google File System (GFS) is the original file system it uses.

GFS makes all the data transfer and distribution on different cluster nodes transparent

to the programmer. The user simply copies a file on the cluster, and GFS decides how

the file to be distributed among cluster nodes, keeps track of the chunks of the file, and

also manages replication of chunks of file on different nodes for fault tolerance purpose.

• Fault tolerance is another thing that MapReduce takes care of. The programmer does

1

CHAPTER 1. INTRODUCTION 2

not need to be worried about resolving the problem of nodes’ failure. If a node fails,

MapReduce itself manages the problem and assigns its tasks to other cluster nodes.

• Code and data migration is also managed by MapReduce. All the mapper nodes inthe cluster run same map code on data. MapReduce takes care of delivering the code

to all mappers and running the code on nodes. The result of map round needs to be

shuffled among cluster nodes (delivered to reducers), and MapReduce takes care of this

data shuffling too. Reduce phase and code migration in this phase is also managed by

MapReduce framework.

• MapReduce simplifies solving a distributed data processing problem by introducing ahigh level programming model for distributed data processing. It helps programmers

to concentrate on program logic and all the details and issues related to distributed

nature of the solution is managed by MapReduce. Although MapReduce restricts us

and reduces the flexibility in some ways, but it helps us to have a standard way of

describing distributed data processing algorithms.

• MapReduce is one of the common ways of solving distributed data processing problemsin industry these days.

Details about how MapReduce works is explained in section 2.2.

1.2 Problem Statement

We are handling a large-scale non-linear regression problem. Regression is a supervised learn-

ing technique in which the algorithm tries to find a model from a dataset to generate a

numerical prediction for future data items. We will call the numerical dependent variable

(target variable) y, and try to approximate its value as a function of other numerical values

x. Here x is a vector consisting of n numerical values x1, x2, . . . , xn, where n is number of

features (attributes) of each data item of the dataset.

y = f(x) + � (1.1)

In above equation � is the difference between actual and predicted value of target value. The

predicted value for y is f(x) and is indicated by ŷ symbol. There are different ways to handle

non-linear regression problem.

We intend to find a solution for large-scale datasets. Handling large-scale datasets could

be very slow if parallel and distributed data processing techniques and frameworks are not


used. Because of the reason mentioned in section 1.1 the programming model we have em-

ployed to handle large-scale datasets is MapReduce. When using MapReduce, the method

that is employed to solve the problem sequentially needs to be coupled and translated into

MapReduce programming model. Some details about how MapReduce works is explained in

section 2.2.

Designing an algorithm for MapReduce framework (map and reduce phases) entails issues

such as deciding about what process needs to be done by cluster nodes on their local piece

of data and what information they need to extract in order to cover the issue of not having

a global view of the data on each node of the cluster. The other challenge when designing

a MapReduce based algorithm is how the final result is aggregated. Generally one problem

could be handled by MapReduce in several different ways and choosing the best way to make

use of MapReduce capabilities is the main challenge. Since there is no communication between

different nodes during Map and Reduce phases, and the results only could be communicated

when the Map phase is done, choosing an effective strategy on extracting useful information

from partial views of different mappers from the partial data they have in hand, and making

use of this data in Reduce phase (or next MapReduce rounds) is a problem that needs to be

addressed.

1.3 Overview of Approach

In this work two different distributed algorithms for approximating non-linear regression

model of a dataset using piecewise regression is suggested. Both algorithms are suggested

for MapReduce framework.

The first algorithm is called MapReduce Regression Tree (MRRT) algorithm. This algo-

rithm is dividing the feature space into equal-size partitions (equal-size in terms of volume

and not number of data points in the partition). To form the partitions, the feature space is

divided into partitions along one dimension of the feature space. This dimension is selected

randomly or using a pre-processing method that is working on a sample of dataset. Data

items belonging to different subspaces are then sent to different reducers, and all reducers

construct regression tree models (in parallel) for the partition they have received. Although

a reducer technically needs only one partition of the feature space to generate the model, we

send left and right partitions of each partition to the reducer too (overlapping subspaces).

This way each reducer would receive three partitions instead of one partition (leftmost and

rightmost partitions of the dataset have only one neighbor and the corresponding reducer

would receive two partitions instead of three). Since we are sending extra information to each


reducer, the data that needs to be transferred in the network and processed in each machine

would increase. This redundancy has a good side-effect which is increase in accuracy of the

final model by decreasing prediction error of the data items that are located near the bor-

derlines. This algorithm uses a weighted prediction mechanism in order to increase accuracy.

Details of this algorithm is explained in section 3.1.

Second algorithm is called Slope-changing algorithm. In this algorithm dataset is dis-

tributed among cluster nodes in a random fashion. Every mapper has part of the dataset

at hand and finds a set of candidate split points in that part of dataset. Split points are

points that will be used to split the feature space into smaller subspaces. These candidate

split points include points with local maximum and minimum target value. Points that the

model’s slope changes sharply in those points are also selected by mappers as candidate split

points. All the candidate split points found by mappers are sent to a single reducer in order

to select the split point set from this set of candidate points. Two different methods are

suggested to make the selection of final split points from candidate split points. One method

is using Parzen Window Classifier, and the other method is fitness proportional selection.

After selecting the split point set, this set is sent to all mappers in the cluster. All mappers

use these split points to partition the data based on the subspaces formed by these split

points. All mappers then send the data points pertaining to a certain subspace to a certain

reducer. This way each reducer would receive all the data points of a certain subspace from

all mappers, and can construct a linear model for that subspace. This way a piecewise linear

model for all subspaces of the feature space is constructed. This piecewise linear model will

be used to predict the target value for future test items based on the subspace in which the

test item is located. Details of this algorithm is explained in section 3.2.

Both MRRT and Slope-changing algorithms divide the feature space into subspaces and

find models for each subspace, but they are dividing the feature spaces differently. MRRT

divides the space in equal-size subspaces, but size and number of subspaces in Slope-changing

algorithm might be different and is determined by split points chosen by the algorithm.

Another difference is that MRRT constructs regression tree models for subspaces, but Slope-

changing algorithm constructs linear models for subspaces. MRRT also uses overlapping

subspaces, but Slope-changing algorithm does not use overlapping subspaces.

1.4 Overview of Contributions

Because of the limitation of Slope-changing algorithm (that is discussed in section 4.5.1), it

is not applicable to high-dimensional datasets. For this reason we only list contributions of


MRRT algorithm:

• Overlapping subspaces (coupled with weighted prediction) not only solves the datadistributed-ness problem, but also helps to improve accuracy over the baseline (regres-

sion tree) algorithm. If the preProcess method is employed to choose the dimension

to split, MRRT improves the accuracy for 8 out of 10 synthetic datasets from 1.1% to

32.86% and for all three real datasets for 4.66%, 13.24%, and 22.73% respectively.

• MRRT algorithm shows to have close to linear speedup (for two out of four datasetsexperimented) and near to optimum scalability for all datasets.

• Although MRRT’s prediction is done sequentially and not on a MapReduce framework,it improves the prediction time by more than 80% comparing to regression tree algo-

rithm.

• MRRT could be used on a single machine, and in that case it improves the learningtime by 60% (in most cases) comparing to regression tree algorithm.

• MRRT needs to choose a dimension to split along. preProcessmethod we have proposedfor MRRT (to choose the dimension to splitp) increases accuracy of model for 11 out of

13 datasets comparing to model constructed by regression tree algorithm.

1.5 Overview of Chapters

In chapter 2 we will review the literature related to regression, piecewise regression and tree

regression. We also talk about MapReduce and also some large-scale problem that is solved

using MapReduce. Limitations of MapReduce and arguments about this limitations are also

discussed in this chapter. Chapter 3 presents details of algorithms we proposed for solving

the piecewise approximation of non-linear model within MapReduce. Next chapter presents

the empirical evaluation of the algorithms and compares them with the baseline (regression

tree) algorithm. Chapter 5 summarizes findings and presents the concluding remarks.

Chapter 2

Literature Review

In this thesis two distributed MapReduce-based algorithms for approximating large-scale non-

linear regression using piecewise regression are proposed. Two major parts of the problem are

approximating non-linear regression using piecewise regression, and MapReduce framework.

We will review the literature related to these two major subproblems in following sections.

2.1 Approximating Non-linear Regression Using Piece-

wise Regression

2.1.1 Linear Regression

Linear regression could be used when there is a linear (or roughly linear) dependency between

x and y (x and y are introduced in section 1.2). In this case the learning algorithm tries to

model y as a linear function of x:

y = β0 + β1x+ � (2.1)

In above equation size of x and β1 vectors are equal to number of dimensions in the feature

space, and � is the difference between actual and predicted value of target variable (error

term).We use ŷ symbol to indicate the predicted value of target variable by the model and we

have ŷ = β0 + β1x. The learning algorithms tries to learn β0 and β1 values (called weights)

from the training items in the dataset. When learning weights, the objective is to minimize

difference of actual and predicted values for all data items (as an example this difference could

be measured by minimizing sum of square of difference between actual and predicted target

6

CHAPTER 2. LITERATURE REVIEW 7

values):n∑

i=1

(y(i) − ŷ(i))2 =n∑

i=1

(y(i) − (β0 + β1.x(i)))2 (2.2)

We use < x(k), y(k) > to indicate kth data item in the dataset.

2.1.2 Non-linear Regression via Piecewise Linear Regression

One of the advantages of linear regression is its simplicity, and one of its disadvantages is its

globality. When the relation between x and y is complex and non-linear, even the best possible

linear model would have a high average prediction error value. Partitioning the feature space

into smaller subspaces and constructing a model for each subspace of the feature space might

be helpful in finding a better model and reducing the error. Piecewise methods are using this

idea and find constant or linear models for each subspace of the feature space instead of one

global linear model.

A constant model for a subspace containing a set of data items like s1 = {< x(1), y(1) >,< x(2), y(2) >, . . . , < x(n), y(n) >} would be calculated as following:

ŷ(s1) =1

size(s1)

∑k∈s1

y(k) (2.3)

and the prediction for any new data item that lies in this subspace would be ŷ(s1).

In most cases it is better to find a linear model for each subspace of the feature space. In

this case, equation 2.2 that is given in previous section is used by linear regression learning

algorithm to find a linear model for each subspace.

%%

%

✏� �� ✏� ��✏� ��

.

%%

%

SSS

��

�

TTT

eee

✏� ��

SSS

x1 c1

x2 c2 x1 c3

x2 c4ŷ1 ŷ2 ŷ3

ŷ4 ŷ5

R3

-

6

R2

R1

R4

R5

c4

c2

x1c1 c3

x2

Figure 5: A decision tree for a dataset with two explanatory variables(left),and the corresponding partitioning of the feature space (right). For eachleaf ` and each corresponding region R` the estimate of the target value isthe average ŷ` of the observed y values within that region.

30

Figure 2.1: A regression tree (left), and the corresponding 2-dimensional feature space (right).Each of tree nodes are corresponding to a subspace in feature space [19]


Dividing the feature space into subspaces can be done in different ways. A simple way

of dividing the feature space into smaller subspaces is using equal-size subspaces. It also is

possible to let the algorithm decide on borderline of the subspaces. Regression tree that is

presented in next section, uses a recursive method to divide the feature space into subspaces.

Figure 2.1 depicts a regression tree and also shows how the feature space is divided into

smaller subspaces based on this regression tree. Leaves of the regression tree are models for

each subspace of the feature space (ŷi is a model for Ri).

2.1.3 Piecewise Regression with Regression Trees

Regression tree is a piecewise method that recursively partitions the feature space into smaller

subspaces. The tree itself consists of nodes and edges. Every node contains a simple condition,

e.g. if xi < 10 (i.e. if its ith feature’s value is smaller or bigger than 10), and one of the

branches is chosen based on the answer of current data item to this question. To find the

prediction for a new data item, tree is traversed starting from the root until we reach a leaf.

Leaves of regression tree contain a model like linear model or constant model.

Constructing regression tree is an iterative task. In each iteration a feature and a cor-

responding threshold value needs to be chosen by the algorithm. We call a pair of <

Feature, V alue > as a split point. Selecting split points could be a critical task when con-

structing piecewise models. When selecting a split point pair among different candidate split

point pairs, a metric is used to evaluate different trees corresponding to different split point

pairs. The tree and corresponding split point that performs better based on the metric is

chosen to be used in next iteration. Basic regression tree algorithm can use Sum of Squared

Errors (SSE) to evaluate a tree T [8]:

S =∑

c∈leaves(T )

∑i∈c

(y(i) − ŷ(c))2 (2.4)

where

ŷ(c) =1

size(c)

∑i∈c

y(i) (2.5)

is the predicted value for all data items landing in that leaf.

Algorithm 1 lists the basic algorithm for constructing regression tree. In this algorithm,

first all the data items of dataset are assigned to the root node (line 2). The ŷ(c) and SSE

values are then calculated for root node (lines 3-4). Afterward a repetitive task (lines 6-

32) is applied on each leaf of the tree and each leaf is populated with two children until

a certain condition is hold (lines 26-30). For each leaf of the tree all possible split pairs


< Feature, V alue > are examined and the pair that reduces SSE of the leaf most is chosen

(lines 12-25). If the chosen pair reduces the SSE more than a threshold δ, then the node

is populated with two children, otherwise that leaf will be kept untouched (lines 26-31). If

number of data items of a node is less than a threshold q, that node also will be kept untouched

(lines 8-10).

Algorithm 1 Basic Regression Tree Construction Algorithm

1: procedure ConstructRegTree(dataset)2: root.dataItems = dataset3: root.ŷ(c) = 1size(dataset)

∑i∈dataset y

(i)

4: root.sse =∑

i∈dataset(y(i) − root.ŷ(c))2

5: queue.add(root)6: while !queue.isEmpty do7: node = queue.remove8: if size(node.dataItems) < q then9: continue

10: end if11: bestSplitPair.sse =∞12: for splitPair ∈ allSplitPairs do13: left.dataItems = splitDataItems(node.dataItems, splitPair, left)14: right.dataItems = splitDataItems(node.dataItems, splitPair, right)15: left.ŷ(c) = 1size(left.dataItems)

∑i∈left.dataItems y

(i)

16: left.sse =∑

i∈left.dataItems(y(i) − left.ŷ(c))2

17: right.ŷ(c) = 1size(right.dataItems)∑

i∈right.dataItems y(i)

18: right.sse =∑

i∈right.dataItems(y(i) − right.ŷ(c))2

19: if bestSplitPair.sse > left.sse+ right.sse then20: bestSplitPair.splitPair = splitPair21: bestSplitPair.sse = left.sse+ right.sse22: bestSplitPair.left = left23: bestSplitPair.right = right24: end if25: end for26: if node.sse− bestSplitPair.sse > δ then27: node.left = left28: node.right = right29: queue.add(left)30: queue.add(right)31: end if32: end while33: end procedure

One of the issues with basic algorithm for regression tree is using a greedy method to

select the feature and value to split. There are two problems with a greedy method to select


the split points. First, since greedy methods make their decision based on a locally optimal

choice, their final model might be a suboptimal model in terms of accuracy. Second, when

number of dimensions and size of dataset is large, finding these split points (even greedily)

would have a very high runtime. We need to find methods to increase accuracy and decrease

the runtime.

Regression tree and piecewise linear regression are proposed when the dataset is not dis-

tributed. In the case when dataset is large, algorithms to generate regression model could be

very slow (splitting all data items of all leaf nodes into two subsets for all different pairs of

< Feature, V alue > is an expensive task for a high-volume high-dimensional dataset). Thus

for large-scale datasets, new technologies, techniques and algorithms needs to be used to per-

form the task more efficiently. Section 2.2 discusses about MapReduce that is the framework

we have used for distributed data processing.

2.1.4 Piecewise Linear Approximation of Time Series

Piecewise linear representation (PLR) is generally used to approximate time series with

straight lines (hyper planes). Piecewise linear representation is more efficient than other

modeling techniques in terms of storage, transmission and computation and has several ap-

plications in clustering, classification, similarity search, etc. [10].

Piecewise linear representation are also called Segmentation Algorithms (SA). Three dif-

ferent specification has been defined for SAs. For a time series T, find the best representation

that

• Includes only K segments,

• The error for each segment does not exceed a threshold, and

• The total error does not exceed a threshold.

A PLR can be either online or batch [10].

PLR algorithms can be divided to 3 different categories: bottom-up, top-down, and sliding-

windows. Bottom-up approach finds the approximation of small pieces of time series and find

the final solution by merging them. Top-down approach recursively divides the time series

until satisfaction of a stopping criterion [10, 13]. Sliding-windows grows a segment until the

error exceeds a threshold. Sliding windows starts from the first point of T and adds points

to it while sum of error is less than a threshold. At that point a segment is generated and

process continues to generate a new segment form the next point. Several optimization are

proposed for this algorithm: 1) adding a bigger value than 1 in each iteration of the process


of finding one segment, 2) since the error is monotonically non-decreasing, methods such as

binary search can be used [10].

Top-down methods find good split points and split T into two segments. An approximate

linear model for each part is calculated and the error is calculated for each part. If error

is less than a threshold, model for that part is finalized, otherwise the algorithm recursively

repeats the process. Bottom-up methods start from smallest possible segments (totally n/2

segments). They find the cost of merging each pair of adjacent segments and merge the

adjacent pair that has the lowest cost. This process is repeated until the minimum cost of

merging is smaller than a threshold [10, 13].

Keogh et al. propose a new online algorithm called SWAB (Sliding-Window And Bottom-

up). SWAB uses a sliding buffer of size close to 6 segments and uses bottom-up on that frame.

After segmentation the leftmost segment is reported and the corresponding data is removed

from the frame and more data are read into the frame [10].

D. Lemire suggests that instead of having linear models for each interval of a time series,

we could have models of different degrees for different intervals [13]. Some intervals may

have constant models, some linear, etc.. The method is called adaptive because degree of

model in an interval is decided adaptively. The reason why adaptive method is suggested is

that piecewise linear models might locally over-fit the data by trying to find a linear model

for the data, while a constant model would fit the data better. Since time series datasets

could be very large, efficiency of algorithm is very important. The adaptive method proposed

in this paper tries to improve the quality of the model while keeping the cost of the model

construction same as top-down[11] method.

Different algorithms with different advantages and dis-advantages could be used for ap-

proximating time series [13]. Optimal adaptive segmentation uses dynamic programming to

find the best segmentation and thus is of high complexity (Ω(n2)). Top-down method on

the other hand selects the worst segment and divides it to two smaller segments iteratively

until the complexity of model reaches the maximum allowed complexity. Adaptive top-down

algorithm first applies top-down algorithm on time series, and then replace linear model seg-

ments with two constant model segments if the error can be reduced with this replacement.

Another version of adaptive top-down first constructs a top-down constant model and then

merges constant models in order to have linear models. The optimal algorithm is not prac-

tical because it takes a very long time (weeks) to generate results for a time series with one

million data points. The adaptive top-down is slightly slower than top-down algorithm, but

generating results of higher quality.


2.1.5 Online Approximation of Non-linear Models

XCSF and LWPR are the two algorithms for online linear approximation of an unknown

function. These methods cluster the input space into small subspaces and find a linear model

for each subspace and use a weighted sum to find the final model. For this we need to

first structure the feature space into small subspaces in order to exploit the linearity of the

target function in each subspace, and then we need to find the linear models in each patch.

There are several solutions for the second step, but the first step is not straightforward.

XCSF is an evolutionary-based algorithm that uses GA[22], and LWPR (Locally Weighted

Projection Regression) is a statistics-based algorithm for function approximation for online

approximation of non-linear multi-dimensional functions incrementally [20].

Receptive Fields (RFs) is the notion used by LWPR for the ellipsoidal subspaces. XCSF

refers to subspaces as classifiers (another term for RF) [17] . Both algorithms has an empty

population of RFs at the beginning, and add new members to this population when a new

uncovered data item is received. An n-Dimensional ellipsoid that is not necessarily axes-

aligned can be represented by a positive semi-definite and symmetric matrix (D). Then the

squared distance of a data item (x) from the center (c) of this space can be defined as:

d2 = (x− c)T .D.(x− c) (2.6)

If this distance is zero, then the data item is placed on the surface of the RF. This way the

subspaces are found in both methods. A linear model for each subspace can be expressed as:

p(x) =∑k=1

nbk.xk + b0 (2.7)

One data item can be covered by several subspaces and in that case a weight combination of

linear models of those subspaces is considered and the model prediction for the input data

item [17, 22, 20].

LWPR assigns a gaussian activity weight to each subspace based on its distance to the data

item, and ignores those weights that are smaller than a threshold for the sake of performance.

This way closer subspace has more significant effect on final prediction comparing to farther

subspaces. XCSF, on the other hand, only assigns weight to subspaces with‘istance of less

than 1. Weights are proportional to inverse value of prediction error in XCSF [17, 22, 20].

Finding a linear model for each subspace is straightforward and can be done using least

squares methods. XCSF uses RLS (Recursive Least Squares), and LWPR uses incremental

partial least squares (incremental version of PLS) to find the linear model in a subspace [17,


22, 20].

Learning the locality (which is the shape and location of receptive fields) is done by a steady

state genetic algorithm in XCSF, and by a stochastic gradient descent in LWPR [17, 22, 20].

In XCSF, each RF has an approximate value for its current prediction error which is used

to calculate its fitness for the GA. Fitness is shared among the RFs that cover same inputs.

Tournament selection is used for selection task of GA, and crossover, and mutation operators

are applied on the location of center, stretch and rotation which is defined by a matrix (D).

When the population reaches a maximum value, some RFs are deleted from crowded regions

of input space using a proportionate selection probability. During this process, RFs are tried

to be generalized by making their coverage area larger while keeping their accuracy sufficiently

large [17, 22, 20].

Center of subspaces are not changed in LWPR and the D matrix is changed (the size and

direction of the ellipsoids). This optimization is done by an incremental gradient descent

based on stochastic leave-one-out cross-validation. For this purpose first D is decomposed to

a triangular matrix and then is updated. The cost function is activity weighted error plus a

penalty term that is preventing the subspaces to shrink over iterations [17, 22, 20].

XCSF and LWPR are compared in [17], and for comparison purpose LWPR is tuned to hit

a low target error (by decreasing size of RFs, changing learning rate, and penalty value) that

is the target error hit by XCSF. Then XCSF’s max population size is changed to be roughly

equal to LWPR’s number of RFs [17, 22, 20].

2.2 MapReduce

MapReduce is a programming model for processing large datasets. Programs written based

on this programming model run on a cluster of nodes called MapReduce cluster. There are

two kind of nodes in such a cluster: mappers and reducers. Mappers run part of the program

called map procedure, and reducers run another part of the code called reduce procedure. All

mappers and reducers run same code on different data. Mappers (map procedure) read the

input data from hard disk of machine they are running on, and process the data to generate

intermediate result. The data that is received by mapper is assumed to be as pairs of (for example , or ). Each

mapper processes its part of data and generates the result as pair of . The input

pair and output pair might not have anything to do with each

other. For example when input is the output pair could be .


One mapper might generate many pairs of with different keys and values.

Paris of generated by mappers then are sent to reducers for the next phase

of process. pairs are not sent randomly to reducers, and instead they are

partitioned among reducers based on the key value of the pairs. For example from all pairs that are generated by all mappers, those with key equal to key1 is sent to a

certain reducer.

Each reducer receives a group of pairs generated by mappers and process

them in order to generate the final result. Since map and reduce phases are run in parallel

by all mappers, a large dataset that is distributed among cluster nodes is processed by the

MapReduce framework much faster than it is possible to process it on a single machine.

2.2.1 Why Pairs?

When data is processed by mappers, we need a way to aggregate result generated by different

mappers. For example if the ultimate task is counting number of words starting with a, b, c,

and d in a huge set of text files, each mapper could generate the result for the part of data

it has locally, and we need a way to aggregate the result from all mappers. Having pairs helps us to request all mappers to send count of all words starting with a certain

character to a certain reducer in order to enable that reducer to have all partial results and

calculate the final result. For this purpose all mappers would generate a result like , and all the results

having a as their key would be sent to a certain reducer [3].

The key concept is that the programmer is aware of the way mappers need to generate

result (pairs of ), and also aware of the way data is shuffled from mappers to

reducers, and he needs to decide how to use this programming model to solve the problem he

has at hand.

2.2.2 Is That All MapReduce Does?

So far we have talked about how MapReduce programming model helps us to solve data

processing problem in a parallel fashion. But it is not all a MapReduce implementation

offers us (Different MapReduce implementations are available among which we can named

Hadoop [12] which is an open source implementation). When you write a MapReduce code

you are done, and Hadoop (or any other MapReduce implementation) takes care of the rest of

problems. The framework sends the mapper procedure to all mappers, and reducer procedure

to all reducers. Then it asks the mappers to run the code on their local data and generate the

result based on what is specified in the code. After result generation, the framework takes


care of the shuffling the data among reducers. After reducers receive the data, it asks them

to run the reducer procedure on the received pairs.

UserProgram

Master

(1) fork

worker

(1) fork

worker

(1) fork

(2)assignmap

(2)assignreduce

split 0

split 1

split 2

split 3

split 4

outputfile 0

(6) write

worker(3) read

worker

(4) local write

Mapphase

Intermediate files(on local disks)

worker outputfile 1

Inputfiles

(5) remote read

Reducephase

Outputfiles

Figure 1: Execution overview

Inverted Index: The map function parses each docu-ment, and emits a sequence of 〈word,document ID〉pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a〈word, list(document ID)〉 pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.

Distributed Sort: The map function extracts the keyfrom each record, and emits a 〈key,record〉 pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.

3 Implementation

Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:

large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:

(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.

(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.

(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.

(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.

(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.

3.1 Execution Overview

The Map invocations are distributed across multiplemachines by automatically partitioning the input data

To appear in OSDI 2004 3

Figure 2.2: MapReduce Execution Overview [3].

A question here is how a programmer decides and copy the file on cluster nodes in order

to be processed by MapReduce framework? Programmer does not need to do such a task.

MapReduce framework has a distributed file system (Google File System or GFS and Hadoop

Distribute File System or HDFS in Hadoop implementation of MapReduce) that facilitates

this task. All you need is running the distributed file system and issuing a command like:

copy bigFile.txt on the cluster. Rest of the work is done by the framework. Another question

here is what if certain mapper fails in the middle of running? The answer is that MapReduce

framework takes care of the issue. When distributed file system copies the data on the

cluster, it replicates different chunks of data on different mappers (based on replication factor

indicated in configuration file by user) and when a certain mapper fails, its task would be

assigned to other mappers. MapReduce framework also takes care of other lower level tasks


such as network communication. There are nodes in a MapReduce framework that their task

is bookkeeping. They keep track of cluster nodes, mappers, reducers, data replication, etc.

Figure 2.2 illustrates execution of a MapReduce task on a MapReduce cluster. User

program is distributed by master among worker nodes. Some of the worker nodes would work

as mappers and some as reducers. Data is read by mappers and then they run the mapper

procedure on the data. Intermediate data is generated, and then they are sent to reducer

nodes. Reducer nodes process the pairs they have received and generate the

final result [3].

2.2.3 MapReduce for Clustering

One of the large-scale data processing that is data clustering. Several algorithms has presented

recently for different clustering algorithms on MapReduce framework. In this section we

review three clustering algorithm to see how they are using MapReduce power in order to

cluster data.

Zhao et al. are arguing that all previous researches on parallel k-means so far are suffering

from two problems [24]. First, they assume that all the data are in main memory, and second,

they are using a restricted programming model. For these two reasons, those works are not

applicable on peta-scale datasets. Since distance calculation (calculated n*k times in each

iteration where n is number of data points and k is number of clusters) is the most expensive

step of the algorithm, they try to exploit the parallelization of MapReduce to decrease this

cost. Map function assigns each data point to its closest center, and Reduce function updates

the centroids. There is one more function called Combine that aggregates the intermediate

results of Map functions. A global variable called centers includes list of all centers and is used

by all map tasks. Map tasks generate pairs of . Combine

method, aggregates the results of the same map task. It calculates the partial sum of the

data points assigned the same cluster. Output of this method is pairs of . The Reduce function aggregates sum of

all partial sums for each cluster, and calculates the new centroids. The output of Reduce

is pair of . The speedup they have

achieved for 4 machines is around 3 for the biggest dataset (8GB) which is a good speedup.

The speedup for bigger datasets is bigger too which is a good indication. The authors are not

talking about iterative nature of the algorithm and about how this issue is handled. They

also do not talk about accuracy of the method and only talk about speed-up, scale-up, and

size-up [24].

Ferreira Cordeiro et al. present an algorithm for very large multi-dimensional dataset


clustering with MapReduce [7]. Since such a dataset doesn’t fit in one or several disks,

parallel processing is the only solution. In that case I/O and network cost are the two things

that needs to be balanced. Best of both World (BoW) is the solutions that the authors are

suggesting in this paper. They have worked on the largest real dataset ever in the database

subspace clustering (Twitter crawl > 12TB, and Yahoo! Operational data: 5 Petabytes - only

reading 1TB from a modern 1TB disk takes around 3h). The contribution of the paper is

combining sequential clustering algorithm with a parallelization method in an efficient way.

Sequential subspace clustering algorithms can be plugged to this solution and the system

would balance the I/O and network cost. The sequential algorithm that is plugged into the

parallel algorithm finds the beta-clusters in a hyper-rectangle shape in the multi-dimensional

space. Sequential subspace algorithm can be density-based or k-means-based [7].

I/O optimal version of the algorithm (ParC) reads the dataset one time and reduces the

I/O. Another algorithm SnI (sample and ignore) improves the network cost but reads the data

two times. Depending on number of reducers each of the two can be the winner. The BoW

is a combined algorithm that decides to use which of those algorithm based on number of

reducers, and keeps the cost as min(ParC, SnI) for any number of reducers. ParC partitions

(using one of these methods: random, address space, or arrival order) the dataset across

the cluster (mappers), finds beta-clusters in each partition (reducers), and finally merges the

clusters (a single machine). SnI, on the other hand, first samples the dataset (exploits the

skewed distribution of the data), and then clusters the sample using ParC ignoring the un-

sampled data items. This way SnI avoids processing of many of the data items that belong to

big clusters that are already sampled. SnI reads the data two times. In first read it samples

the data, and in second read it only maps the sampled data items and avoids other points.

The network cost will be reduced in a great amount by this technique. In sample step of the

algorithm mappers map each point by probability of Sr to a single reducer. That reducer then

clusters the data using the plugin clustering algorithm and passes the clusters description to

next phase. In ignore phase each mapper reads its partition again and ignores the data points

that fit into the clustering found in sample phase and send other data items to r reducers.

Those reducers cluster the data points using the plugin clustering algorithm and pass the

clustering description to one machine. That machine merges all the clusterings found in 2nd

phase to the clustering found in phase 1 [7].

Both ParC and SnI have their own benefits. ParC optimizes I/O by reading the data file

once, and SnI optimizes the network cost by reducing number of data points that needs to be

transferred over the network in cost of reading the data file two times. To take advantage of

the benefits of both of these methods we need a combined method that selects one of these

based on the cost. A cost-based optimization method is used to select the better algorithm


adaptively. The cost formula uses file size, network speed, disk speed, startup cost, and

plugin cost to calculate the total cost for each algorithm. BoW algorithm first calculates

both costParC and costSnI and select the better one based on the parameters and calls it.

Experiments has been done to check the accuracy, scalability and performance of the cost-

based method. The authors have shown that the quality of the clustering matches the quality

of sequential clustering while its speed-up is close to linear. The cost-based method also has

been shown to be the best of both world in all cases [7].

Ene A. et al. have designed the first approximate version of metric k-center and k-median

algorithms for MapReduce [6]. They assumes that a set V of n data points and their cor-

responding distance is given and try to cluster the similar points into same clusters. The

output of the algorithm is k data points that is considered to be the center points of the k

clusters.The algorithms first sample the data (in a way that the sample represents all the

data well) in order to decrease the dataset size. The sampling method incrementally add new

points to the final sample set only if they are not already represented well by the final set.

Sampling is different for k-median and k-center due to their different nature. Sampling for

k-median needs more effort because it needs to consider each points distance from its cluster

center. A version of algorithm is presented in the paper that can be run on MapReduce [6].

The MapReduce version of sampling is an iterative algorithm in each iteration of which we

have three MapReduce operations. The first MapReduce operation partitions data arbitrarily

among machines (mappers), and then each reducer construct two sets (S:final set, and H from

which a pivot is selected). In next MapReduce step all the mappers pass the H and S sets to

a single reducer and that reducer finds the pivot point. In the last MapReduce step mappers

send pivot, S, one partition of R (remained data items that are not sampled yet), and the

distance matrix to the reducers and reducers get rid of the well represented points. This steps

are iterated until number of remaining points in the R falls under a certain threshold. K-center

tries to minimize the maximum distance of the cluster center and the points in that cluster,

while k-median tries to minimize sum of the distance of all the points in a cluster from the

cluster center (both problems are know to be NP-hard). K-center uses the sampling produced

by the sampling algorithm and mappers map all the points in the sampling along with their

pairwise distance to a reducer, and the reducer runs a simple local clustering algorithm. K-

median needs more information and its sample should have information about all the nodes

that are to be clustered. For each un-sampled point the closest sample point is selected and

its weight is increased by 1. In k-median first the sampling is done and then partitions of

the original dataset along with the sample and part of the distance graph is sent to reducers.

Each reducer finds weight of sample points partially. Then, in another MapReduce round the

partial weights are summed up. Last step is a simple clustering on the sample considering


weight of each sample point [6].

2.2.4 MapReduce and Iterative Tasks

Many machine learning and data mining algorithms are working iteratively on data but

MapReduce is not well-suited for tasks with cyclic data flow. There are frameworks such

as Twister [5], Spark [23], and HaLoop [1] that are iterative. Dave et al. present a cloud-

based pattern for large-scale iterative data processing problems [2]. They have implemented

CloudClustering, as a case study, that tries to show how iterative data processing problems

can be handled on the cloud.

CloudClustering is the distributed version of k-means clustering algorithm, implemented

on Microsoft’s Windows Azure platform. They introduce a way to balance the performance-

fault tolerance trade-off (that is the main trade-off when solving iterative problems on the

cloud) using data affinity and buddy system. Some methods are using a central pool of state-

less tasks in order to handle the fault tolerance issue, but this could lead to low performance

because a cluster node might need to receive different parts of the data in different iteration

(i.e. there is no affinity between data and workers) [2].

Windows Azure handles fault-tolerance by means of reliable queues. When a worker takes

a task from the queue, the message becomes invisible and if it is not deleted after a timeout, it

will reappear in the queue. This way if a worker fails, the task will be done by another worker

node. One of the issues with the iterative tasks on the cloud is the stopping criterion. It can be

handled in two different ways in this problem. If no data point is changed among clusters from

one iteration to the next, we are done. This method needs to keep track of previous cluster

of each data point. The other method checks the maximum amount of centroid movement

and stops if it falls below a certain threshold. This method works on an read-only memory,

but can’t guaranty the convergence [2].

The proposed architecture is using the Windows Azure’s queuing system and includes one

master and a pool of worker nodes. Input dataset is stored centrally and is partitioned by

the master. The workers download a task containing the address to the corresponding part

of partition and the centroid list and perform the task. This method is working best in terms

of fault-tolerance but since data affinity is not considered the performance is not good in this

system. The other extreme is having one queue per worker that will solve the problem of

data affinity (master will assign same partition of data to same workers in each iteration),

but suffers from fault tolerance problem (there is not other worker to take over the current

task in case the worker fails). Buddy system is grouping workers in buddy groups and a queue

is shared among all members of each buddy group. Now size of the buddy group defines a


balance between fault tolerance and performance [2].

2.2.5 Arguments about Using or not Using MapReduce

Schwarzkopf et al. have listed seven different assumptions and simplifications employed by re-

searchers in the cloud research that threatens the practical applicability and scientific integrity

of those researches [16].

One of the issues they have pointed out in their paper is unnecessary distributed paral-

lelism. Very large datasets and frameworks such as MapReduce have made researchers to

employ distributed parallelism more and more. Since the new high performance computing

frameworks offer a fascinating simplicity and handle complicated issues like communication,

synchronization, and data motion, a lot of people are willing to use these frameworks without

considering whether these frameworks are useful for the problem at hand or not. Frameworks

such as MapReduce reduce the engineering time needed to design a solution for a distributed

version of an algorithm, but they mostly increase the runtime. For this matter the speedup of

a program must be measured to show that the distributed solution outperforms the sequential

solution. Furthermore, we need to make sure that we need to distributed the data over several

machine even if we are sure that a parallel solution would be beneficial for the problem at

hand. They also point out that as Rowstron et al. have shown, with nowadays multicore

processors and huge amount of RAM we might not need to use a distributed solution for

many problems [15], and we would be able to make use of fast communication mechanisms

such as shared memory and also avoid data motion [16].

Another issue they have mentioned in their paper is forcing the abstraction. MapReduce is

designed to alleviated the I/O bottleneck of big data by distribution of data over several hard

disks. Time needed to process a job on a single machine is also assumed to be long. Some

solutions are iterating and generating many short-time MapReduce jobs while it is better

to have least number of jobs that are iteratively running on each system. Domain-specific

systems (for stream processing, iterative processing and graph processing) have also emerged

that seems to be a lot more justified that using the MapReduce for any problem [16].

Since many of Machine Learning and Data Mining algorithms are iterative, and MapRe-

duce is not inherently an iterative programming model, and some other algorithms does not fit

to this model for other reasons, many alternatives and extensions of MapReduce is provided

by different research/industrial groups in recent years. Some theoretical studies have been

done to show that Hadoop (an open source implementation of MapReduce) has limitations.

Empirical studies also have been done and frameworks such as HaLoop [1] and Twister [5]

are presenting a class of algorithms that Hadoop is not a good fit for, and try to extend the


Hadoop and solve the problem more efficient than Hadoop, and off course they outperform

Hadoop at least when running that special algorithm. Jimmy Lin provides reasons why we

need to either revise current algorithms to be run on MapReduce or devise new algorithms

that follow MapReduce programming model. He suggests that since MapReduce is currently

the widely used solution for large scale data processing problems, we can get rid of the itera-

tive solution and try to use (or devise) alternative solutions that will fit MapReduce instead

of devising new frameworks for algorithms that MapReduce is ”good-enough”. He discusses

three classes of of problems to justify his claim: iterative graph algorithms (e.g., PageRank),

gradient descent (e.g., for training logistic regression classifier), and EM (e.g., for k-means,

and HMM training) [16].

Jimmy Line argues that extensions of Hadoop that support iterative constructs and thus

alleviate some problems, but the problem with all these frameworks is that they are not

Hadoop! It costs a lot for an organization to have another framework (other than Hadoop)

for only graph and iterative algorithms. A better solution would be trying to solve the four

above-mentioned problem by changing the algorithm in order to be runnable on Hadoop. If

MapReduce is performing better than an alternative that is used to solve that problem. That

does not mean that MapReduce needs to beat all the alternatives. For example MapReduce

performing a lot better than GIZA++ for word-alignment algorithm, and also is considered

an advance when used for k-means clustering [16]. The Hadoop stack is the standard and

widely used platform for large-scale data analysis. Any large-scale data analysis needs to

be able to process different types of structured and unstructured data and run different

types of algorithms (graph, text, relational data, ML, etc.). No single programming model

or framework can meet all the needs and be the best in terms of all the aspects such as

performance, fault tolerance, expressivity, simplicity, abstracting low level features such as

synchronization, etc.. No the question is: Dose adopting and deploying a new framework to

solve a problem worth (in terms of cost, time, generality of framework, having mastered HR

to use the framework, etc.) [14]?

Chapter 3

Approach

In this chapter we introduce two different piecewise regression algorithms. First algorithm is

called MapReduce Regression Tree (MRRT), and second one is called Slope-changing algo-

rithm. Both algorithms are trying to find a piecewise regression model for a dataset within

the MapReduce framework. MapReduce Regression Tree algorithm is a Regression Tree based

algorithm that can be used within the MapReduce framework. Slope-changing algorithm on

the other hand is trying to introduce a non-greedy method to find good candidate split points

and use this candidate set in order to find the final set of split points. Performance of these

two algorithms is analyzed and compared in chapter 4.

3.1 MapReduce Regression Tree

Algorithm 2 lists the pseudocode for MapReduce Regression Tree algorithm. This algorithm

partitions the feature space to smaller subspaces, but constructs a regression tree model

(instead of a linear model in Slope-changing algorithm) for each subspace. The generated

regression tree models are used to predict the target value of new data items.

Unlike Slope-changing algorithm that selects the split points based on the logic that maxi-

mum, minimum and slope-changing points are good candidates, this algorithm is not choosing

the split points based on any heuristic, and the feature space is not divided into subspaces

along different dimensions. The feature space is divided to subspaces of equal size (in terms

of volume of the subspace and not number of data items in each subspace), and it is divided

into smaller subspaces along one dimension of the feature space. This dimension is chosen

randomly or using the preProcess method. The preProcess method is retrieving a sample of

the dataset randomly (in our experiments we used 10% of each dataset) and runs the piecewise

22

CHAPTER 3. APPROACH 23

Algorithm 2 MapReduce Regression Tree Algorithm - Main Method

1: function MR-Regression-Tree-Learn2: dimToSplit = preProcess(dataset)3: rangeV alues = Map1(dimToSplit) . All mappers find min and max value of the4: . dimension that is being split5: splitPoints = Reduce1(rangeV alues, dimToSplit, nMappers) . Split points are

specified6: . based on dimension size and number of mappers7: Map2(splitPoints, dimToSplit) . Data is shuffled among reducers8: models = Reduce2() . Each reducer finds the model for the received data9: end function

Regression Tree algorithm on all different dimensions of the dataset. One piecewise regression

tree model is generated for each dimension of the dataset. The model is then tested against a

validation set and the dimension corresponding to the dimension that has the least RMSE on

validation set is chosen as the dimension to split the dataset in MapReduce Regression Tree

algorithm along.

Figure 3.1: Dataset distribution among cluster nodes with overlap to decrease borderline datapoints prediction error

When dividing a feature space to subspaces, models constructed for two subspaces that

are next to each other might have different predictions for a data point that is located on the

borderline. It is same for data points that are located in one subspace, but are close to the

borderline. For these data points, the neighbor model might have a better prediction than

the actual model that the data point is located in. For this reason, smoothening methods try

decrease the problem by using a weighted average of predictions of neighboring models for


data points close to the borderline. Number of these models are two in a two dimensional

feature space, and might be more in a n dimensional feature space based on the location of

the data item (the data item might be close to a borderline in one dimension, and not in

another dimension).

Since we are using a distributed method to solve the regression problem, we have more

resources at hand and we might be able to afford a little bit of redundant calculation in

order to increase the accuracy of the model. Based on this logic and the problem explained

about borderline data points prediction, we decided to have overlapping subspaces and let

each mapper have more data than what it needs in order to construct the model. Figure 3.1

depicts how 7 partitions of dataset that are partitioned along x axis are assigned to 7 map-

pers. All cluster nodes except first and last one receives three partitions of the dataset. All

cluster nodes receive left and right partitions of the partition they are trying to construct the

model for (we call it main partition, and call the left and right partition neighbor partitions).

The reason that first and last nodes receive only two partitions instead of three is that their

corresponding main partition has just one neighboring partition. Distributing dataset this

way would let the system to construct the model based on 3 partitions but only predict the

target value for the test items that are located in their main partition. This way we will

not have any borderline data item and thus we will not need to use prediction smoothening

methods that is used in Slope-changing algorithm.

Algorithm 3 MapReduce Regression Tree Algorithm - Map Phase of First MapReduceRound1: function Map1(dimToSplit)2: for all Mappers do3: . dataset is local part of dataset on the node that this mapper is running4: minV alue = +∞5: maxV alue = −∞6: for all dataPoint ∈ dataset do7: minV alue = min(dataPoint[dimToSplit],minV alue)8: maxV alue = max(dataPoint[dimToSplit],maxV alue)9: end for

10: end for11: send < 1, < minV alue,MaxV alue >>12: . By indicating key as 1 all information is sent to one reducer13: end function

Figure 3.2 depicts different overlap factors when overlapping subspaces on cluster nodes.

When overlap factor is 1, each node would receive its own subspace in addition to two neigh-

boring subspaces that are as big as its own subspace. When overlapping factor is 0.5, size of


Figure 3.2: Different overlap factors of subspaces on cluster nodes

neighboring subspaces that each node receives, is half of size of its own subspace. In case of

overlapping factor of 0, no overlapping exists.

Now that we have discussed the concepts behind how MapReduce Regression Tree algo-

rithm works, let us explain each line of algorithm 2 briefly. The preProcess method selects

what dimension to select to partition the dataset along. Then first round of MapReduce

is started. In map phase of first MapReduce round, each mapper finds range of the data

items (max, min) in the portion of dataset that mapper owns. These information are sent

to one reducer. Reducer would receive rangeV alues, and nMappers (number of mappers),

and would decide what portions of dataset should be sent to each mapper. In map phase of

second round of MapReduce, all mappers receive the splitPoints information and would send

the data items they have to two or three mappers (we know that each partition of the dataset

would be sent to several mappers due to overlapping). In reduce phase of second MapReduce

round, all the reducers would have two or three partitions of the dataset. They construct a

regression tree for the portion of dataset they have received. Each of this phases is explained

in following sections.

3.1.1 Map1: Finding the Min and Max of Dimension that Is Being

Split

Algorithm 3 lists the steps of this phase of the first MapReduce round. All mappers process the

portion of the dataset they own and find the minimum and maximum value of the dimension


Algorithm 4 MapReduce Regression Tree Algorithm - Reduce Phase of First MapReduceRound1: function Reduce1(rangeV alues, dimToSplit, nMappers)2: minV alue = min(rangeV alues.minV alues) . min value of dimToSplit dimension3: maxV alue = max(rangeV alues.maxV alues) . max value of dimToSplit dimension4: stepSize = (maxV alue−minV alue)/nMappers5: . stepSize in dimToSplit dimension when partitioning the dataset6: splitPoints[1].start = minV alue . start and end of the partition for7: splitPoints[1].end = minV alue+ 2 ∗ stepSize . first mapper is calculated differntly8: for i = 2, nMappers− 1 do9: splitPoints[i].start = minV alue+ (i− 2) ∗ stepSize

10: splitPoints[i].end = minV alue+ (i+ 1) ∗ stepSize11: end for12: splitPoints[nMappers].start = minV alue+ (nMappers− 2) ∗ stepSize13: splitPoints[nMappers].end = minV alue+ nMappers ∗ stepSize14: . start and end of the partition for last mapper is calculated differntly15: send splitPoints to all mappers . To be used by mappers of next round16: end function

that is supposed to be split. All the mappers then send this minimum and maximum values

to the same reducer. This is the reason the key value for the emitted < key, value > pair is

1 for all mappers.

Algorithm 5 MapReduce Regression Tree Algorithm - Map Phase of Second MapReduceRound1: function Map2(splitPoints, dimToSplit, nMappers)2: for all Mappers do3: for dataPoint ∈ dataset do4: for i = 1, nMappers do5: if splitPoints[i].start < dataPoint[dimToSplit] < splitPoints[i].end then6: send < i, dataPoint > to corresponding reducer7: end if8: end for9: end for

10: end for11: end function

3.1.2 Reduce1: Finding Split Points Along the Dimension that Is

Being Split

Algorithm 4 lists the reduce phase of first MapReduce round. In this algorithm all the

maximum and minimum values sent by all mappers used to find the maximum and minimum


value of the dimension that is being split. Using these two values, range of the dimension is

found and stepSize of split points along that dimension is found by dividing range to number

of mappers. Now start and end values for each mapper on dimension that is being split is

found and stored in splitPoints array. All the mappers would have a partitions as big as

triple of size of stepSize, except the partitions whose main partition is first or last partition

of the dataset. Those two partitions would only have two partitions.

Algorithm 6 MapReduce Regression Tree Algorithm - Reduce Phase of Second MapReduceRound1: function Reduce2(dataPoints)2: Input: dataPoints: data items sent to this reducer3: for all Reducers do4: models[i] = treeRegressionModel(dataset)5: end for6: end function

3.1.3 Map2: Shuffling the Data Among Cluster Nodes

The split points found in Reduce1 phase are used in this phase to shuffle the data. Algorithm 5

lists how the data is shuffled among cluster nodes in this phase. Each mapper sends each data

item in its local portion of dataset to 2 or 3 mappers. The data points that are located in first

or last partition of the feature space would be sent to two reducers, and all other data points

would be sent to three reducers. This would cause to have more redundancy in the amount

of the data that is transferred in the network, but would solve the problem of borderline data

item’s target value prediction, and would increase the accuracy of the model also.

Algorithm 7 MapReduce Regression Tree Algorithm - Prediction

1: function MR-Regression-Tree-Test(models, dataP

large-scale non-linear regression within the …pkc/theses/khademzadeh13.pdf · large-scale...

Documents