large-scale non-linear regression within the …pkc/theses/khademzadeh13.pdf · large-scale...
TRANSCRIPT
-
LARGE-SCALE NON-LINEAR REGRESSION WITHIN
THE MAPREDUCE FRAMEWORK
by
Ahmed Khademzadeh
A thesis submitted to the College of Engineering at
Florida Institute of Technology
in partial fulfillment of the requirements
for the degree of
Master of Science
in
Computer Science
Melbourne, Florida
July 2013
-
c© Copyright by Ahmed Khademzadeh 2013All Rights Reserved
The author grants permission to make single copies
-
We the undersigned committeehereby approve the attached thesis
LARGE-SCALE NON-LINEAR REGRESSION WITHINTHE MAPREDUCE FRAMEWORK
byAhmed Khademzadeh
Philip Chan, Ph.D.Associate ProfessorComputer SciencesPrincipal Adviser
Marius Silaghi, Ph.D.Assistant ProfessorComputer Sciences
Georgios C. Anagnostopoulos, Ph.D.Associate ProfessorElectrical & Computer Engineering
William D. Shoaff, Ph.D.Associate Professor and Department HeadComputer Sciences
-
Abstract
Large-scale Non-linear Regression within the MapReduce Framework
By: Ahmed Khademzadeh
Thesis Advisor: Philip Chan, Ph.D.
Regression models have many applications in real world problems such as finance, epidemiol-
ogy, environmental science, etc.. Big datasets are everywhere these days, and bigger datasets
would help us to construct better models from the data. The issue with big datasets is that
they would need a long time to be processed or even to be read on a single machine. This
research employs MapReduce to model large-scale non-linear regression problems in a par-
allel fashion. MRRT (MapReduce Regression Tree) algorithm divides the feature space into
overlapping subspaces and then shuffles each of the subspace’s data items to a node in the
cluster. Each node in the cluster then constructs a regression tree for the subspace of the
data it has received. Different versions of algorithm (overlapping/non-overlapping subspaces
and weighted/unweighted prediction using neighboring models) are proposed and compared
with the regression tree (RT) algorithm implemented in Matlab libraries.
Experiments on synthetic and real datasets show that MRRT algorithm that is devised
to be fast and scalable for MapReduce framework not only has a close to linear speedup, and
close to optimum scalability, but also outperforms the RT algorithm in terms of accuracy
(in most cases) and improves the prediction time by more than 80%. Although MRRT is
designed for MapReduce framework, it could be used on a single machine, and in that case
it improves the learning time by 60% (in most cases) comparing to RT algorithm, and shows
to be of close to linear scalability (comparing to RT algorithm which is roughly of quadratic
scalability).
iii
-
Contents
Abstract iii
Preface xiii
Acknowledgments xiv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Overview of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Overview of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Review 6
2.1 Approximating Non-linear Regression Using Piecewise Regression . . . . . . . . 6
2.1.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Non-linear Regression via Piecewise Linear Regression . . . . . . . . . . 7
2.1.3 Piecewise Regression with Regression Trees . . . . . . . . . . . . . . . . 8
2.1.4 Piecewise Linear Approximation of Time Series . . . . . . . . . . . . . . 10
2.1.5 Online Approximation of Non-linear Models . . . . . . . . . . . . . . . . 12
2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Why Pairs? . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Is That All MapReduce Does? . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 MapReduce for Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 MapReduce and Iterative Tasks . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.5 Arguments about Using or not Using MapReduce . . . . . . . . . . . . 20
iv
-
3 Approach 22
3.1 MapReduce Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 Map1: Finding the Min and Max of Dimension that Is Being Split . . . 25
3.1.2 Reduce1: Finding Split Points Along the Dimension that Is Being Split 26
3.1.3 Map2: Shuffling the Data Among Cluster Nodes . . . . . . . . . . . . . 27
3.1.4 Reduce2: Constructing the Tree Regression Models for Each Subspace . 28
3.1.5 Using the MRRT Model to Predict . . . . . . . . . . . . . . . . . . . . . 28
3.2 Slope-changing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Choosing Good Split Points . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Map1: Finding Candidate Split Points . . . . . . . . . . . . . . . . . . . 31
3.2.4 Reduce1 : Generating a Split Point Set from Candidate Set . . . . . . . 35
3.2.5 Map2 : Shuffling the Data Points Based on Split Points . . . . . . . . . 39
3.2.6 Reduce2 : Finding the Linear Model for Each Subspace . . . . . . . . . 39
3.2.7 Using the Slope-changing Model to Predict . . . . . . . . . . . . . . . . 40
4 Empirical Evaluation 42
4.1 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.2 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Overview of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Overview of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 MRRT Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.1 Number of Dimensions to Split Along . . . . . . . . . . . . . . . . . . . 47
4.4.2 Overlapping Subspaces and Neighbor-weighted Predictions . . . . . . . 50
4.4.3 Comparing the Accuracy of MRRT and the Baseline Algorithm . . . . . 54
4.4.4 Choosing the Dimension to Split Along . . . . . . . . . . . . . . . . . . 58
4.4.5 Prediction Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.6 Speedup of MRRT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.7 Scalability of MRRT Algorithm . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.8 Could MRRT Be Used as a Sequential Algorithm? . . . . . . . . . . . . 72
4.5 Slope-changing Experiments Results . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.1 Slope-changing Algorithm Limitation . . . . . . . . . . . . . . . . . . . . 77
v
-
4.5.2 Comparing Accuracy of Slope-changing Algorithm to Baseline Algorithm 78
4.5.3 Comparing Runtime of Slope-changing Algorithm to Baseline Algorithm 79
5 Concluding Remarks 80
5.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Possible Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A Synthetic Datasets Details 83
Bibliography 88
vi
-
List of Tables
4.1 Summary of Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Summary of Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Comparing accuracy and learning time of MRRT when dividing the feature
space along one dimension versus two dimensions on synthetic datasets. As it
can be seen, none of the methods for dividing the feature space is superceeding
the other one and there is no obvious reason to prefer one over the other one
based on this experiment. The learning time of algorithms in both methods is
also similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Comparing accuracy and learning time of MRRT when dividing the feature
space along one dimension versus two dimensions on real datasets. As it can
be seen, Two Dimensions split wins in accuracy and One Dimension split wins
in learing time. The accuracy difference is not a major difference, but the
learning time difference is significant. . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Comparing accuracy of MRRT(W) and MRRT both with no overlap. As it can
be seen the MRRT(W) algorithms works bettern than MRRT on most datasets. 53
4.6 Comparing accuracy of Weighted Overlapping MapReduce Regression Tree and
baseline algorithm on 10-dimensional synthetic datasets. Numbers in the table
are RMSE values. MRRT(WO) algorithm always performs better than baseline
algorithm, when splitting the feature space is done along one dimension and if
the dimension to split is chosen properly. . . . . . . . . . . . . . . . . . . . . . 55
4.7 Comparing accuracy of Weighted Overlapping MapReduce Regression Tree and
baseline algorithm on 20-dimensional synthetic datasets. Numbers in the table
are RMSE values. MRRT(WO) algorithm always performs better than baseline
algorithm, when splitting the feature space is done along one dimension and if
the dimension to split is chosen properly. . . . . . . . . . . . . . . . . . . . . . 56
vii
-
4.8 Comparing accuracy of MRRT(WO) and baseline algorithm on real datasets.
Numbers in the table are RMSE values. MRRT(WO) algorithm always per-
forms better than baseline algorithm, when splitting the feature space is done
along one dimension and if the dimension to split is chosen properly. . . . . . 57
4.9 Dimensions with lowest RMSE on synthetic datasets and rank of same dimen-
sion on samples using MRRT(O) and MRRT(WO) algorithms. . . . . . . . . . 61
4.10 Dimensions with lowest RMSE on sample of synthetic datasets and RMSE of
dataset when divided along same dimension using MRRT(WO) algorithm. . . . 62
4.11 Dimensions with lowest RMSE on real datasets and rank of same dimension
on samples using MRRT(O) and MRRT(WO) algorithms. . . . . . . . . . . . . 63
4.12 Dimensions with lowest RMSE on sample of real datasets and RMSE of dataset
when divided along same dimension using MRRT(WO) algorithm. . . . . . . . 64
4.13 Comparing prediction time of MRRT(O), MRRT(WO) and baseline algorithm
on 20-dimensional ttoy20d3 synthetic test set containing 1000 test items on
different size of clusters. MRRT(WO) and MRRT(O) algorithms reduce pre-
diction time by more than 80% comparing to baseline algorithm in all cases. . . 65
4.14 Comparing prediction time of MRRT(O), MRRT(WO) and baseline algorithm
on real datasets’ test sets containing 4111 test items on different size of clusters.
MRRT(WO) and MRRT(O) algorithms reduce prediction time by more than
80% comparing to baseline algorithm in all cases. . . . . . . . . . . . . . . . . . 66
4.15 Comparing learning time of MRRT(WOS) and baseline algorithm on 20-dimensional
ttoy20d3 synthetic dataset on different number of subspaces. MRRT(WOS)
always perform better than baseline algorithm although it also has better ac-
curacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.16 Comparing accuracy of MRRT(WOS) and baseline algorithm on 20-dimensional
ttoy20d3 synthetic datasets on different number of subspaces when dataset is
divided into supspaces along first dimension. MRRT(WOS) algorithm’s RMSE
is lower than baseline algorithm in all cases. . . . . . . . . . . . . . . . . . . . . 74
4.17 Comparing learning time of MRRT(WOS) and baseline algorithm on real datasets
on different number of subspaces. MRRT(WOS) algorithm’s learning time is
always less than baseline algorithm except in one case when number of sub-
spaces is 32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
viii
-
4.18 Comparing accuracy of MRRT(WOS) and baseline algorithm on three real
datasets on different number of subspaces when dataset is divided into supspaces
along first dimension. MRRT(WOS) algorithm’s RMSE is lower than baseline
algorithm in all cases, and it mostly decreases with increasing number of sub-
spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.19 Summary of synthetic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.20 Comparing accuracy of slope-changing algorithm (PWC and FPS versions) and
baseline algorithm on four datasets. . . . . . . . . . . . . . . . . . . . . . . . . 78
4.21 Comparing learning time of slope-changing algorithm (PWC and FPS versions)
and baseline algorithm on four datasets. . . . . . . . . . . . . . . . . . . . . . . 79
ix
-
List of Figures
2.1 A regression tree (left), and the corresponding 2-dimensional feature space
(right). Each of tree nodes are corresponding to a subspace in feature space [19] 7
2.2 MapReduce Execution Overview [3]. . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Dataset distribution among cluster nodes with overlap to decrease borderline
data points prediction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Different overlap factors of subspaces on cluster nodes . . . . . . . . . . . . . . 25
3.3 Bad split points causes bad piecewise linear models and higher prediction error 29
3.4 Good split points helps to have better piecewise linear models and lower pre-
diction error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Finding the data points with maximum target value by gridifying data points
and using a initial random seed . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Using Parzen Window Classifier to find areas with many candidate split points [18] 36
4.1 Splitting the feature space to subspaces. . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Summary of MRRT versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 10-dimensional
datasets (gtoy10d1, gtoy10d2, ptoy10d1, ptoy10d2, ttoy10d1 and ttoy10d2)
datasets with different overlap values when dividing along first dimension. . . . 51
4.4 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 20-dimensional
datasets (ptoy20d1, ptoy20d2, ttoy20d1 and ttoy20d2) datasets with different
overlap values when dividing along first dimension. . . . . . . . . . . . . . . . . 53
4.5 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on IHEPC1
real dataset with different overlap values when dividing along first dimension. . 54
4.6 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 10 dimen-
sional datasets with overlap = 0.75, when splitting along differnt dimensions. . 59
x
-
4.7 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on 20 dimen-
sional datasets with overlap = 0.75, when splitting along differnt dimensions. . 60
4.8 Analyzing accuracy of MRRT(O) and MRRT(WO) algorithms on IHEPC1
real datasets with overlap = 0.75, when splitting along differnt dimensions. . . 63
4.9 Speedup of MRRT(O) and MRRT(WO) algorithms in log scale and linear
scale on ttoy20d3 dataset with overlap = 0.75 when splitting along the first
dimensions. The runtime of same algorithms on a single machien is 2609.6
seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.10 Speedup of MRRT(O) and MRRT(WO) algorithms in log scale and linear
scale on all IHEPC1, IHEPC2, and IHEPC3 real datasets respectively with
overlap = 0.75 when splitting along the first dimensions. . . . . . . . . . . . . . 68
4.11 Analyzing scalability of baseline, MRRT(WO) and MRRT(WOS) algorithms
on ttoy20d3 datasets with overlap = 0.75 when changing the dataset size from
50,000 items to 1,000,000 data itmes. . . . . . . . . . . . . . . . . . . . . . . . . 69
4.12 Analyzing scalability of baseline, MRRT(WO) and MRRT(WOS) algorithms
on IHEPC1, IHEPC2, and IHEPC3 real datasets with overlap = 0.75 when
changing the dataset size from 103,557 items to 2,071,148 data itmes. . . . . . 71
4.13 Comparing runtime of MRRT(WO), MRRT(WOS) and baseline algorithm on
ttoy20d3 dataset with overlap = 0.75 when splitting along first dimensions. . . 72
4.14 Comparing runtime of MRRT(WO), MRRT(WOS) and baseline algorithm on
IHEPC1, IHEPC2, and IHEPC3 real datasets with overlap = 0.75 when
splitting along first dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xi
-
List of Algorithms
1 Basic Regression Tree Construction Algorithm . . . . . . . . . . . . . . . . . . 9
2 MapReduce Regression Tree Algorithm - Main Method . . . . . . . . . . . . . . 23
3 MapReduce Regression Tree Algorithm - Map Phase of First MapReduce Round 24
4 MapReduce Regression Tree Algorithm - Reduce Phase of First MapReduce
Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5 MapReduce Regression Tree Algorithm - Map Phase of Second MapReduce
Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6 MapReduce Regression Tree Algorithm - Reduce Phase of Second MapReduce
Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7 MapReduce Regression Tree Algorithm - Prediction . . . . . . . . . . . . . . . 27
8 Slope-changing Algorithm - Main Method . . . . . . . . . . . . . . . . . . . . . 30
9 Slope-changing Algorithm - Initialization . . . . . . . . . . . . . . . . . . . . . . 31
10 Slope-changing Algorithm - Map Phase of First MapReduce Round . . . . . . . 31
11 Slope-changing Algorithm - Reduce Phase of First MapReduce Round (Parzen
Window Classifier Version) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
12 Slope-changing Algorithm - Reduce Phase of First MapReduce Round (Fitness
Proportional Selection Version) . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
13 Slope-changing Algorithm - Map Phase of Second MapReduce Round . . . . . 38
14 Slope-changing Algorithm - Reduce Phase of Second MapReduce Round . . . . 39
15 Slope-changing Algorithm - Prediction . . . . . . . . . . . . . . . . . . . . . . . 40
xii
-
Preface
xiii
-
Acknowledgments
xiv
-
Chapter 1
Introduction
1.1 Motivation
The goal is approximating a non-linear regression model using piecewise linear models for
large-scale datasets. Regression models have many applications in real world among which
we can name trend line, finance, epidemiology, environmental science, etc. Big datasets are
everywhere these days, and bigger datasets would help us to find better models from the data.
The issue with big datasets is that they also would need a long time to be processed on a
single machine. When the dataset is very large (terabyte scale) even reading the content of
the dataset would take a very long time (a high-end machine with four I/O channels each
having a throughput of 100 MB/sec will require three hours to read a 4 TB data set! [12]).
For this reason we need to use parallel and distributed methods to process big datasets.
There are many options for parallel data processing. We have decided to use MapReduce
programming model as the distributed data processing framework. MapReduce is a program-
ming model introduced by Google in 2004, for processing large datasets [3]. We have chosen
MapReduce as the distributed data processing framework to use for the following reasons:
• MapReduce handles many of the issues with large-scale distributed data processing suchas distributed file system. Google File System (GFS) is the original file system it uses.
GFS makes all the data transfer and distribution on different cluster nodes transparent
to the programmer. The user simply copies a file on the cluster, and GFS decides how
the file to be distributed among cluster nodes, keeps track of the chunks of the file, and
also manages replication of chunks of file on different nodes for fault tolerance purpose.
• Fault tolerance is another thing that MapReduce takes care of. The programmer does
1
-
CHAPTER 1. INTRODUCTION 2
not need to be worried about resolving the problem of nodes’ failure. If a node fails,
MapReduce itself manages the problem and assigns its tasks to other cluster nodes.
• Code and data migration is also managed by MapReduce. All the mapper nodes inthe cluster run same map code on data. MapReduce takes care of delivering the code
to all mappers and running the code on nodes. The result of map round needs to be
shuffled among cluster nodes (delivered to reducers), and MapReduce takes care of this
data shuffling too. Reduce phase and code migration in this phase is also managed by
MapReduce framework.
• MapReduce simplifies solving a distributed data processing problem by introducing ahigh level programming model for distributed data processing. It helps programmers
to concentrate on program logic and all the details and issues related to distributed
nature of the solution is managed by MapReduce. Although MapReduce restricts us
and reduces the flexibility in some ways, but it helps us to have a standard way of
describing distributed data processing algorithms.
• MapReduce is one of the common ways of solving distributed data processing problemsin industry these days.
Details about how MapReduce works is explained in section 2.2.
1.2 Problem Statement
We are handling a large-scale non-linear regression problem. Regression is a supervised learn-
ing technique in which the algorithm tries to find a model from a dataset to generate a
numerical prediction for future data items. We will call the numerical dependent variable
(target variable) y, and try to approximate its value as a function of other numerical values
x. Here x is a vector consisting of n numerical values x1, x2, . . . , xn, where n is number of
features (attributes) of each data item of the dataset.
y = f(x) + � (1.1)
In above equation � is the difference between actual and predicted value of target value. The
predicted value for y is f(x) and is indicated by ŷ symbol. There are different ways to handle
non-linear regression problem.
We intend to find a solution for large-scale datasets. Handling large-scale datasets could
be very slow if parallel and distributed data processing techniques and frameworks are not
-
CHAPTER 1. INTRODUCTION 3
used. Because of the reason mentioned in section 1.1 the programming model we have em-
ployed to handle large-scale datasets is MapReduce. When using MapReduce, the method
that is employed to solve the problem sequentially needs to be coupled and translated into
MapReduce programming model. Some details about how MapReduce works is explained in
section 2.2.
Designing an algorithm for MapReduce framework (map and reduce phases) entails issues
such as deciding about what process needs to be done by cluster nodes on their local piece
of data and what information they need to extract in order to cover the issue of not having
a global view of the data on each node of the cluster. The other challenge when designing
a MapReduce based algorithm is how the final result is aggregated. Generally one problem
could be handled by MapReduce in several different ways and choosing the best way to make
use of MapReduce capabilities is the main challenge. Since there is no communication between
different nodes during Map and Reduce phases, and the results only could be communicated
when the Map phase is done, choosing an effective strategy on extracting useful information
from partial views of different mappers from the partial data they have in hand, and making
use of this data in Reduce phase (or next MapReduce rounds) is a problem that needs to be
addressed.
1.3 Overview of Approach
In this work two different distributed algorithms for approximating non-linear regression
model of a dataset using piecewise regression is suggested. Both algorithms are suggested
for MapReduce framework.
The first algorithm is called MapReduce Regression Tree (MRRT) algorithm. This algo-
rithm is dividing the feature space into equal-size partitions (equal-size in terms of volume
and not number of data points in the partition). To form the partitions, the feature space is
divided into partitions along one dimension of the feature space. This dimension is selected
randomly or using a pre-processing method that is working on a sample of dataset. Data
items belonging to different subspaces are then sent to different reducers, and all reducers
construct regression tree models (in parallel) for the partition they have received. Although
a reducer technically needs only one partition of the feature space to generate the model, we
send left and right partitions of each partition to the reducer too (overlapping subspaces).
This way each reducer would receive three partitions instead of one partition (leftmost and
rightmost partitions of the dataset have only one neighbor and the corresponding reducer
would receive two partitions instead of three). Since we are sending extra information to each
-
CHAPTER 1. INTRODUCTION 4
reducer, the data that needs to be transferred in the network and processed in each machine
would increase. This redundancy has a good side-effect which is increase in accuracy of the
final model by decreasing prediction error of the data items that are located near the bor-
derlines. This algorithm uses a weighted prediction mechanism in order to increase accuracy.
Details of this algorithm is explained in section 3.1.
Second algorithm is called Slope-changing algorithm. In this algorithm dataset is dis-
tributed among cluster nodes in a random fashion. Every mapper has part of the dataset
at hand and finds a set of candidate split points in that part of dataset. Split points are
points that will be used to split the feature space into smaller subspaces. These candidate
split points include points with local maximum and minimum target value. Points that the
model’s slope changes sharply in those points are also selected by mappers as candidate split
points. All the candidate split points found by mappers are sent to a single reducer in order
to select the split point set from this set of candidate points. Two different methods are
suggested to make the selection of final split points from candidate split points. One method
is using Parzen Window Classifier, and the other method is fitness proportional selection.
After selecting the split point set, this set is sent to all mappers in the cluster. All mappers
use these split points to partition the data based on the subspaces formed by these split
points. All mappers then send the data points pertaining to a certain subspace to a certain
reducer. This way each reducer would receive all the data points of a certain subspace from
all mappers, and can construct a linear model for that subspace. This way a piecewise linear
model for all subspaces of the feature space is constructed. This piecewise linear model will
be used to predict the target value for future test items based on the subspace in which the
test item is located. Details of this algorithm is explained in section 3.2.
Both MRRT and Slope-changing algorithms divide the feature space into subspaces and
find models for each subspace, but they are dividing the feature spaces differently. MRRT
divides the space in equal-size subspaces, but size and number of subspaces in Slope-changing
algorithm might be different and is determined by split points chosen by the algorithm.
Another difference is that MRRT constructs regression tree models for subspaces, but Slope-
changing algorithm constructs linear models for subspaces. MRRT also uses overlapping
subspaces, but Slope-changing algorithm does not use overlapping subspaces.
1.4 Overview of Contributions
Because of the limitation of Slope-changing algorithm (that is discussed in section 4.5.1), it
is not applicable to high-dimensional datasets. For this reason we only list contributions of
-
CHAPTER 1. INTRODUCTION 5
MRRT algorithm:
• Overlapping subspaces (coupled with weighted prediction) not only solves the datadistributed-ness problem, but also helps to improve accuracy over the baseline (regres-
sion tree) algorithm. If the preProcess method is employed to choose the dimension
to split, MRRT improves the accuracy for 8 out of 10 synthetic datasets from 1.1% to
32.86% and for all three real datasets for 4.66%, 13.24%, and 22.73% respectively.
• MRRT algorithm shows to have close to linear speedup (for two out of four datasetsexperimented) and near to optimum scalability for all datasets.
• Although MRRT’s prediction is done sequentially and not on a MapReduce framework,it improves the prediction time by more than 80% comparing to regression tree algo-
rithm.
• MRRT could be used on a single machine, and in that case it improves the learningtime by 60% (in most cases) comparing to regression tree algorithm.
• MRRT needs to choose a dimension to split along. preProcessmethod we have proposedfor MRRT (to choose the dimension to splitp) increases accuracy of model for 11 out of
13 datasets comparing to model constructed by regression tree algorithm.
1.5 Overview of Chapters
In chapter 2 we will review the literature related to regression, piecewise regression and tree
regression. We also talk about MapReduce and also some large-scale problem that is solved
using MapReduce. Limitations of MapReduce and arguments about this limitations are also
discussed in this chapter. Chapter 3 presents details of algorithms we proposed for solving
the piecewise approximation of non-linear model within MapReduce. Next chapter presents
the empirical evaluation of the algorithms and compares them with the baseline (regression
tree) algorithm. Chapter 5 summarizes findings and presents the concluding remarks.
-
Chapter 2
Literature Review
In this thesis two distributed MapReduce-based algorithms for approximating large-scale non-
linear regression using piecewise regression are proposed. Two major parts of the problem are
approximating non-linear regression using piecewise regression, and MapReduce framework.
We will review the literature related to these two major subproblems in following sections.
2.1 Approximating Non-linear Regression Using Piece-
wise Regression
2.1.1 Linear Regression
Linear regression could be used when there is a linear (or roughly linear) dependency between
x and y (x and y are introduced in section 1.2). In this case the learning algorithm tries to
model y as a linear function of x:
y = β0 + β1x+ � (2.1)
In above equation size of x and β1 vectors are equal to number of dimensions in the feature
space, and � is the difference between actual and predicted value of target variable (error
term).We use ŷ symbol to indicate the predicted value of target variable by the model and we
have ŷ = β0 + β1x. The learning algorithms tries to learn β0 and β1 values (called weights)
from the training items in the dataset. When learning weights, the objective is to minimize
difference of actual and predicted values for all data items (as an example this difference could
be measured by minimizing sum of square of difference between actual and predicted target
6
-
CHAPTER 2. LITERATURE REVIEW 7
values):n∑
i=1
(y(i) − ŷ(i))2 =n∑
i=1
(y(i) − (β0 + β1.x(i)))2 (2.2)
We use < x(k), y(k) > to indicate kth data item in the dataset.
2.1.2 Non-linear Regression via Piecewise Linear Regression
One of the advantages of linear regression is its simplicity, and one of its disadvantages is its
globality. When the relation between x and y is complex and non-linear, even the best possible
linear model would have a high average prediction error value. Partitioning the feature space
into smaller subspaces and constructing a model for each subspace of the feature space might
be helpful in finding a better model and reducing the error. Piecewise methods are using this
idea and find constant or linear models for each subspace of the feature space instead of one
global linear model.
A constant model for a subspace containing a set of data items like s1 = {< x(1), y(1) >,< x(2), y(2) >, . . . , < x(n), y(n) >} would be calculated as following:
ŷ(s1) =1
size(s1)
∑k∈s1
y(k) (2.3)
and the prediction for any new data item that lies in this subspace would be ŷ(s1).
In most cases it is better to find a linear model for each subspace of the feature space. In
this case, equation 2.2 that is given in previous section is used by linear regression learning
algorithm to find a linear model for each subspace.
%%
%
✏� �� ✏� ��✏� ��
.
%%
%
SSS
��
�
TTT
eee
✏� ��
SSS
x1 c1
x2 c2 x1 c3
x2 c4ŷ1 ŷ2 ŷ3
ŷ4 ŷ5
R3
-
6
R2
R1
R4
R5
c4
c2
x1c1 c3
x2
Figure 5: A decision tree for a dataset with two explanatory variables(left),and the corresponding partitioning of the feature space (right). For eachleaf ` and each corresponding region R` the estimate of the target value isthe average ŷ` of the observed y values within that region.
30
Figure 2.1: A regression tree (left), and the corresponding 2-dimensional feature space (right).Each of tree nodes are corresponding to a subspace in feature space [19]
-
CHAPTER 2. LITERATURE REVIEW 8
Dividing the feature space into subspaces can be done in different ways. A simple way
of dividing the feature space into smaller subspaces is using equal-size subspaces. It also is
possible to let the algorithm decide on borderline of the subspaces. Regression tree that is
presented in next section, uses a recursive method to divide the feature space into subspaces.
Figure 2.1 depicts a regression tree and also shows how the feature space is divided into
smaller subspaces based on this regression tree. Leaves of the regression tree are models for
each subspace of the feature space (ŷi is a model for Ri).
2.1.3 Piecewise Regression with Regression Trees
Regression tree is a piecewise method that recursively partitions the feature space into smaller
subspaces. The tree itself consists of nodes and edges. Every node contains a simple condition,
e.g. if xi < 10 (i.e. if its ith feature’s value is smaller or bigger than 10), and one of the
branches is chosen based on the answer of current data item to this question. To find the
prediction for a new data item, tree is traversed starting from the root until we reach a leaf.
Leaves of regression tree contain a model like linear model or constant model.
Constructing regression tree is an iterative task. In each iteration a feature and a cor-
responding threshold value needs to be chosen by the algorithm. We call a pair of <
Feature, V alue > as a split point. Selecting split points could be a critical task when con-
structing piecewise models. When selecting a split point pair among different candidate split
point pairs, a metric is used to evaluate different trees corresponding to different split point
pairs. The tree and corresponding split point that performs better based on the metric is
chosen to be used in next iteration. Basic regression tree algorithm can use Sum of Squared
Errors (SSE) to evaluate a tree T [8]:
S =∑
c∈leaves(T )
∑i∈c
(y(i) − ŷ(c))2 (2.4)
where
ŷ(c) =1
size(c)
∑i∈c
y(i) (2.5)
is the predicted value for all data items landing in that leaf.
Algorithm 1 lists the basic algorithm for constructing regression tree. In this algorithm,
first all the data items of dataset are assigned to the root node (line 2). The ŷ(c) and SSE
values are then calculated for root node (lines 3-4). Afterward a repetitive task (lines 6-
32) is applied on each leaf of the tree and each leaf is populated with two children until
a certain condition is hold (lines 26-30). For each leaf of the tree all possible split pairs
-
CHAPTER 2. LITERATURE REVIEW 9
< Feature, V alue > are examined and the pair that reduces SSE of the leaf most is chosen
(lines 12-25). If the chosen pair reduces the SSE more than a threshold δ, then the node
is populated with two children, otherwise that leaf will be kept untouched (lines 26-31). If
number of data items of a node is less than a threshold q, that node also will be kept untouched
(lines 8-10).
Algorithm 1 Basic Regression Tree Construction Algorithm
1: procedure ConstructRegTree(dataset)2: root.dataItems = dataset3: root.ŷ(c) = 1size(dataset)
∑i∈dataset y
(i)
4: root.sse =∑
i∈dataset(y(i) − root.ŷ(c))2
5: queue.add(root)6: while !queue.isEmpty do7: node = queue.remove8: if size(node.dataItems) < q then9: continue
10: end if11: bestSplitPair.sse =∞12: for splitPair ∈ allSplitPairs do13: left.dataItems = splitDataItems(node.dataItems, splitPair, left)14: right.dataItems = splitDataItems(node.dataItems, splitPair, right)15: left.ŷ(c) = 1size(left.dataItems)
∑i∈left.dataItems y
(i)
16: left.sse =∑
i∈left.dataItems(y(i) − left.ŷ(c))2
17: right.ŷ(c) = 1size(right.dataItems)∑
i∈right.dataItems y(i)
18: right.sse =∑
i∈right.dataItems(y(i) − right.ŷ(c))2
19: if bestSplitPair.sse > left.sse+ right.sse then20: bestSplitPair.splitPair = splitPair21: bestSplitPair.sse = left.sse+ right.sse22: bestSplitPair.left = left23: bestSplitPair.right = right24: end if25: end for26: if node.sse− bestSplitPair.sse > δ then27: node.left = left28: node.right = right29: queue.add(left)30: queue.add(right)31: end if32: end while33: end procedure
One of the issues with basic algorithm for regression tree is using a greedy method to
select the feature and value to split. There are two problems with a greedy method to select
-
CHAPTER 2. LITERATURE REVIEW 10
the split points. First, since greedy methods make their decision based on a locally optimal
choice, their final model might be a suboptimal model in terms of accuracy. Second, when
number of dimensions and size of dataset is large, finding these split points (even greedily)
would have a very high runtime. We need to find methods to increase accuracy and decrease
the runtime.
Regression tree and piecewise linear regression are proposed when the dataset is not dis-
tributed. In the case when dataset is large, algorithms to generate regression model could be
very slow (splitting all data items of all leaf nodes into two subsets for all different pairs of
< Feature, V alue > is an expensive task for a high-volume high-dimensional dataset). Thus
for large-scale datasets, new technologies, techniques and algorithms needs to be used to per-
form the task more efficiently. Section 2.2 discusses about MapReduce that is the framework
we have used for distributed data processing.
2.1.4 Piecewise Linear Approximation of Time Series
Piecewise linear representation (PLR) is generally used to approximate time series with
straight lines (hyper planes). Piecewise linear representation is more efficient than other
modeling techniques in terms of storage, transmission and computation and has several ap-
plications in clustering, classification, similarity search, etc. [10].
Piecewise linear representation are also called Segmentation Algorithms (SA). Three dif-
ferent specification has been defined for SAs. For a time series T, find the best representation
that
• Includes only K segments,
• The error for each segment does not exceed a threshold, and
• The total error does not exceed a threshold.
A PLR can be either online or batch [10].
PLR algorithms can be divided to 3 different categories: bottom-up, top-down, and sliding-
windows. Bottom-up approach finds the approximation of small pieces of time series and find
the final solution by merging them. Top-down approach recursively divides the time series
until satisfaction of a stopping criterion [10, 13]. Sliding-windows grows a segment until the
error exceeds a threshold. Sliding windows starts from the first point of T and adds points
to it while sum of error is less than a threshold. At that point a segment is generated and
process continues to generate a new segment form the next point. Several optimization are
proposed for this algorithm: 1) adding a bigger value than 1 in each iteration of the process
-
CHAPTER 2. LITERATURE REVIEW 11
of finding one segment, 2) since the error is monotonically non-decreasing, methods such as
binary search can be used [10].
Top-down methods find good split points and split T into two segments. An approximate
linear model for each part is calculated and the error is calculated for each part. If error
is less than a threshold, model for that part is finalized, otherwise the algorithm recursively
repeats the process. Bottom-up methods start from smallest possible segments (totally n/2
segments). They find the cost of merging each pair of adjacent segments and merge the
adjacent pair that has the lowest cost. This process is repeated until the minimum cost of
merging is smaller than a threshold [10, 13].
Keogh et al. propose a new online algorithm called SWAB (Sliding-Window And Bottom-
up). SWAB uses a sliding buffer of size close to 6 segments and uses bottom-up on that frame.
After segmentation the leftmost segment is reported and the corresponding data is removed
from the frame and more data are read into the frame [10].
D. Lemire suggests that instead of having linear models for each interval of a time series,
we could have models of different degrees for different intervals [13]. Some intervals may
have constant models, some linear, etc.. The method is called adaptive because degree of
model in an interval is decided adaptively. The reason why adaptive method is suggested is
that piecewise linear models might locally over-fit the data by trying to find a linear model
for the data, while a constant model would fit the data better. Since time series datasets
could be very large, efficiency of algorithm is very important. The adaptive method proposed
in this paper tries to improve the quality of the model while keeping the cost of the model
construction same as top-down[11] method.
Different algorithms with different advantages and dis-advantages could be used for ap-
proximating time series [13]. Optimal adaptive segmentation uses dynamic programming to
find the best segmentation and thus is of high complexity (Ω(n2)). Top-down method on
the other hand selects the worst segment and divides it to two smaller segments iteratively
until the complexity of model reaches the maximum allowed complexity. Adaptive top-down
algorithm first applies top-down algorithm on time series, and then replace linear model seg-
ments with two constant model segments if the error can be reduced with this replacement.
Another version of adaptive top-down first constructs a top-down constant model and then
merges constant models in order to have linear models. The optimal algorithm is not prac-
tical because it takes a very long time (weeks) to generate results for a time series with one
million data points. The adaptive top-down is slightly slower than top-down algorithm, but
generating results of higher quality.
-
CHAPTER 2. LITERATURE REVIEW 12
2.1.5 Online Approximation of Non-linear Models
XCSF and LWPR are the two algorithms for online linear approximation of an unknown
function. These methods cluster the input space into small subspaces and find a linear model
for each subspace and use a weighted sum to find the final model. For this we need to
first structure the feature space into small subspaces in order to exploit the linearity of the
target function in each subspace, and then we need to find the linear models in each patch.
There are several solutions for the second step, but the first step is not straightforward.
XCSF is an evolutionary-based algorithm that uses GA[22], and LWPR (Locally Weighted
Projection Regression) is a statistics-based algorithm for function approximation for online
approximation of non-linear multi-dimensional functions incrementally [20].
Receptive Fields (RFs) is the notion used by LWPR for the ellipsoidal subspaces. XCSF
refers to subspaces as classifiers (another term for RF) [17] . Both algorithms has an empty
population of RFs at the beginning, and add new members to this population when a new
uncovered data item is received. An n-Dimensional ellipsoid that is not necessarily axes-
aligned can be represented by a positive semi-definite and symmetric matrix (D). Then the
squared distance of a data item (x) from the center (c) of this space can be defined as:
d2 = (x− c)T .D.(x− c) (2.6)
If this distance is zero, then the data item is placed on the surface of the RF. This way the
subspaces are found in both methods. A linear model for each subspace can be expressed as:
p(x) =∑k=1
nbk.xk + b0 (2.7)
One data item can be covered by several subspaces and in that case a weight combination of
linear models of those subspaces is considered and the model prediction for the input data
item [17, 22, 20].
LWPR assigns a gaussian activity weight to each subspace based on its distance to the data
item, and ignores those weights that are smaller than a threshold for the sake of performance.
This way closer subspace has more significant effect on final prediction comparing to farther
subspaces. XCSF, on the other hand, only assigns weight to subspaces with‘istance of less
than 1. Weights are proportional to inverse value of prediction error in XCSF [17, 22, 20].
Finding a linear model for each subspace is straightforward and can be done using least
squares methods. XCSF uses RLS (Recursive Least Squares), and LWPR uses incremental
partial least squares (incremental version of PLS) to find the linear model in a subspace [17,
-
CHAPTER 2. LITERATURE REVIEW 13
22, 20].
Learning the locality (which is the shape and location of receptive fields) is done by a steady
state genetic algorithm in XCSF, and by a stochastic gradient descent in LWPR [17, 22, 20].
In XCSF, each RF has an approximate value for its current prediction error which is used
to calculate its fitness for the GA. Fitness is shared among the RFs that cover same inputs.
Tournament selection is used for selection task of GA, and crossover, and mutation operators
are applied on the location of center, stretch and rotation which is defined by a matrix (D).
When the population reaches a maximum value, some RFs are deleted from crowded regions
of input space using a proportionate selection probability. During this process, RFs are tried
to be generalized by making their coverage area larger while keeping their accuracy sufficiently
large [17, 22, 20].
Center of subspaces are not changed in LWPR and the D matrix is changed (the size and
direction of the ellipsoids). This optimization is done by an incremental gradient descent
based on stochastic leave-one-out cross-validation. For this purpose first D is decomposed to
a triangular matrix and then is updated. The cost function is activity weighted error plus a
penalty term that is preventing the subspaces to shrink over iterations [17, 22, 20].
XCSF and LWPR are compared in [17], and for comparison purpose LWPR is tuned to hit
a low target error (by decreasing size of RFs, changing learning rate, and penalty value) that
is the target error hit by XCSF. Then XCSF’s max population size is changed to be roughly
equal to LWPR’s number of RFs [17, 22, 20].
2.2 MapReduce
MapReduce is a programming model for processing large datasets. Programs written based
on this programming model run on a cluster of nodes called MapReduce cluster. There are
two kind of nodes in such a cluster: mappers and reducers. Mappers run part of the program
called map procedure, and reducers run another part of the code called reduce procedure. All
mappers and reducers run same code on different data. Mappers (map procedure) read the
input data from hard disk of machine they are running on, and process the data to generate
intermediate result. The data that is received by mapper is assumed to be as pairs of (for example , or ). Each
mapper processes its part of data and generates the result as pair of . The input
pair and output pair might not have anything to do with each
other. For example when input is the output pair could be .
-
CHAPTER 2. LITERATURE REVIEW 14
One mapper might generate many pairs of with different keys and values.
Paris of generated by mappers then are sent to reducers for the next phase
of process. pairs are not sent randomly to reducers, and instead they are
partitioned among reducers based on the key value of the pairs. For example from all pairs that are generated by all mappers, those with key equal to key1 is sent to a
certain reducer.
Each reducer receives a group of pairs generated by mappers and process
them in order to generate the final result. Since map and reduce phases are run in parallel
by all mappers, a large dataset that is distributed among cluster nodes is processed by the
MapReduce framework much faster than it is possible to process it on a single machine.
2.2.1 Why Pairs?
When data is processed by mappers, we need a way to aggregate result generated by different
mappers. For example if the ultimate task is counting number of words starting with a, b, c,
and d in a huge set of text files, each mapper could generate the result for the part of data
it has locally, and we need a way to aggregate the result from all mappers. Having pairs helps us to request all mappers to send count of all words starting with a certain
character to a certain reducer in order to enable that reducer to have all partial results and
calculate the final result. For this purpose all mappers would generate a result like , and all the results
having a as their key would be sent to a certain reducer [3].
The key concept is that the programmer is aware of the way mappers need to generate
result (pairs of ), and also aware of the way data is shuffled from mappers to
reducers, and he needs to decide how to use this programming model to solve the problem he
has at hand.
2.2.2 Is That All MapReduce Does?
So far we have talked about how MapReduce programming model helps us to solve data
processing problem in a parallel fashion. But it is not all a MapReduce implementation
offers us (Different MapReduce implementations are available among which we can named
Hadoop [12] which is an open source implementation). When you write a MapReduce code
you are done, and Hadoop (or any other MapReduce implementation) takes care of the rest of
problems. The framework sends the mapper procedure to all mappers, and reducer procedure
to all reducers. Then it asks the mappers to run the code on their local data and generate the
result based on what is specified in the code. After result generation, the framework takes
-
CHAPTER 2. LITERATURE REVIEW 15
care of the shuffling the data among reducers. After reducers receive the data, it asks them
to run the reducer procedure on the received pairs.
UserProgram
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)assignmap
(2)assignreduce
split 0
split 1
split 2
split 3
split 4
outputfile 0
(6) write
worker(3) read
worker
(4) local write
Mapphase
Intermediate files(on local disks)
worker outputfile 1
Inputfiles
(5) remote read
Reducephase
Outputfiles
Figure 1: Execution overview
Inverted Index: The map function parses each docu-ment, and emits a sequence of 〈word,document ID〉pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a〈word, list(document ID)〉 pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.
Distributed Sort: The map function extracts the keyfrom each record, and emits a 〈key,record〉 pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.
3 Implementation
Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.This section describes an implementation targetedto the computing environment in wide use at Google:
large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:
(1)Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.
(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.
(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiplemachines by automatically partitioning the input data
To appear in OSDI 2004 3
Figure 2.2: MapReduce Execution Overview [3].
A question here is how a programmer decides and copy the file on cluster nodes in order
to be processed by MapReduce framework? Programmer does not need to do such a task.
MapReduce framework has a distributed file system (Google File System or GFS and Hadoop
Distribute File System or HDFS in Hadoop implementation of MapReduce) that facilitates
this task. All you need is running the distributed file system and issuing a command like:
copy bigFile.txt on the cluster. Rest of the work is done by the framework. Another question
here is what if certain mapper fails in the middle of running? The answer is that MapReduce
framework takes care of the issue. When distributed file system copies the data on the
cluster, it replicates different chunks of data on different mappers (based on replication factor
indicated in configuration file by user) and when a certain mapper fails, its task would be
assigned to other mappers. MapReduce framework also takes care of other lower level tasks
-
CHAPTER 2. LITERATURE REVIEW 16
such as network communication. There are nodes in a MapReduce framework that their task
is bookkeeping. They keep track of cluster nodes, mappers, reducers, data replication, etc.
Figure 2.2 illustrates execution of a MapReduce task on a MapReduce cluster. User
program is distributed by master among worker nodes. Some of the worker nodes would work
as mappers and some as reducers. Data is read by mappers and then they run the mapper
procedure on the data. Intermediate data is generated, and then they are sent to reducer
nodes. Reducer nodes process the pairs they have received and generate the
final result [3].
2.2.3 MapReduce for Clustering
One of the large-scale data processing that is data clustering. Several algorithms has presented
recently for different clustering algorithms on MapReduce framework. In this section we
review three clustering algorithm to see how they are using MapReduce power in order to
cluster data.
Zhao et al. are arguing that all previous researches on parallel k-means so far are suffering
from two problems [24]. First, they assume that all the data are in main memory, and second,
they are using a restricted programming model. For these two reasons, those works are not
applicable on peta-scale datasets. Since distance calculation (calculated n*k times in each
iteration where n is number of data points and k is number of clusters) is the most expensive
step of the algorithm, they try to exploit the parallelization of MapReduce to decrease this
cost. Map function assigns each data point to its closest center, and Reduce function updates
the centroids. There is one more function called Combine that aggregates the intermediate
results of Map functions. A global variable called centers includes list of all centers and is used
by all map tasks. Map tasks generate pairs of . Combine
method, aggregates the results of the same map task. It calculates the partial sum of the
data points assigned the same cluster. Output of this method is pairs of . The Reduce function aggregates sum of
all partial sums for each cluster, and calculates the new centroids. The output of Reduce
is pair of . The speedup they have
achieved for 4 machines is around 3 for the biggest dataset (8GB) which is a good speedup.
The speedup for bigger datasets is bigger too which is a good indication. The authors are not
talking about iterative nature of the algorithm and about how this issue is handled. They
also do not talk about accuracy of the method and only talk about speed-up, scale-up, and
size-up [24].
Ferreira Cordeiro et al. present an algorithm for very large multi-dimensional dataset
-
CHAPTER 2. LITERATURE REVIEW 17
clustering with MapReduce [7]. Since such a dataset doesn’t fit in one or several disks,
parallel processing is the only solution. In that case I/O and network cost are the two things
that needs to be balanced. Best of both World (BoW) is the solutions that the authors are
suggesting in this paper. They have worked on the largest real dataset ever in the database
subspace clustering (Twitter crawl > 12TB, and Yahoo! Operational data: 5 Petabytes - only
reading 1TB from a modern 1TB disk takes around 3h). The contribution of the paper is
combining sequential clustering algorithm with a parallelization method in an efficient way.
Sequential subspace clustering algorithms can be plugged to this solution and the system
would balance the I/O and network cost. The sequential algorithm that is plugged into the
parallel algorithm finds the beta-clusters in a hyper-rectangle shape in the multi-dimensional
space. Sequential subspace algorithm can be density-based or k-means-based [7].
I/O optimal version of the algorithm (ParC) reads the dataset one time and reduces the
I/O. Another algorithm SnI (sample and ignore) improves the network cost but reads the data
two times. Depending on number of reducers each of the two can be the winner. The BoW
is a combined algorithm that decides to use which of those algorithm based on number of
reducers, and keeps the cost as min(ParC, SnI) for any number of reducers. ParC partitions
(using one of these methods: random, address space, or arrival order) the dataset across
the cluster (mappers), finds beta-clusters in each partition (reducers), and finally merges the
clusters (a single machine). SnI, on the other hand, first samples the dataset (exploits the
skewed distribution of the data), and then clusters the sample using ParC ignoring the un-
sampled data items. This way SnI avoids processing of many of the data items that belong to
big clusters that are already sampled. SnI reads the data two times. In first read it samples
the data, and in second read it only maps the sampled data items and avoids other points.
The network cost will be reduced in a great amount by this technique. In sample step of the
algorithm mappers map each point by probability of Sr to a single reducer. That reducer then
clusters the data using the plugin clustering algorithm and passes the clusters description to
next phase. In ignore phase each mapper reads its partition again and ignores the data points
that fit into the clustering found in sample phase and send other data items to r reducers.
Those reducers cluster the data points using the plugin clustering algorithm and pass the
clustering description to one machine. That machine merges all the clusterings found in 2nd
phase to the clustering found in phase 1 [7].
Both ParC and SnI have their own benefits. ParC optimizes I/O by reading the data file
once, and SnI optimizes the network cost by reducing number of data points that needs to be
transferred over the network in cost of reading the data file two times. To take advantage of
the benefits of both of these methods we need a combined method that selects one of these
based on the cost. A cost-based optimization method is used to select the better algorithm
-
CHAPTER 2. LITERATURE REVIEW 18
adaptively. The cost formula uses file size, network speed, disk speed, startup cost, and
plugin cost to calculate the total cost for each algorithm. BoW algorithm first calculates
both costParC and costSnI and select the better one based on the parameters and calls it.
Experiments has been done to check the accuracy, scalability and performance of the cost-
based method. The authors have shown that the quality of the clustering matches the quality
of sequential clustering while its speed-up is close to linear. The cost-based method also has
been shown to be the best of both world in all cases [7].
Ene A. et al. have designed the first approximate version of metric k-center and k-median
algorithms for MapReduce [6]. They assumes that a set V of n data points and their cor-
responding distance is given and try to cluster the similar points into same clusters. The
output of the algorithm is k data points that is considered to be the center points of the k
clusters.The algorithms first sample the data (in a way that the sample represents all the
data well) in order to decrease the dataset size. The sampling method incrementally add new
points to the final sample set only if they are not already represented well by the final set.
Sampling is different for k-median and k-center due to their different nature. Sampling for
k-median needs more effort because it needs to consider each points distance from its cluster
center. A version of algorithm is presented in the paper that can be run on MapReduce [6].
The MapReduce version of sampling is an iterative algorithm in each iteration of which we
have three MapReduce operations. The first MapReduce operation partitions data arbitrarily
among machines (mappers), and then each reducer construct two sets (S:final set, and H from
which a pivot is selected). In next MapReduce step all the mappers pass the H and S sets to
a single reducer and that reducer finds the pivot point. In the last MapReduce step mappers
send pivot, S, one partition of R (remained data items that are not sampled yet), and the
distance matrix to the reducers and reducers get rid of the well represented points. This steps
are iterated until number of remaining points in the R falls under a certain threshold. K-center
tries to minimize the maximum distance of the cluster center and the points in that cluster,
while k-median tries to minimize sum of the distance of all the points in a cluster from the
cluster center (both problems are know to be NP-hard). K-center uses the sampling produced
by the sampling algorithm and mappers map all the points in the sampling along with their
pairwise distance to a reducer, and the reducer runs a simple local clustering algorithm. K-
median needs more information and its sample should have information about all the nodes
that are to be clustered. For each un-sampled point the closest sample point is selected and
its weight is increased by 1. In k-median first the sampling is done and then partitions of
the original dataset along with the sample and part of the distance graph is sent to reducers.
Each reducer finds weight of sample points partially. Then, in another MapReduce round the
partial weights are summed up. Last step is a simple clustering on the sample considering
-
CHAPTER 2. LITERATURE REVIEW 19
weight of each sample point [6].
2.2.4 MapReduce and Iterative Tasks
Many machine learning and data mining algorithms are working iteratively on data but
MapReduce is not well-suited for tasks with cyclic data flow. There are frameworks such
as Twister [5], Spark [23], and HaLoop [1] that are iterative. Dave et al. present a cloud-
based pattern for large-scale iterative data processing problems [2]. They have implemented
CloudClustering, as a case study, that tries to show how iterative data processing problems
can be handled on the cloud.
CloudClustering is the distributed version of k-means clustering algorithm, implemented
on Microsoft’s Windows Azure platform. They introduce a way to balance the performance-
fault tolerance trade-off (that is the main trade-off when solving iterative problems on the
cloud) using data affinity and buddy system. Some methods are using a central pool of state-
less tasks in order to handle the fault tolerance issue, but this could lead to low performance
because a cluster node might need to receive different parts of the data in different iteration
(i.e. there is no affinity between data and workers) [2].
Windows Azure handles fault-tolerance by means of reliable queues. When a worker takes
a task from the queue, the message becomes invisible and if it is not deleted after a timeout, it
will reappear in the queue. This way if a worker fails, the task will be done by another worker
node. One of the issues with the iterative tasks on the cloud is the stopping criterion. It can be
handled in two different ways in this problem. If no data point is changed among clusters from
one iteration to the next, we are done. This method needs to keep track of previous cluster
of each data point. The other method checks the maximum amount of centroid movement
and stops if it falls below a certain threshold. This method works on an read-only memory,
but can’t guaranty the convergence [2].
The proposed architecture is using the Windows Azure’s queuing system and includes one
master and a pool of worker nodes. Input dataset is stored centrally and is partitioned by
the master. The workers download a task containing the address to the corresponding part
of partition and the centroid list and perform the task. This method is working best in terms
of fault-tolerance but since data affinity is not considered the performance is not good in this
system. The other extreme is having one queue per worker that will solve the problem of
data affinity (master will assign same partition of data to same workers in each iteration),
but suffers from fault tolerance problem (there is not other worker to take over the current
task in case the worker fails). Buddy system is grouping workers in buddy groups and a queue
is shared among all members of each buddy group. Now size of the buddy group defines a
-
CHAPTER 2. LITERATURE REVIEW 20
balance between fault tolerance and performance [2].
2.2.5 Arguments about Using or not Using MapReduce
Schwarzkopf et al. have listed seven different assumptions and simplifications employed by re-
searchers in the cloud research that threatens the practical applicability and scientific integrity
of those researches [16].
One of the issues they have pointed out in their paper is unnecessary distributed paral-
lelism. Very large datasets and frameworks such as MapReduce have made researchers to
employ distributed parallelism more and more. Since the new high performance computing
frameworks offer a fascinating simplicity and handle complicated issues like communication,
synchronization, and data motion, a lot of people are willing to use these frameworks without
considering whether these frameworks are useful for the problem at hand or not. Frameworks
such as MapReduce reduce the engineering time needed to design a solution for a distributed
version of an algorithm, but they mostly increase the runtime. For this matter the speedup of
a program must be measured to show that the distributed solution outperforms the sequential
solution. Furthermore, we need to make sure that we need to distributed the data over several
machine even if we are sure that a parallel solution would be beneficial for the problem at
hand. They also point out that as Rowstron et al. have shown, with nowadays multicore
processors and huge amount of RAM we might not need to use a distributed solution for
many problems [15], and we would be able to make use of fast communication mechanisms
such as shared memory and also avoid data motion [16].
Another issue they have mentioned in their paper is forcing the abstraction. MapReduce is
designed to alleviated the I/O bottleneck of big data by distribution of data over several hard
disks. Time needed to process a job on a single machine is also assumed to be long. Some
solutions are iterating and generating many short-time MapReduce jobs while it is better
to have least number of jobs that are iteratively running on each system. Domain-specific
systems (for stream processing, iterative processing and graph processing) have also emerged
that seems to be a lot more justified that using the MapReduce for any problem [16].
Since many of Machine Learning and Data Mining algorithms are iterative, and MapRe-
duce is not inherently an iterative programming model, and some other algorithms does not fit
to this model for other reasons, many alternatives and extensions of MapReduce is provided
by different research/industrial groups in recent years. Some theoretical studies have been
done to show that Hadoop (an open source implementation of MapReduce) has limitations.
Empirical studies also have been done and frameworks such as HaLoop [1] and Twister [5]
are presenting a class of algorithms that Hadoop is not a good fit for, and try to extend the
-
CHAPTER 2. LITERATURE REVIEW 21
Hadoop and solve the problem more efficient than Hadoop, and off course they outperform
Hadoop at least when running that special algorithm. Jimmy Lin provides reasons why we
need to either revise current algorithms to be run on MapReduce or devise new algorithms
that follow MapReduce programming model. He suggests that since MapReduce is currently
the widely used solution for large scale data processing problems, we can get rid of the itera-
tive solution and try to use (or devise) alternative solutions that will fit MapReduce instead
of devising new frameworks for algorithms that MapReduce is ”good-enough”. He discusses
three classes of of problems to justify his claim: iterative graph algorithms (e.g., PageRank),
gradient descent (e.g., for training logistic regression classifier), and EM (e.g., for k-means,
and HMM training) [16].
Jimmy Line argues that extensions of Hadoop that support iterative constructs and thus
alleviate some problems, but the problem with all these frameworks is that they are not
Hadoop! It costs a lot for an organization to have another framework (other than Hadoop)
for only graph and iterative algorithms. A better solution would be trying to solve the four
above-mentioned problem by changing the algorithm in order to be runnable on Hadoop. If
MapReduce is performing better than an alternative that is used to solve that problem. That
does not mean that MapReduce needs to beat all the alternatives. For example MapReduce
performing a lot better than GIZA++ for word-alignment algorithm, and also is considered
an advance when used for k-means clustering [16]. The Hadoop stack is the standard and
widely used platform for large-scale data analysis. Any large-scale data analysis needs to
be able to process different types of structured and unstructured data and run different
types of algorithms (graph, text, relational data, ML, etc.). No single programming model
or framework can meet all the needs and be the best in terms of all the aspects such as
performance, fault tolerance, expressivity, simplicity, abstracting low level features such as
synchronization, etc.. No the question is: Dose adopting and deploying a new framework to
solve a problem worth (in terms of cost, time, generality of framework, having mastered HR
to use the framework, etc.) [14]?
-
Chapter 3
Approach
In this chapter we introduce two different piecewise regression algorithms. First algorithm is
called MapReduce Regression Tree (MRRT), and second one is called Slope-changing algo-
rithm. Both algorithms are trying to find a piecewise regression model for a dataset within
the MapReduce framework. MapReduce Regression Tree algorithm is a Regression Tree based
algorithm that can be used within the MapReduce framework. Slope-changing algorithm on
the other hand is trying to introduce a non-greedy method to find good candidate split points
and use this candidate set in order to find the final set of split points. Performance of these
two algorithms is analyzed and compared in chapter 4.
3.1 MapReduce Regression Tree
Algorithm 2 lists the pseudocode for MapReduce Regression Tree algorithm. This algorithm
partitions the feature space to smaller subspaces, but constructs a regression tree model
(instead of a linear model in Slope-changing algorithm) for each subspace. The generated
regression tree models are used to predict the target value of new data items.
Unlike Slope-changing algorithm that selects the split points based on the logic that maxi-
mum, minimum and slope-changing points are good candidates, this algorithm is not choosing
the split points based on any heuristic, and the feature space is not divided into subspaces
along different dimensions. The feature space is divided to subspaces of equal size (in terms
of volume of the subspace and not number of data items in each subspace), and it is divided
into smaller subspaces along one dimension of the feature space. This dimension is chosen
randomly or using the preProcess method. The preProcess method is retrieving a sample of
the dataset randomly (in our experiments we used 10% of each dataset) and runs the piecewise
22
-
CHAPTER 3. APPROACH 23
Algorithm 2 MapReduce Regression Tree Algorithm - Main Method
1: function MR-Regression-Tree-Learn2: dimToSplit = preProcess(dataset)3: rangeV alues = Map1(dimToSplit) . All mappers find min and max value of the4: . dimension that is being split5: splitPoints = Reduce1(rangeV alues, dimToSplit, nMappers) . Split points are
specified6: . based on dimension size and number of mappers7: Map2(splitPoints, dimToSplit) . Data is shuffled among reducers8: models = Reduce2() . Each reducer finds the model for the received data9: end function
Regression Tree algorithm on all different dimensions of the dataset. One piecewise regression
tree model is generated for each dimension of the dataset. The model is then tested against a
validation set and the dimension corresponding to the dimension that has the least RMSE on
validation set is chosen as the dimension to split the dataset in MapReduce Regression Tree
algorithm along.
Figure 3.1: Dataset distribution among cluster nodes with overlap to decrease borderline datapoints prediction error
When dividing a feature space to subspaces, models constructed for two subspaces that
are next to each other might have different predictions for a data point that is located on the
borderline. It is same for data points that are located in one subspace, but are close to the
borderline. For these data points, the neighbor model might have a better prediction than
the actual model that the data point is located in. For this reason, smoothening methods try
decrease the problem by using a weighted average of predictions of neighboring models for
-
CHAPTER 3. APPROACH 24
data points close to the borderline. Number of these models are two in a two dimensional
feature space, and might be more in a n dimensional feature space based on the location of
the data item (the data item might be close to a borderline in one dimension, and not in
another dimension).
Since we are using a distributed method to solve the regression problem, we have more
resources at hand and we might be able to afford a little bit of redundant calculation in
order to increase the accuracy of the model. Based on this logic and the problem explained
about borderline data points prediction, we decided to have overlapping subspaces and let
each mapper have more data than what it needs in order to construct the model. Figure 3.1
depicts how 7 partitions of dataset that are partitioned along x axis are assigned to 7 map-
pers. All cluster nodes except first and last one receives three partitions of the dataset. All
cluster nodes receive left and right partitions of the partition they are trying to construct the
model for (we call it main partition, and call the left and right partition neighbor partitions).
The reason that first and last nodes receive only two partitions instead of three is that their
corresponding main partition has just one neighboring partition. Distributing dataset this
way would let the system to construct the model based on 3 partitions but only predict the
target value for the test items that are located in their main partition. This way we will
not have any borderline data item and thus we will not need to use prediction smoothening
methods that is used in Slope-changing algorithm.
Algorithm 3 MapReduce Regression Tree Algorithm - Map Phase of First MapReduceRound1: function Map1(dimToSplit)2: for all Mappers do3: . dataset is local part of dataset on the node that this mapper is running4: minV alue = +∞5: maxV alue = −∞6: for all dataPoint ∈ dataset do7: minV alue = min(dataPoint[dimToSplit],minV alue)8: maxV alue = max(dataPoint[dimToSplit],maxV alue)9: end for
10: end for11: send < 1, < minV alue,MaxV alue >>12: . By indicating key as 1 all information is sent to one reducer13: end function
Figure 3.2 depicts different overlap factors when overlapping subspaces on cluster nodes.
When overlap factor is 1, each node would receive its own subspace in addition to two neigh-
boring subspaces that are as big as its own subspace. When overlapping factor is 0.5, size of
-
CHAPTER 3. APPROACH 25
Figure 3.2: Different overlap factors of subspaces on cluster nodes
neighboring subspaces that each node receives, is half of size of its own subspace. In case of
overlapping factor of 0, no overlapping exists.
Now that we have discussed the concepts behind how MapReduce Regression Tree algo-
rithm works, let us explain each line of algorithm 2 briefly. The preProcess method selects
what dimension to select to partition the dataset along. Then first round of MapReduce
is started. In map phase of first MapReduce round, each mapper finds range of the data
items (max, min) in the portion of dataset that mapper owns. These information are sent
to one reducer. Reducer would receive rangeV alues, and nMappers (number of mappers),
and would decide what portions of dataset should be sent to each mapper. In map phase of
second round of MapReduce, all mappers receive the splitPoints information and would send
the data items they have to two or three mappers (we know that each partition of the dataset
would be sent to several mappers due to overlapping). In reduce phase of second MapReduce
round, all the reducers would have two or three partitions of the dataset. They construct a
regression tree for the portion of dataset they have received. Each of this phases is explained
in following sections.
3.1.1 Map1: Finding the Min and Max of Dimension that Is Being
Split
Algorithm 3 lists the steps of this phase of the first MapReduce round. All mappers process the
portion of the dataset they own and find the minimum and maximum value of the dimension
-
CHAPTER 3. APPROACH 26
Algorithm 4 MapReduce Regression Tree Algorithm - Reduce Phase of First MapReduceRound1: function Reduce1(rangeV alues, dimToSplit, nMappers)2: minV alue = min(rangeV alues.minV alues) . min value of dimToSplit dimension3: maxV alue = max(rangeV alues.maxV alues) . max value of dimToSplit dimension4: stepSize = (maxV alue−minV alue)/nMappers5: . stepSize in dimToSplit dimension when partitioning the dataset6: splitPoints[1].start = minV alue . start and end of the partition for7: splitPoints[1].end = minV alue+ 2 ∗ stepSize . first mapper is calculated differntly8: for i = 2, nMappers− 1 do9: splitPoints[i].start = minV alue+ (i− 2) ∗ stepSize
10: splitPoints[i].end = minV alue+ (i+ 1) ∗ stepSize11: end for12: splitPoints[nMappers].start = minV alue+ (nMappers− 2) ∗ stepSize13: splitPoints[nMappers].end = minV alue+ nMappers ∗ stepSize14: . start and end of the partition for last mapper is calculated differntly15: send splitPoints to all mappers . To be used by mappers of next round16: end function
that is supposed to be split. All the mappers then send this minimum and maximum values
to the same reducer. This is the reason the key value for the emitted < key, value > pair is
1 for all mappers.
Algorithm 5 MapReduce Regression Tree Algorithm - Map Phase of Second MapReduceRound1: function Map2(splitPoints, dimToSplit, nMappers)2: for all Mappers do3: for dataPoint ∈ dataset do4: for i = 1, nMappers do5: if splitPoints[i].start < dataPoint[dimToSplit] < splitPoints[i].end then6: send < i, dataPoint > to corresponding reducer7: end if8: end for9: end for
10: end for11: end function
3.1.2 Reduce1: Finding Split Points Along the Dimension that Is
Being Split
Algorithm 4 lists the reduce phase of first MapReduce round. In this algorithm all the
maximum and minimum values sent by all mappers used to find the maximum and minimum
-
CHAPTER 3. APPROACH 27
value of the dimension that is being split. Using these two values, range of the dimension is
found and stepSize of split points along that dimension is found by dividing range to number
of mappers. Now start and end values for each mapper on dimension that is being split is
found and stored in splitPoints array. All the mappers would have a partitions as big as
triple of size of stepSize, except the partitions whose main partition is first or last partition
of the dataset. Those two partitions would only have two partitions.
Algorithm 6 MapReduce Regression Tree Algorithm - Reduce Phase of Second MapReduceRound1: function Reduce2(dataPoints)2: Input: dataPoints: data items sent to this reducer3: for all Reducers do4: models[i] = treeRegressionModel(dataset)5: end for6: end function
3.1.3 Map2: Shuffling the Data Among Cluster Nodes
The split points found in Reduce1 phase are used in this phase to shuffle the data. Algorithm 5
lists how the data is shuffled among cluster nodes in this phase. Each mapper sends each data
item in its local portion of dataset to 2 or 3 mappers. The data points that are located in first
or last partition of the feature space would be sent to two reducers, and all other data points
would be sent to three reducers. This would cause to have more redundancy in the amount
of the data that is transferred in the network, but would solve the problem of borderline data
item’s target value prediction, and would increase the accuracy of the model also.
Algorithm 7 MapReduce Regression Tree Algorithm - Prediction
1: function MR-Regression-Tree-Test(models, dataP