implementation of linear regression and logistic regression on spark

Click here to load reader

Post on 12-Apr-2017

109 views

Category:

Data & Analytics

2 download

Embed Size (px)

TRANSCRIPT

  • Parallel implementation of ML algorithms on Spark

    Dalei Li EIT Digital

    https://github.com/lidalei/LinearLogisticRegSpark

    1

    https://github.com/lidalei/LinearLogisticRegSpark

  • Overview Linear regression + l2 regularization

    Normal equation

    Logistic regression + l2 regularization

    Gradient descend

    Newtons method

    Hyper-parameter optimization

    Experiments

    2

  • Tools

    IntelliJ + sbt

    Scala 2.11.8 + Spark 2.0.1

    3

  • Linear regression Problem formulation

    Closed-form solution

    Computation reformulation

    4

  • Linear regression Data set - UCI YearPredictionMSD, text file

    515,345 songs, (90 audio numerical features, year)

    Core computation - norm terms and rmse

    5

    Implemented outer product + vector addition

  • Workflow

    6

    Read file RegexTokenizer StandardScaler Solve normal equation

    Spark SQL textAdd l2 regularization

    LAPACK

    Center data

    Evaluation

    rmse

  • Validation

    7

    Spark ML linear regression with norm solver vs. my implementation (both with 0.1 l2 regularization)

    Randomly split data set into train 70% + test 30%. The RMSEs on test set are also identical, less than 0.5% difference.

  • Logistic regression Problem formulation

    Gradient descent

    Newtons method

    Computation reformulation - gradient and Hessian matrix

    8

  • Logistic regression Data set - UCI HIGGS, csv file

    11 million instances, (21+7 numerical features, binary label)

    Core computation - gradient and Hessian matrix

    9

    treeReduce can reduce the pressure of final ops in driver.

  • Workflow

    10

    Read file VectorAssembler DF to RDDgradient descent/

    newtons method

    Spark SQL csv Gradient - add l2 regularization

    Scala case class Instance (features, label),

    Newtons - append all-one column

    Evaluation

    cross entropy confusion matrix

  • Validation

    11

    Spark ML logistic regression with L-BFGS vs. my implementation of Newtons method

    Randomly split data set into train 70% + test 30%. The learned THETAs are almost identical, the last one is bias.

  • Grid search to find optimal hyper-parameters with best generalization error

    Estimate generalization error

    k-Fold cross validation

    Hyper-parameter optimization

    12

    Hyper-parameter is a parameter used in a training process but not a part of a classifier itself. It controls what kind of parameters can / tend to be selected. For example, polynomial expansion will make non-linear relationship between a label and features be learned possibly.

  • Grid search

    Grid - [polynomial expansion degree] x [l2 regularization]

    Polynomial expansion is memory killer

    Degree 3 on 7 features results in 119 features

    Be careful with exploiting parallelism

    13

    To increase temporal locality - accesses to a data frame are clustered in time.

    Polynomial expansion does not include constant column.

  • K-Fold

    14

    DF Persist, randomSplit map=> [([train_i], test)] map=>[(train, test)]

    Spark SQL data frame

    [([DF], DF)]

    [(union[DF], DF)]

  • 15

    k-Fold

    PE

  • Experiments

    16

    Spark 2.0.2 standalone mode

    3 cores + 5GB mem exact copy of read-in file

    http://spark.apache.org/docs/latest/cluster-overview.html

    In total, we have 3 physical machines with 12GB mem + 8 cores.

    Driver - execute scala program

    Worker - execute tasks

    Executor - each application runs a or more processes on a worker node

    Job - triggered by an action

    Task - a unit of work executed on an executor, related with number of partitions >= number of blocks (128MB). If set manually, 2-4 partitions for each CPU in your cluster.

    Stage - a set of tasks

    Local file - path + content on each worker node.

    http://spark.apache.org/docs/latest/cluster-overview.html

  • Performance test ML Settings

    Logistic regression on HIGGS

    Train-test split, 70% + 30%

    Only 7 high level features were used

    Test unit 1 - 100 times full gradient descent + training error on training set, initial learning rate 0.001, l2 regularization 0.1

    Test unit 2 - compute confusion matrix on test set and make predictions

    17

  • Performance and speedup curve

    18

    0

    1.25

    2.5

    3.75

    5

    0

    225

    450

    675

    900

    local 1 executor 2 executors 3 executors 4 executors 5 executors

    training time (s) training-speed up

    1

    1.822

    2.372

    2.693

    3.641

    4.43

    Running time vs. #executors (2 times average). Except for local, all tests have enough memory

    Local mode does not have enough memory, causing data cannot be persist in memory. Thus, the running time is much higher.

    Having more executors will reduce the running time linearly.

  • Grid search 10% of original data, i.e., 1.1 million instances, 7 high level features only

    Grid

    Polynomial degrees - 1, 2, 3

    l2 regularization - 0, 0.001, 0.01, 0.1, 0.5

    3-Fold cross validation

    100 times gradient descent with initial learning rate 0.01

    2 executors with 10GB mem + 5 cores each

    Result - 4400s training time, final test accuracy 62.4%

    19

    Confusion matrix: truePositive: 117605, trueNegative: 88664, falsePositive: 66529, falseNegative: 57786

  • Conclusion Persist data - use more than once (incl. having branches)

    Change default cluster settings, e.g., executor memory per executor is 1GB

    Make use of Spark UI to find bottlenecks

    Using Spark builtin functions if possible

    Good examples for missing functions

    Dont use accumulators in a transformation, except only need approximations

    Always start from small data to debug faster

    Future work - obey train-test split

    20

  • Q&A Thank you!

    Useful links

    Master - spark://ip:7077, e.g., spark://b2.lxd:7077

    Cluster - http://ip:8080/

    Spark UI - http://ip:4040/

    https://spark.apache.org/docs/latest/programming-guide.html

    http://spark.apache.org/docs/latest/submitting-applications.html, package a jar - sbt package

    21

    spark://ip:7077spark://b2.lxd:7077http://ip:8080/http://ip:4040/https://spark.apache.org/docs/latest/programming-guide.htmlhttp://spark.apache.org/docs/latest/submitting-applications.htmlhttp://spark.apache.org/docs/latest/submitting-applications.html

  • Backend slides

    22

  • Training time vs. # executors

    23

    0

    0.25

    0.5

    0.75

    1

    0

    225

    450

    675

    900

    local 1 executor 2 executors 3 executors 4 executors 5 executors

    training time (s) test accuracy

  • Spark UI

    24

    Jobs timeline

  • Spark UI

    25

    Executor summary

  • Numerical stability

    26

View more