scaling up to big data: algorithmic engineering + hpc · 2015-07-24 · scaling up to big data:...

46
SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data June 12, 2015 Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University www.cs.dal.ca/~arc

Upload: others

Post on 24-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC

Statistical and Computational Analytics for Big DataJune 12, 2015

Andrew Rau-Chaplin

Faculty of Computer Science

Dalhousie University

www.cs.dal.ca/~arc

Page 2: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

THE PROBLEM

Page 3: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

STEP 1: SOLVE THE PROBLEM ON SMALL DATA

Typically Process

• start by using small data sets!

• focus on identifying those machine learning techniques that are best suited to the problem

Page 4: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

STEP 2: SCALE-UP

Page 5: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

SCALING UP

1)Asymptotic Analysis

2)Algorithmic Engineering

3)Parallelism & HPC

Page 6: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

HOW TO SCALE TO “BIG DATA”

• Sorry, I don’t have a recipe!

• Talk about our experience scaling up analytical techniques

• Highlight approaches and technology choices which we have found helped in multiple settings

•Perhaps they should have been obvious?

Page 7: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

EFFICIENT FRONTIER APPROACHES TO TREATY OPTIMIZATION

www.Risk-Analytics-Lab.ca

Joint work with

• Omar Carmona Cortes

• I. Cook and J. Gaiser-Porter

1) Asymptotic Analysis applied to search

and optimization

Page 8: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

Simulated Event Losses

CATASTROPHE MODELING

Exposure Event Catalog

Hazard

Vulnerability

LossEvent Loss Table (ELT)

Cat Model

A Program≈ Multiple layers, over ~15 ELTs, covering ~5 models, and ~200K events

A Portfolio ≈3-4K Programs each with multiple layers, with 40K ELTs, over 100 models, covering 1M events

Page 9: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

A TREATY OPTIMIZATION PROBLEM

Optimization From a Primary Insurers or Broker’s Perspective!

Find a Pareto Frontier

Risk

Expected

Return

Dominated

Infeasible

Page 10: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

MORE FORMALLY

Given: a fixed number of contractual layers and a simulated set of expected loss distributions (one per layer), plus a model of reinsurance market costs

The Task: identify optimal combinations of shares (also called placements) in order to build a Pareto frontier

Page 11: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

INPUTS/OUTPUTS

Treaty

Optimizer

Discretization = 10%, 5%, or 1%

Page 12: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

THE APPROACH

12

• Aggregate the loss data

• Location Event Trial year

• Discretized search parameters

• Calculate results for all combinations of shares

• Use a big parallel machine

Page 13: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

THE PROBLEM

• Works for a small number of layers!

• Results in a large number of computations

• Number of computations exponential increases with dimensions

Number of Layers

Number of share intervals

Asymptotic Analysis!• ((# of trials) * (discretization) ^ (# of layers))/

(number of processors)

• Example: (1,000,000 * 100^15) / 1000 =

10^33 computations

Page 14: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

ROUND 1:

Need a better algorithm!

Use an evolutional search approach

Population Based Incremental Learning Di-PBIL

Single risk measure (ie. 2D Pareto Frontier)

Variance

Value At Risk (VaR)

Tail Value at Risk (TVaR)

Prototype in R (with mutlithreading)

Questions

Quality: How close to the exact method?

Performance: How fast? How big a problem can we now handle?

Page 15: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

QUALITY: HOW CLOSE TO THE EXACT METHOD?

Percentage of time DiPBIL finds the same solution as the exact method?

Page 16: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

QUALITY: HOW CLOSE TO THE EXACT METHOD?

Average error when DiPBIL does not find the same solution as the exact method?

Error always

less than

6/100ths of a

percent.

Page 17: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

PERFORMANCE: HOW FAST WHEN COMPARED TO THE EXACT METHOD?

Time on a single core to compute a single point on efficient frontier for 7 layers and 5% discretization

Enumeration: weeks

Di-PBIL: 2-15 minutes

Page 18: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

PERFORMANCE: HOW BIG A PROBLEM CAN WE NOW SOLVE?

Time on a single core to compute a single point on efficient frontier at 5% discretization

Solutions times no

longer exponential

in the number of layer!

Page 19: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

ROUND 2:

19

Single risk metric Multiple risk metric(e.g. 1 in 100yr TVaR + 1 in 5yr VaR)

2-d Pareto front 3-d+ Pareto front

Di-PBIL Mo-PBIL

Prototype in R Prototype in C++

Advantages

Search for whole front, not point by point

Multiple Risk Metrics

Performance!

Page 20: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

ROUND 2: OPTIMIZED MO-PBIL

Mo-PBIL : Complete frontier (60 - 70 points) for 7 layer program and 5% discretization in 16 seconds!

Setup: 500 iterations, 128 population

2 * Xeon E5-2660 processors

Page 21: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

SUMMARY

Evolutionary techniques work well for Treaty Optimization!

Can now solve practical problem instances with practical performance.

Compared multiple evolutionary search methods

Single Objective: DE, PSO, GA, PBIL

Multi Objective: VEPSO, MODE, SPEA2, NSGA2

Evaluation Results

All work and can produce high quality solutions

Differences

Easy of use

PerformanceDon’t compute exactly what you

can compute approximately!

Parallelism is great, but it only

buys you a constant factor!

Page 22: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

OPTIMIZE LAYER STRUCTURE, NOT JUST SHARES !

Treaty Optimizer 2

Aggregate

Simulation Engine

Att

Exh

7

Discretization = d%

Risk Measure

Premium function

# reinstatements

Aggregate terms

(ie 3rd event cover)

Set of ELTs

100K Year Event Table (YET)

Risk

Expected

Return

Att

ExhPopulation Evaluation

Inputs

Page 23: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

ACCELERATING NGRAM BASED TEXT ANALYSIS

2) Algorithm engineering techniques

applied to text analytics problems

Page 24: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

DOCUMENT RELATEDNESS

Important task in many text mining applications

Represented by a score between 0 and 1

Unsupervised Corpus-based methods: Google Trigram Method, Semantic

Text Similarity, etc.

Page 25: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

Unigram:apple 6878789eating 14987879

Trigram:ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50

Word Relatedness

Document Relatedness: abstracted as a function of word relatedness

Word Relatedness

Find frequency of w1 and w2 in Unigram; Find co-occurrence of w1 and w2 in Trigram

GTM Distance Function

GOOGLE TRIGRAM METHOD (GTM)

Page 26: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

D1: An autograph is the signature of someone famous which is specially written for a fan to keep.D2: Your signature is your name, written in your own characteristic way, often at the end of a document to indicate that you wrote the document or that you agree with what it says

*

* Proposed by Islam, Milios, and Keselj, Text Relatedness using Google Tri-grams

GTM EXAMPLE

Page 27: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

CHALLENGES IN SCALING UP GTM

Measuring the relatedness between a pair of documents is too slow in the

existing work

The size of Unigram is roughly 200 MB; the size of Trigram is 20 GB.

High complexity of N to N pairwise document Relatedness computation.

Volume of documents is growing rapidly

Page 28: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

9 GB 3 GB

WORD RELATEDNESS PRECOMPUTATION

Tokenize! - Assign each word with an number ID

Precompute! - Compute all the word relatedness in advance for lookups

Build in-memory data structures! - Dictionary structure to store word

relatedness dictionary in memory

Hashing vs Arrays (207,761,290 pairs of words)

Page 29: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

SHARED MEMORY MULTITHREADING

Multithreaded implementation: make uses of a multi-core of shared memory machine.

Amortize I/O Costs: Each thread running on a separate core fetches documents from

the shared memory and computes the relatedness between them.

Lots of language and library based approaches: OpenMP, …

Page 30: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

MULTITHREADED IMPLEMENTATION PERFORMANCE

The speed-up analysis

Experiments use 2000 documents from ACM Paper Abstracts collections

Page 31: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

HORIZONTAL SCALING: HADOOP

• Scaling for free?

• Data parallelism

• Solves problem partitioning

• Solves task mapping

• Solves fault tolerance

• Challenges

• Shared data structures?

• How to amortize I/O costs?

Page 32: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

HADOOP

•Shared data structures?

• Use multi-threaded mappers!

• Example: Each multithreaded mapper constructs word relatedness dictionary and takes

input blocks for document relatedness computation.

•How to amortize I/O costs?

•Map over task definitions not the raw data

•Example: Map of blocks of similarity computations

Page 33: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

0

5

10

15

20

25

30

35

40

2000 4000 6000 8000 10000

Tim

e in

Min

ute

s

Number of Files

Hadoop Size_up Performance

20 Instances Performance(mins)

HADOOP IMPLEMENTATION PERFORMANCE

•Hardware: Amazon EC2 m3.xlarge nodes each of which has 4 cores and 15 GB.

•Hadoop: AWS EMR

•Dataset: 10,000 from ACM Paper abstracts collections

Page 34: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

0

5

10

15

20

25

30

35

1 2 3 4 5 6 8 10 20 25

Tim

e in

Min

ute

s

Number of Files

Hadoop Scale_up Performance

Time(mins)

HADOOP IMPLEMENTATION PERFORMANCE

The scale-up performance shows the running time of the implementation while keeping the ratio between input size and nodes fixed.

2000 texts per node using between 2-10 nodes.

Page 35: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

SUMMARY

• Speed-up of ~10,000,000x from initial research prototype

• GPU implementation: ~10x cost/performance

Algorithm Engineering works!

• Compress your data

• Precompute!

• Exploit in memory data structurtes

• Mutli-thread

• Scale horizontally

Page 36: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

SUMMARY

Asymptotic Analysis

• O( (# of Docs)^2 * (# of words per doc)^2 )

0

500

1000

1500

2000

2500

3000

3500

4000

0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000

Hour

s

Documents

Runtime as a function of # of documents

You can’t out run the

asymptotic complexity!

Page 37: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

VOLAP: A FULLY SCALABLE CLOUD-BASED SYSTEM FOR

REAL-TIME OLAP ON HIGH VELOCITY DATA

3) Applying cloud technologies to

accelerate real-time performanceJoint work with

• F. Dehne, D. Robillard

• Q. Kong, N. Burkee

Page 38: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

REAL-TIME OLAP ON THE CLOUD

38

Business Analytics based on OLAP

High dimensional data with hierarchies

Real-time insert and queries

Solution

Multiple server processors

Dynamic load balancing

Strong session serialization

Page 39: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

PERFORMANCE

39

Page 40: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

CLOUD BASED SYSTEM ARCHITECTURE

Page 41: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

THE DISTRIBUTED PDCR-TREE

The building block for CR-OLAP system

DC-Tree A fully dynamic index structure which explicitly represents

dimension hierarchies

“The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses”, [Kriegel2000]

PDC-Tree A Multi-threaded DC-Tree

“Parallel Real-Time OLAP on Multi-core Processors”, [Dehne2012]

Distributed PDCR-Tree A new distributed in-memory index structure for the cloud

Add support for distributed memory

Array-based implementation to support efficient migration

Add supports for both ordered and unordered dimensions.

Page 42: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

DATA INGESTION PERFORMANCE

42

Page 43: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

LOAD BALANCING PERFORMANCE

43

Page 44: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

SCALE-UP QUERY PERFORMANCE

44

Page 45: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

SUMMARY

CloudA model for high velocity data

There are great tools

New IssuesDesign for elasticity

Think about fault-tolerance from the start

Old IssuesNothing is for free! Still need to worry about Compression

Pre-computation

Efficient Data structures

Multi threading

Page 46: Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

Big Data

HPC: Clusters, Clouds, &

GPU

Algorithm Engineering

SCALING UP TO BIG DATA

Andrew [email protected]

Risk Analytics Lab,

Faculty of Computer Science

Dalhousie University,

Halifax, Canada

www.Risk-Analytics-Lab.ca