navigating big data - university of pittsburghchang/231/y13/seminar/mao.pdf · big data on little...

NAVIGATING BIG DATAwith High-Throughput, Energy-

Efficient Data PartitioningLisa Wu, R.J. Barker, Martha Kim, and Ken Ross

Columbia University

Presentor Mengjie Mao

1

JOINs = Cross Reference

SALES

WEATHER

1


SALES

WEATHER

JOINDATE

1


SALES

WEATHER

JOINDATE(SALES, WEATHER)

JOINDATE

1


0%

25%

50%

75%

100%

17 9 11 5 7 8 22 1 2 12 21 18 19 10 15 3 20 4 16 14 13 6 Avg.

Run

time

in Jo

in

TPC-H Query

JOINs = 47% TPC-H Execution Time

2

3

Naïve JOINs of BIG DATA Thrash the Cache

SALESWEATHER

Scan

3


SALESWEATHER

Scan

RandomLookup

3


SALESWEATHER

Scan

RandomLookup

Small table doesn’t fit into cache ➛ lookups thrash

3


SALESWEATHER

Partitioned JOINSALES

WEATHER

4


WEATHER

SALES_0

SALES_1

SALES_2

SALES_3

4


WEATHER

SALES_0

SALES_1

SALES_2

WEATHER_1

WEATHER_2

WEATHER_3

SALES_3

4

WEATHER_0


WEATHER

SALES_0

SALES_1

SALES_2

WEATHER_1

WEATHER_2

WEATHER_3

SALES_3

4

WEATHER_0

DATE SALES_0 WEATHER_0


WEATHER

SALES_0

SALES_1

SALES_2

WEATHER_1

WEATHER_2

WEATHER_3

SALES_3




4

WEATHER_0


JOIN Runtime

time

NaiveJoin

PartitionedJoin

5

JOIN Runtime

time

NaiveJoin

PartitionedJoin

PartitionSALES

PartitionWEATHER

5

JOIN Runtime

time

NaiveJoin

PartitionedJoin

PartitionSALES

PartitionWEATHER

0 1 2 3Join

5

JOIN Runtime

time

NaiveJoin

PartitionedJoin

PartitionSALES

PartitionWEATHER

0 1 2 3Join

5

Partition ≈ 50% of State-of-the-Art

Joins

JOIN Runtime

time

NaiveJoin

PartitionedJoin

PartitionSALES

PartitionWEATHER

0 1 2 3Join

5


JoinsAggregate

JOIN Runtime

time

NaiveJoin

PartitionedJoin

PartitionSALES

PartitionWEATHER

0 1 2 3Join

5


JoinsAggregate

Sort

6

Data Partitioning is...

• broadly applicable in the database domain

• partitioned-operations reduce I/O cost,increase parallel processing, and reduceshipping costs

6




• widely used by commercial databases

• Oracle 11g, IBM DB2, Microsoft SQL Server

6




• widely used by commercial databases

• Oracle 11g, IBM DB2, Microsoft SQL Server

• applicable in the non-database domain

• divide-and-conquer, map-reduce

6


0

7.5

15

22.5

30

0 150 300 450 600

Part

ition

ing T

hrou

ghpu

t (G

B/s)

Number of Partitions

SW Partitioning Performance

1 thread16 threadsPotential System Memory Throughput

7

25.6 GB/s

0

7.5

15

22.5

30

0 150 300 450 600

Part

ition

ing T

hrou

ghpu

t (G

B/s)



1 thread16 threadsPotential System Memory Throughput

7

25.6 GB/s

3 GB/s

Research Overview

SBin

SBout

HARPCore

L1

Memory Controller

Memory

SBin

SBout

HARPCore

L1

L2 L2

8

New Modules

Research Overview

SBin

SBout

HARPCore

L1

Memory Controller

Memory

SBin

SBout

HARPCore

L1

L2 L2

8

New Modules

Hardware accelerated range partitioner (HARP):7.8X more performance @ 7.5X less energy

Research Overview

SBin

SBout

HARPCore

L1

Memory Controller

Memory

SBin

SBout

HARPCore

L1

L2 L2

8

New Modules

Streaming framework: Can keep up with the throughput of HARP

Hardware accelerated range partitioner (HARP):7.8X more performance @ 7.5X less energy

Remainder of the Talk

SBin

SBout

HARPCore

L1

Memory Controller

Memory

SBin

SBout

HARPCore

L1

L2 L2

8

New Modules

• Brief System Overview

• HARP UArch

•HARP and StreamingFramework Evaluation•Other works acclerating big data processing

HARP System Architecture

9

SBin

SBout

HARPCore

L1

Memory Controller

Memory

SBin

SBout

HARPCore

L1

L2 L2


9

SBin

SBout

HARPCore

L1

Memory Controller

Memory

SBin

SBout

HARPCore

L1

L2 L2

HARP communicates to/from memory through stream

buffers: SBin, SBout


9

SBin

SBout

HARPCore

L1

Memory Controller

Memory

SBin

SBout

HARPCore

L1

L2 L2

HW Partitioning with SW configuration

HARP communicates to/from memory through stream

buffers: SBin, SBout

X Y Z158272052291631

Range Partition

10

X Y Z158272052291631

Range Partitionkey

10

X Y Z158272052291631

Range Partitionkey

10

20

30

splitters

10

X Y Z158272052291631

Range Partitionkey

10

20

30

splitters

8

partitions<=

>

10

X Y Z158272052291631

Range Partitionkey

10

20

30

splitters

8

partitions<=

>1520

16

2729

5231

<=

<=

>

>

10

From

SBin

To SBout

HARP Microarchitecture

11

HARP ISA

set_splitterpartition_startpartition_stop

From

SBin

To SBout

HARP Microarchitecture

11

<

=

<

=

<

=

Serializer1 Conveyor2

Merge3

WE WE WE WE

HARP ISA


Step 1: HARP Configuration

11

From

SBin

To SBout

<

=

<

=

<

=


Merge3

WE WE WE WE

HARP ISA

partition_startpartition_stop

set_splitter

Step 1: HARP Configuration

11

From

SBin

To SBout

<

=

<

=

<

=


Merge3

WE WE WE WE

10 20 30

HARP ISA


Step 2: Signal HARP to Start Processing

11

From

SBin

To SBout

<

=

<

=

<

=


Merge3

WE WE WE WE

10 20 30

HARP ISA

partition_stop

set_splitterpartition_start

Step 2: Signal HARP to Start Processing

11

From

SBin

To SBout

<

=

<

=

<

=


Merge3

WE WE WE WE

10 20 30

HARP ISA

partition_startpartition_stop

set_splitter

From

SBin

To SBout

<

=

<

=

<

=


Merge3

WE WE WE WE

Step 3: Serialize SBin Cachelines into Records

11

10 20 30

HARP ISA


From

SBin

To SBout

<

=

<

=

<

=


Merge3

WE WE WE WE

Step 4: Comparator Conveyor

11

10 20 30

HARP ISA


From

SBin

To SBout

<

=

<

=

<

=


Merge3

WE WE WE WE

Step 4: Comparator Conveyor

15 15 15, part2

11

10 20 30

HARP ISA


From

SBin

To SBout

<

=

<

=

<

=


Merge3

WE WE WE WE

Step 5: Merge Output Records to SBout

11

10 20 30

HARP ISA


From

SBin

To SBout

<

=

<

=

<

=


Merge3

WE WE WE WE

Step 6: Drain In-Flight Records and Signal HARP to Stop Processing

11

10 20 30

HARP ISA


Streaming Framework Architecture

12

SBin

SBout

HARPCore

L1

Memory Controller

Memory

SBin

SBout

HARPCore

L1

L2 L2

Inspired by Jouppi’s workImproving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA, 1990.

Streaming Framework Architecture

12

SBin

SBout

HARPCore

L1

Memory Controller

Memory

SBin

SBout

HARPCore

L1

L2 L2

Software-controlled data streaming in/out

Inspired by Jouppi’s workImproving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA, 1990.

Evaluation Methodology

13

Evaluation Methodology• HARP

• Bluespec System Verilog implementation

• Cycle-accurate simulation in BlueSim

• Synthesis, P&R with Synopsys (32nm std cells)

13

Evaluation Methodology• HARP

• Bluespec System Verilog implementation

• Cycle-accurate simulation in BlueSim

• Synthesis, P&R with Synopsys (32nm std cells)

• Streaming framework

• 3 versions of 1GB table memcpy: c-lib, ASM(scalar), ASM(vector)

• Conservative area/power estimates with CACTI

13

Area and Power Overheads

0%

4%

8%

11%

15%

15 31 63 127 255 511Are

a (%

Xeo

n co

re)

HARP Stream Buffers


14

Area and Power Overheads

0%

4%

8%

11%

15%

15 31 63 127 255 511Are

a (%

Xeo

n co

re)

HARP Stream Buffers

0%

2%

4%

6%

8%

10%

15 31 63 127 255 511Pow

er (%

Xeo

n co

re)


14


15

0

2

4

6

8

0 150 300 450 600

Part

ition

ing T

hrou

ghpu

t (G

B/s)


1 thread16 threads

0

2

4

6

8

0 150 300 450 600

Part

ition

ing T

hrou

ghpu

t (G

B/s)


Performance Evaluation

1 thread16 threads1 thread + HARP

15

0

2

4

6

8

0 150 300 450 600

Part

ition

ing T

hrou

ghpu

t (G

B/s)




15

7.8x

0

2

4

6

8

0 150 300 450 600

Part

ition

ing T

hrou

ghpu

t (G

B/s)




15

7.8x8.8x

HARP Energy vs. SW

0

5

10

15

20

0 150 300 450 600

Part

ition

ing

Ener

gy (

J/GB)



16

HARP Energy vs. SW

0

5

10

15

20

0 150 300 450 600

Part

ition

ing

Ener

gy (

J/GB)



6.3x

16

HARP Energy vs. SW

0

5

10

15

20

0 150 300 450 600

Part

ition

ing

Ener

gy (

J/GB)



6.3x7.3x

16

Big data on little clients• LINQits

• A fexible hardware template that can be mappedonto programmable logic or ASICs

• Accelerates a domain-specic query language calledLINQ

• Offers a graceful and transparent migration pathfrom software to hardware

• Improves energy efficiency by 8.9 to 30.6 timesand performance by 10.7 to 38.1 times comparedto optimized, multithreaded C programsrunning on conventional ARM A9 processors.

17

ReferenceNavigatinNavigating Big Data with High-Throughput, Energy-Efficient Data Partitioning, L. Wu, R. J. Barker, M. A. Kim, K. A. Ross, In Proceedings of The 40th International Symposium on Computer Architecture (ISCA), 2013.

•

• LINQits: Big Data on Little Clients, Eric S. Chung, John D. Davis, and Jaewon Lee, In Proceedings of The 40th International Symposium on Computer Architecture (ISCA), 2013.

18

FIN

19

navigating big data - university of pittsburghchang/231/y13/seminar/mao.pdf · big data on little...

Documents