navigating big data - university of pittsburghchang/231/y13/seminar/mao.pdf · big data on little...
TRANSCRIPT
NAVIGATING BIG DATAwith High-Throughput, Energy-
Efficient Data PartitioningLisa Wu, R.J. Barker, Martha Kim, and Ken Ross
Columbia University
Presentor Mengjie Mao
1
JOINs = Cross Reference
SALES
WEATHER
1
JOINs = Cross Reference
SALES
WEATHER
JOINDATE
1
JOINs = Cross Reference
SALES
WEATHER
JOINDATE(SALES, WEATHER)
JOINDATE
1
JOINs = Cross Reference
0%
25%
50%
75%
100%
17 9 11 5 7 8 22 1 2 12 21 18 19 10 15 3 20 4 16 14 13 6 Avg.
Run
time
in Jo
in
TPC-H Query
JOINs = 47% TPC-H Execution Time
2
3
Naïve JOINs of BIG DATA Thrash the Cache
SALESWEATHER
Scan
3
Naïve JOINs of BIG DATA Thrash the Cache
SALESWEATHER
Scan
RandomLookup
3
Naïve JOINs of BIG DATA Thrash the Cache
SALESWEATHER
Scan
RandomLookup
Small table doesn’t fit into cache ➛ lookups thrash
3
Naïve JOINs of BIG DATA Thrash the Cache
SALESWEATHER
Partitioned JOINSALES
WEATHER
4
Partitioned JOINSALES
WEATHER
SALES_0
SALES_1
SALES_2
SALES_3
4
Partitioned JOINSALES
WEATHER
SALES_0
SALES_1
SALES_2
WEATHER_1
WEATHER_2
WEATHER_3
SALES_3
4
WEATHER_0
Partitioned JOINSALES
WEATHER
SALES_0
SALES_1
SALES_2
WEATHER_1
WEATHER_2
WEATHER_3
SALES_3
4
WEATHER_0
DATE SALES_0 WEATHER_0
Partitioned JOINSALES
WEATHER
SALES_0
SALES_1
SALES_2
WEATHER_1
WEATHER_2
WEATHER_3
SALES_3
DATE SALES_1 WEATHER_1
DATE SALES_2 WEATHER_2
DATE SALES_3 WEATHER_3
4
WEATHER_0
DATE SALES_0 WEATHER_0
JOIN Runtime
time
NaiveJoin
PartitionedJoin
5
JOIN Runtime
time
NaiveJoin
PartitionedJoin
5
JOIN Runtime
time
NaiveJoin
PartitionedJoin
PartitionSALES
PartitionWEATHER
5
JOIN Runtime
time
NaiveJoin
PartitionedJoin
PartitionSALES
PartitionWEATHER
0 1 2 3Join
5
JOIN Runtime
time
NaiveJoin
PartitionedJoin
PartitionSALES
PartitionWEATHER
0 1 2 3Join
5
Partition ≈ 50% of State-of-the-Art
Joins
JOIN Runtime
time
NaiveJoin
PartitionedJoin
PartitionSALES
PartitionWEATHER
0 1 2 3Join
5
Partition ≈ 50% of State-of-the-Art
JoinsAggregate
JOIN Runtime
time
NaiveJoin
PartitionedJoin
PartitionSALES
PartitionWEATHER
0 1 2 3Join
5
Partition ≈ 50% of State-of-the-Art
JoinsAggregate
Sort
6
Data Partitioning is...
• broadly applicable in the database domain
• partitioned-operations reduce I/O cost,increase parallel processing, and reduceshipping costs
6
Data Partitioning is...
• broadly applicable in the database domain
• partitioned-operations reduce I/O cost,increase parallel processing, and reduceshipping costs
• widely used by commercial databases
• Oracle 11g, IBM DB2, Microsoft SQL Server
6
Data Partitioning is...
• broadly applicable in the database domain
• partitioned-operations reduce I/O cost,increase parallel processing, and reduceshipping costs
• widely used by commercial databases
• Oracle 11g, IBM DB2, Microsoft SQL Server
• applicable in the non-database domain
• divide-and-conquer, map-reduce
6
Data Partitioning is...
0
7.5
15
22.5
30
0 150 300 450 600
Part
ition
ing T
hrou
ghpu
t (G
B/s)
Number of Partitions
SW Partitioning Performance
1 thread16 threadsPotential System Memory Throughput
7
25.6 GB/s
0
7.5
15
22.5
30
0 150 300 450 600
Part
ition
ing T
hrou
ghpu
t (G
B/s)
Number of Partitions
SW Partitioning Performance
1 thread16 threadsPotential System Memory Throughput
7
25.6 GB/s
3 GB/s
Research Overview
SBin
SBout
HARPCore
L1
Memory Controller
Memory
SBin
SBout
HARPCore
L1
L2 L2
8
New Modules
Research Overview
SBin
SBout
HARPCore
L1
Memory Controller
Memory
SBin
SBout
HARPCore
L1
L2 L2
8
New Modules
Hardware accelerated range partitioner (HARP):7.8X more performance @ 7.5X less energy
Research Overview
SBin
SBout
HARPCore
L1
Memory Controller
Memory
SBin
SBout
HARPCore
L1
L2 L2
8
New Modules
Streaming framework: Can keep up with the throughput of HARP
Hardware accelerated range partitioner (HARP):7.8X more performance @ 7.5X less energy
Remainder of the Talk
SBin
SBout
HARPCore
L1
Memory Controller
Memory
SBin
SBout
HARPCore
L1
L2 L2
8
New Modules
• Brief System Overview
• HARP UArch
•HARP and StreamingFramework Evaluation•Other works acclerating big data processing
HARP System Architecture
9
SBin
SBout
HARPCore
L1
Memory Controller
Memory
SBin
SBout
HARPCore
L1
L2 L2
HARP System Architecture
9
SBin
SBout
HARPCore
L1
Memory Controller
Memory
SBin
SBout
HARPCore
L1
L2 L2
HARP communicates to/from memory through stream
buffers: SBin, SBout
HARP System Architecture
9
SBin
SBout
HARPCore
L1
Memory Controller
Memory
SBin
SBout
HARPCore
L1
L2 L2
HW Partitioning with SW configuration
HARP communicates to/from memory through stream
buffers: SBin, SBout
X Y Z158272052291631
Range Partition
10
X Y Z158272052291631
Range Partitionkey
10
X Y Z158272052291631
Range Partitionkey
10
20
30
splitters
10
X Y Z158272052291631
Range Partitionkey
10
20
30
splitters
8
partitions<=
>
10
X Y Z158272052291631
Range Partitionkey
10
20
30
splitters
8
partitions<=
>1520
16
2729
5231
<=
<=
>
>
10
From
SBin
To SBout
HARP Microarchitecture
11
HARP ISA
set_splitterpartition_startpartition_stop
From
SBin
To SBout
HARP Microarchitecture
11
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
HARP ISA
set_splitterpartition_startpartition_stop
Step 1: HARP Configuration
11
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
HARP ISA
partition_startpartition_stop
set_splitter
Step 1: HARP Configuration
11
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
10 20 30
HARP ISA
set_splitterpartition_startpartition_stop
Step 2: Signal HARP to Start Processing
11
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
10 20 30
HARP ISA
partition_stop
set_splitterpartition_start
Step 2: Signal HARP to Start Processing
11
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
10 20 30
HARP ISA
partition_startpartition_stop
set_splitter
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
Step 3: Serialize SBin Cachelines into Records
11
10 20 30
HARP ISA
set_splitterpartition_startpartition_stop
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
Step 3: Serialize SBin Cachelines into Records
11
10 20 30
HARP ISA
set_splitterpartition_startpartition_stop
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
Step 4: Comparator Conveyor
11
10 20 30
HARP ISA
set_splitterpartition_startpartition_stop
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
Step 4: Comparator Conveyor
15 15 15, part2
11
10 20 30
HARP ISA
set_splitterpartition_startpartition_stop
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
Step 5: Merge Output Records to SBout
11
10 20 30
HARP ISA
set_splitterpartition_startpartition_stop
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
Step 5: Merge Output Records to SBout
11
10 20 30
HARP ISA
set_splitterpartition_startpartition_stop
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
Step 6: Drain In-Flight Records and Signal HARP to Stop Processing
11
10 20 30
HARP ISA
set_splitterpartition_startpartition_stop
From
SBin
To SBout
<
=
<
=
<
=
Serializer1 Conveyor2
Merge3
WE WE WE WE
Step 6: Drain In-Flight Records and Signal HARP to Stop Processing
11
10 20 30
HARP ISA
set_splitterpartition_startpartition_stop
Streaming Framework Architecture
12
SBin
SBout
HARPCore
L1
Memory Controller
Memory
SBin
SBout
HARPCore
L1
L2 L2
Inspired by Jouppi’s workImproving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA, 1990.
Streaming Framework Architecture
12
SBin
SBout
HARPCore
L1
Memory Controller
Memory
SBin
SBout
HARPCore
L1
L2 L2
Software-controlled data streaming in/out
Inspired by Jouppi’s workImproving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA, 1990.
Evaluation Methodology
13
Evaluation Methodology• HARP
• Bluespec System Verilog implementation
• Cycle-accurate simulation in BlueSim
• Synthesis, P&R with Synopsys (32nm std cells)
13
Evaluation Methodology• HARP
• Bluespec System Verilog implementation
• Cycle-accurate simulation in BlueSim
• Synthesis, P&R with Synopsys (32nm std cells)
• Streaming framework
• 3 versions of 1GB table memcpy: c-lib, ASM(scalar), ASM(vector)
• Conservative area/power estimates with CACTI
13
Area and Power Overheads
0%
4%
8%
11%
15%
15 31 63 127 255 511Are
a (%
Xeo
n co
re)
HARP Stream Buffers
Number of Partitions
14
Area and Power Overheads
0%
4%
8%
11%
15%
15 31 63 127 255 511Are
a (%
Xeo
n co
re)
HARP Stream Buffers
0%
2%
4%
6%
8%
10%
15 31 63 127 255 511Pow
er (%
Xeo
n co
re)
Number of Partitions
14
SW Partitioning Performance
15
0
2
4
6
8
0 150 300 450 600
Part
ition
ing T
hrou
ghpu
t (G
B/s)
Number of Partitions
1 thread16 threads
0
2
4
6
8
0 150 300 450 600
Part
ition
ing T
hrou
ghpu
t (G
B/s)
Number of Partitions
Performance Evaluation
1 thread16 threads1 thread + HARP
15
0
2
4
6
8
0 150 300 450 600
Part
ition
ing T
hrou
ghpu
t (G
B/s)
Number of Partitions
Performance Evaluation
1 thread16 threads1 thread + HARP
15
7.8x
0
2
4
6
8
0 150 300 450 600
Part
ition
ing T
hrou
ghpu
t (G
B/s)
Number of Partitions
Performance Evaluation
1 thread16 threads1 thread + HARP
15
7.8x8.8x
HARP Energy vs. SW
0
5
10
15
20
0 150 300 450 600
Part
ition
ing
Ener
gy (
J/GB)
Number of Partitions
1 thread16 threads1 thread + HARP
16
HARP Energy vs. SW
0
5
10
15
20
0 150 300 450 600
Part
ition
ing
Ener
gy (
J/GB)
Number of Partitions
1 thread16 threads1 thread + HARP
6.3x
16
HARP Energy vs. SW
0
5
10
15
20
0 150 300 450 600
Part
ition
ing
Ener
gy (
J/GB)
Number of Partitions
1 thread16 threads1 thread + HARP
6.3x7.3x
16
Big data on little clients• LINQits
• A fexible hardware template that can be mappedonto programmable logic or ASICs
• Accelerates a domain-specic query language calledLINQ
• Offers a graceful and transparent migration pathfrom software to hardware
• Improves energy efficiency by 8.9 to 30.6 timesand performance by 10.7 to 38.1 times comparedto optimized, multithreaded C programsrunning on conventional ARM A9 processors.
17
ReferenceNavigatinNavigating Big Data with High-Throughput, Energy-Efficient Data Partitioning, L. Wu, R. J. Barker, M. A. Kim, K. A. Ross, In Proceedings of The 40th International Symposium on Computer Architecture (ISCA), 2013.
•
• LINQits: Big Data on Little Clients, Eric S. Chung, John D. Davis, and Jaewon Lee, In Proceedings of The 40th International Symposium on Computer Architecture (ISCA), 2013.
18
FIN
19