a parallel streaming framework for multi-processing-unit ... · references "parallel data˜ow...
Post on 22-Mar-2019
223 Views
Preview:
TRANSCRIPT
References"Parallel Data�ow Scheme for Streaming (Un)Structured Data" by H. Vo, D. Osmari, B. Summa, J. Comba, V. Pascucci and C. Silva, SCI Institute, University of Utah, Technical Report #UUSCI-2009-004, 2009.
"Multi-Threaded Streaming Pipeline For VTK" by H. Vo and C. Silva, SCI Institute, University of Utah, Technical Report #UUSCI-2009-005, 2009.
www.sci.utah.edu
A Parallel Streaming Framework for Multi-Processing-Unit SystemsHuy Vo, Brian Summa, Valerio Pascucci, Claudio Silva
SCI Institute, University of Utah, USAPresenter: Huy Vo Daniel Osmari, Joao Comba
UFRGS, Brazil
ENew Event E
Data Request
ENew Event
E
E E
E
E
E
E E
E
Data Request
POLICY
SCOPE
DISTR
IBUTE
DCE
NTR
ALIZE
D
PUSH PULL
SCHEDULER RESOURCES
Trigger,Execute &Feedback
Distribute
RESO
URC
ES
INPUT
PERFORMCOMPUTATION
OUTPUT
Demand-Driven RequestsPulling Data
Event-Driven RequestsPushing Data
Module
Message
Data Dependency
Executive
Our Uni�ed Model
ENew Event
E
E
E
E E
E
Data Request
A
B
C
D
Piece #2
Piece #1
A
B
Merge
A
B
A
B
Streamable/Separable Data Structure
Concurrency Execution
Task Pipeline Data
Motivation Pull(), Push() and ReleaseInputs()Current data�ow system problems(with VTK, AVS, Data Explorer,...):• API “polluted” over the years (many designs in 80s-90s)• Do not map well to modern highly-parallel architecture: Multi- Core, GPUs, ...• Large datasets not well supported
Need a uni�ed data�ow architecture supporting:• Streaming• Heterogeneous architecture• Efficiency• Concurrency
A
B
Piece #1 Piece #2
Piece #1
A
B
A
B
RequestPiece #1
Piece #1
A
B
RequestPiece #2
Piece #2
Streaming Push Streaming Pull
• Use a priority queue with topological order of the modules as priority ranks• Include streaming piece numbers to priority ranks• Resources distributing strategy is based on execution time• Maximize concurrency but still taking data dependency into account
• Pipelining GPU branch with separate CUDA devices• Load-balancing between CPU/GPU is converged overstreaming time• Achieved a throughput of 250M pairs/s on (2) Xeon w5580 and a Quadro FX5800 gives• Achieved a throughput over 350M pairs/s on (2) Xeon e5550 and (2) Tesla C1060s
Task Scheduler & Resource Distribution
System Architecture
Execution Parallelism
Classi�cation of Data�ow Streaming Model Preliminary Results
Ongoing Progress
Complete Implementation for CPU Thread Resources
Basic Concrete Implementation of CUDA execution for GPU Resources • Connections can direct data transfer to appropriate cudaMemcpy() calls • A custom scheduling strategy to minimize host-device data transfer
• Dynamic Load-Balancing for GPU kernels • Concrete implementation for MPI to run on cluster • Incorporate TBB/OpenMP for Task Scheduling • A generalized streaming data structure
vtkContourFilter
vtkDataSetMapper
vtkActor
vtkDataSetReader
vtkContourFilter
vtkDataSetMapper
vtkActor
vtkDataSetReader
vtkRenderWindow
vtkRenderer
ContourFilter
DataSetMapper
Actor
ContourFilter
DataSetMapper
Actor
DataSetReader
RenderWindow
Renderer
11
2 2
3 3
44
5
5
6
6
7
7
9
9 8
8
10
(a) (b)
10
Data Block and Priority Rank
11
12
13
14
15
16
viewportChangedEvent()
while (LOD>0)
// Set LOD of DataAccesses
...
Push(<list of DataAccess>)
LOD = LOD - 1
ProgressiveRenderer
...DataAccess DataAccess DataAccess
Average1 image
12 imagesProgressiveRenderer
...DataAccess DataAccess DataAccess
Average1 image
12 images
viewportChangedEvent()
while (LOD>0)
// Propagate LOD to DataAccesses
...
Pull(Average Module)
LOD = LOD - 1
Push Pull
Better Pipeline ParallelismSimpler Interface
Compute() {
// input dependent computation
..
ReleaseInputs()
// input independent computation
..
}
vtkImageDataStreamer5
vtkImageGaussianSmooth
vtkImageLuminance
vtkImageThreshhold
vtkPNGReader
vtkDataSetMapper
vtkActor
1
2vtkImageDataStreamGenerator
2
3
4
6
7
vtkRenderer
9
8
vtkImageDataStreamMerger6
vtkImageGaussianSmooth
vtkImageLuminance
vtkImageThreshhold
vtkPNGReader
vtkDataSetMapper
vtkActor
1
3
4
5
vtkImageBlend
7
8
9
vtkImageDataStreamMerger
vtkDataSetMapper
vtkActor
vtkImageDataStreamGenerator
vtkImageGaussianSmooth
vtkImageLuminance
vtkImageThreshhold
vtkPNGReader
vtkImageDataStreamGenerator
vtkImageGaussianSmooth
vtkPNGReader
CPU/GPU Sort
Streaming Simpli�cation of Tetrahedral Meshes
Multi-View Visualization ~3x speed up on 4-core machine
ProgressiveRenderer
GrayScale GrayScale GrayScale
Average
StandardDeviation
Di�erenceL2 Norm
EdgeDetection
Downsample Downsample Downsample
...DataAccess DataAccess DataAccess
...
1 image 12 images
Streaming Simplication of Tetrahedral MeshesModels (Tets) Original Streaming Quadric Pipeline
Inst. Inst.Torso 1.0M 8.5s 8.4s 7.9s 3.0sFighter 1.4M 10.1s 9.5s 8.4s 3.6sRbl 3.8M 37.4s 35.6s 31.5s 8.9sMito 5.5M 52.0s 53.1s 44.9s 10.2s
UpdateQuadrics UpdateQuadrics UpdateQuadrics
TetStreamMerger
TetStreamingMesh
Simpli�cationBuf
Decimate
UpdateQuadrics UpdateQuadrics UpdateQuadrics
Simpli�cationBuf
Decimate
Simpli�cationBuf
Decimate
Simpli�cationBuf
Decimate
TetStreamMerger
TetStreamingMesh
CPU Threaded Radix Sort
GPU Presort
GPU Scatter
Streaming Unsorted Data
CPU Threaded Merge
CPU Threaded Accumulate
Sorted Output
GPU Scan Blocks
GPU Sum Blocks
top related