parallel hashing 1
TRANSCRIPT
![Page 1: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/1.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 1/42
Parallel Hashing John Erol Evangelista
![Page 2: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/2.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 2/42
Definition of Terms
• GPU. Graphical Processing Unit
• Parallel Architecture. Architecture where
calculations are done simultaneously
• Serial Architecture. Architecture wherecalculations are done serially
• Voxel. 3D Analog of Pixel
• Kernels. Programs that run on the GPU.
![Page 3: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/3.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 3/42
Definition of Terms
• Threads. Smallest unit of processing.
• Latency. Time Delay
• Cache. Storage of data.
• Race condition. Output is dependenton the timing of the events.
![Page 4: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/4.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 4/42
GPU
• Graphics Processing Unit
• Its highly parallel architecture wasrecognized for its fast number
crunching abilities, giving rise to
techniques for applying GPU for non-graphical purpose.
![Page 5: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/5.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 5/42
Data Structures
• Applications rely on data structures
that can be both built and used
efficiently in parallel environment.
• Defining parallel-friendly data
structures that can be efficiently
created, updated and accessed is a
significant research challenge.
![Page 6: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/6.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 6/42
Voxel
• 3D analog of the pixel
• Number of expected occupied voxels:
O(N2).
• Storing N3
grid is extremely wastefulsince most of the grid is empty.
![Page 7: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/7.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 7/42
Hash Table
• Popular for these types of data (voxels)
since they can be constructed to answer
queries in O(1) memory accesses.
![Page 8: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/8.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 8/42
Figure 1.2. GPU hash tables are being constructed and queried every frame toperform Boolean intersections for these two animated models. Blue parts of onemodel represent voxels inside the other model, while green parts mark surfaceintersections. These images were produced using a 1283 voxel grid for pointclouds of approximately 160k points. We achieve frame rates between 25–29 fpson a GTX 280, with the actual computation of the intersection and flood-fillrequiring between 15–19 ms. Most of the time per frame is devoted to actual
rendering of the meshes.
Application
![Page 9: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/9.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 9/42
Hash Tables
Figure 1.3. While allocating storage for the value of every possible key in anarray allows directly indexing into the structure, it is wasteful when the arrayis mostly unused (top). A hash table can be used instead, which allocates farless space than the array (bottom). In this example, each slot holds both a keyand its value. The table is indexed into using a hash function h(k). Becausemultiple keys may map to the same location, the key contained in the slot andthe query key are compared on a retrieval to ensure the right value is returned.
![Page 10: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/10.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 10/42
Hash Tables
• Needs to be adapted on a parallel
environment• Serialization
• Memory Accesses are Slow
• Many probes may be required
![Page 11: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/11.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 11/42
CUDA
• stands for “Compute Unified Device
Architecture”• provide essential functionality for
parallel applications such as scattered
writes in memory and atomicoperations
![Page 12: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/12.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 12/42
CUDA C
• high-level GPU programming language
that extends C with extra constructs for
dealing with the hardware.
![Page 13: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/13.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 13/42
How it works
• Programs that run on the GPU are
called kernels and typically consist of
just a few small functions.
• Kernels are executed in parallel by
threads, each performing the same
instructions on a different data.• e.g. programs computing hash function
value of every input key.
![Page 14: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/14.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 14/42
Limitation
• Copying data to and from the GPU is
very expensive.
• Kernels do not have access to the hostsystem’s memory.
• Solution: Use data structures that can
be built and used entirely in parallel,allowing data to stay in the GPU while
it is being processed.
![Page 15: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/15.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 15/42
How it works
• Threads are grouped into thread blocks
of up to 512 threads, which are assigned
to different streaming multiprocessors (SM) for execution.
• Thread blocks are queued up for the
SMs and fed in as the thread blocks
finish
![Page 16: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/16.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 16/42
How it works
• Thread blocks can complete execution
before others are even started, so there
is no way to globally synchronize all thethreads without finishing the kernels.
• Threads in the same block can locally
synchronize using execution barriers,guaranteeing that they have all reachedthe same point before continuing.
![Page 17: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/17.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 17/42
How it works
• Multiple thread blocks can be handled by SMs simultaneously, but there is a
hard limit on the number of threads the
SM can handle.
![Page 18: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/18.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 18/42
How it works
• Each SM breaks its thread blocks into
groups of 32 consecutive threads called
warps.• SMs manage when each of their warps
will be executed in their SIMD cores,
with each step running the sameinstruction in lockstep, even when a
branch occurs.
![Page 19: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/19.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 19/42
Types of memory
• low-latency shared memory
• high-latency global memory
![Page 20: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/20.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 20/42
Low latency memory
• used as cache for global memory
• scratchpad for threads working in thesame thread block
• fast but small
• partitioned; does not persist between
kernel operations
![Page 21: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/21.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 21/42
Global Memory
• Abundant and shared but slow
• To hide latency, SMs automatically context
switch to other warps while memorytransactions are being performed
• reads up to 128-byte segments of memory
with a single transaction• memory requests of threads in a warp are
coalesced together into fewer transactions.
![Page 22: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/22.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 22/42
Atomic Operations
• performed when race conditions are
difficult or impossible to avoid.
• perform a series of actions that cannot
be interrupted.
• e.g. incrementing a counter
![Page 23: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/23.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 23/42
Fermi architecture
• higher compute capabilities, more
functionality• efficient atomic operations, cached
memory hierarchy to further reduce
latency when accessing a globalmemory.
![Page 24: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/24.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 24/42
Hashing on GPU
• Open Addressing
• While they can be very fast for bothconstruction and retrieval on a GPU,
problems arise when trying to make a
compact table: in the worst case, the
whole table would have to be
traversed to terminate a query.
![Page 25: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/25.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 25/42
Hashing on a GPU
• Chaining
• number of probes increases greatly as
the number of slots shrinks.
• linked lists are horribly inefficient in aGPU
![Page 26: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/26.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 26/42
Hashing on a GPU
• Collision-free hashing
• larger space = constant probability of no collision
• increased construction time and
inherently sequential on someimplementation
![Page 27: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/27.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 27/42
Hashing on a GPU
• Multiple-choice Hashing
• Choose the one that has the lowest
occupancy
• Cuckoo Hashing
• Variation of Open Hashing, limits theslots an item can fall to
• uses multiple hash functions
![Page 28: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/28.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 28/42
Performance Metrics
• Construction time
• Retrieval efficiency
• Memory usage
![Page 29: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/29.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 29/42
Open Addressing
• Race condition may occur (multiple
threads attempting to insert an item to
the same location simultaneously)
![Page 30: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/30.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 30/42
Open Addressing
Figure 3.1. Examples of linear probing (left) and quadratic probing (right).
![Page 31: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/31.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 31/42
Open Addressing
• The parallel construction assigns each
input item to a thread, then has eachthread simultaneously probe the hash
table for empty slots
• force serialization of access to the table
![Page 32: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/32.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 32/42
Parameters
• Number of slots: ST ≥
N where ST is thenumber of slots and N is the number of
items in the input. ST ≈ 1.25N
• Probe SequenceProbing scheme Hash function
Linear probing h(k) = g(k) + iteration
Quadratic probing h(k) = g(k) + c0 · iteration + c1 · iteration2
Double hashing h(k) = g(k) + jump(k) · iteration
Table 3.1. Open addressing hashing schemes
![Page 33: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/33.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 33/42
Parameters
• Maximum allowed length of ProbeSequence. Used to terminate a probe
sequence that is taking too much time.
Min(N,10000).
![Page 34: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/34.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 34/42
Hash Function
• Perfect Hash Function. Benefits are
minimal since the hash tables can be
constructed in a way that effectivelylimits the number of probes required to
find an item to just one or two
• Simple randomized hash functions
work well in practice
![Page 35: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/35.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 35/42
Hash Function
• g(k) = (f(a,k) + b) mod p mod ST
• Where a and b are randomly generated
constant, p is a prime number and ST is
the number of slots available in the
hash table
![Page 36: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/36.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 36/42
Implementation
Algorithm 3.1 Process for creating an open addressing hash table.
1: allocate enough memory for table [ ], which will contain S T 64-bit slots
2: repeat
3: fill each slot with ∅
4: generate a new hash function for the current attempt
5: for all key-value pairs (k, v) in the input do
6: repeat
7: atomically check-and-set table [location]
8: advance location to next location in probe sequence
9: until ∅ is found or max probes hit
10: end for
11: until hash table is built
Listing 3.1. Parallel insertion of items into an open addressing table.
![Page 37: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/37.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 37/42
1 d e v ic e b oo l i n s er t e n tr y ( const unsigned key ,
2 const unsigned value ,
3 const unsigned t a b l e s i z e ,
4 Entry ∗ t a b l e ) {
5 // M anage t h e k ey a n d i t s v a l u e a s a s i n g l e 64− b i t e n t ry .
6 E nt ry e n t r y = ( ( E nt ry ) k e y << 3 2 ) + v a l u e ;
7
8 // F i gu r e o u t w h er e t h e i te m n e ed s t o b e h as h ed i n t o .
9 unsigned in d e x = h a s h f u n c t i o n ( key ) ;
10 unsigned d o u bl e h a s h j u m p = j u m p f u n c t i o n ( k e y ) + 1 ;
11
12 // Keep t r y i n g t o i n s e r t t h e e n t ry i n t o t he h as h t a b l e
13 // u n t i l an empty s l o t i s f ou nd .
14 E nt ry o l d e n t r y ;
15
fo r ( unsigned a t te m pt = 1 ; a t te m pt <= kMaxProbes ; ++attempt) {16 // Move t h e i n d ex s o t h a t i t p o i n t s s om ew he re w i t h i n t h e t a b l e .
17 i n de x %= t a b l e s i z e ;
18
19 // A t om i c al l y c h ec k t h e s l o t a nd i n s e r t t h e k ey i f e mp ty .
20 ol d en try = atomicCAS( tab le + index , SLOT EMPTY, entry );
21
22 // I f t h e s l o t was empty , t h e i te m w as i n s e r t ed s a f e l y .
23 i f ( ol d en tr y == SLOT EMPTY) return t r u e ;24
25 // Move t h e i n s e r t i o n i n d ex .
26 i f ( m et ho d == LINEAR ) in d e x += 1 ;
27 e ls e i f ( method == QUADRATIC) in de x += att emp t ∗ attempt ;
28 e l s e index += attempt ∗ d o u b le h a s h j u m p ;
29 }
30
31 return f a l s e ;
32 }
![Page 38: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/38.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 38/42
Parallel Retrieval
• Follows same search pattern as
construction
![Page 39: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/39.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 39/42
Construction Rates
Figure 3.2. Eff ect of input size on construction retrieval rates for tables con-taining 1.25N slots on both the GTX 280 (top) and 470 (bottom).
![Page 40: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/40.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 40/42
Memory Usage
Figure 3.3. Eff ect of the table size on construction and retrieval rates for tables
containing 10 million items.
![Page 41: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/41.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 41/42
Limitations
• Performance drops significantly for
compact tables
• High variability in probe sequence
length
• Removing items from the table.
![Page 42: Parallel Hashing 1](https://reader031.vdocuments.mx/reader031/viewer/2022021118/577ce6ea1a28abf10393ec5e/html5/thumbnails/42.jpg)
7/31/2019 Parallel Hashing 1
http://slidepdf.com/reader/full/parallel-hashing-1 42/42
Sources
• Alcantara, D., Efficient Hash Tables on a
GPU.