intro parallel processing 566

Upload: dattatray-bhate

Post on 04-Jun-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Intro Parallel Processing 566

    1/51

    Introduction to ParallelProcessing

    Shantanu Dutt

    University of Illinois at Chicago

  • 8/13/2019 Intro Parallel Processing 566

    2/51

    2

    Acknowledgements Ashish Agrawal, IIT Kanpur, Fundamentals of Parallel

    Processing (slides), w/ some modifications andaugmentations by Shantanu Dutt

    John Urbanic, Parallel Computing: Overview (slides), w/

    some modifications and augmentations by Shantanu

    Dutt John Mellor-Crummey, COMP 422 Parallel Computing:

    An Introduction, Department of Computer Science, Rice

    University, (slides), w/ some modifications and

    augmentations by Shantanu Dutt

  • 8/13/2019 Intro Parallel Processing 566

    3/51

    3

    Outline Moore's Law and its limits

    Different uni-processor performance enhancement

    techniques and their limits Classification of parallel computations

    Classification of parallel architectures - Distributed andShared memory

    Simple examples of parallel processing

    Example applications

    Future advances

    Summary

    Some text from: Fund. of ParallelProcessing, A. Agrawal, IIT Kanpur

  • 8/13/2019 Intro Parallel Processing 566

    4/51

    Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 4

    Moores Law &

    Need for Parallel Processing

    Chip performance doublesevery 18-24 months

    Power consumption is prop. tofreq.

    Limits of Serial computing

    Heating issues

    Limit to transmissionsspeeds

    Leakage currents

    Limit to miniaturization

    Multi-core processors alreadycommonplace.

    Most high performance serversalready parallel.

  • 8/13/2019 Intro Parallel Processing 566

    5/51

    Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 5

    Quest for Performance

    Pipelining Superscalar Architecture

    Out of Order Execution

    Caches

    Instruction Set Design

    Advancements Parallelism

    Multi-core processors

    Clusters

    Grid

    This is the future

  • 8/13/2019 Intro Parallel Processing 566

    6/51

    Top text from: Fundamentals of ParallelProcessing, A. Agrawal, IIT Kanpur 6

    Pipelining Illustration of Pipeline using the fetch, load, execute, store stages.

    At the start of executionWind up.

    At the end of executionWind down.

    Pipeline stalls due to data dependency (RAW, WAR), resource conflict, incorrect

    branch predictionHit performance and speedup.

    Pipeline depthNo of cycles in execution simultaneously.

    Intel Pentium 435 stages.

  • 8/13/2019 Intro Parallel Processing 566

    7/51

    7

    Pipelining

    Tpipe(n) is pipelined time to process n instructions = fill-time + max{ti}, ti =

    exec. time of the ith stage

  • 8/13/2019 Intro Parallel Processing 566

    8/51

    Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 8

    Cache Desire for fast cheap and non volatile memory

    Memory speed growth at 7% per annum while processor growth at 50% p.a.

    Cachefast small memory. L1 and L2 caches.

    Retrieval from memory takes several hundred clock cycles

    Retrieval from L1 cache takes the order of one clock cycle and from L2 cache

    takes the order of 10 clock cycles.

    Cache hit and miss. Prefetch used to avoid cache misses at the start of the execution of the

    program.

    Cache lines used to avoid latency time in case of a cache miss

    Order of searchL1 cache -> L2 cache -> RAM -> Disk

    Cache coherencyCorrectness of data. Important for distributed parallel

    computing

    Limit to cache improvement: Improving cache performance will at most improve

    efficiency to match processor efficiency

  • 8/13/2019 Intro Parallel Processing 566

    9/51

    9

    (exs. of limited data parallelism)

    (exs. of limited & low-level functional parallelism)

    (single-instr.

    multiple data)

    : instruction-level parallelismdegree generally low and dependenton how the sequential code has been written, so not v. effective

  • 8/13/2019 Intro Parallel Processing 566

    10/51

    10

  • 8/13/2019 Intro Parallel Processing 566

    11/51

    11

    Thus need development of explicit parallel algorithms that are

    based on a fundamental understanding of the parallelism inherent

    in a problem, and exploiting that parallelism with minimuminteraction/communication between the parallel parts

  • 8/13/2019 Intro Parallel Processing 566

    12/51

    12

  • 8/13/2019 Intro Parallel Processing 566

    13/51

    13

  • 8/13/2019 Intro Parallel Processing 566

    14/51

    14

    (simultaneous multi-threading)

    (multi-threading)

  • 8/13/2019 Intro Parallel Processing 566

    15/51

    15

  • 8/13/2019 Intro Parallel Processing 566

    16/51

    16

  • 8/13/2019 Intro Parallel Processing 566

    17/51

    Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 17

  • 8/13/2019 Intro Parallel Processing 566

    18/51

    18

  • 8/13/2019 Intro Parallel Processing 566

    19/51

    19

  • 8/13/2019 Intro Parallel Processing 566

    20/51

    Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 20

    Applications of Parallel Processing

  • 8/13/2019 Intro Parallel Processing 566

    21/51

    21

  • 8/13/2019 Intro Parallel Processing 566

    22/51

    22

  • 8/13/2019 Intro Parallel Processing 566

    23/51

    23

  • 8/13/2019 Intro Parallel Processing 566

    24/51

    24

  • 8/13/2019 Intro Parallel Processing 566

    25/51

    25

  • 8/13/2019 Intro Parallel Processing 566

    26/51

    Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 26

    Example problems & solutions Easy Parallel SituationEach data part is

    independent. No communication is required

    between the execution units solving two differentparts.

    Heat Equation -

    The initial temperature is zero on the boundaries

    and high in the middle

    The boundary temperature is held at zero.

    The calculation of an element is dependent uponits neighbor elements

    data1 data2 ... data N

  • 8/13/2019 Intro Parallel Processing 566

    27/51

    Code from: Fundamentals of ParallelProcessing, A. Agrawal, IIT Kanpur 27

    1. find out if I am MASTER or WORKER

    2. if I am MASTER

    3. initialize array

    4. send each WORKER starting info andsubarray

    5. do until all WORKERS converge6. gather from all WORKERS convergence

    data

    7. broadcast to all WORKERS convergencesignal

    8. end do

    9. receive results from each WORKER

    1. else if I am WORKER

    2. receive from MASTER starting info andsubarray

    3. do until solution converged {

    4. update time

    1. non-blocking send neighbors my border

    info5. non-blocking receive neighbors border

    info

    6. update interior of my portion of solutionarray

    7. wait for non-block. commun. to complete

    14. update border of my portion of solutionarray

    15. determine if my solution has converged

    16. if so {send MASTER convergence signal

    17. recv. from MASTER convergence signal}

    18. end do }19. send MASTER results

    20. endif

    Serial Code -

    do iy=2, ny-1

    do ix=2, nx-1u2(ix,iy)=u1(ix,iy)+cx*{u1(ix+1,iy)} + u1(ix-

    1,iy) + cy*{u1(ix,iy+1)} + u1(ix,iy-1)enddo

    enddo Master (can be one of the workers)

    Workers Problem

    Grid

  • 8/13/2019 Intro Parallel Processing 566

    28/51

    28

    How to interconnect the

    multiple cores/processors

    is a major consideration in

    a parallel architecture

  • 8/13/2019 Intro Parallel Processing 566

    29/51

    29

    Tflops Tflops kW

    1

  • 8/13/2019 Intro Parallel Processing 566

    30/51

    Fundamentals of Parallel Processing,

    Ashish Agrawal, IIT Kanpur 30

    Parallelism - A simplistic understanding

    Multiple tasks at once.

    Distribute work into multiple

    execution units. A classification of parallelism:

    Data Parallelism

    Functional or ControlParallelism

    Data Parallelism - Divide thedataset and solve each sectorsimilarly on a separateexecution unit.

    Functional ParallelismDivide the 'problem' into

    different tasks and execute thetasks on different units. Whatwould func. parallelism look likefor the example on the right?

    Sequentia

    l

    DataP

    arallelism

  • 8/13/2019 Intro Parallel Processing 566

    31/51

    16/12/2008 Fundamentals of Parallel Processing,Ashish Agrawal, IIT Kanpur 31

    Data Parallelism

    Functional Parallelism

  • 8/13/2019 Intro Parallel Processing 566

    32/51

    Flynns Classification

    Flynn's Classical Taxonomy -

    Single Instruction, Single Data (SISD)your single-core uni-processor PC

    Single Instruction, Multiple Data (SIMD)special purpose low-granularity multi-processor m/c w/ a single control unit relayingthe same instruction to all processors (w/ different data) every cc

    Multiple Instruction, Single Data (MISD)pipelining is a majorexample

    Multiple Instruction, Multiple Data (MIMD)the most prevalentmodel. SPMD (Single Program Multiple Data) is a very usefulsubset. Note that this is v. different from SIMD. Why?

    Note that Data vs Control Parallelism is another independentclassification to the above

    Fundamentals of Parallel Processing,

    Ashish Agrawal, IIT Kanpur 32

  • 8/13/2019 Intro Parallel Processing 566

    33/51

    Flynns Classification (contd).

    33

  • 8/13/2019 Intro Parallel Processing 566

    34/51

    Flynns Classification (contd).

    34

  • 8/13/2019 Intro Parallel Processing 566

    35/51

    Flynns Classification (contd).

    35

  • 8/13/2019 Intro Parallel Processing 566

    36/51

    Flynns Classification (contd).

    36

  • 8/13/2019 Intro Parallel Processing 566

    37/51

    Flynns Classification (contd).

    37

    f d

  • 8/13/2019 Intro Parallel Processing 566

    38/51

    Flynns Classification (contd).

    38

    Data Parallelism: SIMD and SPMD fall into this category

    Functional Parallelism: MISD falls into this category

    Parallel Arch Classification

  • 8/13/2019 Intro Parallel Processing 566

    39/51

    Fundamentals of Parallel Processing,

    Ashish Agrawal, IIT Kanpur 39

    Parallel Arch. Classification

    Multi-processor Architectures-

    Distributed MemoryMost prevalent architecture model for # processors > 8 Indirect interconnectionn n/ws

    Direct interconnection n/ws

    Shared Memory

    Uniform Memory Access (UMA)

    Non- Uniform Memory Access (NUMA)Distributed shared memory

    1

    b d

  • 8/13/2019 Intro Parallel Processing 566

    40/51

    Fundamentals of Parallel Processing,

    Ashish Agrawal, IIT Kanpur 40

    Distributed MemoryMessage Passing

    Architectures Each processor P (with its own local

    cache C) is connected to exclusive

    local memory, i.e. no other CPU has

    direct access to it.

    Each node comprises at least one

    network interface (NI) that mediates

    the connection to a communication

    network.

    On each CPU runs a serial process

    that can communicate with other

    processes on other CPUs by means of

    the network.

    Non-blocking vs Blocking

    communication

    Direct vs Indirect

    Communication/Interconnection

    network

    Example: A 2x4

    mesh n/w (direct

    connection n/w)

    1

  • 8/13/2019 Intro Parallel Processing 566

    41/51

    The ARGO Beowulf Cluster at UIC (http://accc.uic.edu/service/argo-cluster)

    41

    Has 56 compute nodes/computers and a master node Master here has a different meaninggenerally a system front-end where you login and perform

    various tasks before submitting your parallel code to run on several compute nodesthan the

    master node in a parallel algorithm (e.g., the one we saw for the finite-element heat distribution

    problem), which would actually be one of the compute nodes, and generally distributes data to the

    other compute nodes, monitors progress of the computation, determines the end of the

    computation, etc., and may also additionally perform a part of the computation

    Compute nodes are divided among 14 zones, each zone containing 4 nodes which are

    connected as a ring network. Zones are connected to each other by a higher-level n/w.

    Each node (compute or master) has 2 processors. Each processor on some nodes are

    single-core ones, and dual cores in others; see http://accc.uic.edu/service/arg/nodes

    1

    1

  • 8/13/2019 Intro Parallel Processing 566

    42/51

    System Computational Actions in a Message-Passing Program

    42

    (a) Two basic parallel processes

    X, Y, and their data dependency

    a := b+c; b := x*y;

    Proc. X Proc. Y

    recv(P2, b);

    a := b+c;

    b := x*y;

    send(P1,b);

    Proc. X Proc. Y

    bP(X) P(Y)

    Processor/corecontaining Y

    Processor/corecontaining X

    Message passing

    of data item b.Link (director indirect) betw.the 2 processors

    (b) Their mapping to a message-passing multicomputer

    Message passingmapping

    1

    Di t ib t d Sh d M A h UMA 1

  • 8/13/2019 Intro Parallel Processing 566

    43/51

    Dual-Core Quad-Core

    L1 cache

    L2 cache

    Fundamentals of Parallel Processing,

    Ashish Agrawal, IIT Kanpur 43

    Distributed Shared Memory Arch.: UMA Flat memory model

    Memory bandwidth and latency are the same for allprocessors and all memory locations.

    Simplest exampledual core processor Most commonly represented today by Symmetric

    Multiprocessor (SMP) machines

    Cache coherent UMAconsistent cache values of thesame data item in different proc./core caches

    1

    1

  • 8/13/2019 Intro Parallel Processing 566

    44/51

    System Computational Actions in a Shared-Memory Program

    44

    (a) Two basic parallel processes

    X, Y, and their data dependency

    a := b+c; b := x*y;

    Proc. X Proc. Y

    a := b+c; b := x*y;

    Proc. X Proc. Y

    P(X) P(Y)

    (b) Their mapping to a shared-memory multiprocessor

    Shared-memorymapping

    Shared Memory

    Possible Actions by O.S.:

    (i) Since b is a shared

    data item (e.g.,designated by

    compiler or

    programmer), check

    bslocation to see if

    it can be written to (all

    prev. reads done:

    read_cntr for b = 0).

    (ii) If so, write b to its

    location and markstatus bit as written

    by Y. Initialize

    read_cntr for b to

    pre-determined value

    Possible Actions by O.S.:

    (i) Since b is a shareddata item (e.g.,

    designated by

    compiler or

    programmer), check

    bslocation to see if

    it has been written to

    by Y or any process

    (if dont care about

    the writing process).

    (ii) If so {read b &

    decrement read_cntr

    for b} else go to (i)

    and busy wait (check

    periodically).

    1

    Di t ib t d Sh d M A h NUMA

    1

  • 8/13/2019 Intro Parallel Processing 566

    45/51

    Most text from Fundamentals of Parallel

    Processing, A. Agrawal, IIT Kanpur 45

    Distributed Shared Memory Arch.: NUMA Memory is physically distributed but logically shared.

    The physical layout similar to the distributed-memory message-passing case

    Aggregated memory of the whole system appear as one single address space.

    Due to the distributed nature, memory access performance varies depending on whichCPU accesses which parts of memory (local vs. remote access).

    Two locality domains linked through a high speed connection called Hyper Transport (ingeneral via a link, as in message passing archs, only here these links are used by theO.S. to transmit read/write non-local data to/from processor/non-local memory).

    AdvantageScalability (compared to UMAs)

    Disadvantagea) Locality Problems and Connection congestion. b) Not a naturalparallel prog./algo. Model (it is easier to partition data among procs instead of think ofall of it occupying a large monolithic address space that each proc. can access).

    2x2 meshconnection

    1

  • 8/13/2019 Intro Parallel Processing 566

    46/51

    46

  • 8/13/2019 Intro Parallel Processing 566

    47/51

    47

  • 8/13/2019 Intro Parallel Processing 566

    48/51

    An example of an SPMD message-passing parallel

    program

    48

    1

  • 8/13/2019 Intro Parallel Processing 566

    49/51

    SPMD message-passing parallel program (contd.)

    49

    node xor D,

    1

    S

  • 8/13/2019 Intro Parallel Processing 566

    50/51

    Most text from: Fund. of Parallel

    Processing, A. Agrawal, IIT Kanpur 50

    Summary Serial computers / microprocessors will probably not get much faster -

    parallelization unavoidable

    Pipelining, cache and other optimization strategies for serial computers

    reaching a plateau

    Data and functional parallelism

    Flynns taxonomy: SIMD, MISD, MIMD/SPMD

    Parallel Architectures Intro

    Distributed Memory

    Shared Memory

    Uniform Memory Access

    Non Uniform Memory Access

    Application examples

    Parallel program/algorithm examples

  • 8/13/2019 Intro Parallel Processing 566

    51/51

    Fundamentals of Parallel Processing

    Additional References

    Computer Organization and DesignPatterson Hennessey

    Modern Operating SystemsTanenbaum Concepts of High Performance ComputingGeorg Hager

    Gerhard Wellein

    Cramming more components onto Integrated CircuitsGordonMoore, 1965

    Introduction to Parallel Computinghttps://computing.llnl.gov/tutorials/parallel_comp

    The Landscape of Parallel Computing ResearchA view fromBerkeley, 2006

    https://computing.llnl.gov/tutorials/parallel_comphttps://computing.llnl.gov/tutorials/parallel_comp