# razvan carbunescu, aditya devarakonda, jay alameda, â€؛ documents â€؛ 527334 â€؛ 747011...

Post on 04-Jul-2020

0 views

Category:

## Documents

Embed Size (px)

TRANSCRIPT

• July 15, 2014

Architecting an autograder for parallel code

Razvan Carbunescu, Aditya Devarakonda, Jay Alameda,

James Demmel, Steven I. Gordon, Susan Mehringer

• Talk Outline

• Course results

• Talk Outline

• Course results

• XSEDE Parallel Computing Course

• Created from UC Berkeley course CS267

• Lectures converted for online use (quizzes added)

• Course offered in 2013 for ‘Certificate of Completion’

• Course offered in 2014 for credit at 18 universities in the US and abroad with local instructors

• Universities offering course for credit

• Programming assignments

• HW1 - Optimizing Matrix Multiply

• HW2 - Parallel Particle Simulator

• HW3 – Parallel Knapsack

* Bottom picture taken from Wikipedia article on Knapsack

= + * C(i,j) A(i,:) B(:,j)C(i,j)

• HW 1 – Optimizing Matrix Multiply

• Naïve code 3 loops but

also only 3% arithmetic peak

• Students given naïve

and blocked code, must provide ‘efficient’ code

• Students learn about: memory access, caching, SIMD

and using libraries

• HW 2 – Parallel Particle Simulation

• Simplified particle simulator

• Introduces OpenMP, MPI and CUDA

• Students given working O(n2) code

and must provide O(n) code

,locks and domain decomposition

• HW 3 – Parallel Knapsack

• 0-1 Knapsack problem

• Introduces UPC

• Students given inefficient

parallel UPC code

• Students learn about: analyzing/minimizing communication, pipeline parallelism

• Talk Outline

• Course results

• Testing Correctness

• Testing Performance

• Feedback / automation

• Resource management

• Correctness

• What is the right answer? Does it exist?

–ε

–ε

???

• Correctness

Problems introduced by parallelism

• Race conditions (non-benign)

• Deadlock / livelock / starvation

• Floating Point and non-determinism

Problems exacerbated by parallelism

• Output size compared to input (gathering, testing)

• Input type and size (precomputed vs random)

• Performance

• What is a ‘fast’ or ‘good’ parallel code?

STRONG SCALING WEAK SCALING

• Performance

• Sequential metrics: time, percentage of peak

• Strong scaling and speedup

• Weak scaling

• Input dependent performance

• Feedback / automation

• Providing performance data

• Multiple submission capability

• Resource Management

• Allocation time vs scaling tests

• Latency due to utilization

• Student limits on allocation

• Talk Outline

• Course results

• Split into 2 parts:

• Focuses on correctness and performance

• Given to students at start of assignment

• Parts integrated in assignment starting code

• Used other auxiliary files (job scripts, etc.)

• Instant feedback to student

• Limited scaling information

• Varies heavily from assignment to assignment

• Floating point round-off meant using error norm instead of equalities for correctness checks

• Performance was determined from percentage of peak floating point rate

• Students required to provide defined interface function square_dgemm with compilation options included as comments

• No previous correctness check except visual

• Implemented empirical statistic checks based on the average and minimum interaction distances for particles

• I/O and correctness turned off for performance runs

• Performance determined coefficient of O(nx) serial algorithm, average strong and weak scaling for 1-16 threads for OpenMP, MPI and from speedup for different problem sizes for CUDA

• Correctness was implemented via value check

• used average strong and weak scaling efficiency for 1-16 threads and 16-256 threads to check the 2 different stages of UPC (shared and distributed)

• Focuses on final runs and calculating grades

• Very easily modifiable

• Relatively little changes between assignments

• Uses a private copy of autograder.cpp for correctness/performance checks

• Not available to students

• Talk Outline

• Course results

• Course results

• Students worked individually or in groups of 2

• Most universities had HW3 marked as optional to allow for extra time for final projects

• Homework results

• ~150 students started (includes audits)

• 75 HW1 submissions Max:94 Median:41

• 57 HW2 submissions Max:97 Median:30

• 17 HW3 submissions Max:10 Median:5

• 2013 had 345 students and 36/23/18 submissions with 18 ‘Certificate of Completions’

• From universities that finished and communicated data (4 out of 18) we have 38 starting students 25 that finished the course with 17A’s 4B’s 2C’s and rest auditing

Recommended