razvan carbunescu, aditya devarakonda, jay alameda, … › documents › 527334 › 747011 ›...

July 15, 2014

Architecting an autograder for parallel code

Razvan Carbunescu, Aditya Devarakonda, Jay Alameda,

James Demmel, Steven I. Gordon, Susan Mehringer

Talk Outline

• Course that motivated autograder

• Autograder concepts and challenges

• Autograder implementation

• Course results

XSEDE Parallel Computing Course

• Created from UC Berkeley course CS267

• Lectures converted for online use (quizzes added)

• Programming assignments require autograder

• Course offered in 2013 for ‘Certificate of Completion’

• Course offered in 2014 for credit at 18 universities in the US and abroad with local instructors

Universities offering course for credit

Programming assignments

• HW1 - Optimizing Matrix Multiply

• HW2 - Parallel Particle Simulator

• HW3 – Parallel Knapsack

* Bottom picture taken from Wikipedia article on Knapsack

= + *C(i,j) A(i,:) B(:,j)C(i,j)

HW 1 – Optimizing Matrix Multiply

• Naïve code 3 loops but

also only 3% arithmetic peak

• Students given naïve

and blocked code, must provide ‘efficient’ code

• Students learn about: memory access, caching, SIMD

and using libraries

HW 2 – Parallel Particle Simulation

• Simplified particle simulator

• Introduces OpenMP, MPI and CUDA

• Students given working O(n2) code

and must provide O(n) code

• Students learn about: synchronization

,locks and domain decomposition

HW 3 – Parallel Knapsack

• 0-1 Knapsack problem

• Introduces UPC

• Students given inefficient

parallel UPC code

• Students learn about: analyzing/minimizing communication, pipeline parallelism

Talk Outline




• Course results

Autograder Concepts

• Testing Correctness

• Testing Performance

• Feedback / automation

• Resource management

Correctness

• What is the right answer? Does it exist?

+ε

–ε

+ε

–ε

???

Correctness

Problems introduced by parallelism

• Race conditions (non-benign)

• Deadlock / livelock / starvation

• Floating Point and non-determinism

Problems exacerbated by parallelism

• Output size compared to input (gathering, testing)

• Input type and size (precomputed vs random)

Performance

• What is a ‘fast’ or ‘good’ parallel code?

STRONG SCALING WEAK SCALING

Performance

• Sequential metrics: time, percentage of peak

• Strong scaling and speedup

• Weak scaling

• Input dependent performance

• Overhead of correctness check

• Overhead of I/O operations

Feedback / automation

• Providing fast correctness answer

• Providing performance data

• Submission/grade feedback

• Multiple submission capability

• Need for adaptability

Resource Management

• Allocation time vs scaling tests

• Latency due to utilization

• Student limits on allocation

Talk Outline




• Course results

Autograder implementation

• Split into 2 parts:

autograder.cpp grade.py

Autograder.cpp

• Focuses on correctness and performance

• Given to students at start of assignment

• Parts integrated in assignment starting code

• Used other auxiliary files (job scripts, etc.)

• Instant feedback to student

• Limited scaling information

• Varies heavily from assignment to assignment

HW1 Autograder Implementation

• Floating point round-off meant using error norm instead of equalities for correctness checks

• Performance was determined from percentage of peak floating point rate

• Students required to provide defined interface function square_dgemm with compilation options included as comments


• No previous correctness check except visual

• Implemented empirical statistic checks based on the average and minimum interaction distances for particles

• I/O and correctness turned off for performance runs

• Performance determined coefficient of O(nx) serial algorithm, average strong and weak scaling for 1-16 threads for OpenMP, MPI and from speedup for different problem sizes for CUDA


• Correctness was implemented via value check

• used average strong and weak scaling efficiency for 1-16 threads and 16-256 threads to check the 2 different stages of UPC (shared and distributed)

Grade.py

• Focuses on final runs and calculating grades

• Very easily modifiable

• Relatively little changes between assignments

• Uses a private copy of autograder.cpp for correctness/performance checks

• Not available to students

Talk Outline




• Course results

Course results

• Universities used different grading schemes based on data from autograder

• High drop-off for undergraduate students (CS267 is a graduate course)

• Students worked individually or in groups of 2

• Most universities had HW3 marked as optional to allow for extra time for final projects

Homework results

• ~150 students started (includes audits)

• 75 HW1 submissions Max:94 Median:41



• 2013 had 345 students and 36/23/18 submissions with 18 ‘Certificate of Completions’

• From universities that finished and communicated data (4 out of 18) we have 38 starting students 25 that finished the course with 17A’s 4B’s 2C’s and rest auditing

razvan carbunescu, aditya devarakonda, jay alameda, … › documents › 527334 › 747011 ›...

Documents