razvan carbunescu, aditya devarakonda, jay alameda, â€؛ documents â€؛ 527334 â€؛ 747011...

Download Razvan Carbunescu, Aditya Devarakonda, Jay Alameda, â€؛ documents â€؛ 527334 â€؛ 747011 â€؛ ... Razvan

If you can't read please download the document

Post on 04-Jul-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • July 15, 2014

    Architecting an autograder for parallel code

    Razvan Carbunescu, Aditya Devarakonda, Jay Alameda,

    James Demmel, Steven I. Gordon, Susan Mehringer

  • Talk Outline

    • Course that motivated autograder

    • Autograder concepts and challenges

    • Autograder implementation

    • Course results

  • Talk Outline

    • Course that motivated autograder

    • Autograder concepts and challenges

    • Autograder implementation

    • Course results

  • XSEDE Parallel Computing Course

    • Created from UC Berkeley course CS267

    • Lectures converted for online use (quizzes added)

    • Programming assignments require autograder

    • Course offered in 2013 for ‘Certificate of Completion’

    • Course offered in 2014 for credit at 18 universities in the US and abroad with local instructors

  • Universities offering course for credit

  • Programming assignments

    • HW1 - Optimizing Matrix Multiply

    • HW2 - Parallel Particle Simulator

    • HW3 – Parallel Knapsack

    * Bottom picture taken from Wikipedia article on Knapsack

    = + * C(i,j) A(i,:) B(:,j)C(i,j)

  • HW 1 – Optimizing Matrix Multiply

    • Naïve code 3 loops but

    also only 3% arithmetic peak

    • Students given naïve

    and blocked code, must provide ‘efficient’ code

    • Students learn about: memory access, caching, SIMD

    and using libraries

  • HW 2 – Parallel Particle Simulation

    • Simplified particle simulator

    • Introduces OpenMP, MPI and CUDA

    • Students given working O(n2) code

    and must provide O(n) code

    • Students learn about: synchronization

    ,locks and domain decomposition

  • HW 3 – Parallel Knapsack

    • 0-1 Knapsack problem

    • Introduces UPC

    • Students given inefficient

    parallel UPC code

    • Students learn about: analyzing/minimizing communication, pipeline parallelism

  • Talk Outline

    • Course that motivated autograder

    • Autograder concepts and challenges

    • Autograder implementation

    • Course results

  • Autograder Concepts

    • Testing Correctness

    • Testing Performance

    • Feedback / automation

    • Resource management

  • Correctness

    • What is the right answer? Does it exist?

    –ε

    –ε

    ???

  • Correctness

    Problems introduced by parallelism

    • Race conditions (non-benign)

    • Deadlock / livelock / starvation

    • Floating Point and non-determinism

    Problems exacerbated by parallelism

    • Output size compared to input (gathering, testing)

    • Input type and size (precomputed vs random)

  • Performance

    • What is a ‘fast’ or ‘good’ parallel code?

    STRONG SCALING WEAK SCALING

  • Performance

    • Sequential metrics: time, percentage of peak

    • Strong scaling and speedup

    • Weak scaling

    • Input dependent performance

    • Overhead of correctness check

    • Overhead of I/O operations

  • Feedback / automation

    • Providing fast correctness answer

    • Providing performance data

    • Submission/grade feedback

    • Multiple submission capability

    • Need for adaptability

  • Resource Management

    • Allocation time vs scaling tests

    • Latency due to utilization

    • Student limits on allocation

  • Talk Outline

    • Course that motivated autograder

    • Autograder concepts and challenges

    • Autograder implementation

    • Course results

  • Autograder implementation

    • Split into 2 parts:

    autograder.cpp grade.py

  • Autograder.cpp

    • Focuses on correctness and performance

    • Given to students at start of assignment

    • Parts integrated in assignment starting code

    • Used other auxiliary files (job scripts, etc.)

    • Instant feedback to student

    • Limited scaling information

    • Varies heavily from assignment to assignment

  • HW1 Autograder Implementation

    • Floating point round-off meant using error norm instead of equalities for correctness checks

    • Performance was determined from percentage of peak floating point rate

    • Students required to provide defined interface function square_dgemm with compilation options included as comments

  • HW2 Autograder Implementation

    • No previous correctness check except visual

    • Implemented empirical statistic checks based on the average and minimum interaction distances for particles

    • I/O and correctness turned off for performance runs

    • Performance determined coefficient of O(nx) serial algorithm, average strong and weak scaling for 1-16 threads for OpenMP, MPI and from speedup for different problem sizes for CUDA

  • HW3 Autograder Implementation

    • Correctness was implemented via value check

    • used average strong and weak scaling efficiency for 1-16 threads and 16-256 threads to check the 2 different stages of UPC (shared and distributed)

  • Grade.py

    • Focuses on final runs and calculating grades

    • Very easily modifiable

    • Relatively little changes between assignments

    • Uses a private copy of autograder.cpp for correctness/performance checks

    • Not available to students

  • Talk Outline

    • Course that motivated autograder

    • Autograder concepts and challenges

    • Autograder implementation

    • Course results

  • Course results

    • Universities used different grading schemes based on data from autograder

    • High drop-off for undergraduate students (CS267 is a graduate course)

    • Students worked individually or in groups of 2

    • Most universities had HW3 marked as optional to allow for extra time for final projects

  • Homework results

    • ~150 students started (includes audits)

    • 75 HW1 submissions Max:94 Median:41

    • 57 HW2 submissions Max:97 Median:30

    • 17 HW3 submissions Max:10 Median:5

    • 2013 had 345 students and 36/23/18 submissions with 18 ‘Certificate of Completions’

    • From universities that finished and communicated data (4 out of 18) we have 38 starting students 25 that finished the course with 17A’s 4B’s 2C’s and rest auditing