parallel real -time scheduling from theory to implementationalp514/hpc2017/jing_li.pdf · snippet...
TRANSCRIPT
ParallelReal-TimeSchedulingfromTheorytoImplementation
Jing LiDepartmentofComputerScienceYingWuCollegeofComputing
New Jersey Institute of [email protected]
OutlineØ Why do we demand parallelism?
Ø My research: exploiting parallelism in real-time systems
Ø Two examples of my research
2
FutureofComputingPerformance
100
101
102
103
104
1970 1980 1990 2000 2010 2020
Frequency (MHz)
Year
40 Years of Microprocessor Trend Data
3
Originaldatauptotheyear2010collectedandplottedbyM.Horowitz,F.Labonte,O.Shacham,K.Olukotun, L.Hammond,andC.Batten.Newplotanddatacollectedfor2010-2015byK.Rupp.
ThefastestslothworkingattheDMV (from Zootopia).
PowerDensityØ Clock frequency hits the power wall.
4
Multi-CoreSystems
100
101
102
103
104
1970 1980 1990 2000 2010 2020
Number of
Logical Cores
Frequency (MHz)
Year
40 Years of Microprocessor Trend Data
5
Originaldatauptotheyear2010collectedandplottedbyM.Horowitz,F.Labonte,O.Shacham,K.Olukotun, L.Hammond,andC.Batten.Newplotanddatacollectedfor2010-2015byK.Rupp.
Multi-Core=Parallelism?Ø Concurrent execution of jobs
Ø Parallel execution of a job
6
coressinglemulti-coremachine
jobs Job 1 Job 2 Job 3
coressinglemulti-coremachine
jobs
LaneKeepingAssistSystem in Cars
Since the system interacts with the physical world, its computation must be completed under a time constraint.
7
Applications Benefit from Intra-Job ParallelismØ Motion planning program in a self-driving car
8
FromKimetal.[ICCPS13]
Cyber-Physical Systems(CPS)CPS are built from, and depend upon, the seamless integration of computational algorithms and physical components.
E.g., robotics, drones, autonomous vehicles, etc. 9
Applications Benefit from Intra-Job Parallelism
10
Searchtheweb
*Jeff Dean et al.(Google) "Thetailatscale."CommunicationsoftheACM56.2(2013)
2nd phaseranking
Snippetgenerator
doc
Doc.indexsearch
Response
Query
Ø Web searchNeed to respond within100msfor users to find responsive*.
InteractiveCloudServices(ICS)E.g., web search, online gaming, stock trading etc.
11
Searchtheweb
Real-Time SystemsThe performance of the systems depends not only upon theirfunctional aspects, but also upon their temporal aspects.
Real-time performance:
1) Provide hard guarantee of meeting jobs’ deadlines (e.g. CPS)2) Optimize latency-related objectives for jobs (e.g. ICS)
12
coressinglemulti-coremachine
jobs Job 1 Job 2 Job 3
NewGenerationofReal-TimeSystemsCharacteristics:
Ø New classes of applications with complex functionalitiesØ Increasing computational demand of each application
Ø Consolidating multiple applications onto a shared platform Ø Rapid increase in the number of cores per chip
Demand: leverage parallelism within the applications, to improve real-time performance and system efficiency
13
coressinglemulti-coremachine
jobs Job 1 Job 2 Job 3
Myresearch:parallelreal-timescheduling from theory to implementation
OutlineØ Why do we demand parallelism?
Ø My research: exploiting parallelism in real-time systems
Ø Example 1: parallel real-time scheduling for meeting deadlines
Ø Example 2: parallel real-time scheduling for a target latency
14
Real-TimeHybridSimulation(RTHS)
15
Since the numerical simulation interacts with the physical specimen, its computation must be completed by its deadline.
Cyber-PhysicalBoundary
^RobertL.andTerryL.BowenLargeScaleStructuresLaboratoryatPurdueUniversity
Numericalsimulation Physicalspecimen
HowtoAllocateCorestoMultipleParallelJobs?Ø A real world system consists of many parallel jobs
Ø Each job demands different real-time performanceq E.g., jobs need to meet their deadlines in RTHS
Ø The goal of parallel real-time scheduling:smartly allocate cores to parallel jobs to meet their deadlines
16
coressinglemulti-coremachine
jobs
HowtoAnalyzeaParallelJob?Ø An example:
Computation on a array A[i] with m elements
17
m=3
int i = 0;
parallel_for (; i < m; i++) {compute( A[i] );
}bar();
Naturally captures programs generated
by parallel languages such as Cilk Plus, Intel TBB and OpenMP.
Node: sequential computation
Edge: dependence between nodes
DirectedAcyclicGraph (DAG) Model
Unitnode– singleinstruction
18
Naturally captures programs generated
by parallel languages such as Cilk Plus, Intel TBB and OpenMP.
Node: sequential computation
Edge: dependence between nodes
DirectedAcyclicGraph (DAG) Model
available nodecompleted nodeunavailable node
Unitnode– singleinstruction
19
Naturally captures programs generated
by parallel languages such as Cilk Plus, Intel TBB and OpenMP.
Node: sequential computation
Edge: dependence between nodes
DirectedAcyclicGraph (DAG) Model
available nodecompleted nodeunavailable node
Unitnode– singleinstruction
20
Naturally captures programs generated
by parallel languages such as Cilk Plus, Intel TBB and OpenMP.
Node: sequential computation
Edge: dependence between nodes
DirectedAcyclicGraph (DAG) Model
available nodecompleted nodeunavailable node
Unitnode– singleinstruction
21
Naturally captures programs generated
by parallel languages such as Cilk Plus, Intel TBB and OpenMP.
Node: sequential computation
Edge: dependence between nodes
DirectedAcyclicGraph (DAG) Model
available nodecompleted nodeunavailable node
Unitnode– singleinstruction
22
Work T1 : execution time on one coreSpan T∞ : execution time on ∞ cores
(critical-path length)
Work and Span
T1 =18T∞ =9
23
FederatedSchedulingFor parallel tasks, FS has the best bound in term of schedulability
FS assigns ni dedicated cores to each parallel task
ni – the minimum #cores needed for a task to meet its deadline
cores
𝑛" =𝐶" − 𝐿"𝐷" − 𝐿"
deadlineDi =periodworst-case span Liworst-case workCi
24
tasks
FSPlatformØ Middleware platform providing FS service in Linux
Ø Work with GNU OpenMP runtime system Ø Run OpenMP programs with minimum modification
25
cores
Linux
FederatedScheduling(FS)
OpenMPRuntime
OpenMPRuntime
OpenMPRuntime
ParallelismImprovesRTHSAccuracy
26
A RTHS simulates a nine stories building, with first story damper
Ø Previously, sequential processing power limits a rate of 575HzØ Parallel execution now allows a rate of 3000Hz
Ø Reduction in error for acceleration and displacement
Ø Parallelism increases accuracy via faster actuation and sensing
Sequential (575Hz)Parallel(3000Hz)
Time(sec)
Normalize
dError(%)
OutlineØ Why do we demand parallelism?
Ø My research: exploiting parallelism in real-time systems
Ø Example 1: parallel real-time scheduling for meeting deadlines
Ø Example 2: parallel real-time scheduling for a target latency
27
SystemforInteractiveCloudServicesOnline system: do not know when jobs arrive
Objective: maximize the number of jobs that meet a target latency T
28
2nd phaseranking
Snippetgenerator
doc
Doc.indexsearch
Query
Aggregator
Aggregator
WorkloadDistributionHasaLongTail
29
Job SequentialExecutionTime(ms)(work)
Bing searchworkload
Ø Large jobs must run in parallel to meet target latency
Ø Always run large jobs in full parallelism?
Target latency
Parallelize Large Jobs According to LoadTail-Control Strategy: when load is low, run all jobs in parallel; when load is high, run large jobs sequentially.
Latency = Processing Time + Waiting time
At low load: processing time dominates latency
At high load:waiting time dominates latency
time
core1
core2
core3
✔Miss0 request
core1
core2
core3time
✔Miss1 request
30
target
target
TheInnerWorkingsofTail-ControlWe implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.
31
We implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.
TheInnerWorkingsofTail-Control
TargetLatency
32
defaultwork-stealing≥
TheInnerWorkingsofTail-ControlWe implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.
TargetLatency
33
defaultwork-stealing≥
TheInnerWorkingsofTail-ControlWe implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.
TargetLatency
34
defaultwork-stealing≥
ConclusionExploit the untapped efficiency in parallel computing platforms and drastically improve the real-time performance of applications.
Ø System Guaranteed to Meet Deadlines for CPSq Develop provably good schedulers for parallel applications
q Incorporate real-time scheduling into parallel runtime system
Ø System Optimized to Meet Target Latency for ICSq Design and implement strategy to optimize real-time performance
35