load balancing of parallel tasks using memory …1328356/fulltext01.pdf · ernsund, sens ingels...

30
alardalen University School of Innovation Design and Engineering aster˚ as, Sweden Thesis for Bachelor of Computer Science LOAD BALANCING OF PARALLEL TASKS USING MEMORY BANDWIDTH RESTRICTIONS Tommy Ernsund [email protected] Linus Sens Ingels [email protected] Examiner: Moris Behnam alardalen University, V¨ aster˚ as, Sweden Supervisor: Jakob Danielsson alardalen University, V¨ aster˚ as, Sweden June 20, 2019

Upload: others

Post on 15-Feb-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Malardalen UniversitySchool of Innovation Design and Engineering

Vasteras, Sweden

Thesis for Bachelor of Computer Science

LOAD BALANCING OF PARALLEL TASKS

USING MEMORY BANDWIDTH

RESTRICTIONS

Tommy [email protected]

Linus Sens [email protected]

Examiner: Moris BehnamMalardalen University, Vasteras, Sweden

Supervisor: Jakob DanielssonMalardalen University, Vasteras, Sweden

June 20, 2019

Page 2: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

Abstract

Shared resource contention is a significant problem in multi-core systems and can havea negative impact on the system. Memory contention occurs when the different cores ina processor access the same memory resource, resulting in a conflict. It is possible tolimit memory contention through resource reservation where a part of the system or anapplication is reserved a partition of the shared resource.

We investigated how applying memory bandwidth restrictions using MemGuard can aidin synchronizing execution times of parallel tasks. We further investigate when memorybandwidth restrictions are applicable. We conduct three experiments to investigate whenbandwidth restrictions are applicable. Firstly, we conducted an experiment to pinpointwhen the memory bandwidth saturates a core. Secondly, we investigated our adaptivememory partitioning scheme performance against static and no partitioning. Finally, wetested how our adaptive partitioning scheme and static partitioning can isolate a workloadagainst an interfering memory intensive workload running on a separate core.

As the experiments only were conducted on one system, pinpointing the general pointof contention was difficult, seeing that it can differ significantly from system to system.Through our experiments, we see that memory bandwidth partitioning has the ability todecrease the execution time of feature detection algorithms, which means that memorybandwidth partitioning potentially can help threads to reach their synchronization pointssimultaneously.

i

Page 3: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

Table of Contents

1. Introduction 11.1. Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. Background 22.1. Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1. DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2. Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3. Memory Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2. Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3. Method 7

4. Implementation 94.1. Using MemGuard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1.1. Adaptive Memory Bandwidth Partitioning . . . . . . . . . . . . . . 9

5. Experiments 125.1. EX1: Memory Bandwidth Saturation . . . . . . . . . . . . . . . . . . . . . . 135.2. EX2: No Paritioning and Static Partitioning vs. AMBP . . . . . . . . . . . 135.3. EX3: Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6. Results 156.1. EX 1: Saturation on The Memory Bus . . . . . . . . . . . . . . . . . . . . . 15

6.1.1. TC1 - Point of Saturation . . . . . . . . . . . . . . . . . . . . . . . . 156.2. EX2: No Partitioning and Static Partitioning vs. AMBP . . . . . . . . . . . 16

6.2.1. Test case 1 - No Partitioning . . . . . . . . . . . . . . . . . . . . . . 176.2.2. Test case 2 - Static Partitioning . . . . . . . . . . . . . . . . . . . . . 186.2.3. Test case 3 - AMBP . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.3. Experiment 3: Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.3.1. Test case 1 - No Partitioning . . . . . . . . . . . . . . . . . . . . . . 206.3.2. Test case 2 - Static Partitioning . . . . . . . . . . . . . . . . . . . . . 216.3.3. Test case 3 - AMBP . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7. Discussion 23

8. Conclusions 258.1. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

References 27

ii

Page 4: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

1. Introduction

Most general purpose computers use a multi-core architecture to run every-day tasks [1].An integral part of the multi-core processor is the memory subsystem, used for storing run-time data. The internal memory subsystem, such as caches, TLB’s and the system memorybus is however often physically shared between different cores of a multi-core computer.Memory is often physically shared between different computational cores of a multi-corecomputer. Physically sharing memory between different cores means multiple tasks runconcurrently on different cores and compete for the shared memory bandwidth [2]. Theprocessing time for memory access requests in multi-core systems may vary drasticallydue to several factors including the location of the memory, the state of DRAM (DynamicRandom Access Memory) chips or the DRAM controller [1]. Scheduling algorithms basedon First-Ready First-Come-First-Serve (FR-FCFS) [1, 3] conditions can be used by theDRAM controller to boost the DRAM throughput by reordering memory requests becauserequests from one core can effect requests from other cores. Variables such as DRAMthroughput and temporal predictability inflict significant challenges on real-time systemsbecause of potential task invalidation by workload changes in other cores during run time.

Resource reservation and reclaiming techniques can be used to assign partitions of ashared resource to different applications [1, 2]. Areas such as CPU and disk managementhas applied resource reclaiming techniques successfully using constant-bandwidth servers[4] and high throughput disk scheduling [5]. Memory bandwidth reservation is less investi-gated, and only a few solutions [4, 1] are applicable on Commercial Off-The-Shelf (COTS)computers due to their highly dynamic memory service rate.

Memory bandwidth restrictions are very interesting for multi-core application sincethey can provide a solution to memory bandwidth being a performance bottleneck for thesystem [6]. In this study, we investigate how memory bandwidth restrictions can be usedto synchronize execution times of parallel tasks. We also aim to find an answer to whenbandwidth restrictions are most suitable to apply. This could be beneficial to real-timeapplications [6] with strict time constraints.

1.1. Research Questions

We present the research questions as follows:

• RQ1: When does the assigned bandwidth reach a point of saturation for a thread?

• RQ2: How can we use memory bandwidth restrictions to force forked threads toreach their synchronization points simultaneously?

1

Page 5: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

2. Background

Memory management is a complex procedure [7, p. 6] and include units such as the differentlevels of Cache, Translation Lookaside Buffers (TLB), Memory Bus and DRAM. All userspace processes need to have their private section of memory but can share it with otherprocesses. For the CPU to have fast access to frequently used data and instructions, cachememory is commonly integrated directly into the CPU [8]. By doing this, the CPU doesnot have to read from main memory, which results in an overall speedup of a program.Cache memory is typically composed of multiple levels where each level is slower butlarger in terms of memory. The system can also use disk space as additional memory, butthis task is made more accessible by the MMU (memory management unit) present inmodern CPUs. The MMU (memory management unit) enables a memory management atechnique called virtual memory, where a process access memory using a virtual addressinstead of its physical location. The MMU will translate the virtual address to the addressof the physical memory location using a page table [7, p. 182]. The page table is handledby the kernel and consists of pages that are chunks of memory used by processes. Pagesare loaded and allocated by the kernel as a process needs them (demand paging). When aprocess wants to use a page that is not yet ready, there will be a page fault where the CPUis taken in control by the kernel from the process to get the page ready. If the desiredmemory page is in the main memory, but the MMU does not know where, a minor pagefault will happen. If the desired page is not in memory at all, meaning that the kernelwill have to load it from disk, a major page fault will occur. Both minor and major pagefaults will impact the performance, particularly in memory intensive programs.

The following subsections will describe the memory system in general, DRAM, memorybandwidth, and parallel computing.

2.1. Memory

Memory is one of the primary components in any computer system [9, p. 25]. Memoryshould be very fast, abundant and at the same time very cheap. However, no currentmemory technology can satisfy all of the traits mentioned above at the same time. Thememory system is therefore built in layers with different speeds and sizes that form ahierarchy (see Figure 1) [10, p. 375], to meet these traits. The structure of this hierarchyis similar to a pyramid consisting of several layers, where the top layers are the closestmemory to the CPU and the bottom layers are furthest away from the CPU [9, p. 25–28].We will itemize the hierarchy below:

• The top layer consists of the internal registers of the CPU. The registers have aminimal storage capacity which is determined by the CPU architecture and areusually 32x32 or 64x64 bit large.

• The second layer consists of the cache memory. Cache memory holds the most fre-quently requested cache lines from the main memory (RAM), and gives the CPUfast access to the data. The cache hardware checks if the cache line is present whena program requests to read a memory word. If it finds the corresponding cache line,it does not send a request to the main memory. Such occurrence is called a cachehit and has a much lower time penalty than if a request is sent to the main memory,which is called a cache miss. Usually, the cache memory consists of three levels, eachlarger and slower than the previous level.

2

Page 6: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

• Main memory or RAM (Random Access Memory) is the third layer in the memoryhierarchy and stores all the data that is currently being used by the computer.Reading from RAM is faster than reading from secondary storage, but is muchslower than reading from the cache memory.

• The last layer off the hierarchy is secondary storage (i.e., a hard drive). This storageis non-volatile, which means that the stored content is not lost when the power isswitched off. Usually, this memory is the slowest to access.

Secondary storage

Main memory

CPU cache

CPU registers

Size of memory

Increasing speed

Figure 1: The memory hierarchySource: Adapted from [10, Fig. 5.3]

2.1.1. DRAM

Dynamic Random Access Memory (DRAM), as seen in Figure 2, stores values in a cell asa charge in a single capacitor [10, p. 379–380], which means DRAMs must be refreshedperiodically to maintain data consistency. An entire row is refreshed with a read cycle fol-lowed by a write cycle. DRAMs are organized in multiple banks internally, each consistingof a series of rows of cells. A precharge command is sent to open or close a bank. Anactivate command containing the row address is sent to have a row transferred to a buffer.DRAMs buffer rows for repeated access, which improves performance by decreasing theaccess time a considerable amount. Once a row is in a buffer, it can be further transferredby successive column addresses, usually 4-16 bits. By specifying the starting address, itis also possible to do a block transfer. These operations (precharge, activate, accessing acolumn) can take anywhere from just a few to several dozens of nanoseconds [11, p. 22–23].With multiple banks within the same channel, multiple different banks can handle anyoperation at any given time, but only one bank can read or write column data on the bus

3

Page 7: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

at a time. Multiple independent channels can be used to handle simultaneous reads andwrites on the memory bus, increasing concurrency.

ACT

PRE

RD/WR

Column

Figure 2: Internal organization of DRAMSource: Adapted from [10, Fig. 5.4]

2.1.2. Memory Bandwidth

Memory bandwidth is the number of bytes that can be transferred in a time frame. Thebandwidth of memory systems using DRAM is usually not a big problem today as datarates have increased a lot over the generations, and with the use of multiple channels [11,p. 15–16]. Still, there are many different applications (i.e., video processing) where muchhigher bandwidth is useful. The processor cores of a system share the memory bandwidth,as depicted in Figure 3, meaning that tasks running concurrently on different cores willhave to compete for the available bandwidth [2].

There are different ways of assigning memory bandwidth to a shared memory interface(temporal scheduling) [12]. One is static partitioning, where the assignment of memorybandwidth to each core is determined offline and is unchanged over time. However, if thebandwidth is subject to scheduling, it is dynamically assigned. With dynamic bandwidthassignment, an assignment scheme can be made that follows the memory requirements ofa scheduled workload over time, which in turn might improve performance significantly.

CPU1

Main memory

CPU2 CPUn

L1 Cache L1 Cache L1 Cache

Memory Bus

L2 Cache L2 Cache L2 Cache

L3 Cache

Figure 3: Shared memory system

4

Page 8: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

2.1.3. Memory Contention

Memory bus contention occurs when the cores in a processor compete for the memorybandwidth in a system [13]. One of the factors causing memory bus contention is thatthe number of cores and computational power in a processor increases at a much fasterrate than available memory bandwidth increases for every new generation of technology[13, 14]. This means more cores are demanding memory bandwidth at the same timeresulting in bus contention, which can negatively impact the whole system through higherlatencies and worse throughput.

Tudor et al. performed various experiments to get a better understanding of howmemory bus contention affects programs running in parallel. They discovered that theeffect of memory bus contention is proportional to the number of active cores and theproblem size and also a 1000% increase in cache misses when executed on a system using48 cores.

2.2. Parallel Computing

Parallel computing can be achieved through various approaches; one of these is the fork-join model [15], as seen in Figure 4. Fork-join divides a control flow into multiple inde-pendent flows. After a flow has finished the calculations, it is placed in a join state, whichwaits for the remaining flows. After the join, one flow continues. Fork-join models areoften based on divide and conquer algorithms. These algorithms divide the problem intosmaller independent sub-problems running in parallel, and as they complete their calcu-lations, they combine into a final solution.

T0 T0

T0

Tn-1

Tn

T0 T0

Prepareworkload

Forkworkload

Executealgorithm

Synchronizethreads

Write backdata

.

.

.

Figure 4: An example of fork-join model where the workload is divided into n threads.Source: Adapted from [16, Fig. 3]

Parallelizing applications using a fork-join model can be an efficient way of increasingapplication throughput. The fork-join model, however, presents two problems:

Memory usage - If an application is memory intensive, there is a risk that potential

5

Page 9: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

speedup gained from parallelism vanishes, or at least decreases, due to memory contention[17]. Unfavorable impacts such as resource contention is a direct consequence becauseof shared resources and cache memory, which may lead to reduced performance whenapplications run in parallel [18]. Eklov et al. refer to recent studies that suggest thatapplications with higher memory bandwidth demands generate more memory contentionand are also more sensitive to memory contention. Results from these studies show thatapplications can experience slowdowns from 3% - 23% when memory contention occurs.

Synchronization - The fork-join model is commonly used to parallelize workloads formulti-core systems. It has a starting point, the fork stage, from where an application isdivided into subthreads. The end stage is called join, from which all threads synchronizeand returns to the initial state. This presents problems since the fork-join model will onlybe as fast as the slowest executing thread. Using bandwidth restrictions, it may be possibleto load-balance threads in such a way that the threads will reach the synchronization atthe same time [1, 2, 12, 19, 20].

2.3. Related Work

Behnam et al. [19] and Inam et al. [21] developed a Multi-Resource Server (MRS) thatcan be used to reserve CPU- and memory bandwidth on multi-core systems with low over-head. The servers are both useful for hard and soft real-time systems and can partitionbandwidth, help with composability and prediction of the system. The authors point outthat contemporary scheduling of real-time tasks on multi-core architectures is inherentlyunpredictable and tasks performed by another core may negatively impact the computersystem. One of the causes for this is the contention of physically shared memory, andthat general purpose hardware has no techniques to combat the theft of bandwidth fromother subsystems. The MRS opens up the possibility for encapsulating a complete sub-system of tasks in a server, which shares an allocation of CPU- and memory-bandwidthresources. Yun et al. [2, 1], which developed a bandwidth reservation system called Mem-Guard, says that the MRS approach does not address the problem of wasted bandwidth.In MemGuard, the available memory bandwidth consists of the guaranteed and the besteffort memory bandwidth [1]. Regardless of the memory access activities of other cores, acore will always be guaranteed the minimum bandwidth that it has reserved. After eachcore’s guaranteed bandwidth has been met in a certain period, they can be assigned ad-ditional best-effort bandwidth. They demonstrated that MemGuard can provide isolationof memory performance while having minimal influence on the overall throughput.

In contrast to Behnam et al., since we use MemGuard, which does not restrict resourceson a task level, our method restrict bandwidth for the whole core using MemGuard. Whatalso differs their work from ours is that only a portion of the resources can be allocated toa server. We measure the maximum bandwidth for the system and redistribute it acrossthe cores. Since we are using MemGuard, assigning additional best-effort bandwidthdynamically can be done if needed.

Inam et al. wrote an article about the problems of memory contention in multi-core systems [20]. They believe that measuring the consumption of memory bandwidthmay play a key role in understanding and resolving performance bottlenecks in multi-core applications. They also envision that tracking the consumed memory bandwidth willhelp in making more intelligent scheduling decisions and that this will prevent memorycontention and also even out the load on the memory bus by spreading the memoryaccess over time. Performance counters can be used to measure the consumed memorybandwidth by collecting low-level information about occurring events in the CPU duringexecution and are available for most modern processors. The authors continue by saying

6

Page 10: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

that a continuous determination and tracking of the consumed memory bandwidth ischallenging without using dedicated external hardware that monitors the memory-bus.However, rough estimates about used memory bandwidth and events may be enough forsoft real-time systems to get a fair assessment of consumed memory bandwidth. Thisbecomes much harder with hard real-time systems due to safe estimates, which meansthat no underestimates are allowed. There are no events that exactly show how muchmemory bandwidth that is consumed; this needs to be monitored through multiple eventsto try to determine the number, and size of memory accesses.

Inam’s work is very relevant in aspect to how we measure the memory bandwidthusage of the system. We use performance counters to count the last level cache misses anduse the data to get a rough estimate of the used bandwidth for a core. According to Inamet al., this method may not be sufficient for tracking the consumed bandwidth, which canbe a limitation in our work.

Agrawal et al. have studied the problems of determining the worst-case response timefor sequential tasks, each with different bandwidth-to-core assignments [12]. Their arti-cle provides a general framework to perform response time analysis under dynamicallypartitioned bandwidth. They later demonstrate how their proposed analysis techniquecan be used in a time trigger memory scenario, which shows that dynamic allocation ofbandwidth outperforms static allocation under varying memory-intensive workloads. Amemory bandwidth regulation model, like MemGuard, was used to do this; these modelsuse hardware performance counters to monitor the number of memory transactions per-formed by each core. The idea of memory bandwidth regulation is that a core is given aspecified amount of bandwidth based on the number of memory transactions that a coreis allowed to perform during a regulation period.

This work shares similarities with our work in terms of static and dynamic memorybandwidth partitioning with the use of MemGuard. What differs our work from theauthors is that we dynamically partition bandwidth by predicting future needs, whilethey already know the characteristics of the used workload to more efficiently distributethe bandwidth.

3. Method

Memory intensive parallel applications and resource contention can have a significantimpact on the performance of the computer system, such a scenario can, however, be solvedusing memory bandwidth partitioning. If memory bandwidth is partitioned between thecores of a multi-core system, a core is saturated when it can no longer utilize all of thebandwidth it has been assigned. This may cause various problems, such as synchronizationand performance issues due to other cores not being able to use memory bandwidth thatis reserved for the saturated core.

This study aims to investigate how memory bandwidth partitioning can aid in loadbalancing of parallel tasks, and to find the point of saturation for a thread that is boundto a core. We also aim to implement a solution that continuously partitions bandwidthbetween cores based on an algorithm that takes previous execution times and bandwidthinto account.

The methodology used to solve the research questions is based on action research[22, 23] (see Figure 5). Below follows a summary of each phase of the iterations:

7

Page 11: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

Study and plan

Take actionReflect

Collect and AnalyzeEvidence

Figure 5: Model of our workphasesSource: Adapted from [22, Fig. 2] and [23, Fig. 1]

1. Study and Plan: In the study and plan phases, we investigated state of the artresearch on memory contention in multi-core systems and on memory bandwidth re-striction methods. We discovered that new restriction methods have been developedin forms of a Multi-Resource Server (MRS) [19, 21] and MemGuard [1]. The MRS canencapsulate a subsystem of tasks in order to combat the theft of memory bandwidthfrom other subsystems, and MemGuard can partition memory bandwidth by firstassigning guaranteed bandwidth to a core and then additional best-effort bandwidthonce the guaranteed bandwidth has been met in a certain period. This provideda solution to wasted memory bandwidth. Our initial search on memory contentionleads us to an article written by R. Inam et al. [20] where they discuss the use ofhardware performance counters to measure bandwidth consumption. They believedthat this might play a key role in further understanding and resolving performancebottlenecks in multi-core applications.

Based on the information that we gathered from our investigation on state of the artresearch, we refined the research problem. We started by discussing the problems offorked threads where memory contention and synchronization issues may decreasepotential speedup caused by parallelism, and how memory bandwidth restrictionsmay be able to balance the execution time of forked threads. Based on the researchproblem, the research questions were then formed.

2. Take Action: To answer our research questions, we implemented a memory band-width partitioning scheme called Adaptive Memory Bandwidth Partitioning,referred to AMBP in the subsequent sections in this thesis. AMBP partitions mem-ory bandwidth between the cores of a system based on the execution time of paralleltasks.

3. Collect and Analyze Evidence: We conducted three experiments using memory-heavy workloads to investigate the performance of different memory bandwidth par-titioning schemes. We monitored the memory bandwidth usage for each core duringrun time. Calculating the memory bandwidth usage was done with the help ofperformance counters to get the number of last level cache misses (LLC).

8

Page 12: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

4. Reflect: Finally, the results gained from the experiments was validated and com-pared against each other.

4. Implementation

OpenCV is an open source library that is used for computer vision and machine learning[24]. Our implementation makes use of OpenCV’s SIFT [25] object detection algorithmto process frames from a video, which is costly in terms of memory bandwidth usage. InOpenCV, a frame is represented as a matrix with the height and width of the frame. Eachframe is divided into several sub-frames called regions of interest (ROI) and assigned todifferent threads that are bound to separate cores.

4.1. Using MemGuard

MemGuard can monitor and regulate the memory bandwidth usage of each CPU core byutilizing hardware performance counters [2]. In MemGuard, the available memory band-width consists of the guaranteed and the best-effort memory bandwidth [1]. Regardless ofthe memory access activities of other cores, a core will always be guaranteed the minimumbandwidth that it has reserved. After each core’s guaranteed bandwidth has been met ina certain period, they can be assigned additional best-effort bandwidth [2].

MemGuard is loaded as a kernel module into the Linux kernel manually using theinsmod command. Several configuration interfaces can be accessed after MemGuard hasbeen loaded into the kernel using three files: Control, Usage, and Limit. Control is usedby the module to set or switch between different settings such as maximum bandwidthusage for the cores, and also reservation and reclaim controls that determine how muchbandwidth a core can use and how it can distribute surplus bandwidth [1]. Usage displayshow much memory bandwidth a core has used in a period and also the total budgetfor each core. Limit is used to set different memory bandwidth in MB/s for each core.We assign bandwidth using percentages of the system’s maximum bandwidth because wefound it easier to determine how the bandwidth was divided between the cores this way.The percentage is later multiplied with the maximum bandwidth to get the value in MB/s.

Performance counters are used to collect low-level data about different activities in theCPU, such as the number of cache misses or prefetch misses [20]. We use perf-events, whichare Linux based performance counters that have access to multiple software and hardwareevents [26]. Once the application executes, the performance counters are initiated andstarted. Performance counters reset as threads are created, and the total number of LLCmisses for a core are read by the performance counters before the thread joins. An LLCmiss leads to the corresponding block of data being fetched from DRAM. We calculate thememory bandwidth usage for a core by using LLC misses together with the CPU’s cacheline size and the corresponding thread’s execution time.

4.1.1. Adaptive Memory Bandwidth Partitioning

Execution times of tasks can fluctuate quite drastically, especially when using heavy mem-ory loads such as feature detection algorithms. For example, consider a feature detectionalgorithm [17], which is run frame by frame. There is a high chance that the executiontimes between the current executed frame and the next coming frames differ significantly.It can, therefore, be challenging to create bandwidth partitions based on the executiontime of only one frame. It can be even more difficult when the frame is forked to multiplecores, as exemplified earlier in Figure 4, since the thread execution can become dependant

9

Page 13: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

on the frame characteristics. To solve the execution time fluctuation issue, we implementour adaptive memory bandwidth partitioning (AMBP) using a Weighted Moving Average(WMA), described in Equation 1, of the last five execution times for each core. WMA par-tially solves the execution time fluctuation issue since the memory bandwidth partitionsassigned by AMBP will have a larger comparison base for execution times, rather thanjust one execution time, which in turn will make AMBP react at a slower pace. Figure 6depicts the System architecture of our program.

Application

MemGuard

AMBP

Kernelspace

Userspace

Ourcontribution

Monitorapplicationcharacteristics

CalculateWMAforeachcore

Calculatebandwidthpartitionsforeachcore

Actuatebandwidthpartitions

AMBP

Figure 6: System architecture

The system model shows how AMBP is implemented and how it is connected to mon-itored applications and also MemGuard.

The first step of our model depicts the monitored application. AMBP receives theexecution times of the frame from the application. The second step in our model isAMBP, which consists of four different steps:

1. AMBP receives the values from the monitored application, this data is used in thenext step to calculate the WMA for each core.

2. AMBP calculates the WMA for each core by using the values received from theapplication. The formula used for calculating the WMA execution times can bedescribed as:

WMAcore =

∑ni=1 execution timei · i∑n

i=1 i(1)

where n is the number of values the WMA is based on, with execution time1 beingthe value from the least recently processed frame, thus having the lowest weight.

10

Page 14: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

3. The bandwidth partitions for each core is calculated according to:

BWcore =wmacore

wmaall cores·max BW (2)

where max BW is the maximum memory bandwidth of the system and BWcore isthe new guaranteed bandwidth for a certain core.

4. AMBP actuates the new bandwidth partitions by sending them to MemGuard.

AMBP is implemented as an interface to MemGuard, which means that AMBP com-municates with MemGuard. The last step of our model shows that MemGuard receivesthe calculated partitions from AMBP, which actuates the new partitions.

11

Page 15: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

5. Experiments

We conducted three experiments in order to answer our research questions. All of theexperiments were repeated four times to secure the integrity of the collected data. Theexperiments are conducted using OpenCV’s SIFT corner detection algorithm to processframes using two different videos (see Figure 7, a lower resolution video of 1280x720 wheresome ROIs should be faster to process than others, and one with a higher resolutionof 1920x1080 and uniform corner distribution among the different ROIs. Each frame isdivided into four ROIs which are assigned to different threads, where each thread is boundto a separate CPU core. In each test run, a total of 100 frames are processed.

(a) 1280x720 (b) 1920x1080

Figure 7: A processed frame from each video used in the experiments. Keypoints ofdifferent colors belong to different ROIs.

When the SIFT application is set to use memory bus partitioning, MemGuard willpartition the guaranteed bandwidth using either AMBP or static partitioning. Staticpartitioning took place in all of the experiments so that we can control the bandwidthassignment of each core manually in between executions. The maximum bandwidth ismeasured by the SIFT application using mbw (Memory BandWidth benchmark) [27].

The testing platform is described in Table 1. HyperThreading, ACPI, and CPU idlewas disabled during the experiments to limit the number of factors that could affect theresults.

CPU Model Intel Core i7 7700HQFrequency 3,4 GHz

Number of Cores 4Number of threads 8

L1 Cache Size 256 KiBL2 Cache Size 1 MiBL3 Cache Size 6 MiB

Main Memory Channels 2Main Memory Type DDR4 SDRAMMain Memory Size 2 x 4 GB

Main Memory Speed 2400 MHzSecondary Storage Type SSD

Table 1: Hardware platform

12

Page 16: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

5.1. EX1: Memory Bandwidth Saturation

Experiment 1 (EX1) consists of one test case with two purposes: To find out when thesystem should use memory bandwidth restrictions, and when the assigned bandwidth sat-urates a core i.e., the point from which additional memory bus partitions will not bebeneficial to the execution time. The 1980x1080 resolution video was used for both ofthese test cases to load the memory bus as much as possible with SIFT running on cores0-3.

TC1 - Point of SaturationTC1 measures the average time it takes for a specific core to process a frame when assigninga statically partitioned size of the memory bandwidth to the core. The test is conductedover an interval of 10% to 50% on the first core, to see how the execution time is affectedby memory reservation restraints. Comparing the core’s execution times of a frame, weshould see when its assigned memory bandwidth saturates the core. N is the variable ofthe assigned bandwidth for the specific core.

1. Insert the MemGuard module into the kernel

2. Enable MemGuard

3. N = 10%

4. Assign N of the maximum bandwidth to the first core and the rest of the bandwidthevenly among the other three cores.

5. Execute the application

6. N = N + 5%

7. Return to step 4 until 50% assigned bandwidth has been reached

5.2. EX2: No Paritioning and Static Partitioning vs. AMBP

Experiment 2 consists of three test cases that measure the execution time of the applica-tion when using static, no partitioning, and AMBP. The purpose of this experiment is toinvestigate AMBP’s performance against static and no partitioning in terms of decreasingthe execution time of threads. All of the test cases were performed using SIFT on core 0- 3. We used both the 1280x720 and 1920x1080 video in this experiment.

TC1 - No PartitioningTest case 1 measures the total execution time of the program when no memory bandwidthpartitioning is used.

1. Execute the application

TC2 - Static partitioningTest case 2 measures the total execution time of the application when the memory band-width is statically partitioned to 25% for each core.

1. Insert the MemGuard module into the kernel

2. Enable MemGuard.

3. Statically assign each core 25% of the maximum bandwidth.

13

Page 17: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

4. Execute the application

TC3 - AMBPTest case 3 measures the total execution time of the application when the memory band-width is partitioned between the cores using AMBP.

1. Insert the MemGuard module into the kernel

2. Enable MemGuard and set application to use AMBP.

3. Execute the application

5.3. EX3: Isolation

The third experiment has three test cases that determine if the SIFT application is iso-lated when running interfering workloads on a separate core. This experiment tests howa high bit rate video impacts the execution time of the SIFT algorithm running on cores0-2. The first two test cases show how no partitioning and static partitioning handles theisolated workload against AMBP.

TC1 - No PartitioningTest case 1 shows how the program handles the isolated workload when no partitioning isused.

1. Start playing a high bit rate video on core 3

2. Execute the application

TC2 - Static PartitioningTest case 2 shows how the program handles the isolated workload when static partitioningis used.

1. Insert the MemGuard module into the kernel

2. Enable MemGuard

3. Statically assign 30% guaranteed memory bandwidth of the maximum bandwidthto cores 0-2 and 10% to core 3.

4. Start playing a high bit rate video on core 3

5. Execute the application

TC3 - AMBPTest case 3 shows how the program handles the isolated workload when AMBP is used.

1. Insert the MemGuard module into the kernel

2. Enable MemGuard and set application to use AMBP

3. Statically assign 10% guaranteed memory bandwidth of the maximum bandwidth tocore 3, let the program dynamically partition the remaining bandwidth among core0-2.

4. Start playing a high bit rate video on core 3

5. Execute the application

14

Page 18: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

6. Results

This section presents collected data from the conducted experiments and is structured asthe following: 9.1 presents the results from the saturation experiment, 9.2 presents the NoPartitioning and Static Partitioning vs AMBP experiment, and 9.3 presents the isolationexperiment.

6.1. EX 1: Saturation on The Memory Bus

The data from TC1 shows that assigning more bandwidth to the core improves the exe-cution time on that core. At around 25% this improvement almost became non-existentwhich indicates that the core is saturated and does not need more bandwidth. Belowfollows a more detailed description of TC1.

6.1.1. TC1 - Point of Saturation

TC1 is repeated over an interval of 10% to 50% of bandwidth that is assigned to one of thecores. Each iteration increases the assigned bandwidth with 5%. The results presentedin Figure 8 and Table 2 shows that the average execution time decreases and averagebandwidth usage increases as the core is assigned more bandwidth all the way up to50%. However, at around 25% assigned bandwidth, the core’s used bandwidth ceasedto increase as it got assigned additional bandwidth. The same pattern can be observedfor the average execution time of the core; between 20%-25% assigned bandwidth theimprovement gradually becomes almost non-existent.

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

10%20%30%40%50%

Frame nr

Exe

cuti

on T

ime (

s)

Figure 8: Point of Saturation, test case 1 - Execution times to process a frame for a corewith 10% - 50% bandwidth assigned to it. This is a visual example to show that there isa point of saturation for the core. No average values are depicted in this graph.

15

Page 19: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

Assigned bandwidth Avg. execution time (s) Avg. used bandwidth

10% 0,295 10,38%15% 0,243 13,45%20% 0,218 15,70%25% 0,210 19,09%30% 0,201 19,85%35% 0,199 20,37%40% 0,189 21,61%45% 0,187 22,79%50% 0,185 23,19%

Table 2: Point of Saturation, test case 1 - Average execution time to process a frame andthe used bandwidth for a core with 10% - 50% of guaranteed bandwidth assigned to it.

Figure 8 shows that when a core is assigned 50% of the maximum bandwidth theexecution times start to vary more frame-to-frame. These variations are not as prominentin the other cores.

6.2. EX2: No Partitioning and Static Partitioning vs. AMBP

When processing the 1280x720 video (see Table 3), both Static partitioning and AMBPuse a higher amount of bandwidth on average and has lower execution times than no parti-tioning. Compared to static, AMBP has a clear edge seen to both of these measurements,particularly when best-effort was used. The results from processing the 1920x1080 video(see Table 4) differ in terms of performance as static and AMBP is similar regardlessof whether best-effort is used or not. No partitioning performs best with at least 16%improvement in execution time compared to AMBP and static partitioning when usingbest-effort, which is a complete turnaround in contrast to when the 1280x720 video wasused.

Partitioning Best-effort Max 30% Avg. used BW (%) Exec. time (s)

None n/a n/a 71,91 (68,32 - 75,56) 11,03 (10,81 - 11,26)AMBP No No 82,36 (79,15 - 85,73) 10,54 (10,13 - 10,94)AMBP No Yes 81,82 (74,57 - 89,43) 10,46 (9,59 - 11,26)AMBP Yes No 86,15 (79,73 - 94,25) 10,24 (9,68 - 10,91)AMBP Yes Yes 85,02 (79,29 - 93,13) 10,34 (9,42 - 10,92)Static No n/a 73,58 (63,62 - 84,21) 11,46 (10,03 - 12,83)Static Yes n/a 75,53 (71,25 - 78,31) 11,48 (11,01 - 11,84)

Table 3: Average used bandwidth and total execution times for all test cases when runningthe 1280x720 resolution video. Minimum/maximum values are shown in parentheses.

16

Page 20: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

Partitioning Best-effort Max 30% Avg. used BW (%) Exec. time (s)

None n/a n/a 88,08 (85,97 - 89,44) 18,40 (18,18 - 18,60)AMBP No No 74,61 (72,10 - 80,77) 24,27 (23,49 - 24,73)AMBP No Yes 75,80 (72,71 - 78,45) 23,52 (22,92 - 24,16)AMBP Yes No 82,39 (79,11 - 85,62) 21,41 (20,72 - 22,86)AMBP Yes Yes 83,47 (75,52 - 87,35) 21,34 (20,45 - 23,04)Static No n/a 76,58 (73,58 - 80,35) 22,99 (22,48 - 23,64)Static Yes n/a 82,34 (80,60 - 83,88) 21,22 (20,91 - 21,46)

Table 4: Average used bandwidth and total execution times for all test cases when runningThe 1920x1080 resolution video. Minimum/maximum values are shown in parentheses.

6.2.1. Test case 1 - No Partitioning

In our first test case we test the performance with no partitioning. The results (see Figure9 and 10) shows that the execution times when running the 1920x1080 resolution video ismore stable than when running the 1280x720 resolution video.

Figure 9: Test case 1 - Execution time of a frame using no partitioning for each core. The1280x720 resolution video was used in this test. No average values are depicted in thisgraph.

17

Page 21: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

Figure 10: Test case 3 - Execution time of a frame using no partitioning for each core.The 1920x1080 resolution video was used in this test. No average values are depicted inthis graph.

6.2.2. Test case 2 - Static Partitioning

In test case 2 we test the performance of static partitioning. Each core has 25% of themaximum bandwidth assigned. Figure 11 and 12 shows that using best-effort decreasesthe execution time when running both the lower and higher resolution video.

Figure 11: Test case 2 - Execution time of a frame using a 25% static partition for eachcore.The 1280x720 resolution video was used in this test. No average values are depictedin this graph.

18

Page 22: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

Figure 12: Test case 2 - Execution time of a frame using a 25% static partition for eachcore. The 1980x1080 resolution video was used in this test. No average values are depictedin this graph.

6.2.3. Test case 3 - AMBP

Test case 3 tests the performance of AMBP. Figure 13 and 14 shows that using best-effortor setting a limit to 30% bandwidth per core has little to no impact on the execution times.However, during the high resolution test some frames show that the usage of best-efforthas the ability to greatly improve the execution time, although on very few occasions.

Figure 13: Test case 3 - Execution time of a frame using AMBP for each core. The testshows the execution time when a core can at most use 30% of the maximum bandwidthand when the usage is unrestricted. The 1280x720 resolution video was used in this test.No average values are depicted in this graph.

19

Page 23: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

Figure 14: Test case 3 - Execution time of a frame using AMBP partitioning for each core.The test shows the execution time when a core can at most use 30% of the maximumbandwidth and when the usage is unrestricted. The 1920x1080 resolution video was usedin this test. No average values are depicted in this graph.

6.3. Experiment 3: Isolation

Figure 16 and 17 shows that static partitioning and AMBP has very similar performanceduring the isolation experiment, with consistent minor changes in the execution times ofthe frames. This behaviour continues when using best-effort, but with a lower executiontime overall. With no partitioning used (see Figure 15) the execution times show negligiblevariance for the majority of the frames, but relatively strongly increased during someperiods.

6.3.1. Test case 1 - No Partitioning

With no partitioning (Figure 15), the execution times show that program is very stableduring the majority of execution with some periodic discrepancies.

20

Page 24: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

Figure 15: Test case 1 - Execution time of a frame using no partitioning when a interferingworkload runs on core 3. The 1920x1080 resolution video was used in this test. No averagevalues are depicted in this graph.

6.3.2. Test case 2 - Static Partitioning

Figure 16 shows that there are no discernible interference from the core playing a high bitrate video when static partitioning is used.

Figure 16: Test case 2 - Execution time of a frame using static partitioning when ainterfering workload runs on core 3. The 1920x1080 resolution video was used in this test.No average values are depicted in this graph.

6.3.3. Test case 3 - AMBP

With AMBP (Figure 17), the executions times show more variance than static partitioningbut are still consistent.

21

Page 25: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

Figure 17: Test case 3 - Execution time of a frame using AMBP when a interfering workloadruns on core 3. The 1920x1080 resolution video was used in this test. No average valuesare depicted in this graph.

22

Page 26: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

7. Discussion

Based on the results from experiment 1, the execution times on average lowers as theassigned bandwidth is increased. Table 2 shows that the average used bandwidth is notproportional to the assigned bandwidth for the core. This means that a core on averagerarely uses more than 25% of the maximum bandwidth when processing a frame, usingour current workload. However, increasing the assigned bandwidth allows a core to usemore bandwidth as needed, meaning that the execution times should be lowered. Thepoint of saturation is hard to pinpoint, but according to Figure 8, the system saturates atroughly 30% since this is where the average execution time stabilizes. The results of thistest presumably vary greatly depending on the processing speed of the used CPU, andhow much of its processing power is needed to process a frame. In figure 8, we can seethat when a core is assigned 50% of the maximum bandwidth, it becomes more unstable.We think the limited bandwidth may cause this instability on other cores by forcing thetask scheduler in Linux to schedule processes to the core with higher bandwidth, thuspotentially increasing the execution time on that core. It could also be theorized that thisis a cache related problem. However, we argue that this is not the case, the frames willalways fill the cache memory of our test platform since the size of a frame is around 6 MB(1920x1080x3 bytes) and the LLC is 6 MiB.

Experiment 2 compares the performance of AMBP to no bandwidth partitioning andstatic bandwidth partitioning. The results gathered from all of the test cases show thatAMBP better utilizes the bandwidth than when no partitioning is used for processingthe 1280x720 resolution video. Furthermore, the results from the 1920x1080 resolutionvideo are different in terms of the performance of AMBP. AMBP has similar performanceas static partitioning, with no partitioning having the highest performance with at least16% improvement in execution time compared to AMBP and static partitioning whenbest-effort is used. We believe that the 1280x720 resolution video does not saturate thememory bus, which causes the bandwidth partitioning to be less relevant. However, whenthe 1920x1080 resolution video is processed, the memory bus is saturated. This could meanthat the overhead from processing the video becomes more relevant. Another variable thatmight cause this behavior is how we measure the maximum available bandwidth of thesystem might not be optimal. With no partitioning, there is no bandwidth usage limit,meaning that the system might be able to use more memory bandwidth. On average, wesee that using best effort on top of guaranteed bandwidth is crucial for better performanceas less bandwidth stays unused as the cores might be able to use more than the maximummeasured bandwidth. As this always requires that each core first satisfy its guaranteedpart of the bandwidth, no partitioning should still consistently be able to cross this limit.Setting a maximum of 30% guaranteed bandwidth (where the point of saturation couldbe) when using AMBP did not result in improved performance. This makes sense since theused bandwidth rarely (or never) reached over 30% per core for a frame. Using previousthread execution times as a basis for AMBP yielded a high variance in the frame executiontimes. As the results show that it has potential in reducing execution times compared tono partitioning, we think improving the AMBP algorithm can provide more visible resultssupporting the potential decrease in execution time.

In our third and last experiment, we tested how the application handles interferencefrom other bandwidth-heavy processes executed on a separate core. The results fromEX3.TC2 and EX3.TC3 both shows consistent execution times between frames, with amoderately higher variance using AMBP. Comparing these results to test case 1, where nopartitioning was used, we see that these have more consistent execution times. We argue

23

Page 27: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

that the inconsistency in test case 1 could be caused by interference from the bandwidth-heavy process. Since we do not see the same inconsistency when using bandwidth parti-tioning, this could imply that the OpenCV workload never was affected by the isolatedbandwidth-heavy process. We suspected that the results would look like this, since thebandwidth-heavy process was isolated to a separate core with a statically assigned partof the maximum bandwidth, meaning it could neither take processing power from thecores used by the OpenCV workload or guaranteed bandwidth reserved for them. Staticisolation, however, performed best for pure isolation purposes due to the small variancesin execution time.

24

Page 28: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

8. Conclusions

We have investigated when it is suitable to apply memory bandwidth restrictions. Usingour framework AMBP, we successfully show how it can be used to synchronize executiontimes of forked threads.

We investigated a thread’s point of saturation when executing the SIFT algorithm,i.e., the point from which additional memory bus partitions will not be beneficial tothe execution time. In our case, the point of saturation occurs at around 30% usedbandwidth. We argue that bandwidth restrictions are suitable to apply on parallel taskswhen executing an interfering application on an adjacent core. We statically assigned 10%of the maximum bandwidth to the isolated core, which means that the SIFT applicationis sharing the remaining 90% bandwidth. Restricting the bandwidth usage to 30% percore for the remaining cores achieved the best benefits for the SIFT application. Assigningan additional 5% to each core would not make a significant impact as seen in Table 2.However, making a general cross-platform estimation of the point of saturation is hard.The point of saturation can be dependent on several different factors; memory bus capacityand CPU clock frequency can, for example, differ dramatically. We, therefore, concludethat bandwidth restrictions can help in isolating tasks when an interfering task exceedsthe points of saturation, leading to more predictable execution times, which answers RQ1.

Based on the results from the experiments, we can see that memory bandwidth par-titioning can decrease the execution time of frames to the point that they are lower thanwhen no bandwidth partitioning is used. However, the results show that the average ex-ecution time of the SIFT application is lower when no bandwidth partitioning is used,meaning that an algorithm based solely on WMA execution times to predict future band-width requirements of a core is not the optimal solution. Decreased average executiontimes could mean that the threads are more synchronized, which suggests that memorybandwidth partitioning potentially can help the threads to reach their synchronizationpoints simultaneously. These results answer RQ2.

8.1. Future Work

We think that future work could make use of a more extensive variety of workloads whenconducting experiments. Using different types of workloads may yield more evident resultsthan using only OpenCV as a workload seeing that the point of saturation was fairly lowon our test system. Using other test systems should also make a difference in how muchbandwidth can be utilized before reaching the point of saturation.

The results for AMBP show improvements compared to no partitioning in some in-stances, but it is clear that further improvements to the algorithm are needed. I.e., basingthe partitioning on retired instructions for a thread instead of its execution time could beworth exploring as it would give a more accurate representation of the processing powerneeded by the thread. Using an alternative solution to MemGuard could also be a viablealternative seeing that its functionality and documentation is quite limited.

25

Page 29: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

References

[1] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memory Bandwidth Man-agement for Efficient Performance Isolation in Multi-Core Platforms,” IEEE Trans-actions on Computers, vol. 65, no. 2, pp. 562–576, feb 2015.

[2] H. Yun, “Improving Real-Time Performance on Multicore Platforms Using Mem-Guard,” Real-Time Linux Workshop (RTLWS), 2013.

[3] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith, “Fair QueuingMemory Systems,” in 2006 39th Annual IEEE/ACM International Symposium onMicroarchitecture (MICRO’06). IEEE, dec 2006, pp. 208–222. [Online]. Available:http://ieeexplore.ieee.org/document/4041848/

[4] M. Caccamo, G. Buttazzo, and D. Thomas, “Efficient reclaiming in reservation-based real-time systems with variable execution times,” IEEE Transactionson Computers, vol. 54, no. 2, pp. 198–213, feb 2005. [Online]. Available:http://ieeexplore.ieee.org/document/1377158/

[5] P. Valente and F. Checconi, “High Throughput Disk Scheduling with Fair BandwidthDistribution,” IEEE Transactions on Computers, vol. 59, no. 9, pp. 1172–1186, sep2010. [Online]. Available: http://ieeexplore.ieee.org/document/5467047/

[6] N. Dagieu, A. Spyridakis, and D. Raho, “Memguard, Memory BandwidthManagement in Mixed Criticality Virtualized Systems - Memguard KVMScheduling,” in UBICOMM 2016 : The Tenth International Conference on MobileUbiquitous Computing, Systems, Services and Technologies, 2016, pp. 21–27.[Online]. Available: https://www.semanticscholar.org/paper/Memguard%2C-Memory-Bandwidth-Management-in-Mixed-KVM-Dagieu-Spyridakis/b556e6146d9ddf9da26f3cc0c92034111c19dbd1

[7] B. Ward, How Linux works: What every superuser should know, 2nd ed. No StarchPress, 2014.

[8] M. Rouse, “What is cache memory? - Definition from WhatIs.com,” 2019. [Online].Available: https://searchstorage.techtarget.com/definition/cache-memory

[9] A. Tanenbaum and H. T. Boschung, Modern operating systems, 4th ed. New Jersey:Pearson Education, Inc., 2015.

[10] D. Hennessy and J. Patterson, Computer Organization and Design MIPS Edition:The Hardware/Software Interface, 5th ed. Morgan Kaufmann, 2013.

[11] B. Jacob, “The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’tFake It,” Synthesis Lectures on Computer Architecture, vol. 4, no. 1, pp. 1–77, jan2009.

[12] A. Agrawal, R. Mancuso, R. Pellizzoni, and G. Fohler, “Analysis of DynamicMemory Bandwidth Regulation in Multi-core Real-Time Systems,” sep 2018.[Online]. Available: http://arxiv.org/abs/1809.05921

[13] B. M. Tudor, Y. M. Teo, and S. See, “Understanding Off-Chip MemoryContention of Parallel Programs in Multicore Systems,” in 2011 InternationalConference on Parallel Processing. IEEE, sep 2011, pp. 602–611. [Online]. Available:http://ieeexplore.ieee.org/document/6047228/

26

Page 30: LOAD BALANCING OF PARALLEL TASKS USING MEMORY …1328356/FULLTEXT01.pdf · Ernsund, Sens Ingels Load Balancing of Parallel Tasks 1. Introduction Most general purpose computers use

Ernsund, Sens Ingels Load Balancing of Parallel Tasks

[14] Rogue Wave Software Inc, “3.11. Memory Bandwidth,” 2011. [Online].Available: https://docs.roguewave.com/threadspotter/2011.2/manual html linux/manual html/ch intro bw.html

[15] M. McCool, A. D. Robison, and J. Reinders, Structured Parallel Programming: Pat-terns for Efficient Computation. Waltham: Elsevier, Inc, 2012.

[16] J. Danielsson, M. Jagemar, M. Behnam, and M. Sjodin, “Investigating execution-characteristics of feature-detection algorithms,” in 2017 22nd IEEE InternationalConference on Emerging Technologies and Factory Automation (ETFA). IEEE,2017, pp. 1–4.

[17] J. Danielsson, M. Jagemar, M. Behnam, M. Sjodin, and T. Seceleanu, “Measurement-Based Evaluation of Data-Parallelism for OpenCV Feature-Detection Algorithms,” in2018 IEEE 42nd Annual Computer Software and Applications Conference (COMP-SAC). IEEE, jul 2018, pp. 701–710.

[18] D. Eklov, N. Nikoleris, D. Black-Schaffer, and E. Hagersten, “Design and Evaluationof the Bandwidth Bandit,” Uppsala, Tech. Rep., 2012. [Online]. Available:http://www.it.uu.se/research/publications/reports/2012-003/

[19] M. Behnam, R. Inam, T. Nolte, and M. Sjodin, “Multi-core composability in the faceof memory-bus contention,” ACM SIGBED Review, vol. 10, no. 3, pp. 35–42, oct2013.

[20] R. Inam, M. Sjodin, and M. Jagemar, “Bandwidth measurement using performancecounters for predictable multicore software,” in Proceedings of 2012 IEEE 17th Inter-national Conference on Emerging Technologies & Factory Automation (ETFA 2012).IEEE, sep 2012, pp. 1–4.

[21] R. Inam, N. Mahmud, M. Behnam, T. Nolte, and M. Sjodin, “The Multi-ResourceServer for predictable execution on multi-core platforms,” in 2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, apr2014, pp. 1–12.

[22] University of Bristol, “Action Research — School of Education — Univer-sity of Bristol.” [Online]. Available: http://www.bris.ac.uk/education/study/continuing-professional-development-cpd/actionresearch/

[23] M. Riel, “Understanding Collaborative Action Research.” [Online]. Available:http://cadres.pepperdine.edu/ccar/define.html

[24] OpenCV Team, “About - OpenCV library,” 2019. [Online]. Available: https://opencv.org/about.html

[25] OpenCV: Introduction to SIFT (Scale-Invariant Feature Transform), “About- OpenCV library.” [Online]. Available: https://docs.opencv.org/3.4.6/da/df5/tutorial py sift intro.html

[26] B. Gregg, “Linux perf Examples.” [Online]. Available: http://www.brendangregg.com/perf.html

[27] Canonical Ltd, “Ubuntu Manpage: mbw - Memory BandWidth benchmark,” 2019.[Online]. Available: https://manpages.ubuntu.com/manpages/cosmic/man1/mbw.1.html

27