[ieee 2012 ieee 10th international symposium on parallel and distributed processing with...

L2 Cache Performance Analysis and Optimizations for Processing HDF5 Data on Multi-core Nodes

Rajdeep Bhowmik #1, Madhusudhan Govindaraju #2

Grid and Cloud Computing Research Laboratory # Department of Computer Science, SUNY Binghamton

P.O. Box 6000, Binghamton, NY 13902-6000, USA [email protected] [email protected]

Abstract — It is important to design and develop scientific middleware libraries to harness the opportunities presented by emerging multi-core processors that are available on grid and cloud environments. Scientific middleware libraries not adhering or adapting to this programming paradigm can suffer from severe performance limitations while executing on emerging multi-core processors. In this paper, we focus on the utilization of a critical shared resource on chip multiprocessors (CMPs), the L2 cache. The way in which an application schedules and assigns processing work to each thread determines the access pattern of the shared L2 cache, which may result in either enhancing or diminishing the effects of memory latency on a multi-core processor. Therefore, while processing scientific datasets such as HDF5, it is essential to conduct fine-grained analysis of cache utilization, to make informed processing and scheduling decisions in multi-threaded programming. In this paper, using the TAU toolkit for performance feedback from dual- and quad-core machines, we analyze and recommend methods for effective scheduling of threads on multi-core nodes to augment the performance of scientific applications processing HDF5 data. We discuss the benefits that can be achieved by using L2 Cache-Affinity and L2 Balanced-Set based scheduling algorithms for improving L2 cache performance and effectively the overall execution time.

I. INTRODUCTION Hierarchical Data Format (HDF) [1] is a data model, library,

and file format for storing and managing data. It supports a wide variety of data types, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 (the current version of HDF) libraries are relevant when the challenges of storing and accessing data push the limits of what can be addressed by traditional database systems, XML documents, or in-house data formats. HDF5 also allows users to easily share data across a wide variety of computational platforms using applications written in different programming languages. HDF5 is a user-defined format like XML and also a binary format allowing compression to help reduce the size of data files. Data formats such as NetCDF4 [2] are implemented on top of it. Scientific and engineering disciplines such as physics, cosmology, medicine, and meteorology rely on HDF tools. HDF5, and its previous

version HDF4, are de facto standards in use by many scientific and engineering applications, including the NASA Earth Observing System and DOE ASCI projects. As these applications plan to move to cloud environments, we focus on cache optimizations for typical dual and quad-core node configurations.

The current trend in the microprocessor industry is the wide availability of chip multiprocessors (CMPs), commonly referred to as multi-core processors from leading vendors. Performance improvement cannot be guaranteed by directly porting current middleware and application code onto multi-core systems. Degradation in overall performance can be observed in memory-intensive concurrent threads due to shared memory bandwidth [7, 8]. Applications executing on uni-processors can generate less memory traffic in comparison with smaller private caches in multi-core nodes. Estimates show that the gap between processor and memory performance is expected to grow by 50% per year [9]. Though the number of cores on a single chip will continue growing, the cache resources will not be able to keep up with the same rate due to space limitations on the chip. Therefore, the cache per ratio will decrease and issues related to memory bandwidth, memory latency, and cache fragmentation will hinder performance [10]. There is a need for the middleware library programmers to analyze the behavior of each thread, divide the workload, and schedule them taking into consideration the interaction between the various chip resources.

Though much work has been done on optimizing the memory subsystem performance on multi-core processors, those findings have not been incorporated in HDF5 processing. Multi-threaded programming in many scientific data processing libraries is not designed for efficient execution on multi-core processors as L1 and L2 caching are not taken into account. A simple model is to have threads that are all granted equal time slices, but as our results show, those threads will not always achieve similar performance. On multi-core processors, L2 cache allocation among concurrently running threads is controlled by hardware. Unfair cache sharing can happen depending on the cache requirements of the co-runner(s). A thread’s cache occupancy affects its cache miss rate, and, as a result, impacts the rate at which the thread must wait on data to execute instructions. Therefore, the co-

2012 10th IEEE International Symposium on Parallel and Distributed Processing with Applications

978-0-7695-4701-5/12 $26.00 © 2012 IEEE

DOI 10.1109/ISPA.2012.27

142

runner(s) play a significant part in the CPU performance of a thread. Uneven load distribution and cache affinity for each thread should be taken into account to minimize the number of shared L2 cache misses and improve the overall turnaround time.

In this paper we present how two scheduling schemes, L2 Balanced-Set and L2 Cache Affinity, tailored for the L2 cache, which allow efficient utilization of all the available cores to achieve optimum turnaround time for processing of HDF5 data. Recommendations can be developed based on our performance data that will facilitate the re-structuring and re-tooling of current multi-threaded programming models for avoiding performance pitfalls and will ensure that programs scale gracefully with an increase in the number of processing cores. We study the performance of processing representative HDF5 datasets on dual- and quad-core machines with varying dataset workload per processing thread and quantify the amount of performance gain that can be realized in each case. We believe application- and feature-specific pluggable modules for HDF5 data processing can be designed and developed based on these results. We chose to present detailed performance results for 2- and 4-core systems as they are the most commonly available nodes in grid and cloud infrastructures.

II. RESEARCH CHALLENGES ADDRESSED IN THIS PAPER The primary goal of our research is to facilitate the

collection, analysis, and feedback provision on performance limitations and optimization opportunities during scheduling and processing of HDF5 dataset workloads on multi-core nodes. We try to address the following three research questions:

(1) What are the inter-dependencies of L2 cache size, number of cores, co-scheduling of processing threads, working size, and reuse patterns of HDF5 datasets that create optimal L2 cache behavior and application performance?

(2) If the data access patterns and the working datasets are known in HDF5-based scientific applications, how can we direct optimization efforts so that L2 cache misses are minimized and overall throughput is improved?

(3) How should data processing be divided among different cores to take advantage of optimizations enabled by L2 Cache Affinity and L2 Balanced-Set scheduling?

III. THREAD SCHEDULING ON MULTIPLE CORES L2 Cache Affinity Based Scheduling: In modern

multiprocessor operating systems, schedulers exploit high cache affinity by scheduling a process on a recently used processor whenever possible. When a process runs on a high-affinity processor, it finds most of its state already in the cache and thus runs more efficiently. This idea can also be utilized for building efficient multi-core scheduling algorithms. If there are frequent migrations of threads, it can prevent the scheduler from exploiting cache affinity, and can in turn degrade overall performance. Kazempour et. al. [11] evaluated performance effects of exploiting cache affinity on multi-core processors. They studied both multi-core uni-processors and

multi-core multiprocessors, and evaluated the effects of exploiting both L1 and L2 cache affinity. They hypothesized that cache affinity does not affect performance on multi-core processors: (1) on multi-core uni-processors reloading the L1 cache state is cheap; and (2) on multi-core multiprocessors the L2 cache affinity is generally low due to cache sharing. Their first hypothesis was confirmed because exploiting cache affinity on multi-core uni-processors has no measurable impact on performance even when the L1 cache is relatively large, scheduling time quantum is small and L2 cache retention is low. The performance data did not validate their second hypothesis. Even though upper-bound performance improvements from exploiting cache affinity on multi-core multiprocessors are lower than on uni-core multiprocessors, they are still significant: 11% on average and a maximum of 27%.

L2 Balanced-Set Scheduling: Fedorova et. al. [12] proposed an L2-conscious scheduling algorithm for balanced-set scheduling on multi-core architectures. They investigated ways to leverage the operating system scheduler to reduce pressure on the L2 cache. The idea is that all runnable threads are separated into subsets, or groups, such that the combined working set of each group fits in the cache. Then, a group at a time is executed for the duration of the scheduled time slice. By ensuring that the working set of each scheduled group fits in the cache, this algorithm reduces cache miss ratios. One of the findings of their work is that a working set size is not a good indicator of a workload's cache behavior; instead, the reuse pattern of memory locations in the working set plays a more significant role. In order to estimate a cache miss ratio produced by a group of threads, they used a cache model for single-threaded workloads developed by Berg and Hagersten [13] and adapted it to work for multi-threaded workloads. Using this model they estimated the cache miss ratios of multithreaded workloads to within 17% of their actual values, on average. This level of accuracy is sufficient to distinguish between those workloads that fit in the cache and those that thrash. Once the cache miss ratio of a group of threads is estimated, it is possible to then decide which threads should be scheduled together. By scheduling threads in groups that have low cache miss ratios, the overall miss ratio is kept low, and the overall instructions per cycle (IPC) is high.

IV. RELATED WORK Symbiotic Jobscheduling [14] is a method for co-

scheduling threads on simultaneous multithreading processors (SMT) in a way that minimizes resource contention. This method could be adapted to co-schedule threads on single-threaded cores sharing caches. This method works by trying (or sampling) a large number of thread assignments and picking the ones with the best observed rate of instructions per cycle. The drawback of this solution is the required sampling phase where the workload is not scheduled optimally. The number of samples required increases with the number of threads and cores. Fedorova et. al designed a cache-aware scheduler that compensates threads that were hurt by cache contention by giving them extra CPU time [15]. This

143

algorithm accomplishes the effect of fair cache sharing, but it was not designed to improve overall performance.

In the work on Utility Cache Partitioning [16], a custom hardware solution estimates each application’s number of hits and misses for all possible ways the application has cache space allocated to it. The technique is based on stack-distance profiles. The cache is then partitioned so as to minimize the number of cache misses for the co-running applications.

Tam et. al. [17] address cache contention via software-based cache partitioning. The cache is partitioned among applications using page coloring. Each application is reserved a portion of the cache, and the physical memory is allocated such that the application’s cache lines map only into that reserved portion. The size of the allocated cache portion is determined based on the marginal utility of allocating additional cache lines for that application. Marginal utility is estimated via an application’s reuse distance profile, which is similar to a stack-distance profile and is approximated online using hardware counters. Software cache partitioning, like hardware cache partitioning, is used to isolate workloads that hurt each other. In this paper, we are not considering any software-based cache allocation scheme or cache partitioning techniques.

In earlier work [3, 4], we focused on analysing the number of CPU cycles taken by each core to process XML datasets. In this paper, we focus specifically on the effect of scheduling algorithms on L2 cache misses for HDF5-based scientific applications. Results of this paper apply to processing of HDF5 application data that is memory bound wherein cache utilization is critical for the overall performance.

V. PERFORMANCE RESULTS We conducted performance tests on a dual-core machine

(each core an out-of-order Intel Core2 processor running at 2.13GHz with a 32Kbyte Level 1 data cache and a shared 2 MB Level 2 data cache) and a quad-core machine (each core an out-of-order Intel Xeon processor running at 1.86GHz with a 32Kbyte Level 1 data cache and a shared 4MB Level 2 data cache). We consider synthetic datasets and chose data sizes that reveal insights on L2 cache optimizations.

TAU Performance Profiling Tool: We collected extensive performance data using the TAU Performance System [5], which is a portable profiling and tracing toolkit for performance analysis of parallel programs, and Performance Application Programming Interface (PAPI) [6]. PAPI provides a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. We compiled and ran our benchmarking code using gcc-3.4.6, TAU version 2.19 and PAPI version 4.0.0, after configuring and building a Linux 2.6.25 kernel with support for hardware performance counters to be used on both the dual- and quad-core machines. We utilized the Source Based Instrumentation feature provided by TAU to instrument and compile our benchmark programs and the Performance Data Standard and API (PAPI) package to count hardware performance events such as cache misses while the programs execute on the processor. We identified which PAPI

events our system supports and then tested the compatibility between each metric to be profiled by PAPI. Example metrics include PAPI_L2_DCM (Level 2 Data Cache Misses) and PAPI_L2_TCM (Level 2 Total Cache Misses). In our thread scheduling experiments, we used CPU Affinity mechanisms with the taskset command to set or retrieve the CPU Affinity of a running process given its PID and to launch a new process with a given CPU Affinity. We installed the schedutils (Linux scheduler utilities) package for this purpose. CPU Affinity is a scheduler property overriding the kernel's process scheduling and binding a certain process to a specific CPU(s).

A. L2 Cache Affinity Based Scheduling The Cache Affinity based scheduling experiments have

been specifically designed to demonstrate the effects of the shared L2 cache on HDF5 data processing. In our experiments, we use a single core to prepare HDF5 datasets for processing. When the dataset preparation is completed, the lower part of the dataset remains in the L2 cache, which is shared by all cores. Given the widespread use of 2D arrays in scientific computing, we chose processing this type of data for our experiments. Processing the data consists of reading from disk, populating the 2D array with values, and performing simple floating point operations on said data. HDF5 datasets can also consist of objects ranging from higher dimensional arrays of a specific data type, raster images, tables and metadata among others. We plan to address such datasets in future work. For Non Affinity based scheduling on the dual-core and quad-core machines, we use one core for dataset preparation, flush the cache contents, and study the cases when two and four cores are used for processing the entire dataset. Flushing the cache contents simulates a thread being scheduled on a core that it has not previously been scheduled on. This same situation is also encountered when the processing of a large file begins with the part of the file that is not resident in the cache. For Affinity based scheduling we use one core for dataset preparation, leave the cache contents intact, and study the cases when two and four cores are used for processing. When a processing core executes the thread that processes the array elements in the lower portion of the dataset, its cache misses are serviced from the shared L2 cache, and not from the system memory. Accessing the shared L2 cache within the chip is an order of magnitude faster than accessing system memory. As such, the core processing the final array elements in the dataset with Affinity based scheduling has fewer L2 cache misses and a lower combined processing time, despite processing the same number of elements in both scheduling schemes. However, for Non Affinity based scheduling, after incurring the initial misses as a result of the flushed cache, the threads compete for space in the L2 cache. This is a result of each thread attempting to process more data than can fit in the L2 cache, resulting in an increased number of cache misses and increased processing time. We execute a single processing thread on each core for both the dual-core and quad-core machines. We define combined processing time as the summation of the running times of each individual thread. This metric provides a measure of non-idle cycles and relative

144

workload of each core during processing. In Fig. 1 and Fig. 2, “NA” represents Non Affinity and “A” represents Affinity.

(a)

(b)

Fig. 1. Combined Processing Time and L2 cache misses with Affinity vs. Non Affinity based scheduling for even distribution of dataset workload on a dual-core CMP with 2 MB L2 cache size and increasing dataset size. “NA” represents Non Affinity and “A” represents Affinity

Experiments for Even Dataset Distribution: Fig. 1(a) and Fig. 1(b) present the processing time and total number of L2 cache misses for even distribution of increasing dataset workloads on the dual-core CMP given a 2MB L2 cache for both Affinity and Non Affinity based scheduling. Even distribution of dataset workload means each core is assigned half of the array elements to process. We denote the two processing cores as Core 0 and Core 1. When the processing is conducted with Non Affinity based scheduling, both cores suffer approximately the same number of L2 cache misses, resulting in similar combined processing times. However, when Core 0 uses Affinity based scheduling the number of L2 cache misses is reduced, resulting in less combined processing time. For the dual-core machine, the improvement in combined processing time and number of L2 cache misses incurred is in the range of 36.71%-64.14% and 57.13%-97.65% respectively.

Fig. 2(a) and Fig. 2(b) present processing time and total number of L2 cache misses for even distribution of increasing dataset workloads on a quad-core CMP with a 4 MB L2 cache for both Affinity and Non Affinity based scheduling. Here,

when the total dataset size is 4 MB, each core processes 1 MB of the dataset and all data fits entirely in the 4 MB shared L2 cache. We denote the processing cores as Core 0, Core 1, Core 2 and Core 3. For Non Affinity based scheduling, all cores suffer a comparable number of L2 cache misses resulting in similar combined processing time. For Affinity based scheduling, there are fewer overall L2 cache misses, resulting in reduced combined processing time. Our experiments show improvements in combined processing time and L2 cache misses in the range of 34.60%-61.47% and 52.97%-97.93% respectively. For both the dual-core and quad-core machines, as dataset size increases, eventually growing beyond the size of the L2 cache, the number of L2 cache misses increases even for the core scheduled using Cache Affinity. As expected, performance improvements are best when the dataset size is the smallest and fits entirely in the shared L2 cache.

(a)

(b)

Fig. 2. Combined Processing Time and L2 cache misses with Affinity vs. Non Affinity based scheduling for even distribution of dataset workload on a quad-core CMP with 4 MB L2 cache size and increasing dataset size. “NA” represents Non Affinity and “A” represents Affinity

The behavior of the Non Affinity core with respect to Affinity based scheduling can be seen in Figure 2. For the dual-core machine with a 2 MB shared L2 cache, the number of L2 cache misses in Core 1 (Non Affinity core) is comparable with that of Core 0 (Affinity core) until the total dataset does not entirely fit in the cache and while each core processes an equal portion of the dataset. Once the total dataset grows larger than the L2 cache size, the portion of the dataset Cores 0 and 1 processes no longer fit into the L2 cache

145

together, causing L2 cache misses to be similar for both affinity and non affinity based scheduling on Core 1. For example, with a total dataset size of 2MB and affinity based scheduling, each core processes 1MB of data, Core 0 has 262 L2 cache misses and Core 1 has 281 L2 cache misses. When the total dataset size is 4 MB and affinity based scheduling is employed, each core processes 2 MB, Core 0 incurs 1109 L2 cache misses and Core 1 incurs 18720 L2 cache misses. Core 1’s L2 cache misses are roughly equivalent to the 18813 L2 cache misses incurred for Non Affinity based scheduling. This is a result of the dataset size with respect to the L2 cache, as seen by the number of L2 cache misses remaining comparable with subsequent increases in dataset size for Core 1.

Core 0, given the advantage of cache affinity for Affinity based scheduling, still suffers an increasing number of L2 cache misses because larger datasets processed by the two cores result in thrashing. Similar observations can be made for the quad-core machine about the three Non Affinity cores, Core 1, Core 2 and Core 3, with respect to the number of L2 cache misses incurred. The L2 cache misses also impact the combined processing times in that on this machine, a 13.37%- 91.85% improvement in cache misses results in a 9.05%-61.65% improvement in total running time. Experiments for Uneven Dataset Distribution: In Fig. 3, we present the processing time and the L2 cache miss behavior for uneven distribution of the dataset workload among the two cores in the dual-core machine using Affinity based scheduling. Fig. 1 showed the imbalance in processing times and L2 cache misses between the two processing cores due to cache affinity in Affinity based scheduling on the dual-core machine. In Fig. 3(a), we discuss an optimization that performs uneven distribution of the dataset workload between the processing threads to balance the loads on the two cores. The core that has the advantage of cache affinity in Affinity based scheduling is assigned 75% of the total dataset workload for all dataset sizes. Different percentage settings have been tried and only the result for 75% assignment is provided for brevity purposes. As turnaround time is governed by the longest running thread, this optimization reduces the turnaround time of the application when the dataset does not fit in the L2 cache. When the total dataset size is equal to or less than the L2 cache size, the Non Affinity core performs better than the Affinity core as it has its portion of the dataset already resident in the L2 cache and is entrusted with less processing work. For example, when the total dataset size is 2 MB, it fits entirely in the L2 cache and the workload is unevenly distributed among the two processing cores. Core 0 (Affinity core) processes 1.5 MB of the dataset and Core 1 (Non Affinity core) processes 0.5 MB of the dataset. The turnaround times of the application processing 1 MB and 2 MB of HDF5 data are 36.39 milliseconds and 59.79 milliseconds respectively. In comparison, Core 1 (Non Affinity core) takes 13.26 milliseconds and 20.19 milliseconds to process 1 MB and 2 MB of HDF5 data respectively. Once the total dataset size goes beyond the capacity of the shared L2 cache, although it has more data to process, Core 0 (Affinity core) starts performing better than

Core 1 (Non Affinity core), at a decreasing rate. Note that as the dataset size is increased, Core 0 starts performing worse than Core 1, because more data must be processed than the L2 cache can support. Despite this, the turnaround time is still better for uneven distribution. For example, Core 0 takes 195.31 Milliseconds and Core 1 takes 187.59 Milliseconds to process 16 MB of HDF5 data for uneven distribution of the dataset, whereas Core 0 takes 162.91 Milliseconds and Core 1 takes 257.92 Milliseconds to process 16 MB of HDF5 data for even distribution. The turnaround time of the application is 195.31 Milliseconds for uneven distribution and 257.92 Milliseconds for even distribution. This improvement, in terms of turnaround time of the application processing 4 MB, 8 MB and 16 MB HDF5 data over even distribution on the dual core machine having 2 MB L2 cache is 16.23%, 23.27% and 24.27% respectively. Similar observations are made regarding the turnaround times and L2 cache misses for the quad-core machine, but those results are not presented here due to space constraints. The benefits of uneven distribution depend on the relationship between the dataset size processed per core and the portion of which is able to stay resident in the L2 cache. Optimization requires uneven distribution if and only if the dataset is larger than the L2 cache, which is often the case for HDF5 files.

(a)

(b)

Fig. 3. Turnaround Time and L2 cache misses with Affinity for Even vs. Uneven distribution of dataset workload on a dual-core CMP with 2 MB L2 cache size and increasing dataset size.

146

Fig. 3(b) indicates that the number of L2 cache misses decreases for uneven distribution relative to even distribution for Affinity based scheduling when the total dataset size is larger than that of the L2 cache. In the case where dataset size is less than or equal to L2 cache size, the number of L2 cache misses are comparable for both even and uneven distribution. The total number of L2 cache misses decrease in the range of 2.07% to 10.49% with the increase in dataset size from 4 MB to 16 MB. As the dataset size increases, Core 0 (Affinity core) incurs a greater number of L2 cache misses due to greater processing workload and an increase in the portion of the dataset not resident in the L2 cache. Increased thrashing occurs as the two threads have their own access patterns to satisfy within the space constrained L2 cache. However, uneven distribution of dataset workload still performs better than even distribution in terms of total number of L2 cache misses incurred.

B. L2 Balanced-Set Scheduling In Fig. 4, we compare our L2 Balanced-Set algorithm

with two other cases: Naïve and Optimized. We consider a dataset of 3MB HDF5 data to be processed by two threads on the dual-core machine having 2MB shared L2 cache. We do not take into account any kind of cache affinity or data reuse patterns in this experiment; only the aspect of working set size fitting in the L2 cache is considered for Balanced-Set scheduling. Therefore, the L2 cache contents are entirely flushed after generating the HDF5 dataset. Case1 is for Naïve scheduling. In this case, the dataset is uniformly divided between two cores, each processing 1.5 MB of HDF5 data. We observe that combined processing time is 211.25 Milliseconds and the total number of L2 cache misses is 21971 as shown in Fig. 4(a) and Fig. 4(b) respectively. Case2 denotes Optimized scheduling. In Case2, we optimize Case1. As the shared L2 cache size is 2 MB for the dual core machine, we schedule a 2MB portion of the dataset for processing on Core 0 and 1MB portion of the dataset on Core 1. We observe that the combined processing time is 162.06 Milliseconds and the total number of L2 cache misses is 16530 as shown in Fig. 4(a) and Fig. 4(b) respectively. Therefore, Case2 is approximately 23.29% and 24.76% more efficient than Case1 with respect to combined processing time and total L2 cache misses. Case3 denotes our L2 Balanced-Set scheduling. In Case3, we balance the distribution of the dataset between the two cores in such a way that the dataset size processed by each core is smaller and thus the total data processed at each phase by both cores can fit in the L2 cache. The processing of the entire dataset is done in phases in this algorithm, where both cores have equal work during each individual phase of the processing. Both Core 0 and Core 1 each process 1 MB of the dataset in the first phase. Upon completion of the first phase we again assign 0.5 MB to Core 0 and 0.5 MB to core 1 in the second phase, thus completing processing of the 3 MB dataset If the data processed at each phase can fit in the L2 cache, then cache misses will be lower than they were in Case2. We see from Fig. 4(a) and Fig. 4(b) that the combined processing time is 104.67 Milliseconds and that the total L2 cache misses

incurred is 9175 in Case3, confirming our hypothesis. We report that Balanced-Set scheduling results give 35.41% and 44.50% improvement over Optimized scheduling and 50.45% and 58.24% improvement over Naïve scheduling in terms of combined processing time and number of L2 cache misses respectively.

(a)

(b)

Fig. 4. Processing Time and L2 cache misses of 3 MB for Naïve, Optimized, and Balanced-Set scheduling on a dual-core CMP with 2 MB L2 cache.

Similarly, for the quad-core machine, in Figure 5 we consider the performance difference between the three flavours of Balanced-Set scheduling: (1) Balanced-Naïve, (2) Balanced-Optimal, and (3) Balanced-Utilized. The difference between the three scheduling approaches is that Balanced-Naïve schedules threads on all the available cores while Balanced-Optimal does not. Balanced-Utilized scheduling uses all available cores supporting one processor with a helper core. We consider a 12 MB HDF5 dataset to be processed on the quad-core machine with 4 MB shared L2 cache. We take into consideration data reuse patterns for Balanced-Optimal and Balanced-Utilized scheduling in this experiment. Therefore, the L2 cache contents are not flushed after generating the dataset in these two cases. For Balanced-Naïve scheduling, we focus only on the optimizations required to always have the working set resident in the L2 cache. Case1 denotes Balanced-Naïve scheduling, where each core processes 1 MB of HDF5 data in each phase. Therefore, it takes 3 phases to complete the processing of the entire 12 MB dataset and each core processes 3 MB of HDF5 data in total. The processing times for Core 0, Core 1, Core 2, and

147

Core 3 are 209.13, 209.46, 205.59 and 209.73 Milliseconds respectively, which add up to 833.91 Milliseconds as shown in Fig. 5(a). Similarly, the L2 cache misses for Core 0, Core 1, Core 2, and Core 3 are 17538, 17616, 17577 and 17583 respectively, which add up to 70314 total L2 cache misses as shown in Fig. 5(b).

(a)

(b)

Fig. 5. Processing Time and L2 cache misses of 12 MB for Balanced-Naïve, Balanced-Optimal and Balanced-Utilized scheduling on a quad-core CMP with 4 MB L2 cache size.

Case2 denotes Balanced-Optimal scheduling, where the processing threads only use half of the available cores, and process the data in phases allowing independent co-running workloads to run on the remaining cores. In the first phase of execution, Core 0 and Core 1 each process 2 MB of HDF5 data from the lower part of the dataset (resident in the L2 cache). In the second and third phases, each core is responsible for processing 2 MB from the lower and upper parts of the dataset, resulting in processing the entire 12 MB dataset. In this case, the scheduling algorithm takes advantage of the L2 cache size and data reuse rather than the total number of available cores. Fig. 5(a) shows that the combined processing time for Case2 is 608.19 milliseconds; a 27.07% improvement over Case1. We also observe from Fig. 5(b) that Case 2 causes approximately 21.09% fewer L2 cache misses than Case 1. Note that in both Fig. 5(a) and Fig. 5(b), the processing time and total L2 cache misses incurred by Core 2 and Core 3 for Case2 are 0, as the workload is run only on

Core 0 and Core 1. Therefore, we can conclude that the reuse pattern of the working set is a more important factor than the available number of cores for optimized scheduling of threads.

Case3 denotes Balanced-Utilized scheduling, where each processing thread is associated with a helper thread. Case3 is an extension of Case2, where helper threads run on the previously unutilized cores. The helper threads do not hinder the working of the processing threads as they only pre-fetch data to the L2 cache that will be needed for processing during the subsequent phases, without removing data that is currently being used. During the last phase, the helper threads get transformed into processing threads to complete the processing of the dataset.

In Phase 1, the cache is already populated with the bottom portion of the dataset. As such, Cores 0 and 1 each process 1 MB of data (for a total of 2MBs) and Cores 2 and 3 remain idle with no pre-fetching to do. In Phase 2, the cache contains the data that needs to be processed by Cores 0 and 1. However, in the next phase the cache will not contain the necessary data, so Cores 2 and 3 must each pre-fetch 1MB of data into the cache, replacing previous phase’s processed data. This means that in this phase, 2MB of data was processed and 2MB of data was pre-fetched. Note that the helper cores fetch data into an unused portion of the L2 cache and therefore cannot contribute to cache misses. Furthermore, the Balanced Set Scheduling requirements are met as the working sets of all cores fit into the L2 cache. Phases 3 and 4 proceed in the same fashion as Phase 2. Thus, at the end of Phase 3 a total of 6MB has been processed, and at the end of Phase 4 the total becomes 8MB. In Phase 5, Cores 2 and 3 determine that no more data must be pre-fetched and become processing cores. Thus all cores process data in this phase and the completion of this phase finishes the processing of all 12MB of HDF5 data. We observe from Fig. 5(a) and Fig. 5(b) that in Case3 the combined processing time is 461.97 Milliseconds and the total number of L2 cache misses incurred is 36456. We conclude that in terms of combined processing time and number of L2 cache misses, the Balanced-Utilized scheduling approach gives 24.04% and 34.29% improvement over Balanced-Optimal scheduling and 44.60% and 48.15% improvement over Balanced-Naïve scheduling respectively.

VI. CONCLUSION Observations from performance results in the previous

section reveal six important considerations for scheduling threads to process HDF5 datasets on multi-core processors:

(1) Our framework can be used to quantify the extent to which affinity based scheduling performs better than non-affinity based scheduling (in terms of combined processing turnaround times and shared L2 cache misses) for a given application that requires processing of HDF5 data.

(2) Combined processing time and L2 cache misses with Affinity based scheduling are reduced by approximately 36.71%-64.14% and 57.13%-97.65% respectively on a dual-core CMP and 34.60%-61.47% and 52.97%-97.93% respectively on a quad-core CMP as opposed to Non Affinity based scheduling.

148

(3) For an evenly distributed, Affinity based scheduling on the dual-core CMP, turnaround time and total number of L2 cache misses are reduced by 16.23%-24.27% and 2.07%-10.49% respectively. As long as the total dataset size remains less than or equal to the L2 cache size, combined processing time remains comparable for even and uneven distribution with Affinity based scheduling.

(4) In Balanced-Set scheduling, if the data is processed in phases and fits in the L2 cache, our experiments show that Balanced-Set scheduling result in 35.41% and 44.50% improvement over Optimized scheduling and 50.45% and 58.24% improvement over Naïve scheduling, for combined processing time and number of L2 cache misses respectively.

(5) When threads process datasets, data reuse plays a more important role than the size of the dataset and number of available processing cores. Thus Balanced-Utilized scheduling results in 24.04% and 34.29% improvement over Balanced-Optimal scheduling and 44.60% and 48.15% improvement over Balanced-Naïve scheduling in terms of combined processing time and number of L2 cache misses respectively.

VII. FUTURE WORK In future work, we will perform analyses for a larger

number of representative HDF5 datasets containing different commonly used objects, such as multi-dimensional arrays, raster images, and tables. We will quantify the gains for the emerging 8-, 16-, 32 and 64-core and asymmetric multi-core systems. We plan to use a dynamic marker mechanism, where we keep track of the progress of threads on each core to determine work allocation and dynamically make adjustments based on the processing performance and efficiency. We will also study ways to improve I/O performance during HDF5 dataset processing. We will continue to use our experimental framework to study memory access patterns for other scientific data formats such as NeXus and NetCDF.

REFERENCES [1] Hierarchical Data Format (HDF5). http://www.hdfgroup.org/HDF5/ [2] Network Common Data Form (NetCDF4)

http://www.unidata.ucar.edu/software/netcdf/ [3] Rajdeep Bhowmik, Chaitali Gupta, Madhusudhan Govindaraju,

Aneesh Aggarwal, "Optimizing XML Processing for Grid Applications Using an Emulation Framework", In Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS ’08) , Miami, Florida, April 2008.

[4] Rajdeep Bhowmik and Madhusudhan Govindaraju, "Cache Performance Optimization for Processing XML-based Application Data on Multi-core Processors", In Proceedings of The 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010), Melbourne, Victoria, Australia, May 2010.

[5] S. Shende and A. D. Malony, "The TAU Parallel Performance System," International Journal of High Performance Computing Applications, SAGE Publications, 20(2):287-311, Summer 2006.

[6] S. Browne, C. Deane, G. Ho, P. Mucci, "PAPI: A Portable Interface to Hardware Performance Counters", In Proceedings of Department of Defense HPCMP Users Group Conference, June 1999.

[7] M. Lee, Y. Ryu, S. Hong, and C. Lee. Performance Impact of Resource Conflicts on Chip Multi-Processor Servers. Proc. of the Workshop on State-of-the-art In Scientific and Parallel Computing, 2006.

[8] Pradeep Dubey. CMP Challenges. ICCD Panel Talk: http://www.iccdconference.org/proceedings/2005/cmp1-pradeepdubey.pdf, 2005.

[9] Alfredo Buttari and Jack Dongarra1 and Contact Information and Jakub Kurzak1 and Julien Langou and Piotr Luszczek and Stanimire Tomov. The Impact of Multicore on Math Software. Vol 4699 of Lecture Notes in Computer Science, pages 1–10. Springer, Sept. 2007.

[10] J. Dongarra, D. Gannon, G. Fox, and K. Kennedy. The impact of multicore on computational science software. CTWatch Quarterly, The Promise and Perils of the Coming Multicore Revolution and Its Impact, 3(1):1175–1220, February, 2007.

[11] Vahid Kazempour, Alexandra Fedorova, Pouya Alagheband, "Performance Implications of Cache Affinity on Multicore Processors", in Proceedings of the 14th International Euro-Par Conference on Parallel Processing, 2008.

[12] Alexandra Fedorova, Margo Seltzer, Christopher Small, Daniel Nussbaum, "Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design", in Proceedings of USENIX Annual Technical Conference, 2005.

[13] E. Berg, E. Hagersten, "Efficient Data-Locality Analysis of Long-Running Applications", TR 2004-021, Univ. of Uppsala, May 2004.

[14] A. Snavely and D. M. Tullsen, "Symbiotic jobscheduling for a simultaneous multithreaded processor", SIGARCH Comput. Archit. News, 28(5):234–244, 2000.

[15] A. Fedorova, M. I. Seltzer, and M. D. Smith, "Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler", In Proc. of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT’07), pages 25–38, 2007.

[16] M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: A lowoverhead, high-performance, runtime mechanism to partition shared caches", In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 423–432, 2006.

[17] D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm, "Rapidmrc: Approximating l2 miss rate curves on commodity systems for online optimizations", In ASPLOS ’09: Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 121–132, 2009.

149

[ieee 2012 ieee 10th international symposium on parallel and distributed processing with...

Documents